Intelligent Computing: Proceedings of the 2020 Computing Conference, Volume 2 [1st ed.] 9783030522452, 9783030522469

This book focuses on the core areas of computing and their applications in the real world. Presenting papers from the Co

2,581 133 79MB

English Pages XI, 717 [728] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Intelligent Computing: Proceedings of the 2020 Computing Conference, Volume 2 [1st ed.]
 9783030522452, 9783030522469

Table of contents :
Front Matter ....Pages i-xi
Urban Mobility Swarms: A Scalable Implementation (Alex Berke, Jason Nawyn, Thomas Sanchez Lengeling, Kent Larson)....Pages 1-18
Using AI Simulations to Dynamically Model Multi-agent Multi-team Energy Systems (D. Michael Franklin, Philip Irminger, Heather Buckberry, Mahabir Bhandari)....Pages 19-32
Prediction of Cumulative Grade Point Average: A Case Study (Anan Sarah, Mohammed Iqbal Hossain Rabbi, Mahpara Sayema Siddiqua, Shipra Banik, Mahady Hasan)....Pages 33-42
Warehouse Setup Problem in Logistics: A Truck Transportation Cost Model (Rohit Kumar Sachan, Dharmender Singh Kushwaha)....Pages 43-62
WARDS: Modelling the Worth of Vision in MOBA’s (Alan Pedrassoli Chitayat, Athanasios Kokkinakis, Sagarika Patra, Simon Demediuk, Justus Robertson, Oluseji Olarewaju et al.)....Pages 63-81
Decomposition Based Multi-objectives Evolutionary Algorithms Challenges and Circumvention (Sherin M. Omran, Wessam H. El-Behaidy, Aliaa A. A. Youssif)....Pages 82-93
Learning the Satisfiability of Ł-clausal Forms (Mohamed El Halaby, Areeg Abdalla)....Pages 94-102
A Teaching-Learning-Based Optimization with Modified Learning Phases for Continuous Optimization (Onn Ting Chong, Wei Hong Lim, Nor Ashidi Mat Isa, Koon Meng Ang, Sew Sun Tiang, Chun Kit Ang)....Pages 103-124
Use of Artificial Intelligence and Machine Learning for Personalization Improvement in Developed e-Material Formatting Application (Kristine Mackare, Anita Jansone, Raivo Mackars)....Pages 125-132
Probabilistic Inference Using Generators: The Statues Algorithm (Pierre Denis)....Pages 133-154
A Q-Learning Based Maximum Power Point Tracking for PV Array Under Partial Shading Condition (Roy Chaoming Hsu, Wen-Yen Chen, Yu-Pi Lin)....Pages 155-168
A Logic-Based Agent Modelling Paradigm for Investment in Derivatives Markets (Jonathan Waller, Tarun Goel)....Pages 169-180
An Adaptive Genetic Algorithm Approach for Optimizing Feature Weights in Multimodal Clustering (Manar Hosny, Sawsan Al-Malak)....Pages 181-197
Extending CNN Classification Capabilities Using a Novel Feature to Image Transformation (FIT) Algorithm (Ammar S. Salman, Odai S. Salman, Garrett E. Katz)....Pages 198-213
MESRS: Models Ensemble Speech Recognition System (Ben Zagagy, Maya Herman)....Pages 214-231
DeepConAD: Deep and Confidence Prediction for Unsupervised Anomaly Detection in Time Series (Ahmad Idris Tambuwal, Aliyu Muhammad Bello)....Pages 232-244
Reduced Order Modeling Assisted by Convolutional Neural Network for Thermal Problems with Nonparametrized Geometrical Variability (Fabien Casenave, Nissrine Akkari, David Ryckelynck)....Pages 245-263
Deep Convolutional Generative Adversarial Networks Applied to 2D Incompressible and Unsteady Fluid Flows (Nissrine Akkari, Fabien Casenave, Marc-Eric Perrin, David Ryckelynck)....Pages 264-276
Improving Gate Decision Making Rationality with Machine Learning (Mark van der Pas, Niels van der Pas)....Pages 277-290
End-to-End Memory Networks: A Survey (Raheleh Jafari, Sina Razvarz, Alexander Gegov)....Pages 291-300
Enhancing Credit Card Fraud Detection Using Deep Neural Network (Souad Larabi Marie-Sainte, Mashael Bin Alamir, Deem Alsaleh, Ghazal Albakri, Jalila Zouhair)....Pages 301-313
Non-linear Aggregation of Filters to Improve Image Denoising (Benjamin Guedj, Juliette Rengot)....Pages 314-327
Comparative Study of Classifiers for Blurred Images (Ratiba Gueraichi, Amina Serir)....Pages 328-336
A Raspberry Pi-Based Identity Verification Through Face Recognition Using Constrained Images (Alvin Jason A. Virata, Enrique D. Festijo)....Pages 337-349
An Improved Omega-K SAR Imaging Algorithm Based on Sparse Signal Recovery (Shuang Wang, Huaping Xu, Jiawei Zhang, Boyu Wang)....Pages 350-357
A-Type Phased Array Ultrasonic Imaging Testing Method Based on FRI Sampling (Dai GuangZhi, Wen XiaoJun)....Pages 358-366
A Neural Markovian Multiresolution Image Labeling Algorithm (John Mashford, Brad Lane, Vic Ciesielski, Felix Lipkin)....Pages 367-379
Development of a Hardware-Software System for the Assembled Helicopter-Type UAV Prototype by Applying Optimal Classification and Pattern Recognition Methods (Askar Boranbayev, Seilkhan Boranbayev, Askar Nurbekov)....Pages 380-394
Skin Capacitive Imaging Analysis Using Deep Learning GoogLeNet (Xu Zhang, Wei Pan, Christos Bontozoglou, Elena Chirikhina, Daqing Chen, Perry Xiao)....Pages 395-404
IoT Based Cloud-Integrated Smart Parking with e-Payment Service (Ja Lin Yu, Kwan Hoong Ng, Yu Ling Liong, Effariza Hanafi)....Pages 405-414
Addressing Copycat Attacks in IPv6-Based Low Power and Lossy Networks (Abhishek Verma, Virender Ranga)....Pages 415-426
On the Analysis of Semantic Denial-of-Service Attacks Affecting Smart Living Devices (Joseph Bugeja, Andreas Jacobsson, Romina Spalazzese)....Pages 427-444
Energy Efficient Channel Coding Technique for Narrowband Internet of Things (Emmanuel Migabo, Karim Djouani, Anish Kurien)....Pages 445-466
An Internet of Things and Blockchain Based Smart Campus Architecture (Manal Alkhammash, Natalia Beloff, Martin White)....Pages 467-486
Towards a Scalable IOTA Tangle-Based Distributed Intelligence Approach for the Internet of Things (Tariq Alsboui, Yongrui Qin, Richard Hill, Hussain Al-Aqrabi)....Pages 487-501
An Architecture for Dynamic Contextual Personalization of Multimedia Narratives in IoT Environments (Ricardo R. M. do Carmo, Marco A. Casanova)....Pages 502-521
Emotional Effect of Multimodal Sense Interaction in a Virtual Reality Space Using Wearable Technology (Jiyoung Kang)....Pages 522-530
Genetic Algorithms as a Feature Selection Tool in Heart Failure Disease (Asmaa Alabed, Chandrasekhar Kambhampati, Neil Gordon)....Pages 531-543
Application of Additional Argument Method to Burgers Type Equation with Integral Term (Talaibek Imanaliev, Elena Burova)....Pages 544-553
Comparison of Dimensionality Reduction Methods for Road Surface Identification System (Gonzalo Safont, Addisson Salazar, Alberto Rodríguez, Luis Vergara)....Pages 554-563
A Machine Learning Platform in Healthcare with Actor Model Approach (Mauro Mazzei)....Pages 564-571
Boundary Detection of Point Clouds on the Images of Low-Resolution Cameras for the Autonomous Car Problem (Istvan Elek)....Pages 572-581
Identification and Classification of Botrytis Disease in Pomegranate with Machine Learning (M. G. Sánchez, Veronica Miramontes-Varo, J. Abel Chocoteco, V. Vidal)....Pages 582-598
Rethinking Our Assumptions About Language Model Evaluation (Nancy Fulda)....Pages 599-609
Women in ISIS Propaganda: A Natural Language Processing Analysis of Topics and Emotions in a Comparison with a Mainstream Religious Group (Mojtaba Heidarysafa, Kamran Kowsari, Tolu Odukoya, Philip Potter, Laura E. Barnes, Donald E. Brown)....Pages 610-624
Improvement of Automatic Extraction of Inventive Information with Patent Claims Structure Recognition (Daria Berduygina, Denis Cavallucci)....Pages 625-637
Translate Japanese into Formal Languages with an Enhanced Generalization Algorithm (Kazuaki Kashihara)....Pages 638-655
Authorship Identification for Arabic Texts Using Logistic Model Tree Classification (Safaa Hriez, Arafat Awajan)....Pages 656-666
The Method of Analysis of Data from Social Networks Using Rapidminer (Askar Boranbayev, Gabit Shuitenov, Seilkhan Boranbayev)....Pages 667-673
The Emergence, Advancement and Future of Textual Answer Triggering (Kingsley Nketia Acheampong, Wenhong Tian, Emmanuel Boateng Sifah, Kwame Obour-Agyekum Opuni-Boachie)....Pages 674-693
OCR Post Processing Using Support Vector Machines (Jorge Ramón Fonseca Cacho, Kazem Taghva)....Pages 694-713
Back Matter ....Pages 715-717

Citation preview

Advances in Intelligent Systems and Computing 1229

Kohei Arai Supriya Kapoor Rahul Bhatia   Editors

Intelligent Computing Proceedings of the 2020 Computing Conference, Volume 2

Advances in Intelligent Systems and Computing Volume 1229

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen , Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong

The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within ”Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. ** Indexing: The books of this series are submitted to ISI Proceedings, EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink **

More information about this series at http://www.springer.com/series/11156

Kohei Arai Supriya Kapoor Rahul Bhatia •



Editors

Intelligent Computing Proceedings of the 2020 Computing Conference, Volume 2

123

Editors Kohei Arai Faculty of Science and Engineering Saga University Saga, Japan

Supriya Kapoor The Science and Information (SAI) Organization Bradford, West Yorkshire, UK

Rahul Bhatia The Science and Information (SAI) Organization Bradford, West Yorkshire, UK

ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-3-030-52245-2 ISBN 978-3-030-52246-9 (eBook) https://doi.org/10.1007/978-3-030-52246-9 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Editor’s Preface

On behalf of the Committee, we welcome you to the Computing Conference 2020. The aim of this conference is to give a platform to researchers with fundamental contributions and to be a premier venue for industry practitioners to share and report on up-to-the-minute innovations and developments, to summarize the state of the art and to exchange ideas and advances in all aspects of computer sciences and its applications. For this edition of the conference, we received 514 submissions from 50+ countries around the world. These submissions underwent a double-blind peer review process. Of those 514 submissions, 160 submissions (including 15 posters) have been selected to be included in this proceedings. The published proceedings has been divided into three volumes covering a wide range of conference tracks, such as technology trends, computing, intelligent systems, machine vision, security, communication, electronics and e-learning to name a few. In addition to the contributed papers, the conference program included inspiring keynote talks. Their talks were anticipated to pique the interest of the entire computing audience by their thought-provoking claims which were streamed live during the conferences. Also, the authors had very professionally presented their research papers which were viewed by a large international audience online. All this digital content engaged significant contemplation and discussions amongst all participants. Deep appreciation goes to the keynote speakers for sharing their knowledge and expertise with us and to all the authors who have spent the time and effort to contribute significantly to this conference. We are also indebted to the Organizing Committee for their great efforts in ensuring the successful implementation of the conference. In particular, we would like to thank the Technical Committee for their constructive and enlightening reviews on the manuscripts in the limited timescale. We hope that all the participants and the interested readers benefit scientifically from this book and find it stimulating in the process. We are pleased to present the proceedings of this conference as its published record.

v

vi

Editor’s Preface

Hope to see you in 2021, in our next Computing Conference, with the same amplitude, focus and determination. Kohei Arai

Contents

Urban Mobility Swarms: A Scalable Implementation . . . . . . . . . . . . . . Alex Berke, Jason Nawyn, Thomas Sanchez Lengeling, and Kent Larson Using AI Simulations to Dynamically Model Multi-agent Multi-team Energy Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. Michael Franklin, Philip Irminger, Heather Buckberry, and Mahabir Bhandari Prediction of Cumulative Grade Point Average: A Case Study . . . . . . . Anan Sarah, Mohammed Iqbal Hossain Rabbi, Mahpara Sayema Siddiqua, Shipra Banik, and Mahady Hasan Warehouse Setup Problem in Logistics: A Truck Transportation Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rohit Kumar Sachan and Dharmender Singh Kushwaha WARDS: Modelling the Worth of Vision in MOBA’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alan Pedrassoli Chitayat, Athanasios Kokkinakis, Sagarika Patra, Simon Demediuk, Justus Robertson, Oluseji Olarewaju, Marian Ursu, Ben Kirmann, Jonathan Hook, Florian Block, and Anders Drachen

1

19

33

43

63

Decomposition Based Multi-objectives Evolutionary Algorithms Challenges and Circumvention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sherin M. Omran, Wessam H. El-Behaidy, and Aliaa A. A. Youssif

82

Learning the Satisfiability of Ł-clausal Forms . . . . . . . . . . . . . . . . . . . . Mohamed El Halaby and Areeg Abdalla

94

A Teaching-Learning-Based Optimization with Modified Learning Phases for Continuous Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Onn Ting Chong, Wei Hong Lim, Nor Ashidi Mat Isa, Koon Meng Ang, Sew Sun Tiang, and Chun Kit Ang

vii

viii

Contents

Use of Artificial Intelligence and Machine Learning for Personalization Improvement in Developed e-Material Formatting Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Kristine Mackare, Anita Jansone, and Raivo Mackars Probabilistic Inference Using Generators: The Statues Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Pierre Denis A Q-Learning Based Maximum Power Point Tracking for PV Array Under Partial Shading Condition . . . . . . . . . . . . . . . . . . . . . . . . 155 Roy Chaoming Hsu, Wen-Yen Chen, and Yu-Pi Lin A Logic-Based Agent Modelling Paradigm for Investment in Derivatives Markets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Jonathan Waller and Tarun Goel An Adaptive Genetic Algorithm Approach for Optimizing Feature Weights in Multimodal Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Manar Hosny and Sawsan Al-Malak Extending CNN Classification Capabilities Using a Novel Feature to Image Transformation (FIT) Algorithm . . . . . . . . . . . . . . . . . . . . . . . 198 Ammar S. Salman, Odai S. Salman, and Garrett E. Katz MESRS: Models Ensemble Speech Recognition System . . . . . . . . . . . . . 214 Ben Zagagy and Maya Herman DeepConAD: Deep and Confidence Prediction for Unsupervised Anomaly Detection in Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 Ahmad Idris Tambuwal and Aliyu Muhammad Bello Reduced Order Modeling Assisted by Convolutional Neural Network for Thermal Problems with Nonparametrized Geometrical Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Fabien Casenave, Nissrine Akkari, and David Ryckelynck Deep Convolutional Generative Adversarial Networks Applied to 2D Incompressible and Unsteady Fluid Flows . . . . . . . . . . . . . . . . . . 264 Nissrine Akkari, Fabien Casenave, Marc-Eric Perrin, and David Ryckelynck Improving Gate Decision Making Rationality with Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Mark van der Pas and Niels van der Pas End-to-End Memory Networks: A Survey . . . . . . . . . . . . . . . . . . . . . . . 291 Raheleh Jafari, Sina Razvarz, and Alexander Gegov

Contents

ix

Enhancing Credit Card Fraud Detection Using Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Souad Larabi Marie-Sainte, Mashael Bin Alamir, Deem Alsaleh, Ghazal Albakri, and Jalila Zouhair Non-linear Aggregation of Filters to Improve Image Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 Benjamin Guedj and Juliette Rengot Comparative Study of Classifiers for Blurred Images . . . . . . . . . . . . . . 328 Ratiba Gueraichi and Amina Serir A Raspberry Pi-Based Identity Verification Through Face Recognition Using Constrained Images . . . . . . . . . . . . . . . . . . . . . . . . . 337 Alvin Jason A. Virata and Enrique D. Festijo An Improved Omega-K SAR Imaging Algorithm Based on Sparse Signal Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 Shuang Wang, Huaping Xu, Jiawei Zhang, and Boyu Wang A-Type Phased Array Ultrasonic Imaging Testing Method Based on FRI Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 Dai GuangZhi and Wen XiaoJun A Neural Markovian Multiresolution Image Labeling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 John Mashford, Brad Lane, Vic Ciesielski, and Felix Lipkin Development of a Hardware-Software System for the Assembled Helicopter-Type UAV Prototype by Applying Optimal Classification and Pattern Recognition Methods . . . . . . . . . . . . . . . . . . . 380 Askar Boranbayev, Seilkhan Boranbayev, and Askar Nurbekov Skin Capacitive Imaging Analysis Using Deep Learning GoogLeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Xu Zhang, Wei Pan, Christos Bontozoglou, Elena Chirikhina, Daqing Chen, and Perry Xiao IoT Based Cloud-Integrated Smart Parking with e-Payment Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 Ja Lin Yu, Kwan Hoong Ng, Yu Ling Liong, and Effariza Hanafi Addressing Copycat Attacks in IPv6-Based Low Power and Lossy Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 Abhishek Verma and Virender Ranga On the Analysis of Semantic Denial-of-Service Attacks Affecting Smart Living Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 Joseph Bugeja, Andreas Jacobsson, and Romina Spalazzese

x

Contents

Energy Efficient Channel Coding Technique for Narrowband Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 Emmanuel Migabo, Karim Djouani, and Anish Kurien An Internet of Things and Blockchain Based Smart Campus Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 Manal Alkhammash, Natalia Beloff, and Martin White Towards a Scalable IOTA Tangle-Based Distributed Intelligence Approach for the Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 Tariq Alsboui, Yongrui Qin, Richard Hill, and Hussain Al-Aqrabi An Architecture for Dynamic Contextual Personalization of Multimedia Narratives in IoT Environments . . . . . . . . . . . . . . . . . . . 502 Ricardo R. M. do Carmo and Marco A. Casanova Emotional Effect of Multimodal Sense Interaction in a Virtual Reality Space Using Wearable Technology . . . . . . . . . . . . . . . . . . . . . . . 522 Jiyoung Kang Genetic Algorithms as a Feature Selection Tool in Heart Failure Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531 Asmaa Alabed, Chandrasekhar Kambhampati, and Neil Gordon Application of Additional Argument Method to Burgers Type Equation with Integral Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544 Talaibek Imanaliev and Elena Burova Comparison of Dimensionality Reduction Methods for Road Surface Identification System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554 Gonzalo Safont, Addisson Salazar, Alberto Rodríguez, and Luis Vergara A Machine Learning Platform in Healthcare with Actor Model Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 Mauro Mazzei Boundary Detection of Point Clouds on the Images of Low-Resolution Cameras for the Autonomous Car Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572 Istvan Elek Identification and Classification of Botrytis Disease in Pomegranate with Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 582 M. G. Sánchez, Veronica Miramontes-Varo, J. Abel Chocoteco, and V. Vidal Rethinking Our Assumptions About Language Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599 Nancy Fulda

Contents

xi

Women in ISIS Propaganda: A Natural Language Processing Analysis of Topics and Emotions in a Comparison with a Mainstream Religious Group . . . . . . . . . . . . . . . . . . . . . . . . . . . 610 Mojtaba Heidarysafa, Kamran Kowsari, Tolu Odukoya, Philip Potter, Laura E. Barnes, and Donald E. Brown Improvement of Automatic Extraction of Inventive Information with Patent Claims Structure Recognition . . . . . . . . . . . . . . . . . . . . . . . 625 Daria Berduygina and Denis Cavallucci Translate Japanese into Formal Languages with an Enhanced Generalization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638 Kazuaki Kashihara Authorship Identification for Arabic Texts Using Logistic Model Tree Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656 Safaa Hriez and Arafat Awajan The Method of Analysis of Data from Social Networks Using Rapidminer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667 Askar Boranbayev, Gabit Shuitenov, and Seilkhan Boranbayev The Emergence, Advancement and Future of Textual Answer Triggering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674 Kingsley Nketia Acheampong, Wenhong Tian, Emmanuel Boateng Sifah, and Kwame Obour-Agyekum Opuni-Boachie OCR Post Processing Using Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694 Jorge Ramón Fonseca Cacho and Kazem Taghva Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715

Urban Mobility Swarms: A Scalable Implementation Alex Berke(B) , Jason Nawyn, Thomas Sanchez Lengeling, and Kent Larson MIT Media Lab, Massachusetts Institute of Technology, Cambridge, MA 02139, USA {aberke,nawyn,thomassl,kll}@media.mit.edu https://www.media.mit.edu/groups/city-science/overview/

Abstract. We present a system to coordinate “urban mobility swarms” in order to promote the use and safety of lightweight, sustainable transit, while enhancing the vibrancy and community fabric of cities. This work draws from behavior exhibited by swarms of nocturnal insects, such as crickets and fireflies, whereby synchrony unifies individuals in a decentralized network. Coordination naturally emerges in these cases and provides a compelling demonstration of “strength in numbers”. Our work is applied to coordinating lightweight vehicles, such as bicycles, which are automatically inducted into ad-hoc “swarms”, united by the synchronous pulsation of light. We model individual riders as nodes in a decentralized network and synchronize their behavior via a peer-topeer message protocol and algorithm, which preserves individual privacy. Nodes broadcast over radio with a transmission range tuned to localize swarm membership. Nodes then join or disconnect from others based on proximity, accommodating the dynamically changing topology of urban mobility networks. This paper provides a technical description of our system, including the protocol and algorithm to coordinate the swarming behavior that emerges from it. We also demonstrate its implementation in code, circuity, and hardware, with a system prototype tested on a city bike-share. In doing so, we evince the scalability of our system. Our prototype uses low-cost components, and bike-share programs, which manage bicycle fleets distributed across cities, could deploy the system at city-scale. Our flexible, decentralized design allows additional bikes to then connect with the network, enhancing its scale and impact. Keywords: Cities · Mobility · Swarm behavior · Decentralization Distributed network · Peer-to-peer protocol · Synchronization · Algorithms · Privacy

1

·

Introduction

Cities comprise a variety of mobility networks, from streets and bicycle lanes, to rail and highways. Increasing the use of the lightweight transit options that navigate these networks, such as bicycles and scooters, can increase the sustainability of cities and public health [1–3]. However, infrastructure to promote and protect c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 1–18, 2020. https://doi.org/10.1007/978-3-030-52246-9_1

2

A. Berke et al.

lightweight transit, such as bicycle lanes, are limited, and riders are vulnerable on streets designed to prioritize the efficient movement of heavier vehicles, such as cars and trucks. In this paper we present our design and implementation of a system that synchronizes lights of nearby bicycles, automatically inducting riders into unified groups (swarms), to increase their presence and collective safety. Ad-hoc swarms emerge from our system, in a distributed network that is superimposed on the physical infrastructure of existing mobility networks. We designed and tested our system with bicycles, but our work can be extended to unify swarms of the other lightweight and sustainable transit alternatives present in cities. As bicycles navigate dark city streets, they are often equipped with lights. The lights are to make their presence known to cars or other bikers, and make the hazards of traffic less dangerous. As solitary bikes equipped with our system come together, their lights begin to softly pulsate, at the same cadence. The cyclists may not know each other, or may only pass each other briefly, but for the moments they are together, their lights synchronize. The effect is a visually united presence, as swarms of bikes illuminate themselves with a gently breathing, collective light source. As swarms grow, their visual effect and ability to attract more cyclists is enhanced. The swarming behavior that results is coordinated by our system technology without effort from cyclists, as they collaboratively improve their aggregate presence and safety. We provide a technical description of our light system that includes the design of a peer-to-peer message protocol, algorithm, and low-cost hardware. We also present our prototypes that were tested on a city bike-share network. The system’s low-cost and the opportunity for bike-share programs to deploy it city-wide allows the network of swarms to quickly scale. In addition, the decentralized and flexible nature of our design allows new bikes to join a network, immediately coordinate with other bikes, and further grow a network of swarms. Our system is designed for deployment in a city, yet draws inspiration from nature. Swarms of insects provide rich examples of synchrony unifying groups of individuals in a decentralized network. We focus on examples particular to the night. The sound of crickets in the night is the sound of many individual insects, chirping in synchrony. A single cricket’s sound is amplified when it joins the collective whole. The spectacle of thousands of male fireflies gathering in trees in southeast Asia to flash in unison has long been recorded and studied by biologists [4,5]. These examples of synchrony emerging via peer-to-peer coordination within a decentralized network are of interest in our design for urban swarms. They have also interested biologists, who have studied the coordination mechanisms of these organisms [6]. Applied mathematicians and physicists have also analyzed these systems and attempted to model the dynamics of their synchronized behavior [7,8]. We draw from these prior technical descriptions in order to describe the coordination of our decentralized bike light system.

Urban Mobility Swarms

3

In doing so, we describe the individual bikes that create and join swarms as nodes in a distributed network. These nodes are programmed to behave as oscillators, and their synchronization is coordinated by aligning their phases of oscillation via exchange of peer-to-peer messages. Our message protocol and underlying algorithm accommodate the dynamically changing nature of urban mobility networks. New nodes can join the network, and nodes can drop out, and yet our system maintains its mechanisms of synchrony. Moreover, our system is scalable due to its simplicity, flexibility, and features that allow nodes to enter the swarm network with minimal information and hardware. Namely, – – – –

There is no global clock Nodes communicate peer-to-peer via simple radio messages Nodes need not be predetermined nor share metadata about their identities Nodes can immediately synchronize

Before we provide our technical description and implementation, we first describe the bicycles, their lights, and their swarming behavior. We then describe them as nodes in a dynamic, decentralized network of swarms, before presenting our protocol and algorithm that coordinates their behavior. Lastly, we show how we prototyped and tested our system with bicycles from a city bike-share program.

2

Swarm Behavior and Bicycles Lights

Similar to our swarms of bicycles, swarms of nocturnal insects, such as crickets and fireflies, display synchronous behavior within decentralized networks. In these cases, the recruitment and coordination of individuals in close proximity emerges from natural processes and provides a compelling demonstration of “strength in numbers”. This concept of “strength in numbers” demonstrated in natural environments can be extended to the concept of “safety in numbers” for urban environments. “Safety in numbers” is the hypothesis that individuals within groups are less likely to fall victim to traffic mishaps, and its effect has been well studied and documented in bicycle safety literature [9,10]. The cyclists within swarms coordinated by our system are safer due to their surrounding numbers, but also because their presence is pronounced by the visual effect swarms produce with their synchronized lights. Unlike insects, the coordination of bikes swarms is due to peer-to-peer radio messages and software, yet swarms can still form organically when cyclists are in proximity. The visual display of synchronization is due to the oscillating amplitude of LED lights. Lights line both sides of the bicycle frame, and a front light illuminates the path forward (Fig. 1). The lights stay steadily on when a bike is alone. When a bike is joined by another bike that is equipped with the system, a swarm of two is formed, and the lights on both bikes begin to gently pulsate. The amplitude of the lights oscillates from high to low and back to high, in synchrony. As other

4

A. Berke et al.

bikes come in proximity, their lights begin to pulsate synchronously as well, further growing the swarm and amplifying its visual effect.

Fig. 1. Bicycle with lights.

The system synchronizes swarms as they merge, as well as the momentary passing of bicycles. When any bike leaves the proximity of others, its lights return to their steady state. The effect is a unified pulsation of light, illuminating swarms of bikes as they move through the darkness. This visual effect enhances their safety as well as their ability to attract more members to further grow the swarm. Additionally, as a swarm grows and its perimeter expands, the reach of its radio messages expands as well, further enhancing its potential for growth. While this paper focuses on the technical system that enables these swarms, we note that the swarming behavior that emerges can also be social. Members of swarms may not know each other, but by riding in proximity, they collaboratively enhance the swarm’s effects.

3 3.1

Technical Description A Decentralized Network of Swarms

In order to model swarms, we describe individual bikes as nodes in a decentralized network. We consider swarms to be locally connected portions of the network, comprised of synchronized nodes. The nodes synchronize by passing messages peer-to-peer and by running the same synchronization algorithm. When nodes come within message-passing range of one another, they are able to connect and synchronize. Two or more connected and synchronized nodes form a swarm. When a node moves away from a swarm, and is no longer in range of message passing, it disconnects from that portion of the network, leaving the swarm. The network’s topology changes as nodes (bikes) move in or out of message passing range from one another, and connect or disconnect, and swarms thereby form, change shape, or dissolve (Fig. 2). There may be multiple swarms of synchronized bikes in the city, with each swarm not necessarily in synchrony with another distant swarm. As such, the network of nodes may have a number of connected portions (swarms) at any given time, and these swarms may not be connected to one another (Fig. 3). Our system exploits the transitive nature of synchrony: If node 1 is synchronized with node 2, and node 2 is synchronized with node 3, then node 1 and

Urban Mobility Swarms

5

Fig. 2. Nodes synchronize when near each other, and fall out of synchrony when they move apart.

Fig. 3. Examples of network states.

node 3 are synchronized as well. Since all nodes in a connected swarm are in synchrony with each other, a given node needs only to connect and synchronize with a single node in a swarm in order to synchronize with the entire swarm (Fig. 4).

Fig. 4. Synchrony of nodes in the network is transitive.

When two synchronized swarms that are not in synchrony with each other come into proximity and connect for the first time, our message broadcasting protocol and synchronization algorithm facilitates their merge and transition towards a mutually synchronized state (Fig. 5).

6

A. Berke et al.

Fig. 5. Two swarms come in proximity with each other and merge as one swarm.

A feature of the message passing and synchronization protocol is that the nodes in the network need not be predetermined. New nodes can enter this decentralized network at any given time and immediately begin exchanging messages and synchronizing with pre-existing nodes. 3.2

Nodes as Oscillators

The behavior of the nodes (bikes) that needs be synchronized is the timed pulsation of light. We can characterize this behavior by describing a node as an oscillator, similar to simple oscillators modeled in elementary physics. Nodes have two states: 1. Synchronized: the node’s behavior is periodic and synchronized with another node. 2. Out of sync: the node’s behavior remains steady; the node is not in communication with other nodes. All nodes share a fixed period, T . When a node is in a state of synchrony, its behavior transitions over time, t, until t = T , at which point it returns to its behavior at time t = 0. We denote the phase of node i at time t as φi (t) such that φi (t) ∈ [0, T ] and the phases 0 and T are identical. When nodes are in synchrony, their phases are aligned. Thus for two nodes, node i and node j, to be synchronized, φi (t) = φj (t) (Fig. 6). When a node is out of sync (i.e. the bike is not in proximity of another bike and therefore not exchanging messages with other bikes), then it ceases to act as an oscillator. When out of the synchronous state, the node’s phase remains stable at φ = 0.

Urban Mobility Swarms

7

Fig. 6. Out of sync nodes, and synchronized nodes.

3.3

Phase and Light

The pulsating effect of a bike node’s light is the decay and growth of the light’s amplitude over the node’s period, T . The amplitude of the light is a function of the node’s phase: A = fA (φ) (see Fig. 7). We denote the highest amplitude for the light as HI, and the lowest as LO1 , such that: fA (0) = fA (T ) = HI fA ( T2 ) = LO

(1)

Fig. 7. Graph of fA (φ)

When a node is in the synchronized state, and its phase oscillates, φ(t) ∈ [0, T ], the amplitude of its light can be plotted as a function of time, t (Fig. 8). Note that nodes do not share a globally synchronized clock, so time t is relative to the node. Without loss of generalization, we plot t = 0 as when the given node enters a state of synchrony. When a node is in the out of sync state, the value of its phase φ, is steady at φ = 0, so its light stays at the HI amplitude. A = HI = fA (0) (Fig. 9). 1

In our implementation, the amplitude of light does not reach as low as 0 (LO > 0). This decision was made due to our desired aesthetics and user experience.

8

A. Berke et al.

Fig. 8. Amplitude, A, plotted as a function of relative time, t, for a node in the synchronized state.

Fig. 9. Amplitude, A, plotted for a node in the out of sync state.

As soon as an out of sync node encounters another node and enters a state of synchrony, its phase begins to oscillate and the amplitude of its light transitions from HI to LO along the fA (φ) path (Fig. 10).

Fig. 10. Amplitude, A, plotted as a function of time, for a node transitioning in to synchronous state.

Implementation Notes. For our bicycle lighting system we chose period T = 2200 ms and chose fA (θ) as a sinusoidal curve.     (HI − LO) cos(φ ∗ 2π) +1 ∗ fA (φ) = + LO (2) T 2 We visually tested a variety of period lengths and functions. We chose the combination that best produced a gentle rhythmic effect that would be aesthetically pleasing and noticeable, yet not distracting to drivers.

Urban Mobility Swarms

9

We also considered fA (φ) as a piecewise linear function (Fig. 11). For a slightly different effect, one might choose any other continuous function such that Eq. 1 hold. As long as the period, T , is the same as other implementations, the nodes can synchronize.  HI − k ∗ φ, when φ < T2 (3) fA (φ) = LO + k ∗ (φ − T2 ), when φ ≥ T2

Fig. 11. Graph of fA (φ) as a piecewise linear function.

4 4.1

Message Broadcasting and Synchronization Protocol

Nodes maintain anonymity by communicating information pertaining only to timing over a broadcast and receive protocol. Synchronization is coordinated by a simple set of rules that govern how nodes handle messages received. Broadcasting Messages. The messages broadcast by a node are simply integers representing the node’s phase, φ, at the time of broadcasting, t, i.e. nodes broadcast φ(t). Nodes in the out of sync state broadcast the message of 0 (zero), as φ = 0 for out of sync nodes. Receiving Messages. Nodes update their phase values to match the highest phase value of nearby nodes. When a message is received by a node out of the synchronous state, the phase represented in the message, φm , is necessarily greater than or equal, φm ≥ φ, to the out of sync node’s phase value of φ = 0. The out of sync node then sets its phase to match the phase in the received message, φ = φm , and enters a state of synchrony. Its phase then begins to oscillate from the value of φm , and bike lights pulsate in synchrony. When a message is received by a node that is already in a state of synchrony, the node compares its own phase, φ, to the phase represented in the received

10

A. Berke et al.

message, φm . If the node’s phase value is less than the phase value in the received message, φ < φm , then the node updates its phase to match the received phase, φ = φm . The node then continues in a state of synchrony, with its phase still oscillating, but now from the phase value of φm . The node is now in synchrony with the node that sent the message of φm (see Fig. 12).

Fig. 12. Node updates its phase value to match the phase value received in message.

There is an allowed phase shift, ϕallowed , to accommodate latency in message transit and receipt, and to keep nodes from changing phase more often than necessary (Fig. 13). Nodes do not update their phase to match a greater phase value if the difference between the phases is less than ϕallowed . For example, suppose node 1 has phase value φ1 and node 2 has phase value φ2 , and φ1 < φ2 . If (φ1 + ϕallowed > φ2 ) or (φ2 + ϕallowed mod T ) > φ1 then node 1 does not update its phase to match φ2 upon receiving a message of φ2 . In our implementation, ϕallowed is so small that the possible phase shift between the light pulsations of bike nodes is imperceptible.

Fig. 13. There is an allowed phase shift ϕallowed .

Once a node updates its phase to match a greater phase received in a message, φm , it then broadcasts its new phase. Nodes in range of this new message may

Urban Mobility Swarms

11

have been out of range of the original message, but these nearby nodes can now all synchronize around the new common phase φm . This simple protocol works as a mechanism for multiple swarms to merge and synchronize. Moreover, whenever nodes come in proximity of each other’s messages, they will synchronize. Even when node i with phase value φ receives message φm < φ from node m, and node i does not updates its phase to match φm , node i and node m will still synchronize. Since they are in message passing range, node m will receive the message broadcast by node i of φ > φm , and node m will then update is own phase to match φ. Fig. 14 illustrates various scenarios for receipt of the broadcast message. Once nodes synchronize, minimal messages are required to keep them synchronized, as all nodes share the same period of oscillation, T . When enough time passes without a node receiving any messages, the node then leaves its synchronous state and returns to the out of sync state where its phase stays steady at φ = 0 (and its lights cease to pulsate). Consider the cases of Fig. 14 where nodes come in range of each other’s messages and synchronize. We let the reader extend these small examples to the larger network topology of nodes previously provided. The broadcast messages are minimal, and the synchronization rule set simple, and we consider this simplicity a feature. We demonstrate its implementation as an algorithm. 4.2

Algorithm

The implementation of our algorithm used for our working prototype is provided open source2 . Nodes execute their logical operations through a continuous loop. Throughout the loop, they listen for messages, as well as update their phase as time passes. Algorithm 1 and Algorithm 2 outline the loop operations. This simple protocol and algorithm offer the following benefits across the network, with the only requirements being that all nodes in the network run loops with this same logic, and share the same fixed period. – Nodes need not share a globally synchronized clock in order to synchronize their phases. Time can be kept relative to a node. – Nodes need not share any metadata about their identity, nor need to know any information about other nodes, in order to synchronize. Unknown nodes can arbitrarily join or leave the network at any time while the network maintains its mechanisms for synchrony.

2

https://github.com/aberke/city-science-bike-swarm/tree/master/Arduino/PulseInSync.

12

A. Berke et al.

Fig. 14. Scenarios of nodes receiving broadcast messages and updating their state of synchrony.

Urban Mobility Swarms

13

Algorithm 1. Routine to update phase 1: 2: 3: 4: 5: 6: 7: 8: 9:

currentTime ← getCurrentTime() if node is inSync then timeDelta ← currentTime - lastTimeCheck phase ← (phase + timeDelta) % period else phase ← 0 end if lastTimeCheck ← currentTime return phase

Algorithm 2. Main loop 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:

4.3

inSync ← FALSE if currentTime − lastReceiveTime < timeToOutOfSync then inSync ← TRUE end if phase ← updatePhase() phaseM ← receive() if phaseM not null then lastReceiveTime ← getCurrentTime() phase ← updatePhase() if (phase < phaseM) & (computePhaseShift(phase, phaseM) < allowedPhaseShift) then phase ← phaseM + expectedLatency lastTimeCheck ← lastReceiveTime end if end if broadcast(phase) phase ← updatePhase() updateLights(phase)

Addressing Scheme

A requirement of the system is that any two nodes must be able to communicate upon coming in proximity of one another, without knowing information about the other beforehand. Moreover, any new node that enters an existing network must be able to immediately begin broadcasting and receiving messages to synchronize with pre-existing nodes in the network. Thus the challenge is to accomplish this communication without nodes sharing identities or addresses. Because these nodes are broadcasting and receiving messages over radio, nodes cannot simply all broadcast and receive messages on the same channel, or else their messages will conflict and communication will be lost. Methods have been developed to facilitate resource sharing among nodes in a wireless network such as our network of bike nodes (e.g. TDMA implementations [11]). These methods are designed to avoid the problems of nodes sending messages on the same channel at conflicting times by coordinating the timing

14

A. Berke et al.

at which messages are sent. The DESYNC algorithm [12] even supports channel sharing across decentralized networks of nodes that do not share a globally synchronized clock (such as our network), by nodes monitoring when other messages are sent, and then self-adjusting the time at which they send messages, until gradually the nodes send their messages at equally spaced intervals. These strategies are not as well suited for our network of bike nodes, because its topology continuously changes (as new bikes join or leave the network, and as bikes pass each other, or collect at stoplights, or go separate ways), and nodes need to exchange messages as soon as they enter proximity of each other. In addition, immediately after a node updates its own phase to match a phase received in a message, it must broadcast its phase so that other nearby nodes can resynchronize with it. This immediate resynchronization would be hindered by a resource sharing algorithm that required a node to wait its turn in order to broadcast a message. Bike nodes should be able to continuously listen for messages sent by other nodes, and be able to broadcast messages at any time. We designed and use an addressing scheme to handle these requirements. The scheme exploits the fact that when multiple nodes are in proximity of each other, the messages they broadcast are often redundant: When nodes are in message passing range, they synchronize and the messages they then broadcast contain the same information about their shared phase. In our addressing scheme, we allocate N predetermined addresses, which we number as address 1, address 2, address 3, . . . , address N . All nodes in the network know these common addresses in the same way they all know the common period, T . We also consider our nodes as numbered: node 1, node 2, node 3, . . . Each node uses one of the N addresses to broadcast messages, and listens for messages on the remaining N − 1 addresses: – node i broadcasts on address i, – node i listens on address i + 1 mod N , address i + 2 mod N ,. . . , address i + (N − 1) mod N For example, node 1 broadcasts on address 1, while node 2 broadcasts on address 2. Since node 1 also listens on address 2, and node 2 listens on address 1, the two nodes can exchange messages without conflict. Nodes determine their own node numbers by randomly drawing from a discrete uniform distribution over {1, 2, 3,. . . , N }, such that node i had a N1 chance of choosing any i ∈ {1, 2, 3,. . . , N }. When a node in the out of sync state comes in proximity of another node, there is a small ( N1 ) probability that the nodes share a node number and therefore will not be able to exchange messages. To overcome this issue, nodes in the out of sync state regularly change their node numbers by redrawing from the discrete uniform distribution and then re-configuring which addresses they broadcast and

Urban Mobility Swarms

15

listen on based on their node number. This change allows two nearby nodes with conflicting node numbers and addresses to get out of conflict. If a node encounters multiple synchronized nodes, it needs only to have a non-conflicting node number with one of them in order to synchronize with all of them, since the synchronized nodes share and communicate the same phase messages. Discussion of Alternatives. We also considered an alternative synchronization scheme that would allow all nodes to share one common address to broadcast and receive messages. In this simpler alternative, nodes only broadcast messages when their phase is at 0. (Nodes in the out of sync state broadcast at random intervals). Upon receiving such a message, a node sets its own phase to 0 and enters a state of synchrony with the message sender. Simplifying the message protocol in this way circumvents the issue of nodes sending messages over a shared channel and their messages conflicting. If two nodes do happen to send a message at the same time, then they must already be synchronized (they share a phase of 0 at the time of sending). Any other node in proximity that receives this broadcast will synchronize to phase 0 and then also broadcast its messages at the same time as the other nodes. This message passing protocol has been studied and modeled in relation to pulse-coupled biological oscillators where the oscillation is episodic rather than smooth [7]. Examples include the flashing of a firefly, or the chirp of a cricket, where instead of the system interacting throughout the period of oscillation, there is a single “fire” (e.g. flash or chirp) event that occurs at the end of the period. This simplified synchronization scheme works well for discrete episodic events among oscillators, and while it could work for our bike nodes, we chose not to use it because our bike nodes have a continuous behavior (Fig. 15). Because they update the amplitude of their light continually throughout their phase, synchronizing at phase 0 is as important as synchronizing at any other phase value. Moreover, this simplified messaging protocol would make the time to synchronization longer, dependent on the length of the period, T . Two nodes that come in to proximity for the first time but that are already oscillating with phases that are out of synchrony with each other would not have the opportunity to synchronize until one of their phases reaches 0 again. 4.4

Faulty and Malicious Nodes

We note the unlikely case of faulty nodes, which broadcast messages to the bike swarm network without following the same protocol as other nodes. These faults may occur because one bike’s system breaks or was badly implemented, or may be due to malicious actors. These faulty nodes can destabilize the synchronization of nearby swarms. However, the issue will be spatially isolated to the swarms within broadcasting range of the faulty node, while the rest of the network continues to function successfully.

16

A. Berke et al.

Fig. 15. The timing of message broadcasts in our synchronization scheme versus episodic message broadcasts in the simplified synchronization scheme.

5

Circuitry, Prototype Fabrication, and Tests

Our system design includes an integrated circuit. The circuit connects a lowcost radio to broadcast and receive the protocol messages, a microcontroller programmed to run our algorithm, and lights controlled by the microcontroller. The radio transmission range is limited by design in order to control swarm membership to only include nearby nodes (bikes). Our prototypes use nRF24 [13] radio transceivers to broadcast and receive the protocol messages without necessitating individual nodes to pair. The nRF24 specification allows for software control of transmission range, which is used to constrain the spatial distance between connected nodes. The Arduino Nano [14] microcontroller was selected to run the synchronization protocol and algorithm. The other components in our circuit were used for the management of power and the pulsation of lights. The circuit schematic and Arduino code are open source.3 We implemented and tested our system for urban mobility swarms by fabricating a set of 6 prototypes. The prototypes strap on and off bicycles from a city bike-share program, and we rode throughout our city with them over a series of three nights. We tested various scenarios of bikes forming, joining, and leaving swarms, as well as swarms passing, and swarms merging, as shown in video footage that is available online: https://youtu.be/wUl-CHJ6DK0. Also available online is detailed photographic documentation of the prototype development and deployment: https://www.media.mit.edu/projects/bike-swarm.

6

Conclusion

We designed a system for the urban environment that draws from swarming behavior exhibited in the natural environment. In this paper we presented urban

3

https://github.com/aberke/city-science-bike-swarm.

Urban Mobility Swarms

17

mobility swarms as a means to promote the use and safety of lightweight, sustainable transit. We described and demonstrated a system for their implementation, with a radio protocol, synchronization algorithm, and tested prototypes. The prototypes we designed are specific to synchronizing the lights of nearby bicycles in the dark. Riders within swarms collaboratively amplify the swarm’s effect and collective safety, yet coordination and formation of swarms requires no effort from the riders. The riders are automatically inducted into ad-hoc swarms when in proximity due to our simple, yet powerful system design. The system we implemented can be easily extended and applied to transit options beyond bicycles. More generally, our system treats individual riders as nodes in a decentralized network, and coordinates swarms as connected portions of the network, with a peer-to-peer message protocol and algorithm. Our design accommodates a dynamically changing network topology, as necessitated by the nature of an urban mobility network in which individuals are constantly moving, joining the network, or leaving altogether. Furthermore, the features of our decentralized design afford its flexible and secure implementation. There is no global clock and nodes communicate with minimal radio messages without sharing metadata, allowing new nodes to immediately coordinate with the system while maintaining an individual’s privacy. Moreover, our system can be deployed at scale, which we demonstrated by implementing it with simple, low-cost circuit and hardware components, and by testing with bikes from a city bike-share. Bike-share programs manage fleets of bikes distributed across cities and could deploy the system at city-scale. The system can be integrated into bicycles, or strapped on and off, as riders typically do with bike lights. Once deployed, our modular hardware and decentralized system design allows arbitrary bikes to form or further grow a network with ease.

References 1. De Hartog, J.J., Boogaard, H., Nijland, H., Hoek, G.: Do the health benefits of cycling outweigh the risks? Environ. Health Perspect. 118(8), 1109–1116 (2010) 2. Johansson, C., L¨ ovenheim, B., Schantz, P., Wahlgren, L., Almstr¨ om, P., Markstedt, A., Str¨ omgren, M., Forsberg, B., Sommar, J.N.: Impacts on air pollution and health by changing commuting from car to bicycle. Sci. Total Environ. 584, 55–63 (2017) 3. BBC: Air pollution: Benefits of cycling and walking outweigh harms - study, May 2016. https://www.bbc.com/news/health-36208003. Accessed 25 Sept 2019 4. Buck, J.B.: Synchronous rhythmic flashing of fireflies. Q. Rev. Biol. 13(3), 301–314 (1938) 5. Buck, J., Buck, E.: Synchronous fireflies. Sci. Am. 234(5), 74–85 (1976) 6. Strogatz, S.H., Stewart, I.: Coupled oscillators and biological synchronization. Sci. Am. 269(6), 102–109 (1993) 7. Mirollo, R.E., Strogatz, S.H.: Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math. 50(6), 1645–1662 (1990) 8. Werner-Allen, G., Tewari, G., Patel, A., Welsh, M., Nagpal, R.: Firefly-inspired sensor network synchronicity with realistic radio effects. In: Proceedings of the 3rd International Conference on Embedded Networked Sensor Systems, pp. 142–153. ACM (2005)

18

A. Berke et al.

9. International Transport Forum: Cycling, Health and Safety (2013) 10. Jacobsen, P.L.: Safety in numbers: more walkers and bicyclists, safer walking and bicycling. Inj. Prevent. 21(4), 271–275 (2015) 11. Miao, G., Zander, J., Sung, K.W., Slimane, S.B.: Fundamentals of Mobile Data Networks. Cambridge University Press, Cambridge (2016) 12. Degesys, J., Rose, I., Patel, A., Nagpal, R.: DESYNC: self-organizing desynchronization and TDMA on wireless sensor networks. In: Proceedings of the 6th International Conference on Information Processing in Sensor Networks, pp. 11–20. ACM (2007) 13. Nordic-Semiconductor: nRF24 series (2019). https://www.nordicsemi.com/ Products/Low-power-short-range-wireless/nRF24-series. Accessed 25 Sept 2019 14. Wikipedia: Arduino—Wikipedia, the free encyclopedia (2019). http://en. wikipedia.org/w/index.php?title=Arduino&oldid=917538577. Accessed 25 Sept 2019

Using AI Simulations to Dynamically Model Multi-agent Multi-team Energy Systems D. Michael Franklin1(B) , Philip Irminger2 , Heather Buckberry3 , and Mahabir Bhandari3 1

College of Computing, Kennesaw State University, Marietta, GA, USA [email protected] 2 Power and Energy, Systems Group, Oak Ridge National Laboratory, Oak Ridge, TN, USA 3 Building Technology Research and Integration Center, Oak Ridge National Laboratory, Oak Ridge, TN, USA

Abstract. The complexity of energy systems is well known as they are complex and intricate systems. As a result, many extant studies have used many simplifications or generalizations that do not accurately reflect the nature of this complex system. In particular, most HVAC systems are modeled as a single unit, or several large units, rather than as a hierarchical composite (e.g., as a floor rather than as a collection of disparate rooms). The net result of this is that the simulations are too generic to perform meaningful analysis, machine learning, or integrated simulation. We propose using a multi-agent multi-team strategic simulations framework called SiMAMT to better define, model, simulate, and learn the HVAC environment. SiMAMT allows us to create distinct models for each type of room, hierarchically aggregate them into units (like floors, or sections), and then into larger sets (like buildings or a campus), and then perform a simulation that interacts with each sub-element individually, the teams of sub-elements collectively, and the entire set in aggregation. Further, and most importantly, we additionally model another ‘team’ within the simulation framework - the users of the systems. Again, each individual is modeled distinctly, aggregated into sub-sets, then collected into large sets. Each user, or agent, is performing on their own but with respect to the larger team goals. This provides a simulation that has a much higher model fidelity and more applicable results that match the real-world. Notice: This manuscript has been co-authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-accessplan). c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 19–32, 2020. https://doi.org/10.1007/978-3-030-52246-9_2

20

D. M. Franklin et al. Keywords: Artificial intelligence · Multi-agent systems · Strategy Machine learning · HVAC · Building management systems

1

·

Introduction

Energy systems and thermodynamic interactions are intricate and complex [1,2]. Peak-shaving algorithms attempt to mitigate the high cost of operating HVAC systems by applying machine learning to discover moments where the HVAC control system can be reduced, shut-down, or adjusted [6,10]. In truth, the system can never be truly shut-down because air must always flow, but it can be approached. The overall effort is to learn the movements of the agents within the system to better understand their patterns of behavior. Once these patterns are understood, adjustments can be made to the control algorithm to adjust for peak-shaving - reducing usage during peak-moments of power utilization from HVAC operation. In this paper, we wish to produce a new paradigm for the modeling and simulation of high-demand HVAC systems. This new paradigm utilizes the SiMAMT framework [4], a system designed to simulate multi-agent, multi-team strategic interactions. This framework is utilized to create a highfidelity simulation to be the basis of the machine learning occurring in the HVAC system [3]. We chose this simulation framework because much of the existing work in this area uses lower fidelity generalized simulations. These common methodologies use average customers progressing across average days within average temperature ranges, marginalizing the variation within the system. This creates a system poorly tuned to the nuances of day-to-day operations within a complex system. Further, they miss the impact of the interactive nature of these environments. In the general example, these other simulation tools, even when they are highly detailed, model the building as a whole or in simple zones, discarding the people coming and going, differing configurations of rooms and floors, and the subtleties of which windows get more sun throughout the day; rather, they just simulate the outside temperature, the inside thermostat setting, and then render the effective temperature within the building. We wish to model the individual elements of the buildings and the nuances of the people as thoroughly as possible to capture the most data that we can. Additionally, the simulation uses predictive data analytics and machine learning to determine maximal responses to events within the simulation [11]. Our goal is to improve on the existing modeling and simulations systems by including these factors. We wish to set up two teams - one team is the campus, comprised, hierarchically, of rooms, units of rooms, floors, and buildings; the other team is the student body, comprised, hierarchically, of the individuals, the groups of individuals, and the sets of groups. The two teams will interact with each other in the simulated system to produce a high-fidelity model of the HVAC system. Additionally, multi-agent systems are difficult to model and make predictions in [12], but even more so when we have hierarchical layers of agents and groups of agents in teams.

Using AI Simulations to Dynamically Model

2

21

Background

There are many factors that must be considered when understanding the energy profile of a building. The EnergyPlusTM [8] modeling system is a highly-detailed building simulation tool developed by the Department of Energy (DOE). It factors in a wide variety of thermal loads from thermal envelopes, vents, air handlers, ceilings, floors, etc. It measures the thermal loads generated by each of these various inputs and calculates a total energy requirement to cool or heat the environment. This program has been used with great effect to model energy usage and HVAC system performance for several years. SiMAMT uses a set of models to declare the behavior of each disparate element of the simulated environment. The model-based approach allows for each agent being managed to hold varied, flexible, expressive, and locally powerful behaviors. These models can be designed as Finite-State Automatons (FSAs) or Probabilistic Graphical Models (PGMs). The system architecture allows for these models to be represented as either a diverse set of graphs, where each one represents a policy as a walk through a graph of possible sets of actions or choices, or as a set of multiple isomorphic graphs where the weight of the edges encode the decision matrices. Thus the system allows for multiple layers of hierarchical reasoning and representation where, at each level, there is an overarching directional policy guiding the collected actions of the group of agents to accomplish a shared goal. This allows for a shared, distributed intelligence while preserving the individual agent policies. Further, since the large-scale policies flow downward (the coalition strategy flows down to the teams, the strategy directs each team uniquely, and the team policies direct each agent by assigning behaviors to each) while the information used for decisions flows upward (each agent collects observations and information and sends them to their team, the team collects, collates, and processes that information and sends higher-level observations upstream to the coalition). The output of the SiMAMT based simulation is the total energy consumption of the campus. We will use EnergyPlusTM to calibrate our simulation to make sure the our version of the modeling and simulation works accurately. The total energy cost from the SiMAMT simulation is then compared to the EnergyPlusTM total cost to test the veracity of the system. Once this is confirmed, the system will then be pushed to learn the best algorithms to run the system most efficiently. These final results will be the final output of our tests.

3

Methodology

HVAC systems have large diversity in operating systems, methods, and realizations, but, generally, they operate on a few standard principles. For this discussion, the HVAC system has a set point, the temperature that it is working to achieve. If the ambient temperature is 68 degrees and the set point is 70◦ , the HVAC system will work to increase the temperature to 70◦ . In higher-end HVAC systems there are systems that can regulate temperatures within certain

22

D. M. Franklin et al.

sub-zones by using baffles to increase or decrease the amount of regulated air flowing into that sub-zone. The effect is that the sub-zone can have a different temperature from this set point, and this delta of temperature is called the offset. There are limitations, of course, to the amount of offset achievable, but it is reasonable to have an offset of ±3◦ . From a function perspective, the HVAC system can be told to hold a certain temperature so that it will add cool air or warm air to maintain a certain temperature. This is the primary function of the thermostat, but not all systems have the ability to switch between cooling and heating. Many systems, mostly home HVAC systems, are designed to only do one or the other. For example, if a home thermostat function was set to Heat and the temperature set to 68◦ , the HVAC system would turn on the heat anytime the temperature went below 68◦ in an effort to maintain the set point. However, if the temperature were to rise to 80◦ the HVAC system would not react since it is in Heat mode. The converse is also true, if set to Cool and 68◦ , it will cool when the air is above 68◦ , but will not react when the temperature drops as far down as even 32◦ . Higher-end home HVAC systems and most commercial systems use an additional set of relays to shift the function of the HVAC system from Heat to Cool as appropriate. Many of these systems would have two set points, one for the Heat function set point and one for the Cool function set point. In this scenario the HVAC system will work to maintain a comfortable temperature range. This terminology is used throughout the rest of the paper. First, we had to model the occupancy of the building by creating a population of agents within the system. Recall that the overall system goal is to increase fidelity, so it was imperative that the population of agents within the system match accurately the real-world population that they are modeling. To that end, there was a study made on the population as far classifying them into groups, understanding their schedules and preferences, and matching their behaviors to the tasks within the system. While this is the general process, specifics are provided here to ground the work. In the example scenario, a large campus HVAC management system, the population are students within the campus. These students have a wide variety of schedules that vary between taking classes at various times, attending and participating in extracurricular activities, and socializing with other students. The study revealed that although there is a lot of variation within this population, there are several major groupings that emerge. The final result was a series of five different types of schedules that represent the major combinatorial variants. Each student within the population was then assigned to each of these five groups in order to gather a distribution that matched the student population. In the general sense, these population distributions can be modeled in any standard, normal, or algorithmic distribution, or they can be modeled by hand (if known) or learned through machine learning. In this case, the distribution was learned. The number of groups settled at five, but the algorithm tested many options for groupings to find the lowest statistical error to the actual population to do so. In addition to modeling their schedules, there was the need to model their preferences. As outlined before in brief, this meant understanding their preferences in relation to the set temperature.

Using AI Simulations to Dynamically Model

23

Again, the study determined that the preferences fell within ±3◦ of the set point. The distribution of these preferences follows the normal distribution with most people preferring to be within one degree of the set point and few wishing to be three degrees away. This distribution is vital to the veracity of the model. Each of these components formed the high-fidelity model that constitutes the agent population, hierarchically modeled as individuals comprising teams comprising the population. Second, we had to model the buildings within the campus. As previously explained, this is modeled hierarchically from the room to the unit to the floor to the building to the campus. The main component of cost from the building perspective is heat load. This heat load, measured in BTUs, is the amount of heat transferred into the building from the external factors, such as sunlight, and internal factors, such as lighting and motors. The EnergyPlusTM system was utilized to model the loads induced into the system from the windows, roofs, floors, walls, ambient loads, etc. This modeling was exhaustive and every effort was made to include all heat loads into the system so that the fidelity of the model was maintained. This data was then fed into the simulation as the contribution to the overall cost function for each room in the system (each room, as indicated earlier, has individual elements that determine its contribution to the heat load). Each room was then aggregated, along with its heat load, into a unit of rooms that all share some of these characteristics (like the fact that they are South facing). These units add their shared heat load characteristics to the sum total of the rooms, and this continues on up the hierarchy to the floors. Each floor has its own characteristics, like being on the ground floor, or being the top floor, that also adds to or subtracts from the heat load. Again each element was modeled algebraically with the variables of outside temperature and incident sunlight left as unknowns. This provides a model for the whole building made up of many smaller models that all vary with the outside temperature and the incident sunlight on each individual unit. To model any given day, data is collected from NOAA [7] for the region of the world where the campus is located. This data is then played out across the simulation day to vary the heat loads across the day. As this happens, the heat load is generated by each room, then up to each unit, then up to the floor, then the building, and ultimately the campus. To be clear, this is not a calculated heat load derived en masse, but rather it is being aggregated by each individual unit being modeled throughout the day as temperatures and sunlight vary. This means that each slice of each day of an entire year can be utilized to produce a high-fidelity, high-accuracy simulation of the HVAC system for the campus. Third, we modeled the interactions. There were a few options for how the HVAC system could operate, so we modeled each type of interactive behavior. The first was a basic hold function where the HVAC system holds one temperature throughout the day, or through large portions of the day. The second was with a machine learning algorithm controlling the thermostat to adjust it throughout the day as heat load and temperature vary to use micro-adjustments to save money. In this second scenario there are no occupancy detectors and no

24

D. M. Franklin et al.

baffles to create sub-zones. The third was to introduce sub-zones with offsets so that there were unique preferences for each room in the building and looking at how well the system could manage the HVAC with those dynamic load issues. Further, the movements of the students factored in en masse, but not per room because there were no occupancy sensors. The fourth scenario included the occupancy sensors so that the simulation could track the students as they moved to and from classes. With this additional data, the simulation could fully perform machine learning on the patterns of movements and behaviors of the agents within the simulation and thus much more closely model the real-world scenario. The initial setup of the simulation is shown in Fig. 1. Please note, due to privacy concerns, the image is small. The goal is to show a typical single floor layout of a multi-floor building, similar to a dormitory. It shows one of the building being simulated on the campus. This particular building has six floors. This color of each room indicates it current occupancy setting. The small dots that are visible in this figure, and in Fig. 2, show the agents as they move about the building on their daily routine. Fig. 3 shows a single floor view. The simulation is an interactive time application, so the user can view the entire campus, a single building, or a single floor, and see tabular data for each view showing energy consumption, cost estimates, current set points, distribution of offsets, and much more. The SiMAMT framework provides the interactive simulation and gathers the data for the final results.

Fig. 1. Initial single building view of the simulation

There are some additional real-world considerations that must be factored in. While a simpler machine learning algorithm might just turn off the HVAC

Using AI Simulations to Dynamically Model

25

Fig. 2. Initial single building progress view of the simulation

Fig. 3. Single floor view of the simulation

system during off-peak or reduced load times, the real-world system must keep some air circulating to avoid bad air quality and mold and other stagnant air problems. There are also issues with hard-locking compressors and other mechanical components when the system switches rapidly back and forth from Heat to Cool. While this may be fine in a computer simulation, it is not sustainable in the real world because of the damage to the system. As a result, we factored in reset times and switchover delays (like using a small  value between function changes) to avoid such issues. It is also important to recognize the reality of the exponential nature of heat exchange. Allowing the environment to reach an untenable temperature will require a disproportionate amount of energy to

26

D. M. Franklin et al.

resolve. This non-linear factorization must be a part of any energy calculations because it is how the real-world works, no matter how inconvenient it may be for the simulation. In all, these factors reduce the efficiency of the resultant simulation by 4–9% in aggregation, but they increase veracity by over 10%, and that is critical to us. Additionally, it is critical to note that the simulation system introduces statistical noise and variance into the process to produce realistic results. There are many small but significant factors that affect the performance of real-world systems, and these factors need to have a place in any simulation that attempts to model such real-world scenarios. Once the models were completed, the simulation was run with a variety of settings and factors. The factors are covered in the Experiments section, but the reinforcement learning is described here. While the SiMAMT framework allows for many types of learning, and even supports mixing learning styles, we will illustrate with a variant of SARSA-λ originally created in [5]. Each layer of the agent model is performing SARSA-λ, though with different ranges and setups. Each layer is learning according to the update function shown in Eq. 1. This updates the Q(s, a) table by utilizing the reward r for moving to the next state, the next values provided from taking the chosen next action (a ) from the next state (s ) (stored as Q(s , a )). It is mitigated by the learning rate, α. The algorithm for the updates and the movement tracking history is shown in Fig. 4. This shows the step by step updates shown in Eq. 2. The update amount, the δ, is calculated in Eq. 3. The e-table is incremented for every space that is visited, according to Eq. 4. The decaying updates in the e-table are updated according to Eq. 4 using the discount rate γ and the decay rate λ. This results in an eligibility trace (a history of decaying rewards based on the previously visited, and, thus, eligible spaces that can receive an update/reward). These traces are similar to those shown in Fig. 5. Q(s, a) = Q(s, a) + α(r(s , a ) + γQ(s , a ) − Q(s, a))

(1)

Q(s, a) = Q(s, a) + αδe(s, a)

(2)

δ = r(s , a ) + γQ(s , a ) − Q(s, a)

(3)

e(s, a) = e(s, a) + 1

(4)

e(s, a) = γλe(s, a)

(5)

While this is only one example, it shows the flexibility of the framework that even a normally lower performing algorithm like SARSA-λ can, with modification, provide in-depth insight into the modeling of complex systems. This framework also works with any temporal difference learning or Q-Learning technique, as well as modified genetic algorithms and deep learning [4].

Using AI Simulations to Dynamically Model

27

Fig. 4. SARSA-λ algorithm [9]

Fig. 5. SARSA-λ eligibility traces [9]

4

Experiments

First, it was critical to establish veracity, ensuring that the simulation produced the same final total heat loads and total costs that the subject system did. Using a model day for consistency, we compared the final total heat cost for both the simulation and the system to make sure that we had calibrated our system correctly. The experiment was run successfully and confirmed that the simulation was accurate to the real world heat loads. Fig. 6 shows that the simulation results are right in line with the EnergyPlusTM showing only about an 8% variance across the entire simulation. Importantly, the shape of the days energy usage matches in both iterations. An important note, this graph is below the 0 line,

Fig. 6. SiMAMT vs EnergyPlus

28

D. M. Franklin et al.

meaning that lines higher on the graph represent a lower energy cost, and those on the bottom of the graph represent a larger energy cost (the rate is expressed in negative numbers). Having verified the veracity of the system, the next step was to examine the effects of the hold temperatures and its effect on the total energy usage. The model day was fixed to examine the effects of only one variable at a time. As mentioned previously, there is still some variation because of the statistical noise and variance of the simulated elements, but the consistency is still there across the variety of experiments. In this experiment the hold temperature was incremented slowly across the range from 70◦ to 80◦ in two degree increments. Each full day was run multiple times and at each set point and the data was recorded for each time slice. The results are shown in Fig. 7 and show that the energy usage scales linearly across the set points. This finding is important as it indicated one of the first surprising results. While, predictably, total energy cost was higher at each set point, it did not increase beyond linearity, so any set point is a viable option without additional penalties beyond the linear increase in cost. The consistent shape reaffirms the energy usage pattern of the day and further validates the model. Knowing the relationship across the set points, we wanted to allow the algorithm to learn a control schema to decrease the costs, if possible, by making small micro-adjustments throughout the day. The machine learning algorithm, a variant of Q-learning that uses Reinforcement Learning to make adjustments over time, was rewarded for lower energy costs. As a result, even with no other data, it learned to be approximately 5% more efficient than the standard hold control function. This was an exciting result because it showed that even placing the thermostats under computer or algorithmic control was already paying dividends. Fig. 8 shows the energy costs across the day under the standard hold function vs. being under the learned model control. This model, again, was trained using reinforcement learning, but based on the models of the system already discovered during the initial study phase. These models were tuned, so they took the domain knowledge of the hold function and learned from that point (rather than learning from scratch). The result was that it took much less

Fig. 7. Energy cost by variation of setpoint

Using AI Simulations to Dynamically Model

29

time (fewer iterations) to see positive results. This was both a surprising and welcome finding, and a next step in the progression of algorithmic control for the HVAC system.

Fig. 8. Energy cost of hold vs learned

Building on this success, the next step was to include occupancy sensor data in the simulation. We ran the same day of temperature and sunlight data and added in the ability to learn the habits of the occupants as the simulation progressed. Initially, for this experiment, it was a reactive system. There was a three stage system for determining occupancy to make it more realistic for real-world usage. A naive algorithm would perhaps just shift to ‘Away’ mode immediately upon exit, but if the agent returned immediately it would shift right back. This is not healthy for the system, though it would make more impressive algorithmic results; however, our goal was fidelity. When there was at least one occupant in the room the room was ‘Occupied’. When the last person left the room it would shift to ‘Recently Occupied’. After remaining unoccupied for a set period of time (e.g., 15 min), it would shift to ‘Unoccupied’. The algorithm makes no adjustments to rooms until this final state, preserving the integrity of the system. The results, shown in Fig. 9, show the dramatic effect in energy savings achievable once using occupancy. Recall, graphs show negative numbers, so the higher the line the more efficient the algorithm and the lower the total cost.

Fig. 9. Energy cost with and without occupancy sensors

30

D. M. Franklin et al.

Once we had proven these methodologies via these initial experiments, we moved on to modeling a larger series of days. In the next experiment, the algorithm ran for 45 days and learned as it went. This experiment was run for three trials in one instance and four trials in the second. The first experiment was using an algorithm for Maximum Offset. In this method, unoccupied rooms were shifted to the maximum allowable offset temperature (e.g., +3◦ offset above set point). Once the room became occupied, the system would then reactively shift to the desired offset for the occupant, moving back towards that temperature for them. The prediction was that this schema would produce the highest savings because it was the least concerned with comfort and was rewarded for saving money, not creating comfort. The surprising comparative results will follow the second part of this experiment. For now, Fig. 10 shows the results of the Max Offset algorithm for the total energy costs across the 45 days. The three trials show the progression of the learning (Energy 1 is first trial, Energy 2 builds on that, etc.). Next, the AI shifted from reactive to proactive. The reward for this machine learning was based on Maximum Comfort (keeping the agents as comfortable as possible, keeping the room temperature closest to their desired offset). In this modality, the AI learns the occupancy behaviors of the agents within the system. Now, once the room is unoccupied, the learning offers new insight into the next behavior. As this unoccupied time progresses, the room would warm up until the algorithm predicted the return of the occupant (part of the larger artificial intelligence within the SiMAMT system to learn behaviors of the agents within the simulation, predict those behaviors, and adapt the system to proactively adjust based on predicted actions). The algorithm calculates the time needed to return the room to the desired temperature from its current temperature and starts the cooling process that long before the predicted arrival of the occupant. If it would take 15 min to return the room to the occupant’s offset temperature, the HVAC system would start cooling 15 min before the occupant’s predicted arrival. The results of this algorithm, shown in Fig. 11 were predicted to be better than the normal offset, but not as good as the Max Offset formulation. However, the

Fig. 10. Energy cost of max offset algorithm

Using AI Simulations to Dynamically Model

31

results were a surprise - the Max Comfort algorithm actually uses less energy than the Max Comfort algorithm. The four trials show the progression of the algorithm as it learns. The learning rate is also shown in the next figure, Fig. 12, to indicate the progression of the learning over the four trials.

Fig. 11. Energy cost of max comfort algorithm

Each of the progressive experiments proved in greater detail that the system works and that the computer controlled HVAC system with predictive AI will consistently reduce energy and save money.

Fig. 12. Learning rate of the max comfort algorithm Table 1. Total energy consumption comparison Cooling schema

Total enery Percent savings

Baseline

−54,813.30

Predictive w Max Offset

−38,891.46 29.0%

Predictive w Max Comfort −36,631.41 33.2%

32

5

D. M. Franklin et al.

Conclusions

These experiments show that the SiMAMT multi-team multi-agent framework can produce reliable, repeatable results. The framework is shown to model both the population of agents and the buildings accurately and with high-fidelity. Further, the learning algorithms are effective at learning, the AI is effective at predicting, and the simulation framework can model an environment for learning, reacting, predicting, and controlling the HVAC system of the future. The final results of the simulation show that the potential energy savings for this fully-modeled system, with both occupancy sensors and predictive analytics, can reduce energy consumption (and costs) by over 30%, as shown in Table 1.

References 1. Cook, D.: Discovering activities to recognize and track in a smart environment. IEEE Trans. Knowl. Data Eng. 23(4), 527–539 (2011) 2. Cook, D.: Learning setting-generalized activity models for smart spaces. IEEE Intell. Syst. (2011, to appear) 3. Franklin, D.M.: Strategy inference in multi-agent multi-team scenarios. In: Proceedings of the International Conference on Tools for Artificial Intelligence (2016) 4. Franklin, D.M., Hu, X.: SiMAMT: a framework for strategy-based multi-agent multi-team systems. Int. J. Monit. Surv. Technol. Res. 5, 1–29 (2017) 5. Franklin, D.M., Martin, D.: eSense: BioMimetic modeling of echolocation and electrolocation using homeostatic dual-layered reinforcement learning. In: Proceedings of the ACM SE 2016 (2016) 6. Leemput, N., Geth, F., Claessens, B., Van Roy, J., Ponnette, R., Driesen, J.: A case study of coordinated electric vehicle charging for peak shaving on a low voltage grid. In: 2012 3rd IEEE PES Innovative Smart Grid Technologies Europe (ISGT Europe), pp. 1–7, October 2012 7. US Department of Commerce: National Oceanic and Atmospheric Administration (2019) 8. Department of Energy: EnergyPlus - A Whole Building Modeling Software (2018) 9. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. pages Chap. 4, 5, 8 (1998) 10. Wang, Z., Wang, S.: Grid power peak shaving and valley filling using vehicle-to-grid systems. IEEE Trans. Power Delivery 28(3), 1822–1829 (2013) 11. Weber, B.G., Mateas, M.: A data mining approach to strategy prediction. In: 2009 IEEE Symposium on Computational Intelligence and Games, pp. 140–147, September 2009 12. Yang, E., Gu, D.: Multiagent reinforcement learning for multi-robot systems: a survey. Department of Computer Science, Univeristy of Essex, Technical report (2004)

Prediction of Cumulative Grade Point Average: A Case Study Anan Sarah(B) , Mohammed Iqbal Hossain Rabbi, Mahpara Sayema Siddiqua, Shipra Banik, and Mahady Hasan Independent University, Bangladesh, Dhaka, Bangladesh [email protected], [email protected], [email protected], {banik,mahady}@iub.edu.bd

Abstract. Cumulative Grade Point Average (CGPA) prediction is an important area for understanding tertiary education performance trend of students and identifying the demographic attributes to devise effective educational strategies and infrastructure. This paper aims to analyze the accuracy of CGPA prediction of students resulted from predictive models, namely the ordinary least square model (OLS), the artificial neural network model (ANN) and the adaptive network based fuzzy inference model (ANFIS). We have used standardized examination (Secondary School Certificate and High School Certificate) results from secondary and high school boards and current CGPA in respective disciplines of 1187 students from Independent University, Bangladesh from the period of April 2013 to April 2015. Evaluation measures such as- Mean absolute error, root mean square error and coefficient of determination are used as to evaluate performances of abovementioned models. Our findings suggest that the mentioned predictive models are unable to predict CGPA values of the students accurately with currently used parameters. Keywords: Prediction · CGPA · Classical prediction model · Soft computing models · Mean square error · Root mean square error · Coefficient of determination

1 Introduction Economic, technological, innovative growth of a society is vastly dependent on the ratio of the population receiving tertiary education. To achieve sustainable growth in the industry-both in governmental or private sectors, employers search for highly skilled workforce. Tertiary education ensures increased individual specialized skills necessary for the sustainable economic growth. Cumulative Grade Point Average (CGPA) is a reliable performance metric of a student’s overall tertiary academic achievement. CGPA measures students’ performance by calculating their average grade points obtained from all courses that the students have completed. Academic administrators need to understand the student demographic to provision effective strategies for improvement of academic curriculum and infrastructure. © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 33–42, 2020. https://doi.org/10.1007/978-3-030-52246-9_3

34

A. Sarah et al.

It’s imperative for the administration to identify the crucial underlying factors and their impact on an individual student’s CGPA. Determining students who are at the risk of academic failure and formulating strategies to reduce it, will facilitate to increase the academic success rate. In this context, it has proven particularly important to construct a predictive model for student academic performance. In this paper, our plan is to find out an efficient predictive model for predicting CGPA of Independent University, Bangladesh (IUB) students. Predictive models will be constructed based on the following parameters: (i) prior performances of students in standardized board examination such as SSC (secondary school certificate) and HSC (high school certificate) (ii) University admission test results (Mathematics and English). We have used the Ordinary Least Square (OLS) method, the ANN model and the ANFIS model to estimate IUB students CGPA [2, 12]. According to our knowledge, this type of study has not yet been conducted in the context of our country, Bangladesh. Thus, we believe that this study will make an important contribution in CGPA forecasting literature. The paper intends to portray an abrupt description of related work of prediction models to predict CGPA in Literature review section which is followed by Methodology of the selected predictive models in Sect. 3. In Sect. 4, we described the experimental design of the predictive models. The next section that is result and discussion talks about the findings made by the implemented models. The paper concludes with scope of subsequent work.

2 Background Details and Related Work Numerous works [1–7] has been done in literature associated with CGPA prediction and improving student learning based on the techniques such as Neural networks, Bayesian model, maximum-weight dependence trees and many others. Hien and Haddawy [1] used the Bayesian model to predict CGPA based on Master’s and Doctoral applicants background at the time of admission by the mean absolute error and relative mean square as evaluation measures. Their results show that according to their findings Doctoral model performed better as compared to the other model. Oladokun et al. [2] built an artificial neural network (ANN) model to determine student’s performance based on several factors that affect a student’s academic results. The accuracy of artificial network was 74%. The ANN topology was constructed on multilayer perceptron. The network’s performance was measured with the mean square error. Wang and Mitrovic [3] also established a model to foresee student’s results depending on ANN. A feed forward ANN with four inputs, a single hidden layer, and a single output where delta-bar-delta back-propagation (BP) and a linear tanh transfer function was used to test the accuracy of the prediction which was 98.06%. Gedeon and Turner [4] use a full causal index method based on the ANN to calculate the probable concluding grade to be achieved by a student by their current performance and partial marks. A back-propagation trained feed-forward ANN was trained to perform this experiment on an array of 153 samples which were the class results of undergraduate Computer Science subject. The applied model shows correct output for 94% cases. Fausett and Elwasif [5] trained the ANN model to foresee student’s performance in the assignment trial. Two ANNs that is back-propagation and counter-propagation were trained to anticipate student’s reasonable performance in a

Prediction of Cumulative Grade Point Average

35

subject based on their placement test responses. Their findings show that the BP networks achieved very high level of accuracy in predicting student performances. Zollanvari et al. [6] created a foretelling model of GPA with a maximum-weight first-order dependence tree structure. The assembled model distinguishes the set of training data with 82% precision. Rusli et al. [7] created three predictive models (namely logistic regression, ANN and adopted neural fuzzy inference system (ANFIS)) based on students’ pervious performance and the first semester’s CGPA of the undergraduate degree. The models efficiency was estimated by the root mean squared error and their results show that the ANFIS is superior to the other models. According to our knowledge, these are the studies available in recent literature.

3 Proposed Approach In this case study three methods have been proposed due to the nature of the data. These forecasting models are also broadly used in diverse scenarios [14, 15]. The methods are The Ordinary Least Square (OLS) method, The Artificial Neural Network (ANN) Method, The Adaptive Network based Fuzzy Inference (ANFIS) Method, this linear regression model and two neural networking were proposed in order to perceive the algorithm that works best for the predictive model. The data set used in this study CGPA (measures students’ performance by calculating their average grade points obtained from all courses that the students have been completed) of IUB students dated from April 2013 to April 2015, whose number of credits completed is greater than or equal to 90 [Data source: IUB database (www.iras.iub.edu.bd)]. Data set consists of the student’s GPA of SSC (SSC_GPA) and HSC (HSC_GPA) examinations, university admission test marks of English (IUBAENG_S) and Mathematics (IUBAMAT_S) along their current CGPA. The size of the dataset is 1187. Before applying our selected predicted models to predict CGPA, we have created a numerical summary (to give an idea about basic properties of the considered raw data sets) and results are tabulated in the Table 1. Table 1. Numerical summary of the considered data set CGPA

IUBAENG_S

IUBAMAT_S

HSC_GPA

SSC_GPA

Minimum

1.90

8.00

0.00

3.20

3.44

Maximum

3.99

48.00

50.00

5.00

5.00

Mean

2.81

24.78

20.05

4.51

4.65

Standard deviation

0.43

7.65

8.59

0.46

0.39

Skewness

0.34

0.63

0.75

−0.57

−1.01

Correlation with CGPA



0.30

0.11

0.32

0.34

From Table 1, it is observed that the minimum CGPA is 1.90 and the maximum CGPA is 3.99. We found that for example, most of students CGPA around 2.81. To understand very clearly whether all students CGPA is equal to 2.81 or not, we have

36

A. Sarah et al.

calculated standard deviation of CGPA. We noticed that CGPA varies from the mean value of 2.81. So, we may conclude that CGPA is ranging from 2.38 to 3.34. In addition, to understand whether our considered variables are symmetric or not, we have calculated coefficient of skewness. For example, skewness of CGPA 0.34 it means that CGPA is positively skewed. More clearly, we may say that most of the students CGPA below 2.81. Skewness of HSC_GPA -0.57 means that HSC_GPA is negatively skewed which means most of the students HSC_GPA over 4.51. To understand the relationship of CGPA with other considered independent variables, we have used the graphical method (scatter diagram [see Fig. 1(a–d)]) and numerically calculated coefficient of correlation. For example, Fig. 1(c) shows that there is a poor positive relationship exists between CGPA and SSC_GPA. More clearly, if we select one student randomly, we can conclude that if that student’s SSC score increases, there is 34% chance CGPA might increase.

Fig. 1. (a). Scatter diagram between CGPA andIUBAMAT_S Considered independent variable, (b) Scatter diagram between CGPA and IUBAENG_S considered, (c) Scatter diagram between CGPA and SSC_GPA considered independent variable, (d) Scatter diagram between CGPA and HSC_GPA considered independent variable

Figure 2 represents the general approach that was used for all the methods. The beginning step was to collect the data and filter them as needed. Then the filtered data were trained and tested using the three proposed methods. The results which is the accuracy level of the proposed methods were then compared and analyzed to make a decision.

Prediction of Cumulative Grade Point Average

Data Collection

Training and Testing with OLS, ANN and ANFIS

Data Filtering

Decision

37

Comparing the results

Fig. 2. General approach used for the methods

The widely used forecasting models that are used to predict CGPA, briefly discussed as follows: 3.1 The Ordinary Least Square (OLS) Model It is a technique that is used to estimate unknown parameters for a population linear regression model. To understand this model clearly, consider the following model: CGPAi = β0 + Xi β1 + ei , i = 1, 2, . . . , n where CGPA is the dependent variable, β0 is a constant, X is a vector of selected independent variables included SSCGPA, HSCGPA IUBAENG_S, IUBAMAT_S, β1 is the vector of parameters of independent variables and ei ~ iid (μ, σ2 ), which means ei is independent and identically distributed (iid) with mean μ and variance σ2 . In matrix form, we can rewrite the above equation as, CGPA = Xβ + e. Our target is to estimate β of the above model based on collected data sets, which is T minimizing error sum of squares w.r.t parameters, defined as follows: ∂e∂βe = 0. After ˆ (X T X)−1 X T CGPA. Thus, the estimated model solving above equations, we get β= ˆ = X β. This predicted CGPA based on our selected independent variables become CGPA model will be used to predict CGPA based on our selected independent variables. 3.2 The ANN Model Culloch and Pitts [8] proposed this model, contains the following processing functions: (i) (ii) (iii) (iv) (v)

Having inputs Allotting proper weight coefficient of inputs Computing weighted sum of inputs Comparing this sum with some threshold and Achieve suitable output value (see Fig. 3).

In this Fig. 3, a configuration of ANN with 1 input layer, two hidden layers (with sufficient no. of neurons) and 1 output layer is also presented for better understanding to reader.

38

A. Sarah et al.

Fig. 3. An ANN design

The net input as defined as: n = WX + b Here, R is the number of units in input vector and N are the no. of neurons in the hidden layers. The training algorithm is the standard back propagation, uses the gradient descent tool to minimize error over all training data. During training, each desired output CGPA is compared with the actual output CGPA and computes error at the output layer. The backward pass is the error back propagation and adjustments of weights. Thus, the network is adjusted based on a comparison of the output CGPA and the target until the network output CGPA matches the target. After the training process is ended, the network with specified weights can be used for testing a set of data different than those for training, which will be used to predict CGPA based on our selected independent variables. For details see Culloch and Pitts [8]. 3.3 The Adaptive Network Based Fuzzy Inference (ANFIS) Model Based on the fuzzy set theory and fuzzy logic, Jang [9] developed this very important soft computing forecasting model in forecasting literature. A brief description of this model is as follows (details see Jang [9]): ANFIS is a mixture of ANN and fuzzy inference system (FIS) in such a way that the ANN learning algorithm is used to find the parameters of FIS. An ANFIS architecture is presented in Fig. 3, which shows that this design has 5 layers (1 input layer, 3 hidden

Fig. 4. An ANFIS architecture

Prediction of Cumulative Grade Point Average

39

layers that represents membership functions and fuzzy rules and 1 output layer). Usually, ANFIS uses the Sugeno FIS model as a learning algorithm. In the above Fig. 4, the circular nodes represent fixed nodes whereas the square nodes are nodes that have parameters to be learnt. Each layer in this figure is associated with a particular step in the FIS (details see Jang [9]). The output structure can be rearranged as: CGPA = XW where X = [W1 x, W1 y, W1 , W2 x, W2 y, W1 ] and W = [p1, q1, r1, p2, q2, r2]T (details see Jang [9]). When the input-output training pattern exist, the vector W can be solved using the ANFIS learning algorithm.

4 Results To develop a predictive model that works efficiently to estimate CGPA based on our selected input an error and trial approach is used. Two different set of data is used for training and testing (for details, see [13]). Results To compare the performance of ANFIS, CGPA is also predicted using the ANN and the OLS models. An ANN topology of 4:10:6:1, learning rate of 0.15 and a momentum parameter 0.98 is chosen using the error and trial method. The learning rate parameter controls the step size in each iteration (for details, refer [10, 11]). The momentum parameter avoids getting stuck in local minima or slow convergence (details see [12]). We also performed the same prediction using the OLS model, minimizing error sum of squares w.r.t each parameters for considered inputs. Selected prediction models performances have been measured by mean absolute error (MAE), mean absolute percentage error (MAPE), root mean square error (RMSE), root mean square percentage error (RMSPE) and also by the coefficient of determination R2 . MAE(MAPE) and RMSE(RMSPE) are used to measure the accuracy of prediction through representing the degree of scatter. R2 is a measure of the accuracy of prediction of the trained network models. Lower values of MAE (MAPE), RMSE (RMSPE) and higher R2 values indicate better prediction (details see [12]). In Table 2, we reported performances of different considered prediction models achieved for the CGPA values using selected error measures. To understand very clearly, selected error measures are visualized in Fig. 5, 6 and 7 respectively. It is observed that for both of error measures, the ANFIS prediction model has lowest MAPE/RMSPE as compared to other selected prediction models ANN and OLS. Figure 7 presents values of coefficient of determination for all chosen prediction models. We found that R2 value (indicates the accuracy of prediction is 29%) for ANFIS again outperforms ANN and OLS [12]. According to our study, it may be concluded that the ANFIS prediction model cannot be used to predict CGPA based on our given inputs.

40

A. Sarah et al. Table 2. Performance measures of selected systems. OLS

ANN

Train MAE MAPE (%) RMSE RMSPE (%) R2

Test

ANFIS

Train

Test

Train

Test

0.3038

0.3236

0.2998

0.3181

0.2634

0.2823

11.3601

11.7203

11.2245

12.0256

10.1334

10.9684

0.3723

0.3959

0.3689

0.3904

0.3470

0.3725

14.2300

14.6900

13.6732

13.4268

11.0453

11.9674

0.2077

0.1414

0.2222

0.2576

0.2467

0.2890

MAPE 13 12

11.72 11.36

12.02 11.22

11

10.96 10.13

10 9

OLS

ANN Training

ANFIS

Testing

Fig. 5. MAPE of training and testing data

RMSPE 20 15

14.2314.69

13.6713.43

11.97 11.05

10 5 0

OLS

ANN

ANFIS

Training Testing

Fig. 6. RMSPE of training and testing data

Prediction of Cumulative Grade Point Average

R-Square

0.4 0.3

0.26 0.22

0.21 0.2

41

0.289 0.25

0.14

0.1 0

OLS

ANN Training

ANFIS

Testing

Fig. 7. R-square value of training and testing data

5 Conclusion This paper aims to find out if CGPA values can be predicted based on selected inputs using several prediction models such as the OLS model, the ANN model and the ANFIS model. We measured performances of considered models using various evaluation measures, namely MAE, MAPE, RMSE, RMPSE and coefficient of determination. We observed that the ANFIS prediction model has the lowest MAE/MAPE for testing and training data respectively as compared to the ANN and OLS predictions models. We also found that w.r.t RMSE/RMPSE, the ANFIS prediction model performs better than other two selected models. Coefficient of determination value for the ANFIS model is higher indicates better prediction rate in achievable than other selected prediction models. Our study findings based on selected inputs conclude that the ANFIS, ANN and OLS prediction models are not able to predict with CGPA with better accuracy. We believe that results of this study will be considered as a helpful contribution in forecasting areas. There are other prediction models besides the aforementioned three models in prediction literature, such as weighted least square model, Bayesian prediction model, decision tree model, genetic algorithm prediction model and others. Our next plan is to implement these models on qualitative data along with the quantitative data which might improve the accuracy rate of prediction of CGPA. The qualitative data might include information on student’s psychological state, physical condition, economic and social state which can be helpful in our future research.

References 1. Hien, N.T.N., Haddawy, P.: A decision support system for evaluating international student applications. In: 37th ASEE/IEEE Frontiers in Education Conference, Milwaukee (2007) 2. Oladokun, V., Adebanjo, A., Charles-Owaba, O.: Predicting students’ academic performance using artificial neural network. Pac. J. Sci. Technol. 9 (2008) 3. Wang, T., Mitrovic, A.: Using neural networks to predict student’s performance. In: International Conference on Computers in Education, Auckland, New Zealand (2002)

42

A. Sarah et al.

4. Gedeon, T., Turner, S.: Explaining student grades predicted by a neural network. In: Neural Networks 1993, IJCNN 1993, Nagoya, Japan (1993) 5. Fausett, L., Elwasif, W.: Predicting performance from test scores using backpropagation and counterpropagation. In: 1994 IEEE International Conference on Neural Networks. IEEE World Congress on Computational Intelligence, Orlando, FL, USA (1994) 6. Zollanvari, A., Kizilirmak, R.C., Kho, Y.H.: Predicting students’ CGPA and developing intervention strategies based on self-regulatory learning behaviors. IEEE Access 5, 23792–23802 (2017) 7. Rusli, N.M., Ibrahim, Z., Janor, R.M.: Predicting students’ academic achievement: comparison between logistic. In: International Symposium on Information Technology, Kuala Lumpur, Malaysia (2008) 8. Culloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Bioohys 5, 115–133 (1943) 9. Jang, J.S.R.: ANFIS: adaptive-network-based fuzzy. IEEE Trans. Syst. Man Cybern. 23, 665–685 (1993) 10. Fuzzy Logic Toolbox for Use with MATLAB, MathWorks, New York (2015) 11. Neural Network Toolbox for Use with MATLAB, MathWorks, New York (2015) 12. Banik, S., Chanchary, F.H., Rouf, A.R., Khan, K.: Modeling chaotic behavior of Dhaka stock market index values using the neuro-fuzzy Model. In: 10th International Conference on Computer and Information Technology (2007) 13. Banik, S., Chanchary, F.H., Khan, K., Rouf, A.R., Anwer, M.: Neural network and genetic algorithm approaches for forecasting Bangladeshi monsoon rainfall. In: 11th International Conference on Computer and Information Technology (2008) 14. Banik, S., Khan, A.F.M.K.: Forcasting US NASDAQ stock index values using hybrid forcasting systems. In: 18th International Conference on Computer and Information Technology (ICCIT) (2015) 15. Banik, S., Anwer, M., Khan, A.F.M.K.: Soft computing models to predict daily temperature of Dhaka. In: 13th International Conference on Computing and Information Technology (ICCT) (2010)

Warehouse Setup Problem in Logistics: A Truck Transportation Cost Model Rohit Kumar Sachan(B) and Dharmender Singh Kushwaha Motilal Nehru National Institute of Technology Allahabad, Allahabad, India [email protected]

Abstract. Fast, efficient, timely delivery of goods and optimal transportation cost are the major challenges in a logistics industry. A wellplanned transportation system overcomes these challenges and reduces the operational and investment costs of a logistics company. This transportation system is based on the Warehouse-and-Distribution Center (W&DC) network, which is similar to a Hub-and-Spoke (H&S) network. This paper presents a new hub location model based on the truck transportation cost instead of unit cost of goods transportation, since this is more suitable for real world goods transportation scenario from the perspective of a logistics company. Anti-Predatory Nature Inspired Algorithm (APNIA) is used to find the optimal solution of the proposed model. It finds an optimal solution in terms of W&DC (or H&S) network and respective total logistics cost. The proposed approach first finds the location of warehouses and DCs; and then allocates the DCs to warehouses in order to reduce the total logistics cost. Experimental evaluations are conducted on a real-life Warehouse Setup Problem (WSP) of 10 locations of Kanpur city, India. It reveals that the proposed Truck Transportation Cost based Model (TTCM) gives approximate 2%–10% more accurate overall logistics cost from the perspective of a logistics company. Keywords: Anti-predatory NIA · Hub Location Problem · Transportation · Logistics · Nature-inspired algorithms · Optimization

1

Introduction

Logistics is the management of flow of goods from origin to their destination. It includes various activities like, packaging, order processing, material handling, transportation, inventory control and warehousing [3]. Out of these, transportation and warehousing are the two key activities [1]. Transportation of goods is responsible for the end-to-end movement of goods and warehouse is responsible to provide intermediate storage of goods during the transportation [1]. According to Knight Frank report-2018 [2], logistics cost in India is 13–17% of the Gross Domestic Product (GDP) which is nearly double (6–9%) logistics cost to GDP ratio of developed countries such as US, Hong Kong and France. c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 43–62, 2020. https://doi.org/10.1007/978-3-030-52246-9_4

44

R. K. Sachan and D. S. Kushwaha

This statistic signifies the absence of the efficient transportation system in India. The other major challenges in Indian logistics are fast, reliable, on time delivery and seamless transportation of goods at optimal logistics [17]. These challenges are overcome by the efficient transportation system which is directly related to savings on logistics cost as their sole objective [2]. The most efficient way to develop an efficient transportation system is to develop a Warehouse-and-Distribution Center (W&DC) network [32] between the cities/locations which is commonly termed as a logistics network. In W&DC network, warehouses act as a distribution hub/dispatch hub/logistics hub/return center and are responsible for the various activities like, consolidation, sortation, connectivity, switching and distribution of goods between stipulated origins and destinations (O-D) points; and Distribution Centers (DCs) act as an origin or destination point of goods and are responsible for the booking, distribution and packaging of goods [32]. An end-to-end transportation system based on the W&DC is illustrated in Fig. 1. It shows the movement of goods from origin DC to destination DC via warehouses.

Fig. 1. An end-to-end transportation system based on the W&DC

A W&DC network is similar to a Hub-and-Spoke (H&S) network [16] where warehouse acts as a hub and DC acts as spoke. In H&S network, few cities/locations act as hubs and remaining cities/locations act as spokes. The spokes are connected to the hubs in a way that ensures that all flow of goods passes through the hub(s) before these reach the destination spoke. The H&S network fulfills the demands of flow of goods via a smaller set of transportation links between the pair of O-D than in fully connected network [30]. To find a W&DC (or H&S) network from the given set of locations is a two step process [5]. First step is to identify the locations of warehouse and distribution center; and second is to allocate the DCs to identified warehouses in such a way that total logistics cost is optimized to route the flow of goods for every pair of O-D locations [7]. The problem to find a W&DC network is named as Warehouse Setup Problem (WSP).

Warehouse Setup Problem in Logistics: A Truck Transportation Cost Model

45

The WSP is an application of Hub Location Problem (HLP). HLP finds a solution in terms of H&S network which has optimal logistics cost. HLP is useful in those logistics problems where some goods must be transported from every pair of O-D and when it is impossible to establish a direct transportation link between each pair of locations or it is too expensive (or not reasonable) [14]. Several HLP’s models have been proposed in the past [11,15]. These models are broadly classified into five major categories: p-Hub Median Problem (pHMP), p-Hub Location Problem (pHLP), p-Hub Center Problem (pHCP), Hub Covering Problem and Hub Arc Location Problem (HALP) [6]. Further, these HLPs are classified based on the area of solution (discrete or continuous), type of objective function (minimax or minisum), assessment policy of number of hubs (exogenous or endogenous), capacity of hubs (unlimited or limited), cost of hub establishment (no cost or fixed cost or variable cost), node allocation scheme (single or multiple) and cost of connection establishment between spokes to hubs (no cost or fixed cost or variable cost) [15]. Some of these models are: Uncapacitated Single Allocation p-HMP (USApHMP) [21], Uncapacitated Multiple Allocation HLP with Fixed Costs (UMAHLP-FC) [20], Capacitated Single Allocation HLP (CSAHLP) [14], Uncapacitated Multiple Allocation HLP (UMAHLP) [12], Capacitated Multiple Allocation HLP (CMAHLP) [8], Single Allocation p-HCP (SApHCP) [18], Capacitated Asymmetric Allocation HLP (CAAHLP) [29], Multiple Allocation p-HMP under Intentional Disruptions (MApHMP-I) [23], Capacitated Multiple Allocation Cluster HLP (CMACHLP) [19], Un-capacitated Multiple Allocation HLP (UMAHLP) [22], Uncapacitated Single Allocation p-HLP with Fixed Cost (USApHLP-FC) [24]. These models have different mathematical formulations and associated constraints that calculate the logistics cost. To the best of our knowledge, all aforementioned models/formulations of HLP are based on the unit cost of goods transportation; none of model is based on the real world goods transportation scenario that incorporates the truck transportation cost. Also, none of these model includes labour cost in the total logistics cost. Most of these use sum of transportation cost and fixed establishment cost of hubs as the total cost. The unit cost of goods transportation based models are not well suited for real world goods transportation scenario from the perspective of a logistics company. A logistics company transports goods between different locations via trucks and many times, few under loaded trucks also travel between different locations. Cost of transportation of these under loaded trucks are not included in the unit cost of goods transportation based model but still logistics company has to bear the cost of underutilized space of the trucks. For this reason, there is a need for a new HLP model which addresses the above issue in real world goods transportation scenario. This different charging issue is illustrated in Fig. 2. Due to the possibility of several solutions, finding a solution with optimal logistics cost and H&S network is challenging. As number of location points increases, the complexity of the problem also increases exponentially. Operation research and heuristic methods are best suited to solve small problems, but

46

R. K. Sachan and D. S. Kushwaha

Fig. 2. Illustration of different charging issue

when the number of locations is high, meta-heuristic algorithm based approaches are used. A meta-heuristic algorithm finds the optimal solution from randomly generated initial solutions, based on the fitness value (or logistics cost) of the problem. Mathematical formulation of the HLP’s model is used as the fitness/objective function during optimization process. Commonly used metaheuristic algorithms for HLPs, are Genetic Algorithm (GA) [24,31], Particle Swarm Optimization (PSO) [4,22], Ant Colony Optimization (ACO) [18,19], Simulated Annealing (SA) [9,10,13] and Anti-predatory NIA (APNIA) [28]. This paper proposes a new model for real world goods transportation scenario that is based on the transportation cost of trucks and named as “Truck Transportation Cost based Model (TTCM)”. The proposed TTCM of HLP is solved using APNIA [25]. For experimental evaluation, a WSP of 10 location points is created based on the real world scenario of Kanpur city, India. The obtained results are compared with the Unit Cost of Goods Transportation based Model (UCGTM). Rest of the paper is organized as follows: Sect. 2 presents the mathematical formulation of the proposed models with general assumptions for HLP. Section 3 briefly describes the APNIA and APNIA based approach for solving the HLP is presented in Sect. 4. Section 5 discusses the experimental evaluations of APNIA on UCGTM and TTCM on a real-life WSP. The analysis of obtained results is presented in Sect. 6. Section 7 provides the conclusion with future remarks.

2

Proposed Models for Hub Location Problem (HLP)

Out of the dozen variants of HLP’s models, Uncapacitated Single Allocation pHub Location Problem with Fixed Cost (USApHLP-FC) [24] is a HLP variant with no capacity constraints on hubs (i.e. unlimited capacity), single allocation constraints on spokes (i.e. a spoke allocated to only one hub) and fixed establishment/development cost of each hub location. The aim of USApHLP-FC is to find an optimal H&S network to route the flow of goods between O-D pairs of location points so that total logistics cost is optimal (minimum). This logistics cost includes the total transportation cost and total hubs establishment cost. This section proposes two models and their mathematical formulations based on the USApHLP-FC. The first model UCGTM is an extension of unit cost of goods transportation model which incorporates the labour cost in total logistics

Warehouse Setup Problem in Logistics: A Truck Transportation Cost Model

47

cost. The second model TTCM is a novel one for the real world goods transportation scenario which is based on the truck transportation cost instead of the unit cost of goods transportation. Both models consider the sum of transportation cost of goods (T C); labour cost of loading and unloading of goods (LC); and fixed hub and spoke establishment cost (HSC) as a total logistics cost. Both the models have some common assumptions, decision variables and constraints, as listed below: General Assumptions – – – – – – – – – – –

Number of hub locations is fixed and known. Hubs do not have capacity constraints. A spoke location is allocated to only one hub. All hubs are connected to each other via direct link. Direct connection between the spokes is not allowed. Distance between every pair of O-D location is known. Flow of goods between every pair of O-D location is known. Every location has a hub establishment cost which is fixed and known. Every location has a spoke rent cost which is fixed and known. Labour wage and labour productivity is known. Transportation cost between hubs is always lower than the transportation cost between hub and spoke. – Two different capacity trucks are used for transportation of goods. The lower capacity trucks are used for transportation between the hub to spoke and higher capacity trucks are used for transportation between the hubs. Output (or Decision) Variables – Yk : a hub allocation variable – Xij : a spoke to hub allocation variable  1 if a hub is located at node k Yk = 0 otherwise  1 Xij = 0

if node i is connected to a hub located at node j otherwise

Constraints

N 

Xij = 1∀i

(1)

(2)

(3)

j=1 N 

Yk = P

(4)

k=1

Yk

and

Xij

 {0, 1}

Xijkm ≥ 0

(5) (6)

48

R. K. Sachan and D. S. Kushwaha

Constraint (3) ensures that every spoke location will be allocated to exactly one hub and constraint (4) ensures that exactly p hubs are selected. Constraint (5) and (6) are the standard integrity constraints. The details of both models are discussed in following subsections. 2.1

Model 1. Unit Cost of Goods Transportation Based Model (UCGTM)

Mathematical formulation of the objective function of UCGTM is discussed in this subsection which includes the input variables and the objective function. This model is mainly based on the unit cost of goods transportation and discount factor. The unit cost of goods transportation represents the monetary expenses required to transport the one kilogram goods to one kilometer between the warehouse and DC; and discount factor represents a reduction in unit cost of goods transportation between the warehouses with respect to the unit cost of goods transportation between the warehouse and DC. Input Variables – – – – – – – – – –

N : number of location points P : number of hubs Dij : distance between the O-D locations (in kilometer (km)) Wij : flow of goods between the O-D locations (in kilogram (kg)) Fk : hub establishment cost at location k (in rupees (Rs.)) Sk : spoke rent at location k (in rupees) LW : labour wage (in rupees) LP : labour productivity (kilogram per day) UC : unit cost of goods transportation (in rupees) Cij : unit cost of goods transportation between O-D locations (Cij = Dij × U C) – α: discount factor for hub-to-hub transportation (0< α 4.8 difference in WARDS. This is equivalent to three times the mean “optimality” score and allows for an effective margin of at most 3 miss-predicted variables, such as minutes, detection, item or level reveals. For example, a WARDS prediction of 3.9 compared against an actual measure of 6.7 will result in a correct prediction because it is within the error tolerance. Likewise, when analysing the results of the Artificial Neural Networks which had the individual variables as their relevant target an error tolerance of 1 was introduced. Table 4 offers a description of the performance as well as train iterations for each of the described networks. As the table demonstrates, detection prediction has provided the biggest accuracy, while duration had the lowest accuracy. This is also reflected in Table 3 where in order to achieve its accuracy a different training function was employed with a significantly different architecture as a result. Because of this reason, the train iterations for the duration was also notably larger. This suggests that the main factor that is reducing the accuracy at present is the complex space of the game. This is a reflection of how small variations in decision making can alter the outcome of a situation drastically, thus predicting the game state accurately several minutes in advance becomes difficult. Table 4. Neural networks result summary Target

Epoch Accuracy

WARDS

87

63.3%

Detection

93

69.3%

Duration

1339

55.7%

Item reveal

103

64.9%

Level reveal

92

65.3%

Consequence kills

97

59.6%

Due to the novelty of the model, particularly its ability to report performance during the running game, there is no consist baseline to be compared. We have looked at similar predicting algorithm, that are aimed at different aspects of the game [3,12,20]. Although none of the prediction models have looked at warding, nor a similar time frame of a period of approximately six minutes, we have found the overall performance of the network to be in line with the predictive capabilities we have encountered. Furthermore, we have produced two simple baselines where we have (1) run a random guess algorithm and (2) made a small improvement to the guessing capability of the random guessing by weighing guesses closer to the mean more heavily. We have found that baseline (1)

WARDS

79

produced a very low guessing accuracy of 0.3% when the same error tolerance was applied. This performance matches what is expected of the continues value space. In order to produce a better baseline, we have modified the algorithm to produce random guesses with a heavier weight towards the mean (refer to Sect. 6.2). Model (2) produced a higher accuracy of 9.2%. Despite the improvement observed in algorithm (2), it is clear that our suggested Neural Network model is undoubtedly more accurate than simply guessing.

8

Discussion/Conclusions

Relatively little work has been done towards measuring and improving the effects of vision and information gathering mechanics in esports games with imperfect information. In particular, the study of warding in MOBAs like League of Legends and Dota 2 has been limited to simple metrics despite the mechanic’s significance. This paper analysed the current industry standard for measuring warding success, called the Vision Metric. We then used detailed expert interviews to model each individual ward with a technique named Wards Aggregate Record Derived Score (WARDS). We used the WARDS model to objectively measure the effect and impact of warding in Dota 2. Although this paper has focused primarily within MOBAs and Dota 2, the WARDS model described can be generalised to any title with similar mechanics as long as all of the necessary data can be retrieved. Furthermore, this paper we analysed the current industry standard for measuring warding success, called the Vision Metric. We then used detailed expert interviews to model each ward using the WARDS model. We used the WARDS model to objectively measure the effect and impact of warding in Dota 2 and used this model to generate a large amount of labelled data. This data was then utilised in the design, training and evaluation of an Artificial Neural Network, aimed at predicting the final WARDS value for any given ward prior to its expiration. Although the results obtained with those Neural Networks had a relatively low accuracy value, we have found that due to the complexity of the problem and the large time frame the performance is matched with other predictive models that focused in other aspects in the game when considering the different time frames. We have also compared the network with a simple guessing solution and we have found our Network considerably outperformed it. The WARDS model as described by this paper has multiple applications. The first is game analytics, where the WARDS can help a coach assess their teams’ warding abilities or evaluate and explore different warding positions and their relative value based on what they want to achieve. The second is training and education, where the WARDS can be used to improve a casual player’s gameplay by helping them pick optimal warding positions during a game or evaluate their past games with alternate simulated warding placements. This feedback should accelerate a player’s ability to learn effective information management in MOBAs. In addition to those applications, the WARD Score is a novel measurement that can be used in conjunction with existing metrics for Machine Learning purposes. For example, it would be possible to utilise the WARDS model as

80

A. Pedrassoli Chitayat et al.

an additional parameter for win prediction models. This could assist with the accuracy of those models as it would be a step towards a better understanding of this complex game feature. The model’s ability to predict short turn increases in team gold networth on approximately 83% of cases in our dataset could be useful to account for unique variances and predict team success. Furthermore, WARDS provides a mathematical model for a complex area of Dota 2 which can assist with understanding the game’s noisy and complicated environment. The WARDS model can serve as a baseline for other imperfect information games. An example of possible applications would be titles such as Counter Strike Global Offensive (CSGO) [21] and Overwatch [7]. Both of these games do not have wards as in-game items, instead players themselves act as scouts and have to move around the map with the sole intent of acquiring game state information and relaying back to their team-mates. The same principle explored in the model can be utilised to measure how effective their performance has been when gaining intel for their team. Lastly, it addresses a mechanic that is well established as advantageous for gameplay situations. The vision and warding mechanic enables, for example, a characters to move to a different areas in order to kill an enemy character which may not have been possible without the knowledge that a ward provides [10]. After reviewing the performance of the Artificial Neural Network and the predictive problem itself, we suggest that the consistency of the scores have proven the possibility for future work on the area. Our current Neural Network architecture makes predictions based on a single state snapshot taken at the start of each ward. One improvement that could increase prediction accuracy is to modify the architecture to incorporate updated state information throughout the ward’s lifetime into its prediction. This modification could increase the overall accuracy of the network by reducing the amount of uncertainty the network has to contend with as time progresses. Acknowledgments. This work has been created as part of the Weavr project (weavr.tv) and was funded within the Audience of the Future programme by UK Research and Innovation through the Industrial Strategy Challenge Fund (grant no. 104775) and supported by the Digital Creativity Labs (digitalcreativity.ac.uk), a jointly funded project by EPSRC/AHRC/Innovate UK under grant no. EP/M023265/1. We would also like to thank the “Ogreboy’s Free Coaching Serve” Discord server for agreeing to let us use their server as a platform and the following players for their input: Fyre, Water, Arzetlam, Trepo and Ogreboy. Lastly we would like to thank Isabelle Noelle for enabling the timely delivery of this project.

References 1. API Documentation - Riot Games API 2. Matchmaking | Dota 2 3. Block, F., Hodge, V., Hobson, S., Sephton, N., Devlin, S., Ursu, M.F., Drachen, A., Cowling, P.I.: Narrative bytes: data-driven content production in esports. In: Proceedings of the 2018 ACM International Conference on Interactive Experiences for TV and Online Video, pp. 29–41. ACM (2018)

WARDS

81

4. Bonny, J.W., Castaneda, L.M., Swanson, T.: Using an international gaming tournament to study individual differences in MOBA expertise and cognitive skills. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI 2016, New York, NY, USA, pp. 3473–3484. ACM (2016) 5. Developer Community. Steam - valve developer community, March 2011 6. Drachen, A., Yancey, M., Maguire, J., Chu, D., Wang, I.Y., Mahlmann, T., Schubert, M., Klabajan, D.: Skill-based differences in spatio-temporal team behaviour in defence of the Ancients 2 (DotA 2). In: 2014 IEEE Games Media Entertainment, pp. 1–8, October 2014 7. Blizzard Entertainment. Overwatch (2019) 8. Ferrari, S.: From generative to conventional play: MOBA and league of legends. In: DiGRA Conference (2013) 9. Raiol, G., Sato, G.: League of legends - challenger’s ranked games, June 2019 10. Hodge, V., Devlin, S., Sephton, N., Block, F., Drachen, A., Cowling, P.: Win prediction in esports: Mixed-rank match prediction in multi-player online battle arena games. arXiv preprint arXiv:1711.06498 (2017) 11. Hoffman, R.R.: The problem of extracting the knowledge of experts from the perspective of experimental psychology. AI Mag. 8(2), 53–53 (1987) 12. Katona, A., Spick, R., Hodge, V., Demediuk, S., Block, F., Drachen, A., Walker, J.A.: Time to die: death prediction in DOTA 2 using deep learning. arXiv preprint arXiv:1906.03939 (2019) 13. Kinkade, N., Jolla, L., Lim, K.: Dota 2 win prediction. Univ Calif. 1, 1–13 (2015) 14. Muhammad, L.J., Garba, E.J., Oye, N.D., Wajiga, G.M.: On the problems of knowledge acquisition and representation of expert system for diagnosis of coronary artery disease (CAD). Int. J. u e Serv. Sci. Technol. 11(3), 49–58 (2018) 15. do Nascimento Junior, F.F., da Costa Melo, A.S., da Costa, I.B., Marinho, L.B.: Profiling successful team behaviors in league of legends. In: Proceedings of the 23rd Brazillian Symposium on Multimedia and the Web, WebMedia 2017, Gramado, RS, Brazil, pp. 261–268. ACM, New York, NY, USA (2017) 16. Olson, J.R., Rueter, H.H.: Extracting expertise from experts: Methods for knowledge acquisition (1987) 17. Riot Games. Riot games: Who we are (2017) 18. Riot Games. Welcome to league of legends (2019) 19. Riot GMang. Vision score details (2017) 20. Schubert, M., Drachen, A., Mahlmann, T.: Esports analytics through encounter detection. In: Proceedings of the MIT Sloan Sports Analytics Conference, vol. 1 (2016) 21. The CSGO Team. Counter strike: Global offensive (2019) 22. Valve Corporation. DOTA 2, July 2013. Accessed 24 July 2019 23. Various Authors. Lane creeps, May 2019 24. Various Authors. Matchmaking rating, June 2019 25. Various Authors. Movement speed, July 2019 26. Various Authors. Observer wards, June 2019 27. Various Authors. Runes, May 2019 28. Wagner, W.P.: Trends in expert system development: a longitudinal content analysis of over thirty years of expert system case studies. Expert Syst. Appl. 76, 85–96 (2017) 29. Xia, B., Wang, H., Zhou, R.: What contributes to success in moba games? An empirical study of defense of the ancients 2. Games Cult. 14(5), 498–522 (2019)

Decomposition Based Multi-objectives Evolutionary Algorithms Challenges and Circumvention Sherin M. Omran1(B) , Wessam H. El-Behaidy1 , and Aliaa A. A. Youssif1,2 1 Faculty of Computers and Artificial Intelligence, Helwan University, Cairo, Egypt

[email protected], [email protected] 2 College of Computing and Information Technology, Arab Academy for Science,

Technology and Maritime Transport, Cairo, Egypt [email protected]

Abstract. Decomposition based Multi Objectives Evolutionary Algorithms (MOEA/D) became one of research focus in the last decade. That is due to its simplicity as well as its effectiveness in solving both constrained and unconstrained problems with different Pareto Front (PF) geometries. This paper presents the challenges on different research areas concerning MOEA/D. Firstly, the original MOEA/D algorithm is explained. Its basic idea is to decompose the Multi Objectives Optimization (MOO) problem into multiple single objective optimization sub problems and works concurrently to solve these sub problems. Each sub problem is optimized with the help of the information gained from its neighborhood. Then, two major factors that affect the search ability of decomposition based MOO algorithms: Scalarization Functions (SF) and weight vectors generation and adaptation are discussed. Furthermore, the researches in two categories of different variants of MOEA/D are illustrated. Finally, the real world application areas that applied the decomposition approach are mentioned. Keywords: Multi Objective Optimization · MOEA/D · Decomposition approach · Evolutionary Algorithms

1 Introduction Problems in the real world usually have conflicting objectives. For these kinds of problems, there does not exist an individual solution that simultaneously fits all the objectives. Such problems are called Multi Objective Optimization (MOO) problems. A Multi Objective Optimization (MOO) problem can be stated as in [1]: Maximize F(x) = (f1 (x), . . . . . . fm (x))T subject to x ∈ 

(1)

A. A. A. Youssif—On leave from Faculty of Computers & Artificial Intelligence, Helwan university, Cairo, Egypt. © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 82–93, 2020. https://doi.org/10.1007/978-3-030-52246-9_6

Decomposition Based Multi-objectives Evolutionary Algorithms

83

Such that: m is the number of objectives,  is the variable space (also called decision space), and F:  → Rm is the objective space. The achievable objective set can be defined as the set {F(x)| x ∈ }. Since these objectives oppose each other, the solution for these problems is a set of all the nondominated points [2, 3]. Let u, v ∈ Rm , u is said to dominate v if and only if ui ≥ vi for every i ∈ {1, . . . . . . , m} and uj > vj for at least one index j ∈ {1, . . . . . . ., m}. A point x∗ ∈  is Pareto-optimal to Eq. (1) if it cannot be dominated by any other point x ∈ . In this case, F(x∗ ) is called Pareto-optimal objective vector. The refinement in an objective for any Pareto-optimal point always leads to regression into at least another objective. The set containing all such points is called a Pareto-optimal Set (PS) while, the set containing the whole Pareto-optimal objective vectors is called Pareto Front (PF) [2, 4]. The algorithms used for MOO problems lie into two categories; Pareto dominance based algorithms and decomposition based algorithms. Pareto dominance based algorithms such as Non-dominated Sorting Genetic Algorithm (NSGA2) [5], Speed-constrained Multi objectives Particle Swarm Optimization (SMPSO) [6], and Strength Pareto Evolutionary Algorithm (SPEA) [7] usually work well to approximate the Pareto Front in 2 or 3 objectives. However, the performance is extremely reduced due to the increased number of objectives. In this case, almost all the solutions cannot be dominated by each other [3, 4]. For most MOO problems, it is very time consuming to cover the whole Pareto Front (PF) as they always have an infinite set of Pareto-optimal vectors. So, Zhang et al. [3] proposed a newly implemented Multi-Objectives Evolutionary Algorithm using Decomposition (MOEA/D). Decomposition based MOO algorithms try to solve the problems of the previous dominance based techniques. Amongst these problems are the dominance resistance phenomenon and the increasing complexity [4]. The main idea of decomposition based MOO techniques is to break down the MOO problem into a group of single objective sub problems. The optimization process for each sub problem is accomplished in a concurrent and a collaborative way using the information acquired from its neighboring sub problems. That makes MOEA/D have a reduced complexity. This technique has proved that it is one of the most promising techniques to handle both Multi and Many objective optimization problems (i.e. problems with more than three objectives) [3]. Researches on decomposition based algorithms lie into four categories: • • • •

Scalarizing Functions (SF) adaptation. Weight generation or adaptation. Generation of new versions that can be applied for more complex problems. Applying the decomposition based algorithms to different applications (i.e. real world applications).

Some new scalarizing functions (SF) were provided in the literature such as the work of GRABISCH [8], Santiago [9], Miettinen [10], and Jiang [11]. The concurrent use of

84

S. M. Omran et al.

different SFs in order to get the benefits of all of them was found in Ishibuchi [12], Rojas [13]. Xiaoliang Ma [14] and Wang [15] proposed some new adaptive strategies to study the effect of Lp scalarizing functions on problems with different PF geometries. An extensive number of researches studied the effect of weight vector generation and adaptation strategies. Amongst these researches, some new adaptive weight vector generation strategies [16, 17], and [25], Uniformly-random mechanisms [18], some mechanisms to handle complex PFs [19], and some Artificial Intelligence-based mechanisms [19, 20], and [21]. The decomposition approach has been extended to different number of evolutionary mechanisms [22, 23] and [24] or by using different new operators to handle the tradeoff between diversity and convergence [25, 26]. The MOEA/D has been also used to solve real world application problems [27, 28]. The next sections of this paper are organized as follows: The original MOEA/D algorithm is described in Sect. 2, followed by a review of Scalarizing Functions (SF) adaptation techniques in Sect. 3. Section 4 presents the weight generation or adaptation mechanisms. Different MOEA/D versions are explained in Sect. 5. Real world applications are reviewed in Sect. 6. Finally, conclusion is given in Sect. 7.

2 Decomposition Based Multi Objective Evolutionary Algorithm (MOEA/D) Framework The first step to solve the MOO problem using MOEA/D is to decompose or split the MOO problem defined by Eq. (1) into several scalar sub problems and to work on these sub problems concurrently. According to Zhang et al. [3], there are many decomposition techniques; the Weighted Sum (WS), the Penalty Boundary Intersection (PBI), and the Weighted Tchebycheff (W-Tch) technique. The W-Tch will only be considered, in this section, as it is considered the most effective one [14]. The MOO problem of Eq. (1) can be handled as a group of N scalar sub problems using W-Tch technique where, the objective function of the jth sub problem is given as stated in [3]: Minimize gtch (x|λj , z∗ ) = max1≤i≤m {λi |fi (x) − zi∗ |} j

Subject to x ∈  (2)  ∗ T is the reference point, S.T z ∗ = max{f (x)|x ∈ } for Where z = z1∗ , z2∗ , . . . . . . zm i i i = 1 → m. For each Pareto-optimal point (i.e. non-dominated solution) x∗ , there is a weight m j vector λ = (λ1 , . . . ., λm )T such that, λi ≥ 0, and i=1 λi = 1 for all i from 1 to m objectives and for all j from 1 to N sub-problems. In this case, all the non-dominated solutions found for Eq. (2) are considered as Pareto-optimal to Eq. (1). By changing the weight vector, various Pareto-optimal solutions could be achieved. So that, selecting the appropriate weight vectors is one of the factors that affect the solution quality. For each weight vector λi there is a neighborhood which is a set of the T closest weight vectors in {λ1 , . . . ., λN }. Hence, the neighborhood of the ith sub problem contains every sub problem that has a weight vector at distance ≤ T from λi . “Figure 1” shows the complete algorithm steps. 

Decomposition Based Multi-objectives Evolutionary Algorithms

85

MOEA/D algorithm Inputs: • Multi-Objective Optimization (MOO) problem. • : The number of sub problems (the population size). • : A set of evenly sampled weight vectors { • : The neighborhood size • A stopping criterion Steps: 1.

2.

Initialization: Generate the initial population at random such is the current that, is the population size where, solution to the sub problem. Initialize the reference point as mentioned in Eq.(2). Calculate the Euclidean distances for each couple of weight vectors to determine the neighbors for each vector (the set of its closest weight vectors). For each set are the closest weight vector to . For each :Evaluate the fitness value . Update: For Reproduction: Select at random two indexes from then, by using the genetic operators (i.e crossover and . and mutation) generate a new solution from Repair: Apply a problem specific improvement heuristic to generate . on then Update : For each , If set Neighboring solutions update: For each index and

3.

If stopping criteria is met, then stop. Else go back to step 2. Fig. 1. MOEA/D algorithm

3 Scalarizing Functions (SF) Adaptation The Scalarizing Function (SF) or aggregation function is the function used to transform the MOO problem into a group of scalar sub problems [29, 30]. It is one of the main factors that affect the MOEA/D search ability. Selecting the most appropriate SF affects both the diversity and convergence of the algorithm. Many researches have investigated and proposed new SFs or studied the effect of changing the SF control parameters on the final results. Previously, Zhang et al. [3]

86

S. M. Omran et al.

suggested three SFs; Weighted Sum (WS), Penalty Boundary Intersection (PBI), and Weighted Tchebycheff (W-Tch). While GRABISCH et al. [8] and Santiago et al. [9] proposed SFs as Weighted exponential sum, Weighted product, Quadratic mean, and Exponential mean. More other SFs were explained by Miettinen et al. [10]. Recently, Jiang et al. [11] presented two new SFs and studied their impact on MOEA/D algorithms; MSF and PSF. The Multiplicative Scalarizing Function (MSF) is a general form that is the weighted Tchebycheff (W-Tch) is a special case of. Penalty based SF (PSF) updated the weighted Tchebycheff function in such a way inspired by the Penalty Boundary Intersection. It uses different linearly decreasing α values through the search stages, where α is the penalty value used for diversity adjustment. The improved region of the PSF extremely varies with α where, α ≥ 0 is preferable to maintain the diversity. Results proved the effectiveness of the framework based on the two proposed SFs called eMOEA/D as compared to other recent approaches on some classical problems with 2 or 3 objectives. However, further investigations should be provided for larger number of objectives. The problem with this framework is that the linearly decreasing approach for the control parameter α is not always suitable for all kinds of problems. Other researches combined different SFs to make advantage of each. Ishibuchi et al. [12] examined the concurrent use of both the WS and the W-Tch in a single algorithm. Two implementation schemes have been proposed. The first one is called the multi-grid scheme. In this scheme, each SF has a complete grid of uniformly distributed weight vectors and the two similar grids for each SF are used simultaneously. As a result of this design, the two grids can be overlapped and both the population size and the actual number of neighbors will be doubled. The second scheme is a single grid scheme with different SFs. Each SF has a non-complete grid of weight vectors where, each function is assigned alternately to each weight vector. So that, the result is a single complete grid with two SFs instead of one in the original algorithm. The two implementation schemes were examined on multi objectives 0/1 knapsack problems with different number of objectives. Results showed that the proposed schemes could outperform the MOEA/D using a single SF. The main advantages of the two proposed schemes are their simplicity and their ability to be applied for different types of SFs. Furthermore, Rojas and Coello [13] proposed a technique that supports collaborative populations mechanism (similar to the single-grid idea proposed by Ishibuchi et al. [12]) using several SFs and model parameter values through an adaptive selection strategy. The selection strategy selects from a pool the SF that fits each sub population according to the improvement fitness rate that is calculated for each sub problem at each time span. They suggested combining SFs with similar target directions so as to generate a uniformly distributed solutions all over the PF. They proposed a pool of strategies S1 and S2 to combine different SFs. The S1 pool includes The Augmented Tchebycheff (ACHE), Modified Tchebycheff (MCHE), and Weighted Norm (WN), while S2 pool includes Augmented Achievement Scalarizing Function (AASF), Modified ASF (MASF) and Penalty Boundary Intersection (PBI). Results showed that the proposed technique could outperform the other counterparts for different kinds of problems over a large set of objectives. Some researches studied the effect of using Lp-norm/Lp scalarizing methods in some adaptive strategies on problems with different PF geometries. Xiaoliang Ma et al. [14]

Decomposition Based Multi-objectives Evolutionary Algorithms

87

proposed a W-Tch decomposition with constrained Lp-norm on direction vectors (p-Tch) with clear geometric property objective functions. In p-Tch, sub-problems have been constructed on basis of a direction vector instead of weight vector. The direction vector λ can be thought of as a positive vector with ||λ||p = 1. They used the 2-Tch as a representative example of their theory. The proposed algorithm proved its effectiveness as compared to literature MOEA algorithms on both benchmark and real world problems. Wang et al. [15] analyzed the Lp scalarizing method showing the impact of the p value on the MOEA/D algorithm selection pressure. They found that as p increases the search ability reduces while it becomes more robust on PF geometries. They also proposed a new Pareto adaptive scalarizing approximation called Pas that approximates the p value. The proposed MOEA/D-Pas proved its effectiveness as compared to other benchmarks on various problems with different PF geometries over a large number of conflicting objectives.

4 Weight Generation or Adaptation Mechanisms Weight vectors generation is the second factor that affects the decomposition based algorithm search ability. Similar weight vectors lead to poor solutions because in this case, the resultant solutions will not be evenly distributed over the Pareto front [3, 9]. The state of the art methodologies suggest two types of weight vector generation; systematic and random weight vector generation [9, 31]. In the systematic weight vector generation, the weight vectors are distributed evenly on a unit simplex [32, 33]. For irregular Pareto fronts, the random weight vector generation is sometimes recommended [34]. These methods work very well in case of hyperplane Pareto Fronts, but they cannot guarantee solutions diversity in case of more complex PFs geometry [35]. According to the high influence of the weight vector distribution on the final solutions, many researches concerning weight vector adaptation have been provided. Jiang Siwei et al. [16] presented a Pareto adaptive weight vector methodology called (paλ) that depends on Mixture Uniform Design (MUD). The proposed methodology modifies the weight vectors depending on the PF shape. Empirical results proved that the proposed paλ-MOEA/D methodology provided higher hybervolume, and better solutions as compared to both the classical MOEA/D and NSGAII on 12 benchmark problems. Guo et al. [17] provided a new MOEA/D algorithm called (AWD-MOEA/D) that is based on an adaptive weight vector adjustment method. To get an adaptive weight vector set, a two phase methodology was provided. In the first phase, the initial weight vectors are created using the uniform design method so as to evenly investigate the objective space. In the second phase, an adaptive weight vector generation method that combines generalized decomposition as well as uniform design is used. This method is used in this stage in order to dynamically adapt the weight vector settings which in turn helps generating a uniform non-dominated solutions. The proposed algorithm provided the best diversity and convergence against both UMOEA/D (Uniform MOEA/D) and MOEA/D over 2 standard test problems. Another adaptive weight generation method called MOEA/D-AWG was presented in [36]. The proposed method generates the weight vectors related to the geometrical properties of the PFs which are estimated first by using Gaussian process regression.

88

S. M. Omran et al.

Results verified the efficiency of the proposed adaptively weight generation method as compared to MOEA/D alternatives with uniformly generated weight vectors. Farias et al. [18] proposed a Uniformly-Randomly-Adaptive Weights generation algorithm called (MOEA/D-URAW) that combined the uniform random generation mechanism with the adaptation mechanism presented in [37] for sub problems generation. Sub problems are generated according to the sparseness of the population. The MOEA/D-URAW performance was evaluated on Waving Fish Group (WFG) problems with various PF geometries. Results showed that the algorithm provided the best results as compared to other state of the art techniques. Ch. Zhang et al. [19] presented a new weight vector adjustment method for biobjective problems with non-continuous PFs called MOEA/D-ABD. The proposed method starts with detecting weight vectors that requires some adjustments and allocates vectors along the Pareto front depending on the length. The solutions for these vectors are produced by means of linear interpolation mechanism. MOEA/D-ABD algorithm proposed the best solutions as compared to MOEA/D-AWA [37]. The main problem with that algorithm is that it is not suitable for problems with larger number of objectives as well as it doesn’t guarantee good solutions in case of continuous PFs. Different Artificial Intelligence based mechanisms were also used for weight vector generation such as Artificial Neural Networks (ANN) and Evolutionary techniques. Gu et al. [20] developed an innovative weight generation mechanism called MOEA/D-SOM that is based on Self Organizing Maps (SOM). By using the recent individuals’ objective vectors, the SOM network was periodically trained with N neurons such that, both the weight and objective vectors are of the same dimensions. The neurons’ weights were considered as the weight vectors. Results showed that the proposed SOM-based mechanism could highly outperform the other counterparts on a set of both redundant and nonredundant test problems. Meneghini et al. [21] proposed an evolutionary-based weight vector generation technique. The main idea is to evolve a set of weight vectors depending on the desired characteristics in a MOEA/D framework. The proposed EA could prohibit weight vectors creation through the border of the orthant where, this area has poor solutions. Experiments proved that the proposed mechanism can provide a group of generally well spread vectors close to the uniform distribution, without forming clusters or empty spaces.

5 Different MOEA/D Versions The Recent researches proposed different versions or variants of MOEA/D. These researches lie into two categories. The first one is to apply the decomposition approach to other efficient evolutionary algorithms in order to benefit from both of them and to get balance between both diversity and convergence. The other one is to update the original MOEA/D to fit the more complex problems. Among the first category, algorithms like MOEA/DD [38], MOEA/DD-CMA [22], MO-GPSO/D [23], and MOEA/D-ACO [24] will be clarified. Li et al. [38] presented a novel algorithm called MOEA/DD that collects both decomposition and dominance approaches into a single algorithm. The proposed algorithm

Decomposition Based Multi-objectives Evolutionary Algorithms

89

proved its superiority to both state of the art and recent algorithms on some constrained and unconstrained problems. Castro Jr et al. [22] combined MOEA/D-CMA [39] which is one of the variants of Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [40] with MOEA/DD [38] algorithm. This combined algorithm called MOEA/DD-CMA. This algorithm was compared against MOEA/D-CMA over a large number of problems with objectives ranging from 5 to 15 objectives. Martínez et al. [23] studied the incorporation of Geometric Particle Swarm Optimization (GPSO) [41] one of the variants of PSO that is used for discrete optimization problems with the decomposition mechanism. The algorithm was tested on 1/0 knapsack problems with different number of objectives (more than 3). Experiments showed that MO-GPSO/D algorithm provided a very Promising results as compared to three other benchmark algorithms; NSGA3, MOEA/D, and MOEA/D*. Liangjun Ke et al. [24] proposed an Ant Colony Optimization (ACO) algorithm using the decomposition principle called MOEA/D-ACO such that, the number of ants is the number of sub-problems where each ant tries to solve one sub-problem. Ants are split into groups where, each group targets a particular part in the PF. The neighbors for each ant may be members of the same or a different group. The final solution is obtained using the shared information from all the group members. MOEA/D-ACO could outperform both the original MOEA/D and the Bicriterion-Ant algorithm on all the test instances. The algorithms of second category include a large number of variants on the original MOEA/D to solve more complex problems such as problems with complex or noncontinuous PF. Here, the most recent and effective algorithms will be mentioned. Li et al. [25] proposed an update to their original MOEA/D algorithm proposed in [3] in order to handle more complex PS shapes. The new algorithm MOEA/D-DE incorporated the original decomposition approach with Differential Evolution (DE) operator. The DE is used to carry out the mating procedure. A polynomial mutation is then presented after DE as it slightly improves the algorithm performance. Although the results did not prove that the proposed algorithm is always superior to NSGA2 using the same operators (NSGA2-DE) but, it’s still a promising technique for such kind of problems. To solve complex problems such as problems with non-continuous regions, sharp peaks and long tails, Jiang et al. [42] solved through a Two Phase (TP) strategy and a niching method, namely, MOEA/D-TPN. During the first phase, the algorithm searches for areas of solutions crowdedness so as to detect the shape of the PF. In the second phase, the algorithm determines the sub problem form that will be used depending on the results of the first phase. The niching method is presented to avoid duplicate solutions/off-springs by guiding the mating procedure with parents in the regions with minimal crowd. The reproduction process is carried out using the same operators that were previously proposed by Li et al. [25]. Results showed the superiority of the algorithm as compared to both SPEA2 [43] with Shift-based Density Estimation (SPEA2 + SDE) [44] and NSGA3 [45]. Xu et al. [26] proposed a hierarchical decomposition based MOEA named (MOEA/HD) that divides the sub problems into different layers/hierarchies. MOEA/HD then adjust the direction of the search of the lower hierarchy sub problems using superior

90

S. M. Omran et al.

guiding sub problems. Results showed that the algorithm provided promising results on problems with different PF features.

6 Real World Applications According to the good results provided by the decomposition approach and its different variants on large and different sets of benchmark problems, it has to be tested for real world application areas. Zhang et al. [27] extended their original decomposition approach presented in [3] into a new one gathering both Normal Boundary Intersection (NBI) and the W-Tch approaches into one algorithm called NBI-style Tchebycheff. The algorithm was tested on portfolio management optimization problem with bi-objectives; return and variance or risk. The proposed approach provided a promising and comparable results to NSGA2. Xing et al. [28] incorporated the MOEA/D approach with Population-based incremental learning (PBIL) components and proposed (MOEA/D-PBIL). The proposed algorithm was applied on the multicast routing with network coding optimization problem with three objectives; the coding cost, the end to end delay, and the link cost. Results proved the superiority of the algorithm as compared to 6 other MOEAs variants. Decomposition mechanism was also used with many other applications such as natural and medical image segmentation [46], the sizing of a folded-cascode amplifier [47], aerospace applications [48], reservoir flood control operation [49] and agile Satellite Mission Planning [50].

7 Conclusions The Multi-objective optimization (MOO) techniques can be categorized as dominance based and decomposition based algorithms. The dominance based algorithms work very well with problems with at most three objective functions. However, its performance deteriorates with more than three objective optimization problems. The decomposition based approach works mainly to solve such kinds of problems. It decomposes the problem into several sub problems and tries to solve each of them separately. Due to the promising results of MOEA/D algorithms, this paper presents the challenges performed on the decomposition functions/SFs, weight generation mechanisms, MOEA/D different versions, and real world applications. The researches of new SFs, their combinations and adaptive Lp-norm scalarizing method on problems with complex Pareto Fronts are discussed. Also, the new adaptively weight generation mechanisms and AI based weight generation mechanisms; paλ-MOEA/D, AWD-MOEA/D, MOEA/D-AWG, MOEA/DURAW, MOEA/DABD, MOEA/D-SOM, and Evolutionary-based weight vector were discussed to handle complex problems. Different variants of MOEA/D that combines the decomposition with different other strategies were also presented such as MOEA/DD, MOEA/DD-CMA, MO-GPSO/D, MOEA/D-ACO, MOEA/D-TPN, MOEA/D-DE, and MOEA/HD. MOEA/D proved its effectiveness in many different fields such as financial optimization, network routing, aerospace, image segmentation, and satellite mission planning.

Decomposition Based Multi-objectives Evolutionary Algorithms

91

References 1. Deb, K.: Multi-Objective Optimization Using Evolutionary Algorithms. Wiley, New York (2001) 2. Bui, L.T., Alam, S.: Multi Objective Optimization in Computational Intelligence: Theory and Practice. IGI Global, Hershey (2008) 3. Zhang, Q., Li, H.: MOEA/D: a multiobjective evolutionary algorithm based on decomposition. IEEE Trans. Evol. Comput. 11(6), 712–731 (2007) 4. Purshouse, R.C., Fleming, P.J.: On the evolutionary optimization of many conflicting objectives. IEEE Trans. Evol. Comput. 11(6), 770–784 (2007) 5. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002) 6. Nebro, A.J., Durillo, J.J., García-Nieto, J., Coello, C., Luna, F., Alba, E.: SMPSO: a new PSObased metaheuristic for multi-objective optimization. In: IEEE Symposium on Computational Intelligence in Multi-Criteria Decision-Making (MCDM), Nashville, TN, USA (2009) 7. Zitzler, E., Thiele, L.: Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach. IEEE Trans. Evol. Comput. 3(4), 257–271 (1999) 8. Grabisch, M., Marichal, J.-L., Mesiar, R., Pap, E.: Aggregation functions. part I: means. Inf. Sci. 181(1), 1–22 (2011) 9. Santiago, A., Huacuja, H.J.F., Dorronsoro, B., Pecero, J.E., Santillan, C.G., Barbosa, J.J.G., Monterrubio, J.C.S.: A survey of decomposition methods for multi-objective optimization. In: Recent Advances on Hybrid Approaches for Designing Intelligent Systems. Springer, vol. 547, pp. 453–465, 2014 10. Miettinen, K., Makela, M.M.: On scalarizing functions in multiobjective optimization. Oper. Res. Spektrum 24, 193–213 (2002) 11. Jiang, S., Yang, S., Wang, Y., Liu, X.: Scalarizing functions in decomposition-based multiobjective evolutionary algorithms. IEEE Trans. Evol. Comput. 22(2), 296–313 (2018) 12. Ishibuchi, H., Sakane, Y., Tsukamoto, N., Nojima, Y.: Simultaneous use of different scalarizing functions in MOEA/D. In: GECCO 2010, Portland, Oregon, USA (2010) 13. Pescador-Rojas, M., Coello, C.A.C.: Collaborative and adaptive strategies of different scalarizing functions in MOEA/D. In: IEEE Congress on Evolutionary Computation (CEC) (2018) 14. Ma, X., Zhang, Q., Tian, G., Yang, J., Zhu, Z.: On Tchebycheff decomposition approaches for multi-objective evolutionary optimization. IEEE Trans. Evol. Comput. 22(2), 226–244 (2018) 15. Wang, R., Zhang, Q., Zhang, T.: Decomposition based algorithms using Pareto adaptive scalarizing methods. IEEE Trans. Evol. Comput. 20(6), 821–837 (2016) 16. Siwei, J., Zhihua, C., Jie, Z., Yew-Soon, O.: Multiobjective optimization by decomposition with Pareto-adaptive weight vectors. In: Seventh International Conference on Natural Computation, Shanghai, China (2011) 17. Guo, X., Wang, X., Wei, Z.: MOEA/D with adaptive weight vector design. In: 11th International Conference on Computational Intelligence and Security, Shenzhen, China (2015) 18. Farias, L.R.C.d., Braga, P.H.M., Bassani, H.F., Araújo, A.F.R.: MOEA/D with uniformly randomly adaptive weights. In: GECCO 2018, Kyoto, Japan (2018) 19. Zhang, C., Tan, K.C., Lee, L.H., Gao, L.: Adjust weight vectors in MOEA/D for bi-objective optimization problems with discontinuous Pareto fronts. Soft Comput. 22(12), 3997–4012 (2018) 20. Gu, F., Cheung, Y.-M.: Self-organizing map-based weight design for decomposition-based many-objective evolutionary algorithm. IEEE Trans. Evol. Comput. 22(2), 211–225 (2018)

92

S. M. Omran et al.

21. Meneghini, I.R., Guimaraes, F.G.: Evolutionary method for weight vector generation in multiobjective evolutionary algorithms based on decomposition and aggregation. In: CEC, San Sebastián (2017) 22. Castro, O.R., Santana, R., Lozano, J.A., Pozo, A.: Combining CMA-ES and MOEA/DD for many-objective optimization. In: IEEE Congress on Evolutionary Computation (CEC), San Sebastian, Spain (2017) 23. Zapotecas-Martínez, S., Moraglio, A., Aguirre, H., Tanaka, K.: Geometric particle swarm optimization for multi-objective optimization using decomposition. In: GECCO 2016, Denver, CO, USA (2016) 24. Ke, L., Zhang, Q., Battiti, R.: MOEA/D-ACO: a multiobjective evolutionary algorithm using decomposition and ant colony. IEEE Trans. Cybern. 43(6), 1845–1859 (2013) 25. Li, H., Zhang, Q.: Multiobjective optimization problems with complicated Pareto sets, MOEA/D and NSGA-II. IEEE Trans. Evol. Comput. 13(2), 284–302 (2009) 26. Xu, H., Zeng, W., Zhang, D., Zeng, X.: MOEA/HD: a multiobjective evolutionary algorithm based on hierarchical decomposition. IEEE Trans. Cybern. 49(2), 517–526 (2019) 27. Zhang, Q., Li, H., Maringer, D., Tsang, E.: MOEA/D with NBI-style Tchebycheff approach for Portfolio Management. In: IEEE Congress on Evolutionary Computation, Barcelona, Spain (2010) 28. Xing, H., Wang, Z., Li, T., Li, H.: An improved MOEA/D algorithm for multi-objective multicast routing with network coding. Appl. Soft Comput. 59, 88–103 (2017) 29. Gunantara, N.: A review of multi-objective optimization: methods and its applications. Cogent Eng. 5(1) (2018). https://doi.org/10.1080/23311916.2018.1502242 30. Emmerich, M.T.M.: A tutorial on multiobjective optimization: fundamentals and evolutionary methods. Num. Comput. 17(3), 585–609 (2018) 31. Trivedi, A., Srinivasan, D., Sanyal, K., Ghosh, A.: A survey of multi-objective evolutionary algorithms based on decomposition. IEEE Trans. Evol. Comput. 21(3), 440–462 (2017) 32. Messac, A., Mattson, C.A.: Normal constraint method with guarantee of even representation of complete Pareto frontier. AIAA J. 42(10), 2101–2111 (2004) 33. Das, I., Dennis, J.E.: Normal-boundary intersection: a new method for generating the pareto surface in nonlinear multicriteria optimization problems. SIAM J. Optim. 8(3), 631–657 (1998) 34. Li, H., Ding, M., Deng, J., Zhang, Q.: On the use of random weights in MOEA/D. In: 2015 IEEE Congress on Evolutionary Computation (CEC), Sendai, Japan (2015) 35. Qi, Y., Ma, X., Liu, F., Jiao, L., Sun, J., Wu, J.: MOEA/D with adaptive weight adjustment. Evol. Comput. 22(2), 231–264 (2013) 36. Wu, M., Kwong, S., Jia, Y., Li, K., Zhang, Q.: Adaptive weights generation for decompositionbased multi-objective optimization using Gaussian process regression. In: GECCO 2017 Proceedings of the Genetic and Evolutionary Computation Conference, Berlin, Germany (2017) 37. Qi, Y., Ma, X., Liu, F., Jiao, L., Sun, J., Wu, J.: MOEA/D with adaptive weight adjustment. Evol. Comput. 22(2), 231–264 (2014) 38. Li, K., Deb, K., Zhang, Q., Kwong, S.: An evolutionary many-objective optimization algorithm based on dominance and decomposition. IEEE Trans. Evol. Comput. 19(5), 694–716 (2015) 39. Zapotecas-Martínez, S., Derbel, B., Brockhoff, D., Aguirre, H.E., Tanaka, K.: Injecting CMAES into MOEA/D. In: GECCO 2015, Madrid, Spain (2015) 40. Hansen, N., Auger, A.: CMA-ES: evolution strategies and covariance matrix adaptation. In: GECCO 2011, Dublin, Ireland (2011) 41. Moraglio, A., Chio, C.D., Poli, R.: Geometric particle swarm optimisation. In: Genetic Programming, EuroGP 2007. Lecture Notes in Computer Science, vol. 4445. Springer, Heidelberg (2007)

Decomposition Based Multi-objectives Evolutionary Algorithms

93

42. Jiang, S., Yang, S.: An improved multiobjective optimization evolutionary algorithm based on decomposition for complex Pareto fronts. IEEE Trans. Cybern. 46(2), 421–437 (2016) 43. Zitzler, E., Laumanns, M., Thiele, L.: SPEA2: improving the strength Pareto evolutionary algorithm. TIK-Report. 103, Zurich, Switzerland (2001) 44. Li, M., Yang, S., Liu, X.: Shift-based density estimation for pareto-based algorithms in manyobjective optimization. IEEE Trans. Evol. Comput. 18(3), 348–365 (2014) 45. Deb, K., Jain, H.: An evolutionary many-objective optimization algorithm using referencepoint based non-dominated sorting approach, part I: solving problems with box constraints. IEEE Trans. Evol. Comput. 18(4), 577–601 (2014) 46. Sarkar, S., Das, S., Chaudhuri, S.S.: Multi-level thresholding with a decomposition-based multi-objective evolutionary algorithm for segmenting natural and medical images. Appl. Soft Comput. 50, 142–157 (2017) 47. Liu, B., Fernández, F.V., Zhang, Q., Pak, M., Sipahi, S., Gielen, G.: An enhanced MOEA/DDE and its application to multiobjective analog cell sizing. In: IEEE Congress on Evolutionary Computation, Barcelona, Spain (2010) 48. Ho-Huu, V., Hartjes, S., Visser, H.G., Curran, R.: An efficient application of the MOEA/D algorithm for designing noise abatement departure trajectories. Aerospace 4(4), 54 (2017) 49. Qi, Y., Bao, L., Ma, X., Maio, Q.: Self-adaptive multi-objective evolutionary algorithm based on decomposition for large-scale problems: a case study on reservoir flood control operation. Inf. Sci. 367(10), 529–549 (2016) 50. Li, L., Chen, H., Li, J., Jing, N., Emmerich, A.M.: Preference-based evolutionary manyobjective optimization for agile satellite mission planning. IEEE Access 6, 40963–40978 (2018)

Learning the Satisfiability of L  -clausal Forms Mohamed El Halaby(B) and Areeg Abdalla Department of Mathematics, Faculty of Science, Cairo University, Giza 12613, Egypt {halaby,areeg}@sci.cu.edu.eg

Abstract. The k-SAT problem for L  -clausal forms has been found to be NP-complete if k ≥ 3. Similar to Boolean formulas in Conjunctive Normal Form (CNF), L  -clausal forms are important from a theoretical and practical point of views for their expressive power, easy-hard-easy pattern as well as having a phase transition phenomena. In this paper, we investigate predicting the satisfiability of L  -clausal forms by training different classifiers (Neural Network, Linear SVC, Logistic Regression, Random Forest and Decision Tree) on features extracted from randomly generated formulas. Firstly, a random instance generator is presented and used to generate instances in the phase transition area over 3-valued and 7-valued Lukasiewicz logic. Next, numeric and graph features were extracted from both datasets. Then, different classifiers were trained and the best classifier (Neural Network) was selected for hyperparameter tuning, after which the mean of the cross-validation scores (CVS) increased from 92.5% to 95.2%.

Keywords: Satisfiability logic · Machine learning

1

· L -clausal forms · L ukasiewicz logic · Fuzzy

Introduction and Preliminaries

In propositional logic, a Boolean variable x can take one of two possible values: 0 or 1. A literal l is a variable x or its negation ¬x.  A disjunction C is a group r of r literals joined by ∨. This is expressed as C = i=1 li A Boolean formula φ in Conjunctive Normal Form (CNF) is a group of m disjunctions joined by ∧ (i.e., a conjunction of disjunctions). From now on, we will refer to a disjunction in a CNF formula as a clause. If φ consists of m clauses where each clause Ci is composed of ri literals, then φ can be written as φ=

m 

Ci

i=1

where Ci =

ri 

li,j

j=1 c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 94–102, 2020. https://doi.org/10.1007/978-3-030-52246-9_7

Learning the Satisfiability of L  -clausal Forms

95

Given a propositional formula in CNF, the satisfiability problem (SAT) [1] is deciding whether φ has an assignment to its variables that satisfies every clause. SAT is a core problem in theoretical computer science because of its central position in complexity theory [2]. Moreover, numerous NP-hard practical problems have been successfully solved using SAT [3]. Fuzzy logic is an extension of Boolean logic by Lotfi Zadeh in 1965 [4] based on the theory of fuzzy sets, which is a generalization of the classical set theory. By introducing the notion of degree in the verification of a condition, thus enabling a condition to be in a state other than true or false (thus, infinite truth degrees), fuzzy logic provides a very valuable flexibility for reasoning, which makes it possible to take into account inaccuracies and uncertainties. The SAT problem in fuzzy logic and specifically L  ukasiewicz logic exists [5] as well as its optimization version (maximizing the number of satisfied clauses) [6], but less attention has been paid to develop efficient solvers for the problem. One of the recent attempts [7] consists of enhancing the start-of-the-art Covariance Matrix Adaptation Evolution Strategy (CMA-ES) algorithm. This was done by having multiple CMA-ES populations running in parallel and then recombining their distributions if this leads to improvements. Another finding in [8] showed that a hillclimber approach outperformed CMA-ES on some problem classes. A different idea was recently proposed which involves encoding the formula as a Satisfiability Modulo Theories (SMT) program then employing flattening methods and CNF conversion algorithms to derive an equivalent Boolean CNF SAT instance [9]. For formulas in L  ukasiewicz logic, the variables can take a value from a finite (or infinite) set of truth values, but this paper is concerned with finite truth sets. Moreover, the basic connectives of L  ukasiewicz logic are defined in Table 1. We will be dealing with five operations, namely negation (¬), the strong and weak disjunction (⊕ and ∨ respectively) and the strong and weak conjunction ( and ∧, respectively). Table 1. Logical operations in L  ukasiewicz logic. Operation name

Definition

Negation ¬

¬x = 1 − x

Strong disjunction ⊕ x ⊕ y = min{1, x + y} Strong conjunction  x  y = max{x + y − 1, 0} Weak disjunction ∨

x ∨ y = max{x, y}

Weak conjunction ∧

x ∧ y = min{x, y}

Implication →

x → y = min{1, 1 − x + y}

One obvious extension of Boolean CNF formulas in L  ukasiewicz logic is called simple L  -clausal forms [10], where the Boolean negation is replaced by the L  ukasiewicz negation and the Boolean disjunction with the strong disjunction.

96

M. El Halaby and A. Abdalla

The resulting form is m  i=1

⎛ ⎞ ri  ⎝ lij ⎠ j=1

where each lij is a variable (that can take a truth value belonging to either a finite or an infinite set) or its negation. In this paper, we are concerned with a slightly different and more interesting class of formulas called L  -clausal forms [10]. The following definition describes how these formulas are constructed. Definition 1. Let X = {x1 , . . . , xn } be a set of variables. A literal is either a variable xi ∈ X or ¬xi . A negated term is a literal or an expression of the form  -clause is weak disjunction of ¬(l1 ⊕ · · · ⊕ lk ), where l1 , . . . , lk are literals. An L terms. An L  -clausal form is a weak conjunction of L  -clauses. For example, (x1 ⊕¬(x2 ⊕x3 )) ∧ (x3 ⊕¬x2 ⊕¬x1 ) is an L  -clausal form, but not a simple L  -clausal form, due to the presence of the negated term ¬(x2 ⊕x3 ). It has been shown [10] that the satisfiability problem for any simple L  -clausal form is solvable in linear time, contrary to its counterpart in Boolean logic which is NPcomplete in the general case. In addition, the expressiveness of simple L  -clausal forms is limited, meaning that not every L  ukasiewicz formula has an equivalent simple L  -clausal form. The reason L  -clausal forms are interesting is that if at most three literals appear in each L  -clause (i.e., 3-SAT), the satisfiability problem becomes NP-complete1 . The authors also showed that 2-SAT is solvable in linear time for L  -clausal forms. For satisfiability problems over randomly generated instances, the transition from under-constrained instances with very high probability of satisfiability to over-constrained problems with very low probability of satisfiability is called the phase transition phenomena. This phenomena has been observed in Boolean satisfiability [11] as well as in the satisfiability of L  -clausal forms [10]. Therefore, predicting the satisfiability of instances generated at or near the phase transition is more challenging than predicting the satisfiability of formulas generated elsewhere. Due to the good performance of currently available Boolean SAT algorithms, SAT solving procedures are becoming interesting alternatives for an increasing number of problem domains. Applications of SAT include knowledge-compilation [12], hash functions, cryptanalysis [13] and many others. Fuzzy logic has numerous applications [14–16]. Several problems in computer-aided design (CAD) and electronic design automation (EDA), for example, can be naturally stated satisfiability problems over a multi-dimensional, multivalued (e.g. fuzzy) solution space. If these practical problems are formulated using Boolean SAT, then an encoding (such as one-hot encoding) of the multi-valued dimensions using a set of Boolean variables must also be used. Such encodings require defining new constraints, which exclude encoded values that do not occur in the original problem. For example, if a five-valued domain variable is encoded using three binary variables (having eight possible settings), additional constraints must be specified to 1

The proof involves reducing Boolean 3-SAT to the SAT problem for L  -clausal forms.

Learning the Satisfiability of L  -clausal Forms

97

exclude the remaining three assignments. Therefore, understanding and solving the SAT problem efficiently over a powerful class of formulas such as L  -clausal forms is a step towards taking advantage of this generic problem solving strategy in solving combinatorial optimization problems over variables in L  ukasiewicz logic. There has been works on predicting the satisfiability of Boolean formulas, for example in [17] and [11], but to the best of our knowledge, there is no research yet that focuses on predicting the satisfiability of L  -clausal forms. In this paper, we demonstrate that it is possible to achieve classification accuracies higher than 95% based on features computed in polynomial-time. The rest of the paper is structured as follows. First, the instance generator used to produce the formulas is described. Second, the details of dataset used are illustrated along with the parameters chosen and the features extracted. Third, the machine learning models trained over the dataset are listed as well as their initial results. Finally, hyper-parameter tuning is performed on the best-scoring model and the corresponding improved results are reported.

2

Instance Generator

We have carried out a similar experiment to the one done by Bofill et al. in [10] on 3-valued L  -clausal forms. The instances used were generated in the following manner: given the number of variables n and the number of clauses m, each clause is generated from three variables xi1 , xi2 and xi3 picked uniformly at random. Then, one of the following eleven L  -clauses is drawn uniformly at random (xi1 ⊕ xi2 ⊕xi3 ), (¬xi1 ⊕xi2 ⊕xi3 ), (xi1 ⊕¬xi2 ⊕xi3 ), (xi1 ⊕xi2 ⊕¬xi3 ), (¬xi1 ⊕¬xi2 ⊕xi3 ), (¬xi1 ⊕ xi2 ⊕ ¬xi3 ), (xi1 ⊕ ¬xi2 ⊕ ¬xi3 ), (¬xi1 ⊕ ¬xi2 ⊕ ¬xi3 ), (¬(xi1 ⊕ xi2 ) ⊕ xi3 ), (¬(xi1 ⊕ xi3 ) ⊕ xi2 ) and (xi1 ⊕ ¬(xi2 ⊕ xi3 )). We aim to generalize the construction and generation of L  -clauses to study them in more detail experimentally and theoretically. Our model generates L clausal forms with parameters (m, n, k, p), where m is the number of L  -clauses, n is the number of variables, k is the number of variables appearing in each L  -clause and p is the degree of absence of negated terms. It is important to note that no generated L  -clause has a variable appearing more than once. The decision of whether or not to put a negated term in a clause is made as follows: Given p, we generate a random integer r ∈ {0, 1, . . . , p − 1}, and if r = 0, then we add a negated term with length less than or equal to k. So, as p increases, the probability of adding a negated term decreases. For example, when p approaches 1, the sum of the lengths of negated terms in each L  -clause approaches k, and when p approaches ∞, that sum approaches 0. In the next section, we will discuss the relationship between p and the cost. Different from Boolean CNF formulas, the phase transition phenomena in L  -clausal forms does not depend only on the ratio between m and n, but also on p. In other words, using our model, by changing p one can generate different L  -clausal forms with the same value for m and n but having totally different costs. The instances produced are then translated into Satisfiability Modulo Theories (SMT) programs and solved using Z3 [18].

98

3

M. El Halaby and A. Abdalla

Dataset and Features

A dataset of 1030 instances in the phase transition area was produced using the described instance generator. The number of satisfiable and unsatisfiable instances in the dataset are 565 and 465 respectively. The formulas are 3-valued L  -clausal forms with k = 3, 1000 L  -clauses, p = 28 and 510 variables. The reason for choosing these parameters is due to the reported phase transition in [10]. This phase transition is reported to occur at a clauses-to-variables ratio of 1.9 which corresponds to 510 variables if the number of L  -clauses chosen is 1000. Moreover, the value of p = 28 was chosen as it leads to a probability of satisfiability of the generated instances is 1/2. The following two subsections describe the two sets of features, numeric and graph features, that are extracted from every instance. 3.1

Numeric Features

Let φ be an L  -clausal form with m clauses, n variables and mneg negated terms. 1. m/n 2. mneg /m 3. mhorn /m, where mhorn is the number of Horn clauses in φ. A horn clause is an L  -clause with at most one positive literal. 4. Statistics on the numbers of positive and negative occurrences of each variable. 5. Statistics on the number of negated literals each clause contains. The statistics calculated for the numbers of positive and negative occurrences and the number of negated literals are: minimum, maximum, mean, variance, standard deviation, geometric mean, quadratic mean, Shannon’s entropy, kurtosis, 25th, 50th and 75th percentile, skewness and the total sum of squares. Some of these statistics were used in [11]. 3.2

Features from Graph Representations

1. The variable-clause graph is a bipartite graph with a node for each variable, a node for each clause, and an edge between them whenever a variable occurs in a clause (positively or negatively). The negations over the negated terms are removed before computing the variable-clause graph of the formula. For example, let φ = (x¯1 ⊕ x3 ⊕ x4 ) ∧ (x1 ⊕ x2 ⊕ x¯4 ⊕ x5 ) ∧ (x2 ⊕ x¯3 ⊕ x5 ). Figure 1 describes the variable-clause graph of φ. 2. The variable graph has a node for each and an edge between variables that occur together in at least one clause. The following features are extracted from the degrees of the nodes in each of the graph representations mentioned: minimum, maximum, mean, variance, standard deviation, geometric mean, quadratic mean, Shannon’s entropy, kurtosis, 25th, 50th and 75th percentile, skewness and the total sum of squares.

Learning the Satisfiability of L  -clausal Forms

99

Fig. 1. Variable-clause graph of φ.

4

Results

In this section, we first describe the features extracted and their final selection. Second, the classifiers used in the exploratory stage are mentioned as well as their evaluation. Finally, hyper-parameter tuning is performed on the best scoring classifier and final accuracy is reported. 4.1

Feature Engineering and Selection

The following steps are carried out to engineer new features based on the original ones and to select the best among them. 1. A Standard scalar transformation (the dataset is transformed such that its distribution has a mean value 0 and standard deviation of 1) is applied to the data. 2. A Logistic Regression model with L1 penalty is used to produce an importance score for every feature. The model takes the feature matrix X and the target Y then outputs a score for each feature such that the higher the score, the more relevant the feature is towards the target. The number of iterations used for the model is 1000. 3. The selected features in the second step are then transformed using a polynomial transformation of the third degree. The way this works is as follows: Given a feature vector (a, b), a polynomial transformation of the third degree produces the feature vector (1, a, b, ab, a2 , b2 , a3 , b3 , ab2 , ba2 ). 4. A Standard scalar transformation is applied to the new features. 5. Due to the large number of features produced from the polynomial transformation, the second step is repeated using the features obtained from the fourth step. The final number of features we end up with is 80 and the number of iterations used for the Linear SVC model is 1000. 4.2

Classifiers and Evaluation

In the exploratory state, we experimented with different classifiers. The following models were tested using stratified K-fold cross-validation with K = 10. Using a value of K = 10 has been shown through experimentation to generally result in a model skill estimate with low bias a modest variance [19]:

100

M. El Halaby and A. Abdalla

1. Logistic Regression with an L1 penalty and an inverse of regularization strength of 1.0. 2. Neural Network with a tanh activation and a single hidden layer with 30 nodes. 3. Random Forest with 1000 trees and the minimum number of samples needed to split is 2. 4. Linear Support Vector Machine with L1 penalty and an inverse of regularization strength of 1.0.

Table 2. Exploratory stage results, indicating the mean and the standard deviation of the cross-validation scores (CVS). Classifier

Mean CVS STD CVS

Logistic Regression 90.3%

3.3%

Neural Network

92.5%

2.3%

Random Forest

61.4%

2.2%

Linear SVC

90.8%

4.2%

Decision Tree

54.1%

5.8%

The Neural Networks model had the highest mean CVS and the second lowest standard deviation, as Table 2 shows. Hyper-parameter tuning was performed using grid search augmented by cross-validation to optimize the model. The following list describes the parameter grid chosen for the search and the best parameter for each item is in bold. – Learning rate: constant, invscaling and adaptive. – Solver: Adam, Stochastic Gradient Descent (SGD) and Limited-memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS). – Sizes of hidden layer(s): (30), (35), (40), (45) and (30,30). – Activation functions: logistic, relu and tanh. – Nesterov’s accelerated gradient: off, on. – L2 penalty parameter: 0.01, 0.001 and 0.0001 After creating a Neural Network with the best parameters, the resulting mean and standard deviation of the CVS are 95.2% and 1.98% respectively. This result is an improvement on the initial results described in Table 2. A different dataset of 1012 instances (531 satisfiable and 481 unsatisfiable instances) with the same parameters of the first dataset was generated, but over 7-valued instead of 3-valued logic. The produced formulas are also in the phase transition area. A Neural Network whose parameters were hyper-tuned was tested on this dataset and the achieved mean CVS was 92%.

Learning the Satisfiability of L  -clausal Forms

5

101

Conclusion and Future Work

This paper is concerned with the problem of predicting the satisfiability of an interesting class of logical formulas in L  ukasiewicz logic, called L  -clausal forms. In particular, we first presented an instance generator and used it to produce two datasets with instances at the phase transition over 3-valued and 7-valued Lukasiewicz logic. Next, numeric and graph features were extracted from both datasets. Then, different classifiers were trained and the best classifier (Neural Network) was selected for hyper-parameter tuning, after which the mean of the CVS increased from 92.5% to 95.2%. Many practical problems are naturally described in multi-valued logic and this work will be beneficial in encoding many of these applications into L  -clausal forms and solving them in short time. The same technique gained tremendous success and popularity over the years in the case of Boolean satisfiability [20–22]. A limitation of this study is the size of the dataset. Future studies will involve generating larger datasets with variable occurrences having different statistical distributions (power, exponential and normal distributions). This is important since variable occurrences in formulas coming from industrial applications often follow various distributions and not just a uniform distribution. In addition, we will investigate the ability of our model to predict the satisfiability of formulas from real-life applications. Finally, we will explore more efficient methods of solving the satisfiability of L  -clausal forms. This will enable generating instances with more variables and L  -clauses.

References 1. Biere, A., Heule, M., van Maaren, H. (eds.): Handbook of satisfiability. IOS Press, Amsterdam (2009) 2. Cook, S.: The complexity of theorem-proving procedures. In: Proceedings of the Third Annual ACM Symposium on Theory of Computing, pp. 151–158. ACM (1971) 3. Marques-Silva, J.: Practical applications of boolean satisfiability. In: 9th International Workshop on Discrete Event Systems, pp. 74–80, IEEE (2008) 4. Zadeh, L.: Fuzzy sets. Inf. Control 8(3), 338–353 (1965) 5. Rushdi, M., Rushdi, M., Zarouan, M., Ahmad, W.: Satisfiability in intuitionistic fuzzy logic with realistic tautology. Kuwait J. Sci. 45(2), 15–21 (2018) 6. El Halaby, M., Abdalla, A.: Fuzzy maximum satisfiability. In: Proceedings of the 10th International Conference on Informatics and Systems, pp. 50–55. ACM (2016) 7. Brys, T., Drugan, M., Bosman, P., De Cock, M., Now´e, A.: Solving satisfiability in fuzzy logics by mixing CMA-ES. In: Proceedings of the 15th Conference on Genetic and Evolutionary Computation, pp. 1125–1132. ACM (2013) 8. Brys, T., Drugan, M., Bosman, P., De Cock, M., Now´e, A.: Local search and restart strategies for satisfiability solving in fuzzy logics. In: 2013 IEEE International Workshop on Genetic and Evolutionary Fuzzy Systems (GEFS), pp. 52–59. IEEE (2013) 9. Soler, J., Many` a, F.: A bit-vector approach to satisfiability testing in finitelyvalued logics. In: 2016 IEEE 46th International Symposium on Multiple-Valued Logic (ISMVL), pp. 270–275. IEEE (2016)

102

M. El Halaby and A. Abdalla

10. Bofill, M., Many` a, F., Vidal, A., Villaret, M.: New complexity results for lukasiewicz logic. Soft. Comput. 23(7), 2187–2197 (2019) 11. Devlin, D., O’Sullivan, B.: Satisfiability as a classification problem. In: Proceedings of the 19th Irish Conference on Artificial Intelligence and Cognitive Science (2008) 12. Darwiche, A.: New advances in compiling CNF to decomposable negation normal form. In: Proceedings of the 16th European Conference on Artificial Intelligence, pp. 318–322. IOS Press (2004) 13. Mironov, I., Zhang, L.: Applications of sat solvers to cryptanalysis of hash functions. In: International Conference on Theory and Applications of Satisfiability Testing, pp. 102–115. Springer (2006) 14. Yager, R., Zadeh, L. (eds.): An Introduction to Fuzzy Logic Applications in Intelligent Systems, vol. 165. Springer, New York (2012) 15. De Silva, C.: Intelligent control: fuzzy logic applications. CRC Press, Boca Raton (2018) 16. Srivastava, P., Bisht, D.: Recent trends and applications of fuzzy logic. In: Advanced Fuzzy Logic Approaches in Engineering Science, pp. 327–340. IGI Global (2019) 17. Xu, L., Hoos, H., Leyton-Brown, K.: Predicting satisfiability at the phase transition. In: 26th AAAI Conference on Artificial Intelligence (2012) 18. De Moura, L., Bjørner, N.: Z3: an efficient SMT solver. In: International Conference on Tools and Algorithms for the Construction and Analysis of Systems, pp. 337– 340. Springer, Heidelberg (2008) 19. Koller, D., Friedman, N., Dˇzeroski, S., Sutton, C., McCallum, A., Pfeffer, A., Abbeel, P., Wong, M., Heckerman, D., Meek, C., Neville, J.: Introduction to statistical relational learning. MIT press (2007) 20. Favalli, M., Dalpasso, M.: Applications of boolean satisfiability to verification and testing of switch-level circuits. J. Electron. Test. 30(1), 41–55 (2014) 21. Vizel, Y., Weissenbacher, G., Malik, S.: Boolean satisfiability solvers and their applications in model checking. Proc. IEEE 103(11), 2021–2035 (2015) 22. Aloul, F., El-Tarhuni, M.: Multipath detection using boolean satisfiability techniques. J. Comput. Netw. Commun. 2011 (2011)

A Teaching-Learning-Based Optimization with Modified Learning Phases for Continuous Optimization Onn Ting Chong1 , Wei Hong Lim1(B) , Nor Ashidi Mat Isa2 , Koon Meng Ang1 , Sew Sun Tiang1 , and Chun Kit Ang1 1 Faculty of Engineering, Technology and Built Environment,

UCSI University, 56000 Kuala Lumpur, Malaysia [email protected] 2 School of Electrical and Electronic Engineering, Universiti Sains Malaysia, 14300 Nibong Tebal, Malaysia

Abstract. The deviation between the modelling of teaching-learning based optimization (TLBO) framework and actual scenario of classroom teaching and learning process is considered as one factor which contributes to the imbalance of algorithm’s exploration and exploitation searches, hence restricting its search performance. In this paper, the TLBO with modified learning phases (TLBO-MLPs) is proposed to achieve better search performance of algorithm through the further refinement of learning framework so that it can reflect the actual teaching and learning processes in classroom more accurately. A modified teacher phase is first introduced in TLBO-MLPs, where each learner is modelled to have different perspectives of mainstream knowledge in classroom to maintain the diversity of population’s knowledge. A modified learner phase consisting of an adaptive peer learning mechanism and a self-learning mechanism are also proposed in TLBO-MLPs. The former mechanism enables each learner to interact with multiple learners in gaining new knowledge for different subjects, while the latter facilitates the update of new knowledge through personal efforts. The overall performances of TLBO-MLPs in solving the CEC 2014 test functions are compared with seven competitors. Extensive simulation results show that TLBO-MLPs has demonstrated the best search performance among all compared methods in solving majority of test functions. Keywords: Global optimization · Modified learning phases · Teaching-learning based optimization (TLBO)

1 Introduction Recently, there were overwhelming attention received in the research areas related to optimization given its promising performance in decision making. An optimization problem consists of an objective function to represent the intended goal. Referring to the problem characteristics, the optimal combination of decision variables can lead to the © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 103–124, 2020. https://doi.org/10.1007/978-3-030-52246-9_8

104

O. T. Chong et al.

largest or smallest objective function values. Given the rapid advancement of technology, substantial amounts of modern engineering problems are represented via the highly complex, nonlinear and large-scale optimization problems that are difficult to address with deterministic mathematical programing methods. It is crucial to develop the alternative optimization schemes that can handle these increasingly complex problems more robustly. Metaheuristic search algorithms (MSAs) consist of unique search strategy to emulate certain natural phenomena and their robust search performances enable them to tackle various complex real-world optimization problems. Swarm intelligence (SI) algorithms and evolutionary algorithms (EAs) are two main branches of MSAs with different sources of inspiration. For EAs, their search mechanisms are inspired by Darwin’s theory of evolution and some of the representatives include genetic algorithm (GA) [1], evolutionary strategies (ES) [2], genetic programming (EP) [3], differential evolution (DE) [4] and etc. Meanwhile, SI algorithms emulate the collective behavior of a group of simple agents that interact locally with each other and their environments in self-organized and decentralized manners. Some notable examples of SI are particle swarm optimization (PSO) [5], ant colony optimization [6], artificial bee colony (ABC) [7], grey wolf optimizer (GWO) [8] and etc. These MSAs are widely used by practitioners to tackle the wide range of modern optimization problems such as those reported in [9–19] because of their competitive advantages such as high efficiency and simple implementation. Motivated by the pedagogy of conventional classroom, a new SI algorithm known as the teaching-learning based optimization (TLBO) has emerged [20, 21]. Similar with most MSAs, TLBO has a large population of learners that represent different candidate solutions of a given optimization problem, while teacher represents the best solution found. The search trajectory of each learner is adjusted based on their interaction with teacher and other peer learners during the optimization process. In contrary to most existing MSAs, the TLBO has an additional advantage of not requiring any parameter tuning for specific type of algorithm such as inertia weight and acceleration coefficients of PSO, mutation and crossover rates of GA, and etc. Given these appealing features, TLBO was extended to solve more challenging types of optimization problems with greater complexities of fitness function landscapes such as those described in the multi-objective problems [22–24], constrained problems [25–27], etc. Although different TBLO variants were proposed to solve optimization problems with enhanced performances, their robustness in addressing the problems with different complexities of fitness landscapes (e.g. multimodal, expanded, hybrid and composite functions) remains arduous. Most TLBO variants only demonstrate good performance in limited classes of problems and deliver inferior optimization results for other types of problems due to the imbalance of algorithm’s exploration and exploitation strengths [21]. The design of robust mechanisms to attain proper regulation of exploration and exploitation strengths of TLBO variants remain as an open research question that is actively pursued by the researchers of this arena. It is also observed that some mechanisms adopted in the algorithmic framework of TLBO is not aligned with the real-world teaching and learning processes in classroom. For instance, the knowledge of each learner is updated based on the same teacher and mainstream knowledge of classroom that are represented by the best and mean solutions of population, respectively. In real world

A Teaching-Learning-Based Optimization with Modified Learning Phases

105

scenario, different learners tend to have different perspectives of mainstream knowledge in order to preserve the diversity level of knowledge acquired. Additionally, it is also observed that each learner tends to interact with only one peer learner for updating his or her knowledge in all subjects taken as shown in the TLBO’s learner phase. Nevertheless, it is more realistic for a learner to interact with several peers for enhancing the knowledge of different subjects. The discrepancies found between the TLBO framework modelling and real-world teaching and learning processes in classroom can be another factor to restrict the robustness of TLBO in handling the wide range types of optimization problems [28]. In this paper, a new TLBO variant known as teaching-learning based optimization with modified learning phases (TLBO-MLPs) is proposed to overcome the challenges mentioned above. The modelling of teaching and learning mechanisms proposed in the TLBO-MPLs framework is refined further for it to reflect the actual teaching and learning processes in classroom more accurately, hence leading to better search performance of algorithm. Some notable contributions of TLBO-MLPs are highlighted as shown below: – A modified teacher phase is designed in TLBO-MPLs by introducing a new concept of weighted mean position that aims to simulate the different perspectives of mainstream knowledge that are perceived by different learners in order to update their knowledge in teacher phase. – A modified learner phase is designed in TLBO-MLPs by allowing each learner to interact with multiple peers in updating his or her knowledge in different subjects (i.e., dimensional components). A peer learning probability is also introduced to quantify the tendency of each learner to interact with its peers in modified learner phase based on the knowledge level of learner. – The modified learner phase of TLBO-MLPs is incorporated with a self-learning process that aims to simulate the tendency of some learners to update his or her knowledge using personal efforts instead of interacting with his or her peers. – Rigorous performance evaluations are performed on TLBO-MLPs with CEC 2014 test functions and verified using statistical analyses. The remaining section of this research paper are summarized herein. The literature review of this work are provided in Sect. 2. Detailed descriptions of TLBO-MLPs are explained in Sect. 3. Extensive performance evaluation of TLBO-MLPs in solving the complete set of CEC 2014 test functions are described in Sect. 4, followed by the performance validation with statistical analyses. The conclusion drawn from this work and its future works are finally summarized in Sect. 5.

2 Literature Review 2.1 Conventional TLBO Teaching-learning-based optimization (TLBO) algorithm was proposed in [20] and its search mechanisms are motivated by classroom’s conventional teaching and learning processes. At the beginning stage of optimization, the random initialization is used to produce a group of learners with the population size of I. Each i-th learner of

106

O. T. Chong et al.

  Xi = Xi,1 , . . . , Xi,d , . . . , Xi,D , represents the potential solution for a given optimization problem, where d ∈ [1, D] and D refer to the index of learner and total dimensional size of problems to be optimized, respectively. Suppose that f (Xi ) is the objective function value of each i-th solution and it implies the knowledge level of every i-th learner in classroom, which can be enhanced via the teacher or learner phases. All learners are updated in the teacher phase by interacting with the best learner in population, i.e., the teacher solution represented with X teacher , by taking the average knowledge level of population denoted as X mean into account, where: 1 Xi I I

X mean =

(1)

i=1

Suppose that r1 denotes a uniformly distributed random number in the range of 0 to 1; Tf refers to a teaching factor with integer value of either 1 or 2 to emphasize the influences of mainstream knowledge in population X mean . Let Xinew be the new solution of the i-th learner obtained in teacher phase, then:   (2) Xinew = Xi + r1 X teacher − Tf X mean On the other hand, each i-th learner interacts with its peers in population during the learner phase in order to enhance its knowledge level (i.e., fitness). Denote s as the index of a randomly selected peer learner assigned to the i-th learner in learner phase, where s ∈ [1, I ] and s = i; r2 as a uniformly distributed random number in the range of 0 to 1. If the randomly selected learner Xs is fitter than Xi , the inferior Xi is attracted by the fitter Xs as depicted in Eq. (3). In contrary, a repel scheme is incorporated into Eq. (4) in order to prohibit the rapid convergence of the fitter Xi towards the inferior Xs that can lead to the premature convergence. Xinew = Xi + r2 (Xs − Xi )

(3)

Xinew = Xi + r2 (Xi − Xs )

(4)

The new solution Xinew obtained by the i-th learner during the teacher and learner phases can be used to update the original Xi if the former solution is found to be more superior to the latter one. Otherwise, Xinew is discarded due to its inferior fitness. The knowledge of all TLBO learners are update iteratively via both of the teacher phase and learner phase prior to the satisfaction of termination conditions. At the end of search process, the teacher solution of X teacher is obtained as the best solution to address a given optimization problem. 2.2 TLBO Variants and Improvement Significant amounts of extended works of TLBO were reported since its inception for enhancing its performance [21]. A promising research focus to improve the TLBO’s performance is through the employment of parameter adaptation strategy. In [29], an

A Teaching-Learning-Based Optimization with Modified Learning Phases

107

adaptive weight factor that decreases linearly with iteration number was introduced to encourage the TLBO learners focus more on exploration during the earlier phase of optimization and then exploit the smaller search space in later phase. Another similar adaptive weight was also proposed in [30], aiming to improve the exploration ability of TLBO. The learning efficiency of TLBO was enhanced in [31] using the concepts of acceleration coefficients and inertia weight to determine the influence of previous learner and learning step size, respectively, based on the fitness of each learner. A nonlinear inertia weighted TLBO (NIWTLBO) was proposed in [32]. The nonlinear inertia weight factor was employed to regulate the memory rate and scale the learner’s existing knowledge, while a time-varying weight value was used to enhance the differential increment and impose the random fluctuations on existing TLBO learners. A TLBO variant with population size varies in triangle form was proposed in [33]. Gaussian distribution was used to produce the new learners when population size of TLBO increases from minimum to maximum, while the similarity concept was employed to discard the redundant learners when the population size decreases from maximum to minimum. Neighborhood structure is another common strategy used to improve performance of TLBO given its ability to govern the propagation rate of best solution in the population. A ring topology was introduced in [34] to enhance the exploration strength of TLBO by facilitating the information exchange with its two nearest neighbors during the teacher phase. In [35], a quasi von Neumann topology was integrated into the learner phase of TLBO to enable each learner improving its knowledge level via the conventional search operator or neighborhood search operator with certain probability. Multi-population scheme is also commonly used to construct different neighborhood structures of TLBO so that the diverse areas of search space can be explored by different subpopulations simultaneously. For instance, a dynamic group strategy (DGS) was proposed into DGSTLBO in [36] to divide the population into multiple groups with equal numbers of learners based on their Euclidean distance. Apart from learning via the teacher and subpopulation’s mean, there is a probability for learners to update their knowledge with a quantum-behaved learning strategy. Different clustering techniques such as fuzzy K-means [37], adaptive clustering [38], random clustering [39] and etc. were applied to divide the main population of TLBO into certain cluster numbers by referring to a metric known as spatial distribution. A two-level hierarchical multi-swam cooperative TLBO (HMCTLBO) with enhanced exploration strength and population diversity was proposed in [40]. Each subpopulation was first constructed at the bottom layer via random partition and evolved independently. The best learner in each subpopulation at bottom layer was then selected to form the top layer and evolved through Gaussian sampling learning. Another popular approach widely used for enhancing the TLBO’s performance is through the modification of learning strategy. A modified TLBO variant was designed in [41] by replacing the original learning strategy of teacher phase with Gaussian probabilistic model, while the learner phase was incorporated with a neighborhood search operator and a permutation-based crossover to guide learners better in searching for more promising areas. In [42], the concept of tutorial class was introduced into the learner phase and performance enhancement of modified TLBO (mTLBO) was observed because of close collaborative interactions between the teacher and learners as well as among the leaners.

108

O. T. Chong et al.

An improved TLBO with learning experience of other learners (LETLBO) was proposed in [43] to facilitate the knowledge improvement of each learner by accessing to the experience information of other learners during the teacher and learner phases. In [44], an improved TLBO with differential learning (DLTLBO) was developed to achieve better population diversity and exploration capability. During the teacher phase of DLTLBO, two trial vectors were first produced using a neighborhood learning operator and a differential learning operator and this is followed by a crossover operation on these two trial vectors in order to produce a new learner. Another similar TLBO variant known as the TLBO-DE was proposed in [45] to generate new solution efficiently by using a hybrid search operator developed from the learning mechanism of original TLBO and differential learning. A TLBO with learning enthusiasm mechanism (LebTLBO) was proposed in [46] and it was inspired by the correlation between the grade of learners and their enthusiasm in acquiring knowledge. It was assumed that the learners with better grade have higher tendency to pursue new knowledge due to the higher learning enthusiasm and vice versa. In [47], a generalized oppositional TLBO (GOTLBO) with improved convergence characteristics was proposed by leveraging the benefits of opposition-based learning in producing new learners.

3 TLBO-MPLs This section elaborate the detailed search mechanism of proposed TLBO-MLPs. First, the concept of weighted mean position is introduced into modified teacher phase, aiming to avoid premature convergence of population by simulating the behaviors of learners with different mainstream knowledge in classroom. Second, the modified learner phase is equipped with a probabilistic mutation operator to simulate the tendency of some learners to choose the self-learning over peer-learning in enhancing their knowledge. Third, an adaptive peer learning mechanism is devised in the modified learner phase of TLBO-MLPs to determine the likelihood of a particular learner to interact with multiple learners in order to learn different aspects of subjects (i.e., decision variables). 3.1 Modified Teacher Phase of TLBO-MLPs During the teacher phase of conventional TLBO, it is observed from Eq. (1) that all learners contribute equally to obtain the population mean value of X mean . In addition, the search processes of all learners in teacher phase are guided by the same direction information obtained from the best learner (i.e., teacher) and mainstream knowledge of population (i.e., X mean ) as demonstrated in Eq. (2). If the teacher with best knowledge level is trapped into the suboptimal regions, the search processes of remaining TLBO learners will be misguided towards the local optimum bys identical X teacher and X mean . This behavior explains the tendency of conventional TLBO in experiencing the rapid diversity loss and premature convergence of population, particularly in tackling the optimization problems consisting of complicated search space. In order to address the aforementioned drawback, an alternative is proposed in the teacher phase of TLBO-MLPs to obtain the mean value of population. Unlike the conventional approach, it is suggested that each TLBO-MPLs learner has slightly different

A Teaching-Learning-Based Optimization with Modified Learning Phases

109

perceptions on the mainstream knowledge of population, hence different mean position should be derived as the unique source of influence to guide the search process of each learner in teacher phase. Let Xa be any a-th learner in the population that plays crucial rules to derive a unique mean position for guiding each i-th learner. In order to preserve the diversity level, define ra as a uniformly distributed random number in the range between 0 and 1 to indicate different weightage of Xa in deriving the unique mean position. Assume that X¯ imean is weighted mean position of each i-th learner, then X¯ imean

I a=1 =  I

ra Xa

a=1 ra

(5)

Based on X teacher and X¯ imean , the modified teacher phase of TLBO-MLPs updates the new solution of each i-th learner as follow:    Xinew = Xi + r3 X teacher − Tf 1 Xi + r4 X¯ imean − Tf 2 Xi (6) where r3 and r4 are the uniformly distributed random numbers in the range between 0 to 1; Tf 1 and Tf 2 refer two parameters known as teaching factor that can be set with values between 1 and 2. According to Fig. 1 and Eq. (6), every i-th TLBO-MLPs learner is able to update its knowledge by learning directly based on the knowledge gap observed between (i) the teacher and i-th learner and (ii) the weighted mean position of other learners in classroom and i-th learner. Algorithm 1: Modified Teacher Phase Input: I, Xi, i teacher

1: Identify the best learner in population as X ; 2: Calculate the weighted mean X imean of each learner using Eq. (5); 3: Randomly generate T f 1 and T f 2 where T f 1 , T f 2 ∈ [1, 2] ; 4: Update the new position X inew of i-th learner using Eq. (6); Output: X inew

Fig. 1. Pseudo-code for modified teacher phase of TLBO-MLPs.

3.2 Modified Learner Phase of TLBO-MLPs The exploration strength of TLBO is emphasized during the learner phase of algorithm via a repulsion scheme introduced in Eq. (4), particularly when a poor performing peer is randomly assigned to the i-th learner. When the search process of TLBO progresses further, majority of the learners tend to converge in certain region of search space and the population tend to be stabilized without having significant changes in term of diversity. Under this circumstance, it is less likely for a randomly selected peer learner to assist a given learner in jumping out from the local optima region, especially for those problems with complicated fitness landscape. Given these limitations, two major modifications

110

O. T. Chong et al.

are introduced into the modified learner phases of TLBO-MLPs and their mechanisms are described as follows. A stochastic-based mutation operator that aims

to perform perturbation on the TLBOMLPs learners with a probability of P MUT = 1 D is first incorporated into the modified learner phase as a new learning strategy. From the perspective of teaching and learning paradigm, the learners with different learning styles exist in a same classroom. Certain learners choose to adopt the self-learning approach over the peer interaction in enhancing their knowledge level. The incorporation of mutation scheme enable those selected TLBO-MLPs learners to perform self-learning during the modified learner phase after the modified teacher phase is completed. If any i-th learner is triggered with self-learning mechanism with the probability of P MUT , a randomly selected dimension of the i-th learner, denoted as dr ∈ [1, D] is perturbed as follow:   new (7) = Xi,dr + r5 XdUr − XdLr Xi,d r where r 5 is a uniformly distributed random number with the range in between −1 to 1; new , X U and X L represent the d -th dimension of i-th learner as well as the upper and Xi,d r i,dr i,dr r lower boundary values of decision variables, respectively. Figure 2 provides the detailed description of the aforementioned self-learning mechanism. Algorithm 2: Self Learning Input: Xi, D, XU, XL 1: Randomly generate a dimension index of dr ∈ [1, D] ; 2: Extract the dr-th component of Xi, XU, and XL; 3: Perform perturbation on X i , d to produce X inew using Eq. (7); ,d r

r

4: Return X inew as the perturbed solution; Output: X inew

Fig. 2. Pseudo-code for self-learning mechanism of TLBO-MLPs.

For other learners that prefer to achieve knowledge improvement through the peer interaction, an adaptive peer learning mechanism is introduced into TLBO-MLPs, intending to enhance the search efficiency of algorithms. During the learner phase of conventional TLBO, each learner can only interact with one peer to update all of its dimensional components as shown in Eqs. (3) and (4). These formulations might not be able to describe the actual scenario in classroom accurately because a peer learning process can be more effective if every learner can interact with more peers. Furthermore, different learners might be more knowledgeable in certain subjects (i.e., dimensions), hence those weaker subjects have higher urgency to be improved further via the peer interaction. Based on these motivations, an adaptive peer-learning strategy is developed to enhance the search efficiency of TLBO-MLPs by emulating a more accurate peer learning mechanism explained as follows. After completing the modified teacher phase, all TLBO-MPLs learners are sorted based on their current fitness values from the best to worst. Let Ri be the ranking of each i-th learner, then Ri = I − i

(8)

A Teaching-Learning-Based Optimization with Modified Learning Phases

111

From Eq. (8), the fitter learners are assigned with higher ranking value and vice versa. Let PiPL ∈ [0, 1] be the peer learning probability of every i-th learner, i.e.,: PiPL = 1 −

Ri I

(9)

Given the peer learning probability PiPL , the new solution of each i-th learner represented using Xinew can be produced using the following procedures. For every d-th new , a random number r ∈ [0, 1] is gendimension of the i-th new solution denoted as Xi,d 6 erated from uniform distribution and then compared with the peer learning probability PiPL of the i-th learner. If r6 is smaller than PiPL , three learners denoted as Xj , Xk and Xl new as indicated are randomly selected from the population to produce a new value for Xi,d in Eq. (10), where i = j = k = l. Otherwise, the i-th learner can retain its original Xi,d new . Define φ ∈ [0.5, 1] as the peer learning factor of the i-th learner and it value in Xi,d i is randomly generated from uniform distribution, the adaptive peer learning mechanism of TLBO-MPLs can be formulated as follow:  Xj,d + φi Xk,d − Xl,d , if r6 < PiPL or d = drand new (10) Xi,d = Xi,d , otherwise As shown in Eq. (10), the i-th learner with worse fitness value (i.e., higher PiPL ) has higher tendency to interact with its peers in order to update most of its dimensional components of Xinew as compared to those with better fitness (i.e., lower PiPL ). Unlike the conventional TLBO, the adaptive peer learning mechanism proposed in TLBO-MPLs not only allow a given learner to interact with multiple peers, but it can also adaptively determine the tendency of the learner to update its dimensional components via peer interaction based on its fitness value. The pseudocode used to describe the modified learner phase of TLBO-MLPs are presented in Fig. 3. Algorithm 3: Modified Learner Phase Input: Xi, D, XU, XL, Pi PL 1: Randomly generate rand ∈ [ 0,1] ; MUT

2: if rand ≤ P then /*Perform self-learning*/ 3: Produce X inew using Self Learning (Algorithm 2); 4: else /*Perform adaptive peer-learning*/ 5: for d = 1 to D then 6: Calculate X inew using Eq. (10); ,d 7: end for 8: end if Output: X inew

Fig. 3. Pseudo-code for modified learner phase of TLBO-MLPs.

3.3 Overall Framework of TLBO-MLPs The overall framework of TLBO-MLPs is summarized in Fig. 4, where γ and  max represent a counter used to record the number of fitness evaluations (FEs) consumed and

112

O. T. Chong et al.

the maximum FEs number used as termination criterion of TLBO-MLPs, respectively. At the beginning stage of optimization, all learners of TLBO-MLPs are randomly initialized and their associated fitness values are evaluated. Crucial information such as the teacher solution, weighted mean position, peer learning probability and etc. can be determined based on these fitness values. The new solution of each learners can be obtained via the modified teacher phase and the modified learner phase explained in Fig. 1 and 3, respectively. The search process is repeated until the termination condition defined as γ >  max is satisfied. The teacher solution obtained in the final stage is returned as the best solution to solve a given optimization problem.

4 Performance Evaluation on Test Functions 4.1 Test Functions and Performance Metric Performance evaluation of TLBO-MLPs is conducted using 30 real-parameter single objective optimization function introduced in CEC 2014 [48]. These test functions consist of different characteristics and they can be categorized as unimodal functions G1 (F1–F3), simple multimodal functions (F4–F16), hybrid functions (F17–F22) and composition functions (F23–F30). The mean fitness of F mean and standard deviation of SD are adopted to measure the search performance of all compared algorithms. In particular, F mean indicates the mean error value between the fitness value of best solution obtained by a compared algorithm and the theoretical value of global optimum for a function in multiple simulation runs. Meanwhile, the consistency of a compared algorithm to solve a given test function is evaluated using SD. Smaller values of F mean and SD imply the capability of an algorithm in tackling a test function consistently with promising search accuracy. The non-parametric statistical procedures are also employed to compare the proposed TLBO-MLPs and other peer algorithms rigorously. Wilcoxon signed rank test [49] was first applied to compare TLBO-MLPs with each of its peer in pairwise manner at the significance level of α = 0.05 with the results of R+ , R− , p and h values. The sum of ranks that indicated the outperformance and underperformance of TLBO-MLPs against each of its compared peers are represented by R+ and R− , respectively. The p value is a threshold level to identify the significant performance deviations between the compared algorithms. The better results of an algorithm is statistically significant if the p value obtained is smaller than α. Based on the p and α values, the h value is used to conclude if TLBO-MLPs is significantly better (i.e., h = ‘+’), insignificant (i.e., h = ‘=’) or significantly worse (i.e., h = ‘−’) than its compared peers. As for the multiple comparison of algorithms with non-parametric statistical analysis, Friedman test is first performed to determine the average rank values of all compared algorithms [50]. If significant global differences are observed between these algorithms based on the p and α values, three post-hoc procedures of Bonferroni-Dunn, Holm and Hochberg are employed to further investigate the concrete performance deviation among all algorithms by referring to their adjusted p-values [50].

A Teaching-Learning-Based Optimization with Modified Learning Phases Algorithm 4: TLBO-MLPs Input:

Algorithm 1

Algorithm 3

Output:

Fig. 4. Pseudo-code for complete framework of TLBO-MLPs.

113

114

O. T. Chong et al.

4.2 Parameter Settings for Compared Algorithms The search performance of TLBO-MLPs in tacking all CEC 2014 test functions were compared with seven well-established TLBO variants know as: conventional TLBO [20], modified TLBO (mTLBO) [42], differential learning TLBO (DLTLBO) [44], nonlinear inertia weighted TLBO (NIWTLBO) [32], TLBO with learning experiences of other learners (LETLBO) [43], generalized oppositional TLBO (GOTLBO) [47] and TLBO with learning enthusiasm mechanism (LebTLBO) [46]. The parameter settings of all compared algorithms are presented in Table 1. To ensure fair comparison, all TLBO variants are simulated independently for 30 times in solving all CEC 2014 test function at D = 30 using the maximum fitness evaluation numbers of  max = 10000×D. All simulations are performed using Matlab 2019a on a workstation equipped with Intel ® Core i7-7500 CPU @ 2.0 GHz. Table 1. Parameter settings of all eight selected TLBO variants. Algorithm

Parameter settings

TLBO

Population size I = 50

mTLBO

I = 50

DLTLBO

I = 50, scale factor F = 0.5, crossover rate CR = 0.9 and neighborhood size ns = 3

NIWTLBO

I = 50, inertia weight ω = 0 ∼ 1.0

LETLBO

I = 50

GOTLBO

I = 50, jumping rate Jr = 0.3

LebTLBO

I = 50, maximum learning enthusiasm LE max = 1.0, minimum learning enthusiasm LE min = 0.3 and F = 0.9.

TLBO-MLPs

I = 50, peer learning factor φn ∈ [0.5, 1]

4.3 Comparison of Search Performance for All Algorithms The F mean and SD values produced by the proposed TLBO-MLPs and seven other peer algorithms in tackling all of the CEC 2014 functions are provided in Table 2. The best and second best results are shown in boldface and underline text, respectively. Table 2 also summarized the performance analysis between TLBO-MLPs and its peers in terms of w/t// and #BR. In particular, w/t/l shows that TLBO-MPLs perform better in w functions, perform similar in t functions and perform worse in l functions. #BR is the number of test function with best F mean result produced by each algorithm. For the unimodal functions of F1 to F3, it is observed that the proposed TLBOMLPs produces two best F mean values that lead to the global optimum of functions F2 and F3. The performance of LeTLBO in tackling unimodal functions are also promising because it produces the best and second results in the functions F1 and F2, respectively. In contrary, the search performances of mTLBO in solving unimodal functions are inferior

F9

F8

F7

F6

F5

F4

F3

7.31E+01

1.27E+01

SD

1.13E+01

SD

F mean

6.82E+01

F mean

6.88E−02

1.92E−01

SD

2.18E+00

SD

F mean

1.53E+01

5.12E−02

SD

F mean

2.09E+01

F mean

6.64E+01

4.35E+01

SD

4.19E+01

SD

F mean

4.33E+01

1.84E+00

SD

F mean

8.15E−01

1.80E+05

F mean

SD

F2

2.75E+05

F mean

F1

TLBO

Metrics

Function.

2.89E+01

1.25E+02

2.46E+01

1.16E+02

2.36E+01

6.00E+01

2.27E+00

2.42E+01

5.60E−02

2.09E+01

1.17E+02

3.85E+02

1.73E+01

4.68E+00

1.52E+09

1.18E+09

4.03E+07

6.46E+07

mTLBO

1.35E+01

5.77E+01

9.52E+00

4.82E+01

3.32E−02

2.67E−02

2.67E+00

1.27E+01

5.98E−02

2.09E+01

3.11E+01

8.27E+01

4.39E−01

9.06E−02

3.09E−05

1.11E−05

1.41E+05

2.24E+05

DLTLBO

2.15E+01

1.72E+02

1.96E+01

1.42E+02

1.42E+00

3.15E−01

3.01E+00

2.88E+01

4.68E−02

2.09E+01

3.17E+01

9.32E+01

5.73E+01

2.52E+01

2.49E+02

1.78E+02

3.70E+05

4.95E+05

NITLBO

2.37E+01

1.00E+02

1.76E+01

8.96E+01

2.86E−02

1.81E−02

3.31E+00

2.25E+01

4.61E−02

2.09E+01

3.17E+01

3.59E+01

5.47E+00

4.15E+00

6.33E−02

2.24E−02

2.74E+05

3.29E+05

LETLBO

1.40E+01

7.31E+01

1.35E+01

6.35E+01

4.73E−02

2.84E−02

2.28E+00

1.42E+01

2.19E−01

2.08E+01

3.57E+01

4.97E+01

2.03E+02

1.28E+02

3.61E+00

2.48E+00

1.31E+05

2.10E+05

GOTLBO

1.14E+01

5.68E+01

9.07E+00

4.96E+01

1.43E−02

9.67E−03

2.38E+00

1.12E+02

7.14E−02

2.09E+01

2.77E+01

1.60E+01

3.59E−01

3.17E−01

9.52E−08

6.05E−08

7.25E+04

1.08E+05

LebTLBO

Table 2. Performance comparison between TLBO-MLPs with seven peer algorithms in 30 CEC 2014 test functions.

(continued)

7.94E+00

1.19E+02

5.24E−01

4.97E−01

0.00E+00

0.00E+00

0.00E+00

0.00E+00

4.21E−02

2.07E+01

3.05E−03

9.63E−04

0.00E+00

0.00E+00

0.00E+00

0.00E+00

1.07E+06

3.19E+06

TLBO-MLPs A Teaching-Learning-Based Optimization with Modified Learning Phases 115

F18

F17

F16

F15

F14

F13

F12

2.58E+03

2.80E+03

SD

1.09E+05

SD

F mean

1.19E+05

4.09E−01

SD

F mean

1.15E+01

SD

F mean

1.63E+01

7.95E+00

F mean

2.53E−01

4.85E−02

SD

1.03E−01

SD

F mean

4.51E−01

3.06E−01

SD

F mean

2.46E+00

SD

F mean

5.33E+03

1.18E+03

F mean

5.39E+02

SD

F11

1.62E+03

F mean

F10

TLBO

Metrics

Function.

8.10E+04

1.92E+04

1.24E+06

6.56E+05

7.48E−01

1.11E+01

1.46E+03

1.20E+03

8.51E+00

2.04E+01

9.83E−01

1.77E+00

2.66E−01

2.50E+00

5.48E+02

3.28E+03

6.57E+02

2.87E+03

mTLBO

1.41E+03

1.04E+03

5.10E+03

5.05E+03

5.66E−01

1.09E+01

7.53E+00

1.39E+01

3.70E−02

2.40E−01

6.29E−02

3.60E−01

3.75E−01

2.29E+00

1.22E+03

2.40E+03

3.20E+02

7.79E+02

DLTLBO

2.34E+03

2.06E+03

3.08E+04

2.52E+04

4.92E−01

1.16E+01

2.96E+02

1.74E+02

1.67E−01

2.91E−01

1.44E−01

5.93E−01

6.19E−01

1.93E+00

7.21E+02

3.05E+03

4.38E+02

3.15E+03

NITLBO

Table 2. (continued)

2.98E+03

2.48E+03

5.24E+04

7.70E+04

4.88E−01

1.12E+01

5.08E+00

1.43E+01

1.28E−01

2.92E−01

1.00E−01

4.07E−01

4.00E−01

2.37E+00

9.57E+02

5.29E+03

6.85E+02

2.53E+03

LETLBO

3.25E+03

1.97E+03

1.22E+05

1.27E+05

4.41E−01

1.16E+00

4.91E+00

1.39E+01

4.70E−02

2.61E−01

7.86E−01

3.89E−01

5.29E−01

1.11E+00

9.38E+02

3.83E+03

6.59E+02

1.86E+03

GOTLBO

1.27E+03

1.25E+03

3.34E+04

5.14E+04

4.79E−01

1.10E+01

3.06E+00

9.00E+00

4.38E−02

2.58E−01

5.18E−02

3.17E−01

2.72E−01

2.57E+00

1.71E+03

4.09E+03

3.66E+02

1.02E+03

LebTLBO

(continued)

5.16E+00

5.65E+01

6.92E+02

2.84E+03

2.58E−01

1.03E+01

1.10E+00

1.18E+01

1.32E−02

1.12E−01

1.69E−02

1.34E−01

1.13E−01

1.02E+00

2.93E+02

4.64E+03

5.18E+01

3.95E+01

TLBO-MLPs

116 O. T. Chong et al.

F27

F26

F25

F24

F23

F22

F21

5.18E+02

1.60E+02

SD

3.77E+01

SD

F mean

1.17E+02

F mean

2.00E+02

1.45E−13

SD

1.03E−03

SD

F mean

2.00E+02

4.41E−12

SD

F mean

3.15E+02

SD

F mean

2.61E+02

1.22E+02

F mean

3.58E+04

2.47E+04

SD

1.77E+02

SD

F mean

4.36E+02

2.19E+01

F mean

SD

F20

2.84E+01

F mean

F19

TLBO

Metrics

Function.

2.48E+02

9.40E+02

4.93E+01

1.41E+02

7.51E+00

2.04E+02

7.93E−04

2.00E+02

1.98E+01

3.53E+02

2.09E+02

5.32E+02

2.86E+04

3.18E+04

1.21E+02

2.72E+02

3.50E+01

8.02E+01

mTLBO

1.23E+02

5.66E+02

4.49E+01

1.27E+02

3.29E+00

2.10E+02

1.16E+01

2.09E+02

1.27E−11

3.15E+02

8.92E+01

1.66E+02

5.64E+02

6.08E+02

1.73E+01

2.03E+01

1.51E+01

1.05E+01

DLTLBO

2.76E+02

5.24E+02

3.03E+01

1.90E+02

0.00E+00

2.00E+02

1.99E−05

2.00E+02

0.00E+00

2.00E+02

2.93E+02

6.41E+02

8.91E+03

1.44E+04

1.56E+02

3.75E+02

1.40E+01

2.30E+01

NITLBO

Table 2. (continued)

2.11E+02

6.15E+02

4.78E+01

1.34E+02

2.41E+00

2.01E+02

9.76E−04

2.00E+02

9.15E−12

3.15E+02

1.36E+02

2.72E+02

2.14E+04

3.25E+04

8.83E+01

3.16E+02

2.41E+01

2.33E+01

LETLBO

3.00E+02

2.00E+02

4.05E+01

1.20E+02

1.73E−13

2.00E+02

8.59E−04

2.00E+02

1.89E−13

2.00E+02

1.14E+02

2.61E+02

4.04E+04

5.07E+04

2.14E+02

4.87E+02

1.50E+01

1.37E+01

GOTLBO

1.05E+02

5.35E+02

1.82E+01

1.04E+02

3.93E+00

2.01E+02

2.25E−03

2.00E+02

1.09E−12

3.15E+02

8.54E+01

1.92E+02

4.02E+04

2.97E+04

4.22E+01

2.12E+02

1.32E+00

6.17E+00

LebTLBO

(continued)

4.57E+01

3.27E+02

5.26E+01

1.50E+02

0.00E+00

2.00E+02

1.42E−06

2.00E+02

2.21E−13

3.15E+02

4.08E+01

4.26E+01

8.56E+01

3.44E+00

2.11E+00

2.23E+01

9.12E−01

4.96E+00

TLBO-MLPs A Teaching-Learning-Based Optimization with Modified Learning Phases 117

2

#BR

1.12E+03

SD

24/3/3

2.62E+03

F mean

SD

w/t/l

F30

4.33E+05

2.36E+06

F mean

2.48E+02

SD

F29

1.19E+03

F mean

F28

TLBO

Metrics

Function.

1

27/1/2

6.81E+04

8.61E+04

1.21E+07

5.45E+06

5.34E+02

2.36E+03

mTLBO

2

24/1/5

5.67E+02

1.80E+03

3.16E+02

1.11E+03

6.05E+01

9.74E+02

DLTLBO

3

24/2/4

7.84E+02

2.93E+03

3.58E+06

9.18E+05

3.98E+02

1.88E+03

NITLBO

Table 2. (continued)

1

25/2/3

1.05E+05

3.19E+03

4.21E+06

1.18E+06

3.18E+02

1.29E+03

LETLBO

5

21/2/7

8.29E+02

2.46E+03

1.71E+06

3.13E+05

1.59E+02

2.29E+02

GOTLBO

5

23/2/5

5.75E+02

2.24E+03

2.00E+02

9.21E+02

5.87E+01

9.45E+02

LebTLBO

21



6.04E+02

1.44E+03

2.10E+02

8.29E+02

5.67E+01

7.86E+02

TLBO-MLPs

118 O. T. Chong et al.

A Teaching-Learning-Based Optimization with Modified Learning Phases

119

because it produces two worst results of F mean in functions F1 and F2. For the simple multimodal functions of F4 to F16, TLBO-MLPs has demonstrated the most dominating search accuracy in tackling these 12 test functions for being able to produce ten best F mean values (i.e., functions F4 to F8, F10, F12 to F14 and F16) and one second best F mean value (i.e., function F15). The search performances of DLTLBO and LebTLBO in solving multimodal functions are promising as well. In particular, DLTLBO produces the best F mean value in function F11 and the second best F mean values in functions F6, F8, F9, F10, F14 and F16. Meanwhile, LeTLBO has demonstrated good performance in solving functions F9 and F15 with the best search accuracy and functions F4, F7 and F13 with the second best search accuracy. For the hybrid functions of F17 to F22, the proposed TLBO-MPLs is reported to solve these six functions successfully with five best F mean values (i.e., functions F17 to F19, F21 and F22) and one second best F mean value (i.e., function F20). DLTLBO is observed as the second most competitive algorithm in handling the hybrid functions for being able to solve the function F20 with best search accuracy and functions F17, F18, F21 and F22 with second best F mean value. Although the LeTLBO has solved the unimodal and simple multimodal functions with relatively good results, some performance degradations are demonstrated when it is used to tackle the more complex hybrid function because it can only achieve the second best F mean value for function F19. For the composite functions of F23 to F30, the search accuracies demonstrated by both TLBO-MLPs and GOTLBO in addressing these eight challenging problems are comparable. Particularly, the proposed TLBO-MLPs has successfully solve these composite functions with four best F mean values (i.e., functions F24, F25, F29 and F30) and three second best F mean values (i.e., functions F23, F27 and F28). For GOTLBO, it is reported to produce five best F mean values for functions F23 to F25, F27 and F28. Other compared algorithms that did not perform well in solving the unimodal and multimodal functions such as TLBO, mTLBO, NITLBO, LETLBO are also reported to produce at least one best or second best results in solving the tested composite functions. In summary, the proposed TLBO-MLPs is reported to have best optimization performance in tackling these 30 CEC 2014 test functions by producing a total of 21 best F mean values and five second best F mea values. Notably, the proposed TLBO-MLPs is the only algorithm that has excellent capability in locating the global optima of functions F2, F3, F6 and F7. Both of the modified teacher and learner phases incorporated into the proposed TLBO-MLOs enable it to have the competitive robustness in handling the optimization problems with various types of fitness landscapes (i.e., unimodal, multimodal, hybrid and composite functions) with excellent search accuracy. In contrary, other compared methods have the drawbacks of only able to solve certain types of problems competitively. For example, the search performance of LebTLBO in solving unimodal, multimodal and composite functions are promising but its search accuracy in handling hybrid functions is relatively poor. DLTLBO is able to solve both multimodal and hybrid functions with good accuracy but it has relatively inferior performance in handling the unimodal and composite functions. The Wilcoxon signed rank test is also used to compare TLBO-MLPs and each of the selected peer in pairwise manner [49]. The non-parametric statistical analysis results of R+ , R− , p and h values are reported in Table 3. Notable enhancement of TLBO-MLPs

120

O. T. Chong et al.

over TLBO, mTLBO, DLTLBO, NIWTLBO, LETBLO and LebTLBO are observed in Table 3 as shown by the value of h = + at the significance level of α = 0.05. While there are no significant differences observed between TLBO-MLPs and GOTLBO in Table 3 as indicated by the h-value of ‘=’, the Wilcoxon signed rank test would confirm that the TLBO-MLPs can significantly outperform the GOTLBO if α = 0.10. Table 3. Wilcoxon signed rank test for the pairwise comparison between TLBO-MLPs with peer algorithms. TLBO-MLPs vs. R+

R−

p-value

h-value

TLBO

376.5

58.5 5.63E−04 +

mTLBO

407.0

28.0 4.00E−05 +

DLTLBO

337.0

98.0 9.47E−03 +

NIWTLBO

394.5

70.5 8.31E−04 +

LETLBO

405.5

59.5 5.59E−04 +

GOTLBO

326.5 138.5 5.19E−02 =

LebTLBO

362.5 102.5 7.27E−03 +

Apart from pairwise comparison, the optimization performance of TLBO-MLPs is also evaluated using Friedman test for multiple comparison analysis [50]. Table 4 reported that TLBO-MLPs and all the peer algorithms are ranked according to the corresponding F mean as followed: TLBO-MLPs, LebTLBO, DLTLBO, GOTLBO, TLBO, LETLBO, NIWTLBO and mTLBO. The p-value obtained from the chi-square statistics of Friedman test is reported to be 0.00E + 00 which is smaller than α = 0.05. This analysis results shows that notable difference exists among all algorithms from the global perspective. Referring to the Friedman test results, a set of post-hoc statistical analyses known as Bonferroni-Dunn, Holm and Hochberg procedures are further performed to investigate the significant differences with the proposed TLBO-MLPs [50]. Table 5 reported that the adjusted p-values (APVs) associated with the three aforementioned post-hoc procedures. Referring to the threshold value of significant level α = 0.05, all post-hoc procedures can verify the substantial performance demonstrated by the TLBOMLPs over the mTLBO, NIWTLBO, LETLBO, TLBO and GOTLBO. No significant differences are reported between TLBO-MLPs over DLTLBO and LebTLBO through these three post- hoc analysis procedures.

A Teaching-Learning-Based Optimization with Modified Learning Phases

121

Table 4. Average ranking and the p-value obtained from Friedman test. Algorithm

Ranking

Chi-Square Statistic

p-value

TLBO-MLPs

2.17

8.51E+01

0.00E+00

TLBO

5.17

mTLBO

6.85

DLTLBO

3.33

NIWTLBO

5.73

LETLBO

5.4

GOTLBO

4.2

LebTLBO

3.15

Table 5. APVs for Bonferroni-Dunn, Holm and Hochberg procedures TLBO-MLPs vs. Bonferroni-Dunn Holm p p

Hochberg p

mTLBO

0.00E+00

0.00E+00

0.00E+00

NIWTLBO

0.00E+00

0.00E+00

0.00E+00

LETLBO

2.00E−06

2.00E−06 2.00E−06

TLBO

1.50E−05

8.00E−06 8.00E−06

GOTLBO

9.13E−03

3.91E−03 3.91E−03

DLTLBO

4.56E−01

1.30E−01 1.20E−01

LebTLBO

8.40E−01

1.30E−01 1.20E−01

5 Conclusions A new TLBO variant known as the teaching-learning based optimization with modified learning phases (TLBO-MLPs) is proposed in this paper, aiming to enhance its robustness in handling the optimization problems with different types of characteristics. Substantial efforts are made to refine the framework of TLBO-MLPs so that the teaching and learning mechanisms adopted can represent the real world scenario of classroom paradigm accurately, hence leading to the improvement of search performance. A modified teacher phase is first designed in the TLBO-MLPs to preserve the population diversity level by enabling the learners to have different perceptions of mainstream knowledge in updating their knowledge. Meanwhile, the modified learner phase incorporated into TLBO-MLPs aims to improve its convergence characteristic by allowing each learner to interact with in different subjects based on the adaptive peer learning probability.

122

O. T. Chong et al.

A self-learning mechanism is also introduced to facilitate the behaviors of certain learner in updating the knowledge through personal efforts rather than via peer interactions. Comprehensive analyses were conducted to assess the performances of TLBOMLPs in tackling CEC 2014 test functions. Rigorous comparison between TLBO-MLPs and seven other TLBO variants are verified through non-parametric statistical analysis procedures. As the future studies, an extensive theoretical framework can be formulated to investigate convergence characteristic of TLBO-MLPs. It is also worth to study the applicability of using the proposed TLBO-MLPs to handle the real-world optimization problems such as the material machining optimization problems and robust controller design optimization problems. Finally, the feasibility of TLBO-MLPs in tackling the more challenging optimization problems with multimodal, multi-objective, constrained and dynamic characteristics is another promising future research directions.

References 1. Whitley, D., Sutton, A.M.: Genetic algorithms — a survey of models and methods. In: Rozenberg, G., Bäck, T., Kok, J.N. (eds.) Handbook of Natural Computing, pp. 637–671. Springer, Heidelberg (2012) 2. Kramer, O.: Evolutionary self-adaptation: a survey of operators and strategy parameters. Evol. Intell. 3(2), 51–65 (2010) 3. Burke, E., Gustafson, S., Kendall, G.: A survey and analysis of diversity measures in genetic programming. In: Proceedings of the 4th Annual Conference on Genetic and Evolutionary Computation, New York (2002) 4. Das, S., Suganthan, P.N.: Differential evolution: a survey of the state-of-the-art. IEEE Trans. Evol. Comput. 15(1), 4–31 (2011) 5. Valle, Y.D., Venayagamoorthy, G.K., Mohagheghi, S., Hernandez, J., Harley, R.G.: Particle swarm optimization: basic concepts, variants and applications in power systems. IEEE Trans. Evol. Comput. 12(2), 171–195 (2008) 6. Dorigo, M., Blum, C.: Ant colony optimization theory: a survey. Theor. Comput. Sci. 344(2– 3), 243–278 (2005) 7. Karaboga, D., Gorkemli, B., Ozturk, C., Karaboga, N.: A comprehensive survey: artificial bee colony (ABC) algorithm and applications. Artif. Intell. Rev. 42(1), 21–57 (2014) 8. Faris, H., Aljarah, I., Al-Betar, M.A.: Mirjalili, S,: Grey wolf optimizer: a review of recent variants and applications. Neural Comput. Appl. 30(2), 413–435 (2018) 9. Ang, C.K., Tang, S.H., Mashohor, S., Ariffin, M.K.A.M., Khaksar, W.: Solving continuous trajectory and forward kinematics simultaneously based on ANN. Int. J. Comput. Commun. Control 9(3), 253–260 (2014) 10. Alrifaey, M., Tang, S.H., Supeni, E.E., As’arry, A., Ang, C.K.: Identification and priorization of risk factors in an electrical generator based on the hybrid FMEA framwork. Energies 12(4), 649 (2019) 11. Lim, W.H., Isa, N.A.M.: Particle swarm optimization with dual-level task allocation. Eng. Appl. Artif. Intell. 38, 88–110 (2015) 12. Yao, L., Shen, J.Y., Lim, W.H.: Real-time energy management optimization for smart household. In: 2016 IEEE International Conference on Internet of Things (iThings), Chengdu, China, pp. 20–26 (2016)

A Teaching-Learning-Based Optimization with Modified Learning Phases

123

13. Yao, L., Damiran, Z., Lim, W.H.: Energy management optimization scheme for smart home considering different types of appliances. In: 2017 IEEE International Conference on Environment and Electrical Engineering and 2017 IEEE Industrial and Commercial Power Systems Europe (EEEIC/I&CPS Europe), Milan, Italy, pp. 1–6 (2017) 14. Solihin, M.I., Akmeliawati, R., Muhida, R., Legowo, A.: Guaranteed robust state feedback controller via constrained optimization using differential evolution. In: 6th International Colloquium on Signal Processing & its Applications, pp. 1–6 (2010) 15. Solihin, M.I., Wahyudi, Akmeliawati, R.: PSO-based optimization of state feedback tracking controller for a flexible link manipulator. In: International Conference of Soft Computing and Pattern Recognition, pp. 72–76 (2009) 16. Lim, W.H., Isa, N.A.M., Tiang, S.S., Tan, T.H., Natarajan, E., Wong, C.H., Tang, J.R.: Selfadaptive topologically connected-based particle swarm optimization. IEEE Access 6, 65347– 65366 (2018) 17. Sathiyamoorthy, V., Sekar, T., Natarajan, E.: Optimization of processing parameters in ECM of die tool steel using nanofluid by multiobjective genetic algorithm. Sci. World J. 2015, 6 (2015) 18. Yao, L., Lim, W.H., Tiang, S.S., Tan, T.H., Wong, C.H., Pang, J.Y.: Demand bidding optimization for an aggregator with a genetic algorithm. Enegies 11(10), 2498 (2018) 19. Yao, L., Damiran, Z., Lim, W.H.: A fuzzy logic based charging scheme for electric vehicle parking station. In: 2016 IEEE 16th International Conference on Environment and Electrical Engineering, Florence, Italy (2016) 20. Rao, R.V., Savsani, V.J., Vakharia, D.P.: Teaching–learning-based optimization: a novel method for constrained mechanical design optimization problems. Comput. Aided Des. 43(3), 303–315 (2011) 21. Zou, F., Chen, D., Xu, Q.: A survey of teaching–learning-based optimization. Neurocomputing 335, 366–383 (2019) 22. Natarajan, E., Kaviarasan, V., Lim, W.H., Tiang, S.S., Parasuraman, S., Elango, S.: Non-dominated sorting modified teaching-learning-based optimization for multi-objective machining of polytetrafluoroethylene (PTFE). J. Intell. Manuf. 31, 911–935 (2020) 23. Rao, R.V., Waghmare, G.G.: Multi-objective design optimization of a plate-fin heat sink using a teaching-learning-based optimization algorithm. Appl. Therm. Eng. 76, 521–529 (2015) 24. Natarajan, E., Kaviarasan, V., Lim, W.H., Tiang, S.S., Tan, T.H.: Enhanced multi-objective teaching-learning-based optimization for machining of Delrin. IEEE Access 6, 51528–51546 (2018) 25. Yua, K., Wang, X., Wang, Z.: Constrained optimization based on improved teaching–learningbased optimization algorithm. Inf. Sci. 352–353, 61–78 (2016) 26. Savsani, V.J., Tejani, G.G., Patel, V.K.: Truss topology optimization with static and dynamic constraints using modified subpopulation teaching–learning-based optimization. Eng. Optim. 48(11), 1990–2006 (2016) 27. Zheng, H., Wang, L., Zheng, X.: Teaching–learning-based optimization algorithm for multiskill resource constrained project scheduling problem. Soft. Comput. 21(6), 1537–1548 (2017) 28. Akhtar, J., Koshul, B., Awais, M.: A framework for evolutionary algorithms based on charles sanders peirce’s evolutionary semiotics. Inf. Sci. 236, 93–108 (2013) 29. Satapathy, S.C., Naik, A., Parvathi, K.: Weighted teaching-learning-based optimization for global function optimization. Appl. Math. 4(3), 429–439 (2013) 30. Cao, J., Luo, J.: A study on SVM based on the weighted elitist teaching-learning-based optimization and application in the fault diagnosis of chemical process. MATEC Web Conf. 22, 05016 (2015)

124

O. T. Chong et al.

31. Li, G., Niu, P., Zhang, W., Liu, Y.: Model NOx emissions by least squares support vector machine with tuning based on ameliorated teaching–learning-based optimization. Chemom. Intell. Lab. Syst. 126, 11–20 (2013) 32. Wu, Z.-S., Fu, W.-P., Xue, R.: Nonlinear inertia weighted teaching-learning-based optimization for solving global optimization problem. Comput. Intell. Neurosci. 2015(292576), 15 (2015) 33. Chen, D., Lu, R., Zou, F., Li, S.: Teaching-learning-based optimization with variablepopulation scheme and its application for ANN and global optimization. Neurocomputing 173, 1096–1111 (2016) 34. Wang, L., Zou, F., Hei, X., Yang, D., Chen, D., Jiang, Q.: An improved teaching–learningbased optimization with neighborhood search for applications of ANN. Neurocomputing 143, 231–247 (2014) 35. Chen, D., Zou, F., Li, Z., Wang, J., Li, S.: An improved teaching–learning-based optimization algorithm for solving global optimization problem. Inf. Sci. 297, 171–190 (2015) 36. Zou, F., Wang, L., Hei, X., Chen, D., Yang, D.: Teaching–learning-based optimization with dynamic group strategy for global optimization. Inf. Sci. 273, 112–131 (2014) 37. Zhai, Z., Li, S., Liu, Y., Li, Z.: Teaching-learning-based optimization with a fuzzy grouping learning strategy for global numerical optimization. J. Intell. Fuzzy Syst. 29(6), 2345–2356 (2015) 38. Reddy, S.S.: Clustered adaptive teaching–learning-based optimization algorithm for solving the optimal generation scheduling problem. Electr. Eng. 100(1), 333–346 (2018) 39. Li, M., Ma, H., Gu, B.: Improved teaching–learning-based optimization algorithm with group learning. J. Intell. Fuzzy Syst. 31(4), 2101–2108 (2016) 40. Zou, F., Chen, D., Lu, R., Wang, P.: Hierarchical multi-swarm cooperative teaching–learningbased optimization for global optimization. Soft. Comput. 21(23), 6983–7004 (2017) 41. Shao, W., Pi, D., Shao, Z.: An extended teaching-learning based optimization algorithm for solving no-wait flow shop scheduling problem. Appl. Soft Comput. 61, 193–210 (2017) 42. Satapathy, S.C., Naik, A.: Modified teaching–learning-based optimization algorithm for global numerical optimization—a comparative study. Swarm Evol. Comput. 16, 28–37 (2014) 43. Zou, F., Wang, L., Hei, X., Chen, D.: Teaching-learning-based optimization with learning experience of other learners and its application. Appl. Soft Comput. 37, 725–736 (2015) 44. Zou, F., Wang, L., Chen, D., Hei, X.: An improved teaching-learning-based optimization with differential learning and its application. Math. Probl. Eng. 2015(754562), 19 (2015) 45. Wang, L., et al.: A hybridization of teaching–learning-based optimization and differential evolution for chaotic time series prediction. Neural Comput. Appl. 25(6), 1407–1422 (2014) 46. Chen, X., Xu, B., Yu, K., Du, W.: Teaching-learning-based optimization with learning enthusiasm mechanism and its application in chemical engineering. J. Appl. Math. 2018(1806947), 19 (2018) 47. Chen, X., Yu, K., Du, W., Zhao, W., Liu, G.: Parameters identification of solar cell models using generalized oppositional teaching learning based optimization. Energy 99, 170–180 (2016) 48. Liang, J.J., Qu, B.Y., Suganthan, P.N.: Problem definitions and evaluation criteria for the CEC 2014 special session and competition on single objective real-parameter numerical optimization. Zhengzhou University, Zhengzhou China Computational Intelligence Laboratory (2013) 49. García, S., Molina, D., Lozano, M., Herrera, F.: A study on the use of non-parametric tests for analyzing the evolutionary algorithms’ behaviour: a case study on the CEC’2005 special session on real parameter optimization. J. Heuristics 15(6), 617 (2008) 50. Derrac, J., García, S., Molina, D., Herrera, F.: A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput. 1(1), 3–18 (2011)

Use of Artificial Intelligence and Machine Learning for Personalization Improvement in Developed e-Material Formatting Application Kristine Mackare1(B) , Anita Jansone1 , and Raivo Mackars2 1 Faculty of Science and Engineering, Liepaja University, Liepaja, Latvia

[email protected] 2 IK, Liepaja, Latvia

Abstract. Although screen technology use in most daily tasks, including daily work tasks and educational reasons, a significant part of users and learners are having problems to read the more extended time without complains. Developed E-material formatting application is working based on the developed methodology for e-material formatting as an improvement for text perception from the screen. The methodology is limited and includes several variables for general formatting improvement and primary personalization. More in-depth and more specific personalization ask for a broader range of variables to be involved. That comes to an enormous number of potential configurations in total. It is not something that the human mind can operate and generate appropriate methodologies. Artificial Intelligence and Machine Learning use can improve personalization as can proses a huge amount of data and analyze algorithms faster to get a solution and reach the goal. Methodology: use case analysis of E-material formatting application. Results: Paper includes short description of current situation and developed application; description of existing application limitations and challenges; analysis for Artificial Intelligence and Machine Learning use in this case; analysis of necessary data for Machine Learning development; description of the information types Artificial Intelligence generates after learning process and description of the Artificial Intelligence process for e-material formatting personalization. Conclusions: Use of Artificial Intelligence and Machine Learning is reasonable for personalization improvement in developed E-material formatting application. Keywords: Artificial Intelligence · E-material formatting application · Machine learning

1 Introduction Smart technologies with screens are in daily use for achieving any of humans’ daily tasks. It includes personal and work tasks as well as for educational reasons and getting knowledge [1]. Despite the accessibility of the wide range different visual, graphical and audial solutions and materials, textual information still is on the top as popular e-material for everyday use [2]. © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 125–132, 2020. https://doi.org/10.1007/978-3-030-52246-9_9

126

K. Mackare et al.

Reading from screens and papers are different [3]. People are in need to adapt and use a new reading model [4], but it is a slow process, and human evolution of visual perception [5] is not as fast as technologies development [6]. A significant part of users’ is having complains after screen-reading [7]. It can make some different problems, including health, [8] that affect people life quality. The human visual system needs help. There is a need for a solution. As a possible solution for that is e-material formatting application. Appropriate formatting of e-materials should be applied for the natural and comfortable perception of text and its content. Typographical aspects of the text are in close cooperation with visual processes. Also, they should be helping the learning process and facilitating memorization.

2 Developed Application 2.1 Methodologies The developed methodology is several years of research for developing recommendations for e-material formatting based on vision science, user behavior and user needs and preferences. For methodology development several previous tasks have been done: • Broad range literature research about currently offered and available recommendations, guidelines and methodologies for e-materials have been done, and it showed unambiguity in suggestions. • Statistical research on digital devise and internet use in population as well as they involvement in educational activities Worldwide, Europe and Latvia have been done. • Users’ need and preference, and users’ habits research have been conducted. • Users’ complains, and vision problems related to near- and screen-work have been overviewed. • Patient data record analysis from practice shown vision conditions, most often symptoms and complains after screen work as well as refraction changes and other findings in ocular health. An extended presentation about all collected and analyzed data can be found in the previous publications [9–12]. In previous work [9, 13] it was proposed that recommendations of five main parameters as font type, font size, line spacing, the color of text and color of the background are suggested as the basic parameters for e-material formatting. Targeting groups was primarily divided into nine age groups: Children as pre-school 3–5 and grade-schooler 6–11, Teens 12–15, Youth 16–25, Young adults 26–35, adults 36–39, middle-aged adults 40–55, senior adults 55+, elderly 65+ [14]. However, the methodology is limited. It includes several variables for general formatting improvement and primary personalization. On methodology base, the application possibilities of formatting personalization are also limited.

Use of Artificial Intelligence and Machine Learning for Personalization Improvement

127

2.2 The Application Concept of the Application. The concept for application prototype had been made. It based on developed recommendations for e-material formatting and by users’ individual factors and need. The idea involves such component as an idea of the tool by itself: what it must do and how, what information must contain in the database, and vision of design. It must give a clear understanding of the application to the programmer what must be done. The app must work for both – e-material creators and e-material users. The main idea of the concept is represented by three main edges of the app: user, interface, and database. The scheme also represents collaboration processes between edges and the main idea of the app – text formatting of documents based on user groups. More detailed concept description is published previously [15–17]. The Prototype of Application The First Prototype of Application. Already developed first version prototype gives the overview of application work to follow implemented working schemes, relationships between main edges and the collaboration process with the user, material and database. For the possibility to give formatting recommendations, application collect necessary data from the user. That is followed by solution finding with a step-by-step recommendation and so-called tree-scheme of users answers and related application respond what gives a recommendation of text formatting and provides it. After the user has tried new formatting of e-material, the application provides a short questionnaire of user feedbacks. An application can be described from four sides: developers, e-material formatting users as readers, e-material creators and researchers. Application prototype is developed on Moodle type platform base but with possible transformation and adaptation for a different environment and wider range of use. As the application is developed to give access to the database to researchers, it helps in userhabit research and allows to keep the application up to date to reach application learning process. That is an important part of nowadays user-centric designs for users’ satisfaction. A more detailed description of the application prototype is published previously [15, 18]. The Second Prototype of Application. After the development of the first version of the prototype, authors have concluded that it is not only possible to make such an application but that there is room for improvements [18]. As the app will be developed to use on the e-study environment, in the second prototype have been used PHP7.3 programming language, as it is the server language which is used for interaction between the browser and the server, which is a perfect option for this app development. Using PHP7.3 functions allows overwriting the XML code, thus changing and modifying the e-materials, updating them to learner needs. Secondary, in the improved version of the prototype focus is to make it accessible and workable on all most popular formats of e-materials, including PDF. In theory, it is possible to develop the app as a web browser add-on which can modify any base code of the loaded webpage. Then it will be able to edit HTML format by modifying its CSS parameters too. It will be improved in the final version of the app. The third significant upgrade is that the app is made in such a way so it could be used with any e-material system not only Moodle. Moreover, as e-learning is getting more and more popular in web, beyond traditional schools and universities, then authors

128

K. Mackare et al.

concluded that the app must be able to format the web materials that are viewable only in browser therefor edit HTML and CSS. As it is planned to implement machine learning artificial intelligence to suggest best parameters as fonts, sizes and colors for each person, but it initially needs a decent size data of user choice preferences, then it is important to collect as much data on alpha tests as possible. The second prototype could be used for it. So, it was decided that on requesting file user will see two file previews on screen, on one side original file and on the other - prototypes modified version with an option to download any of versions. Therefore, if user choses to download modified version we get information that user is interested in personalized file he saw on preview and we can give a small survey on next time the user receives in system or downloads another file, did he liked or not personalized version and is there something he would change. Alternatively, if he chooses original file, we give an instant survey to find out was user not interested in personalized file as an option at all or user didn’t like personalized option that was offered.

3 Intelligence and Machine Learning Use in Improvement of e-Material Formatting Application 3.1 Benefits of Using Artificial Intelligence and Machine Learning in This Case Machine learning involves feeding complex algorithms, designed to carry out data processing tasks in a similar way to the human brain, with huge amounts of data. The result is computer systems which become capable of learning [19]. Why could the machine learning process make automatized formatting app so powerful as a tool for e-material personalization? If you think about it: • It is 5 basic parameters - font style, font size, space between lines, text color and background color - what are used as most important parameters to format text appropriately. • Each of parameters could be formatted by 3 recommended values (described in recommendations of e-material formatting methodologies) and at least 3 additional possible values (what are not in general recommendations as are more specific for individual users). – that’s from 125 general combinations what could be used for formatting improvement to 7776 total possible combinations - for formatting personalization. These combinations should be used based on huge number of users’ variables (about 34381): • users general features as age and gender - about 81 combinations, if look only on nine complex age groups not each age, • and variety of individual characteristics as vision problems, disabilities, limitations, reading or learning disorders as well as possible cultural and professional aspects (at least 343 combinations).

Use of Artificial Intelligence and Machine Learning for Personalization Improvement

129

That’s an enormous number of potential configurations in total. It is not something that the human mind can operate and optimize. Moreover, it is without taking into consideration some specific individual features and personal preferences what can come up only after the intensive learning process. So, with AI and machine learning there are opportunities to make further step and create e-material formatting tool with more personalized and individual approach as it allows to take in consideration as many variables as necessary for the best solution and opens possibilities to use other features what could be seen only while tool is being used and during its learning process. The goal is to build a product to automate the e-material formatting functions for the individual user completely. 3.2 Necessary Data for Machine Learning Development For successful Machine Learning process in the application, there is a need for a huge amount of data from which it is possible to learn. Necessary data for Machine Learning: • • • •

Previously developed methodologies and recommendations List of parameters for formatting List of variations of parameter values List of possible features which are involved or potentially affect user preferences or satisfaction of formatting • Users general information as – age – gender • Users specific personal information as existence or absence of – – – – – –

complains during or after screen-reading, vision problems or ocular health problems, disabilities, specific limitations, reading disorders, learning disorders.

• Information about users possible cultural and professional aspects. • Information about past satisfaction or dissatisfaction of users for various types of formatting. • Information about users’ manual changes. • Feedbacks from users.

130

K. Mackare et al.

3.3 Information Types Artificial Intelligence Generates After the Learning Process The type of information the AI is building on includes: • Information from users personal saved application or set of features. • Information there data is coupled with past satisfaction or dissatisfaction of users for various types of formatting. • Statistics gathered from previous uses, users’ manual changes and successful matches. • Preparing of the individual e-material formatting options. 3.4 Description of the Artificial Intelligence Process for e-Material Formatting Personalization A user can now use Moodle platform or another educational website and search for necessary materials as they always would do but for personalization, they need to choose the required information or set of features and upload it in the system. All this data is saved, and the app can immediately give general and basic personalized recommendations. However, now the AI immediately begins gathering all data, syncing all data, incorporating and building on other data points to improve the formatting matching and deliver more appropriate and more individual e-material formatting options to the user. The real power behind this technology, however, is how the AI interacts with the users’ information and prepare the individual e-material formatting options. The product allows the user to enjoy good quality e-material with increased comfort from screen reading, quickly and with much less human interaction.

4 Results The first result is an application with a fully automated e-material formatting process. The new automated process: • can collect and analyze data very fast. • took a few seconds for the outcome compared to long hours or days for a human to firstly develop a pattern and algorithms and secondly more attempts before finding correct combinations of formatting. • can offer much more personalized e-material formatting combination for each user by taking in consideration more individual variables and specifications. • can learn and improve recommendations much faster than a human. • increase users’ satisfaction of e-material formatting. The second result is developments of full and complete e-material formatting methodology for user-centric and adaptive e-material creation and formatting.

Use of Artificial Intelligence and Machine Learning for Personalization Improvement

131

5 Conclusions Developed methodology and on it based application prototype is suitable for general e-material formatting and primary personalization but they have limitations. Limitations are based on a huge amount of possible variations of individual features and followed by higher possibilities of combinations of e-material combinations to the satisfied user. Machine Learning will help deal with all variables and combinations and make fully automated e-material formatting process after the development of the learning process in the application. For that to be done, the huge amount of data is necessary. It is believed that the second prototype of the application will be able to gain all the necessary data through alpha testing. The AI is using learning based on experience to create future decisions and is making the decision fully automated. AI use in the application improvement is vital for a usercentered approach and personalization as it is in connection with the aim of AI to solve problems. Acknowledgments. The article is written with the financial support of European Regional Development Fund project Nr.1.1.1.5/18/I/018 “P¯etniec¯ıbas, inov¯aciju un starptautisk¯as sadarb¯ıbas zin¯atn¯e veicin¯ašana Liep¯ajas universit¯at¯e”.

References 1. Hargreaves, T., Wilson, C.: Who uses smart home technologies? Representations of users by the smart home industry. In: Proceedings of ECEEE 2013 Summer Study – Rethinking, Renew, Restart, pp. 1769–1780 (2013) 2. Khan, M., Khushdil: Comprehensive study on the basis of eye blink, suggesting length of text line, considering typographical variables the way how to improve reading from computer screen. Adv. Internet Things 3(1), 9–20 (2013) 3. W3C: Web Content Accessibility Guidelines (WCAG) 2.0. In: Caldwell, B., Cooper, M., Reid, L.G., Vanderheiden, G. (eds.) W3C (2008) 4. Nielsen, J.: Designing Web Usability: The Practice of Simplicity. New Riders Publishing, Indianapolis (2000) 5. Ramamurthy, M., Lakshminarayanan, V.: Human vision and perception. In: Handbook of Advanced Lighting Technology. Springer (2015) 6. UNCTAD: Technology and Innovation Report 2018: Harnessing Frontier Technologies for Sustainable Development, United Nations, NY and Geneva (2018) 7. Holden, B.A., Fricke, T.R., Wilson, D.R.: Global prevalence of Myopia and High Myopia and temporal trends from 2000 through 2050, Ophthalmology, Epub, Nr., February 2016 (2016) 8. Low, W., Dirani, M., Gazzard, G., Chan, Y.H., Zhou, H.J., Selvaraj, P., et al.: Family history, near work, outdoor activity, and myopia in Singapore Chinese preschool children. Br. J. Ophthalmol. 94(8), 1012–1016 (2010) 9. Mackare, K., Jansone, A.: Research of guidelines for designing e-study materials. In: Proceedings of ETR17, The 11th International Scientific and Practical Conference Environment. Technology, Resources, Latvia, 15–17 June 2017, vol. 2 (2017) 10. Mackare, K., Jansone, A.: Digital devices use for educational reasons and related vision problems. In: Proceedings of ICLEL18, The 4th International Conference on Lifelong Education and Leadership, Poland, 2–4 July 2018 (2018)

132

K. Mackare et al.

11. Mackare, K., Jansone, A.: Habits of using internet and digital devices in education. In: Proceedings of SIE18, The International Scientific Conference Society. Integration. Education, Latvia, 25–26 May 2018, vol. V (2018) 12. Mackare, K., Zigunovs, M., Jansone, A.: Justification of the need for a custom e-material creation program. In: Proceedings of the conference Society and Culture, Liepaja, Latvia, May 2018 (2018) 13. Mackare, K., Jansone, A.: Recommended formatting parameters for e-study materials. IJLEL 4(1), 8–14 (2018) 14. Mackare, K., Jansone, A.: Personalized learning: Effective e-material formatting for users without disabilities or specific limitations, In: Proceedings of ERD2019, The 10th anniversary International Conference of Education, Research and Development, Burgas, Bulgaria, 23–27 August 2019 (2019) 15. Zigunovs, M., Jansone, A., Mackare, K.: E-learning material adaptive software development, In: Presentation of ICIC18, The 2nd International Conference of Innovation and Creativity, Liepaja, Latvia, 5–7 April 2018 (2018) 16. Mackare, K., Jansone, A., Zigunovs, M.: E-material creating and formatting application. Adv. Intell. Syst. Comput. 876, 135–140 (2018) 17. Mackare, K., Jansone, A.: The concept for e-material creating and formatting application prototype. Period. Eng. Nat. Sci. 7, 197–204 (2019). ISSN 2303-4521 18. Mackare, K., Jansone, A., Konarevs, I.: The prototype version for e-material creating and formatting application. BJMC 7(3), 383–392 (2019) 19. Singh, P.: Understanding AI and ML for Mobile app development, Towards Data Science, December 2018

Probabilistic Inference Using Generators: The Statues Algorithm Pierre Denis(B) Louvain-la-Neuve, Belgium [email protected]

Abstract. The Statues algorithm is a new probabilistic inference algorithm that gives exact results in the scope of discrete random variables. This algorithm calculates the marginal probability distributions on graphical models defined as directed acyclic graphs. These models are made up of five primitives that allow expressing, in particular, conditioning, joint probability distributions, Bayesian networks, discrete Markov chains and probabilistic arithmetic. The Statues algorithm relies on an original technique based on the generator construct, a special form of coroutine. This new algorithm aims to promote both efficiency and scope of application. This makes it valuable regarding other probabilistic inference approaches, especially in the field of probabilistic programming. Keywords: Probability · Probabilistic programming · Probabilistic arithmetic · Graphical model · Bayesian network · Algorithm · Generator

1

Introduction

Problems characterized by some uncertainty can be modeled using different approaches, formalisms and primitives. These include, among others, joint probability distributions, Bayesian networks, Markov chains, hidden Markov models, probabilistic arithmetic and probabilistic logic [5,6,19]. In order to perform actual problem resolution, each modeling approach has its own catalogue of algorithms, characterized by different merits and trade-offs. Several algorithms produce exact results but are limited practically by complexity barriers [4] whilst other algorithms can deal with intractable problems by delivering approximate results. In the specific case of Bayesian networks [14,15], exact algorithms include enumeration, belief-propagation, clique-tree propagation, variable elimination and clustering algorithms. On the other hand, approximate algorithms include rejection sampling, Gibbs sampling and Monte-Carlo Markov Chain (MCMC). Beside the Bayesian reasoning domain, probabilistic arithmetic and, more generally, the study of functions applied on random variables (+, −, ×, /, min, max, etc.) is a research field on its own that includes, for example, convolution-based approaches [1,10,23,24] and discrete envelope determination [2,3]. P. Denis—Independent scholar. c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 133–154, 2020. https://doi.org/10.1007/978-3-030-52246-9_10

134

P. Denis

These well-established algorithms, in their original formulation, are specialized for one single modeling approach. In particular, algorithms for probabilistic arithmetic do not handle Bayesian reasoning or even simple conditioning. On the other hand, above-cited inference algorithms for BN do not handle arithmetic (e.g. the sum of two random variables of the BN, whether latent or observed). Also, many BN algorithms handle only observations expressible as conjunctions of equalities; without extensions, these algorithms cannot treat the conditioning in its generality, that is considering any boolean function of the BN variables as a possible assertion. In short, early probabilistic models and associated algorithms have been constrained by some compartmentalization. These limitations tend now to disappear with the advent of probabilistic programming (PP) and richer probabilistic models [11,12,16] that mix several approaches together. Following this trend, the present paper introduces a unifying modeling framework in the scope of discrete random variables. Then, it presents an inference algorithm for such framework, namely the Statues algorithm. This algorithm is in essence a (distant) variant of the enumeration algorithm that provides significant improvements in terms of scope and efficiency. The enabler of this algorithm is the generator construct, a special form of coroutine (as defined by Knuth in [13]). This construct, which is available in several modern programming languages, is of great interest for combinatorial generation [20]. It seems however to be overlooked in computer science literature: coroutines/generators are not widely used in published algorithms, for which the subroutine construct is prevalent. At the time of writing, to the best of author’s knowledge, no probabilistic inference algorithm using generators has been published yet.1 The paper is organized as follows. Section 2 introduces the aforementioned modeling framework. Section 3 details the Statues algorithm, with pseudocode and some example of execution. Section 4 discusses the salient points of the algorithm as well as possible extensions. Section 5 presents existing implementations, with some examples of usage. Section 6 gives the conclusions.

2

p-Expressions: A Unifying Probabilistic Framework

The Statues algorithm uses a modeling framework that scopes discrete random variables having a finite domain. It aims, in particular, to unify Bayesian reasoning with probabilistic arithmetic in such scope. There is no restriction on the domains of the variables, provided that they are discrete and finite: these can be numbers, matrices, symbols, booleans, tuples, functions, etc. An order relationship on these domains is not required. In the following, the objects characterized above shall be referred as “random variables”, or simply “RV”. In the present framework, the probabilistic model for any RV is defined either by a probability 1

The continuation construct, originated from functional languages, is another way to achieve coroutines. It is worth pointing out that “continuation passing style” (CPS) is used in marginalization algorithms of WebPPL [12], a modern probabilistic programming language based on JavaScript.

The Statues Algorithm

135

mass function or by a precise dependency to other RV’s.2 In order to express such model in a form usable by an algorithm, a set of primitives are introduced hereafter under the name of “p-expressions” or “pex”, for short (“pexes”, in the plural). Five types of p-expression are defined, namely, the elementary pex, the tuple pex, the functional pex, the conditional pex and the table pex. As it will be shown, these primitives allow building up graphical models [6] for, among others, joint probability distributions, Bayesian networks, discrete Markov chains, conditioning and probabilistic arithmetic. In the following, the typographical convention uses uppercase (e.g. X) for random variables and bold lowercase (e.g. x) for p-expressions. 2.1

Elementary Pex

An elementary pex models an RV X characterized by a given probability mass function x. Elementary pexes are the most basic type of pex. They require a probability mass function (pmf) specifying a prior probability for each possible value of their domains. This function shall obey Kolmogorov axioms: in particular, the individual probabilities shall be non-negative and the sum of all probabilities over the domain shall be 1. As a simple example, an RV C giving the result obtained by flipping a fair coin can be specified by the elementary pex c defined as: © ¶ (1) c := (tail, 12 ), (head, 12 ) using the pmf notation borrowed from Williamson [23]. Continuous random variables are excluded from the definition of elementary pex but their probability density functions can be approximated to a pmf through discretization—several methods exist for this purpose, with known shortcomings [1,2]. The Poisson and hypergeometric distributions are excluded also since, although discrete, these are not finite; such distributions could however be approximated, e.g. by considering only the finite set of values having a probability above a given threshold and normalizing the probabilities to have a total of 1. Elementary pexes model RV’s that are pairwise independent. It is important to avoid confusion between pex and pmf concepts: each pex represents one defined event or outcome, even if several pexes may have identical pmf. For instance, n throws of the same die shall be represented by n pexes that are defined using the same pmf; the same applies also if n similar dice are thrown together. As a special case, an elementary pex may have a domain of one unique element; this element is then certain and has a probability of 1. For instance, 2

The term “random variable” is stricto senso specific to real number domains. This limitation is deliberately set aside here for the sake of generality. Consistently, the wording “probability mass distribution” could be replaced by “categorical distribution”. Also, the mathematical formalism of probability spaces (Ω, F , P) is avoided here, even if the present framework could be expressed using this formalism.

136

P. Denis

¶ © the π number can be represented as the elementary pex (π, 1) and the empty ¶ © tuple [ ] as ([ ], 1) . For easiness, the notation π, [ ], etc. shall be used for such special pexes. Even if there is no randomness in such contrived construct, this assimilation is meant to simplify the inference algorithm when constants and random variables are mixed in the same probabilistic model. Since any finite domain is accepted for elementary pex, two other special cases are worth mentioning because ubiquitous. – A Boolean elementary pex is used to model a proposition that is uncertain, having a given probability p to be true. It is modeled as an elementary pex having Booleans as domain: ¶ © b := (true, p), (false, 1 − p) (2) For convenience, the notation b := t(p) shall be used to represent such elementary Boolean pex. Remark that, according to the notation given above, t(1) = true and t(0) = false. – A joint elementary pex is used to define a given joint probability distribution. Its domain is a set of tuples of same length; each tuple represents a possible outcome and each element of the tuple represents a given attribute or measure of this outcome. For example, the following joint elementary pex links the weather (W ) and someone’s mood (M ) by means of tuples [ W, M ]: ¶ d := ( [ rainy, sad ], 0.20), ( [ rainy, happy ( [ sunny, sad ( [ sunny, happy

], 0.10), ], 0.05), © ], 0.65)

The joint elementary pex is one of the ways to model interdependence between random phenomena. Such interdependence shows up in the present example since, in particular, the probability Pr([ W, M ] = [ sunny, happy ]) = 0.65 is not equal to the product of marginal probabilities Pr(W = sunny) × Pr(M = happy) = (0.05 + 0.65) × (0.10 + 0.65) = 0.525. It is well known that joint probability distributions are not much suited if the number of attributes (or their domain’s size) is large—see for example [15] and [19]. The following sections introduces derived pex types, which allow modeling interdependence in more expressive and more compact ways. 2.2

Tuple Pex

A tuple pex models a given tuple of n RV [ X1 , . . . , Xn ], with n ≥ 1. It is noted x1 ⊗ ... ⊗ xn , where xi are the pexes modeling Xi .

The Statues Algorithm

137

For instance, a tuple pex y can be defined to model a tuple of two RV [ X1 , X2 ] characterized by two elementary pexes x1 , x2 having Bernoulli distributions: ¶ © x1 := (0, 12 ), (1, 12 ) ¶ © x2 := (0, 34 ), (1, 14 ) y := x1 ⊗ x2 Without more condition involving x1 or x2 , the pmf related to y can © be ¶ calculated by enumeration as ([ 0, 0 ], 38 ), ([ 0, 1 ], 18 ), ([ 1, 0 ], 38 ), ([ 1, 1 ], 18 ) . The order of elements in a tuple is significant; so, the tuple pex x1 ⊗ x2 is different from x2 ⊗ x1 (incidentally, the pmf calculated on this latter tuple pex differs from the former one). Note that, by definition, the empty tuple [ ] is not a tuple pex: as seen in Sect. 2.1, it can be assimilated to the elementary pex [ ]. It is important to understand that a tuple pex containing elementary pexes (as y) is not equivalent to a joint elementary pex (as seen in Sect. 2.1), despite the fact that the domains involved are sets of tuples in both cases. A tuple pex is a derived pex and, as such, it cannot be used to specify a joint probability distribution. The true interest of a tuple pex is to express dependencies in other derived pexes, as it will be shown soon. The tuple pex is the first type of derived pex. Generally speaking, a derived pex may refer to several pexes, which may themselves be derived. Before going any further, it is required to introduce two important constraints that shall apply on any type of pex. No Cycle – Cycles are forbidden in any pex. The interdependence between RV’s shall be expressible in a directed acyclic graph (DAG)—see Sect. 2.4. Referential Consistency – As it will be shown soon, a derived pex may contain multiple references of the same pex x at different places. Since any given pex is meant to represent one given random variable, it is required that the values for any outcome are consistent between all occurrences of x. This constraint is referred throughout the present paper as referential consistency. A rather contrived example is given when a given pex appears twice in the same tuple, e.g. v := x2 ⊗x2 . The referential consistency forces the two occurrences to refer to the very same ¶pex; using the definition © of x2 given above, the pmf calculated from 3 1 v is then ([ 0, 0 ], 4 ), ([ 1, 1 ], 4 ) . More meaningful examples will be given in the next subsections. The referential consistency is closely linked to the concept of stochastic memoization found at least in Church [11] and WebPPL [12]. The two concepts actually enforce the same consistency constraint. The difference lies essentially in the fact that referential consistency applies on exact probabilistic inference, whereas stochastic memoization applies on approximate probabilistic inference (e.g. MCMC). 2.3

Functional Pex

A functional pex models the RV f (X) obtained by applying a given unary function f on a given RV X. It is notated fÛ(x), where x is the pex modeling X.

138

P. Denis

f shall be a pure function, that is deterministic and without side-effect. It can use any algorithm, provided that it can evaluate the result in a finite time and in a deterministic way, whatever the argument given. Any given n-ary function with n > 1 can easily be converted into a unary functions by packing the arguments into a n-tuple3 . Then, a functional pex can be defined by grouping arguments in a n-tuple pex. It is important to understand the distinction between notations f (x) and fÛ(x): the former denotes a value conforming to usual mathematical meaning while the latter denotes an assembly made up of the function f and the pex x, as objects on their own. Actually, the latter is close to the concept of lazy evaluation found in programming language theory. Functional pexes allow representing, first and foremost, basic mathematical operations on RV’s (in the following, N , X, Y , Z have numerical domains and A, B have Boolean domains): √ – arithmetic: X + Y , X − Y , X.Y , −X, X, X Y , etc. – comparison: X = Y , X = Y , X < Y , X ≤ Y , etc. – logical: A, A ∧ B, A ∨ B, A ⇒ B, etc. and any combinations of these operations using usual function composition like E := (X + Y ≥ 6) ∧ (Y ≤ 4). Using notations introduced above and replacing infix subexpressions by Änamed unary functions, E could be ämodeled by the ˜ gÙe((add(x ˜ ⊗ y) ⊗ 6) ⊗ le(y Û ⊗ 4) . functional pex e := and In this last example, the referential consistency (Sect. 2.2) shall assure that all occurrences of y refers to the same RV Y . More generally, the referential consistency enforces the rules of algebra and logic in the context of probabilistic models. It assures for example that a pex representing X + X shall result in the same probability distribution as the pex representing 2X (or Y − X − Y + 3X or (X + 1)2 − X 2 − 1 or ...). The lack of referential consistency is referred with the terms “dependency error” in [23] and [24]. In contrast to these approaches, which investigate how these errors can be bounded, any such dependency error is outlawed here. As it will be shown in the following, the referential consistency is essential to enable conditioning and Bayesian reasoning. Beside the afore-mentioned usual mathematical functions, many other useful functions could be added: checking the membership of an element in a set, getting the minimum/maximum element of a set, summing/averaging the numbers of a dataset, getting the attribute of an object, etc. Among these functions, the indexing of a given tuple is worth to mention. Reconsidering the joint probability distribution d seen in Sect. 2.1 and defining the defining extract([ t, i ]) as the func˙ ⊗ 1) tion giving the ith element of tuple t, the functional pexes w := extract(d ˙ and m := extract(d ⊗ 2) model the weather and mood RV’s (resp. W , M ). The referential consistency on d enforces the interdependence between the two pexes and guarantees the same marginal probability results as those exemplified in Sect. 2.1. 3

For instance, a 2-ary function g shall be converted into a unary function g  , such that g  ([ x, y ])  g(x, y).

The Statues Algorithm

2.4

139

Conditional Pex

A conditional pex models a given RV X under the condition that a given boolean RV E is true. It is noted x  e, where x and e are the pexes modeling resp. X and E. The RV E could represent an evidence, an assumption or a constraint. E has its own prior probability to be true but, in the conditional pex context, it is assumed that it is certainly true. A valid conditional pex x  e requires that x can produce at least one value verifying the condition expressed in e. If this is not the case, then no pmf can be calculated and the inference algorithm shall report an error. Although not required, the interesting cases happen of course when X and E are interdependent; this entails the sharing of one or several pexes, e.g. fÛ(... ⊗ y ⊗ ...)  gÛ(... ⊗ y ⊗ ...). For example, consider the following pexes that represent the throwing of two fair dice and the total of their values: ¶ © d1 := (1, 16 ), (2, 16 ), (3, 16 ), (4, 16 ), (5, 16 ), (6, 16 ) ¶ © d2 := (1, 16 ), (2, 16 ), (3, 16 ), (4, 16 ), (5, 16 ), (6, 16 ) ä Ä ˜ d1 ⊗ d2 s := add Suppose now that some evidences ensure that the dice total is greater or equal to 6 while the second die’s value is less or equal to 4. The conditional pex expressing the dice total given these evidences is Ä Ä ää ä Ä ˜ gÙe s ⊗ 6 ⊗ le Û d2 ⊗ 4 x := s  and (3) The¶pmf related to x can be calculated using conditional probability defi© 4 4 3 2 1 ), (7, 14 ), (8, 14 ), (9, 14 ), (10, 14 ) —actually, the Statues algorithm nition: (6, 14 presented in Sect. 3 aims to calculate such result. At this stage, it may be worthwhile to represent derived pexes as graphs, in accordance with the concept of graphical model [6]. One may remark that the tree structure is usually inadequate since the same pex may be referred multiple times (here, s and d2 ), as dictated by the referential consistency. In full generality, any pex can be represented as a directed acyclic graph (DAG). Figure 1 shows the DAG corresponding to x. The arrow direction, from parent node p to child node c, is meant to represent that p is dependent of c.4

4

One may deplore that this is the exact opposite of the convention used in graphical models. Actually, the point of view adopted here is more suited for an algorithm: arrows represent references, as these are drawn for example in the abstract syntax trees representing arithmetic expressions.

140

P. Denis

Fig. 1. x as a DAG

Many approaches, including those handling Bayesian networks, constrain conditions to be observations [15,19], which are equalities of the form X = x or a conjunction of such equalities X1 = x1 ∧ . . . ∧ Xn = xn . Compared to these approaches, the combination of conditional and functional pexes provide a significant generalization. The sole constraint is to be able to express the evidences as a Boolean function applying on some RVs. Beside equalities and conjunctions, this includes inequalities, negations, disjunctions, membership, etc. 2.5

Table Pex

A table pex models an RV obtained by associating a given RV Xi for each value ci of a given RV C, with dom(C) = {c1 , ... , cn }. It is noted c  g, where pex c models C and where g := {c1 → x1 , .... , cn → xn } is an associative array with pexes xi modeling the Xi . The table pexes allow defining conditional probability tables (CPT), which are used in particular for defining Bayesian networks. To exemplify the idea, table pexes combined with tuple pexes can be used to model the classical “RainSprinkler-Grass” BN. Three Boolean RVs are defined: R represents whether it is raining, S represents whether the sprinkler is on and G represents whether the grass is wet. R has a prior probability 0.20; the other probabilities and dependencies are quantified using CPT’s: S’s probability depends of the weather: if it is raining, then the probability of S is 0.01, otherwise it is 0.40. G depends of both the weather and the grass state; the probabilities for G depending of the values of the tuple [ R, S ] are: [ false, false ] : 0.00, [ true, false ] : 0.80, [ false, true ] : 0.90, [ true, true ] : 0.99. This BN can be modeled as follows, using elementary pex r for R and table pexes s and g for S and G:

The Statues Algorithm

141

r := t(0.20) ¶ s := r  true → t(0.01), © false → t(0.40) ¶ g := (r ⊗ s)  [ false, false ] → t(0.00), [ true, false ] →

t(0.80), [ false, true ] →

t(0.90), © [ true, true ] → t(0.99) In order to make queries on such BN considering information obtained, new pexes mixing conditional, functional and table pexes can be built up. This allows for forward chaining (e.g. g  nˆ ot(s), for calculating Pr(G | S) = 0.2336), as well as Bayesian inference (e.g. nˆ ot(s)  g, for calculating Pr(S | G) = 0.3533). The number of entries in a table pex shall be the cardinal of the domain of conditioning RV C. This can be cumbersome if this domain is large, e.g. if the condition is a tuple having many inner RVs (the domain of C is usually the cartesian product of these RV’s domains). However, in several CPT, such as those having the property of contextual independence [14,17], redundancies can be avoided by embedding table pexes one in another. Other applications of the table pex include mixture models and discrete-time Markov chains (DTMC). For the latter, the initial state can be modeled by an elementary pex (pmf with probabilities to be in each state), while the transition matrix can be expressed in a table pex giving pmf of next state for each current state.

3

The Statues Algorithm

Assuming that some RV D has been modeled in a pex d, the marginalization inference consists in calculating, from d only, the pmf associated to D. Following the classification given in [5], this inference function shall be named marg. The aim of the Statues is to calculate marg(d) as the exact pmf ä algorithm © ¶ Ä ... , vi , Pr(D = vi ) , ... . The name “Statues” is borrowed from the popular children’s game of the same name.5 The analogy with the algorithm should hopefully be clarified after the explanations given below. The examples seen so far show that, in full generality, marg(d) can not proceed by simple recursive evaluation, as done for example in usual arithmetic; this is valid actually only if underlying RVs are independent, that is if each inner pex occurs only once in the pex under evaluation. Generally speaking, a given

5

Other names include “Red Light, Green Light” (US), “Grandmother’s Footsteps” (UK), “1–2–3, Soleil ! ” (France), “1–2–3, Piano ! ” (Belgium) and “Annemaria Koekkoek ! ” (Netherlands).

142

P. Denis

subsidiary pex x cannot be replaced by marg(x): this would transform it into an elementary pex, removing any dependency of x with other pexes. To obtain correct results in any case, the referential consistency (see Sect. 2.2) shall be enforced wherever in the model. The divide-and-conquer paradigm needs here to be revisited. The Statues algorithm uses a construction called generator, a special case of coroutine, as presented in [13] and [20]. Generators are available in several modern programming languages (e.g. C#, Python, Ruby, Lua, Go, Scheme), whether natively or as libraries. To state it in simple words, a generator is a special form of coroutine, which can suspend its execution to yield some object towards the caller and which can be resumed as soon as the caller has treated the yielded object. In the following algorithms, generators are identified as such by the presence of yield x statement(s) and, incidentally, by the absence of any return y statement. At the time yield x is executed, the generator yields the value x to its caller. The execution control is then passed to the caller: it treats x, then waits for the next value. The execution resumes to the generator until the next yield x statement, and so on until the generator terminates. Generators can be recursive, which make them particularly well suited for combinatorial generation (see [20]). This ability is extensively used in the algorithm presented below. For detailing the algorithm, the term atom shall be used to designate a couple (v, p) made up of a value v and an associated probability p. An atom relates to a particular event that excludes events related to other atoms. Such condition makes it possible to add, without error, the probabilities of atoms in a condensation treatment: more precisely, if n atoms (v, p1 ), (v, p2 ), ... , (v, pn ) n pi is the unnormalized probability of are collected for some value v, then Σi=1 v. For instance, when throwing two fair dice, the probability to get the total 3 1 1 ) and ([ 2, 1 ], 36 ) for the can be obtained by collecting the two atoms ([ 1, 2 ], 36 1 1 2-tuple pex, then converting them to atoms (3, 36 ) and (3, 36 ) by the addition functional pex; these two probabilities can then be safely added, giving the 1 (here, already normalized). expected probability 18 Another important concept used in the algorithm is that of binding. At any stage of the execution, a given pex is either bound or unbound. At start-up, all pexes are unbound, which means that they have not yet been assigned a value. When an unbound pex is required to browse the values of its domain, each yielded value is bound to this pex until the next value is yielded; when there are no more values, the pex is unbound. Consistently with referential consistency, once a pex is bound, it yields the bound value for any subsequent occurrence of this pex. The fact that a bound value is immobile for a while during execution explains that it can be likened to a statue in the aforementioned game. In the algorithm below, the pex binding is materialized in the binding store β, which is an associative array {pex → value}, initially empty. The Statues algorithm is made up of three parts. The entry-point is the subroutine marg. This subroutine calls genAtoms generator, which itself may call the genAtomsByType generator. These two generators are mutually recursive, as shown in the calling graph given in Fig. 2.

The Statues Algorithm

143

Fig. 2. Call graph of Statues algorithm

The entry-point marg subroutine is given, as pseudocode, in Algorithm 1. marg takes the given pex d to be evaluated as argument. It invokes the genAtoms generator and collects the atoms yielded one by one. Using the associative array a {value → probability}, the condensation treats atoms containing the same value so that they are merged together, by summing their probabilities. Once the genAtoms generator is exhausted, a check verifies that at least one atom has been received, otherwise an error is reported and the subroutine halts (as seen in Sect. 2.4, this may occur if the evaluated pex is conditional while the given condition is impossible). The final step normalizes the distribution a to ensure that the sum of probabilities equals 1—a post-processing common to many marginalization algorithms.6 The pmf is then returned. The genAtoms generator (Algorithm 2) uses the binding store β to check whether, in the current stage of the algorithm, the given pex is bound or not. If the given pex is unbound, which is the case at least for the very first call on this pex, then genAtomsByType is called and each atom yielded is bound to the pex before yielded in turn to the genAtoms’s caller. Otherwise, if the pex is currently bound, then the atom yielded is the bound value with probability 1. This behavior is actually the crux of the algorithm for enforcing referential consistency. The genAtomsByType generator (Algorithm 3) is the last part of the Statues algorithm. It yields the atoms according to the semantic of each type of pex. The dispatching is presented here as a pattern matching switch construct although other constructs are feasible, e.g. using object-orientation (inheritance/polymorphism). In the case of an elementary pex, the treatment is simple and non-recursive. In the case of a derived pex, the dependent pexes are accessed by calling genAtoms on them; this shall cause recursive calls, yielding atoms and updating the current bindings.

6

It can be shown that the probability sum may differ from 1 only if the evaluated pex contains some conditional pex. Actually, the division performed is closely related to Pr(A ∧ C) . the formula of conditional probability: Pr(A | C)  Pr(C)

144

P. Denis

Algorithm 1 Statues algorithm – marg subroutine (entry-point) 1: function marg(d) 2: β ← {} 3: a ← {} 4: for (v, p) ← genAtoms(d) do 5: if  a[v] then 6: a[v] ← 0 7: end if 8: a[v] ← a[v] + p 9: end for 10: if a = {} then 11: halt with error 12: end if p 13: s←

 init global binding store  init unnormalized pmf  collect atoms

 condense pmf  pmf is empty: error  normalize pmf

(v,p)∈a

  p 14: return (v, ) (v, p) ∈ a s 15: end function

Algorithm 2 Statues algorithm – genAtoms generator 1: generator genAtoms(d) 2: if ∃ β[d] then 3: yield (β[d], 1) 4: else 5: for (v, p) ← genAtomsByType(d) do 6: β[d] ← v 7: yield (v, p) 8: end for 9: delete β[d] 10: end if 11: end generator

 d is bound  yield unique atom to caller  d is unbound  (re)bind d to value v  yield atom to caller  unbind d

To ease the writing of the algorithm in a recursive way, the LISP-like notation [ h  t ] is used to represent a tuple with h as first element (the “head”) and t as a tuple with remaining elements (the “tail”). For instance, the 2-tuple [ x, y ] could be written as [ x  [ y  [ ] ] ]. The tuple pex shall follow a similar recursive structure: for any pexes x andÄy, the notation x ⊗ y introduced in Sect. 2.2 shall ä actually be interpreted as x⊗ y ⊗[ ] . This comment is required to accurately trace the treatment of tuple pexes in genAtomsByType. To get an in-depth understanding of the algorithm, one has to remember that genAtoms and genAtomsByType are not subroutines returning a list of atoms; these are generators working cooperatively and yielding atoms one by one. During algorithm execution, two generators (genAtoms and genAtomsByType) are created for each pex reachable from the root pex under evaluation. All these generators live together, the flow of control switching between the generators at each yield statement. Furthermore, at each yield

The Statues Algorithm

145

Algorithm 3 Statues algorithm – genAtomsByType generator 1: generator genAtomsByType(d) 2: switch d do 3: case {...} 4: for (v, p) ∈ {...} do 5: yield (v, p) 6: end for 7: case fÛ(x) 8: for (v, p) ← genAtoms(x) do 9: yield (f (v), p) 10: end for 11: case h ⊗ t 12: for (v, p) ← genAtoms(h) do 13: for (s, q) ← genAtoms(t) do 14: yield ([ v  s ], pq) 15: end for 16: end for 17: case x  e 18: for (v, p) ← genAtoms(e) do 19: if v then 20: for (s, q) ← genAtoms(x) do 21: yield (s, pq) 22: end for 23: end if 24: end for 25: case c  g 26: for (v, p) ← genAtoms(c) do 27: for (s, q) ← genAtoms(g[v]) do 28: yield (s, pq) 29: end for 30: end for 31: end generator

 d is an elementary pex

 d is a functional pex

 d is a tuple pex

 d is a conditional pex

 d is a table pex

statement, new bindings are created or removed. For instance, in the treatment of the conditional pex in genAtomsByType, the outer for loop in line 18 makes some bindings that act on inner statements: then, only atoms (s, q) compatible with these bindings are yielded in the inner for loop (line 20). A proof of correctness of the Statues algorithm is provided in [9]. In essence, this proof shows that the atoms collected by marg(d) form a partition of the sample space (or a subset of it) and that they conform to the semantic of the p -expression d. Beside this proof, and well before it, good confidence on the correctness of the algorithm has been gained through informal reasoning and, above all, by verification with results given in literature’s examples ([5,16] and [19], in particular).

146

3.1

P. Denis

Example of Execution

As support for understanding how the Statues algorithm works practically, the present section shows the key steps of the execution for a toy problem involving an addition and a condition. This problem involves two Bernoulli RV B1 and B2 , with respective probabilities 23 and 14 . Supposing that some evidence ensures that the sum S := B1 + B2 does not exceed 1, the goal is to calculate the odds of B1 given this evidence. This problem can be modeled by the conditional pex d (and its subsidiary pexes) defined as follows: ¶ © b1 := (0, 13 ), (1, 23 ) ¶ © b2 := (0, 34 ), (1, 14 ) ˜ 1 ⊗ b2 ) s := add(b Û ⊗ 1) d := b1  le(s This model can be represented by the DAG shown in Fig. 3. Note that, as stated earlier, the tuple pexes shown above (⊗ nodes) are slightly simplified for the sake of conciseness: in the present model, b2 is meant for b2 ⊗ [ ] meanwhile 1 is meant for 1 ⊗ [ ]. The Table 1 shows the sequence of steps executed for calculating marg(d). A step is defined by all the actions made by the main generator genAtoms to yield a new atom (line 4 of marg). Each row shows some key data present or exchanged at a given step. The first two columns show the value bound on b1 and b2 during the given step. The remaining columns, labeled c→p show atoms yielded by pex c to its parent pex p during the given step; this atom is the one yielded at line 3 or line 7 of genAtoms(c). The rightmost column → shows the atom yielded by the main generator genAtoms: it is collected in marg, which is the final action of the step. When starting marg(d), generators genAtoms / genAtomsByType are created in cascade for each pex, in a top-down order, i.e. from d until elementary pexes b1 , b2 , 1 and [ ]. Since the root node is a conditional pex, the first Ù pex (line 18 of processing is the evaluation of the condition defined on the le genAtomByType). During execution of each step, the atoms are yielded one by one until they reach the root conditional pex—graphically, they climb the DAG, from bottom to top. The first three steps yields three atoms to marg, viz. (0, 14 ), 1 ) and (1, 12 ). In this processing, the atoms yielded on b1→ have probability (0, 12 1 because b1 is already bound at this stage (see line 3 of genAtoms). The step #4 blocks the last atom yielded since it does not verify the given condition (see line 19 of genAtomsByType). During this process, marg has made on the fly of the three received atoms into the associative array © ¶ the condensation a = (0, 13 ), (1, 12 ) . The final treatment of marg(d) consists in normalizing a © ¶ to get the final pmf (0, 25 ), (1, 35 ) . This result is correct, in regard to definition of conditional probability. Incidentally, it is different from the pmf of b1 ; this shows that the given evidence, here, brings information on top of prior beliefs.

The Statues Algorithm

147

Fig. 3. d as a DAG Table 1. Execution trace of marg(d) b1 #1

0

b2 0

b 1→⊗ (0,

1) 3 1) 3

b 2→⊗ ([0],

3) 4

([1],

1) 4



ı

→add

([0, 0],

1) 4

([0, 1],

1 ) 12

ı

add→⊗ (0,

1) 4

(1,

1 ) 12



Û

Û

b 1→

→

(true,

1) 4

(0, 1)

(0, 1 ) 4

(true,

1 ) 12

(0, 1)

1 ) (0, 12

(1, 1)

(1, 1 ) 2

le→

→le

([0, 1],

1) 4

([1, 1],

1 ) 12

#2

0

1

(0,

#3

1

0

(1, 2 ) 3

([0], 3 ) 4

([1, 0], 1 ) 2

(1, 1 ) 2

([1, 1], 1 ) 2

(true, 1 ) 2

1

2) 3

1) 4

1) 6

1) 6

1) 6

1) 6

#4

1

(1,

([1],

([1, 1],

(2,

([2, 1],

(false,





The case above is decidedly basic. The Statues algorithm is nonetheless able to treat correctly all examples given in the present paper, as well as far more involved problems (see, in particular, the libraries referred in Sect. 5).

4

Discussion

Due to the usage of generators, the execution model of the Statues algorithm is quite singular considering the large majority of algorithms based on subroutines. During algorithm execution, each pex involved in the evaluated query gives rise to two generators, namely, genAtoms and genAtomsByType. These generators live together and their call graph mimics the DAG on which they operate; the yielded atoms are passed through the arcs, from child node to parent node. Each genAtomsByType node performs a very simple treatment where probabilities are multiplied together. The collecting of atoms and their condensation are done at the root of the DAG by the marg subroutine, hence the only place where probabilities are added together. As already stated, the Statues algorithm belongs to the category of exact probabilistic algorithms. At its very heart, it explores all possible paths or “possible worlds” [5] compatible with the given query. Without much surprise, it is limited by the NP-hard nature of inference on unconstrained BN [4]. However, it performs far more efficiently than a naive inference by enumeration. There are three reasons to support this assertion. Firstly, due to the way the algorithm travels through the DAG model starting from the given node d, the variables

148

P. Denis

that do not impact the query at hand (i.e. those that are unreachable by any path from d) are not considered in the calculation. There is then a de facto elimination of unused variables. Secondly, when looking for possible bindings, the treatment of the conditional pex performs a pruning of the solution branches not complying with the given evidence (lines 19–23 of genAtomsByType). In many cases, this prevents wasteful calculations—see below for an extension that may improve such pruning even further. Finally, since the binding done by genAtoms occurs for every pex (whether elementary or derived), it has the virtue of memoizing on the fly the results of functional pex, avoiding redoing the same calculation over and over. For instance suppose that, among a large set of RVs, a variable √ D is defined as D := X 2 + Y 2 . Even if D is used at multiple places of the query, like in the conditional expression D2 − U × V | (A ≤ D) ∧ (D ≤ B), the values of D will be calculated only once for each pair of values [X, Y ], hence not for each combination of [X, Y, A, B, U, V ]. This memoization is allowed without restriction since the functional pexes use pure functions, by definition. Although the five pex types introduced in Sect. 2 have a large scope in probabilistic modeling, it is possible to add new pex types in order to improve modeling expressiveness or execution performance. Their handling in the algorithm just requires the addition of case clauses in the genAtomsByType generator, the rest of the algorithm remaining unchanged. A first example of extension consists in generalizing the conditional pex for handling chained conjunctions of conditions C1 ∧ ... ∧ Cn . This would enable short-circuit evaluation when a (false, p) atom is encountered for any Ci . This extension may then perform early pruning, which can dramatically speed up the calculations on several cases. Such kind of optimization can be extended to any operation having an absorbing element (true for disjunction, 0 for multiplication, etc.). A second example of extension is a variant of the table pex: instead of providing an explicit CPT as a lookup table, the modeler may provide a function that returns a specific pex depending on the value of the decision RV. Such construct allows defining a CPT in an algorithmic way, which may be far more compact than an explicit table. This may be helpful in particular for defining noisy-OR and noisy-MAX models (see [15,19]). The construction of suitable probabilistic models, using pex or other formal frameworks, is usually a difficult task for a human being. If enough observation data are available, then the Statues algorithm could be coupled to machine learning algorithms, e.g. maximum–likelihood or expectation–maximization [19], for helping to determine the best model fitting these data. Exact algorithms, as the Statues algorithm, have general merits and liabilities relatively to approximate algorithms. As stated before, any exact probabilistic inference algorithm is limited in practice by the intractability of many problems, including large or densely connected BN. For such intractable problems at least, approximate algorithms like MCMC provide a fallback. Despite this constraint, the exact algorithms remain very useful for a number of reasons—beside their exactness! Firstly, several problems can be solved exactly in an acceptable time; these cover, at the very least, many sparsely connected BNs and the exam-

The Statues Algorithm

149

ple cases used for education. Secondly, exact algorithms offer the opportunity to represent probabilities in different manners, beyond the prevailing floatingpoint numbers. Probability representation as fractions enables perfect accuracy of results, tackling usual—and annoying—rounding errors. Furthermore, symbolic computation is made possible by defining probabilities with variable symbols instead of numbers, e.g. p, q, ..., and by coupling the algorithm with a symbolic computation system (such as SymPy [22]); the output of marg is then a parametric pmf made up of probability expressions instead of numbers, e.g. p2 (1 − q). Such approach using probability symbols instead of numbers is useful when the same query is made over and over on the same complex model, with only varying prior probability values: the query result may be compiled offline once for all into a raw arithmetic expression—maybe taking a long processing time—then, the resulting expression can be evaluated many times using fast arithmetic computations, with varying parameters. As an even bolder objective, one may envision the usage of probability amplitudes, i.e. complex numbers, in the context of quantum computing. This would allow defining pseudo-probability distributions for simulating qubits, quantum registers and quantum circuits. This extension would require a careful analysis of what can be kept/changed in the algorithm; at the very least, a “measure” post-processing should be set up for squaring the module of probability amplitudes, in order to obtain true probabilities (the so-called “Born rule”). Beside the proposed extensions, further research is definitely needed to factually assess the assets and liabilities of Statues algorithm among existing probabilistic inference algorithms. This includes at least the following research tracks: – to make an objective comparison of the expressiveness of the p-expressions framework with those used in other approaches, – to study the performance of the algorithm, both for space and time aspects, and to put these results in perspective with other comparable algorithms.

5

Implementations: The Lea Libraries

The Statues algorithm has been successfully implemented in the Python programming language [18], following the concept of “probabilistic programming language” (PPL) [11,12,16]. Python is well suited for the task because it natively supports generators [20,21]. The primary implementation in Python is an open-source library named Lea [7]. Lea is fully usable, comprehensive and well documented. Also, it encompasses several useful extensions, as those presented in Sect. 4; this includes the support of fractions, floating-point numbers and symbols for representing probabilities (for symbolic computation, Lea uses the package SymPy [22]). However, the understanding of the core Statues algorithm in Lea is quite difficult because the implementation contains many optimizations and extraneous functions, as standard indicators, information theory, random sampling, machine learning, etc. Also, beside the Statues algorithm, Lea implements an approximation algorithm

150

P. Denis

based on Monte-Carlo rejection sampling—providing a fallback alternative for intractable problems. To help the understanding of the Statues algorithm, a lighter open-source Python library has been developed: MicroLea, abbreviated as μLea [8]. μLea is much smaller and much simpler than Lea: it focuses on the Statues algorithm and nothing more. μLea has a limited functionality and usability compared to Lea’s but it is well suited to study how the Statues algorithm can be practically implemented. Incidentally, the names of classes and methods match exactly the terminology used in the present paper. As a short introduction to μLea, the Rain-Sprinkler-Grass Bayesian network seen in Sect. 2.5 is demonstrated below. Here are the statements for instantiating this BN in μLea: from microlea import * rain = ElemPex.bool(0.20) sprinkler = TablePex( rain, { True : ElemPex.bool(0.01), False: ElemPex.bool(0.40)} ) grass_wet = TablePex( TuplePex(sprinkler, rain ), { (False , False): False, (False , True ): ElemPex.bool(0.80), (True , False): ElemPex.bool(0.90), (True , True ): ElemPex.bool(0.99)} )

Note that μLea makes automatic conversion of fixed values into elementary pexes when needed; this is why False is allowed in place of ElemPex.bool(0.0) in the first entry of grass wet. From these definitions, μLea allows making several queries for which the marg subroutine is called implicitly. If the resulting pmf r is boolean, the convenience function P(r) is useful to extract the probability of true. The method given builds a conditional pex from the boolean pex passed in argument; the operator-overloading is used to build functional pexes for arithmetic and logical operators NOT (~), AND (&) and OR (|). Lines beginning with # -> display returned objects. sprinkler # -> {False: 0.6780, True: 0.3220} P(sprinkler) # -> 0.3220 P(rain & sprinkler & grass_wet) # -> 0.00198 P(grass_wet.given(rain)) # -> 0.8019 P(rain.given(grass_wet)) # -> 0.35768767563227616

For checking the consistency of these results, it is possible to retrieve the very last calculated probability thanks to the following expressions, which check respectively the definition of conditional probability and the Bayes’ theorem: P(rain & grass_wet) / P(grass_wet) # -> 0.35768767563227616

The Statues Algorithm

151

P(grass_wet.given(rain)) * P(rain) / P(grass_wet) # -> 0.35768767563227616

Other relationships, including the axioms of probability and the chain rule, can be verified similarly in μLea. One may note that these relationships do not appear explicitly in the Statues algorithm; actually, these are emerging properties thereof. Functional pexes allow expressing more complex queries or evidences: P(rain.given(grass_wet & ~sprinkler)) # -> 1.0 P(rain.given(~grass_wet | ~sprinkler)) # -> 0.27889355229430157 P((rain | sprinkler).given(~grass_wet)) # -> 0.12983575649903917 P((rain == sprinkler).given(~grass_wet)) # -> 0.87020050034444

It is easy to get the full joint probability distribution of the BN by using the tuple pex construct. This gives the probability of each atomic state of the three variables, taking their interdependence into account: TuplePex(rain,sprinkler,grass_wet) # -> {(False, False, False): 0.48000, (False, True, False): 0.03200, (False, True, True): 0.28800, (True, False, False): 0.03960, (True, False, True): 0.15840, (True, True, False): 0.00002, (True, True, True): 0.00198}

One can notice that the case (False, False, True) is absent, since impossible according to given grass wet CPT. Using tuple pexes, it is possible to derive any joint probability distribution, whether full or partial, of any pex model (e.g. factors as calculated by the variable elimination algorithm [6,19]). This may provide useful clues to understand returned results, as those given above. For providing an example involving numerical RV, the above use case can be extended by adding a hygrometer, showing a measure of grass humidity as an integer from 0 to 4. Assuming that this device is imprecise, the measure is modeled as a CPT depending on the state of the grass: measure = TablePex( grass_wet, { True : ElemPex({2: 0.125, 3: 0.375, 4: 0.500 }), False: ElemPex({0: 0.500, 1: 0.375, 2: 0.125 })})

Booleans, numerical values and comparison operators can then be freely mixed in the same query: measure # -> {0: 0.2758, 1: 0.2069, 2: 0.1250, 3: 0.1681, 4: 0.2242} measure.given(~rain) # -> {0: 0.3200, 1: 0.2400, 2: 0.1250, 3: 0.1350, 4: 0.1800} P((measure 0.685 P(~rain.given(measure 0.9018089662521034

152

P. Denis

Finally, to elaborate functional pexes on this use case, a variable dry is defined hereafter as the negation of rain, while a new variable norm measure converts measure into a normalized value ranging from −1.0 to +1.0. The queries made below are consistent with the previous results: dry = ~rain norm_measure = (measure - 2.0) / 2.0 norm_measure.given(dry) # -> {-1.0: 0.3200, -0.5: 0.2400, 0.0: 0.1250, 0.5: 0.1350, 1.0: 0.1800} P(dry.given(norm_measure 0.9018089662521034

These examples demonstrate that μLea, supported by the Statues algorithm, let the user express models and queries in a natural way, quite close to the underlying random variables. They show also that Bayesian reasoning and usual functions (such as inequalities, arithmetic and logical operators) can be combined together, providing capabilities absent from classical probabilistic approaches.

6

Conclusions

The present paper has introduced a new probabilistic framework, namely the p-expressions, that allows modeling discrete random variables having a finite domain. In essence, this framework provides primitives to define graphical models capturing the dependencies between random variables up to those having known prior probabilities. As sketched in provided examples, this formalism appears to be rich enough to model probabilistic arithmetic, conditioning, discrete-time Markov chains and Bayesian networks. Then, a new inference algorithm has been presented, the Statues algorithm, which calculates exact marginal probability distribution on any given p-expression. This algorithm relies on a special binding mechanism that uses recursive generators. Beside the validity of results regarding several problems covered in literature, a proof of algorithm correctness is available. The Statues algorithm is successfully implemented in the Lea and MicroLea libraries, using the Python programming language. The usage of MicroLea has been demonstrated on a simple Bayesian network, with some non-standard variations mixing Boolean and numerical random variables. The merits and liabilities of the Statues algorithm have been shortly discussed, as well as possible extensions. The algorithm handles only discrete random variables and it does not overcome the computational limitations of exact probabilistic inference. However, one of its interests in the perspective of probabilistic programming resides in its ability to address a set of problems traditionally handled by different specialized probabilistic modeling approaches. On the question of time efficiency, the Statues algorithm appears to have several strengths for competing with other exact algorithms, notably through its pruning and memoization features. For the algorithm’s inner machinery, the binding mechanism based on recursive generators has proven to be elegant and powerful to handle the dependencies between random variables.

The Statues Algorithm

153

Despite these promising results, further research is needed to assess the Statues algorithm’s actual assets/liabilities among other probabilistic inference algorithms. Acknowledgments. The author warmly thanks Nicky van Foreest for reviewing the early version of the present paper and for providing fruitful advice to improve it. The author is grateful to Fr´ed´eric and Marie-Astrid Buelens for their wise guidelines on writing scientific papers. The author thanks Gilles Scouvart, Nicky van Foreest, Zhibo Xiao, Noah Goodman, Rasmus Bonnevie, Paul Moore, Thomas Laroche and Guy Lalonde for their feedback, support, suggestions or contributions provided for the Lea library. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

References 1. Agrawal, M.K., Elmaghraby, S.E.: On computing the distribution function of the sum of independent random variables. Comput. Oper. Res. 28(5), 473–483 (2001) 2. Berleant, D., Goodman-Strauss, C.: Bounding the results of arithmetic operations on random variables of unknown dependency using intervals. Reliable Comput. 4(2), 147–165 (1998) 3. Berleant, D., Xie, L., Zhang, J.: Statool: a tool for Distribution Envelope Determination (DEnv), an interval-based algorithm for arithmetic on random variables. Reliable Comput. 9(2), 91–108 (2003) 4. Cooper, G.F.: The computational complexity of probabilistic inference using Bayesian belief networks. Artif. Intell. 42(2–3), 393–405 (1990) 5. De Raedt, L., Kimmig, A.: Probabilistic programming concepts. arXiv preprint arXiv:1312.4328 (2013) 6. Jordan, M.I.: Graphical models. Statist. Sci. 19(1), 140–155 (2004) 7. Denis, P.: Lea: discrete probability distributions in Python (2014). http://www. bitbucket.org/piedenis/lea 8. Denis, P.: MicroLea: probabilistic inference in Python (2017). http://www. bitbucket.org/piedenis/microlea 9. Denis, P.: Probabilistic inference using generators – the Statues algorithm, appendix C. arXiv preprint arXiv:1806.09997 (2018) 10. Evans, D.L., Leemis, L.M.: Algorithms for computing the distributions of sums of discrete random variables. Math. Comput. Modell. 40(13), 1429–1452 (2004) 11. Goodman, N., Mansinghka, V., Roy, D.M., Bonawitz, K., Tenenbaum, J.B.: Church: a language for generative models. In: Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence (2012) 12. Goodman, N., Stuhlm¨ uller, A.: The design and implementation of probabilistic programming languages (2014). http://dippl.org 13. Knuth, D.E.: The Art of Computer Programming: Fundamental Algorithms, vol. 1, pp. 193-200, 3rd edn. Addison-Wesley, Boston (1997) 14. Pearl, J.: Reverend Bayes on inference engines: A distributed hierarchical approach, pp. 133-136. Cognitive Systems Laboratory, School of Engineering and Applied Science, University of California, Los Angeles (1982) 15. Pearl, J.: Fusion, propagation, and structuring in belief networks. Artif. Intell. 29(3), 241–288 (1986)

154

P. Denis

16. Pfeffer, A.: Practical Probabilistic Programming. Manning Publications Co, Greenwich (2016) 17. Poole, D., Zhang, N.L.: Exploiting contextual independence in probabilistic inference. J. Artif. Intell. Res. 18, 263–313 (2003) 18. Python Software Foundation (2001). http://www.python.org 19. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 2nd edn. Prentice Hall, Upper Saddle River (2003) 20. Saba, S.: Coroutine-based combinatorial generation (Doctoral dissertation, University of Victoria) (2014) 21. Schemenauer, N., Peters, T., Hetland, M.L.: PEP 255 - Simple Generators (2001). http://www.python.org/dev/peps/pep-0255/ 22. SymPy Development Team: SymPy: python library for symbolic mathematics (2016). http://www.sympy.org 23. Williamson, R.C.: Probabilistic arithmetic (Doctoral dissertation, University of Queensland) (1989) 24. Williamson, R.C., Downs, T.: Probabilistic arithmetic. I. Numerical methods for calculating convolutions and dependency bounds. Int. J. Approximate Reasoning 4(2), 89–158 (1990)

A Q-Learning Based Maximum Power Point Tracking for PV Array Under Partial Shading Condition Roy Chaoming Hsu1(B) , Wen-Yen Chen2 , and Yu-Pi Lin1 1 Electrical Engineering Department, National Chiayi University, Chiayi City 600, Taiwan

[email protected] 2 Computer Science and Information Engineering Department, National Chiayi University,

Chiayi City 600, Taiwan

Abstract. Due to the rise of environmental awareness and the impact of the nuclear disaster at the Fukushima nuclear power plant, green energy has become an important development project. When using solar energy to generate electricity, it is pollution-free and noise-free, making it the most important direction for green energy development. However, when portion of the PV Array is shaded, the P-V characteristic curve will produce a multi-peak situation such that the traditional maximum power point tracking (MPPT) method is difficult to successfully achieve tracking. In this paper, a maximum power tracking method based on reinforcement learning is proposed for PV array under partial shading condition, and the state, the required action and the reward function equation of the reinforcement learning are designed to effectively track the maximum power point for PV array under partial shading condition environment. Keywords: Reinforcement learning · PV array · Partial shading condition · MPPT

1 Introduction In recent years, the demand for electric energy is increasing, yet when generating energy, the awareness of environmental protection has risen and the safety of nuclear power generation has been questioned. Therefore, many alternative energy generation sources have been developed. To reduce the risk of nuclear power generation and the nuclear waste generated by the nuclear power plant, and the air pollution caused by thermal power generation, alternative green energy generations have been promoted, including wind energy, hydropower, solar energy, etc. Among these alternative power generation sources, solar energy is the most popular and widely used one. Photovoltaic Array (PV array for short) is a photoelectric component [1] in solar power generation, which converts solar energy and radiant energy into electrical energy. Because the main function of PV array is to convert solar energy into electrical energy, hence how to improve the photoelectric conversion efficiency of PV array has become an important issue in solar power research. © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 155–168, 2020. https://doi.org/10.1007/978-3-030-52246-9_11

156

R. C. Hsu et al.

Maximum power point tracking (MPPT) [2–5] is the technology used in the tracking maximum power point of a PV array under uniform insolation for solar power generation. The power vs. voltage characteristic curve (P-V curve for short) of a solar array under uniform insolation has a maximum power point at a fixed illumination and temperature. When the PV array is exposed to illuminance of sunlight under different temperature, the characteristic curve of the solar array also changes. Technically, if the load of a PV array can be adjusted with the highest power transmission efficiency of the P-V characteristic curve, the photoelectric conversion of the PV system will have the best efficiency. The MPPT is a method of adjusting the load such that the maximum power point of the PV characteristic curve can be tracking and maintained. However, PV systems are generally connected in series and in parallel by PV modules into a PV array, if the array is partially shaded, the P-V characteristic curve will have multi-local maximum situations, traditional MPPT methods such as P&O are difficult to achieve tracking because they are likely trapped in local maximum. A Two-Stage method [6] has been proposed to solve the MPPT method for PV Array with partially shading condition (PSC for short). To find the maximum power point of PV array under PSC, the Two-Stage method is divided the MPPT into two stages. The first stage uses the open circuit voltage Voc and the short-circuit current Isc as a load line. For the first step of the operating point voltage movement, the second stage uses the incremental conductance method to move the operating point voltage. After searching for the maximum operating power point at the current stage, it will compare with the maximum power point searched in the first stage. The maximum power point power stored in the first stage is greater than the maximum power point power searched in the second stage, and the operating point voltage is moved to the operating point voltage stored in the first stage, that is, the maximum power point is found in the whole domain. In this study, we apply the Q-Learning [7] learning method of Reinforcement Learning to track the power changes caused by shadow shading and track the maximum power point such that the higher efficiency of the power generation can be achieved and the results are compared with those of Two-stage [6] method. In the following, Sect. 2 is background of the proposed methodology. Section 3 shows the system architecture and simulation of this study. Section 4 exhibits the experimental results, and the conclusion comes in the Sect. 5.

2 Reinforcement Learning and Partial Shading Condition of PV Array Reinforcement Learning [7–9] is an area in machine learning that emphasizes on how to act based on the environment to maximize the expected benefits. In the reinforcement learning framework, an agent explores and learns in the environment, recognizes the current state according to the information given by the environment, executes the action after a specific decision-making process, and then receives the reward value from the environment. A common model of reinforcement learning is shown in Fig. 1. In Fig. 1, the agent observes the state of the unknown environment, takes action(s) through specific decisions, a reward associated with the state-action pair in exhibiting the positive or negative impact is then determined and given under the environment. After a continuous update of action and state to accumulate the maximum reward value, a set of optimal (action, state) strategies can be consequently learned.

A Q-Learning Based Maximum Power Point Tracking for PV Array

157

Agent

State

Reward

Action

Environment Fig. 1. Model of reinforcement learning.

The difference between reinforcement learning and supervised learning is that reinforcement learning does not require correct input and output, nor does it require precise correction of suboptimal behavior. Reinforcement learning focuses on online planning to find a balance under known/unknown environments and the existing knowledge. In the learning, the agent first explores the environment and tries a variety of different actions to obtain enough information. When certain knowledge about the environment are known, the agent then exploits the higher chances of action-state to obtain higher reward value. In the reinforcement learning, generally either ε-Greedy [10] or softmax [11] method is employed for the action selection. 2.1 Q-Learning Q-Learning [6] is a reinforcement learning algorithm among various reinforcement learning methods. The Q (state, action) value (Q-value) in the Q-Table is used as the judgment of the pros and cons of agent’s actions in certain states. The Q value is determined based on the agent’s chosen action (A) at the specific state (S). The resulting Q value is obtained after the agent takes action through the specific exploration strategy, and, to find the best strategy, the Q values are constantly updated such that the agent, after the learning, will consequently perform the same action under a similar environment with the associated Q-value. The Q value is the result of learning after each exercising of action at certain state, and the final Q values each tend to reach the best value Q* through many times of learning. The Q* is updated according to the following Eq. (1):   (1) Q∗ = maxa s , a and the updating of each Q value in the Q(S, A) is by (2)   Q(St , At ) ← Q(St , At ) + α Rt + γmaxa Q(St+1 , A) − Q(St , At )

(2)

in (2), α is the learning coefficient with value within [0, 1], the value of α is normally decreased with the time to obtain the convergence of Q. Rt is the immediate reward the agent received in every exercised action to exhibit the positive or negative impact. maxa Q(St+1 , A) is the current experience the agent learned so far, which indicates the accumulated reward starting from the state S until the next process. γ is the decreasing factor for decreasing the influence of future reward.

158

R. C. Hsu et al.

2.2 The Shading Effect of PV Array To facilitate the discussion of the MPP of the PV, the equivalent circuit of the PV is analyzed. The equivalent circuit of the ideal solar cell is composed of a current source in parallel with a diode, and a shunt resistance and a series resistance, where the magnitude of the current source is proportional to the sunlight illuminance, as shown in Fig. 2.

Fig. 2. The equivalence circuit of a solar cell and the load.

Ideally, the internal resistance Rp of the solar cell is very small which can be ignored. The output current I of the solar cell can hence be expressed as Eq. (3). I = IL − ID = IL − IOS [exp AkqB T (V + IR) − 1] ≈ IL − IOS [exp AkqVB T − 1]

(3)

In (3), the parameters are I: V: IL : IOS : kB : q: A: R: T:

output current (A) output voltage (V) light generated current (A) the dark saturation current (A) Boltzman constant the electron charge the non-ideality factor the series resistance () the temperature

2.3 The PV Array Under Partial Shading Condition A PV array placed outdoors are often shaded by uncertainties such as clouds, leaves, or dust. When the PV array are partially shaded, photoelectric conversion efficiency of the PV array might be affected. The following example shows the case when two solar cells are connected in series, while one of which is subjected to shading. The influence of the I-V and P-V characteristic curve is shown in Fig. 3(a) and (b), respectively.

A Q-Learning Based Maximum Power Point Tracking for PV Array

159

Fig. 3. Two solar cells are connected in series and one of which is subjected to shading

When two solar cells are connected in parallel and one of which is subjected to shadow shading. The influence of the I-V and P-V characteristic curve is shown in Fig. 4(a), and (b), respectively.

Fig. 4. Two solar cells are connected in parallel and one of which is subjected to shading

The shadow generated by the external shading of the PV array will cause the multilocal maxima situation in the P-V characteristic curve. By taking the 2 × 2 PV array as an example, and the P-V characteristic curve is illustrated in Fig. 5. The shadowed solar cells will generate less current than other unshaded solar cells and will become a resistor on the circuit such that the so-called Hot Spot [12] will occur.

Fig. 5. The P-V characteristic curve under partial shading condition.

160

R. C. Hsu et al.

2.4 Calculating the Number of Partial Shading Patterns To calculate the number of partial shading patterns, it is assumed that a M × N PV array is consisted of N-in series connected with M-in parallel [14]. An example of 3 × 4 PV array with 2 and 1 shaded PV in the first and second, respectively, parallel PV array is shown in Fig. 6.

Fig. 6. An example of 3 × 4 PV array.

First considering there is 1 shading PV module in any parallel connected PV array with the symbol i = 1. And there will be two cases by extending in parallel direction, i.e. (i = 1, j = 1), and (i = 1, j = 2). It is then calculated to confirm that there are two cases when i = 1, as shown in the Fig. 7.

Fig. 7. All the shading pattern when i = 1, first parallel array with 1 module is shaded.

Next considering the shading pattern for i = 2, i.e., 2 shaded modules in a parallel connection, to be calculated. When i = 2, there are two kinds of shading cases for extending to j = 2. As Fig. 8 shows, where the shaded number i and j are regarded as the i × j array, and fixing i in the series connection, the shading pattern can be counted on the parallel j shading module. The shading pattern of a 2 × 2 solar array is shown in Fig. 7, with the case of i = 1, and in Fig. 8, with the case of i = 2 and j = 2, respectively. When the size of PV array is known to be M × N, the partial shading pattern can be calculated by i and j such that the table for calculating shading pattern of M × N PV array can be obtained as in Fig. 9. For example, if M = 2, and N = 2, i.e., M × N = 2, by looking into the table, one can find the shading pattern is 6.

A Q-Learning Based Maximum Power Point Tracking for PV Array

161

Fig. 8. The shading pattern when i = 2 and j = 2

M

N

Fig. 9. Table for calculating shading pattern of M × N PV array.

It can be seen from the table that with a 2 × 2 PV array, when the series connection is as i = 1, there are two cases of parallel j = 1 and j = 2, when the series shading i = 2, there is parallel j = 1, j = 2, there are three cases, so there will be five kinds of shading patterns for the 2 × 2 PV array, but when the shading situation is calculated, the non-shadowing situation is not added, so the shading pattern must be added to the non-shading situation, 2 × 2. Hence, there are six shading patterns for the 2 × 2 PV array.

3 System Architecture and Simulation 3.1 Bullseye Reward Function of RLMPPT for PV Array with PSC To simulate Bullseye reward function of RLMPPT for PV array with PSC, a 3X2 PV array, as shown in Fig. 10, is taken as an example, where Fig. 10(a), (b) and (c), respectively, are no shading, three, and two of them are shaded. For different kind of partial shading condition, the pre-trained bullseye reward function of the RLMPPT will be different. The bullseye reward function for the different shading pattern of Fig. 10(a), (b), and (c) are shown in Fig. 11(a), 11(b), and 11(c), respectively as below. The simulation flowchart of the proposed RLMPPT [13] for PV array with PSC is show in Fig. 12.

162

R. C. Hsu et al.

Fig. 10. Three difference kind of partial shading condition for 3 × 2 PV array.

Fig. 11(a). Bullseye reward function without shading

Fig. 11(b). Bullseye reward function with 3 PV modules under shading.

A Q-Learning Based Maximum Power Point Tracking for PV Array

Fig. 11(c). Bullseye reward function with 2 PV modules under shading.

Start

Initializing learning parameters and calculating shading patterns Detecting the current and voltage, respectively, of the in series and in parallel PV array Deciding the shading type and selecting its’ Bullseye reward function

Executing RLMPPT Detecting the current and voltage, respectively, of the in series and in parallel PV array

Is shading type changed?

Yes

No Fig. 12. The simulation flowchart of the RLMPPT for PV array with PSC

163

164

R. C. Hsu et al.

3.2 The Algorithm of RLMPPT for PV Array with PSC The Algorithm of RLMPPT for PV array with Partial Shading Condition is shown in Fig. 13. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

15.

Read the value of temperature and illuminance. Read the partial shading pattern of PV array. Initialize all Q(s,a)←0 While i Max P Max P ←P(i) Max V ←V(i) end set i = i+1, for next execution time ; Calculating reward r’ and observing s’ Update Q value by Eq. 2. s←s’ end while loop

Fig. 13. The algorithm of the proposed RLMPPT for PV array with PSC

Line 1 is to read in the values of current temperature and illuminance for the PV array simulation. Line 2 is the sensing data to detect the shading pattern of the PV array and search into the memory to obtain the bullseye reward function of RL for the detected shading pattern. When in the reinforcement learning, as starts from line 4, the agent explores the current environment and selects an action by E-greedy, judges the current pair of working voltage and the obtained power, i.e., (V, P) to see if (V, P) falls into the bullseye reward function. If it is inside the bullseye, the reward value of 50 is given, if it is outside the bullseye, reward value of 0 will be given. Line 8–10 are the condition of P and V after agent’s action. If P > 0 and V > 0, i.e. (+, +), state value is 0, if P > 0 and V < 0, i.e. (+, −), state value is 1, if P < 0 and V > 0, i.e., (−, +), and the state value is 3. After the action, reward function, and state updated process, all the Q value can be updated and send the predicted working voltage to reach the end of this single execution.

4 Experimental Results In the experiment and simulation, the PV module used is MS × 60 and its specification is shown in Table 1. In the pre-training to obtain reinforcement learning’s bull’s-eye reward function, 5 days of real climate data recorded from Loyola Marymount University are used with the average values are as in Table 2. To show the advantage of the proposed method, our results are compared with those of Two-stage method. It can be seen from Fig. 14 that for the un-shaded PV array of Fig. 10(a), our method reaches convergence starting from about 45 s.

A Q-Learning Based Maximum Power Point Tracking for PV Array

165

Table 1. The PV module used is MS × 60 and it’s specification Parameters Value Pmax

60 W

Voc

21.1 V

Isc

3.8 A

VMPP

17.1 V

IMPP

3.5 A

Table 2. Averaged climate values with 5 days. Parameters

Values

Temperature

2015/08/01–08/05 Am10:00–Pm2:00

Averaged temperature

25.025 °C

Temperature variation

04387 °C

Average illumination

785.9202 W/m2

Standard deviation of illumination

86.14 W/m2

Fig. 14. The distribution of maximum power difference vs. time of the proposed method.

To simulate the shading pattern changes from Fig. 10(b) to Fig. 10(c), the distribution of maximum power difference vs. time are shown in Fig. 15, and Fig. 16, respectively. It can be seen from Fig. 16 that the convergence takes about at 50 s, while in Fig. 16 due to the shading pattern changes from 3 shaded PV modules to 2 shaded PV modules, it takes longer time to reach convergence. Under the same climates and PV array specification, the results of the Two-stage method for un-shaded PV array, 3 PV module are shaded and changes form 3 shaded PV modules to 2 shaded PV modules are shown in Fig. 17, 18, and 19, respectively. In Fig. 17, and Fig. 18 the Two-stage method converges at about 38 s, 20 s, and 170 s, respectively, for the un-shaded PV array and for the 3 shaded PV modules, and changes form 3 PV shaded modules to 2 shaded PV modules which are in comparable with our proposed RLMPPT for PV array with PSC.

166

R. C. Hsu et al.

Fig. 15. The distribution of maximum power difference vs. time for Fig. 10(b) of the proposed method

Fig. 16. The distribution of maximum power difference vs. time for changes from 3 PV module are shaded to 2 PV module are shaded of the proposed method

Fig. 17. The distribution of maximum power difference vs. time of the Two-stage method.

In comparing with the Two-stage method, it seems that our proposed RLMPPT converges a little bit slower than the Two-stage method, yet our proposed RLMPPT method achieves better accuracy of the maximum power, which can be seen from Fig. 20 and Fig. 21. In Fig. 20, the maximum power difference is below 1 W, while that of Two-stage method is about 6 –7 W.

A Q-Learning Based Maximum Power Point Tracking for PV Array

167

Fig. 18. The distribution of maximum power difference vs. time for Fig. 11(b) of Two-stage method.

Fig. 19. The distribution of maximum power difference vs. time for changes from 3 PV module are shaded to 2 PV module are shaded of the Two-stage method.

Fig. 20. The distribution of maximum power difference of the Two-stage method.

Fig. 21. The distribution of maximum power difference of the Two-stage method.

168

R. C. Hsu et al.

5 Conclusions In this study, the 3 × 2 PV array was used to verify the proposed RLMPPT for PV array with partial shading condition. In the actual environmental parameters, the proposed method can converge to the maximum power point in a short time even under the changes from one shading pattern to another shading pattern. In comparing to an existing Two-stage method, the proposed method reaches the maximum power point with better accuracy than that of the Two-stage method, which indicates that our method can find the actual global maximum better than that of Two-stage method as well. In reality, the actual PV array may be larger than the experimental ones, but as long as the proposed method can calculate and detect the real shading pattern as described in this paper, the global maximum power point tracking can be achieved even under variable conditions and to achieve even more accurate power generation with efficiency.

References 1. Singh, P.O.: Modeling of photovoltaic arrays under shading patterns with reconfigurable switching and bypass diodes. The University of Toledo Digital Repository, Paper 723 (2011) 2. Bahgata, A.B.G., Helwab, N.H., Ahmad, G.E., Shenawy, E.T.E.: Estimation of the maximum power and normal operating power of a photovoltaic module by neural networks. J. Renew. Energy 29(3), 443–457 (2004) 3. Femia, N., Petrone, G., Spagnuolo, G., Vitelli, M.: Optimization of perturb and observe maximum power point tracking method. IEEE Trans. Power Electron. 20(4), 963–973 (2005) 4. Bahgata, A.B.G., Helwab, N.H., Ahmad, G.E., Shenawy, E.T.E.: Maximum power point traking controller for PV systems using neural networks. J. Renew. Eneygy 30(8), 1257–1268 (2005) 5. Esram, T., Chapman, P.L.: Comparison of photovoltaic array maximum power point tracking techniques. IEEE Trans. Energy Convers. 22(2), 439–449 (2007) 6. Kobayashi, K., Takano, I., Sawada, Y.: A study of a two stage maximum power point tracking control of a photovoltaic system under partially shaded insolation conditions. Sol. Energy Mater. Sol. Cells 90(18–19), 2975–2988 (2006) 7. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996) 8. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998) 9. Zhan, Z., Wang, Q., Chen, X.: Reinforcement learning model, algorithms and its application (2011) 10. Rodrigues Gomes, E., Kowalczyk, R.: Dynamic analysis of multiagent Q-learning with εgreedy exploration. In: ICML 09 Proceedings of the 26th Annual International Conference on Machine Learning (2009) 11. Hinton, G.E., Salakhutdinov, R.R.: Replicated softmax: an undirected topic model (2009) 12. Herrmann, W., Wiesner, W., Vaassen, W.: Hot spot investigations on PV modules-new concepts for test standard and consequences for module design with respect to bypass diodes. In: IEEE PV Specialists Conference, pp. 1129–1132 (1997) 13. Hsieh, H.-I., Wang, H., Liu, C.-T., Chen, W.-Y., Hsu, R.C.: A reinforcement learning based maximum power point tracking method for photovoltaic array. Int. J. Photoenergy 2015 (2015) 14. Walker, G.: Evaluating MPPT converter topologies using a MATLAB PV model. J. Electr. Electron. Eng. 21, 49–55 (2001)

A Logic-Based Agent Modelling Paradigm for Investment in Derivatives Markets Jonathan Waller(B) and Tarun Goel Endfield Derivatives, LLC, Wilmington, DE 19808, USA [email protected], [email protected]

Abstract. The paper describes Multiagent Systems going into details of the agents themselves and how they can be configured for a financial gain setting in the stock market. The interaction of Multiagent Systems between each other and how they behave within an environment, specifically the environment which they are in, will be detailed with their own growth from a single, relatively simple model, to a large network of model constantly interacting with different agents. The stock market is used as the source to prove that Multiagent Systems can be more efficient than other conventional models such as Machine Learning models. The common problems that can arise from bad agents are described introducing a real-world system scenario to better explain the concept. Finally, a look into the model itself is presented with an insight into how it works and some of the high-level profitable returns made from the model over the course of the models deployed timeframe. The future of the design is then discussed. Keywords: Multiagent Systems · Quantitative finance · Optimization · Investment decisions

1 Introduction 1.1 Features Multiagent systems (MASs) generally are described by their individual agents, interactions with one another and the environments in which they reside. Agents can be described in many ways but for this paper are considered as separate nodes participating in an activity and reporting the results either to a central user or to each other with some sort of moderator or overseer analysing the decisions made. The agents themselves are described by such attributes as uniformity, autonomy, goals, abilities, flexibility, etc. which in turn help to describe the overall system in which they serve a purpose to. Uniformity asks whether the agents are homogenous or heterogeneous, i.e., whether the agents are the same or have different goals and abilities. These agents have goals rather than following preset procedures which are set at the creation of the agents, and rather than locking in explicit instructions for how to achieve its goals, optimal methods are learned by the agents given certain constraints by the designer only at the initialization of the system. The agents constantly learn from actions and interactions with one another and as time passes. © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 169–180, 2020. https://doi.org/10.1007/978-3-030-52246-9_12

170

J. Waller and T. Goel

1.2 Interactions MASs achieve these goals by interacting with other agents via sensors, cognition and the like to obtain information (percepts) about their environment. The ways in which these agents socialize within a MAS can range from simple signal exchange to more knowledge-intensive, semantically rich language exchange. 1.3 Environment It is important to know what type of environment with which the agents will be interacting. Of primary interest are accessibility, predictability, periodicity, dynamics and number of states in which the environment can exist. An accessible environment is one in which the agent can obtain complete, accurate, up-to-date information about the environment’s state [1]. The stock market would be considered inaccessible as one cannot know all the information about that environment, i.e., earnings report before they happen, what every investor is thinking, national security concerns that will affect the market etc. This leads to a static versus dynamic problem. A static environment is one that remains unchanged except for the performance of actions by the agent [1]. However, there are outside agents beyond those in the stock market that affect our environment, making it dynamic in nature. A dynamic environment is one that has external processes, which affect it. A recent paper in the Federal Reserve Bank of St. Louis Review argues for a mathematical fear gauge model of forward guidance. It models connections from various announcements from members of the Federal Reserve Board to subsequent movements in a basket of securities [2]. Even if this model is sound, it is believed to be already priced-in to the market. This is a more game-theoretic belief that whatever actions or inactions are made by external actors, will influence the environment—securities prices. This is why the model focuses upon prices rather than actions to prices. Each price point of a security represents the potential states of the environment, i.e., a countable set of states. Securities prices and asset returns may appear discrete as prices move by discrete ticks; however, there are an incalculable number of price movements. In a chess game, the environment is discrete as there are a finite number of moves that are possible. Furthermore, the environment is non-episodic. There is an intertemporal relation between present and future episodes. People do care about past performance in the market and current sentiment will affect future performance somewhat. Linking this to the stock market can prove that upon carefully analysing real-time actions, can one achieve a desired end goal by profiting from stock market trades, or subsequently winning a game of chess. Financial markets are of the general environment class of inaccessible, non-deterministic, dynamic, continuous and non-episodic, making them the most complex to model [2]. 1.4 Objectives The hierarchy then explains that investment strategies are not completely random but in principle can be proven to be deterministic. When considering a risk-based scale and

A Logic-Based Agent Modelling Paradigm for Investment in Derivatives Markets

171

training a MASs model to lower risk and maximise profit, a valid investment decision opportunity can be created. The MASs will use a combination of agents, which will make all possible investment-based decisions for a shareholder and report back the potential successes and failures to a central mediator. This in turn will lead to an incremental strategy where upon vigorous training regimes, a least-risk solution will be highlighted which will lead to profit maximisation. The training itself will also be a rather important topic to consider, as the processing power required to deal with such a system will require unique solutions given limited hardware performance. A highlight into why MASs can also be favoured over other conventional models will also be presented. A look into the particular single agents will be provided in Sect. 2 and how it develops into a system of multiple agents. An overview of the introduction of disturbances into the agents will be described with a comparison of its design compared to other common artificial intelligence based approaches. Section 3 will reiterate the concept of disturbances into the system and its effects. Section 4 describes the created model while Sect. 5 takes a look at the results from the model.

2 Background 2.1 Intelligent Agents Intelligent agents exhibit goal directed behavior. The system has two goals in mind: profit maximisation and risk minimisation. Intelligent agents socialise, i.e. cooperate and or compete, otherwise known as satisfying sociability. There is also the desire to seek as many trades as possible (see Fig. 1). This creates a signal to noise problem. Taking more trades may increase returns, but it will also increase risk, which may inevitably decrease returns. Various metrics can be used for determining a marginal rate of transformation between expected returns and risk to maximize utility (profit). There is also competition about when to place trades and whether it is a call or put. The agent can perceive (stock prices) and can react in a timely manner. This means at some point it must decide whether the goal is still feasible. A hard stop is programmed if its objective is not reached by a certain date. The agents may also decide to stop trying earlier should they reach their objective, or if the objective becomes untenable (see Fig. 1 and Fig. 2). 2.2 Amplification and Disturbances One ideal way to further understand the risk scenario is by looking at a common example. Heavy traffic is a major problem in many cities and there have been many solutions put in place to maximise traffic flow, such as the introduction of infrared sensors to detect vehicle presence and inductive coils embedded inside the roads at junctions which utilise these states to create algorithms that choose which lights to make red and which to make green for maximum efficiency [4]. However, when people break the rules in this setting, it can often compromise the algorithms integrity, for example a car breaks a red light but there is not enough space for them to safely pass the junction so they are stuck in the middle of a junction blocking traffic from other lanes. This same concept can be applied to an agent-based model to combat the negative impacts that one bad action can have to

172

J. Waller and T. Goel

Fig. 1. Single agent showing its continuous interactions with the system around and how this affects its output signal, or logic. The figure shows an interaction event and three behavioral states: Behavior 1—null/quit, Behavior 2—buy/enter, Behavior 3—sell/exit through time. An orthogonal edge-routing algorithm is used because it limits edge crossing and length for a small number of nodes [3].

an overall system. The vehicle-based traffic can also be translated to the stock market workflow and how the structure of the investment market is ordered. A break in the flow of the organisations can propagate, sometimes exponentially, and have rapid positive or negative consequences. 2.3 Why Multiagent Systems With the massive advances in computing power due to decreases in cost, and increases in scalability from cloud computing, quantitative finance has blossomed. The field of machine learning (ML) offered promise in discovering and creating financial models once considered too computationally expensive [5]. However, the model presented herein, attempts to explain how and why MASs should be used above other techniques such as ML.

A Logic-Based Agent Modelling Paradigm for Investment in Derivatives Markets

173

Fig. 2. The development of a single agent into a multiple agented network, creating a network of interactions that converge onto a consensus, otherwise it clusters consensuses into multiple competing scenarios. The figure shows multiple agents and interaction events and three behavioral states: Behavior 1—null/quit, Behavior 2—buy/enter, Behavior 3—sell/exit through time. Clusters can be seen representing interaction events. Edges represent travelling from behavioral state to state.

ML generally comes in two flavours: supervised and unsupervised [5]. Supervised learning involves a great more human intervention than MASs, as it requires the architect to provide the inputs (percepts) and the desired outputs. For example, a vending machine may be required to determine the various denominations of coins being inserted. Upon being provided with features about said coins, the supervised ML algorithm will then attempt to classify. Here, the algorithm is given preset classes to fit the data into. In the unsupervised arena, an ML system would determine the classes by clustering data from a large dataset, at which time the architect must name, or attempt to name the clusters discovered by the ML system. However complex the ML system becomes, (i) It will only prescribe one optimal decision at a time, and; (ii) It will suffer from being monolithic. That is, it behaves like a single-agent system. The algorithm does not compete or cooperate with itself and all processes are centralized. Even if there are multiple inputs, sensors, actuators or robots, a single agent parses the inputs and decides its action [6]. Monolithic System. Monolithic systems have monolithic architectures, that is, there is a single overarching control of all internal processes [7].

174

J. Waller and T. Goel

Fig. 3. The figure shows part of the network in Fig. 2 with interaction events and three behavioral states: Behavior 1—null/quit, Behavior 2—buy/enter, Behavior 3—sell/exit through time. Clusters can be seen representing interaction events. Edges represent travelling from behavioral state to state. The blue box in Fig. 2 is magnified and represented here in Fig. 3.

Certain problems would prove intractable in monolithic problems. The monolithic architecture suffers from low fault tolerance as its components are tied closely together, giving the system limited flexibility. This type of structure means that the architect must understand and code for how all the constituent parts of the system will fit together [7]. On the other hand, a subsumption agent architecture allows for multiple actions to be prescribed at any one state in which the environment may exist [6]. The MAS may suggest entering 100 securities at once or just one. The investment advice for Apple doesn’t affect the investment advice for Microsoft. It also allows for paraconsistent or contradictory prescriptive actions. In an environment describing the stock market, this may lead to buying calls and puts on the same security, or even buy and don’t-buy advice contemporaneously. One might consider then that one of these actions would have to prove to be ill advised to avoid contradiction. However, such actions, like straddles, could be conducted in a securities market and

A Logic-Based Agent Modelling Paradigm for Investment in Derivatives Markets

175

could be considered contradictory as it the agent would be both betting for and against the same security. Straddle. A Straddle order is an order to purchase calls and puts in the same quantity, having the same expiration and strike price on the same security at the same time [8].

2.4 Overtraining Lastly, despite any ML algorithm’s ability to learn and evolve, at any state space within said model, it is effectively static. A top-down or a bottom-up approach would suffer from a division fallacy and a composition fallacy respectively. A top-down approach would be in essence using macroeconomics indicators to model markets. This would fallaciously assume that components parts behave the same as the macroeconomic whole. Conversely, if a model were to be trained with data from a single sector or security—Apple (AAPL), it would falsely try to model the whole based upon movements seen in AAPL. The problem remains in a monolithic system that only is generated trying to describe both the whole and constituents. We have seen how this paradigm falls apart in physics. Relativity describes the macroscopic universe fairly, but fails disastrously in the quantum world. The converse is true for quantum mechanics. An argument for the obsolescence of machine learning is being made here; however, an agented model running multiple, independent heterogeneous ML algorithms contemporaneously against each component part of the whole is needed. In this way, any ML system won’t over-train itself by fixing what is not broken.

3 Literature Review 3.1 String Stability In the context of autonomous vehicles, string stability refers to the maintenance of a constant distance between a series of vehicles moving in the same direction behind one another [9]. Sometimes, a disturbance can occur which disrupts this distance making each vehicle alter its distance somewhat where the noise is amplified through the set of vehicles (see Fig. 4). This problem is typically present when trying to model a controller that controls the acceleration of an object due to the following derivation. P˙ = v

(1)

v = Gc

(2)

P¨ = Gc

(3)

H (s) =

1 s2

(4)

176

J. Waller and T. Goel

Fig. 4. Simulation with variations in the position between cars travelling between one another with a disturbance introduced to Car1 altering and amplifying the position of all other cars to compensate producing inconsistencies.

P is the position, v is the velocity, Gc is the controller or acceleration in this case and the transfer function of the model is given as H (s). Modelling a controller on acceleration tends to yield two poles at the origin, which tend to yield instability. This sort of problem can be related to the compliance levels of an individual. The past actions witnessed affect the future outcomes hence the cost at the relative junctions may need to scale based on the results obtained at the previous junction making the need to manage entire network appropriately [10]. 3.2 Flash Crashes, Bad Actors and Market Perturbations Per our string stability theory of markets, one bad behavior begets many more of its kind, descending systems into chaos. In traffic flux dynamics, a Cambridge study found that autonomous driving vehicles could reduce traffic congestion by up to 35% over egocentric driving. Researchers programmed a small fleet of cars and recorded changes to traffic flow when one random car stopped [11]. Bad agents with poor decisions can inevitably be avoided in a MASs since no decision will be made until all agents have reported back a result that is found to be least-risk based. However, the introduction of rogue traders to the system could be a concern. They could have the negative effect of training the MASs model to follow their decision pattern and effectively yielding to a solution that is non-optimal. A perfect example of this is when the rogue trader does not effectively introduce extra risk, but introduces more costs, which has the unknowing effect of a greater risk.

A Logic-Based Agent Modelling Paradigm for Investment in Derivatives Markets

177

Along with rogue traders, market perturbations have a high impact on risk. Smaller companies that slightly depend on their larger competitors could be exponentially impacted by fluctuations of their co-dependent companies’ stock prices. Taking a look at an example, in 1990, Harry Markovitz won the Nobel Prize in economics for his Modern Portfolio Theory (MPT), a mathematical formulation and extension upon the idea of diversification to maximize asset return given a specific level of risk aversion. This model prescribes an optimal portfolio diversification based upon covariances of asset returns [12]. This only reduces static risk: the risk as determined by MPT at one point in time. Static risk can be alleviated, but MPT does nothing about intertemporal risk. To consider the broader idea, another diversification model must be taken into account. This means that we need to diversify to remove intertemporal risk. The idea is to minimise the probability of an unexpected situation—black swans or flash crashes, having untoward effects upon assets. Current models of diversification are long-term, accounting only for static risk. As evidence of this, the California Public Employee Retirement System (CalPERS)—the largest pension fund in the United States—experienced a 51.4% loss in three quarters from July 2008 to March 2009 [13]. At seemingly no point did the managers—of which there were 76 of them acting as external securities managers—decide to start betting against the market [14]. Either they could not—due to bylaw restrictions— or would not hedge for intertemporal risk. As evidence of that, there was a point within that three-quarter stretch that the managers could turn around and start betting against the market or pull out. Admittedly, there are potential geopolitical and feasibility concerns. Subsequently, CalPERS experienced three major corrections in their domestic equity investment some time after the Great Recession: in 2010Q2 (~11%), 2011Q3 (−11%) and 2015 (−11%), proving that the consensus ideology was diversification would be enough having foreknowledge of a severe market downturn, or else a false sense of security has been placed into these fiduciaries [15]. The main reason for divesting some of the total assets among a large number of external securities is to allow for nimble transaction maneuvering. However, it would seem that the general consensus is to remain exposed in the market, and ride through the corrections. This means current models do not account for a second dimension of diversification: exposure and non-exposure. The purpose of intertemporal risk minimization is to avoid these events, by limiting the percentage of AUM exposed at a given state, and limiting portfolio exposure to the markets across time. This decreases the probability that an investment will be exposed during a black swan event. 3.3 Existing Models The process in which MASs are used to assess patterns within financial markets is not solely unique. Others have attempted to implement their own versions of models by characterizing single agents with particular parameters so that a MAS is generated with a slightly altered end goal. However, most of these solutions that are successful essentially yield the same output, which is financial profit. One such example attempts to simulate the behaviour of the entire stock market through agents [16]. They do this by specifying a fixed number of starting agents, three in this case, and label them as investors with altered personas. The agents then are given categories in the ways they make investment

178

J. Waller and T. Goel

decisions. The interaction between the agents helps to form the model and determine what actually should be invested in, probabilistically speaking. The research [16] also does well to highlight the work that exists in relation to this field. In many ways, most of the work around this area is often just tweaked at the initial tuning stage, otherwise known as the creation of the agents. The constraints that the agents then rely on, typically a mathematical formula are just defined based on what is thought to be considered more significant than others to form the MAS.

4 Proposed Investment Strategy 4.1 The Created Model Agents with memory architecture: agents have percepts; in this case, only price data is used—both historical and real-time. Percepts about the market are used to make determination about the state of the environment. The agents use a rule set to determine optimal actions given a specific state of the environment. These states are databased to make intertemporal reasonings about the evolution of the environment, more specifically, the market and its constituent parts—securities. The agents also catalog decisions across time to measure performance of said decisions and course correct. These decisions make tiny market perturbations in its environment, which creates a feedback loop, and generates new percepts.

5 Outputs from the Model 5.1 Results The described model was then put through a test. The statistics of the output were analysed and the returns were recorded throughout the trial period of the model. An initial sum was invested into a proof-of-concept fund. The model assumes a game-theoretic movement of price, makes decisions about when to enter, speculates about direction and timeframe, and maximizes price target trajectories, whilst minimizing exposure time in the market—intertemporal risk minimization. The agents act like state-machines, reevaluating their choices depending on the state. The graph below shows a cumulative benchmark return year-to-date (see Fig. 5). The overall model has yielded over a 200% profit return to what was initially invested into the design. Although there are stages in the design whereby the model’s decision incurs a loss, the overall projection has yielded almost a linear progression of profit after exposure to an initial period of fine-tuning of the design parameters. 5.2 Limitations and Weaknesses The current model potentially suffers from over-communication, or in the very least, the need to communicate. This leads to information lag. Unlike a system underpinned by ML algorithms, a MAS doesn’t have constant, instantaneous information about its component agents and their behaviors. This because the agents are semi-autonomous or fully autonomous and can make novel decisions and/or change their belief structure. There is not a central controller (monolithic system) that predetermines the behavior.

A Logic-Based Agent Modelling Paradigm for Investment in Derivatives Markets

179

Fig. 5. Cumulative benchmark returns for the Dow Jones Industrial Average (green), NASDAQ Composite (purple) and S&P500 (red) measured against the prototypical account year-to-date 2019, stopping at 8th Oct 2019 based on the decisions, which the model indicated, should be taken. The INDU represents the Dow Jones Industrial Average, the COMP represents the NASDAQ composite, and the SPX represents the S&P 500 with PROTOTYPICAL ACCOUNT (blue) being the rate of the models account.

6 Conclusion The idea behind the use of MASs was discussed in detail throughout the report as well as highlighting its implications and applications for maximising returns when investing in stocks. The design that when run proves that return will in some way be greater than the amount invested. There is always a risk when it comes to such a model but with further training and data, the model can be improved to optimally minimise the risk. One such evident approach would be to analyse the stock prices and trends of companies on a regular basis by exporting the model to a cloud form of solution so that new data is always read in and used as training information and that the model is up to date with the news affecting stock prices values.

References 1. Weiss, G.: Multiagent Systems, 2nd edn. The MIT Press, Cambridge (2019) 2. Kliesen, K., Levine, B., Waller, C.: Gauging market responses to monetary policy communication. Federal Reserve Bank of St. Louis. Review 101(2), 69–91 (2019). Second Quarter 3. Freivalds K., Glagolevs, J.: Graph compact orthogonal layout algorithm. In: Fouilhoux, P., Gouveia, L., Mahjoub, A., Paschos, V. (eds.) Combinatorial Optimization. ISCO 2014. LNCS, vol. 8596. Springer, Cham 4. Engineers Journal. http://www.engineersjournal.ie/2015/06/30/traffic-monitoring-nra/. Accessed 10 Nov 2019 5. Simon, A., Singh, M., Venkatesan, S., Ramesh, D.: An overview of machine learning and its applications. Int. J. Electr. Sci. Eng. 1(1), 22–24 (2015) 6. Stone, P., Veloso, M.: Multiagent systems: a survey from a machine learning perspective. Auton. Robot. 8(3), 1–57 (2000)

180

J. Waller and T. Goel

7. Stephens, R.: Beginning Software Engineering, 1st edn. Wiley, Indianapolis (2015) 8. SEC Options Trading Rule 6. https://www.sec.gov/rules/sro/pcx/34-49451_a6.pdf. Accessed 07 Nov 2019 9. Oncu, S., Wouw, N., Heemels, W., Nijmeijer, H.: String stability of interconnected vehicles under communication constraints. In: IEEE 51st IEEE Conference (2012). https://doi.org/10. 1109/cdc.2012.6426042 10. Ferraro, P., King, C., Shorten, R.: Distributed ledger technology for smart cities, the sharing economy, and social compliance. arXiv e-prints, page arXiv:1807.00649, October 2018 11. Cambriduge University Driverless Cars Working Together Can Speed Up Traffic By 35 Percent. https://www.cam.ac.uk/research/news/driverless-cars-working-together-can-speed-uptraffic-by-35-percent. Accessed 07 Nov 2019 12. Markovitz, H.: Portfolio selection. J. Financ. 7(1), 77–91 (1952) 13. CalPERS Comprehensive Annual Financial Report, Fiscal Year Ending June 30, p. 17 (2009) 14. CalPERS Comprehensive Annual Financial Report, Fiscal Year Ending June 30, pp. 71–72 (2009) 15. California Public Employee Retirement System 13F Metrics. https://whalewisdom.com/filer/ california-public-employees-retirement-system. Accessed 11 Nov 2019 16. Souissi, M.A., Bensaid, K., Ellaia, R.: Multi-agent modeling and simulation of a stock market. Invest. Manag. Financ. Innov. 15(4), 123–134 (2018)

An Adaptive Genetic Algorithm Approach for Optimizing Feature Weights in Multimodal Clustering Manar Hosny(B) and Sawsan Al-Malak Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia [email protected]

Abstract. Social Media is a popular channel of communication, where people exchange different types of high volume and multimodal data. Cluster analysis is used to categorize this data to extract useful information. However, the variation of features that can be used in clustering makes the clustering process difficult, since some features may be more important than others, and some may be irrelevant or redundant. An alternative to traditional feature selection techniques, especially with the absence of domain knowledge, is to assign feature weights that depend on their importance in the clustering process. In this paper, we introduce a multimodal adaptive genetic clustering (MAGC) algorithm that clusters data according to multiple features. This is done by adding feature weights as an extension to the clustering solution. In other words, feature weights evolve and improve alongside the original clustering solution. In addition, the number of clusters is not determined a priori, but it is adapted and optimized during the evolutionary process as well. Our approach was tested on a large collection of Flickr images metadata and was found to perform better than a non-adaptive genetic algorithm clustering approach and to produce semantically related clusters. Keywords: Clustering · Genetic algorithms · Multimodal data · Feature selection · Adaptive feature weights

1 Introduction User generated content has been vastly growing on the World Wide Web (WWW). In fact, the essence of Web 2.0 is to transform users into co-developers who generate and share content in websites. It has developed over the last few years into becoming a channel of communication that is widely known as Social Media. People now share, like, and annotate content of all types, such as text (posts or tweets), documents, images, and videos. Social media data is complex and multimodal in nature. Multimodality refers to seeing an object from different perspectives. For example, a post, a tweet, a shared photo or video may have associated tags, geolocation features, temporal features, visual characteristics, etc. Multimodality makes it hard to manage social media data, due to its volume and structure diversity [1]. Research efforts have been made to manage and © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 181–197, 2020. https://doi.org/10.1007/978-3-030-52246-9_13

182

M. Hosny and S. Al-Malak

categorize data in social media using a variety of approaches, one of which is cluster analysis. Clustering techniques, in general, construct groups of data objects such that objects belonging to the same group are similar and objects belonging to different groups are dissimilar. Clustering data based on their semantic meaning has recently gained the attention of researchers, and efforts in optimizing the clustering quality in such domain are still ongoing. Using metadata in clustering has proven effective in many studies, yet exploiting the multimodal nature of shared information has been proven to give better semantically related clusters [1]. One problem associated with multimodality, though, is how to weigh the different features associated with the data, since some features may be more important than others during the clustering process. In addition, traditional clustering approaches usually require certain parameters, such as the number and shape of clusters to be known a priori. The latter is considered a limitation since the algorithm that requires such prior knowledge does not discover efficiently the naturally-existing groups within the dataset. In this research, our aim is to implement an optimization model that considers multimodal feature weights during clustering, where the number of clusters and feature weights are not determined a priori. Genetic algorithms (GAs) are well-known evolutionary methods that have been successfully used to handle clustering problems [2]. Thus, we introduce in this paper a novel multimodal adaptive genetic clustering algorithm that clusters information based on multiple features. Our approach adds feature weights as an extension to the clustering solution, such that feature weights are also optimized with the original clustering solution. Optimizing feature weights aims to achieve a clustering in which the most important features, i.e., those that are more relevant to the clustering solution, are targeted by adaptively adjusting their weights during the evolutionary process. The rest of this paper is organized as follows: Sect. 2 overviews some related work to our study. Section 3 formally defines the problem. Section 4 presents the details of the proposed method. Sections 5 and 6 introduce the experimental setup and the architecture design and implementation environment of our system, respectively. Section 7 details the results of the experiments and discusses some challenges and limitations of this work. Finally, Sect. 8 concludes this paper.

2 Related Work Clustering is considered as an unsupervised learning approach, because it tries to identify natural occurring groups of objects without having prespecified labels. Examinable features of data clusters include cluster compactness (the closeness of data objects belonging to the same cluster to each other), and external separation (how far separated clusters are from one another) [3]. Clustering has been widely used in a variety of fields, such as statistics, pattern recognition machine learning, image processing, data compression and data mining. In addition, recently, there has been an emergent need for robust and efficient systems that can manage the exploding volume of the data available on the WWW. Thus, clustering has been applied in many applications of Web indexing and data retrieval as well as recommender systems [1].

An Adaptive Genetic Algorithm Approach for Optimizing Feature Weights

183

As previously mentioned, another issue that has emerged with the multimedia available on the Web is how to handle multimodality in clustering. A common approach in this context is to consider each modality separately. This approach, though, does not consider the interaction between modalities. For example, a dataset that is clustered based on temporal modality only may not be able to detect clusters of data that are related in both spatial and temporal modalities, which could be interpreted as a relation to some event. Multimodal clustering has thus been adopted to obtain more meaningful clustering [1]. Clustering data has gradually progressed from one-way to multi-way, proving the effectiveness of exploiting multimodality. One-way clustering is based on the traditional bag of keywords, which uses the text associated with the document or within it. On the other hand, two-way clustering groups together documents based on keywords and then groups keywords based on the common documents they appear in at the same time. The latter approach proved to be effective in documents that are high in dimensionality [4]. The works in [5–7] exploited spatial and temporal modalities in addition to other modalities in order to identify events in social media. The work in [8] focused on the fusion of text features (tags, text descriptions, etc.) and spatial knowledge to provide a better description of data and add extra semantics. The approach is claimed to be effective in many realistic applications, such as content classification, clustering, and tag recommendation. Images are one of the most common types of contributed content on the Web, due to the ease of capturing it on devices and its richness in content. Online public photo-sharing websites are widespread and billions of images are shared through them. Managing such large collections of images is challenging, and has resulted in many attempts of archiving and clustering algorithms that automate the process. Unlike text documents, images can be difficult to automatically make sense of, due to their entire dependence on metadata, which in many cases may not be fully available or consistent. Clues of the content of images may be extracted from tags or associated captions, and possibly some visual features. The research in [9] and [10] depends on tags and visual features in enhancing the experience of browsing images in social media. Search results can be better enhanced through multimodal clustering, which would also depend on tags and visual features of images [11]. The work in [12] added an extra probabilistic modality, which is user preference, in order to deliver a personalized experience in image retrieval. Visual features and tags can also be used to enhance computer vision and identify salient objects inside images [13]. Analyzing images in social media with reliance on tags only lacks precision, since tags are user generated, and hence error prone. In addition, this reliance on tags motivates users to overwhelm data with relevant tags (tag synonymy), irrelevant tags, and invalid tags, which would lead to inaccurate retrieval. In [14], the authors proposed a method that aims to generate additional semantics after mining images in social media. With other modalities, these extracted semantics are employed in an unsupervised learning model. The works in [14] and [15] also proposed approaches of generating additional semantics based on multimodal (textual, visual, and spatial) analysis, however their approach uses visual notes provided in Flickr as part of the visual features.

184

M. Hosny and S. Al-Malak

The technique proposed in our work is an evolutionary algorithm that is inspired from natural biological evolution. Evolutionary algorithms are meta-heuristic methods that are considered to be efficient in solving complex problems, where they usually provide nearoptimal solutions. The clustering problem is considered an NP-hard grouping problem [16], which justifies the use of meta-heuristic algorithms to solve it. Many evolutionary algorithms have been used to solve clustering problems in the literature. Some of these techniques can be found in [6, 17–19]. For more information about evolutionary clustering techniques, the reader is referred to the survey in [2]. Besides evolutionary approaches that considered classical clustering methods without adaptation of feature weights, few studies considered the use of evolutionary algorithms for adapting feature weights while clustering. For example, the work in [20], which in turn adopts some features from the work in [21], presents a co-evolutionary framework to optimize feature weights in multi-dimensional data clustering. The idea is to have two populations evolving simultaneously, one for the clustering solution and another for the weights of the features. The approach was tested on datasets obtained from UCI machine learning repository, where the results indicate the superiority of the approach compared with another version of the algorithm that does not adapt feature weights, and it also significantly outperforms classical K-means clustering. Finally, Ant Colony Optimization (ACO) [22] is another popular technique that is inspired from the behavior of ants when finding the shortest path from their colony to food sources. The works in [1] and [23] adopt ACO in clustering a set of social media images. ACO is used to optimize feature weights for applications that handle large datasets and high-dimensional.

3 Problem Statement As defined in [7] and [20] we assume having a dataset consisting of N objects X = {X1 , X2 , ..., XN }. Each object Xi is described by a set of P features, Xi = (xi1 , xi2 ..., xiP ), where xij is the feature value of the ith data object (Xi ) in terms of the jth feature. The objective is to partition the N data objects into K non-overlapping clusters, where K is not previously defined. In addition, we need to find a set W = {w1 , w2 , ..., wP } of p P feature weights, such that wj is in the range [0, 1], and j=1 wj = 1 (i.e., the sum of all feature weights = 1). The partition should maximize a certain objective function to be described in Sect. 4.5.

4 Methodology: A Multimodal Adaptive Genetic Clustering (MAGC) In this section, we introduce the details of our proposed Multimodal Adaptive Genetic Clustering (MAGC), where the number of clusters and feature weights are not previously determined. Rather, they are optimized together with the clustering solution (i.e., the partitioning of objects into clusters). Our approach is explained under four sections: Genetic algorithm overview, the solution representation, the objective function, and the genetic operators.

An Adaptive Genetic Algorithm Approach for Optimizing Feature Weights

185

4.1 Genetic Algorithm (GA) GA is a popular optimization technique that is inspired from natural genetics [2]. It constructs a collection of solutions, through selection and combination, in search for a near-optimal solution. An individual solution is called a chromosome, and a set of chromosomes is a population. As an initial step, a population is created randomly, and then a degree of goodness is assigned to each individual chromosome based on a fitness function. A selection of the fittest individuals is then undertaken for further operations that are genetically inspired, such as crossover and mutation, in order to produce a new population. The latter operations are applied for multiple iterations until a specific number of generations is reached or a termination condition is satisfied. The solution to the problem is the fittest one of the last population [24]. 4.2 Solution Representation Each genetic solution (chromosome) in our MAGC is divided into two parts, one for the clustering solution, and the second is for the feature weights solution, as follows: Clusters. To represent the clusters, we use a label-based representation that represents the solution with integer encoding. In this representation, clusters’ centroids are actual data objects (a.k.a. medoids) that represent the cluster [21]. Assuming that we have N data objects that we need to cluster, we generate a number of solutions for the initial population. For each solution, k medoids are randomly selected from the N data objects. The value k (the number of clusters) is randomly chosen within a range of [kmin , kmax ], 

where kmin ≥ 2 and kmax ≈ N2 as recommended in [25]. In a chromosome of length N + 1, the first gene value at index 0 contains the number of clusters (k) that the partition contains. Gene values of medoids are −1, and gene values of other data objects are the indices of their nearest medoid. The choice of −1 gene value is justified by the fact that no data object has a −1 index, and hence a medoid is easily distinguished from other objects. A representation of a solution is illustrated in Fig. 1, where the clustering part is the non-shaded part of the chromosome. Feature Weights. Feature weights are added as an extension at the end of the chromosome, where its length is equal to the number of features considered in the clustering, and weights are real numbers falling in the range [0, 1]. At the beginning, weights are chosen randomly, such that, in a single chromosome, they should sum up to 1. For example, Fig. 1 shows a chromosome that represents a partition of five data objects into two clusters using three feature weights. In this figure, the number of clusters is stored in the first gene with index 0, objects 3 and 4 are the chosen medoids, objects 1 and 5 belong to the same cluster as the medoid with index 3, and object 2 belongs to the same cluster as the medoid in index 4. The shaded part of the chromosome represents the feature weights for the three features.

186

M. Hosny and S. Al-Malak

0

1

2

3

4

5

2

3

4

-1

-1

3

0.41

0.23

0.36

Fig. 1. Solution representation

4.3 Genetic Operators Crossover. For the crossover operator, parents transfer some properties to the children. We propose a Join and Split (J&S) crossover [21]. In this crossover, parents randomly pass down the number of medoids they carry to the children. For example, if two parents contain partitions of k1 and k2 clusters, then one child will randomly have k1 clusters, and the other will have k2 clusters. The (k1 + k2 ) cluster medoids are then randomly distributed between the two children, taking into consideration that duplicate medoids are not allowed in the same child. In other words, in case the same medoid appears in both parents, it should appear in both children, which makes the J&S crossover heritable (i.e., respecting the decision made by both parents), which is a desirable feature in crossover [2]. After that, both children will be amended by having their objects reallocated to the nearest clusters’ medoids.

P1

P2

0

1

2

3

4

5

2

3

4

-1

-1

3

0

1

2

3

4

5

3

-1

4

5

-1

-1

0.41

0.23

0.36

0.15

0.52

0.33

0.15

0.52

0.33

0.41

0.23

0.36

Crossover

C1

C2

0

1

2

3

4

5

3

-1

4

-1

-1

3

0

1

2

3

4

5

2

5

4

5

-1

-1

Fig. 2. Crossover example

Regarding the weights part of the chromosome, this part will also be passed randomly to the children. In other words, one child will randomly inherit the weights of parent 1,

An Adaptive Genetic Algorithm Approach for Optimizing Feature Weights

187

while the other child will randomly inherit the weights of parent 2. Figure 2 illustrates a crossover example. Mutation. With a small probability, mutation is applied in a cluster-oriented fashion, where we add or remove a cluster to the solution, updating the number of clusters in the chromosome correspondingly. When a cluster is added, all data objects must be redistributed, whereas in the case of removing a cluster, we only need to redistribute data objects belonging to the removed cluster to other clusters. For the feature weights part, we apply mutation through subtracting a small value ε from one weight and adding it to another weight [21]. Mutation will take place right after the crossover step in the reproduction process. Figure 3 illustrates weights mutation example. 0

1

2

3

4

5

2

5

4

5

-1

-1

0.41

0.23

0.36

0.42

0.23

0.35

Weights Mutation

0

1

2

3

4

5

2

5

4

5

-1

-1

Fig. 3. Weights mutation example with ε = 0.01

4.4 Selection Strategy The selection strategy is a scheme followed to select individuals for reproduction. A good selection scheme is one that gives all individuals a chance of reproduction, yet fitter individuals are more likely to be selected [26]. In the literature, various methods have been used in assigning this probability; one of the most commonly used is the Roulette Wheel Selection [27]. As its name implies, each individual is given a slice of the roulette wheel, where fitter individuals have bigger slices, hence a larger chance of selection. According to [26], this stochastic selection scheme and many others have almost the same performance and are not reported to have an effect on the overall result. Our algorithm uses the roulette wheel selection scheme to select individuals for reproduction. 4.5 Objective Function Recalling the problem definition in Sect. 3, the objective function of our algorithm depends on the quality of the clustering solution. In other words, it is our cluster validity measure. Our chosen cluster validity measure is the Davies-Bouldin (DB) index

188

M. Hosny and S. Al-Malak

[28], since it is non-monotonically increasing with k, and hence it is suitable for a clustering algorithm with a variable number of clusters. It has also been proven to be computationally efficient [29]. The DB index is described in Eq. (1). DB =

k SCi + SCj 1 maxi=j k Mij

(1)

i=1

Where k is the number of clusters and Ci and Cj are clusters such that i = j. The scatter SCi is the average distance between objects belonging to Ci and their cluster medoid Ri , while Mij is the distance between the two cluster medoids Ri and Rj . The scatter is obtained as shown in Eq. (2): SCi =

1  d (X , Ri ) |Ci |

(2)

X ∈Ci

The smaller the DB index, the better the clustering solution. In other words, individuals with smaller DB value have better fitness. Thus our fitness function f (z) is as shown in Eq. (3), where DB is the clustering measure given by Eq. (1). minimiz f (z) = DB

(3)

To measure the distance between two objects Xi , Xj , we use the Euclidean distance (Eq. (4)), which measures the distance between each xl feature, where l = 1, 2, . . . p and p is the number of features.  p    2 d Xi , Xj = xil − xjl (4) l=1

5 Experimental Setup The computational experimentation aims to apply our MAGC algorithm and compare its performance to a non-adaptive genetic clustering version, where all features are assumed to have equal weights. In the following subsections we explain the details of the experimental setup and the implementation environment used to test our approach. 5.1 Data Set Our algorithm was tested on the CoPhIR test collection, which is the largest Flickr metadata collection that is available for studies on scalable similarity search techniques [30]. It consists of 106 million images processed and extracted from Flickr, with metadata structured as XML files containing a number of standard MPEG-7 image features alongside the Flickr image entries (e.g. title, location, tags, comments, etc.).

An Adaptive Genetic Algorithm Approach for Optimizing Feature Weights

189

5.2 Feature Selection For the feature selection phase, we followed a manual approach to select a subset of the images’ features in CoPhIR dataset. The selected features were categorized as follows: 1) the upload date, 2) photograph owner’s location, 3) photograph title, 4) photograph tags, 5) photograph’s location, and 6) photograph’s visual descriptors. These features will be referred to thereafter as feature 1 to feature 6, respectively. 5.3 XML Parsing In order to extract the information related to images that are embedded in XML files, a tool can be used to ease the process of reading the content in-between XML tags. For this purpose, we have used Java architecture for XML binding (JAXB) [31] tool to parse XMLs. This parser follows a document object model (DOM) approach which creates a tree of objects that represents the content and hierarchy of data in the document, which will then be stored in memory. The XML schema of CoPhIR dataset is used with JAXB to create image objects and the objects within it, in the correct object hierarchy. 5.4 Data Pre-processing The CoPhIR dataset contains parts of text that are not preprocessed, hence may introduce noise that can affect the clustering process. We have used Apache Lucene API [32] tool to preprocess text fields such as image title and tags. This API is a Java-based fullfeatured text search engine library that provides various capabilities to process text. A part of the API is the Analysis package, which provides features to convert text into searchable (or in our case, comparable) tokens. The first step of text pre-processing is removing stop-words, followed by removing non-word characters, and finally stemming the resulting text. After text pre-processing, a lookup table [33] is generated. The lookup table is a symmetric matrix data structure transferred to a text file to store the distances between each image and all other images in the data collection. This is very important to reduce the time of recalculating the distances between images repeatedly. The lookup table will be fetched to memory whenever it is needed during runtime. Each image will contain a distance entry per feature, between it and all other images’ features.

6 Architecture Design and Implementation In this section, we will explain the design decisions and implementation of our MAGC. First, we will explain the components and their interactions, then introduce an important component of the implementation of the algorithm, the Watchmaker Framework. 6.1 Software Components Our algorithm is composed of many components that interact with one another in order to produce the final result. Figure 4 illustrates the software architecture of our MAGC

190

M. Hosny and S. Al-Malak

and its components’ interactions. The XMLUnmarshaller is a component built based on JAXB Unmarshaller, which is used to transform each XML file to a Java object. First, this component is used to produce a list of objects that represents the dataset. The objects are then updated with the preprocessed text that is produced by the TextPreprocessor. After this step, the list is ready to be used as an input in the Lookup Table Generator. The Lookup Table Generator outputs the results to the Lookup Table store. The Lookup Table Reader fetches the distance values stored in the lookup table to the working memory before the start of the algorithm. The next component is based on the Watchmaker Framework, which will be explained further in the upcoming subsection. 6.2 The Watchmacker Framework The Watchmaker framework [34] is an efficient, extensible, object-oriented framework for developing evolutionary/genetic algorithms in Java. It is an open source software that provides a variety of ready-to-use operators, such as crossover and mutation implementations for common data types.

Fig. 4. MAGC software architecture

The core component of the framework is the EvolutionEngine, which performs the evolution of the populations. The evolution takes multiple objects as inputs. These objects

An Adaptive Genetic Algorithm Approach for Optimizing Feature Weights

191

are implementations of the already-defined interfaces, which are the CandidateFactory, FitnessEvaluator, EvolutionaryOperator, and the EvolutionObserver. First, an implementation of the CandidateFactory, which is the ChromosomeFactory, is created to randomly initiate the first population of candidates. ClusterEvaluator is our implementation of the FitnessEvaluator interface that utilizes the Davies-Bouldin (DB) index. The evolution engine accepts a pipeline of evolutionary operators, which in our case are the JoinAndSplitCrossover and the ClusterMutation based on the EvolutionaryOperator interface. Finally, the EvolutionLogger is our implementation of EvolutionObserver interface, whose core functionality is to output the best candidate of each generation, the value of its fitness, and the average fitness of the complete generation.

7 Results and Evaluation To illustrate the performance of our MAGC, we conducted three experiments. First, we show the results of multiple runs of the algorithm on five different data collection sizes. Second, we compare the results of the algorithm with a non-adaptive genetic clustering algorithm on larger sizes collections of data sets. Third, we demonstrate some visuallyevaluated results of a clustering solution found by MAGC. Finally, we highlight some of the challenges and limitations of this work. 7.1 Multi-run Experiment In this experiment we have run our MAGC multiple times on a number of nonoverlapping collections of the dataset, whose sizes range from 100 to 500 images. The experiments were conducted on a machine with Intel Core processor with an i5 CPU that has a clock speed of 1.70 GHz, and 4 GB of RAM. We have set the population size to 100, and the crossover and mutation probabilities to pc = 0.8 and pm = 0.2, respectively. 

The number of clusters in all individuals ranges between kmin ≈ N8 and kmax ≈ N2 . The latter parameter values were chosen by trial (more explanation in Sect. 7.4). The termination condition halts the evolution if no improvement in fitness is observed within 10 consecutive generations. Table 1 lists the best and average fitness values calculated in terms of the DB index in 10 runs for each collection size. Recall that the lower the DB index the better the result (i.e., the fitness) obtained. From the results in Table 1, we can observe that the fitness value of the best individual slightly decreases as the dataset size increases. This means that the clustering quality becomes slightly worse as the collection size increases. This is expected, since the increase in the number of images in the collection makes the clustering more difficult. On the other hand, the average number of generations remains relatively stable irrespective of the dataset size (ranging from approximately 27 to 40). With respect to processing time, the algorithm is quite fast with less than 7.5 s needed to process the largest dataset of 500 images. Throughout the experiment, the evolvement of weights was recorded in order to detect any possible patterns in terms of the assigned feature weights. The results,

192

M. Hosny and S. Al-Malak Table 1. MAGC multi-run experiment results

Dataset size

Best fitness (DB index)

Average fitness (DB index)

Average no. generations

Processing time (sec.)

100

1.07

1.57

31.70

2.62

200

1.39

1.86

38.60

3.41

300

1.47

1.76

40.50

4.73

400

1.54

1.86

27.00

5.04

500

1.48

1.78

35.20

7.49

however, were non-conclusive, since it was observed that each run produced different weights. This means that weights were not inclined towards certain features, i.e., overall no particular feature(s) can be considered as more important in the clustering process as far as the algorithm is concerned. Nevertheless, observing the fittest individuals, it was noticed that weights for the location (features 2 and 5) and the title (feature 3) were relatively higher than the other weight values. The relation between 10 of the fittest solutions found in multiple runs (sorted ascendingly) and their feature weights is illustrated in Fig. 5.

0.6

Weight

0.4

1.472045

1.454487

1.452239

1.445013

1.430355

1.426708

1.401802

1.356159

1.287466

0

1.253294

0.2

DB index value Feature 1 Feature 2

Feature 3

Feature 4

Feature 6

Feature 5

Fig. 5. MAGC weight values of multiple runs

An Adaptive Genetic Algorithm Approach for Optimizing Feature Weights

193

7.2 Comparing with Non-adaptive Genetic Clustering In this experiment, we have run the same MAGC on larger dataset sizes that range from 100 to 1500 images in each dataset. The algorithm was run 10 times on each dataset. In addition, another version of the algorithm (non-adaptive genetic clustering algorithm (GCA)) that excludes the adaptation of weights, i.e., all features have equal weights, was run the same number of times on the same datasets. The parameters of the population size, crossover, and mutation rates are all identical to the former experiment. The results of this experiment are listed in Table 2. Table 2. Results of genetic clustering with and without adaptive weights Dataset size

MAGC

Non-Adaptive GCA

Best DB ind.

Avg DB ind.

Avg no. gen.

Proc time (sec.)

Best DB ind.

Avg DB ind.

Avg no. gen.

Proc time (sec.)

100

1.04

1.50

35.00

1.32

1.22

1.79

18.00

1.04

300

1.46

1.85

24.90

2.40

1.53

1.94

20.40

2.06

500

1.40

1.76

40.10

5.04

1.56

1.97

21.40

3.70

700

1.50

1.77

47.80

13.78

1.61

1.95

20.50

9.35

1000

1.51

1.72

62.40

27.07

1.64

1.91

24.00

14.96

1500

1.44

1.64

73.60

56.65

1.63

1.92

29.40

25.65

Avg.

1.39

1.71

47.30

17.71

1.53

1.91

22.30

9.46

From the results in Table 2, we can observe that the non-adaptive GCA produced solutions that are worse in fitness (i.e., have higher DB index) than the MAGC. Observing the overall average in the last row of the table, we can see that our MAGC outperforms the non-adaptive GCA with approximately 10%, in terms of the best solution fitness, and approximately 12% in terms of the average solution fitness. Moreover, as noticed by the smaller number of generations, the non-adaptive version converged faster than the adaptive version, which probably indicates that it has been stuck in a local optimum and cannot improve further. 7.3 Visual Evaluation To visualize the results of our genetic clustering algorithm, we have chosen one of the solutions that assigned a relatively high weight to visual features (feature 6), and collected some of the images that belonged to the same cluster. Figure 6 displays the images that were collected from three clusters. It is clear that MAGC was able to produce semantically-related image clusters since images belonging to one cluster had visual similarities. For example, cluster B contains two images of the same child, as well as similar color ranges to the other images.

194

M. Hosny and S. Al-Malak

Fig. 6. Sample images of three clusters A, B, and C

7.4 Challenges and Limitations The following are some of the challenges and the limitations of our work: First, in the assessment of the clustering solution, we have observed that sometimes there are solutions with empty clusters for the fittest individuals. This required adding a code fragment to detect and remove empty clusters. However, this also had an effect on the minimum number of clusters which was previously defined as (kmin = 2), in case one of the two clusters happened to be empty. Thus, we increased the minimum threshold of the number of clusters to make  it a ratio of the number of objects N (i.e., it was changed

from kmin = 2 to kmin ≈ N8 ), as explained in Sect. 7.1. Second, our approach slightly altered the values of weights as part of mutation only. Mutation is usually given a very low probability, which might make the change of weights insignificant in some cases. More experiments are needed in terms of mutation probability and the amount of change allowed for the weights part of the chromosome, or handling the weights part differently in the crossover operation. This may have a better effect in terms of determining the features that are more important in the clustering solution, if their weights are evolved more significantly. Finally, one limitation of our work is handling the missing values in the dataset. For example, the best solutions found emphasized the weight of the location feature, even though it is not available for all data objects. In addition, our work needs more analysis in terms of interpreting solutions and visual results assessment. The latter is a tedious and time-consuming process when done manually, due to the large size of the dataset. Moreover, some image URLs are inaccessible, which makes it a challenge to visually evaluate these images.

An Adaptive Genetic Algorithm Approach for Optimizing Feature Weights

195

8 Summary and Future Work Research efforts have been made to manage and categorize data in social media using a variety of approaches, one of which is cluster analysis. The focus on multimedia is due to two main reasons: it is one of the most commonly shared types in social media, and it has many modalities that can contribute in learning the semantic meanings contained within it. As a result, research attempts head towards mining information in social media, with multiple modalities employed in an unsupervised learning model. Evolutionary techniques can be used to optimize the fusion of modalities, aiming to enhance the clustering performance. In this work, we have implemented a Multimodal Adaptive Genetic Clustering (MAGC) algorithm that exploited multimodality in clustering in order to produce semantically-related clusters. Our algorithm optimizes both the number of clusters and feature weights in order to accomplish this objective. We have extensively tested the algorithm on the CoPhIR dataset. The test results of our adaptive algorithm showed that it performs better compared to a non-adaptive genetic clustering algorithm and produces semantically related clusters. Our future work will emphasize on the adaptation of feature weights by involving them more in the evolutionary process through crossover and mutation. This may improve the cohesiveness of the produced clusters. We will also investigate the scalability of the algorithm by testing it on larger datasets and using other sources of data.

References 1. Nikolopoulos, S., Giannakidou, E., Kompatsiaris, I.: Combining multi-modal features for social media analysis. In: Hoi, S.C.H., Luo, J., Boll, S., Xu, D., Jin, R., King, I. (eds.) Social Media Modeling, pp. 71–96. Springer, London (2011). 2. Hruschka, E., Campello, R., Freitas, A.A.: A survey of evolutionary algorithms for clustering. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 39, 133–155 (2009) 3. Hansen, P., Jaumard, B.: Cluster analysis and mathematical programming. Math. Program. 79, 191–215 (1997) 4. Dhillon, I.S., Mallela, S., Modha, D.S.: Information-theoretic co-clustering. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2003, p. 89. ACM Press, New York (2003) 5. Becker, H., Naaman, M., Gravano, L.: Event identification in social media. In: 12th International Workshop on the Web and Databases (WebDB), Rhode Island, USA (2009) 6. Sheng, W., Liu, X.: A hybrid algorithm for k-medoid clustering of large data sets. In: Evolutionary Computation. CEC2004. IEEE (2004) 7. Liu, Y., Zheng, F., Cai, K., Jiang, B.: Cross-media retrieval method based on temporalspatial clustering and multimodal fusion. In: 2009 Fourth International Conference on Internet Computing for Science and Engineering, pp. 78–84. IEEE (2009) 8. Sizov, S.: GeoFolk: latent spatial semantics in web 2.0 social media. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining - WSDM 2010, p. 281. ACM Press, New York (2010) 9. Olivares, X., Ciaramita, M., van Zwol, R.: Boosting image retrieval through aggregating search results based on visual annotations. In: Proceeding of the 16th ACM international conference on Multimedia - MM 2008. p. 189. ACM Press, New York (2008)

196

M. Hosny and S. Al-Malak

10. Aurnhammer, M., Hanappe, P., Steels, L.: Augmenting navigation for collaborative tagging with emergent semantics. In: International Semantic Web Conference (ISWC), pp. 58–71. Springer, Heidelberg (2006) 11. Wu, F., Pai, H.-T., Yan, Y.-F., Chuang, J.: Clustering results of image searches by annotations and visual features. Telemat. Inform. 31, 477–491 (2014) 12. Zhuang, Y., Chiu, D.K.W., Jiang, N., Jiang, G., Wu, Z.: Personalized clustering for social image search results based on integration of multiple features. In: Zhou, S., Zhang, S., Karypis, G. (eds.) Advanced Data Mining and Applications, pp. 78–90. Springer, Heidelberg (2012) 13. Chatzilari, E., Nikolopoulos, S., Patras, I.: Enhancing computer vision using the collective intelligence of social media. In: New Directions in Web Data Management 1, pp. 235–271. Springer, Heidelberg (2011) 14. Giannakidou, E., Kompatsiaris, I.: SEMSOC: semantic, social and content-based clustering in multimedia collaborative tagging systems. In: 2008 IEEE International Conference on Semantic Computing (2008) 15. Lienhart, R., Romberg, S., Hörster, E.: Multilayer pLSA for multimodal image retrieval. In: Proceedings of the ACM International Conference on Image and Video Retrieval, p. 9 (2009) 16. Falkenauer, E.: Genetic Algorithms and Grouping Problems. Wiley, Hoboken (1998) 17. Lu, Y., Lu, S., Fotouhi, F., Deng, Y., Brown, S.: FGKA: a fast genetic k-means clustering algorithm. In: Proceedings of the 2004 ACM Symposium on Applied Computing, pp. 622–623 (2004) 18. Ma, P., Chan, K., Yao, X., Chiu, D.K.: An evolutionary clustering algorithm for gene expression microarray data analysis. IEEE Trans. Evol. Comput. 10, 296–314 (2006) 19. Alhenak, L., Hosny, M.: Genetic-frog-leaping algorithm for text document clustering. Comput. Mater. Contin. 61, 1045–1074 (2019) 20. Hosny, M.I., Hinti, L.A., Al-Malak, S.: A co-evolutionary framework for adaptive multidimensional data clustering. Intell. Data Anal. 22, 77–101 (2018) 21. Al-malak, S., Hosny, M.: A multimodal adaptive genetic clustering algorithm. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2016), pp. 1453–1454, Denver, Colorado. ACM (2016) 22. Dorigo, M.: Optimization, learning and natural algorithms, Ph.D. thesis. Politecnico di Milano, Italy (1992) 23. Piatrik, T., Izquierdo, E.: Subspace clustering of images using ant colony optimisation. In: 2009 16th IEEE International Conference on Image Processing (ICIP), pp. 229–232. IEEE (2009) 24. Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addion Wesley, Boston (1989) 25. Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate analysis. Analysis 97, 1–4 (1979) 26. Goldberg, D.E., Deb, K.: A comparative analysis of selection schemes used in genetic algorithms. Found. Genet. Algorithms 1, 69–93 (1991) 27. De Jong, K.A.: An Analysis of the Behavior of a Class of Genetic Adaptive Systems, Ph.D. thesis. University of Michigan, USA (1975) 28. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1, 224–227 (1979) 29. Petrovic, S.: A comparison between the silhouette index and the davies-bouldin index in labelling ids clusters. In: Proceedings of the 11th Nordic Workshop of Secure IT Systems, pp. 53–64 (2006) 30. Bolettieri, P., Esuli, A., Falchi, F., Lucchese, C., Perego, R., Piccioli, T., Rabitti, F.: CoPhIR: a test collection for content-based image retrieval. CoRR abs/0905.4627 (2009) 31. JAXB Reference Implementation — Project Kenai 32. Apache Lucene 5.3.1 Documentation

An Adaptive Genetic Algorithm Approach for Optimizing Feature Weights

197

33. Lin, H., Yang, F., Kao, Y.: An efficient GA-based clustering technique. Tamkang J. Sci. 8, 113–122 (2005) 34. The Watchmaker Framework for Evolutionary Computation (evolutionary/genetic algorithms for Java)

Extending CNN Classification Capabilities Using a Novel Feature to Image Transformation (FIT) Algorithm Ammar S. Salman1(B) , Odai S. Salman2 , and Garrett E. Katz1 1 Syracuse University, Syracuse, NY 13244, USA

{assalman,gkatz01}@syr.edu 2 Carleton University, Ottawa, ON K1S-5B6, Canada

[email protected]

Abstract. In this work, we developed a novel approach with two main components to process raw time-series and other data forms as images. This includes a feature extraction component that returns 18 Frequency and Amplitude based Series Timed (FAST18) features for each raw input signal. The second component is the Feature to Image Transformation (FIT) algorithm which generates uniquely coded image representations of any numerical feature sets to be fed to Convolutional Neural Networks (CNNs). The study used two datasets: 1) behavioral biometrics dataset in the form of time-series signals and 2) EEG eye-tracker dataset in the form of numerical features. In earlier work, we used FAST18 to extract features from the first dataset. Different classifiers were used and Deep Neural Network (DNN) was the best. In this work, we used FIT on the same features and invoked CNN which scored 96% accuracy surpassing the best DNN results. For the second dataset, the FIT with CNN significantly outperformed DNN scoring ~90% compared to ~60%. An ablation study was performed to test noise effects on classification and the results show high tolerance to large noise. Possible extensions include time-series classification, medical signals, and physics experiments where classification is complex and critical. Keywords: Fingerprint · Biometrics · Spoofing · Feature to Image Transformation FIT · Anti-spoofing protection · CNN stochastic gradient descend optimizer · Ablation · Frequency and Amplitude based Series Timed (FAST18) · CFS

1 Introduction A novel approach to transform raw signals into images promises to add a major extension to the CNN capabilities [1]. We have transformed signals into images by first extracting signals features using FAST18 algorithm [2, 3] or using ready features, and then using our novel FIT algorithm to transform them into coded images. No previous work has used these specialized combinations of FAST18 or FIT algorithms. We made tests and compared DNN and other classifiers with the CNN after these added capabilities. © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 198–213, 2020. https://doi.org/10.1007/978-3-030-52246-9_14

Extending CNN Classification Capabilities

199

The first experiment reported separately involves a spoofing detection system using behavioral biometrics [3], which collects data from several sensors prior to any unlocking attempt. The sensors detect a variation between intentional and forced application of the fingerprint. The work used Naïve-Bayes [4], Support Vector Machine (SVM) [5], and Deep Neural Networks (DNN) [1] classifiers, and the DNN was the most successful. SVMs came close when Correlation-based Feature Selection (CFS) [6] was used. In this work we have used CNN for classification of the same data, and another set featuring eye-print data [7]. The biometrics raw data we have is made of timestamped signals, and the eyeprint set has some features which are time dependent. Features were extracted using FAST18 algorithm developed by one of the authors, as reported in the previous work [3]. The novel method we have developed and tested in this work can effectively transform signals or features into coded images that are optimized for CNN classification. The method is different from other works, in terms of transformations, feature extraction, and optimization. A success in doing a robust transformation can open the door for extending the CNN capabilities to become a general classifier not restricted to pioneering in images. Section 2 covers related work, and Sect. 3 presents the methodology. Sections 4 and 5 provide a description of datasets, and extensive tests including accuracy, ablation and optimizations. We outline our conclusions and future work in Sect. 6.

2 Related Work There are three main related work types. Many works transform timed series signals into images through amplitude correlations, and then apply CNN [8, 9]. The second type splits the signal into amplitude and frequency based content, then extracts features of each part in a format suitable for CNN [10]. The third type projects the signal into 2D format, and extracts features from the images by mapping the pixels and records the information as features for a non-CNN learner (SVM) [11]. Finally, there are review studies of the various Time-Series Classification (TSC) works, and their observations about the University of California Riverside (UCR) archive. All approaches do not address our method strategy of creating coded images with maximum contrast between classes, hence the work is truly novel in using the FIT and the Features Extraction algorithms for CNN use. Hatami et al. [8] used Recurrence Plots to transform time-series into 2D texture images and then invoked CNN classifier. They reported competitive results on the UCR archive compared to deep architectures, and state-of-the art TSC algorithms. Wang and Oates [9] proposed a novel framework to encode time series data as different types of images, namely, Gramian Angular Fields (GAF) and Markov Transition Fields (MTF). Using a polar coordinate system, GAF images are represented as a matrix with elements as the trigonometric sum of different time intervals. MTF images represent the first order Markov transition probability along one dimension and temporal dependency along the other. They used Tiled CNNs on 12 standard datasets to learn high-level features from individual GAF, MTF, and combined GAF-MTF images. They reported results competitive with five state-of-the-art approaches.

200

A. S. Salman et al.

Cui et al. [10] proposed a Multi-Scale CNN (MCNN) method, which incorporates feature extraction and classification in a single framework with a multi-branch layer and learnable convolutional layers. MCNN extracts features at different scales and frequencies. They have made an empirical evaluation with various methods and benchmark datasets and claim that MCNN showed superior accuracy. Comparing ordinary CNN with their method they claim their system fared better, but they did not specify what CNN or what feature extraction method used for the ordinary CNN, hence their conclusions are not solidly tested. The approach of Azad et al. [11] transforms 1-D signals into 2-D grayscale image and extracts features by taking the pixel values to calculate and present energy as gray image. They normalize by measuring the signals within the time intervals, and use Empirical Mode Decomposition to remove low frequency noise. They used Segmentation-based Fractal Texture Investigation (SFTA) algorithm to create the feature vectors, and SVM for classification. They claim their method preserves more information from 1D signal compared to losses of correlations using other methods. Their accuracy is 88.57% but they do not explain mapping the signal that can preserve more correlations. Dau et al. [12] review various TSC user works and observations about the UCR archive sets. Chen et al. [13] contains standard datasets with exact form or time duration for any class signal. The sets are rigidly constructed, and divided into train and test sets that cannot be changed. They do not represent real life data but are exact with zero noise for use with benchmark testing. The high achievements for some signals do not reflect a reliable measure of the method strength or robustness in real applications, but can produce fair and reasonable ranking since they use the exact same standard. Hu et al. [14] challenges the assumptions on TSR that the beginning and ending points of the signal or pattern can be correctly identified, during training or testing which can lead to over optimism about the algorithms’ performances. They show, that correctly extracting individual gait cycles, heartbeats, gestures, behaviors, is more difficult than classification. They propose a solution by introducing an alignment-free TSC that requires only weakly annotated data. They claim that extending the machine learning data editing to streaming/continuous data enabled building robust, fast and accurate classifiers. By testing real-world problems, they claim their framework is faster and more accurate than other solutions. Their work is related to our method in terms of testing on real world data, in the sense that it takes care of their concerns implicitly through the contrasted transformed images.

3 Methodology 3.1 System Layout Figure 1 shows the main configuration of the experiment scope. The first stage is data collection; we have used the same data as provided in references [3] and [7]. Features extraction can be accomplished by many means, and for the biometric data the FAST18 algorithm was used [2]. The FIT Algorithm and its trainable parameters are described in Sect. 3.2.

Extending CNN Classification Capabilities

201

Fig. 1. General system layout

The CNN receives the transformed data, then learning and optimization are incorporated in the interface between the FIT and the CNN. In the following we provide some details of the various pipeline steps. 3.2 Features to Images Transformation (FIT) The developed FIT Algorithm has parameters that can be tuned for each dataset to maximize the features’ contrast in the generated images. It transforms the signal features’ values into gray-scale spatial gradients with different angles based on their amplitudes and exponents (Fig. 2). “Gradient” here refers to a gradual change of brightness within a 2D image from light to dark. The rotation hyper-parameters can be optimized based on the dataset. The constructed images are then fed to the CNN for classification. Results during the learning stage can be used to calibrate the parameters of the FIT algorithm. CNN’s Stochastic Gradient Descent with Momentum (SGDM) optimizer is used as part of the CNN. In addition, we studied the accuracy dependence on the rotation hyper-parameters to get the best settings. Mapping a machine problem for efficient use with CNNs requires generation of an image that effectively represents features as a valid input. For such reason, a unique representation of features is highly valued. To achieve uniqueness, one must identify the parameters that help distinguish features from each other. Once each feature has its own unique representation, they are distributed on an N × N grid image. This way, the neural network can maximize learning of feature-to-class correlations. Moreover, since these features are grouped onto one grid, and given the process of CNN in passing a filter along the whole image, the filters will also learn some feature-to-feature correlations (e.g. when a filter covers portions of several features). Computing Feature Map Generation Parameters. The first step in mapping features to images is to represent the feature value X into m significant digits and an exponent. We used m = 3 which provides all needed variations. Setting m to other values could be needed for complex datasets, but m = 3 covers a sufficient range of variations for our

202

A. S. Salman et al.

purposes. We write a three-digit number X in the form = x(ddd) × 10p , (e.g. 45905 → 459 × 102 ). This representation is important to generate a unique combination of feature intensity I in the range [0–255], and rotation R in degrees, range ±p[0–360]. I and R are calculated using  255 − x%255, x > 255 I (x) = (1) x, x ≤ 255  (p + (x−255) 1000 ), x > 255 (2) R(x, p) = p, x ≤ 255 The free parameter  defines the coded image rotation. If Rmaxf is the largest absolute rotation in degrees for a given feature, we define Nf = ABS int (255R/Rmaxf )

(3)

as the multiplier of the cell border intensity. In this setting we have three defining quantized parameters with a wide range of values. I takes 0–255 values, R values are coarse multipliers of p, plus some fine sub range, and N f takes 0–255 values. The number of unique representations of micro images is 2 × range N f × range I × coarse R. In our settings the maximum is around 130,000p. The coding has a wide range of possibilities and can be extended if needed.  can be optimized to maximize classification accuracy, and that generally requires the training and testing results on the dataset.

Fig. 2. Example feature-vector representation

Generating the Features Image. Figure 2 shows how I and R are used to generate a feature micro image. It generates a rectangle that has a rotation of R degrees, intensity I in the range [0–255], and a border with a normalized intensity. Other parameters that add valuable information are the signs of the feature itself, and the rotation. The feature sign is added as a character to the top-left corner of the micro image, and the rotation negative sign applies a color inversion to the generated feature map and this doubles the number of possible representations. A feature matrix for the instance is created by concatenating the micro images.

Extending CNN Classification Capabilities

203

3.3 Designing a CNN Given that the generated input differs from one dataset to another, we considered designing a neural network architecture that is adaptive to such variations. Figure 3 shows the CNN network architecture. It consists of 5 convolutional layers, each followed by a maxpooling layer of stride of “(1, 1)” and a ReLu.

Fig. 3. A CNN architecture to classify generated images

We used a working empirical formula that was established from extensive optimizations. This is valid for datasets having fpr less than or equal to 12. For the input layer, the filter size is set to [3 × round (12/fpr)]2 (where fpr = number of features per row) and a stride of (1, 1) × round (12/fpr). As for the remaining layers, the filter size should be [2 × round (12/fpr)]2 and always having a stride of (1, 1). The optimum CNN has a constant learning rate of 0.001, mini-batch size of 64, and an SGDM optimizer. We have tested and recorded results for several mini-batch sizes of {16, 32, 64} with number of epochs {3, 5,1 0} respectively and two values for #filters/layer of {32, 64}. An epoch is the cycle of iterations needed to complete the dataset training, and an iteration is the process of feed-forward/backpropagation needed to adjust weights using a mini batch. Figure 2 shows also the feature matrix for an instance. The matrix dimension dim = √ Int ( #F + 1) where #F is the number of features. In our case #F = 126, hence dim = 12. the excess matrix elements are blackened. 2D or 3D representations provide more flexibility for correlations between the features. We decided to use 2D for simplicity. 3.4 Optimizations The parameter  is a multiplier within the rotation angle of the generated gradient. From the definition, it does not have a preferred value for the best unique representation of features, hence we do not expect it to have large influence on accuracy. In that sense our algorithm constructs a well-behaved form that preserves and adds more correlations for the features. We do not expect to need optimizations beyond the standard CNN’s SGDM [15].

204

A. S. Salman et al.

Studying Error Dependence on . This method requires selecting a number of values for  and, on each one, training a CNN to obtain the final accuracy. There is generally no function form for error vs.  available priori; there is no function indeed in our case, and hence optimization requires constructing the function. Therefore, we need to run the CNN first with adequate cross-validation for  values from (0–2π], then we plot the error vs . If  is critical for data conversion, the initial functions construction will show responses with a wide error range. If it has no significance beyond statistical fluctuations, the CNN will optimize despite the initial value. In this sense measuring the dependence adds another layer for optimization when conversion is sensitive to .

3.5 Training, Configurations, and Ablation Settings We used SGD with Momentum (SGDM). We have built and tested 6 configurations for the number of filters/layer and mini-batches. The 4-fold Cross Validation (CV) datasets are split into 25% for Test1, the other 75% for Training, and half of Test1 for Test2. For 10-fold CV the Test is 10% and the other 90% is for Training; then repeat the run 10 times covering all portions for testing. Standard errors are generally small, and their training errors asymptotically decay to zero. We have applied the same settings with random shuffling of the dataset elements for the various runs covering the configurations, or the ablation studies. The number of iterations used are well beyond saturation, and as it turned out, most near optimum configurations needed around 60 iterations, while the weak ones required over a hundred. For the ablation studies we have three types of noise on the constructed images; Gaussian; salt and pepper, and speckle. In addition, we have applied different levels of Gaussian noise on the numerical features before using the FIT.

4 Datasets 4.1 Biometrics Data Table 1 [3] shows the biometric dataset description. A total of 6 runs with nearly 100 instances each represented a reasonable range of values, and statistical samples for training and testing the classifiers. The application we developed runs on Android devices and collects the sensors data, when a fingerprint push is made on the scanner. The data is collected during the time span when the system is processing the authenticity of the fingerprint. We have the original raw signals and features as constructed by our feature extraction algorithm which is designed to provide highly contrasted features.

Extending CNN Classification Capabilities

205

Table 1 Biometric dataset: runs codes and settings used events from runs 1, 2 and 4. A total of 426 instances and 18 feature/sensor. Class = 1 intentional; class = 0 forced. Pressing force ratio F01 = F0 /F1 ; Fx = force for class x. Number of features 126. Composition for all runs 50% class = 0 and 50% class = 1. From [3]. Run code Type

F0/F1 Configuration

R1

Classification

2.5/1

R2

Reproduction of R1 2.5/1

CP1

R3

Classification

2/1

CP2: two different pushes for the two classes push changes slightly during the application for each class The push ratio is slightly less than CP1

R4

Classification

1.5/1

CP3: two different pushes for the two classes. Push changes during the application for each class. The push ratio is a little less than the R3

R01

Calibration

1/1

CP4: push and positioning are fixed for all instances but labeling 50% class = 1 and 50% class = 0

R02

Calibration

2/1

CP5: slightly different positioning and different pushes for the two classes

CP1: two different fixed pushes for the two classes to optimize contrast between the two cases for most sensors

4.2 EEG Eye State Dataset The second dataset has a large number of instances (Roesler 2019) [7], and the purpose is to classify the eye print if done correctly or not. The same procedure was applied, and achieved similar CNN accuracy results. Table 2 shows the dataset description. The data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 s. The eye state was detected via a camera during the EEG measurement, and added later manually to file after analyzing the video frames. ‘1’ indicates eye-closed, and ‘0’ eye-open. Table 2. EEG Eye State Dataset consists of ~15k instances. The number of attributes = 15 (14 for features and 1 for the class). From [7]. Property

Description

Dataset characteristics

Multivariate, Sequential, Time-Series

Attribute characteristics Integer, Real Associated tasks

Classification, two classes

Number of instances

14980

Number of features

15 matrix of dimension 4

Date donated

2013-06-10

206

A. S. Salman et al.

5 Results 5.1 Data Transformation Results Figure 4 shows an instance demonstration from the biometric dataset. A feature vector of 126 features was created by extracting and concatenating FAST18 features from seven different sensors in the form of time-series readings. The vector was transformed by the FIT algorithm into coded images matrix that was fed to a trained CNN, which classified it as not forced.

Fig. 4. Instance demonstration. The images representation and the CNN classification from biometric dataset

5.2 Six Different Hyper-parameter Configurations We ran experiments with six hyper-parameter configurations (variable filters/layer, and mini batch settings). Figure 5 shows the learning curves (classification error vs. iteration). The first two configurations are not saturated while the other four are. All six configurations used exactly the same set of data and reshuffling, reflecting the configuration and statistical contributions only. The results are generally good, and saturation is reached fairly early. These experiments were applied only to the fingerprint biometric dataset given its small size and nature of instances. Each experiment was run using 10-fold cross validation. EEG dataset was tested using a single configuration (1024 batch size and 32 filters, Sect. 5.4). 5.3 Ablation Study After training, we added noise to the data. Figure 6 shows the impact of four types of noise, with training on noiseless data only. Data-noise refers to noise applied on the numerical features before the FIT is used to generate images, while the other three were applied on the images after transformation. The degradation is about 5% for noise on the signal, while on the images it went up to 20–30%. These effects do not impose a serious impact on the main algorithm robustness, because what is critical is a noise on the signals.

Extending CNN Classification Capabilities

207

10- Fold Cross Validation, Learning Rate =0.001, Train = 90% Test =10%

Fig. 5. Classification accuracy percentage: Mean ± SD for 6 configurations, Biometric Dataset stable and the variation between them is within the statistical error. Fold means 10% of the dataset, of 426 instances.

Fig. 6. Training on noiseless data, 64/32 config. 10-Fold CV

208

A. S. Salman et al.

Figure 7 shows the results after applying the same four noise types, with training on noiseless and noisy data. The noise training helps accuracy compared to Fig. 6, with little improvement for data-noise, and highest improvements for the images noise.

Fig. 7. Training on noiseless and noisy data. 64/64 config. 10-Fold CV

On the other hand, it’s important to note the drop in standard deviation of Fig. 6 compared to Fig. 7. This indicates a significant increase in classification precision after including noisy data in the training. Therefore, we can conclude the results are consistent with applying training on noise, and while it did not dramatically affect accuracy mean on noiseless data, it certainly improved precision. Figure 8 shows the results after applying the same four noise types and testing their effects individually. Four training and testing sessions were conducted; one for each noise type. Data-noise had the least effect on accuracy since SNR was high enough, and the CNN is only trained on noisy instances. As for image noise types, the results are consistent with Fig. 7 experiment.

Fig. 8. Training on noisy data only, 64/32 config. 4-Fold CV

Extending CNN Classification Capabilities

209

We have also made a fourth test by training on a mixed set of all types and the original. The accuracy went down more than the case of training on noise. Noiseless data classification was also affected negatively suggesting the complexity of using many noise types at once in the training. We should note that the original signals are real life measurements and carry with them certain amounts of error which was more preannounced in this test. Figure 9 shows the classification error vs. noise to signal ratio, for data-noise where training is made on noiseless data only. The error increase resembles a log function and nearly saturates at around 25%, when the ratio reaches about 20%. This is impressive because the global structure of the image could be seen even if the noise ratio is 1. This demonstrates the power of vision vs. numbers.

Fig. 9. Error vs. noise to signal ratio. The noise is applied to the original signal then the test is made. Training is on signal only.

5.4 Testing EEG Eye State Dataset Figure 10 shows the results of the EEG Eye State dataset. We have selected near optimal settings, where the learning curve was stable after 60 iterations. The standard deviation is smaller as a result of larger number of instances. The end results are comparable with the biometric data, taking into account the provided features content of the dataset. Signals of the set are not available, which did not allow for seeing the impact of applying our FAST18 on the EEG for the DNN. The point is that the FIT algorithm has provided a good contrast for high CNN accuracy for both datasets. On the other hand, the features of the EEG set are not that separate and DNN suffered, while the CNN kept almost the same accuracy. The FIT algorithm provided enough contrast through the image representation. Figure 11 shows the scatter of the feature values for both sets.

210

A. S. Salman et al. 1024/32 configuration. 4-Fold CV, Train = 75% Test =25%

Fig. 10. CNN Classification accuracy percentage: Mean ± SD from EEG eye state dataset. DNN accuracy mean: 57.7% +1.8

Fig. 11. (top) Features values for the two classes of the biometric data are well separated due to good FE Algorithm. (bottom): Normalized feature values for the two classes of the EEG data are not that separate resulting in a low accuracy for DNN while no negative effect on CNN due to the FIT solution.

5.5 Testing Dependence on  We made a number of test runs as outlined below. Our initial estimation was that the error vs.  curve is almost flat. There is little to expect from changing . The FIT algorithm is designed to give a unique representation for all features due to versatility of the transformation variables independent of . We sought to confirm this empirically, and to ensure stability, we made a 4-fold cross-validation CNN run using 12 different  points covering the range (0–2π]. Figure 12 shows the error vs. . The results are consistent with the assumption, and the curve indicates accuracy is not sensitive to . The error range goes from 8% to a

Extending CNN Classification Capabilities

211

12% and the data statistical error is around 3%. The variations mainly reflect statistical errors. This leads to the conclusion that we can calibrate the behavior and select a stable minimum. Since we have to perform these calibrations before the final experiments it seems natural to plot such curve always, and select the best operation range. On the other hand, if the angular parameter is critically unstable for the operation, and the variation impact is significantly beyond the statistical range, we can always extend the calibration to operate over stable regions only. Our FIT algorithm is powerful, reliable, stable with minimum variations, and that is good.

Fig. 12. Error vs. . For the biometric CNN data. The variations are within sigma. We used  = 1 rad that shows the least error. From 4 fold CV.

6 Discussion and Conclusions In summary, we have utilized CNNs to classify signals that are hard to classify with CNNs in regular terms. The main point is to transform raw data signals into featured images. The CNN is highly capable of image classification; hence by doing the transformation we utilized the best of CNN. The FIT algorithm has some tunable parameters in the case of data dependence. The system learns to fit the best parameters values for the image transformation, and from there the CNN takes the best classification path. The ability of the CNN system to achieve comparable accuracies for two different datasets with different feature extraction methods that showed extreme impact on the DNN classification, suggests that the FIT algorithm method is stable and powerful. It means in principle we can take real life variable datasets, better but not critical to use an efficient feature extraction algorithm, then apply the FIT algorithm, and the CNN will perform well. In that sense we think testing many diverse datasets with large number of classes is needed to find the limits of the procedure. In the reported tests, the accuracy limit reflects the nature of data more than the system accuracy. The comparison shows that time and resources are fair compared to other methods, and that is an additional point for the FIT-CNN utilization. The ablation studies showed that the method is not sensitive to a wide range of noise levels and that is good. This is mainly because the image represents a global structure, and a small variation does not change the overall indicators. This is a very strong point

212

A. S. Salman et al.

for extending the process to the maximum limits. It supports our main vision that the image processing can make recognition much easier and more reliable. The main task is how to get the transformation that gets the best of CNN and SGDM. Applying different noise levels over the coded images is more complicated, but not critical. What usually matters is the noise on the signals. Still this sensitivity can be utilized to create augmented sets for small datasets without the need to generate noisy copies of the originals. One only needs to add images with little noise that does not harm the selection efficiency, and a statistical enhancement can be provided. We are working on extending the testing to other diverse datasets. It is natural to extend it to data with multiple classes. In principle, we do not anticipate this would be a problem, but we rather test the limits of the FIT algorithm uniqueness capabilities. The FIT algorithm can handle very large fluctuations, while still providing a unique representation. The maximum number of unique representations for micro images is about 1.3 × 105 p, where p is the exponent power in the 3-digit representation. The combination with other features micro images extends the limit further. Theoretically the limit is not exhaustible, and we expect the practical limit to remain high with the ability to manage feature classes in the thousands, with stable capabilities, through the use of 2D capabilities of the CNN. We expect applications with good success to a wide range of signal-based feature extraction processes, and our transformation algorithm can provide maximum uniqueness for instance representations. The configuration of features as images provides an advantage for the CNN compared to numbers only for other deep learners. This work shows the capacity of our novel FIT algorithm to extend CNN applications to cover different datatypes including times series classification, medical signals, high energy physics experiments, big data mining, and speech recognition, where classification is complex. We conclude with some notes: In separate work [2] we have shown that FIT + CNN compares favorably with many other classifiers such as DNN applied to non-temporal features extracted from the raw data. In future work, FIT could be compared with other models such as Long Short Time Memory (LSTM) that are designed to operate on raw time-series data rather than extracted features. However, there is no mathematical limitation of FIT to time-series data, so it could also be tested on other types of data-sets to empirically demonstrate its generality. We hypothesize that on other types of data, FIT will be similarly insensitive to parameter tuning, as we observed here, and that could be verified empirically as well. Acknowledgments. This work was thoroughly and critically reviewed, evaluated, and manuscript corrected by Professor Salman M Salman from Alquds University.

References 1. Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015). https://doi.org/10.1016/j.neunet.2014.09.003 2. Salman, A.S., Salman, O.S.: Spoofed/unintentional fingerprint detection using secondary biometric features. In: SAI Computing Conference, London (2020)

Extending CNN Classification Capabilities

213

3. Salman, O., Jary, C.: Frequency and amplitude based series timed signals 18 features extraction algorithm (FAST18), pattern classification project report. SCE Carleton University, Spring 2018 4. Rish, I.: An empirical study of the Naive Bayes classifier, T.J. Watson Research Center, 30 Saw Mill River Road, Hawthorne, NY 10532 (2001) 5. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https:// doi.org/10.1023/A:1022627411411 6. Hall, M.A.: Correlation-based feature selection for machine learning. Ph.D. thesis, Department of Computer Science, University of Waikato, Hamilton, New Zealand (1999) 7. Roesler, O.: EEG Eye State Dataset. Baden-Wuerttemberg Cooperative State University (DHBW), Stuttgart (2013) 8. Hatami, N., Gavet, Y., Debayle, J.: Classification of time-series images using deep convolutional neural networks. In: Proceedings of the Tenth International Conference on Machine Vision. International Society for Optics and Photonics, Vienna (2017). https://doi.org/10. 1117/12.2309486 9. Wang, Z., Oates, T.: Encoding time series as images for visual inspection and classification using tiled convolutional neural networks. In: Trajectory-Based Behavior Analytics: AAAI Workshop 2015 (2015) 10. Cui, Z., Chen, W., Chen, Y.: Multi-scale convolutional neural networks for time series classification (2016). arXiv preprint. arXiv:1603.06995 [cs.CV]. Cornell University Library, Ithaca, NY 11. Azad, M., Khaled, F., Pavel, M.I.: A novel approach to classify and convert 1D signal to 2D greyscale image implementing support vector machine and imperial mode decomposition algorithm. Int. J. Adv. Res. (IJAR) 7(1), 328–335 (2019). https://doi.org/10.21474/IJAR01/ 8331 12. Dau, H.A., Bagnall, A., Kamgar, K., Yeh, C.M., Zhu, Y., Gharghabi, S., Ratanamahatana, S.A., Keogh, E.: The UCR time series archive (2018). arXiv preprint. arXiv:1810.07758 [cs.LG]. Cornell University Library, Ithaca, NY 13. Chen, Y., Keogh, E., Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista, G.: UCR Time Series Classification Archive (2015). www.cs.ucr.edu/~eamonn/time_series_data/ 14. Hu, B., Chen, Y., Keogh, E.: Time series classification under more realistic assumptions. In: Proceedings of the 2013 SIAM International Conference on Data Mining. Austin, Texas (2013). https://doi.org/10.1137/1.9781611972832.64 15. Bergen, K., Chavez, K., Ioannidis, A., Schmit, S.: Distributed Algorithms and Optimization. CME-323, Stanford Lecture Notes, Institute for Computational & Mathematical Engineering (ICME), Stanford University, CA (2015)

MESRS: Models Ensemble Speech Recognition System Ben Zagagy(B) and Maya Herman Department of Mathematics and Computer Science, The Open University of Israel, Ra’anana, Israel [email protected], [email protected]

Abstract. Speech recognition (SR) technology is used to recognize spoken words and phonemes recorded in audio and video files. This paper presents a novel method for SR, based on our implementation for an ensemble of multiple deep learning (DL) models with different architectures. Contrary to standard SR systems, we ensemble the most commonly used DL architectures followed by dynamic weighted averages, in order to classify audio clips correctly. Models’ training is performed using conversion of audio signals from the audio space into the image space. We used the converted images as training input for the models. This way, most of the default parameters originally used for training models using images, can also be used for training our models. We show that the combination of space conversion and models ensemble can achieve high accuracy results. This paper has 2 main objectives. The first - represent a new trend of definition by expanding the DL process for the audio space. The second - present a new platform for deep learning models ensemble using weighted averages. Previous works in this field tend to stay in the comfort zone of a single DL architecture, fine-tuned to capture all edge cases. We show that applying dynamic weighted average over multiple architectures can improve the final classification results significantly. Since models that classify high pitch audio might not be as good in classifying low pitch audio and vice versa, we harness the capabilities of multiple architectures in order to handle the various edge cases. Keywords: Data mining · Deep learning · Ensemble classifier · Speech recognition

1 Introduction Speech recognition is the ability of a device or program to recognize words in spoken language and convert them into text. The most common speech recognition applications include turning speech into text, voice dialing, and voice search. Although some of these applications work properly for end users, improvement is still needed in several aspects: Speech is sometimes difficult to detect due to variations in pronunciation, speech recognition is performed poorly for most non-English languages, and background noise must be filtered. All these factors can lead to inaccuracies; hence speech recognition is still an interesting area of research. © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 214–231, 2020. https://doi.org/10.1007/978-3-030-52246-9_15

MESRS: Models Ensemble Speech Recognition System

215

The goal of identifying voice commands is to enable people to communicate naturally and efficiently with machines as well as with other people who do not speak the same language. A voice recognition process, also known as Automatic Speech Recognition (ASR), is a process of converting voice signals into words. By using a fully connected deep neural network (DNN), deep learning methods were successful in facing the problem of speech recognition [1–11]. As shown in [12] and [10], over the past few years, DNNbased systems were used by multiple research groups that provided speech recognition solutions with significantly higher accuracy for continuous phone and word recognition, than the earlier GMM-based systems [13] could provide. Advanced deep learning techniques that were successfully applied to improve the speech recognition problem, include the temporarily repeated (deep) neural networks (RNN) [5, 8, 11, 12] and the coiled (CNN) deep neural networks [9, 14–16]. In recent years, a new cutting-edge method called “Deep Learning” has emerged and made a huge impact upon the entire Computer Vision Community. Since they first appeared, deep learning, deep neural networks and mixed neural networks, have achieved results with high quality and accuracy in many computer vision problems, even those research teams considered the most difficult to crack. Deep learning is, roughly, a type of neural network consisting of several layers. When applied to computer vision problems, they are capable of automatically finding a set of highly expressive features. Based on empirical results, in many cases, these features have been shown to be better than those produced manually for solving the given problem. This methodology has another great benefit – there is no need for researchers to manually design these features, since the network is responsible for that. In addition, the features learned by the deep neural networks can be considerably abstracted. The most effective way to train a deep neuronal network model on new input, is to train an existing network with different weights on the existing training input files. This approach is called Transfer Learning [5]. In the groundbreaking article “CNN Features off-the-shelf: an Astounding Baseline for Recognition”, published by researchers at Stockholm University, it has been shown that a network that has learned to classify a particular problem, can undergo a simple basic conversion process and adapt to a different classification problem. This approach is successful thanks to the fact that a deep neuron network contains many layers of abstraction and modeling for the received input. These layers focus on different and on partial features of the original image. These features are determined by the architecture per network and are not always visible to the human eye. Focusing on these features helps the network to improve the quality of the classification it returns. The prevalent approach among deep learning researchers, reinforced, among other things, in the above article, is that abstraction in the various layers of the neuronal network is generic and similar for problems from other spaces, as well. The implications for the system built for our project are, that deep neuron networks that have been trained for purposes other than voice signal classification, can be used to classify voice signals - and this is how the system was built.

216

B. Zagagy and M. Herman

The implications for the system suggested in this paper are, that deep neuron networks, trained for purposes other than voice signal classification, can be used to classify voice signals in the audio space. The purpose of this paper is to present an effective method for speech recognition with an ensemble deep learning technique. Contrary to the basic approach presented in literature, where one highly tuned DNN is trained for the purpose of classifying all spoken words, we train multiple deep learning neural networks with various architectures in order to cover all different edge cases, utilizing the different architectures’ builtin properties. Eventually the system classifies a specific audio clip with a weighted average which is performed on top of the trained networks classifications, in order to retrieve the most likely classification result. An implementation that considers multiple architectures classifications instead of just one, is more robust. Such implementation can cover multiple edge cases which a single architecture cannot cover (since models that classify high pitch audio might not be as good in classifying low pitch audio and vice versa). The paper is organized as follows: • Section 2 overviews our methodology for the problem solving. • Section 3 provides concrete platform implementation details. • Section 4 presents results of a real ensemble for deep learning models that are trained for classifying human speech. • Section 5 provides conclusions and a discussion of future improvement and extensions.

2 Methodology The developed system consists of two core components: batch offline for training the system and real time online testing data output. The offline batch process is responsible for data retrieval, data preprocessing including converting the signals into images in spectral space. Deep Learning models are applied to find the most promising models. Finally, the models are deployed to a shared location, to be used during the online process. 2.1 Offline Process The offline process is responsible for generating and storing deep learning neural networks with various architectures, to be used during the online process. The offline flow shown in Fig. 1, describes data retrieval from audio files in the sound space. Then, the audio data is preprocessed into data in the image space in the form of Mel Spectrogram (as shown in Fig. 2), to be loaded as input for the deep learning models training mechanism. The deep learning training mechanism is responsible for generating the output of the offline batch by creating a series of deep learning models, to be used later by the online flow. The offline batch process can (and should) be executed on a separate strong computer, regardless of the machine on which the online process will be executed.

MESRS: Models Ensemble Speech Recognition System

217

Fig. 1. The offline process flow

Fig. 2. An Example for the Mel Spectrogram image generated from the word “Down”

Data Source. The data sources that were used for the system contain more than 60,000 audio files, each containing different words, spoken by different people. All files are labeled according to the spoken words within them. The labeled audio folders contain 1 s clips of voice commands, with the folder name being the label of the audio clip. This folder contains 31 words, including: “bed”, “bird”, “cat”, “dog”, “down”, “eight”, “five”, “four”, “go, happy “, “house”, “left”, “marvin”, “nine”, “no”, “off”, “on”, “one”, “right”, “seven”, “sheila”, “six”, “stop”, “three”, “tree”, “two”, “up”, “wow”, “yes”, “zero”. This data source was taken from TensorFlow’s speech commands data [17] and

218

B. Zagagy and M. Herman

was collected using crowdsourcing [18]. The files contained in the training audio are not uniquely named across labels but are unique in their inner labeled folder. Data Preprocessing. Prior to the training phase of the labeled input data, or to the classification phase for a non-labeled data, a conversion is performed. We convert audio files’ content into the image space. The conversion generates Mel Spectrograms images out of audio clips. The conversion uses the open source python library of LibROSA [19]. Deep Learning. Deep learning neural network models are created using the preprocessed data, the classification output of which will be used during the online phase. The better and more accurately the system is trained on the training files, the more accurate the system classification of the input data during the online phase.

2.2 Online Process The online process is responsible for classifying the content of a given audio input (in other words – return the output classification of our speech recognition system). The offline flow shown in Fig. 3, describes data retrieval, setting weights and impact for each of the models produced during the Offline stage. Then it describes data preprocessing, including conversion from the audio space to the image space of given input. Each of the deep learning models generated during offline phase produces its own classifications for the given input, then the majority voting component performs an elaborate average on the identification results obtained by the given models’ weights, thus generating a single classification for a given audio input. Note that the online phase is performed “Live” on user’s machine and is not machine dependent. Data Source. The online Process receives two main inputs and data sources – the first input are the weights given by each of the models trained during offline phase, (the weights sum up to 100). These weights will be used during Majority voting section, for creating a weighted average. The second input is an audio clip that the system is required to classify. Data Preprocessing. Prior to classification of the given input audio clip, a conversion is performed. A similar conversion was performed during the offline process, when audio files from the audio space, were converted into image space, in the output form of Mel Spectrogram. Majority Voting. This component is responsible for producing the system’s final classification. The final system classification is made following the performance of a weighted average on top of the system’s models according to their configured weights. Since the system contains more than one neuron network model, and since different models can produce different classifications - getting a single classification for a specific input audio file, requires calculating a weighted average of the classification results produced by the various models in the system. The calculation considers the different weights (percentages) of the various neural network models, considered when setting the system’s final classification of a given audio input file.

MESRS: Models Ensemble Speech Recognition System

219

Fig. 3. The online process flows

3 Implementation Our speech recognition system relies on several modules. There was a need to separate between the offline and the online processes, so they can run on different machines in different times. Most of our code was written in python, as it has multiple free and easy to use audio to image conversion libraries. The same goes for the deep learning neural networks frameworks for training models. We chose to use LibROSA [19] as our audio conversion library and Pytorch [20] (by Facebook) as our deep learning framework. Using Pytorch we have created 6 deep learning neural networks models, based on the following deep learning architectures: DenseNet [21], ResNet [22], SeNet [23], VGG [24].

220

B. Zagagy and M. Herman

3.1 Architecture Components The Models Ensembled Speech Recognition System (MESRS) architecture is described in Fig. 4, its components are:

Fig. 4. MESRS Architecture

Training Data Storage. This section stores all information and data that is required for deep learning neural networks training. It contains labeled audio files in “wav” format, which are categorized according to the content of the spoken word, contained in them. Test Data Storage. This section stores all information and data required for testing the system. It contains non labeled audio files in “wav” format. Audio to Image Conversion Service. This Service converts wav file content into image format. The conversion is performed using the LibROSA library for open source written Python. This library converts an audio file to Mel Spectrogram image. Training Service. This service is the heart of the system. In this service, the various models are trained according to input data entered from training files. The neuron network models created in this process will be activated during the classification phase. The better and more accurately the system is trained on the training files, the more accurate the system classification of the testing files will be. The deep learning neural networks training library of “PyTorch” powered by “Facebook”, is widely used in this service. Models Storage. This module stores the models created during the training phase. These models are in the standard format of PyTorch models (having “pth” extension), and in the later stages they will be used to classify the various test files.

MESRS: Models Ensemble Speech Recognition System

221

CSV Generator Service. The purpose of this service is to produce CSV files (a file for each built-in neuron network model), containing the classification results of each network for each test file. Each CSV file will contain as many rows as the number of test files. Each row will contain two columns: The first column will contain the test file name. The second column will contain the classification results of the current model for the same file. For example, for the model “vgg1d_mel”, a file called “vgg1d_mel_predictions.wav” will be generated. Its content will contain classifications for each and every test file, for example, in lines 60 and 61 it will hold the following information: “clip_0018ff8e9.wav, Up”, “clip_001a5ce9c.wav, Stop”. It means that for the clip_0018ff8e9.wav test file, the model called vgg1d_mel classifies the word “up”. Models Classifications Storage. Here the classification results are stored for each of the CSV-built models (where each model has a corresponding CSV file). Majority Voting Service. The purpose of this service is to obtain the system’s final classification, using the classifications of each of the various test files. The need for this service stems from the system architecture. Since the system contains more than one neuron network model and since different models can produce different classifications, in order to get a single classification for a specific test file, a weighted average of the classification results for the various models in the system, is required. Verification Service. The purpose of the verification service is to compare the final classification results obtained from the system for each test file, with the correct identification results for each test file. This service returns a percentage score between 0 and 100 that describes the quality and correctness of system results. UI Client. The UI interface is designed to load the different weight percentages for each model that will be entered as input to the Majority Voting service. After receiving the results from the Majority Voting phase, the interface will show the user the obtained result, regarding the nature of the model after comparing the system’s results with the correct results during verification phase. The interface is written in html and communicates via the Flask architecture with the project’s python libraries.

3.2 The System’s Training Phase Figure 5 describes a sequence diagram of all the events during the system’s training phase. The person responsible for creating the system sends all the training files in their initial form (“wav” audio files) to the service, which converts them into image files, to be used to train the deep learning system. The generated information is then sent to the training service whose role is to produce models using Facebook’s Porch framework.

222

B. Zagagy and M. Herman

Fig. 5. MESRS training phase sequence diagram

When the training phase ends (this step, as mentioned above can take a lot of time between hours on a powerful computer to days and weeks on a home computer), the newly created model will be stored in a folder containing the system’s model pool. Following various attempts, we have concluded that in order to obtain the best system score, a golden number of six different neural network models is needed. Following the creation of the six different neural network models in the way described, the part where the system produces six CSV files containing the classifications of each model - will begin. First, the system converts the test files, the same way it converted the training files, earlier (from “wav” audio files to a format that the system’s deep network model can classify - that is, to image file format). The system will then go over each test file, for each of the system’s neural network models it will create its own CSV file. Each CSV file line will contain the test file name and the classification that the model gave it, from among one of the words the neural network model can classify. The CSV files for the various models are stored in a folder which will be used during the next phase of system classification. An example of a few rows from the CSV file created for the VGG1D model which classifies Mel Spectrograms images is shown in Table 1.

MESRS: Models Ensemble Speech Recognition System

223

Table 1. An example for different classifications between the system’s models. Unknown clip_00147bbb6.wav Yes

clip_0014ea384.wav

No

clip_0014ed3d5.wav

No

clip_0014f2f18.wav

Unknown clip_00150496f.wav Unknown clip_0015fa156.wav Up

clip_00169a7f7.wav

Go

clip_0017365f5.wav

Unknown clip_0017714af.wav

3.3 The System’s Query Phase Figure 6 describes a sequence diagram for all events during the system’s query phase.

Fig. 6. MESRS query phase sequence diagram

A system query is performed using the user interface. In this interface, the user determines the weight percentages for each of the six classifications, created by the six deep learning neural networks models. After determining the weight percentages for each of the model classifications, these percentages will be sent to the Majority Voting service, which will perform a weighted average based on the results of the classifications and the user entered data. The Majority Voting service will return the classification obtained from the data for each of the testing files. At this point, the received classifications will be

224

B. Zagagy and M. Herman

sent to another service which compares these classifications to the correct classifications of the testing files. After the comparison is done, this service will return the user a final score of the results of the weights he chose, in relation to the correct classifications results received in previous work. The system then calculates an overall grade given to system according to the weighted average method shown in Eq. (1).  (test data files amount) SystemClassification[i] == CorrectClassification[i]: 1 i=0 else: 0 Grade = (test data files amount) (1) Each test file in which the classification given by the system and the original classification are the same, will be added to the final calculation. The overall score of the system is calculated by the amount of times the classification given by the system was identical to the correct classifications, divided by the total amount of files to be classified. This way a score between 0–100 can be quantified to reflect the accuracy of the system.

4 Experimental Results 4.1 Experiment Phases Our experiments are comprised of four main stages. We begin with training deep learning neural networks (colored in blue in Fig. 7). Then we classify the testing data files using the trained neural networks (colored in orange in Fig. 7). Afterwards we perform a weighted average on top of the testing data (colored in yellow in Fig. 7) in order to retrieve a single system classification for each of the testing data. Finally, we compare the system’s classifications to the correct classifications of the testing files (colored in green in Fig. 7) in order to retrieve an overall grade for the system classifications. Train Deep Neural Networks. Colored in blue in Fig. 7, this phase is consisting of two main parts: the first part is the conversion of the labeled audio clips that were designated for training of the system from the audio space to the image space in the form of Mel Spectrograms. The second is the actual training of neural networks using deep learning techniques on top of the converted audio clips. The outputs of this action are 6 models of neural networks that were generated using deep learning and has the ability to classify audio clips content (following their conversion into the image space) into one of the following options: Yes, No, Up, Down, Left, Right, On, Off, Stop, Go, Silence, Unknown. Classify Testing Data Using the Trained Networks. Colored in orange in Fig. 7, this phase consists of two main parts: the first is conversion of non-labeled audio clips that were designated for testing the system, from the audio space into image space, in the Mel Spectrograms form. The second is retrieving and saving the classification of the neural networks’ models (generated in the previous phase) for the Mel spectrograms of

MESRS: Models Ensemble Speech Recognition System

225

Fig. 7. The complete system flows

the testing audio files. The outputs of this action are the classification results for each of the neural networks models that were generated in the previous phase, for each of the testing files. Perform Majority Voting and Retrieve a Single System Classification Per Testing File. Colored in yellow in Fig. 7, in the 3rd phase we will perform a weighted average for all the different model classifications for each of the testing files. The outputs of this action are the system’s overall classification results for each of the testing files. Following this action, there will be only one classification per testing file (unlike the output of action #2 in which each testing file had six different classifications according to the number of trained models). Comparison Between the System’s Classifications and the Correct Classifications. Colored in green in Fig. 7, in the 4rd phase, a comparison will be held between the MESRS system classification results, received during phase 3, to the correct results received from previous work. The output of this phase is a final classification grade that represents the correctness of the entire system.

226

B. Zagagy and M. Herman

4.2 An Experimental Ensemble-Learning Paradigm The uniqueness of MESRS is in the use of an ensemble of very different deep learning neural networks architectures. Thus, it can cover many edge cases that one neural network architecture cannot cover. For example, The ensemble of the deep learning neural networks models is performed by taking different weights (percentages) for the different neural network models and apply a weighted average upon their classifications in order to select the final classification of each test file in the system. For example, given the following data: 1. Audio file named “sample.wav”. 2. 3 neural networks models named VGG, ResNet and DenseNet. 3. 3 weights in percentage for the above models as shown in Table 2. Table 2. An example of weights distribution between the system’s models. DL model architecture

Weight of model in the final classification decision

VGG-1D on Mel Spectrogram images

30%

VGG-1D on Mel Spectrogram images

10%

VGG-2D on Mel Spectrogram images

15%

DenseNet on Mel Spectrogram images

15%

SeNet on Mel Spectrogram images

15%

ResNet on Mel Spectrogram images

15%

An example of the different classifications that could be received from the above models, on the sample file shown in Table 3. Table 3. An example for different classifications between the system’s models. DL model architecture

Classification for “sample.wav”

VGG-1D on Mel Spectrogram images

Up

VGG-1D on Mel Spectrogram images

Go

VGG-2D on Mel Spectrogram images

Go

DenseNet on Mel Spectrogram images

Go

SeNet on Mel Spectrogram images

Up

ResNet on Mel Spectrogram images

Up

MESRS: Models Ensemble Speech Recognition System

227

The classification obtained from the weighted average results for the above data is: “Up”, since this classification holds 60% compared to the “Go” classification which only holds 40%. Choosing the Right Architectures. These are the criteria for choosing deep learning neural networks architectures for MESRS: 1. Have been proven in the past. 2. Have a wide pool of online support and documentation. 3. Have been able to classify images with high accuracy. The architectures that were selected to be used in MESRS were VGG [24], SeNet [23], DenseNet [21] and ResNet [22]. Since all the carefully selected networks for MESRS implementation, are commonly used in the deep learning world, their preprepared implementations exist in Pytorch. In order to match the implementation of the above architectures in python, one could inherit from the pytorch model object (torch.nn.Module) [25]. Table 4 shows the models classification grades, in case of a single architecture which is used to perform the classification of the testing audio clips. Note that the grades are verified against the correct classifications of the audio clips that are dedicated for testing. Table 4. Single architecture classification results (without models’ ensemble) Model

Grade (out of 100)

VGG-1D on Raw Data

91.55

VGG-1D on Mel Spectrogram images 90.71 VGG-2D on Mel Spectrogram images 92.04 DenseNet on Mel Spectrogram images 92.13 SeNet on Mel Spectrogram images

92.05

ResNet on Mel Spectrogram images

89.93

Table 5 holds the ideal weights that produce the best result for the system’s six trained models, when compared to the correct classifications of the audio clips that are dedicated for testing. Audio Clips Classification Using Ensemble of Neural Networks Models. As shown in Fig. 8, a simple user interface was created in order to help calculate different grades given by MESRS for different models. The grade that the MESRS implementation received using the above weights for the entire testing data was 94.03. We see that there is not one dominant model - only with the combination of the many predictions from the different models can the optimal result be obtained. Moreover - when examining the score of a single model (for each of the six models) the highest score is 92.04, less than two percent of the optimal score obtained by using all models together. It can be thought

228

B. Zagagy and M. Herman Table 5. The ideal weights distribution for the system’s models

ResNet on Mel Spectrogram images

SeNet on Mel Spectrogram images

DenseNet on Mel Spectrogram images

VGG-2D on Mel Spectrogram images

VGG-1D on Mel Spectrogram images

VGG-1D on Raw Data

15

15

15

15

10

30

that those 2% are not a difference that justifies the “complexity” of the MESRS system, but it is not true, as 2% of the test files (158539 files) equals 3170, meaning that using 6 models and the elaborate average mechanism, the system achieves 3170 more correct identifications than it was Achieves without them. This is a very large number of correct identifications that would be lost, since in a world where neuronal network models can be implemented in critical systems, such many correct identifications are essential. The best results are obtained when there is one model with 30% and the other 70% of the result consists of a relatively uniform distribution between the other models.

Fig. 8. The user interface to load and calculate different weights into MESRS with the calculated grade.

Comparative Outcome Analysis. The above case study was developed and tested within the restriction of Kaggle’s speech recognition competition [26]. As part of the competition, two large databases were provided, these databases are significant and contain large amounts of audio files. Training set - an audio clip pool containing 65,000 files. The sound clips were recorded by thousands of different people, each file including a single command. The various files are tagged by content. Testing set - a repository containing almost 160,000 files which, like the Training Set, have been recorded by thousands of different people and contain different voice commands. Unlike the Training set files, the testing set files aren’t tagged according to their content. This case study has two main objectives:

MESRS: Models Ensemble Speech Recognition System

229

1. Present the optimal situation and the optimal system score for the above input. 2. Demonstrate the ease of use of the system for testing various weights for various trained neural network models, thereby also demonstrating the ease of obtaining a system quality indicator score. 4.3 Comparative Outcome Analysis and Other Experiments Additional experiments on different weights distribution is shown in Table 6, the different weights are set during the Majority Voting phase for each model. For each different distribution between the models’ weights, the obtained score is indicated in the “Overall system grade” column of Table 6. Table 6. MESRS grades during testing different weights distribution over the system models ResNet on Mel Spectrogram images

SeNet on Mel Spectrogram images

DenseNet on Mel Spectrogram images

VGG-2D on Mel Spectrogram images

VGG-1D on VGG-1D Overall Mel on Raw System Spectrogram Data Grade images

15

15

15

15

10

30

94.03%

20

20

10

10

30

10

93.46%

15

15

15

15

35

5

93.19%

10

20

5

5

50

10

91.43%

10

20

10

30

15

15

93.56%

20

10

30

10

15

15

93.68%

10

10

10

10

10

50

92.47%

10

10

10

10

50

10

91.43%

10

10

10

50

10

10

92.47%

10

10

50

10

10

10

92.44%

10

50

10

10

10

10

91.49%

50

10

10

10

10

10

90.29%

30

10

10

10

10

30

93.62%

10

20

20

20

20

10

93.51%

0

0

0

0

0

100

91.55%

0

0

0

0

100

0

90.71%

0

0

0

100

0

0

92.04%

0

0

100

0

0

0

92.13%

0

100

0

0

0

0

92.05%

100

0

0

0

0

0

89.93%

230

B. Zagagy and M. Herman

5 Conclusions This paper describes an automated speech recognition system that was developed under the name of MESRS (Models Ensemble Speech Recognition System). The conventional data format that is used for training and working with deep learning neural networks, is data from the visual image space. This paper presented MESRS, a system designed for speech recognition using an ensemble of deep learning models that gives high quality classification results. As deep learning is a new and pioneering field, deep learning of audio files is significantly newer, which makes the various developments in this field insufficiently robust and usable for our system, MESRS is a new system based on Google’s TensorFlow Simple Audio Recognition library [27] at first. This library trains on audio data without any further conversion. The classification results from this library were not robust enough for MESRS system. The method that proved to be the most efficient for training such networks was using regular image-aimed deep learning neural networks, but with a twist of converting its input data from the audio space into the image space. This is a well-known and highly robust method for deep learning purposes. The results clearly indicate that creating a quality robust system using an ensemble of deep learning models’ architectures is a viable task. This paper could serve as a base for future studies in the area of speech recognition or other deep learning models ensemble. It has been clearly shown that using an ensemble of deep learning models architectures, one could obtain a classification result that fits trivial and non-trivial problems in forms that a single architecture would not be able to obtain.

References 1. Sainath, T., Kingsbury, B., Ramabhadran, B., Novak, P., Mohamed, A.: Making deep belief networks effective for large vocabulary continuous speech recognition. In: Proceedings of the ASRU (2011) 2. Deng, L., Abdel-Hamid, O., Yu, D.: A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion. In: Proceedings of the ICASSP (2013) 3. Yu, D., Deng, L., Seide, F.: The deep tensor neural network with applications to large vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 21(2), 388–396 (2013) 4. Mohamed, A., Dahl, G., Hinton, G.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Language Process. 20, 14–22 (2012) 5. Sainath, T., Kingsbury, B., Soltau, H., Ramabhadran, B.: Optimization techniques to improve training speed of deep neural networks for large speech tasks. IEEE Trans. Audio Speech Language Process. 21(11), 2267–2276 (2013) 6. Sainath, T., Kingsbury, B., Mohamed, A., Dahl, G., Saon, G., Soltau, H., Beran, T., Aravkin, A., Ramabhadran, B.: Improvements to deep convolutional neural networks for LVCSR. In: Proceedings of the ASRU (2013) 7. Deng, L., Li, J., Huang, Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., Gong, Y., Acero, A.: Recent advances in deep learning for speech research at Microsoft. In: Proceedings of the ICASSP (2013) 8. Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: Proceedings of the ICASSP (2013)

MESRS: Models Ensemble Speech Recognition System

231

9. Baker, J., Deng, L., Glass, J., Khudanpur, S., Lee, C.-H., Morgan, N., O’Shaughnessy, D.: Research developments and directions in speech recognition and understanding. IEEE Sig. Proc. Mag. 26(3), 75–80 (2009) 10. Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012) 11. Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: Proceedings of the Interspeech (2011) 12. Yu, D., Deng, L., Dahl, G.E.: Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2010) 13. Jaitly, N., Nguyen, P., Vanhoucke, V.: Application of pre-trained deep neural networks to large vocabulary speech recognition. In: Proceedings of the Interspeech (2012) 14. Kingsbury, B., Sainath, T., Soltau, H.: Scalable minimum Bayes risk training of deep neural network acoustic models using distributed Hessian-free optimization. In: Proceedings of the Interspeech (2012) 15. Sainath, T., Mohamed, A., Kingsbury, B., Ramabhadran, B.: Convolutional neural networks for LVCSR. In: Proceedings of the ICASSP (2013) 16. Dahl, G., Yu, D., Deng, L., Acero, A.: Context-dependent, pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Audio Speech Language Process. 20, 30–42 (2012) 17. Training & Testing sets Used in Kaggle’s Speech Recognition Challenge. https://www.kag gle.com/c/tensorflow-speech-recognition-challenge/data 18. Google’s Crowd sourcing Open Speech Recording. http://aiyprojects.withgoogle.com/open_s peech_recording 19. Librosa – python package for music and audio analysis. https://librosa.github.io/librosa 20. Pytorch – An open source deep learning platform by Facebook. https://pytorch.org 21. Squeeze-and-excitation Networks (SeNet). https://arxiv.org/abs/1709.01507 22. Deep Residual Learning for Image Recognition (ResNet). https://arxiv.org/abs/1512.03385 23. Very Deep Convolutional Networks For Large Scale Image Recognition (VGG). https://arxiv. org/pdf/1409.1556.pdf 24. Densely Connected Convolutional Networks (DenseNet). https://arxiv.org/pdf/1608.06993. pdf 25. Pytorch neural networks module inheritance documentation. https://pytorch.org/docs/stable/ nn.html 26. Tensor Flow Speech Recognition Challenge. https://www.kaggle.com/c/tensorflow-speechrecognition-challenge 27. Tensorflow Simple Audio Recognition Library. https://github.com/tensorflow/docs/blob/mas ter/site/en/r1/tutorials/sequences/audio_recognition.md

DeepConAD: Deep and Confidence Prediction for Unsupervised Anomaly Detection in Time Series Ahmad Idris Tambuwal(B) and Aliyu Muhammad Bello Faculty of Engineering and Informatics, University of Bradford, Bradford, UK [email protected], [email protected]

Abstract. The current digital era of Industrial IoT and Automotive Technologies have made it standard for a large number of sensors to be installed on machines or vehicle, capture and exploit time-series data from such sensors for health monitoring tasks such as anomaly detection, fault detection, as well as prognostics. Anomalies or outliers are unexpected observations which deviate significantly from the expected observations and typically correspond to critical events. Current literature demonstrates good performance of Autoencoder for anomaly and novelty detection problems due to its efficient data encoding in an unsupervised manner. Despite the unsupervised nature of autoencoder-based anomaly detection methods, they are limited by the identification of anomalies using a threshold that is defined based on the distribution of reconstruction cost. Often, it is difficult to set a precise threshold when the distribution of reconstruction cost is not known. Motivated by this, we proposed a new unsupervised anomaly detection method (DeepConAD) that combined Autoencoder with forecasting model and used uncertainty estimates or confidence interval from forecasting model to identify anomalies in multivariate time series. We performed an experimental evaluation and comparison of DeepConAD with two other anomaly detection methods using Yahoo benchmark dataset, which contain both real and synthetic time series. Experimental results show DeepConAD outperforms other anomaly detection methods in most of the cases. Keywords: Time series · Anomaly detection · Uncertainty estimation · Deep Neural Networks · Autoencoder

1 Introduction Anomaly has become popular in our daily life activities, where every day we observe and identify abnormalities. When something deviates mostly from the sequence of these activities, it is label as an anomaly or an outlier. In the field of data mining, anomalies and outliers are used interchangeably, which refers to unexpected observations which deviate significantly from the expected observations and typically correspond to critical events [1, 2]. The current digital era of Industrial Internet of Things [3] and Automotive Technologies have made it familiar for a large number of sensors to be installed on devices, © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 232–244, 2020. https://doi.org/10.1007/978-3-030-52246-9_16

DeepConAD: Deep and Confidence Prediction

233

capture and exploit time-series data from such sensors for health monitoring tasks such as anomaly detection, fault detection, as well as prognostics. In time-series signals, an anomaly is any unexpected change in the pattern of one or more of the signals. It is vital for timely detection of abnormalities in such signals from any system to avoid its total failure [4–6]. In the context of automotive industries, for example, anomaly detection can provide prior information on mechanical faults [7] and sensor faults [8]. Several types of anomalies exist in the literature, but the three most popular types are the point, contextual, and collective anomalies [6, 9]. Point anomaly is generally studied in the context of multidimensional dependency-oriented data types such as time series, discrete sequence, spatial and temporal data. In the context of this paper, we focus on point anomalies in time series data. Several techniques for point anomaly detection exist in the literature, but we will focus on prediction or regression-based models due to their ability to handle temporal features that exist within the time series [6]. In prediction models, the anomaly score is computed as the rate of deviation of a point from its predicted value. Current literature also shows the usage of deep learningbased regression models such as Recurrent Neural Network (RNN) mainly based on Long Short Term Memory (LSTM) [10, 11] or Gated Recurrent Units (GRU) [12] and Convolutional Neural Network (CNN) [13] for anomaly detection in time-series data. Their performance in unsupervised representation learning of time sequence applicable to text, video, and speech recognition, shows their ability to handle the temporal nature of time series data [14]. Despite the performance of these techniques and their unsupervised learning approach, they used the prediction error for the detection of anomalies. However, in most of the real-life scenarios, that involves complex system (e.g. automotive driving); there are often external factors or variables, which are not captured by sensors that lead to unpredictable time series. Detecting anomalies in such scenarios become challenging using a standard approach based on prediction models that utilized prediction errors to detect anomalies and hence, the introduction of Autoencoders. Autoencoder is another type of deep neural network, which is trained to reconstruct its input. Autoencoders are used for dimensionality reduction that helps in classification and visualization tasks and used for learning a hidden representation of time series. As a result of its efficient data encoding in an unsupervised manner, it is also gaining popularity in anomaly and novelty detection problems [15–17]. Despite the unsupervised nature of autoencoder-based anomaly detection methods, they are limited by the identification of anomalies using a threshold that is defined based on the distribution of reconstruction cost. Often, the distribution of reconstruction cost is not known, or the experimenter does not aim at making any assumption about a specific distribution. As such, it is not possible to define a precise threshold value that can aid for identification of anomalies and hence, the need for probabilistic time-series forecasting to identify anomalies. This area has not been extensively explored in deep learning where prediction models focus more on estimating an optimal value as define by lost function. In contrast, Lingxue and Nikolay proposed a deep learning model that provides time-series predictions along with uncertainty estimates and used that for forecasting of extreme events [18]. Similarly, Reunanen et al. demonstrate the use of Autoencoder reconstruction cost and Chebyshev’s inequality to calculate the upper and lower outlier detection limits in sensor streams [2].

234

A. I. Tambuwal and A. M. Bello

Motivated by probabilistic time series forecasting and ability of Autoencoder to learn the hidden representation of data, we proposed a new unsupervised anomaly detection method (DeepConAD) that utilized hidden representation provided by autoencoder and uncertainty estimates from prediction model to identify anomalies in multivariate time series. We employ a window-based sequence learning by combining LSTM autoencoder and LSTM forecaster for encoding and forecasting of next time step based on a window of previous time steps. In this context, an encoder is used to encode the input sequence and captured the patterns of the time series. This encoded sequence is then passed to a regression model, which then forecast the next step for each value along with uncertainty estimate. We achieve this by developing a quantile regression model, which approach the regression problem as estimating continuous probability distribution instead of estimating a single value or sequence of values. By computing the upper and lower quantiles of the model predicted output, we assume the model have captured most likely values the reality can expect. We then compute the interval between upper and lower quantiles to see whether our model find any anomalies. We consider any observed value that is outside the range as an anomaly. DeepConAD also captured deviations in the correlation of data features via Autoencoder, which enhances its performance in handling the multivariate and temporal characteristics of time series. As such DeepConAD is suitable for domains where time-series is collected from heterogeneous sensors. When tested with publically available Yahoo anomaly benchmark dataset, DeepConAD outperforms must of the state of the art anomaly detection methods. In summary, the main contributions of this paper are as follows: 1) To the best of our knowledge, DeepConAD is the first deep learning-based anomaly detection method that used model uncertainties for detecting point anomalies in time series data. 2) The proposed method is flexible and can easily be adapted to different use cases and domains where dynamic behaviour and complex temporal dependencies exist among the sensors. 3) In contrast to the LSTM and CNN based approach, DeepConAD used Autoencoder to learn hidden representation in the time series, which increase its performance. The rest of the paper is organized as follows: Sect. 2 provides a literature review on anomaly detection methods. In Sect. 3, we provide a theoretic background and detail description of our approach. Section 4 provides experimental evaluation, and Sect. 5 includes a conclusion and future work.

2 Literature Review The broad diversity of application domain together with a different type of data affects the choice of the anomaly detection technique. Temporal data such as social network data streams, computer network traffic, astronomy data, sensor data and commercial transactions that are generated from different application domains have led to the rise of a field of anomaly detection for temporal data. Anomaly detection problem in the field of temporal data can be categorized in different ways due to the diversity of the area.

DeepConAD: Deep and Confidence Prediction

235

One of the categorizations is based on data type, nature of the data, types of abnormality in context, and availability of label anomalies [6]. Different types of data exist from different application domains such as continues series (e.g. sensor readings), discrete series (e.g. weblogs), data streams (e.g. text streams), and network data (e.g. graph and social streams). In this section, we will review anomaly detection methods designed for time series data. Literature has shown two significant types of anomaly detection techniques for time-series. The first involves anomaly detection in the time series database where the focus is on identifying entire time series sequences in the database that is anomalous [19, 20]. The second involves identification of anomalies within a given time series sequence which includes point (instantaneous) anomaly and window-based anomaly. Window-based require identification of anomalous subsequence within the time series where the current window is unexpected and deemed abnormal. Although there are several window-based anomaly detection methods proposed in the literature [21, 22], these methods require the definition of the expected pattern. Unfortunately, without expert knowledge of the system, it will be difficult to define such patterns, which affect the performance of the model. Point-based anomalies involve identification of element or point within the subsequence of time series as an anomaly. Which means given a window of time series; the aim is to identify an abnormal point within the window. Point-based anomaly detection methods are also used for identification of sensor drift where an individual continuous sensor suddenly drift. Many point anomaly detection techniques exist in literature and review of those techniques is beyond the scope of this paper. We refer the reader to [6] for detailed study and understanding of those techniques. Similarly, current literature shows the used of Deep Neural Networks (DNNs) such as Recurrent Neural Network (RNN) mainly based on Long Short Term Memory (LSTM) or Gated Recurrent Unit (GRU) for point anomaly detection in time series data. Their performance in unsupervised representation learning of time sequence applicable to text, video, and speech recognition, shows their ability to handle the temporal nature of time series data. Sangha et al. [8], used RBF for online fault diagnosis on real engine sensor data from automotive industry. An anomaly detection method based on stack LSTM was also proposed in [10]. The authors developed a predictive model that was trained using the normal sequence, which is further evaluated to compute error vectors based on its performance on the anomalous sequence. Anomaly is then defined by the setting of an error threshold that is defined using the validation test. A similar approach was also used to detect anomalies in ECG data [23]. Deep CNN was also used as a regression model for anomaly detection on time series data [13]. The model predicts the next timestamp using a window of previous time stamps and used Euclidean distance to measure the difference between the predicted and actual values. Based on these differences, an anomaly is identified at a given timestamp using a threshold value. The techniques review above works on Univariate Time Series (UTS) and therefore, cannot handle the correlation of Multivariate Time Series (MTS). Saurav et al. [12], proposed another highly related anomaly detection method that used a sliding window to handle both multidimensional and streaming nature of time series.

236

A. I. Tambuwal and A. M. Bello

Similarly, the authors in their proposed approach used GRU units, which are a simplified version of LSTM units. Even though GRU is simpler than LSTM units, LSTM is more potent than GRU because of its ability to learn long-term correlations in a sequence and capable of accurately modelling complex multivariate sequences without the need for pre-specified time window [24]. As such, it becomes more appropriate to consider LSTM units with multiplicative gates that enforce constant error flow through the internal states of individual units called “memory”. LSTM also have internal memory that operates like a local variable, allowing them to accumulate state over the input sequence. Despite the ability of the method mentioned above to handle multidimensional and streaming nature of time series, the technique used prediction error for the detection of anomalies. However, in most of the real-life scenarios, that involves complex system (e.g. automotive driving); there are often external factors or variables, which are not captured by sensors that lead to unpredictable time series. Detecting anomalies in such scenarios become challenging using a standard approach based on prediction models that utilized prediction errors to detect anomalies and hence the introduction of Autoencoders. Autoencoder is another type of deep neural network, which is trained to reconstruct its input. Autoencoders are used for dimensionality reduction that helps in classification and visualization tasks and also used for learning the hidden representation of timeseries. As a result of its efficient data encoding in an unsupervised manner, it is also gaining popularity in anomaly and novelty detection problems. In this context, LSTM Encoder-Decoder had been used to handle MTS characteristics and shown to be useful for anomaly detection [15]. In the paper, Encoder-Decoder model learns to capture the normal behaviour of the machine by learning to reconstruct MTS corresponding to normal functioning in an unsupervised manner, thereby using reconstruction error to detect anomalies. Since their model is trained only on time series corresponding to normal behaviour, it is expected to perform well with a small error while reconstruction normal MTS and poorly with more significant error on abnormal MTS. The reconstruction error is then used as an anomaly score that is used to identify anomalies. In a similar approach, Amarbayasgalan et al. [17] combined autoencoding and clustering method for unsupervised novelty detection. The authors in their approach used to compress data and reconstruction error threshold obtained from autoencoders and apply density-based clustering on the compressed data to detect novelty groups with low density. Schreyer et al. [16] also used deep autoencoders to detect anomalies in large-scale accounting data in the area of fraud detection. Despite the unsupervised nature of autoencoder-based anomaly detection methods, they are limited by the assumption of the distribution of reconstruction cost. They used the distribution of reconstruction cost to define a threshold, which can help in identifying anomalies. Often, the distribution of reconstruction cost is not known, or the experimenter does not aim at making any assumption about a particular distribution. As such, it is not possible to define a precise threshold value that can aid for identification of anomalies and hence, the need for probabilistic time-series forecasting to identify anomalies. This area has not been extensively explored in deep learning where prediction models focus more on estimating an optimal value as define by lost function. In contrast, Lingxue and Nikolay proposed a deep learning model that provides time-series predictions along

DeepConAD: Deep and Confidence Prediction

237

with uncertainty estimates and used that for forecasting of extreme events [18]. Similarly, Reunanen et al. demonstrate the use of Autoencoder reconstruction cost and Chebyshev’s inequality to calculate the upper and lower outlier detection limits in sensor streams [2]. Motivated by probabilistic time series forecasting and ability of Autoencoder to learn the hidden representation of data, we proposed a new unsupervised anomaly detection method (DeepConAD) that utilized hidden representation provided by autoencoder and uncertainty estimates from prediction model to identify anomalies in multivariate time series. We employ a window-based sequence learning by combining LSTM autoencoder and LSTM forecaster for encoding and forecasting of next time step based on a window of previous time steps. In this context, an encoder is used to encode the input sequence and captured the patterns of the time series. This encoded sequence is then passed to a regression model, which then forecast the next step for each value along with uncertainty estimate. We achieve this by developing a quantile regression model, which approach the regression problem as estimating continuous probability distribution instead of estimating a single value or sequence of values. By computing upper and lower quantiles of the model predicted output, we assume the model have captured most likely values the reality can expect. We then compute the interval between upper and lower quantiles to see whether our model find any anomalies. We consider any observed value that is outside the range as an anomaly. DeepConAD also captured deviations in the correlation of data features via Autoencoder which enhance its performance in handling the multivariate and temporal characteristics of time series data and therefore suitable for domains where time series is collected from heterogeneous sensors.

3 DeepConAD: Proposed Approach In this section, we described our proposed approach, which is divided into seven stages: Normalization, Time Series Segmentation, Auto encoding, Forecasting, Uncertainty estimation, Anomaly identification, and Visualization of anomalies. The overall steps are depicted in Fig. 1.

Fig. 1. The workflow of the propose approach

238

A. I. Tambuwal and A. M. Bello

Normalization. Motivated by the existence of different value scales in each time series, normalization of the MTS is carried out to enhance the performance of the regression model. Consider an MTS x = {x1 , x2 , . . . , xt }, where t is the length of the time series and each point xi ∈ Rm (for i = 1 . . . t) in the time series is an m-dimensional vector corresponding to the m features or sensor channels read at time t. We scaled the feature between 0 and 1 (xij ∈ [0, 1]) where j = 1 . . . m, as shown in (1). xi =

xi − xmin xmax − xmin

(1)

Where xmax and xmin are vectors that contain the maximum and minimum values of the features. The scaling is done for each point per feature. The output of this stage is an array of scaled values representing the MTS. Time Series Segmentation. The input to this stage is an array of scaled MTS from the previous stage. In order to leverage our regression model for sequence-to-sequence learning, the input MTS is transformed into multiple sequences of time steps. This transformation involves converting the MTS from one sequence to pairs of input and output sub-sequences. The MTS is segmented into two overlapping fixed length windows of size l. First, is a history window (hw ) which define the number of previous time steps in history that will  be used as input to theAutoencoder. That is, given an MTS we use a history window xt−w−(l−1) , . . . , xt−1 , xt of length w. Second, is the predicted window (pw ), which represent the number of time steps required to be forecasted as output from the regression model. The aim is to predict the next time step given a window of previous time steps as shown in (2) for w = 5 and pw = 1. xt−4 , xt−3 , xt−2 , xt−1 , xt → xt+1

(2)

In a regression problem as ours, the left-hand side serve as input data to the model and the right-hand side as the output that is treated as a label to the input data. The output of this stage is two arrays of sub-subsequence representing hw and pw . Auto Encoding. The input to this stage is an array of history windows from the previous step. In this stage, an encoder-decoder model is used to reconstruct each window of sub-sequence and extract useful representation or pattern from the window. The last LSTM cell states of the encoder are extracted, which contains both learned features for forecasting and unusual input captured in the feature space that will be propagated to the regression model in the next stage. The output of this stage is an array of extracted features from a window of sub-sequence which is passed as input for forecasting in the next step. Forecasting. In this step, an array of extracted features obtained from the auto encoding step and an array of prediction window (pw ) are used as input. An LSTM based regression model is trained to forecast the next time step (pw ) using the learned features. The model forecast next time step, taking into account uncertainty as explain in the next stage.

DeepConAD: Deep and Confidence Prediction

239

Uncertainty Estimation. Uncertainty estimation in prediction is the process of quantifying uncertainty (or confidence) in prediction models where instead of estimating an optimal value, the model can estimate probability distributions. It is the process of demonstrating Bayesian inference, which aims at finding the posterior distribution over model parameters p(ω|X , Y ), given set of sub-sequence X and Y where X represents hw and Y represent pw . With a new data point x ∈ X , the distribution of the prediction is obtained by marginalizing out the posterior distribution, as shown in (3), where ω is the collection of model parameters. Taking the variance of this distribution will quantify the prediction uncertainty, which further elaborated using the law of total variance, as shown in (4). p(y|x) = ∫ p(y|x, ω)p(ω|X , Y )d ω

(3)

  Var(y|x) = Var E(y|ω, x) + E = Var(ω(x)) + δ 2

(4)

ω

From the last part of (4), we can see that the variance is decomposed into two terms: (i) Var(ω(x)), which represent model uncertainty that reflects our ignorance over model parameters ω; and (ii) δ 2 which represent the noise level during the data generation process, referred to as inherent noise. The assumption is y is generated from the same distribution, which may not be the case in most scenarios. In anomaly detection, for example, we expect some time series sub-sequence will have unexpected points, which can be different from the sequence used in training the model. Therefore, we can conclude that complete measurement of prediction uncertainty should be a combination of model uncertainty, model misspecification and inherent noise level [18]. In order to achieve this, a quantile regression model was developed which focus on forecasting extreme values (lower (10th), upper (90th), and classical (50th) quantiles). The model is implemented in Keras LSTM by creating a quantile loss function which penalizes errors based on quantile depending on whether the error is positive or negative. By computing upper and lower quantiles, the model has captured most likely values the reality can assume. The width of this range can be very depth; it will be small when the model is sure about the future and can be big when the model isn’t able to see significant changes within the time series. This behaviour was used and let the model detect whether the predicted point is an anomaly or not. The computed quantiles are stored in arrays and pass as input to the next stage for anomaly identification. A detailed description of this procedure is given in the following subsection. Anomaly Identification. The input to this step is the arrays of computed quantiles. Anomalies are identified by computing an interval as an array of difference between upper and lower quantiles (90-10 quantiles range). The interval is expected to be small when the data is normal and is within the model learned range. On the contrary, anomalous values are expected when there is a bigger interval. We consider any observed value xt , which falls outside of the 95% prediction interval as an anomaly. The output of this stage as an array is given to the next step for visualization. Visualization of Anomalies. In this stage, an array of computed intervals received as input from the previous step is plotted against the actual sequence to visualized actual points that are anomalous.

240

A. I. Tambuwal and A. M. Bello

4 Experimental Evaluation In this section, we designed and conducted extensive experiments to evaluate and compare the performance of DeepConAD with the current state-of-the-art methods using both real and artificial datasets. The section starts with a description of the dataset used, which is then followed by an experimental setup, and finally results and discussion. 4.1 Datasets Descriptions DeepConAD is evaluated using Yahoo Webscope dataset, which is a commonly used anomaly benchmark dataset in the literature. Yahoo Webscope dataset is publically available dataset release by Yahoo Labs. The benchmark datasets have been widely used in research to validate anomaly detection algorithms and determine the accuracy of various types of an anomaly, including outliers and change-points [12, 13]. The choice of this datasets is because of the availability of the anomaly labels that we can use to validate our model. The dataset consists of 367 real and synthetic time series with label anomaly points. The real datasets include of time series with representing the metrics of various yahoo services. The synthetic datasets consist of time series that demonstrate trend, noise, and seasonality changes with anomalies presents at random positions. Each time series contains 1,420 to 1,680 instances. The benchmark dataset is further divided into four sub-benchmarks that include A1Benchmark, A2Benchmark, A3Benchmark, and A4Benchmark. In each dataset file, there is a Boolean attribute named “is_anomaly” with values 1 and 0 that indicate if the value at a particular timestamp is an anomaly or not. Because we are using unsupervised learning approach in our evaluation, we drop the label attribute from each dataset. 4.2 Experimental Setup All the experiments are run on the same computer having Intel Pentium processor with core i7, Windows operating system and python anaconda 3.7 configured with deep learning libraries. 60% of each dataset is considered as a training set and the rest of 40% as a test set. Similarly, to test the power of the model on an unseen time series data, the training set is further divided, and 20% is used as unseen data to validate the model. A window sizes of 35 and 45 was used, which have an optimal performance as demonstrated in [13]. Similarly, 100 number of iterations was used as the number of times the model was evaluated and dropout probability of 0.5. F-score was used as an evaluation metric for DeepConAD and all anomaly detection methods used in our comparison. F_score was used to evaluate the models in terms of the number of detected and rejected point anomalies. Average F-scores (5) per sub-benchmark are reported for each anomaly detection method. F − score = 2 ×

Precision × Recall Precision + Recall

(5)

DeepConAD: Deep and Confidence Prediction

241

4.3 Experimental Result and Discussions This section contains two sets of results. It starts by demonstrating how Autoencoder recognizes the pattern of the time series, thereby improving the prediction performance of the model using uncertainty estimation. Then, it shows the performances of DeepConAD on anomaly detection and its comparison with other anomaly detection methods. Prediction Performance and Uncertainty Estimation. Table 1 reports the Mean Absolute Error (MAE) and related uncertainties of DeepConAD and a single LSTM model using the validation set. As it can be observed from the table, DeepConAD has overall improvement of at least 1% in accuracy and at least 0.3% lower degree of uncertainty in all the datasets. With this result, we can assert that LSTM Autoencoder used in DeepConAD improve its performance due to its ability of extracting important unseen features from time series. Table 1. MAE and its relative uncertainties DeepConAD and LSTM models Datasets

Models

MAE

Uncertainties

A1Benchmark DeepConAD 0.0296 0.0008 LSTM

0.0824 0.0009

A2Benchmark DeepConAD 0.0342 0.0006 LSTM

0.0386 0.0007

A3Benchmark DeepConAD 0.0253 0.0002 LSTM

0.0370 0.0007

A4Benchmark DeepConAD 0.0288 0.0003 LSTM

0.0332 0.0006

In order to demonstrate the performance of DeepConAD on anomaly detection, A1Benchmark original time series is depicted against the prediction interval as shown in Fig. 2. The actual values in red with blue dots showing the interval range between the predicted quantiles in the time series. As illustrated in the figure, the quantile interval range (blue dots) is higher in the period of uncertainty, and the model tends to generalized well in other cases. We then go more in-depth to investigate the periods of high uncertainties and notice these coincide with the anomaly points in the original label time series which shows the performance of DeepConAD in detecting anomalous point within the time series. Performance Comparison. In this subsection, we described the performance comparison of DeepConAD and other state-of-the-art anomaly detection methods. On a detailed level, Table 2 show a comparison of the technique with DeepAnT [13] and LSTM-AD [10] anomaly detection methods. The table shows our method outperforms other methods in three sub-benchmarks whereas for A2Benchmark it a runner up between our approach and DeepAnt, which all outperform by LSTM-AD model. In this table, it is

242

A. I. Tambuwal and A. M. Bello

Fig. 2. An example of identified anomalous points in A1Benchmark time series is shown in this figure. Original time series is shown in red lines and quantile interval in blue dots.

also important to note that our method performs better than LSTM-AD on three subbenchmarks and only perform slightly weak in one sub-benchmark. This demonstrates how Autoencoder improves the prediction performance by extracting essential features from the time series. Table 2. Average F-Score of DeepConAD, LSTM-AD, and DeepAnT on Yahoo dataset Yahoo dataset DeepConAD LSTM-AD DeepAnT A1Benchmark 0.57

0.44

0.46

A2Benchmark 0.94

0.97

0.94

A3Benchmark 0.98

0.72

0.87

A4Benchmark 0.98

0.59

0.68

5 Conclusion and Future Work This paper presented a new unsupervised anomaly detection method (DeepConAD) that utilized uncertainty estimates from the prediction model to detect anomalies in multivariate time series. We employ a window-based sequence learning using LSTM autoencoder and LSTM forecaster for predicting next time step based on previous time steps. In this context, an encoder is used to encode the input sequence and captured the patterns of the time series. This encoded sequence is then passed to a prediction model, which then make the next step prediction for each value along with the uncertainty estimate in the output sequence. This is achieved by developing a quantile regression model, which approach the regression problem as estimating continuous probability distribution instead of estimating a single value or sequence of values. By computing the upper and lower

DeepConAD: Deep and Confidence Prediction

243

quantiles of the model predicted output, we assume the model have captured most likely values the reality can expect. The interval between upper and lower quantiles is then computed to determine whether the model find any anomalies. Observed values that falls outside 95% confidence interval are considered as anomalies. Experiments were conducted to evaluate the performance of DeepConAD and its comparison with other state of the art anomaly detection methods. The result shows that in most cases, DeepConAD outperforms other methods. DeepConAD also handles the multivariate and temporal characteristics of time series data and therefore suitable for domains where unlabeled time series is collected from heterogeneous sensors. Our future work will be focused on extending the method to include anomaly prediction part that will use anomaly labels obtain in the current approach and train a classifier for prediction of anomalies. Furthermore, an online model will be explored to detect both concept drift and anomalies from sensor streams.

References 1. Wang, J.: Outlier detection in big data. J. Clean. Prod. 16(15), 2862 (2014) 2. Reunanen, N., Räty, T., Jokinen, J.J., Hoyt, T., Culler, D.: Unsupervised online detection and prediction of outliers in streams of sensor data. Int. J. Data Sci. Anal. 9(3), 285–314 (2019) 3. Da Xu, L., He, W., Li, S.: Internet of things in industries: a survey. IEEE Trans. Ind. Inform. 10(4), 2233–2243 (2014) 4. Teng, M.: Anomaly detection on time series. In: International Conference on Progress in Informatics and Computing, vol. 1, pp. 603–608 (2010) 5. Galeano, P., Pena, D., Tsay, R.S.: Outlier detection in multivariate time series by projection pursuit. J. Am. Stat. Assoc. 101, 654–669 (2006) 6. Gupta, M., Gao, J., Aggarwal, C.C., Han, J.: Outlier detection for temporal data: a survey. IEEE Trans. Knowl. Data Eng. 26(9), 2250–2267 (2014) 7. Theissler, A., Dear, I.: An anomaly detection approach to detect unexpected faults in recordings from test drives. In: Proceedings of the WASET International Conference on Vehicular Electronics and Safety 2013, vol. 7, no. 7, pp. 195–198 (2013) 8. Sangha, M.S., Yu, D.L., Gomm, J.B.: Sensor fault diagnosis for automotive engines with real data evaluation. Multicr. Int. J. Eng. Sci. Technol. 3(8), 13–25 (2011) 9. Kandhari, R., Chandola, V., Banerjee, A., Kumar, V., Kandhari, R.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 1–6 (2009) 10. Malhotra, P.A.P Vig, L., Shroff, G., Rinard, M.: Long short term memory networks for anomaly detection in time series. In: Proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges (Belgium), 22–24 April 2015, (2015) 11. Aldosari, M.S.: Unsupervised anomaly detection in sequences using long short term memory recurrent neural networks. PhD Diss. Georg. Mason Univ., p. 98 (2016) 12. Saurav, S., Malhotra, P., Vishnu, T.V., Gugulothu, N., Vig, L., Agarwal, P., Shroff, G.: Online anomaly detection with concept drift adaptation using recurrent neural networks. In: Proceedings of the ACM India Joint International Conference on Data Science and Management of Data - CoDS-COMAD 2018, pp. 78–87 (2018) 13. Munir, M., Siddiqui, S.A., Dengel, A., Ahmed, S.: DeepAnT: a deep learning approach for unsupervised anomaly detection in time series. IEEE Access 7, 1991–2005 (2019) 14. Gugulothu, N., Vishnu, T.V., Malhotra, P., Vig, L., Agarwal, P., Shroo, G.: Predicting remaining useful life using time series embeddings based on recurrent neural networks. In: 2nd ML PHM Work. SIGKDD 2017, vol. 10 (2017)

244

A. I. Tambuwal and A. M. Bello

15. Malhotra, P., Ramakrishnan, A., Anand, G., Vig, L., Agarwal, P., Shroff, G.: LSTM-based encoder-decoder for multi-sensor anomaly detection. arXiv.org, July 2016 16. Schreyer, M., Sattarov, T., Borth, D., Dengel, A., Reimer, B.: Detection of anomalies in large scale accounting data using deep autoencoder networks arXiv:1709.05254, September 2017 17. Amarbayasgalan, T., Jargalsaikhan, B., Ryu, K.: Unsupervised novelty detection using deep autoencoders with density based clustering. Appl. Sci. 8(9), 1468 (2018) 18. Zhu, L., Laptev, N.: Deep and confident prediction for time series at Uber. In: IEEE International Conference on Data Mining Workshops, ICDMW, vol. 2017–November, pp. 103–110 (2017) 19. Hyndman, R.J., Wang, E., Laptev, N.: Large-scale unusual time series detection. In: Proceedings - 15th IEEE International Conference on Data Mining Workshop, ICDMW 2015, pp. 1616–1619 (2016) 20. Keogh, E., Lonardi, S., Chiu, B.Y.: Finding surprising patterns in a time series database in linear time and space. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2002, p. 550 (2002) 21. Keogh, E., Lin, J., Fu, A.: HOT SAX: efficiently finding the most unusual time series subsequence. In: Proceedings - IEEE International Conference on Data Mining, ICDM, pp. 226–233 (2005) 22. Bontemps, L., Cao, V.L., McDermott, J., Le-Khac, N.A.: Collective anomaly detection based on long short-term memory recurrent neural networks. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), LNCS, vol. 10018, pp. 141–152 (2016) 23. Chauhan, S., Vig, L.: Anomaly detection in ECG time signals via deep long short-term memory networks. In: Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015 (2015) 24. Hochreiter, S., Schmidhuber, J.J.: Long short-term memory. Neural Comput. 9(8), 1–32 (1997)

Reduced Order Modeling Assisted by Convolutional Neural Network for Thermal Problems with Nonparametrized Geometrical Variability Fabien Casenave1(B) , Nissrine Akkari1 , and David Ryckelynck2 1

Safran Tech, Rue des Jeunes Bois, 78114 Chˆ ateaufort, Magny-Les-Hameaux, France [email protected] 2 Centre des Mat´eriaux, Mines ParisTech PSL Research University, CNRS UMR 7633, 63-65 rue Henri Auguste Desbru`eres, Corbeil-Essonnes, France

Abstract. In this work, we consider a nonlinear transient thermal problem numerically solved by a high-fidelity model. The objective is to derive fast approximations of the solutions to this problem, under nonparametrized variability of the geometry, and the convection and radiation boundary conditions, using physics-based reduced order models (ROM). Nonparametrized geometrical variability is a challenging task in model order reduction, which we propose to address using deep neural networks. First, a convolutional neural network (CNN) is trained to compute the discretization error of a fastly simulated solution on a coarse mesh, under the aforementioned geometry and boundary conditions variability. Then, for a fixed geometry, a ROM is constructed under the boundary conditions variability; the data used to construct the ROM being the coarse solutions and the CNN predicted discretization errors. We illustrate that in all our tested configurations, the reduced order model improves the accuracy of the coarse and CNN predictions. Keywords: Convolutional neural network · Thermal problem · Physical reduced order modeling · Nonparametrized geometrical variability · Discretization error

1

Introduction

In the industry, fast procedures to simulate physical problems are very important in design and uncertainty quantification processes. Meta modeling, deep learning or physical reduced order modeling are possible candidates to speed up numerical simulations. All these techniques require first solving some instances of the problem, in order to construct a procedure able to approximate the solution under new variability. In this work, the industrial problem of interest is the thermal analysis of mechanical parts of aircraft engines. Such parts are subjected c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 245–263, 2020. https://doi.org/10.1007/978-3-030-52246-9_17

246

F. Casenave et al.

to thermal loads, and their strength is usually limited by the maximum temperature reached by the material: too close to the melting point, the mechanical properties of the part quickly deteriorate. We find in the literature many papers where the authors are interested in using regression in deep learning combined to physics-based approach in order to control the sharpness of the data of interest. In [16], it is proposed to train convolutional neural networks (CNNs) without any labeled data, where the highfidelity partial differential operators are incorporated in the likelihood loss functions. These CNNs are called physics-constrained deep learning models. In [7], the authors propose to use machine learning in order to quantify errors made by some approximations of partial differential equations. These approximations could be obtained at the end of an iterative scheme, lower fidelity model or even projection-based reduced order models. The machine learning framework for model errors is based on a regression method that constructs the noise model as a zero-mean Gaussian random variable. In [13], a Gaussian process regression method is used to construct a set of response functions for the errors between the high fidelity model and a parametric nonintrusive reduced order model (PNIROM). In [1], physics-based data driven models are proposed, where the distance to the admissible points in a physical sense is used to find the optimal weights of a regression model. In [9], it is shown that one-dimensional flow models can be used to constrain the outputs of deep neural networks, so that their predictions satisfy the conservation of mass and momentum principles. In [11], it is proposed a new method to transform the source domain knowledge to fit the target domain, using deep learning method and limited samples from the target domain to transform the source or the input domain dataset. It is proposed in [15] to find feature representations which minimize the distance between source and target domains, as it is not naturally done in classical deep learning methods: a novel semi-supervised representation deep learning framework is proposed thanks to a softmax regression model auto-encoder for the manifold regularization. In this work, we are interested in the fast simulation of nonlinear transient thermal problems, under a nonparametrized variability of the geometry, as well as the boundary conditions. We propose to combine physical reduced order models and deep learning in the following fashion: a first coarse prediction is computed by solving the problem on a coarse mesh. We then use CNNs to predict the error of this coarse prediction with respect to the reference solution, namely the discretization error. This new prediction is then improved using physical reduced order modeling. More precisely, we use the snapshots Proper Orthogonal Decomposition [3,12] combined with the Empirical Cubature Method [8] to reduce this nonlinear problem. Having a physical reduced order model compute the final prediction has numerous advantages: the boundary conditions and physical equations are weakly satisfied, i.e. on the reduced order basis (besides, the homogeneous Dirichlet boundary conditions are exactly enforced), the constitutive law is exactly solved, and in some cases, we even dispose of an error estimation of the approximation [10]. We illustrate in our numerical applications

CNN-Assisted ROM for Thermal Problems

247

that the reduced order model improves the prediction of the coarse model and the CNNs in all the tested configurations. In what follows, we first present in Sect. 2 our problem of interest: a transient nonlinear thermal problem under a nonparametrized variability of the geometry, as well as the convection and radiation boundary conditions. The proposed CNNs to predict the discretization error of the coarse solution are presented in Sect. 3 and elements on physical reduced order modeling are provided in Sect. 4. Our proposed framework combining CNNs with reduced order modeling is detailed in Sect. 5, and numerical applications are given in Sect. 6.

2

Description of the Problem of Interest

Consider a structure of interest denoted Ω, such that the boundary ∂Ω is partitioned in d = 4 surfaces: ∂Ω = ∪di=1 Γ (i) , Γ (i)◦ ∩ Γ (j)◦ = ∅, 1 ≤ i, j ≤ d, where ·◦ denotes the interior of a set, see Fig. 1. Γ (2)

Γ (1)

Ω

Γ (3)

Γ (4)

Fig. 1. Representation of the structure of interest Ω, with a partitioning of the boundary ∂Ω = ∪di=1 Γ (i) , Γ (i)◦ ∩ Γ (j)◦ = ∅, 1 ≤ i, j ≤ d = 4.

We are interested in the prediction of the temperature field T (x, t) over the structure Ω during a time interval [0, tf ]. In the absence of work (the geometry of the structure is fixed) and volumic heat source, the heat equation reads ∂u (x, t) + ∇ · q(x, t) = 0, ∂t

(1)

where u is the volumic internal energy (in J.m−3 ) and q is the heat flux density (in J.s−1 .m−2 ). We make the following additional assumptions: the density ρ (in kg.m−3 ), the massic heat capacity cp (in J.kg−1 .K−1 ) and heat conductivity λ (in J.s−1 .m−1 .K−1 ) are supposed uniform over Ω and constant in time. The massic internal energy U = uρ (in J.kg−1 ) is assumed to depend only on the temperature. Moreover, the heat exchanges between the structure and the exterior are modeled by convection and radiation boundary conditions: q(x, t) ·

248

F. Casenave et al.

  4 n(x) = h(x, t) (T (x, t) − T1,e (x, t)) + (σ)(x, t) T 4 (x, t) − T2,e (x, t) , for (x, t) ∈ ∂Ω × [0, tf ], where h denotes the convection coefficient (in J.s−1 .m−2 .K−1 ), σ the Stefan–Boltzmann constant (in J.s−1 .m−2 .K−4 ),  is the emissivity coefficient (dimensionless), and T1,e and T2,e are external temperature values. The coefficients σ and  are fixed, and h is uniform over each surface and constant in time: ⎧ h(x, t) = h(i) , Γ (i) × [0, tf ] ⎪ ⎪ ⎪ ⎪ ⎨ (i) T1,e (x, t) = T1,e (t), Γ (i) × [0, tf ] (2) ⎪ ⎪ (σ)(x, t) = σ, ∂Ω × [0, tf ] ⎪ ⎪ ⎩ T2,e (x, t) = T2,e (t) ∂Ω × [0, tf ] Under these assumptions, T (x, t), x ∈ Ω, t ∈ [0, tf ], is solution of the following system of equations: ⎧ ∂T ⎪ ρcp (x, t) − λΔT (x, t) = 0, Ω × [0, tf ] ⎪ ⎪ ⎪ ∂t ⎪  ⎪ ⎨ (i) λ∇T (x, t) · n(x) = h(i) T (x, t) − T1,e (t) + ⎪   ⎪ 4 ⎪ ⎪ σ T 4 (x, t) − T2,e (t) , Γ (i) × [0, tf ], 1 ≤ i ≤ d ⎪ ⎪ ⎩ Ω × {0} T (x, t = 0) = Tinit (x). (3) The strong form (3) of the partial differential equations of interest is weakened into a variational formulation, then discretized in space using the Galerkin method with finite elements, and in time using a backwards Euler time stepping scheme. This leads to a system of nonlinear equations, for which an approximate solution is computed using a Newton algorithm.

N The solution temperature is obtained as T (x, ts ) = k=1 Tk (s)ϕk , 1 ≤ s ≤ J, where {0, t1 , ..., tJ = tf } is the time discretization and {ϕk }1≤k≤N , the finite element basis of cardinal N , is the space discretization. At time step s + 1, the p-th iteration of the Newton algorithm writes: ⎧ (0) ⎨ T (s + 1) = T (s)    ⎩ DFs T (p) (s + 1) T (p+1) (s + 1) − T (p) (s + 1) = −Fs T (p) (s + 1) , DV (4) where T (p) (s + 1) ∈ RN is the p-th iteration at time step s + 1 of the solution temperature coefficients on the finite element basis, T (s) is the known solution at the previous time step s, where, for V ∈ RN , Fs (V ) ∈ RN , 0 ≤ s ≤ J − 1, is such that

CNN-Assisted ROM for Thermal Problems

249

 N ρcp Fs,l (V ) = ϕk (x)ϕl (x)dx (Vk − Tk (s)) dt Ω k=1  N ∇ϕk (x) · ∇ϕl (x)dx Vk +λ k=1

− σ

Ω

⎛ ⎝

∂Ω



d

N

h(i)

Vk ϕk (x)



4 T2,e (s



Γ (i)

N

(5)

+ 1)⎠ ϕl (x)dx

k=1

i=1



4

 Vk ϕk (x) −

(i) T1,e (s

+ 1) ϕl (x)dx, 1 ≤ l ≤ N.

k=1

 (p)  s and where DF (s + 1) ∈ RN ×N is such that DT T   ∂F  DFs s,k T (p) (s + 1) = T (p) (s + 1) DV k,l ∂Vl

ρcp ϕk (x)ϕl (x)dx + λ ∇ϕk (x) · ∇ϕl (x)dx = dt Ω Ω 3

 N (p) − 4σ Tk (s + 1)ϕk (x) ϕk (x)ϕl (x)dx ∂Ω



d i=1

The

Newton

algorithm

T (p+1) (s+1)−T (p) (s+1)2

h(i)

k =1

Γ (i)

ϕk (x)ϕl (x)dx, 1 ≤ k, l ≤ N.

iterations

(4)

are

stopped

(6) when

≤ HF tol , where  · 2 denotes the Euclidean norm on 

d 4 (i) (i) RN , and where bext l (s + 1) = σT2,e (s + 1) ∂Ω ϕl (x)dx + i=1 h T1,e (s +  1) Γ (i) ϕl (x)dx, 1 ≤ l ≤ N . The problem (3), solved by the Newton algorithm (4), is our reference high fidelity model (HFM). This HFM will be solved to generate the data needed in our training tasks. In our application, we are interested in fast numerical strategies to approximate the solution temperature under nonparametrized variations of the geometry Ω, the convection coefficients h and the time evolutions of the external temperatures T1,e and T2,e . To generate the random geometries, the coordinates of the four corners of the unit 2D square are shifted by a value taken randomly following a uniform probability distribution over [−0.25, 0.25]. In practice, the unit square is first meshed, and then the mesh is morphed using radial basis function interpolation, see [5]. The convection coefficients h(i) , 1 ≤ i ≤ 4, are taken randomly following a uniform probability distribution over [0, 10000]. To generate the evolutions of T1,e and T2,e , for each temperature, 11 values are taken following a uniform probability distribution over [0, 1000], corresponding to times instances 0, 100, ..., 1000, and the time evolution is obtained by linear interpolation between these values, see Fig. 2 for an example. bext (s+1)2

250

F. Casenave et al. 1,000

temperature (◦ C)

800 600 T2,e

400

(1)

T1,e

(2)

T1,e

200

(3)

T1,e

(4)

T1,e

0 0

200

400 600 time (s)

800

1,000

Fig. 2. Example of external temperature time evolution (i)

Remark 1 (external temperatures). The external temperature T1,e , 1 ≤ i ≤ 4, and T2,e can be different at the same time since they do not model the same physical phenomenon. The convection models the heat exchanged by contact at (i) the surface Γ (i) , where T1,e is the temperature of the fluid medium in contact. The radiation models the heat exchanged by the exterior surfaces seen by ∂Ω, in the linear optics sense, where T2,e is the temperature of these external seen surfaces (assumed here uniform in space). Remark 2 (nonparametrized variability). The fact that we choose some parametrization to generate our data in the previous paragraph do not contradict the fact that our variability is nonparametrized. When exploiting our fast approximation, we want to be able to use more general variations. In practice, the boundary conditions come from numerical simulations from another solver and another physics, for which we do not know any explicit parametrization.

3

Convolutional Neural Networks

In this section, we consider couples of meshes for the same geometry: one set of coarse meshes for which the HFM is solved very fast, and one set of fine meshes for which the HFM provides our reference solutions, in a runtime considered too long for efficient exploitation. A possibility for deriving fine solutions from coarse ones is to use superresolution strategies. These strategies are popular for computational fluid dynamics simulations, where the data is often produced as piece-wise constant fields on grids by finite volume schemes: the coarse solution is already available in the form of a coarse grid, a natural candidate for the input of a deep neural network. In our case, taking a subsampling of the coarse

CNN-Assisted ROM for Thermal Problems

251

solutions would neglect a lot of available information: the solution of (most) finite element problems are available as continuous functions over the complete structure through the finite element interpolation. We recall that the difference between a coarse solution and the reference solution at each point of the structure is called the discretization error. Hence, we propose to use deep learning to learn the discretization error of the coarse solutions. The proposed approximation is the prediction of the discretization error by the neural network added to the coarse solution. The considered data consist in 125 simulations of 100 times steps, for random geometries and boundary conditions, computed each on a coarse and fine mesh. We keep 100 simulations for the training set, and 25 for the testing set. The coarse solution and the difference between the fine and the coarse solutions (the discretization errors) are projected on a regular grid of 81 by 81 cells. The obtained tensors are scaled between 0 and 1. The regression task addressed by CNNs has for input a tensor of size (10000 × 81 × 81 × 2): 100 times steps for 100 random geometries for the number of samples, a 2D grid of 81 × 81 cells, and 2 channels: one for the projected coarse field and another one for the mask of the fine mesh projected onto the grid. The output tensor is of size (10000 × 81 × 81 × 1): still 10000 samples on the same 81 × 81-cell grid, and only one channel: the prediction of the discretization error. The data preparation for the training of the CNNs is illustrated in Fig. 3. For the computation of the prediction using the trained CNNs, the input data is prepared in the same way as for the training (the scaling of the data is done using the same function as the one fitted on the training data). After the CNNs has been applied to the data, an inverse projection is done to represent the data on the fine mesh, see Fig. 4. Notice that both the coarse and the fine meshes are two different approximations of an underlying geometry. To help the CNNs and prevent them for having to learn how to transform the coarse boundary into the fine one, all the projection and inverse projection operations are done so that the mask of the fine mesh is used as geometrical support. Depending on the convexity of the boundary, data is suppressed or extrapolated so that the fine mask is always used, see Fig. 5. Two CNNs are trained for 24 h on two Nvidia Quadro K5200 Graphics Processing Units using keras [4], their architectures are represented in Fig. 6. They consist in a succession of 2D convolution layers of increasing number of filters, followed by 2D deconvolution layers of decreasing number of filters. In all layers, the kernel size is (3 × 3) and ‘tanh’ activation functions are chosen for the first CNN whereas ‘relu’ ones are chosen for the second CNN. The batch size is 100, and the Adam optimizer is chosen with a learning rate of 10−4 and the mean square error metric. In this work, we use for predictions the average of the predictions of these two CNNs. Some other tested architectures lead to worse performance on our particular data, namely the stochastic gradient descent optimizer, and adding 50% dropout layers after each convolution and deconvolution layer.

252

F. Casenave et al.

Fig. 3. Data manipulation for the training of a CNN.

Two predictions using data from the testing set are presented in Fig. 7. We see that the CNNs successfully reproduce patterns and values of the discretization error on a configuration not seen during the training. Remark 3 (Prediction by windows). Notice that the learning of the discretizations error using the CNNs is restricted to a square subdomain (a window), see the images at the top of Fig. 3. When using the CNNs, the discretization error is not predicted outside the window, see the bottom-right image in Fig. 4. This enables us to use the CNNs on larger structures by predicting the discretization error window by window, as done in Sect. 6.2. In this section, we have constructed a framework to generate fast approximations of the HFM (3) from fastly computed coarse solutions using CNNs. As we will see in Sect. 6, such predictions can lead in some cases to worse L∞ -errors, with respect to the coarse solution.

CNN-Assisted ROM for Thermal Problems

253

Fig. 4. Data manipulation for the prediction using a trained CNN.

4

Model Order Reduction

Physical reduced order model (ROM) techniques can also be used to accelerate the computation of the high fidelity problem (3). ROM procedures consist in two stages: an offline stage, where information from the HFM is learned using machine learning, and an online stage, where the reduced order model is constructed in the form of an approximation of the physical equations and solved. Expensive tasks are computed during the offline stage, whereas the online stage is required to be efficient, since only operations in computational complexity independent of the number N of degrees of freedom of the HFM are usually allowed. In our case, as for most physical ROM techniques, the online approximation consists in solving the same equations as the HFM, namely the Newton algorithm (4), but where the Galerkin method is applied on a particular basis, called reduced-order basis (ROB), instead of the finite element basis. The ROB being

254

F. Casenave et al.

Fig. 5. Zoom over a couple of coarse and fine meshes represented together: depending on the convexity of the boundary, the coarse or the fine mask is larger.

Fig. 6. Architecture of the two tested CNNs.

specifically tailored for our problem of interest, its construction needs solving instances of the HFM. The cardinality n of the ROB is much smaller than N , leading to reduced runtimes for the reduced problem. As far as the ROM task is concerned, our objective contains important challenges. First, the nonparametrized variabilities of the geometry are moved out of the scope of the ROM: each time a new geometry is considered, a new complete ROM procedure has to be restarted. Then, the other challenges are the nonparametrized variability for the external temperature time evolutions and

CNN-Assisted ROM for Thermal Problems

255

Fig. 7. Two predictions using data from the testing set at two different time steps. Left: predictions from the CNNs, right: exact discretization error.

the nonlinearity of the equations of the HFM: they have been recently addressed in the literature. The ROB is obtained using a principal component analysis (PCA) (called proper orthogonal decomposition in the ROM community [12]) on the collection of some solution temperature fields. The efficiency of the reduced problem is obtained by replacing the costly integrals in (5) and (6) with precomputed tensors when possible, and reduced quadrature schemes constructed using a NonNegative Orthogonal Matching Pursuit Algorithm otherwise, see [14, Algorithm 1], [6] and [8] (this last approximation is often called hyperreduction).

5

Proposed Framework

We recall our objective: the fast computation of the HFM (3) under nonparametrized variations of the geometry Ω, the convection coefficients h and

256

F. Casenave et al.

the time evolutions of the external temperatures T1,e and T2,e . In this section, we propose to improve the accuracy of the CNNs prediction using physical ROM in the following framework. First, CNNs are trained to provide a first prediction under the considered variabilities, using some training data. Then, for each geometry of interest (in the exploitation phase), we construct a ROM using some training data representing the variabilities of h, T1,e and T2,e in the ROM offline stage. The data used in the training of the ROM, on which the PCA is applied to construct the ROB, are the coarse solutions and the prediction of the discretization errors from the CNNs. Finally, in the ROM online stage, we can fastly construct an approximation of the HFM for any variability of h, T1,e and T2,e . The framework is illustrated in Fig. 8.

Fig. 8. Proposed framework combining CNNs on discretization error and physical ROMs.

Using a ROM constructed with the projection of the coarse solution on the fine mesh would not improve the accuracy of the coarse prediction, since the ROB is a subspace of the linear space generated by the data. On the contrary, CNNs prediction do not account for the underlying physics, as explained in [16], where physics-based constraints have been imposed in the loss function. In this work, the data provided to the ROM is the coarse solution enriched by predictions of the CNNs. The intuition is that the ROM will filter this data to keep only the ones relevant to the current configuration and partial differential equations. The data not relevant will be discarded by the ROM when solving the physical reduced problem. As stated in the introduction, the advantages of the ROM prediction are, in our case, the boundary conditions and physical equations being satisfied on the ROB and the constitutive law being exactly solved (more precisely, only on the reduced integration scheme for the terms needing hyperreduction in the online assembling). For these reasons, we expect the ROM predictions to be more accurate than the CNNs ones.

6

Numerical Results

In this section, we apply the framework proposed in Sect. 5 to our problem of interest. For the moment, we do not consider the variabilities of h, T1,e and T2,e in the ROM: the CNNs are constructed under the complete variability, but when

CNN-Assisted ROM for Thermal Problems

257

considering the ROM, the offline and online stages are computed for one set of h and temporal profiles for T1,e and T2,e . In what follows, we denote Tcoarse the coarse temperature solution, TCNN the prediction temperature from the CNNs (namely, the prediction of the discretization error added to the coarse solution), TROM the prediction temperature of the ROM using the framework described in Sect. 5 and Tref the reference temperature solution (which is, in our case, the fine temperature solution). We define the following indicators, where Tpred is taken among Tcoarse , TCNN and TROM : – E ∞ (Tpred ) := max max |Tpred − Tref | (x, ts ), 1≤s≤J x∈Ω J

¯ ∞ (Tpred ) := 1 – E J – EL2 (Tpred ) :=

– E Q (Tpred ) := ts ) · n(x)dx, – Emat (Tpred ) :=

max |Tpred − Tref | (x, ts ),

x∈Ω s=1 

J 1 s=1 J Ω

1 J

2

(Tpred − Tref ) (x, ts )dx  , 2 max (T ) (x, t )dx ref s Ω 1≤s≤J 

J 

 s=1 QTpred (ts ) − QTref (ts ) , where QT (ts ) := ∇T (x, max |QTref (ts )| ∂Ω 1≤s≤J

1 D



|Tpred − Tref | (x , ts ), where D := {(x , s ) ∈ (Ω ×

(x ,s )∈D

{1, · · · , J}) | Tref (x , ts ) > 0.98 cardinal of D.

max

(x,s)∈(Ω×{1,··· ,J})

(Tref (x, ts ))} and D is the

¯ ∞ (Tpred ) and EL2 (Tpred ) quantify L∞ - and L2 The indicators E ∞ (Tpred ), E Q errors, E (Tpred ) quantifies the error on the prediction of the amount of heat exchanged with the exterior, and Emat (Tpred ) quantifies the error where the reference temperature is close to its maximal value. We explained in the introduction that this last indicator is related to the material integrity in an industrial context, where materials are pushed to the limit of their strength. 6.1

Geometrical Variabilities for Structures of the Same Size as the Training of the CNNs

We first consider random geometries located inside the limits of the grid used for the training of the CNNs of Sect. 3, so that we cannot have parts of the geometry going outside the grid, as illustrated in Fig. 3 and 4. The proposed framework is tested in the geometries illustrated in Fig. 9. The predicted discretization errors using CNNs and ROM, as well as the reference discretization errors, for the third geometry at t = 450 s are illustrated in Fig. 10. We notice that the ROM improves the CNNs prediction in the areas where the difference Tref −Tcoarse is maximal and minimal, and on the boundaries of the structure.

258

F. Casenave et al.

Fig. 9. Five geometries used for testing the proposed framework.

Fig. 10. Predicted discretization errors using CNNs and ROM, and reference discretization error, for the third geometry at t = 450 s.

The accuracy indicators of the different predictions for the five geometries, under five random instances of the boundary conditions, are provided in Table 1. In all cases and for all indicators, the ROM provided the best approximation, with a significant accuracy improvement for the last two indicators, more related to the physics of the problem and the industrial stakes. We notice that for the last two indicators, the CNNs predictions can be worse than the coarse ones. The duration of the different steps of the framework are given in Table 2. We remind that the ROM online stage is in computational complexity independent of N , the number of degrees of liberty of the associated HFM. Without accounting for the training of the CNNs, the duration of the complete procedure is approximately the same as the duration of the fine reference simulation. Here, we do not consider variability for the boundary conditions in the ROM. Suppose that the constructed ROM is accurate in a certain variability regime for the boundary conditions, then any new set of boundary condition in this regime can be computed to the price of the ROM online stage, which is 4.2 s in this case. The indicated durations for our framework can be improved by optimizing the code for the projection and inverse projection operations.

CNN-Assisted ROM for Thermal Problems

6.2

259

Geometrical Variability for Structures Larger Than the Training of the CNNs

In this experiment, we consider a structure 8 times larger than the previous ones (twice as high and four times as large), see Fig. 11. Table 1. Accuracy indicators applied to five different configurations. The bold results correspond to the more accurate prediction, and the underlined results to the least accurate one. Geometry 1 Geometry 2 Geometry 3 Geometry 4 Geometry 5 E∞

Tcoarse 45.1 ◦ C TCNN 45.4 ◦ C TROM 33.1 ◦ C

55.1 ◦ C 50.0 ◦ C 46.9 ◦ C

70.8 ◦ C 66.0 ◦ C 58.3 ◦ C

46.1 ◦ C 44.9 ◦ C 37.8 ◦ C

63.8 ◦ C 58.8 ◦ C 49.4 ◦ C

¯∞ E

Tcoarse 18.7 ◦ C TCNN 17.2 ◦ C TROM 16.6◦ C

29.2 ◦ C 26.6 ◦ C 25.9◦ C

32.5 ◦ C 30.0 ◦ C 26.9◦ C

23.1 ◦ C 21.0 ◦ C 19.2◦ C

29.8 ◦ C 27.6 ◦ C 24.0◦ C

EL2

Tcoarse 0.00926 TCNN 0.00753 TROM 0.00733

0.00872 0.00688 0.00650

0.0107 0.0080 0.0072

0.00713 0.00561 0.00525

0.00901 0.00699 0.00659

EQ

Tcoarse 0.0212 TCNN 0.0210 TROM 0.0097

0.0165 0.0165 0.0097

0.0191 0.0192 0.0129

0.0145 0.0146 0.0088

0.0228 0.0222 0.013

Emat Tcoarse 2.69 ◦ C TCNN 2.76 ◦ C TROM 2.16 ◦ C

14.2 ◦ C 13.4 ◦ C 7.6 ◦ C

6.48 ◦ C 7.33 ◦ C 4.77 ◦ C

5.04 ◦ C 4.98 ◦ C 3.75 ◦ C

3.65 ◦ C 3.58 ◦ C 3.23 ◦ C

Table 2. Duration of the different steps of the framework when testing with small geometries. Fine simulation Coarse simulation Projection coarse mesh to grid CNNs prediction Projection grid to fine mesh ROM offline ROM online

61.83 s 0.46 s 63.76 s 6.2 s 4.8 s 11.1 s 37 s 4.2 s

260

F. Casenave et al.

Fig. 11. Large geometry used for testing the proposed framework.

Fig. 12. Predicted discretization errors using CNNs and ROM, and reference discretization error, for the large geometry at t = 770 s. Table 3. Accuracy indicators applied to the configuration with the large geometry. The bold results correspond to the more accurate prediction, and the underlined results to the least accurate one. Large geometry ∞

Tcoarse 30.1 ◦ C TCNN 29.2 ◦ C TROM 28.9 ◦ C

¯∞ E

Tcoarse 13.5 ◦ C TCNN 18.8 ◦ C TROM 13.1 ◦ C

EL2

Tcoarse 0.00306 TCNN 0.00301 TROM 0.00242

EQ

Tcoarse 0.0184 TCNN 0.0237 TROM 0.0138

E

Emat Tcoarse 3.69 ◦ C TCNN 4.32 ◦ C TROM 3.50 ◦ C

CNN-Assisted ROM for Thermal Problems

261

Table 4. Duration of the different steps of the framework when testing with a large geometries. Fine simulation Coarse simulation Projection coarse mesh to grid CNNs prediction Projection grid to fine mesh ROM offline ROM online

483 s 2.4 s 340 s 61 s 10 s 90 s 176 s 1s

The predicted discretization errors using CNNs and ROM, as well as the reference discretization errors, for the large geometry at t = 770 s are illustrated in Fig. 12. We notice that the ROM improves the CNNs predictions on the boundaries of the structure and removes some irregularities. The accuracy indicators of the different predictions for the large geometry, under random instances of the boundary conditions, are provided in Table 3. The comments done in Sect. 6.1 are also valid for this experiment. The duration of the different steps of the framework are given in Table 4. Notice that the faster ROM online stage with respect to the previous section is due to a less stringent accuracy requirement enforced for the ROM approximation in this case.

7

Conclusions and Future Works

In this work, we considered a transient nonlinear thermal problem, for which we proposed fast numerical approximations in a context of nonparametrized geometrical and temporal boundary condition scenarios variability. This approximation is based on a first numerical resolution on a coarse mesh, from which the discretization error is predicted by convolutional neural networks. As illustrated in our numerical experiments by error indicators adapted to the physical problem, this first prediction can sometimes be less accurate than the coarse solution. This prediction is then improved by constructing a physical reduced order model on data composed of the coarse solution and the prediction of the discretization error by the neural networks. In all our numerical experiments, the prediction of the reduced order model is more accurate than both the coarse and the neural network predictions. Hence, among the data generated by the neural networks, the reduced order model only keeps the one pertinent for the problem at hand. The numerical applications contain an experiment where the neural network predicts the discretization error window by window on a structure 8 times larger than the ones used during the training, and the reduced order model, constructed on the complete structure, could also improve the accuracy of the prediction.

262

F. Casenave et al.

For the moment, we do not consider variation of boundary conditions temporal scenarios in the reduced order model procedure, and the complete framework computes an approximation in the same runtime as the reference high-fidelity model in this case. As a perspective, we plan to consider this variability in the physical model order reduction, which should lead to practical speedups with respect to the high-fidelity model, in a many queries context. With our second numerical application, we experiment the use of the CNNs window by window. In a parallel computing context, the proposed framework can generate in parallel with distributed memory the data for the offline stage of the reduced order modeling method, which can also be computed in parallel with distributed memory using the procedure detailed in [2]. Another way to put this in perspective is: for any new geometry, a reduced order model strategy can be derived without having to solve any global equilibrium over the high-fidelity structure, since the data required to construct the reduced order model is only coarse computations, and discertization errors computed locally in parallel by the CNNs. This must be confirmed by also increasing the prediction ability of the CNNs to provide better data to the reduced order modeling procedure.

References 1. Ayensa-Jim´enez, J., Doweidar, M.H., Sanz-Herrera, J.A., Doblar´e, M.: An unsupervised data completion method for physically-based data-driven models. Comput. Methods Appl. Mech. Eng. 344, 120–143 (2019) 2. Casenave, F., Akkari, N., Bordeu, F., Rey, C., Ryckelynck, D.: A nonintrusive distributed reduced order modeling framework for nonlinear structural mechanics - application to elastoviscoplastic computations. Int. J. Numer. Methods Eng. 121, 32–53 (2020) 3. Chatterjee, A.: An introduction to the proper orthogonal decomposition. Curr. Sci. 78(7), 808–817 (2000) 4. Chollet, F., et al.: Keras (2015). https://github.com/fchollet/keras 5. de Boer, A., van der Schoot, M.S., Bijl, H.: Mesh deformation based on radial basis function interpolation. Comput. Struct. 85(11), 784–795 (2007). Fourth MIT Conference on Computational Fluid and Solid Mechanics 6. Farhat, C., Avery, P., Chapman, T., Cortial, J.: Dimensional reduction of nonlinear finite element dynamic models with finite rotations and energy-based mesh sampling and weighting for computational efficiency. Int. J. Numer. Methods Eng. 98(9), 625–662 (2014) 7. Freno, B.A., Carlberg, K.T.: Machine-learning error models for approximate solutions to parameterized systems of nonlinear equations. Comput. Methods Appl. Mech. Eng. 348, 250–296 (2019) 8. Hern´ andez, J.A., Caicedo, M.A., Ferrer, A.: Dimensional hyper-reduction of nonlinear finite element models via empirical cubature. Comput. Methods Appl. Mech. Eng. 313, 687–722 (2017) 9. Kissas, G., Yang, Y., Hwuang, E., Witschey, W.R., Detre, J.A., Perdikaris, P.: Machine learning in cardiovascular flows modeling: predicting arterial blood pressure from non-invasive 4D flow MRI data using physics-informed neural networks. Comput. Methods Appl. Mech. Eng. 358, 112623 (2020)

CNN-Assisted ROM for Thermal Problems

263

10. Ryckelynck, D., Gallimard, L., Jules, S.: Estimation of the validity domain of hyper-reduction approximations in generalized standard elastoviscoplasticity. Adv. Model. Simul. Eng. Sci. 2(1), 19 (2015) 11. Salaken, S.M., Khosravi, A., Nguyen, T., Nahavandi, S.: Seeded transfer learning for regression problems with deep learning. Expert Syst. Appl. 115, 565–577 (2019) 12. Sirovich, L.: Turbulence and the dynamics of coherent structures, parts I, II and III. Q. Appl. Math. XLV, 561–590 (1987) 13. Xiao, D.: Error estimation of the parametric non-intrusive reduced order model using machine learning. Comput. Methods Appl. Mech. Eng. 355, 513–534 (2019) 14. Yaghoobi, M., Wu, D., Davies, M.E.: Fast non-negative orthogonal matching pursuit. IEEE Sig. Process. Lett. 22(9), 1229–1233 (2015) 15. Zhu, Y., Wu, X., Li, P., Zhang, Y., Hu, X.: Transfer learning with deep manifold regularized auto-encoders. Neurocomputing 369, 145–154 (2019) 16. Zhu, Y., Zabaras, N., Koutsourelakis, P.-S., Perdikaris, P.: Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data. J. Comput. Phys. 394, 56–81 (2019)

Deep Convolutional Generative Adversarial Networks Applied to 2D Incompressible and Unsteady Fluid Flows Nissrine Akkari1(B) , Fabien Casenave1 , Marc-Eric Perrin2 , and David Ryckelynck2 1

Safran Tech, Modelling and Simulation, Rue des Jeunes Bois, 78114 Chˆ ateaufort, Magny-Les-Hameaux, France [email protected] 2 Centre des Mat´eriaux, Mines ParisTech PSL Research University, CNRS UMR 7633, 63-65 Rue Henri Auguste Desbru`eres, Corbeil-Essonnes, France

Abstract. In this work, we are studying the use of Deep Convolutional Generative Adversarial Networks (DCGANs) for numerical simulations in the field of Computational Fluid Dynamics (CFD) for engineering problems. We claim that these DCGANs could be used in order to represent in an efficient fashion high-dimensional realistic samples. Let us take the example of fluid flows’ unsteady velocity and pressure fields computation when subjected to random variations associated for example with different design configurations or with different physical parameters such as the Reynolds number and the boundary conditions. The evolution of all these variables is usually very hard to parameterize and to reproduce in a reduced order space. We would like to be able to reproduce the time coherence of these unsteady fields and their variations with respect to design variables or physical ones. We claim that the use of the data generation field in Deep Learning will enable this exploration in numerical simulations of large dimensions for CFD problems in engineering sciences. Therefore, it is important to precise that the training procedure in DCGANs is completely legitimate because we need to explore afterwards large dimensional variabilities within the Partial Differential Equations. In literature it is stated that theoretically the generative model could learn to memorize training examples, but in practice it is shown that the generator did not memorize the training samples and was capable of generating new realistic ones. In this work, we show an application of DCGANs to a 2D incompressible and unsteady fluid flow in a channel with a moving obstacle inside. The input of the DCGAN is a Gaussian vector field of dimension 100 and the outputs are the generated unsteady velocity and pressure fields in the 2D channel with respect to time and to an obstacle position. The training set is constituted of 44 unsteady and incompressible simulations of 450 time steps each, on a cartesian mesh of dimension 79 × 99. We discuss the architectural and the optimization hyper-parameters choice in our case, following guidelines from the literature on stable GANs. We quantify the GPU cost needed to train a generative model to the 2D unsteady flow fields, to 892 s on one Nvidia Tesla V100 GPU, for 40 epochs and a batch size equal to 128. c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 264–276, 2020. https://doi.org/10.1007/978-3-030-52246-9_18

GAN for Unsteady and Incompressible Flows Keywords: DCGAN · CFD · Velocity field · Pressure field dimensional realistic samples · Data generation

1

265

· High

Introduction

Unsteady simulations for fluid mechanics in the domain of aeronautics are increasingly costly. Due to the statistical nature of the flow of unsteady and turbulent fluids, data driven algorithms could potentially reduce the computational cost through trained reduced-models. In fluid mechanics for computer graphics, the abundant amount of high-fidelity simulations has been used for training deep neural networks to approximate the behavior of a complex solver [1], to compress and decompress fluid simulations [7] or to synthesize high-resolution fluid flows starting from low-resolution velocities or vorticities [10]. Among the different techniques from the deep learning community, Generative Adversarial Networks (GAN)s [2] are particularly interesting for our task. GANs aim to capture the data distribution such that they can then easily generate new realistic samples similar with the real ones. We can find a large number of papers that discuss empirical choices and heuristic ones in order to obtain a stable GAN’s architecture for a given domain of application. In the paper of Goodfellow et al. [2], the GANs were trained on a range of datasets including MNIST [3], the Toronto Face Database [4], and CIFAR-10 [5]. The empirical rules of the models architecture are based on rectifier linear activations and sigmoid activations for the generative model and maxout activations for the discriminative one. Dropout was used at the final layer only of the generator. Goodfellow showed also in [6] different domains of application of GANs other than the representation and manipulation of highdimensional realistic samples, such as the reinforcement learning and the super resolution. In [7], the authors applied deep neural networks for parameterized fluid simulations. They called these networks generative networks because the inputs are defined by a reduced latent space of the corresponding problem parameters. These generative models were however optimized during the training phase of an auto-encoder of which the reduced latent space is the corresponding parameters latent space. Hence, the decoder part of this latter auto-encoder is considered afterwards for the generation of parameterized solutions. In [9], it was proposed some constraints on the architectural topology of the GANs that make them stable to train. The authors proposed to use the batchnorm in both the generator and the discriminator, to remove fully connected hidden layers for deeper architectures, to use the ReLU activation in the generator for all layers except for the output, which uses Tanh, and to use LeakyReLU activation in the discriminator for all layers. These guidelines were applied on the generation of human faces with different poses and bedrooms images. As already stated in the abstract, we are interested in the use of DCGANs for engineering problems, more particularly in the domain of aeronautics, in order to generate simulations data (i.e. which are usually generated by the high-fidelity Partial Differential Equations PDEs) in an efficient fashion. More precisely, we

266

N. Akkari et al.

need to generate very fast realistic high-dimensional samples. This could be very useful in a design exploration procedure for finding the optimal one with respect to a given industrial criteria. In this work, we propose to study the performance and robustness of DCGANs in order to generate high-dimensional realistic samples of high-fidelity simulations associated with 2D incompressible and unsteady fluid flows, with geometrical variations. To our knowledge, it is the first time that DCGANs are applied to generate unsteady fluid flows with geometrical variations. Nevertheless, there exists applications of GANs in the field of numerical simulations in the aim to do a super-resolution and an upscaling of the simulations data, see for instance [10] for more details. The architectures of the generative and discriminative models used in this work are inspired from the DCGAN tutorial of Pytorch, see [8] for instance. These architectures involve the main guidelines we already mentioned for stable GANs guarantee. The paper is organised as follows: in Sect. 2 we show the simulations data set considered for the training phase of the GAN. In Sect. 3 we provide the models’ architectures, the choice of the gradient descent algorithm and the hyperparameters values such as the learning rate, the batch size and the generator inputs. In Sect. 4, we show the generated 2D fluid flows in the training and test phases after 40 epochs of training. In Sect. 5, we give some conclusions and prospects to this work.

2

Training Fluid Simulations

The training fluid simulations are constituted of 44 CFD simulations of unsteady and incompressible fluid flows in a 2D channel around a moving squared solid obstacle. The dimensions of this 2D channel are respectively x ∈ [0.0 m, 0.07536 m] and y ∈ [−0.006 m, 0.006 m]. The positions of this obstacle respectively with respect to the 44 training unsteady simulations are imposed randomly, as an immersed solid boundary using a level set function, see for instance [15]. This test case runs on a cartesian mesh discretized into 79 parts in the x − axis direction and 99 parts in the y−axis. The boundary conditions are respectively given by an inlet constant velocity equal to 2 m/s, an outlet condition and two wall boundary conditions, see Fig. 1 for a sample of these training unsteady simulations. The physical time for each one of these simulations is specified equal to 45 ms, and snapshots of the instantaneous velocity and pressure fields are extracted each 0.1 ms, i.e. we have 450 snapshots of the velocity and the pressure per simulation. This makes a total of 19800 3D-snapshots (2D-velocity field and 1D-pressure field) in the training set, of dimension respectively 3 × (79 × 99) (as we concatenated the velocity and the pressure fields for the training phase of the GAN).

GAN for Unsteady and Incompressible Flows

267

Fig. 1. Three chosen couples of the magnitude of the high-fidelity velocity fields (on the top) and the high-fidelity pressure fields (at the bottom) in a batch of the training set.

3

GAN and Discriminator Architectures and the Optimization Hyper-parameters

GAN and Discriminator Architectures. As already mentioned in the introduction, we adopted the configurations of the deep convolutional generative and discriminative models following the Pytorch tutorial on DCGAN [8], as it satisfies the main features and guidelines for stable GANs, that we found in the literature. The Optimization Hyper-parameters. In what follows, we enumerate the hyper-parameters choice for the optimization algorithm of the weights of the generative and discriminative models from [8]: – The 19800 snapshots are loaded using a dataloader in Pytorch. The workers number is taken equal to 4 in this case.

268

– – – – – – – – – – –

N. Akkari et al.

The data are scaled in the interval [−1, 1]. The data are resized into an image size equal to 64 × 64. The batch size is taken equal to 128. The number of channels is taken equal to 3, as we concatenate the twodimensional velocity field with the one dimensional pressure field. The size of feature maps in the generator is equal to 64. The size of feature maps in the discriminator is equal to 64. The ADAM optimizer is considered, with a learning rate equal to 0.0002. During the training phase, a random noise is added on the target labels of the genarator and the discriminator. The generative model inputs are random distributions of dimension 100. The epochs number is equal to 40, in order to obtain sharp and realistic generated samples. The number of GPU is equal to 1.

The training phase runs on one GPU node of the Tesla V100 of Nvidia. We are able to realise 40 epochs during 892 s.

4

Experimental Results

Generated 2D Fluid Flows by the Generator at the Last Epoch of the Training Phase. The generator loss and the discriminator one during the training stage are shown in Fig. 2. We can deduce from this plot that the training phase was very stable, this could be seen also by showing in Fig. 3 the generator outputs on a fixed noise, during the training stage. More precisely, we show the generator outputs at the last epoch of the training stage for 64 different inputs. These 64 output images are compared to 64 training images picked from a batch (of 128 images) of the dataloader. Newly Generated 2D Fluid Flows by Optimized-Generative Models Trained for 40 Epochs. In this part, we consider the application of the optimized generative model on a set of input distributions of dimension 100 which have been obtained as a linear interpolation between two fixed random distributions of dimension 100: this will allow us to verify the assumption that the generator did not memorize the training images and was able to generate new realistic ones. We consider this application for several times and we obtain the results in Fig. 4, 5, 6, 7, 8, 9, 10 and 11. We remark thanks to these results that the generator was able to learn the temporal evolution and the geometrical variation during the training phase. Variation Laws of the Newly Generated Data in the Space of Time. In the following part, we show some logarithmic laws describing the evolution of the velocity magnitude of the previously-like newly generated fields for given points of the fluid domain for instance, with respect to the physical simulation time. We claim to detect thanks to these laws, if the generator is producing time coherent

GAN for Unsteady and Incompressible Flows

269

Fig. 2. The generator loss in blue and the discriminator loss in orange.

Fig. 3. Comparison between 64 training images (RGB ones containing a concatenation of the velocity field and the pressure one) on the right and 64 fake ones generated after 40 epochs of the training phase.

velocity fields and/or velocity fields associated with different positions of the obstacle. We would like to precise that we are not considering the obstacle’s position variable as we do not have an explicit parameter that describes this variation, for instance. However, we are able to establish a non-linear logarithmic law of the velocity with respect to time if the obstacle’s position is changing and, a linear one if the simulation’s time is going forward or backward for a given channel configuration. Therefore, we might be able to establish, in real time, physical models with respect to design/time variables and to calibrate them

270

N. Akkari et al.

Fig. 4. Test phase: From the left to the right a newly generated couple of velocity and pressure fields from a linear interpolation between two fixed random distributions of dimension 100.

Fig. 5. Test phase: From the left to the right a newly generated couple of velocity and pressure fields from a linear interpolation between two fixed random distributions of dimension 100.

GAN for Unsteady and Incompressible Flows

271

Fig. 6. Test phase: From the left to the right a newly generated couples of velocity and pressure fields from a linear interpolation between two fixed random distributions of dimension 100.

Fig. 7. Test phase: From the left to the right three newly generated couples of velocity and pressure fields from a linear interpolation between three couples of fixed random distributions of dimension 100.

272

N. Akkari et al.

Fig. 8. Test phase: From the left to the right three newly generated couples of velocity and pressure fields from a linear interpolation between three couples of fixed random distributions of dimension 100.

Fig. 9. Test phase: From the left to the right three newly generated couples of velocity and pressure fields from a linear interpolation between three couples of fixed random distributions of dimension 100.

with respect to data of realistic behavior and distribution generated thanks to the DCGAN. In Fig. 12, 13 and 14, we confirm the different logarithmic laws with respect to time. Moreover, we recover the fact that the distance between the generated fields from the interpolation distributions and the first generated field tends to increase with a decreasing rate, see Fig. 12, 13 and 14.

GAN for Unsteady and Incompressible Flows

273

Fig. 10. Test phase: From the left to the right three newly generated couples of velocity and pressure fields from a linear interpolation between three couples of fixed random distributions of dimension 100.

Fig. 11. Test phase: From the left to the right three newly generated couples of velocity and pressure fields from a linear interpolation between three couples of fixed random distributions of dimension 100.

5

Conclusions and Prospects

In this paper, we applied Deep Convolutional Generative Adversarial networks DCGANs for the generation of unsteady and incompressible 2D fluid flows in the wake of a moving squared obstacle: the latent reduced space being of dimension 100 and the high-fidelity fields of dimension 3 × (79 × 99). The newly generated 2D fluid flows by the optimized generative model with random distribution

274

N. Akkari et al.

Fig. 12. Test phase: Evolution of ln(d(V − V0 )) with respect to ln(|t − t0 |): d is the l2 − norm on given points of the fluid domain, V denotes the magnitude of the generated velocity from an interpolation point and V0 denotes the magnitude of the generated velocity field from the first interpolation random distribution. t is the time value.

Fig. 13. Test phase: Evolution of ln(d(V − V0 )) with respect to ln(|t − t0 |).

inputs, are very interesting. We were able to identify different realistic characteristics inherited from the real training samples, such as the temporal coherence and the geometrical variability. This is a first step towards the efficient manipulation and representation of high-dimensional realistic numerical data in engineering sciences. Moreover, we were able to quantify the GPU cost needed for this learning phase: 892 s on one GPU node of the Tesla V100 of Nvidia. These results indicate that the GPU cost of the generator training might become very important on a data set with 3D geometrical and unsteady fluid flows. In the prospects of this work, we want to be able to prove that the generative model

GAN for Unsteady and Incompressible Flows

275

Fig. 14. Test phase: Evolution of ln(d(V − V0 )) with respect to ln(|t − t0 |).

is always not memorizing the training samples. This is very important in order to accomplish the desired objectives of the fast design conception in engineering sciences problems. Moreover, these results are very encouraging and promising for further future studies, where we would like to assist a large number of newly generated velocity and pressure fields by the generative model thanks to physical reduced order modeling based on the projection of the high-fidelity equations on a Proper Orthogonal Decomposition (POD) basis with the generated fields, see for instance [11–14]. In other words, we claim to add more physical constraints to these GAN-generated velocity and pressure fields by resolving efficient reduced order systems of the Navier-Stokes equations.

References 1. Tompson, J., Schlachter, K., Sprechmann, P., Perlin, K.: Accelerating Eulerian fluid simulation with convolutional networks. arXiv 2016 (2016). https://arxiv. org/abs/1607.03597 2. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. In: Proceedings of the International Conference on Neural Information Processing Systems (NIPS 2014), pp. 2672–2680 (2014) 3. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 4. Susskind, J., Anderson, A., and Hinton, G.E.: The Toronto face dataset. Technical report UTML TR 2010-001, U. Toronto (2010) 5. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, University of Toronto (2009) 6. Goodfellow, I.J.: NIPS 2016 Tutorial. arXiv:1701.00160 7. Byungsoo, K., Vinicius, C.A., Nils, T., Theodore, K., Markus, G., Barbara, S.: Deep fluids: a generative network for parameterized fluid simulations. Eurographics 38(2), 59–70 (2019)

276

N. Akkari et al.

8. Inkawhich, N.: Pytorch tutorial, DCGAN. https://pytorch.org/tutorials/beginner/ dcgan faces tutorial.html 9. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: Conference paper at ICLR 2016 (2016) 10. Xie, Y., Franz, E., Chu, M., Thuerey, N.: tempGAN: a temporally coherent, volumetric gan for super-resolution fluid flow. ACM Trans. Graph. 37, 4 (2018). Article 95. arXiv:1801.09710 11. Akkari, N., Mercier, R., Moureau, V.: Geometrical reduced order modeling (ROM) by proper orthogonal decomposition (POD) for the incompressible Navier Stokes equations. In: 2018 AIAA Aerospace Sciences Meeting, AIAA SciTech Forum, (AIAA 2018-1827) (2018) 12. Akkari, N., Casenave, F., Moureau, V.: Time stable reduced order modeling by an enhanced reduced order basis of the turbulent and incompressible 3D Navier– Stokes equations. Math. Comput. Appl. 24(45), 2019. http://www.mdpi.com/ 2297-8747/24/2/45 13. Akkari, N.: A Velocity Potential Preserving Reduced Order Approach for the Incompressible and Unsteady Navier-stokes Equations. AIAA Scitech forum and exposition (2020) 14. Akkari, N., Casenave, F., Ryckelynck, D.: A novel Gappy reduced order method to capture non-parameterized geometrical variation in fluid dynamics problems. Working paper (2019). https://hal.archives-ouvertes.fr/hal-02344342 15. Abgrall, R., Beaugendre, H., Dobrzynski, C.: An immersed boundary method using unstructured anisotropic mesh adaptation combined with level-sets and penalization techniques. J. Comput. Phys. 257(Part A), 83–101 (2014)

Improving Gate Decision Making Rationality with Machine Learning Mark van der Pas1(B) and Niels van der Pas2 1 European Center for Digital Transformation, Roermond, The Netherlands

[email protected] 2 European Center for Machine Learning, Roermond, The Netherlands

Abstract. Canceling ideas and projects is an important part of the Innovation Portfolio Management (IPM) process as stopping the unsuccessful ones avoids sunk costs and sets free resources for successful ideas and projects. A large body of literature is available on the decision making in IPM. In this study, we analyzed within IPM the cancellation of ideas and projects by gatekeeping boards as well as the possibilities of applying machine learning. The hypotheses were tested with data from three large European telecommunication organizations. In total the three organizations shared 9,118 canceled ideas and projects of which 0.3% was canceled in the gatekeeping boards. 2.7% of the 1,469 gate requests on the agenda of these boards were canceled. The dataset of one organization was used to train four machine learning models to predict the likelihood of idea and project cancellation. The first model is trained to predict the ideas that will be canceled before the first gate approval, the second model does the same for projects being canceled after the first gate is approved but before the second gate is approved. Models three and four are trained for projects in a later phase. The fourth model predicts the cancellation of projects that hold a go on implementation approval. All four hold a high area under the curve of at least 0.802 turning them into possibly valuable instruments for predicting project cancellation. Keywords: Innovation Portfolio Management · Mortality curve · Machine learning

1 Introduction Where numerous ideas for innovations are generated and present in organizations, not all these are turned into projects and implemented. Several mechanisms can be used by organizations to determine which ideas to realize or if they are not realized, which to cancel. These mechanisms could embrace the ideas with the strongest contribution to realizing organizational goals whilst canceling the remaining ideas. It is generally accepted that the ideas and projects that need to be stopped are to be canceled as early as possible [11, 28]. Sifting out successful value contributing ideas from the total set of ideas is not always done error-free. Back in 2004 Cooper, Edgett and Kleinschmidt claimed that products © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 277–290, 2020. https://doi.org/10.1007/978-3-030-52246-9_19

278

M. van der Pas and N. van der Pas

fail at an alarming rate; amongst others, they found that 21.3% of businesses’ total NPD efforts meet their project objectives [5]. In a more recent study Van der Pas and Furneaux [29] found that 43% of new products and 69% of cost-saving investments deliver at least 80% of the expected revenue whereas 26% of new products and 7% of cost-saving investments never earned back more than the cost they occurred and thus delivered a bottom-line negative value. Practitioners from different telecom organizations explained several mechanisms to cancel ideas and projects. E.g. some ideas never get captured as the idea generator never takes the effort to register the idea. In a Spanish telecommunication organization, the capturing of ideas was structured in a way that the registering of an idea could only be done by a small group of employees. Idea generators that could not convince their colleagues would see their idea being canceled even before it was formally captured in a process or system. For the captured ideas that were registered, other mechanisms were identified in telecommunication organizations. Practitioners from a German organization explained that ideas, as well as projects, were canceled automatically after nobody worked on them for 90 days. In a Dutch organization, the ideas were only worked on, if they were budgeted for that year. Idea generators needed to wait for the next year budget cycle in case their ideas are not on the budget list. During that waiting time, a lot of ideas lost the initial energy and never attempted to be on the next budget cycle. Practitioners from organizations from several countries explained the usage of IRR and payback hurdles. Not passing these would mean the cancellation of an idea or project. This study focusses on two specific instruments to cancel ideas and projects. First, a well-known governance instrument, being a board to approve or reject innovation calls is studied. Focus is on the contribution of these boards on canceling ideas and projects. Second is the usage of machine learning to predict the probability that an idea or project will be canceled. The purpose of this study is to better understand these instruments so organizations can push the quality of decision making as well as the structure of their innovation funnel. This could push their Innovation Portfolio Management (IPM) performance which is important as it is related to firm performance [24]. Researchers can use the output of this study to better understand the contribution these instruments hold on canceling ideas and projects. This brings us to the research question: what contribution can decisions of gatekeeping boards and machine learning deliver in canceling ideas and projects? On our research we elaborate in the following sections commencing with a discussion of our theoretical framework including the resulting hypotheses. We then outline our research methodology and present our results. Finally, we present our conclusion as well as the managerial implications, limitations and suggestions for future research.

2 Theoretical Framework Innovation Portfolio Management [25] is defined as the dynamic decision-making process in which projects are evaluated and selected, and resources are allocated to them [19]. This process can be used to strive for an efficient portfolio [22] meaning that it is impossible to obtain a greater average return (benefits and costs) without incurring a greater standard deviation (risk) or it is impossible to obtain a smaller standard deviation

Improving Gate Decision Making Rationality with Machine Learning

279

without giving up return. Alternatively, it can be focused on growing profitability over the long term [18] or on selecting high-value projects in a balanced portfolio reflecting the business strategy with a good balance between projects and available resources [5, 6]. Regardless of the goals pursued with Innovation Portfolio Management, IPM is the decision-making process to select and cancel projects. The decision-making process can be organized along gates. A well-known example of this is the StageGate® process that holds gates and is used to evaluate, select or cancel projects by managing ideas through key steps into launched products [8, 10]. The gates in the StageGate® process are formal decision points that split the project into multiple stages. Cooper [9] presents a Stage-Gate overview with five gates and five stages. Alternative models use different setups with a different number of gates or a different split of work between those gates (e.g. [13, 14]). Before an idea enters this formal product development process it is in the pre-Stage-Gate® [7]. This ‘front end’ of IPM can include work as: technical feasibility, financial viability analysis, business model development and business plan preparation [20]. In this study, we focus on the captured ideas and follow these through the formal gate decisions. Within those, we studied the early decisions up to the decision to technically realize the project and to start the development [2]. With this, we capture the decisions determining the allocation of a large part of the project resources. On top, these IPM early phases are important as they determine the structure of the innovation funnel and have even shown to influence product success, time to market and financial performance [21]. Captured ideas can be canceled by formally rejecting a requested gate by the gatekeeper or by not requesting the next gate. In the latter case, the “decision” to cancel the project is not a gate decision but is made in a stage of the gated process. This can be an offline decision by senior management, but the project can also be abandoned and left to bleed to death. This cancellation of projects can be supported by numerous reasons. Examples are not passing a financial threshold [12] and extensive project risks [22]. Other reasons mentioned are that the idea or project cannot be financed, is technically not feasible, does not fit to the strategy or into the existing product portfolio, is illegal, is immoral, can hurt the company’s reputation, is politically not wanted, the initiator credibility is too low etc. Gate approvals are normally organized in review teams [14], boards [11] or with experienced gatekeepers [10]. Different to this we have seen in several telecommunication organizations IPM structures where early gates were approved by one person. Practitioners have chosen this set up as it assured a clear commitment of one manager as the overall end responsible manager of the investment since he or she approved the start of the initiative. Governance that empowers to approve the first gate by an individual senior manager and the next gate by a board also holds an effect on escalation of commitment [11, 30]. Since the board did not commit to the first gate it can decide on the second gate without any prior commitment immunizing it to escalation of commitment. Furthermore, practitioners were convinced that turning down an idea by one senior manager holds a less negative impact on the idea generator than turning it down in front of a board with multiple senior gatekeepers. And even though an idea was not prioritized, the organizations are nevertheless interested in the upcoming, better proposals of that

280

M. van der Pas and N. van der Pas

idea generator. Finally, practitioners set up this governance as they were convinced that a large part of the requests on the agenda of the board is being approved. They believed this was valid for IPM as for all kinds of other board decisions and considered a board as an inappropriate instrument to turn down large amounts of investments calls. As one CEO put it ‘the colleague determining the agenda holds more influence on what the organization will be doing than every board member alone and all board members together’. Although no exact figures were known this CEO claimed that 85% or even more of the requested gates on the board agenda were approved. Several reasons were mentioned for a high board approval rate. In one organization the board members agreed to work as a team and to stop fighting and arguing. This could steer board participants to avoid tough decisions (assuming that approving is easier than canceling). The group dynamic in the gate approving board could also influence the cancellation rate as board members do not criticize or reject proposals of board colleagues to avoid future push back on their own board requests. Finally, it was mentioned that board presentations are prepared in detail and hold a lot of upfront (emotional) investment. Turning down an idea, especially in front of a board with senior managers would kill this positive energy and even lead to a culture where innovative ideas are not respected. Board members could be reluctant to cancel projects as they value innovations. In hypothesis 1 we test the practitioners’ thesis that more than 85% of requests are being approved by the gatekeeping board: H1. More than 85% of gate requests on the agenda of a board is being approved In case more than 85% of the gate requests on the agenda of boards are approved and less than 15% is rejected then the latter is, as compared to published cancel rates, a low rate. Cooper, Edgett and Kleinschmidt [5] found that average businesses kill 19.0% of projects that entered the development stage prior to launch. As IPM also includes the stages before the development in which ideas and projects can be canceled, the IPM cancellation rate will be higher than 19%. The cancellation rate can be visualized and benchmarked with mortality curves presenting the progressive rejections of projects through the stages of the gated process [13]. Although the structure of the mortality curve improved strongly from 1982 to 2004 approximately 75% of the captured ideas never make it to commercialization [2]. The termination rates are even higher in case ideas are captured outside the organization by customers or suppliers. Mendely canceled 91.1% (feedback.mendely.com) of the captured ideas, the Idea boxes of Ericsson canceled 96.3% [3] whereas Ideastorm [16] had a cancellation rate of 97.8% and MyStarbucksIdeas canceled 99.8% [15]. Based on the published cancellation rates and the low number of gate requests rejected by boards we expect that a large part of the cancellation is not done by gatekeepers in their boards. The cancellations outside board meetings would support the suggestion that IPM is a decision-making process made up of more than gate decisions at single points of time [19]. In the second hypothesis, we investigate the contribution of gatekeeping boards to the total amount of cancellations. H2. Decisions of gatekeeping boards deliver a contribution of less than 5% of the total number of canceled initiatives.

Improving Gate Decision Making Rationality with Machine Learning

281

Organizations hold an interest in the likelihood of cancellation as it can be used to avoid sunk cost and to mitigate the risks that lead to cancellation. If cancellation is a rational reproducible process, the organization needs all the data points relevant as well as the capacity to process them to predict cancellation. Relevant data points can be split into structured and unstructured data. Examples of structured data are a classification of the innovation e.g. into new product, sales expansion and cost reduction [4] or into new-to-the-world, new-to-the-firm, major revisions and incremental improvements, repositioning, cost reduction [13] as well as the name of the initiator and of the project leader, the time-to-market in calendar days, the NPV and project budget. Unstructured data are the text documents describing the product, the business case or the weekly report of the project leader. Thousands of structured data points can be found on each project and the unstructured data can be dozens of pages of documentation. A lot of project information is scattered over emails, presentations, minutes, spreadsheets and text documents spread over all kind of (personal) drives. It is also available in databases, on whiteboards or ‘in the air’ for spoken words. Collecting this data is a task consuming significant resources. Collecting, internalizing and processing project data is due to the size of it challenging for most human beings (from here on called organizational agents) especially when considering portfolios of hundreds or even thousands of projects. Also, the time component makes this analysis of projects challenging as the project data evolves together with the project over time. As an alternative or a complementary instrument to organizational agents processing this data, machine learning could be applied. Machine learning models could be trained to predict the likelihood of cancellation based on the available information. Machine learning can be an interesting IPM instrument in case it can outperform organizational agents’ performance, that is in case it can better predict project cancellation than organizational agents. Given the size and complexity of the data, we expect machine learning to outperform organizational agents in predicting the likelihood of project cancellation. H3. Machine learning can outperform organizational agents in predicting the likelihood of projects cancellation The amount, level of detail and quality of information is expected to grow as projects progress along the stages and gates. Furthermore, due to the structure of the mortality rate the percentage of canceled projects normally reduces when more gates are passed. Based on these two effects we expect that the predictability improves when different machine learning models are used for the different gates. H4. Machine learning performance improves when different models are used for each of the gates.

3 Methodology In this study, we applied data from three European telecommunication organizations. These organizations run operations in three different countries and all three used boards as gatekeepers. In two organizations (A and B) the first three gates are approved by boards, whereas the third organization (C) holds a governance where the first gate was approved

282

M. van der Pas and N. van der Pas

by a senior manager and the second and third gates are approved by gatekeeping boards. The board approvals are recorded in minutes and the senior manager approval is captured over an automated workflow. Two organizations (A and C) hold a low threshold to enter ideas into the NPD process, by allowing every employee to enter ideas in a low threshold, browser-based, system. In organization B the ideas are entered by a PMO-organization. The idea owner contacts the PMO and requests them to enter the idea. The organizations have included the activities normally defined in the front end of IPM into their stage gated process. The exact allocation of activities into the early stages differ between the three organizations but they all hold three gates before the actual development and realization of the idea starts. As the names of the gates in the organizations differ, we normalized them for this study as Gate 1, 2 and 3. The data we received for H1 and H2 covered different time frames ranging from August 2008 up to August 2018. In total, we received information on 9,118 canceled projects and 1,469 early gate decisions. Details on the timeframe as well as the number of decisions are in Table 1. Table 1. The number of board decisions as well as the total number of cancelations for the three organizations Organization A

Organization B

Organization C

Time frame covered

November 2014–September 2016

February 2017–July 2018

August 2008–August 2018

Number of board gate decisions

285

126

1,058

Average decisions per month

12.8

7.0

8.8

Number of cancelled projects

523

67

8,528

For H1 we used the number of approvals, cancellations as well as the number of gate requests (being the sum of approvals and rejections) on the agenda of the gatekeeping boards. For postponed gate decisions we neglected the delay and counted the outcome of the decision. Some of the cancellations turned out to be temporary. For that we checked whether a rejected gate was approved in a later stage; in those cases, we counted both decisions together as one approval since the gate was finally approved. For H1 we calculated the cancellations divided by the number of topics on the agenda. For H2 we compared the number of board cancellations to the total number of cancellations during the time frame defined by the received board meeting minutes. H3 and H4 were tested using data from organization A. In total 2,451 ideas were captured that were canceled or launched in the time frame from November 2014 up to September 2019. Figure 1 presents the mortality curve for this data set. For each of the 2,451 ideas, we received 37 data points. These data points were used to train the machine learning models for H3 and H4. The output of the model (Y) is a

Improving Gate Decision Making Rationality with Machine Learning

283

3000

Number of projects

2500 2000 1500 1000 500 0

Pre Gate 1

Gate 1

Gate 2

Gate 3

Gates Fig. 1. Mortality curve of the data set for H3 and 4 from organization A

Boolean parameter with the values 1 for canceled and 0 for implemented and launched. The remaining data points are features on: 1. The project organization. This includes categorical identifiers for the demand owner who initiated the idea as well as the project leader. It also includes Booleans on departments involved in the project (e.g. IT, Technology and Digital). 2. Project type. The project type includes a classification on business-to-consumer products, business-to-business products or business-to-partner products, cost-saving projects but also an indication on the Net Promoter Score as well as the targeted number of customers affected. 3. Project financials. This includes information on the estimated project cost as well as net present value and payback period. 4. Time to Market. Information on the date the idea was captured as well as information on the estimated duration in days between gates 1, 2, 3 and the closing of the project. 5. Project execution. Information on the number of days a project was on hold, how long a project was flagged with a red risk and how often it was flagged with red risks. The data is used to feed a linear model-based learning algorithm using binary classification and the loss was minimized using stochastic gradient descent. The dataset was shuffled randomly and from that set, the first 70% was used to train the model. The trained model was evaluated with the remaining 30%. Four models were created. In the first model, the Pre Gate 1, all 2,451 captured ideas were used, in the second, Gate 1, model all captured ideas with an approved gate 1 were used. The third, Gate 2, model used all projects with an approved gate 2 and in the fourth, Gate 3, model an approved gate 3 was required. Some data points are only available after certain gates have been approved. E.g. the approval of gates 1 and 2 as well as any budget approvals are unknown in organization A before gate 1 is approved. In the trained models only data points were included that are available at the time of decision making.

284

M. van der Pas and N. van der Pas

To test H3 we compared the trained machine learning models to organizational agents’ performance. Based on the data set the following results of the machine learning models are known: • true positive (TP): the model predicted a cancellation and the project was canceled, • false positive (FP): the model predicted a cancellation and the project was launched and implemented, • true negative (TN): the model predicted a launch and implementation and the project was launched and implemented and • false negative (FN): the model predicted a launch and implementation and the project was canceled. The dataset includes ideas and projects that were captured in organization A. Based on inputs from practitioners we assume that they could only be captured in the case at least one colleague expects the idea to be a success (the idea capturing colleague). Against this expectation, false and true negatives can be measured as a cancellation or a launch and implemented project. Due to this baseline which only includes expected launches both the true and false positives of the ideas of organizational agents are unknown. Metrics like recall, accuracy, precision and false positive rate cannot be calculated for organizational agents as they all require false or true positives. The percentage of false negatives of the total data set can be calculated for both machine learning as well as organizational agents’ performance. This false negative percentage is defined as: false negative percentage =

FN FN + TN

(1)

The thresholds used in the machine learning models can be used to optimize the false negative rate. For this reason, we test H3 with two thresholds: 0.5 and 0.33. The first was chosen as the de facto default in machine learning models, the second since it optimizes significantly towards our focus area for H3: the false negative percentage. H4 was tested using new machine learning models trained for pre gate 1 and gate 3 from hypothesis 3. As stated earlier, 30% of the total dataset was used to evaluate the pre gate 1 model. Of this 30%, the projects without an approval for gate 3 were filtered out, leaving projects from the evaluation dataset with a gate 3 approval. It is likely that the training data set used to create the gate 3 model included projects that were used to evaluate the pre gate 1 model. Evaluating a model on the same data it was trained by is considered a bad research method as it may give unreliable results due to the risk of overfitting. Therefore, a new gate 3 model will be trained on the total dataset for gate 3 excluding the gate 3 approved projects in the evaluation set of the pre gate 1 model. The gate 3 model holds additional features not included in the pre gate 1 model. These additional features are excluded when using the pre gate 1 model but included in evaluating the gate 3 method. This approach allows testing both models using the same projects whilst these projects have not been used for training the models. As described e.g. by Powers, recall, accuracy, precision and false positive rate (FPR) are measures to evaluate machine learning models [26], these measures are defined as follows: Recall =

TP TP + FN

(2)

Improving Gate Decision Making Rationality with Machine Learning

TP TP + FP

(3)

TN + TP TN + TP + FN + FP

(4)

Precision = Accuracy =

285

FPR =

FP FP + TN

(5)

By comparing these four measures as well as the AUC of both models, H4 can be tested.

4 Results The cancellation rates of the gatekeeping boards are shown in Table 2. From the 1,469 captured gate decisions over the three organizations, 1,430 (= 97,3%) were approved. Organization A holds with 95.4% the lowest approval rate. In organization B 100% of the requested gates were approved; this can be explained by the way ideas are captured as in organization B only a select group of senior managers is empowered to capture and register an idea. Overall in the three organizations in this study, the percentage of approved board requests is clearly above 85%. H1 was tested with a binominal test with N = 1,569 and p = 0.85, the cumulative probability of getting up to 1,430 approvals equals 1.00, based on this H1 is accepted. Table 2. The percentage of approvals and cancelations by gatekeeping boards Organization A

Organization B

Organization C

Total

Approved

95.4%

100.0%

97.5%

97.3%

Canceled

4.6%

0.0%

2.5%

2.7%

100.0%

100.0%

100.0%

100.0%

Total

Table 3 shows the percentage of gates canceled in the gatekeeping board (in board) as well as the percentage of cancellations outside the board (not in board). On average over all three organizations, 99.6% of the cancellations are done outside the gatekeeping board. The gatekeeping board of organization A accounts for 2.5% of the total cancellations in that organization and since no cancellations were made by the gatekeeping board of organization B, all cancellations were done outside the board. With 0.4% of all cancellations made in gatekeeping boards, it seems to support H2. H2 was also tested with a binominal test with N = 9,118 and p = 0.05. 1.00 minus the probability of getting 39 or fewer cancellations is also 1.00 allowing us to accept H2. The results of the tests of the four machine learning models are presented in Table 4. The models show the area under the curve (AUC) of 0.802 or higher. An AUC of 0.5 is equal to guessing at random and the closer the figure is to 1 the better the prediction. The AUC for all four models is classified as “very good” or better.

286

M. van der Pas and N. van der Pas Table 3. The percentage of cancellations in a board as well as outside the board Organization A

In board

Organization C

Total

2.5%

0.0%

97.5%

100.0%

99.7%

99.6%

100.0%

100.0%

100.0%

100.0%

Not in board Total

Organization B

0.3%

0.4%

Table 4. The area under the curve of the four machine learning models Implemented and launched projects (y = 0)

Canceled (y = 1)

Total

AUC

1,004

1,447

2,451

0.942

Gate 1

960

308

1,268

0.871

Gate 2

734

118

852

0.859

Gate 3

727

105

832

0.802

Pre Gate 1

Table 5. False negative percentage of organizational agents and the machine learning models with thresholds of 0.5 and 0.33 Organizational agents Machine learning threshold Machine learning threshold 0.5 0.33 Pre Gate 1 0.590

0.103

0.033

Gate 1

0.243

0.142

0.110

Gate 2

0.138

0.073

0.051

Gate 3

0.126

0.082

0.081

The false negative percentage for both the current decisions of organizational agents as well as the machine learning model are presented in Table 5. Machine learning outperforms organizational agents based on false negative percentage by a factor of 1.5 to 5.7 for threshold 0.5 and 1.5 to 17.9 for the threshold 0.33. Based on this H3 is accepted. To test H4 we used the model created for pre gate 1 and evaluated it with data from gate 3. Since the models are now evaluated with different test data compared to the models created for H3, the AUC might differ from H3. The AUC as well as the four metrics; recall, accuracy, precision, and false positive rate, are presented in Table 6. For the metrics, the threshold was set to 0.5. The gate 3 model returns a better AUC as well as better precision and a lower false positive rate. The recall of the Pre Gate 1 model is better. Based on the strong reduction of the recall we cannot accept H4 for this dataset.

Improving Gate Decision Making Rationality with Machine Learning

287

Table 6. AUC and metrics of the Pre Gate 1 and Gate 3 machine learning models AUC

Recall

Accuracy

Precision

False positive rate

Pre Gate 1

0.800

0.581

0.884

0.529

0.073

Gate 3

0.824

0.161

0.884

0.625

0.014

5 Conclusion H1, more than 85% of gate requests on the agenda of a board is being approved, has been accepted based on a binominal test with N = 1,569 and p = 0.85. The cumulative probability of getting up to 1,430 approvals equals 1.00, which is sufficient to accept the hypothesis. H2, decisions of gatekeeping boards deliver a contribution of less than 5% of the total number of canceled initiatives, has also been accepted based on a binominal test with N = 9,118 and p = 0.05. 1.00 minus the probability of getting 39 or fewer cancellations is also 1.00. Based on this H2 was accepted. H3, machine learning can outperform organizational agents in predicting the likelihood of projects cancellation, has been accepted on data from Table 5. Machine learning outperforms organizational agents based on false negative percentage by a factor of 1.5 to 5.7 for threshold 0.5 and 1.5 to 17.9 for the threshold 0.33. H4, machine learning performance improves when different models are used for each of the gates, has been rejected based on data from Table 6. The rejection was mainly based on the strong reduction of the recall between the models.

6 Discussion and Managerial Implications According to H1 gatekeeping boards approve most of the requests on the agenda. For practitioners, this creates more insights on the limitations of gatekeeping boards as the decisions of these boards will only help in a modest way in canceling project calls. Furthermore, accepting H2 makes clear that these boards are, in the studied organizations, not in use as a main instrument to cancel initiatives; canceling projects in IPM is done outside the gates. The conclusion that gatekeeping boards are not a main cancellation instrument does not mean that these boards hold a limited contribution to the cancellation of projects. Practitioners could use the agenda of the board as an important cancellation instrument by blocking projects, in the preparation process for the boards, from the agenda. These projects blocked from the agenda are cancelled outside the board without the need of a cancellation by the gatekeeping board. A gatekeeping board can also support cancellation by defining clear thresholds. A project could be canceled outside the board as it becomes clear that it cannot pass the board defined threshold of e.g. an IRR > 20%. With AUCs that can be tuned above 0.800, the generated machine learning models hold an interesting capability in predicting the likelihood of project cancellation. Machine learning could provide additional information that the organizational agents

288

M. van der Pas and N. van der Pas

can consider. E.g. gatekeepers could be informed on the probability of cancellation. So, each gate request would be accompanied by a figure between 0 and 1. Furthermore they could include previous gate requests that got a similar valuation from the machine learning models. This enables gatekeepers to compare the probability of cancellation. Project leaders could run scenario analyses by changing features and studying the effects of those changes on the predictions. Machine learning could be a new category of instruments supporting IPM complementary to known instruments defined by e.g. Mauerhoefer, Strese and Brettel [23]. Currently, the performance of machine learning models and organizational agents is compared based on the false negative percentage. A high false negative percentage is an indication for sunk costs, reducing it could reduce sunk costs. For that, the machine learning models can be used to support gate keeper decision making to reduce sunk costs.

7 Limitations and Future Research The hypotheses are only tested on telecommunication companies operating in Europe. Innovations in this industry are almost completely executed using IT. No physical production processes nor any distribution optimizations were in the scope. This limits the possibility to apply the findings to manufacturing, logistics and services industries. The findings can probably be transferred more easily to finance, IT or insurance industries. In case the machine learning findings are used to reduce the cancellation rate as studied in H3 and the counteractions are successful then this could reduce the accuracy of the machine learning models. In future research, these countermeasures should be used to train the machine learning models. Future research could also focus on selffulfilling prophecy which happens in case organizations decide to cancel projects based on machine learning models predicting the cancellation. Adding additional insights into the decision process based on machine learning does not guarantee an improvement of the outcome of that process. Projects are carried on even when there are clear indications that the project needs to be canceled [27]. Ajzen developed the theory of planned behavior [1] that can be used to explain organizational agents’ behavior [31]. Behavior is amongst others influenced by the subjective norm being the inputs received from other persons. This influence can, but is not per definition, be aligned with rationality. Kahneman has done extensive research on the limitations of humans in acting in an optimized way [17]. Nevertheless, the usage of a conscious rational thinking style outweighs the disadvantages of the limitations of rationality of organizational agents [11]. The machine learning models were only trained with project metadata such as the budget, NPV, and time-to-market. The actual content of the project (what is the product the project is creating) was not an input parameter. In future research, this could be added to the features to improve the models. A further limitation of this study is that the false kills are left out of scope. False kills are ideas or projects that are canceled even though they would have been a success. Both gatekeepers, as well as machine learning models, are prone to false kills, but the data provided did not allow us to study this effect. A future study could focus on the canceled projects and focus on the false kills.

Improving Gate Decision Making Rationality with Machine Learning

289

Finally, we only studied the cancellation process. In future research machine learning models could be tested to predict budget overruns, not delivering on time and even the percentage of expected value delivered. Acknowledgements. We would like to thank Stijn van Rozendaal for setting up together the first machine learning models as well as Ruben van der Linden and Dan Heinen for optimizing the models. Finally, we thank Dominik Mahr for his inputs and thoughts on this paper.

References 1. Ajzen, I.: The theory of planned behavior. Organ. Behav. Hum. Decis. Process. 50(2), 179–211 (1991) 2. Barczak, G., Griffin, A., Kahn, K.B.: Perspective: trends and drivers of success in NPD practices: results of the 2003 PDMA best practices study. J. Prod. Innov. Manage 26(1), 3–23 (2009) 3. Björk, J., Karlsson, M.P., Magnusson, M.: Turning ideas into innovations - introducing demand-driven collaborative ideation. Int. J. Innov. Reg. Dev. 5(4–5), 429–442 (2014) 4. Bower, J.L.: Managing the Resource Allocation Process, Revised edn. Harvard Business Review Press, Boston (1986) 5. Cooper, R.G., Edgett, S.J., Kleinschmidt, E.J.: Benchmarking best NPD practices I. Res. Technol. Manage. 47(1), 31–43 (2004) 6. Cooper, R.G., Edgett, S.J., Kleinschmidt, E.J.: Benchmarking best NPD practices II. Res. Technol. Manage. 47(3), 50–59 (2004) 7. Cooper, R.G.: Fixing the fuzzy front end of the new product process: building the business case. CMA Mag. 71(8), 21–23 (1997) 8. Cooper, R.G.: Perspective: the stage-gate® idea-to-launch process-update, what’s new, and nextgen systems. J. Prod. Innov. Manage 25(3), 213–232 (2008) 9. Cooper, R.G.: Stage-gate systems: a new tool for managing new products. Bus. Horiz. 33(3), 44–54 (1990) 10. Cooper, R.G.: The invisible success factors in product innovation. J. Prod. Innov. Manage 16(2), 115–133 (1999) 11. Eliens, R., Eling, K., Gelper, S., Langerak, F.: Rational versus intuitive gatekeeping: escalation of commitment in the front end of NPD. J. Prod. Innov. Manage 35(6), 890–907 (2018) 12. Figueiredo, P.S., Loiola, E.: Ehancing new product development (NPD) portfolio performance by shaping the development funnel. J. Technol. Manage. Innov. 7(4), 20–35 (2012) 13. Griffin, A.: PDMA research on new product development practice: updating trends and benchmarking best practices. J. Prod. Innov. Manage 14(6), 429–458 (1997) 14. Grönlund, J., Rönnberg-Sjödin, D., Frishammer, J.: Open innovation and the stage-gate process: a revised model for new product development. Calif. Manage. Rev. 52(3), 106–131 (2010) 15. Hossain, M., Islam, K.M.Z.: Generating ideas on online platforms: a case study of “My Starbucks Idea”. Arab Econ. Bus. J. 10(2), 102–111 (2015) 16. IdeaStorm Homepage. http://www.ideastorm.com. Accessed 06 Aug 2018 17. Kahneman, D.: Thinking Fast and Slow. Penguin Books Ltd, London (2011) 18. Kester, L., Griffin, A., Hultink, E.J., Lauche, K.: Exploring portfolio decision-making processes. J. Prod. Innov. Manage 28(5), 641–661 (2011) 19. Kock, A., Heising, W., Gemünden, H.G.: Antecedents to decision-making quality and agility in innovation portfolio management. J. Prod. Innov. Manage 33(6), 670–686 (2016)

290

M. van der Pas and N. van der Pas

20. Markham, S.K., Ward, S.J., Aiman-Smith, L., Kingon, A.I.: The valley of death as context for role theory in product innovation. J. Prod. Innov. Manage 27(3), 402–417 (2010) 21. Markham, S.K.: The impact of front-end innovation activities on product performance. J. Prod. Innov. Manage 30(S1), 77–92 (2013) 22. Markowitz, H.M.: Portfolio Selection: Efficient Diversification of Investment. BookCrafters, Michigan (1959) 23. Mauerhoefer, T., Strese, S., Brettel, M.: The impact of information technology on new product development performance. J. Prod. Innov. Manage 34(6), 719–738 (2017) 24. McNally, R.C., Durmus, oˇglu, S.S., Calantone, R.J.: New product portfolio management decisions: antecedents and consequences. J. Prod. Innov. Manage 30(2), 245–261 (2012) 25. Meifort, A.: Innovation portfolio management: a synthesis and research agenda. Creativity Innov. Manage. 25(2), 251–269 (2016) 26. Powers, D.M.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2, 37–63 (2011) 27. Schmidt, J.B., Calantone, R.J.: Escalation of commitment during new product development. J. Acad. Mark. Sci. 30(2), 103–118 (2002) 28. Unger, B.N., Kock, A., Gemünden, H.G., Jonas, D.: Enforcing strategic fit of project portfolios by project termination: an empirical study on senior management involvement. Int. J. Proj. Manage. 30(6), 675–685 (2012) 29. Van der Pas, M., Furneaux, B.: Improving the predictability of it investment business value. In: 2015 ECIS proceedings, paper 190. AIS Electronic Library, Münster (2015) 30. Van der Pas, M., Van der Pas, N.: Escalation of commitment in NPD and cost saving IT projects. In: Nunes, M.B., Isaías, P., Powell P., Ravesteijn, P., Ongena, G. (eds.) 12th IADIS International Conference Information Systems 2019, Utrecht, Netherlands, pp. 276–280 (2019) 31. Van der Pas, M., Walczuch, R.: Behaviour of organisational agents to improve information management impact. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) Advances in Intelligent Systems and Computing. Paper presented at the Science and Information Conference: Intelligent Computing, London, United Kingdom, pp. 774–788. Springer, Cham (2018)

End-to-End Memory Networks: A Survey Raheleh Jafari1(B) , Sina Razvarz2 , and Alexander Gegov3 1

School of Design, University of Leeds, Leeds LS2 9JT, UK [email protected] 2 Departamento de Control Autom´ atico, CINVESTAV-IPN (National Polytechnic Institute), Mexico City, Mexico [email protected] 3 School of Computing, University of Portsmouth, Buckingham Building, Portsmouth PO1 3HE, UK [email protected]

Abstract. Constructing a dialog system which can speak naturally with a human is considered as a major challenge of artificial intelligence. Endto-end dialog system is taken to be a primary research topic in the area of conversational systems. Since an end-to-end dialog system is structured based on learning a dialog policy from transactional dialogs in a defined extent, therefore, useful datasets are required for evaluating the learning procedures. In this paper, different deep learning techniques are applied to the Dialog bAbI datasets. On this dataset, the performance of the proposed techniques is analyzed. The performance results demonstrate that all the proposed techniques attain decent precisions on the Dialog bAbI datasets. The best performance is obtained utilizing end-to-end memory network with a unified weight tying scheme (UN2N). Keywords: Memory networks

1

· Deep learning · Dialog bAbI dataset

Introduction

Instructing machines that can converse like a human for real-world objectives is possibly one of the crucial challenges in artificial intelligence. In order to construct a meaningful conversation with human, the dialog system is required to be qualified in the perception of natural language, constructing intelligent decisions as well as producing proper replies [1–3]. Dialog systems, recognized as interactive conversational agents, communicate with the human through natural language in order to aid, supply information and amuse. They are utilized in an extensive applications domain from technical support services to language learning tools [4,5]. Artificial intelligence techniques are viewed as the most efficient techniques in recent decades [6–17]. For example, Fuzzy logic systems are broadly utilized to model the systems characterizing vague and unreliable information [18–37]. In artificial intelligence area [38,39], end-to-end dialog systems have been attained interest because of the current progress of deep neural networks. In [40] a gated c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 291–300, 2020. https://doi.org/10.1007/978-3-030-52246-9_20

292

R. Jafari et al.

end-to-end trainable memory network is proposed which is learning in an end-toend procedure without the utilization of any extra supervision signal. In [41] the original task is broken down into short tasks where they should be individually learned by the agent, and also built in order to perform the original task. In [42] a long short term memory (LSTM) model is suggested which learns in order to interact with APIs on behalf of the user. In [43] a dynamic memory network is introduced which contains tasks for part-of-speech classification as well as question answering, also uses two gated recurrent units in order to carry out inference. In [44] the memory network has been implemented which needed supervision in every layer of the network. In [45] a set of four tasks in order to test the capability of end-to-end dialog systems has been introduced which focuses on the domain of movies entities. In [46] a word-based method to dialog state tracking utilizing recurrent neural networks (RNNs) is proposed which needs less feature engineering. Even though neural network models include a tiny learning pipeline, they need a remarkable content of the training. Gated recurrent network (GRU) and LSTM units permit RNNs to deal with the longer texts needed for question answering. Additional advancements to be mentioned as attention mechanisms, as well as memory networks, permit the network to center around the most related facts. In this paper, the applications of different types of memory networks are studied on data from the Dialog bAbI. The performance results demonstrate that all the proposed techniques attain decent precisions on the Dialog bAbI datasets. The best performance is obtained utilizing UN2N. The remaining of the article is organized as follows. In Sect. 2, different types of memory networks are demonstrated and explained in details. Experimental results are given in Sect. 3. Section 4 concludes the work.

2 2.1

Memory Networks End-to-End Memory Network with Single Hop

The end-to-end Memory Network (N2N) with single hop has two stories embed C,  as well as a question embedding B,  see Fig. 1. Matrices dot product ding A, are utilized in order to match each word in the story with each word in the question which will cause the creation of the attention. By passing the attention through a softmax layer they will change into the probability distribution across the whole word from the story. Afterward, these probabilities are implemented  and the sum of that with the question embedding B  to the story embedding C passes through a dense layer and the softmax prediction layer. 2.2

End-to-End Memory Network with Stacked Hops

The N2N architecture contains two major components: supporting memories and final answer prediction [47]. Supporting memories consist of a set of input and output memory represented by memory cells. In complicated tasks with the requirement of multiple supporting memories, the model can be developed

End-to-End Memory Networks

293

Fig. 1. End-to-end memory network with single hop

in order to contain more than one set of input-output memories by stacking a number of memory layers. Each memory layer in the model is called hop, also the input of the (κ + 1)th hop is the output of the κth hop: ˜κ u ˜κ+1 = o˜κ + u

(1)

κ , C  κ , utilized in order to Each layer contains its own embedding matrices A embed the inputs x ˜i . The prediction of the answer to the question q˜, is carried out by  (˜ ˜κ )) a ˜ = sof tmax(W oκ + u

(2)

 (of size V × d) is where a ˜ is taken to be the predicted answer distribution, W considered to be a parameter matrix for the model in order to learn, also κ is the total number of hops. The N2N architecture with three hop operations is shown in Fig. 2. The hard max operations within each layer are substituted with a continuous weighting from the softmax. ˜n which are stored in the The method takes a discrete set of inputs x ˜1 , ..., x memory, a question q˜, also outputs a reply a ˜. The model can write all x ˜ to the memory up to a fixed buffer size, also it obtains a continuous demonstration for x ˜ and q˜. Afterward, the continuous demonstration is processed with multiple hops in order to generate a ˜. This permits backpropagation of the error signal through multiple memory accesses back to the input while training. 2.3

Gated End-to-End Memory Network

The gated end-to-end memory network (GN2N) is able to dynamically conditioning the memory reading operation on the controller state u ˜κ at every hop, see Fig. 3. In GN2N, (1) is reformulated as below [48], κ ˜κ κu T κ (˜ uκ ) = σ(W T ˜ + bT )

(3)

294

R. Jafari et al.

Fig. 2. A three layer end-to-end memory network

u ˜κ+1 = o˜κ  T κ (˜ uκ ) + u ˜κ  (1 − T κ (˜ uκ ))

(4)

 κ and ˜bκ are taken to be the hop-specific parameter matrix and bias where W T x) is the transform gate for the κth hop. term for the κth hop, respectively. T κ (˜  is the Hadamard product.

Fig. 3. Gated end-to-end memory network

End-to-End Memory Networks

2.4

295

End-to-End Memory Networks with Unified Weight Tying

In [47], two kinds of weight tying are proposed for N2N, namely, adjacent and layer-wise. Layer-wise approach portions the input and output embedding matri2 = ... = A κ and C 1 = C  2 = ... = C  κ ). 1 = A ces across various hops (i.e., A Adjacent approach portions the output embedding for a given layer with the  κ ). Furthermore, the matrix W  κ+1 = C corresponding input embedding (i.e., A  which predicts the answer, as well as the question embedding matrix B, are T = C  κ and B =A 1 . In [48], a dynamic mechanism is designed developed as W which permits the model to choose the proper kind of weight tying on the basis of the input. Therefore, the embedding matrices are developed dynamically for every instance which makes UN2N more efficient compared with N2N and GN2N where the same embedding matrices are implemented for each input. In UN2N a gating vector z˜, described in (8), is used in order to develop the embedding  κ , B,  and W  . The embedding matrices are influenced by the κ , C matrices, A information transported by z˜ related to the input question u ˜0 and the context sentences in the story m ˜ t . Therefore, κ+1 = A κ  z˜ + C  κ  (1 − z˜) A

(5)

 κ+1 = C  κ  z˜ + C  κ+1  (1 − z˜) C

(6)

where  is taken to be the column element-wise multiplication operation, also  κ+1 is the unconstrained embedding matrix. In (5) and (6), the large value of C z˜ leads UN2N towards the layer-wise approach and the small value of z˜ leads UN2N towards the adjacent approach. In UN2N, at first, the story is encoded by reading the memory one step at a time with a gated recurrent unit (GRU) as below, ˜ t) ˜ t+1 = GRU (m ˜ t, h h

(7)

such that t is considered to be the recurrent time step, also m ˜ t is taken to be the context sentence in the story at time t. Afterward, the following relation is defined,  0 ˜ z˜ u (8) z˜ = σ(W + ˜bz˜) ˜ hT ˜ T is the last hidden state of the GRU which presents the story, W z˜ is where h ˜ considered  0  as a weight matrix, bz˜ is bias, σ is taken to be the sigmoid function u ˜ ˜ T . A linear mapping G ∈ Rd×d is also, ˜ is the concatenation of u ˜0 and h hT added for updating the connection between memory hops as below, u ˜κ+1 = o˜κ + (G  (1 − z˜))˜ uκ

(9)

296

3 3.1

R. Jafari et al.

Experiments and Results Experiment Setup

In this section, an extensive range of parameter settings along with data set configurations are utilized in order to validate the proposed techniques in this paper. 3.2

Task Explanations

The tasks in the dataset are divided into 5 groups where each group focus on a special objective. Task 1: Issuing API calls. The chatbot asks questions in order to fill the missing areas, and finally produces a valid corresponding API call. The questions asked by the bot is for collecting information in order to make the prediction possible. Task 2: Updating API calls. In this part users update their requests. The chatbot asks from users if they have finished their updates, then chatbot generates updated API call. Task 3: Demonstrating options. The chatbot provides options to users utilizing the corresponding API call. Task 4: Generating additional information. User can ask for the phone number and address and the bot should use the knowledge bases facts correctly in order to reply. Task 5: Organizing entire dialogs. Tasks 1–4 are combined in order to generate entire dialogs. For evaluating the capability of the techniques in order to deal with out-ofvocabulary (OOV) items a set of test data is used which contains entities different from the training set. Task 6 is the Dialog state tracking 2 task (DSTC-2) [49] with real dialogs, and only has one setup. 3.3

Experimental Results

Efficiency results on Dialog bAbI tasks are demonstrated in Table 1, with seven techniques which are among the most important techniques, namely rule-based systems, TF-IDF, nearest neighbor, supervised embedding, N2N, GN2N, and UN2N. As is shown in Table 1, the rule-based system has a high performance on tasks 1–5. However, its performance reduces when dealing with DSTC-2 task. TF-IDF match has poor performance compared with other methods on both the simulated tasks 1–5 and on the real data of task 6. The performance of the TF-IDF match with match type features considerably increases but is still behind the nearest neighbor technique. Supervised embedding has higher performance compared with TF-IDF match and nearest neighbor technique. In task 1, supervised embedding is fully successful but its performance reduces in task 2–5, even with match type features. GN2N and UN2N models outperform the other methods in DSTC-2 task and Dialog bAbI tasks, respectively.

End-to-End Memory Networks

297

Table 1. The accuracy results of rule-based systems, TF-IDF, nearest neighbor, supervised embedding, N2N, GN2N, and UN2N methods

4

Conclusion

End-to-end learning scheme is suitable for constructing the dialog system because of its simplicity in training as well as effectiveness in model updating. In this paper, the applications of various memory networks are studied on data from the Dialog bAbI. The performance results demonstrate that all the proposed techniques attain decent precisions on the Dialog bAbI datasets. The best performance is obtained utilizing UN2N. In order to evaluate the true performance of the proposed methods, extra experimentations are required utilizing wide non-synthetic data set.

References 1. Araujo, T.: Living up to the chatbot hype: The influence of anthropomorphic design cues and communicative agency framing on conversational agent and company perceptions. Comput. Hum. Behav. 85, 183–189 (2018) 2. Hill, J., Ford, W.R., Farreras, I.G.: Real conversations with artificial intelligence: a comparison between human-human online conversations and human-chatbot conversations. Comput. Hum. Behav. 49, 245–250 (2015) 3. Quarteroni, S.: A chatbot-based interactive question answering system. In: 11th Workshop on the Semantics and Pragmatics of Dialogue, pp. 83–90 (2007) 4. Young, S., Gasic, M., Thomson, B., Williams, J.D.: POMDP-based statistical spoken dialog systems: a review. Proc. IEEE 101, 1160–1179 (2013) 5. Shawar, B.A., Atwell, E.: Chatbots: are they really useful? LDV Forum 22, 29–49 (2007) 6. Dote, Y., Hoft, R.G.: Intelligent Control Power Electronics Systems. Oxford Univ. Press, Oxford (1998) 7. Mohanty, S.: Estimation of vapour liquid equilibria for the system carbon dioxidedifluoromethane using artificial neural networks. Int. J. Refrig. 29, 243249 (2006)

298

R. Jafari et al.

8. Razvarz, S., Jafari, R., Yu, W., Khalili, A.: PSO and NN Modeling for photocatalytic removal of pollution in wastewater. In: 14th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE) Electrical Engineering, pp. 1–6 (2017) 9. Jafari, R., Yu, W.: Artificial neural network approach for solving strongly degenerate parabolic and burgers-fisher equations. In: 12th International Conference on Electrical Engineering, Computing Science and Automatic Control (2015). https:// doi.org/10.1109/ICEEE.2015.7357914 10. Jafari, R., Razvarz, S., Gegov, A.: A new computational method for solving fully fuzzy nonlinear systems. In: Computational Collective Intelligence. ICCCI 2018. Lecture Notes in Computer Science, vol. 11055, pp. 503–512. Springer, Cham (2018) 11. Razvarz, S., Jafari, R.: ICA and ANN modeling for photocatalytic removal of pollution in wastewater. Math. Comput. Appl. 22, 38–48 (2017) 12. Razvarz, S., Jafari, R., Gegov, A., Yu, W., Paul, S.: Neural network approach to solving fully fuzzy nonlinear systems. In: Fuzzy modeling and control Methods Application and Research, pp. 45–68. Nova science publisher Inc., New York (2018). ISBN: 978-1-53613-415-5 13. Razvarz, S., Jafari, R.: Intelligent techniques for photocatalytic removal of pollution in wastewater. J. Elect. Eng. 5, 321–328 (2017). https://doi.org/10.17265/ 2328-2223/2017.06.004 14. Graupe, D.: Chapter 112. In: Chen, W., Mlynski, D.A. (eds.): Principles of Artificial Neural Networks. Advanced Series in Circuits and Systems, 1st edn. vol. 3, p. 4e189. World Scientific (1997) 15. Jafari, R., Yu, W., Li, X.: Solving fuzzy differential equation with Bernstein neural networks. In: IEEE International Conference on Systems, Man, and Cybernetics, Budapest, Hungary, pp. 1245–1250 (2016) 16. Jafari, R. Yu, W.: Uncertain nonlinear system control with fuzzy differential equations and Z-numbers. In: 18th IEEE International Conference on Industrial Technology, Canada, pp. 890–895 (2017). https://doi.org/10.1109/ICIT.2017.7915477 17. Jafarian, A., Measoomy, N.S., Jafari, R.: Solving fuzzy equations using neural nets with a new learning algorithm. J. Adv. Comput. Res. 3, 33–45 (2012) 18. Werbos, P.J.: Neuro-control and elastic fuzzy logic: capabilities, concepts, and applications. IEEE Trans. Ind. Electron. 40, 170180 (1993) 19. Jafari, R., Yu, W., Razvarz, S., Gegov, A.: Numerical methods for solving fuzzy equations: a Survey. Fuzzy Sets Syst. (2019). ISSN 0165-0114, https://doi.org/10. 1016/j.fss.2019.11.003 20. Kim, J.H., Kim, K.S., Sim, M.S., Han, K.H., Ko, B.S.: An application of fuzzy logic to control the refrigerant distribution for the multi type air conditioner. In: Proceedings of IEEE International Fuzzy Systems Conference, vol. 3, pp. 1350– 1354 (1999) 21. Wakami, N., Araki, S., Nomura, H.: Recent applications of fuzzy logic to home appliances. In: Proceedings of IEEE International Conference on Industrial Electronics, Control, and Instrumentation, Maui, HI, pp. 155–160 (1993) 22. Jafari, R., Razvarz, S.: Solution of fuzzy differential equations using fuzzy Sumudu transforms. In: IEEE International Conference on Innovations in Intelligent Systems and Applications, pp. 84–89 (2017)

End-to-End Memory Networks

299

23. Jafari, R., Razvarz, S., Gegov, A., Paul, S.: Fuzzy modeling for uncertain nonlinear systems using fuzzy equations and Z-numbers. In: Advances in Computational Intelligence Systems: Contributions Presented at the 18th UK Workshop on Computational Intelligence, Advances in Intelligent Systems and Computing, 5–7 September 2018, Nottingham, UK, vol. 840, pp. 66–107. Springer, Cham (2018) 24. Jafari, R., Razvarz, S.: Solution of fuzzy differential equations using fuzzy Sumudu transforms. Math. Comput. Appl. 23, 1–15 (2018) 25. Jafari, R., Razvarz, S., Gegov, A.: Solving differential equations with z-numbers by utilizing fuzzy Sumudu transform. In: Intelligent Systems and Applications. IntelliSys 2018. Advances in Intelligent Systems and Computing, vol. 869, pp. 1125-1138. Springer, Cham (2019) 26. Yu, W., Jafari, R.: Modeling and Control of Uncertain Nonlinear Systems with Fuzzy Equations and Z-Number. IEEE Press Series on Systems Science and Engineering. Wiley-IEEE Press, John Wiley & Sons, Inc., Hoboken (2019). ISBN-13: 978-1119491552 27. Negoita, C.V., Ralescu, D.A.: Applications of Fuzzy Sets to Systems Analysis. Wiley, New York (1975) 28. Zadeh, L.A.: Probability measures of fuzzy events. J. Math. Anal. Appl. 23, 421– 427 (1968) 29. Zadeh, L.A.: Calculus of fuzzy restrictions. In: Fuzzy Sets and Their Applications to Cognitive and Decision Processes, pp. 1-39. Academic Press, New York (1975) 30. Zadeh, L.A.: Fuzzy logic and the calculi of fuzzy rules and fuzzy graphs. Multiple Valued Logic 1, 1–38 (1996) 31. Razvarz, S., Jafari, R.: Experimental study of Al2O3 nanofluids on the thermal efficiency of curved heat pipe at different tilt angle. J. Nanomater. 2018, 1–7 (2018) 32. Razvarz, S., Vargas-Jarillo, C., Jafari, R.: Pipeline monitoring architecture based on observability and controllability analysis. In: IEEE International Conference on Mechatronics (ICM), Ilmenau, Germany, vol. 1, pp. 420–423 (2019). https://doi. org/10.1109/ICMECH.2019.872287 33. Razvarz, S., Vargas-jarillo, C., Jafari, R., Gegov, A.: Flow control of fluid in pipelines using PID controller. IEEE Access 7, 25673–25680 (2019). https://doi. org/10.1109/ACCESS.2019.2897992 34. Razvarz, S., Jafari, R.: Experimental study of Al2O3 nanofluids on the thermal efficiency of curved heat pipe at different tilt angle. In: 2nd International Congress on Technology Engineering and Science, ICONTES, Malaysia (2016) 35. Jafari, R., Razvarz, S., Gegov, A.: Neural network approach to solving fuzzy nonlinear equations using Z-numbers. IEEE Trans. Fuzzy Syst. (2019). https://doi. org/10.1109/TFUZZ.2019.2940919 36. Jafari, R., Yu, W., Li, X., Razvarz, S.: Numerical solution of fuzzy differential equations with Z-numbers using Bernstein neural networks. Int. J. Comput. Intell. Syst. 10, 1226–1237 (2017) 37. Jafari, R., Yu, W., Li, X.: Numerical solution of fuzzy equations with Z-numbers using neural networks. In: Intelligent Automation and Soft Computing, pp. 1–7 (2017) 38. Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: Proceedings of ICML-2011 (2011) 39. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of CVPR-2015 (2015) 40. Liu, F., Perez, J.: Gated end-to-end memory networks. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Association for Computational Linguistics, Valencia, Spain, pp. 1–10 (2017)

300

R. Jafari et al.

41. Bordes, A., Weston, J.: Learning end-to-end goal-oriented dialog, arXiv preprint. arXiv:1605.07683 (2016) 42. Williams, J.D., Zweig, G.: End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning, arXiv preprint arXiv:1606.01269 (2016) 43. Kumar, A., Irsoy, O., Su, J., Bradbury, J., English, R., Pierce, B., Ondruska, P., Gulrajani, I., Socher, R.: Ask me anything: dynamic memory networks for natural language processing. In: Proceedings of ICML-2016 (2016) 44. Weston, J., Chopra, S., Bordes, A.: Memory networks. In: International Conference on Learning Representations (ICLR) (2015) 45. Dodge, J., Gane, A., Zhang, X., Bordes, A., Chopra, S., Miller, A.H., Szlam, A., Weston, J.: Evaluating prerequisite qualities for learning end-to-end dialog systems. In: Proceedings of ICLR-2016 (2016) 46. Henderson, M., Thomson, B., Young, S.: Word-based dialog state tracking with recurrent neural networks. In: Proceedings of SIGDIAL-2014 (2014) 47. Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R.: End-to-end memory networks. In: Proceedings of Advances in Neural Information Processing Systems (NIPS 2015), Montreal, Canada, pp. 2440–2448 (2015) 48. Liu, F., Cohn, T., Baldwin, T.: Improving end-to-end memory networks with unified weight tying. In: Proceedings of the 15th Annual Workshop of The Australasian Language Technology Association (ALTW 2017), Brisbane, Australia, pp. 16–24 (2017) 49. Henderson, M., Thomson, B., Williams, J.D.: The second dialog state tracking challenge. In: Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2014), Philadelphia, USA, pp. 263–272 (2014)

Enhancing Credit Card Fraud Detection Using Deep Neural Network Souad Larabi Marie-Sainte, Mashael Bin Alamir, Deem Alsaleh, Ghazal Albakri, and Jalila Zouhair(B) College of Computer and Information Sciences, Prince Sultan University, Riyadh, Saudi Arabia {slarabi,jzouhair}@psu.edu.sa, [email protected], [email protected], [email protected]

Abstract. With the development of technology, e-commerce became an essential part of an individual’s life, where individuals could easily purchase and sell products over the internet. However, fraud attempts; specifically credit card fraudulent attacks are rapidly increasing. Cards may potentially be stolen; fake records are being used and credit cards are subject to being hacked. Artificial Intelligence techniques tackle these credit card fraud attacks, by identifying patterns that predict false transactions. Both Machine Learning and Deep Learning models are used to detect and prevent fraud attacks. Machine Learning techniques provide positive results only when the dataset is small and do not have complex patterns. In contrast, Deep Learning deals with huge and complex datasets. However, most of the existing studies on Deep Learning have used private datasets, and therefore, did not provide a broad comparative study. This paper aims to improve the detection of credit card fraud attacks using Long Short-Term Memory Recurrent Neural Network (LSTM RNN) with a public dataset. Our proposed model proved to be effective. It achieved an accuracy rate of 99.4% which is higher compared to other existing Machine and Deep Learning techniques. Keywords: Recurrent Neural Network · Long Short-Term Memory · Deep Learning · Machine Learning · Fraud detection

1 Introduction Due to the rapid development of the internet, online shopping is becoming a trend. It enables customers to shop and pay their deposit online. Online payments cannot always be trusted, however, as theft is an issue. Theft is a crime committed by stealing payment information such as credit card information. This can also be referred to as credit card fraud [1]. It can be observed in various ways such as using a stolen card, using a fake card record, or hacking a credit card by making a fake copy of it. In the United States, the occurrence of credit card fraud has been increasing since 2014. According to [2], the US reported a financial loss of 9.1 billion dollars due to online fraud attacks. Crimes related to credit card fraud can generally be detected and prevented through technology and the application of Artificial Intelligence (AI). © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 301–313, 2020. https://doi.org/10.1007/978-3-030-52246-9_21

302

S. L. Marie-Sainte et al.

Machine Learning (ML) is one field of AI that aims to find patterns in data and garner certain information [3]. ML is mainly used in credit card fraud detection. It enables the detection of suspicious credit card transactions. Many techniques have been used such as the Decision Trees, the Random Forest, the Majority Voting, the Artificial Neural Networks, and others [4, 5]. However, ML has limitations such as the long period of time it takes to train the classifier when the data size is large [6]. Additionally, these techniques do not detect patterns that have biases in the provided training data [6]. Moreover, feature reduction is essential for ML. For that, this research proposes using Deep Learning (DL) as a method of detecting credit card fraud. DL is another field used for data classification and prediction. It can perform tasks on raw data directly with a huge size reaching millions of thousands of records. It accomplishes its purpose by learning through a hierarchy of concepts. Thus, enable computers to build simple concepts after understanding the complex ones. DL is applied through various techniques such as Deep Boltzmann Machines, Deep Feedforward Networks, Deep Neural Network based Hidden Markov Model (HMM), and Deep Convolutional Networks, etc. Applications of DL techniques range from object detection, speech recognition and natural language processing [7]. These techniques are already applied in credit card fraud detection including Deep Autoencoders [8–10], Recurrent Neural Networks [5, 11], Convolutional Neural Networks [12], Deep Belief Networks [13], and others. The Recurrent Deep Neural Network (RNN) is a common Deep Neural Network (DNN) technique that has proved its efficiency in many domains such Natural Language Processing [15], image processing [16], music classification [17]. However, its application in credit card detection has only been used in a private dataset for an institutional bank. To the best of our knowledge, none has applied RNN to the Kaggle dataset and compared it with the state-of-the art techniques. The Kaggle dataset is a well-known credit card data available in [18] and used in many studies (for example [19]). Hence, in this paper, we aim to investigate the effectiveness of applying the Long Short-Term Memory Recurrent Neural Network (LSTM RNN) to the Kaggle dataset to improve the detection rate of the credit card fraudulent transactions. The rest of the paper is organized as follows. Section 2 highlights the related works addressing the issues regarding the detection of credit card fraud. In Sect. 3, the methodology and the dataset used in this paper are discussed. Section 4 presents the experimental results. Finally, Sect. 5 addresses the discussion and conclusion to this study.

2 Related Works The research studies related to detection of Credit Card Fraud have been trending in the last few years. The following sections cover the research related to this area since 2016 based on ML and DL. 2.1 Machine Learning Related Works For the purpose of detecting frauds in credit cards, the authors in [4] proposed two algorithms; namely, the Random Forest and the Decision Tree. The aforementioned

Enhancing Credit Card Fraud Detection Using Deep Neural Network

303

methods were applied to the credit cards’ transactions of a Chinese e-Commerce company. Their classification method resulted in an accuracy rate of 98.67%. However, the authors mentioned the issue of imbalanced dataset used in their work. So, the accuracy can be improved by handling the dataset. Moreover, in [20], the authors compared four different methods for detecting credit cards fraud. They used the K-Nearest Neighbor (KNN), Random Tree, AdaBoost, and Logistic Regression. The dataset used was acquired from the UCI repository. The accuracy of the results achieved by the KNN was 96.9%, whilst Random Tree was 94.3%, the AdaBoost algorithm was 57.7%, and the Logistic Regression was 98.2%. The results revealed that the Logistic Regression method was the most effective model for credit card fraud detection. However, it was not deemed to be a practical method for detecting frauds at the time of the transaction. Furthermore, in [12], the authors combined the Back Propagation Neural Network (NN) algorithm with the Whale optimization algorithm (WOA) on the Kaggle dataset. The authors used WOA to optimize the weights of the NN and enhance its results. The proposed method yielded an accuracy rate of 96.4%. In [21], the authors proposed a fraud detection system based on the Neuro-Fuzzy expert system. It combined evidence obtained from a rule-based fuzzy inference system as well as a learning component that uses Back Propagation NN. The authors analyzed different transaction attributes and the deviation of a client’s behavior compared to their normal spending profile. Data was obtained from a synthetic dataset developed with a simulator. The authors did not mention the accuracy value of their work. However, in their results, they affirmed that the proposed system has higher accuracy compared to the previously proposed systems. Another model was proposed by [22]. The model used the Decision Trees with a combination of Luhn’s and Hunt’s algorithms to detect credit card fraud. It also used an address matching rule to check whether the customer’s billing address and shipping address matched or not. The authors did not mention any accuracy rate. Additionally, [23] developed and implemented a fraud detection system in a large e-tail merchant. The experiments were applied to real data obtained from the company. The authors compared three distinct algorithms, the Logistic Regression (LR), the Random Forests (RF), and the Support Vector Machines (SVM). The RF resulted in a classification accuracy of 93.5%, the SVM resulted in 90.6%, and the LR scored 90.7%. However, the proposed model demonstrated a limitation in terms of its integration with real retail systems and services. Incidentally, the model showed weakness due to the variant difference between the training and testing accuracy results. In [24], the authors used two Machine Learning techniques based on outliers’ detection to discover the credit card fraud called Local Outlier Factor and Isolation Forest algorithms. A private dataset from a Germany bank collected in 2006 was used. The authors preprocessed the dataset using the Principal Component Analysis. The algorithms reached an accuracy of 99.6% and precision of 28%. The authors explained that the low value of precision was due to the unbalanced dataset. In [10], the authors experimented with the use of a new model based on the linear mapping, nonlinear mapping, and probability, along with the use of the RUSMRN algorithm for imbalanced datasets. The data was retrieved in October 2005 from a bank

304

S. L. Marie-Sainte et al.

in Taiwan. The classification result is 79.73% which outperformed the RUS Boost, the AdaBoost, and the Naïve Bayes classifiers. In [13], the paper presented ANN to detect credit card fraud. To solve the issue of the imbalanced dataset, the Meta Cost algorithm was applied. The data was retrieved from a big Brazilian credit card issuer. However, the authors did not mention the accuracy of their results. 2.2 Deep Learning Related Works In [25], the authors used a three-layer Autoencoder to detect credit card fraud. Two datasets were collected from companies in Turkey. In order to evaluate their method, they calculated its accuracy and precision. The accuracy reached was 68% whilst the precision was shown to be more than 61%. The accuracy and precision of the results were low despite the fact that they were using the Autoencoder. Perhaps the datasets used were the reason for these low percentages. In [5], the authors used multiple classifiers to detect the credit card fraud, ANN, the RNN, the Long Short-term Memory (LSTM), and the Gated Recurrent Units (GRUs). The dataset used in this study was provided by a US financial institution engaged in the retail banking. The results of their classifiers are as follows: 88.9% for ANN, 90.433% for RNN, 91.2% for LSTM, and finally 91.6% for GRU. Although the authors achieved highly accurate results, they mentioned that there was an issue regarding the limited number of instances in the used dataset. In [19], the authors attempted to detect credit card fraud by using Autoencoders. The Kaggle dataset was used. To evaluate their model, they calculated the recall and the precision. The recall achieved 0.9 and the precision 0.009. Therefore, the accuracy of the results were not very high and needed improvement. The authors in [26] stated that since the fraud behavior is changing continuously, it is better to use unsupervised learning to detect it. They proposed using a model of Deep Autoencoder and Restricted Boltzmann Machine (RBM) that can identify both normal and suspicious transactions. Autoencoders is a DL algorithm for unsupervised learning. Their experiments were conducted on three datasets, the German (1,000 instances), the Australian (600 instances), and the European (284,807 instances) datasets. The results of their experiments demonstrated that DL is effective with huge datasets, since they achieved 96% accuracy rate for applying their method on the European dataset, while the accuracy for the other two datasets was around 50%. In [27], the authors proposed using DL methods to detect fraud behavior. They tried using several methods such as, Autoencoder, Restricted Boltzmann Machine, Variational Autoencoder, and Deep Belief Network. The authors applied their models to the European dataset (284,807 instances). The results of their work for Autoencoder, Variational Autoencoder, Restricted Boltzmann Machine, and Deep Belief Network achieved accuracy rates of 96.26%, 96.55%, 96.00%, and 96.31%, respectively. In [28], the authors aimed to detect credit card fraud by using the Deep Autoencoders Artificial Neural Networks. They applied the proposed model to the German Credit dataset achieving an accuracy result of 82%. However, the accuracy rate achieved was considered to be low compared to other studies that were conducted.

Enhancing Credit Card Fraud Detection Using Deep Neural Network

305

In [8], the authors aimed to detect credit card fraud transactions by applying Backpropagation DNN algorithm. Two open source DL libraries, known as TensorFlow and Scikit-learn were used. The authors chose Logistic Regression (LR) as the benchmark model, which yielded in accuracy of 96.04% demonstrating a better performance result on the test set than NN. Then, different DNN were tested with a different number of hidden layers. The average validation accuracy differed based on the learning rates, for learning rate 0.1 the accuracy was 96.27%, for learning rate 0.01 the accuracy was 99.59%, and for 0.001 the accuracy was 99.12%. This study shows that the learning rate value can enhance the classification accuracy. In [11], the authors used the Deep Belief Networks (DBN) with multilayer belief networks to detect credit card fraud. The authors used the Restricted Boltzmann Machines (RBM) as hidden layers in the model. The dataset used was the Markit which contains credit transactions from different regions in the United States. The authors did not mention the accuracy results of their experiment. However, they made a comparison between the proposed model, the SVM classifier, and the Multinomial Logistic Regression. The authors claimed that their approach accomplished the highest accuracy rate. In [30], the authors aimed to detect credit card fraud by using the Convolutional Neural Network (CNN). They used a credit card transactions dataset obtained from a commercial bank. Due to the imbalanced dataset, the authors used cost-based sampling on their experiments. The accuracy of the results was not mentioned. The authors in [29] applied Autoencoders Deep Learning to detect fraud in credit card used in insurance companies. They used a private data collected in September 2013 in Europe. Although the dataset was unbalanced, this technique achieved a high accuracy of 91% which outperformed the logistic regression algorithm that yielded an accuracy of 57%. The authors stressed that the Autoencoders algorithm is efficient in handling unbalanced datasets. To summarize then, the papers that have been discussed all possess a similar purpose in detecting credit card fraud. They used different classification methods which are based on ML (see Table 1) such as the ANN, the Decision Trees, the SVM, and many others. However, in ML, the classification accuracy is highly depending on the data prepossessing, cleaning, and feature selection. Moreover, DL techniques have been also used (see Table 2) including the Deep Autoencoders, the Deep Belief Networks, as well as the Restricted Boltzmann Machine, etc. DL can directly use raw data while achieving high results. As is evident from the literature that has been discussed, the DL models achieved higher accuracy rates than the ML techniques. On the other hand, most of the used datasets were obtained from either private banks or companies, which make any related studies closed and special. To the best of our knowledge, the RNN has not been used with a public dataset. In this research paper, the RNN and Kaggle dataset are used to enhance the classification accuracy of the present study and make the use of RNN open.

306

S. L. Marie-Sainte et al. Table 1. Related works based on Machine Learning techniques

Ref

Year Method

Dataset

Accuracy

[24] 2019 Local Outlier Factor and Isolation Germany bank dataset Forest algorithms

99.6%

[4]

e-Commerce Chinese company dataset

98.67%

UCI dataset

96.9% 94.3% 57.7% 98.2%

2018 Random Forest Algorithm and Decision Tree

[19]

KNN Random Tree AdaBoost Logistic Regression

[12]

Back Propagation Neural Network Kaggle with Whale Optimization Algorithm

96.4%

[21] 2017 Neuro Fuzzy based system

Synthetic datasets developed with not mentioned a simulator

[22]

Decision Tree with Lunh’s and Hunt’s algorithms



not mentioned

[23]

Logistic Regression Random Forest And Support Vector Machines

e-Commerce dataset

not mentioned

Payment data of Taiwanian bank

79.73%

[10] 2016 RUS based on linear mapping, non-linear mapping, and probability [13]

(CSNN) model based on Artificial Brazilian credit card issuer dataset not mentioned Neural Networks (ANN) and Meta Cost procedure

Table 2. Related works based on Deep Learning Techniques Ref

Year

Method

Dataset

Accuracy

[29]

2019

Autoencoders Deep Learning

Private data

91%

[27]

2018

Deep Autoencoder and Deep Networks

German Credit dataset

82%

Artificial Neural Networks (ANNs) Recurrent Neural Networks (RNNs) Long Short-term Memory (LSTMs) And Gated Recurrent Units (GRUs)

Financial institution engaged in the retail banking in the US

88.9% 90.433% 91.2% 91.6%

[5]

(continued)

Enhancing Credit Card Fraud Detection Using Deep Neural Network

307

Table 2. (continued) Ref

Method

Dataset

Accuracy

[14]

Deep Autoencoder

Kaggle

90%

[25]

Deep Autoencoder with Restricted Bolzman Machine

German dataset Australian dataset European dataset

48% 50% 96%

[26]

Deep Autoencoder Restricted Bolzman Machine Variational Autoencoder Deep Belief Networks

European dataset

96.26% 96.55% 96.00% 96.31%

Convolutional Neural Network (CNN)

Commercial bank credit card transaction dataset

Not mentioned

BP Deep Neural Networks

Not mentioned

96%

Deep Autoencoder

Turkish companies

68%

Deep Belief Networks with Restricted Bolzman Machine

Markit dataset

Not mentioned

[28]

Year

2017

[8] [23] [10]

2016

3 Methodology 3.1 Recurrent Neural Network (RNN) Deep Learning is derived from Machine Learning where the machine is learning from experience. Deep Learning uses raw data as an input then processes them in hierarchy levels of learning in order to obtain useful conclusions [7]. Recurrent Neural Network was first introduced in the 1980s [31]. It is considered to be a supervised learning algorithm. The RNNs process involves the inputting of data in sequence loops and pass-by hidden layers, also known as state vectors, which then stores the history of past inputs in an internal state memory. While learning about the output, it learns from the current input and the previous data, both of which are stored in the internal state memory. During each iteration, the RNN takes two inputs, both the new input and the recent past stored input. For the output, the RNN can match one input to one output, or one input to many outputs, or many inputs to many outputs, or many inputs to one output. Figure 1 shows the structure of the RNN. 3.2 Long Short-Term Memory Recurrent Neural Network (LSTM RNN) RNN has a disadvantage of vanishing gradients, meaning that the RNN fails to derive new learning from past inputs and outputs such as reading a sequence of input at time 0, then reading another sequence of input at time 1. The second output should be derived from earlier inputs and outputs. However, the RRN has a tendency to forget about the first input at time 0 then produces an output from the current input only. On account of this, the Long Short-Term Memory (LSTM) RNN was introduced in [9]. The LSTM has a longterm memory which helps the network to remember better. This powerful characteristic

308

S. L. Marie-Sainte et al.

Fig. 1. RNN structure

Fig. 2. LSTM RNN structure

contributes in enhancing the result of our problem. Figure 2 shows the architecture of the LSTM layer. The LSTM RNN has three steps to determine the weights’ values: 1. Forget Gate Operation: This step takes the current input x at time t and the output h at time t–1 and combine them, then sigmoid them into a single tensor called f t .   ft = σ (Wf ht−1 , xt + bf

(1)

Where: f t : The tensor, ht–1 : The output at time t–1, x t : The input at time t, W f : The weight for the forget gate, bf : The bias vector. The output of this equation f t will be between 0 and 1 due to the sigmoid operator. f t is then multiplied by the internal state, and that is why the gate is called forget gate. In the case of f t = 0, then the previous internal state is completely forgotten, but if f t = 1, it passes through unaltered. 2. Update Gate Operation: LSTM joins values from current and previous steps, and then applies the joint data to tanh function. After that, LSTM chooses which data to pick from the tanh results and updates them using the update gate.

Enhancing Credit Card Fraud Detection Using Deep Neural Network

  Ct = tanh(Wi ht−1 , xt + bi )

309

(2)

Where: C t : The new cell state, tanh: The ratio of the corresponding hyperbolic sine and hyperbolic cosine, W i : The weight for the input, x t : The input at time t, ht–1 : The previous output, bi : The bias vector. 3. Output Gate Operation: In this step, the old cell state C t–1 is updated into the new cell state C t. The old state is multiplied by f t to forget about the previous state. Afterward, it * C t is added as new candidate values which are scaled by how much we want to update each state value. Ct = ft ∗ Ct−1 + it ∗ C˜ t

(3)

Where: C t : The new cell state, C t–1 : The old cell state, f t : The tensor, it * C t : The new candidate values scaled by an update value. The last step is to decide what to output. The output is based on the cell state. A sigmoid function decides what part of the state is presented as an output.   (4) Ot = σ (Wo ht−1 , xt + bo ) ht = Ot ∗ tanh(Ct )

(5)

Where: Ot : The output, W o : The weight for the output, ht–1 : The previous output at time t–1, x t : The input at time t, bo : The bias vector. The cell state goes through tanh to push values between –1 and 1. The result is multiplied by the sigmoid gate. Only selected candidates will be used as an output.

4 Experimental Study For the experiment, Python 3 was used to implement the model through the KERAS library. The experiment was conducted using MacBook Pro with a processor (2.3 GHz Intel Core i5) and memory capacity of (8 GB 2133 MHz LPDDR3). 4.1 The Dataset The data, obtained from [18], contains information about credit card transactions made in September 2013 by European users. The dataset is highly imbalanced since it only has 39206 fraud instances out of 284,807 transactions. Due to confidentiality, the original features are not mentioned in the dataset and were replaced with different names (V1, V2, …, V28). The dataset contains only numerical attributes (features) as a result of the Principal Component Analysis (PCA) transformation that was applied to all features except for Time (i.e. the number of seconds which elapses between the current transaction and the first transaction in the dataset), Amount (the transaction amount), and Class features (Class: 1 for a fraud transaction with 39206 instances, 0 for a normal transaction with 246,102 instances).

310

S. L. Marie-Sainte et al.

4.2 Experiment Settings At first, the data is divided into 70% training, 15% testing, and 15% validation. The classifier is based on three hidden layers that are: the LSTM layer which has 12 neurons, Flatten, and Dense layers. The LSTM and Dense layers can have different parameters related to the kernel initializer, activation function, and others. In contrast, the Flatten layer has no parameters and it is only used to resolve the obstacle of the dimensionality, meaning that the data needs to become accessible to the dense layer. Different parameters are added to the hidden layers to avoid the issue of overfitting and construct a powerful classifier. To find the best combination of the parameters, several experiments are conducted using different combinations of parameters for each of the LSTM and Dense layers. After repeating each experiment many times, we decided to set the length of the epochs to 50. This guaranteed that the accuracy of the results that are obtained in that way cannot be improved upon further. To save space, Table 3 summarizes only the best-found parameters that are used for the classification. First, the loss function is used for the purpose of neglecting false classifications. Incidentally, minimizing the loss function leads to maximizing the accuracy rate. Different loss functions are used to determine the most suitable one. From the conducted experiments, the Mean Square Error (mse) loss function is the best match that raised the classification accuracy. It works by summing the squared distance between the target and predicted values. In addition, the kernel initializer is was also used to randomly initiate the weights of the layers. Then these initialized values are passed to the LSTM, and Dense layers. Each layer has its own kernel_initializer including uniform, lecun_uniform, glorot_normal, and others. The best kernel_initializer in this study is the uniform which is based on the uniform distribution within [-limit, limit] where limit is sqrt(3/fan_in), and fan_in is the number of input units. Moreover, the activation function is used to determine the output of the layer. Many activation functions are available such as tanh, sigmoid, relu, softwax, and others. In this study, the tanh activation function is used for both layers as previously explained in Sect. 3.2. Furthermore, the optimizer is required for compiling the Keras model. There are many available optimizers such as rmsprop, Adam, Adagrad, Adadelta, etc. The rmsprop, Adagrad, and adam optimizers provided better accuracy results. However, the rmsprop is deemed to be the best in this cohort. 4.3 Results and Discussion After implementing the LSTM RNN, with the highlighted parameters in Table 1, the resulted testing accuracy rate is founded to be 99.4% whilst the training accuracy rate is 99.97%. Firstly, we can notice that the training accuracy is very high. Secondly, the testing accuracy remained high, with the difference between the training and testing accuracy is being only 0.57%. In fact, LSTM RNN aimed to remember the past information needed to predict the output in the long or the short term. This characteristic is important when several layers exist. This is a crucial benefit which RNN does not possess. Moreover, LSTM RNN can outperform CNN when the data does not possess a hierarchical structure needed for the prediction. This was exactly our case, as the dataset used in this study does not contain any hierarchical structure. In addition, predicting the incidence of credit card fraud does not require hierarchical information.

Enhancing Credit Card Fraud Detection Using Deep Neural Network

311

Table 3. Experimentation Results # Epoch Layer 1 50

Loss kernel_initializer Activation Optimizer Training Validation function

LSTM mse

lecun_uniform

tanh

Dense

lecun_uniform

tanh

2 50

LSTM mse

uniform

tanh

Dense

uniform

tanh

3 50

LSTM mse

lecun_uniform

tanh

Dense

lecun_uniform

tanh

LSTM mse

uniform

tanh

Dense

uniform

tanh

4 50

rmsprop

99.6

99.53

rmsprop

99.97

99.97

Adam

99.4

99.4

Adagrad

99.4

99.4

Secondly, the achieved accuracy result (testing accuracy) was higher than the accuracy of the state of the art methods discussed above (see Table 4). LSTM RNN yielded a better result than the Back Propagation Neural Network combined with an optimization method; the accuracy is shown to increase by 3%. Moreover, the accuracy obtained by the Deep autoencoder is also improved by 9.4%. Table 4. Comparison with the related works Ref

Year

Method

Dataset Accuracy

[12] 2018 Back propagation neural network with whale optimization Kaggle algorithm

96.4%

[32] 2003 Frequency domain

Kaggle

77%

[32] 2003 Random Forest

Kaggle

95%

[14] 2018 Deep encoder

Kaggle

90%

LSTM RNN

Kaggle

99.4%

To validate the performance of the proposed method, the study presented in [32] is used for comparison. Two models are proposed; namely, the Frequency domain and the Random Forest. The same dataset is used along with 10-flods cross validation. The accuracy reached 77% and 95% respectively for both models (as shown in Table 4), which is less than that obtained by the proposed method. This proves the effectiveness of the LSTM RNN in credit card fraud detection.

5 Conclusion To conclude, the application of Deep Learning techniques in credit card fraud detection has proved to be effective. LSTM RNN outperformed both the Machine Learning

312

S. L. Marie-Sainte et al.

and Deep Learning techniques that have been previously applied in credit card fraud detection. Though the data is imbalanced, the achieved results are high. For future work, the data would be balanced using the Synthetic Minority Over-sampling Technique (SMOTE) algorithm. Furthermore, a larger dataset would be used in order to achieve higher results. Comparison results with different state of the art methods would be performed using a common comparison framework. Ackowledgment. We would like to acknowledge the Artificial Intelligence and Data Analytics (AIDA) Lab, Prince Sultan University, Riyadh, Saudi Arabia for supporting this work.

References 1. Inc. US Legal. Internet fraud law and legal definition 2. U.S. payment card fraud losses by type 2018 | statistic 3. Mohammed, M., Khan, M.B., Bashier, E.B.M.: Machine Learning: Algorithms and Applications. Crc Press, Boca Raton (2016) 4. Xuan, S., Liu, G., Li, Z., Zheng, L., Wang, S., Jiang, C.: Random forest for credit card fraud detection. In 2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC), pp. 1–6. IEEE (2018) 5. Roy, A., Sun, J., Mahoney, R., Alonzi, L., Adams, S., Beling, P.: Deep learning detecting fraud in credit card transactions. In: Systems and Information Engineering Design Symposium (SIEDS), pp. 129–134 (2018) 6. Machine Learning: the Power and Promise of Computers That Learn by Example. The Royal Society, 2017, Machine Learning: the Power and Promise of Computers That Learn by Example. royalsociety.org/~/media/policy/projects/machine-learning/publications/machinelearning-report.pdf 7. Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning, vol. 1. MIT press, Cambridge (2016) 8. Lu, Y.: Deep Neural Networks and Fraud Detection (2017) 9. Gandhi, R.: Introduction to Sequence Models - RNN, Bidirectional RNN, LSTM, GRU. Towards Data Science, Towards Data Science, 26 June 2018. towardsdatascience.com/introd uction-to-sequence-models-rnn-bidirectional-rnn-lstm-gru-73927ec9df15 10. Charleonnan, A.: Credit card fraud detection using RUS and MRN algorithms. In: Management and Innovation Technology International Conference (MITicon), 2016, pp. MIT-73. IEEE (2016) 11. Luo, C., Desheng, W., Dexiang, W.: A deep learning approach for credit scoring using credit default swaps. Eng. Appl. Artif. Intell. 65, 465–470 (2017) 12. Wang, C., Wang, Y., Ye, Z., Yan, L., Cai, W., Pan, S.: Credit card fraud detection based on whale algorithm optimized bp neural network. In: 2018 13th International Conference on Computer Science & Education (ICCSE), pp. 1–4. IEEE (2018) 13. Ghobadi, F., Rohani, M.: Cost sensitive modeling of credit card fraud using neural network strategy. In: International Conference of Signal Processing and Intelligent Systems (ICSPIS), pp. 1–5. IEEE (2016) 14. Zhang, X.-Y., Yin, F., Zhang, Y.-M., Liu, C.-L., Bengio, Y.: Drawing and recognizing Chinese characters with recurrent neural network. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 849–862 (2018) 15. Yogatama, D., Dyer, C., Ling, W., Blunsom, P.: Generative and discriminative text classification with recurrent neural networks. arXiv preprint arXiv:1703.01898 (2017)

Enhancing Credit Card Fraud Detection Using Deep Neural Network

313

16. Toderici, G., Vincent, D., Johnston, N., Hwang, S.J., Minnen, D., Shor, J., Covell, M.: Full resolution image compression with recurrent neural networks. In: CVPR, pp. 5435–5443 (2017) 17. Choi, K., Fazekas, G., Sandler, M., Cho, K.: Convolutional recurrent neural networks for music classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2392–2396. IEEE (2017) 18. Kaggle: Your home for data science. https://www.kaggle.com 19. Tom SweersCredit Card Fraud. Radboud University, Bachelor thesis Computer Science (2018) 20. Naik, H.: Credit card fraud detection for online banking transactions. Int. J. Res. Appl. Sci. Eng. Technol. 6(4), 4573–4577 (2018) 21. Behera, T.K., Panigrahi, S.: Credit card fraud detection using a neuro-fuzzy expert system. In: Computational Intelligence in Data Mining, pp. 835–843. Springer, Singapore (2017) 22. Save, P., Tiwarekar, P., Jain, K.N., Mahyavanshi, N.: A novel idea for credit card fraud detection using decision tree. Int. J. Comput. Appl. 161(13) (2017) 23. Carneiro, N., Figueira, G., Costa, M.: A data mining based system for credit-card fraud detection in e-tail. Decis. Support Syst. 95, 91–101 (2017) 24. Maniraj, S.P., Aditya, S., Shadab, A., Deep Sarkar, S.: Credit card fraud detection using machine learning and data science. Int. J. Eng. Res. Technol. (IJERT) 08(09), September 2019 25. Renstrom, M., Holmsten, T.: Fraud Detection on Unlabeled Data with Unsupervised Machine Learning. The Royal Institute of Technology (2018) 26. Pumsirirat, A., Yan, L.: Credit card fraud detection using deep learning based on auto-encoder and restricted boltzmann machine. Int. J. Adv. Comput. Sci. Appl. 9(1), 18–25 (2018) 27. Reshma, R.S.: Deep learning enabled fraud detection in credit card transactions. Int. J. Res. Sci. Innov. (IJRSI) V(VII), 111–115 (2018) 28. Kazemi, Z., Zarrabi, H.: Using deep networks for fraud detection in the credit card transactions. In: 2017 IEEE 4th International Conference on Knowledge-Based Engineering and Innovation (KBEI), pp. 0630–0633. IEEE (2017) 29. Al-Shabi, M.: Credit card fraud detection using autoencoder model in unbalanced datasets. J. Adv. Math. Comput. Sci. 33(5), 1–16 (2019). https://doi.org/10.9734/jamcs/2019/v33i53 0192 30. Fu, K., Cheng, D., Tu, Y., Zhang, L.: Credit card fraud detection using convolutional neural networks. In: International Conference on Neural Information Processing, pp. 483–490. Springer, Cham (2016) 31. Pozzolo, A.D., Caelen, O., Johnson, R.A., Bontempi, G.: Calibrating probability with undersampling for unbalanced classification. In: 2015 IEEE Symposium Series on Computational Intelligence, pp. 159–166. IEEE (2015) 32. Financial Transaction Card Originated Messages, document ISO 8583-1 (2003). https://www. iso.org/standard/31628.html

Non-linear Aggregation of Filters to Improve Image Denoising Benjamin Guedj1,2(B) and Juliette Rengot3 1

2 3

Inria, Paris, France [email protected] University College London, London, UK Ecole des Ponts ParisTech, Paris, France [email protected] https://bguedj.github.io

Abstract. We introduce a novel aggregation method to efficiently perform image denoising. Preliminary filters are aggregated in a non-linear fashion, using a new metric of pixel proximity based on how the pool of filters reaches a consensus. We provide a theoretical bound to support our aggregation scheme, its numerical performance is illustrated and we show that the aggregate significantly outperforms each of the preliminary filters. Keywords: Image denoising · Statistical aggregation methods · Collaborative filtering

1

· Ensemble

Introduction

Denoising is a fundamental question in image processing. It aims at improving the quality of an image by removing the parasitic information that randomly adds to the details of the scene. This noise may be due to image capture conditions (lack of light, blurring, wrong tuning of field depth, . . . ) or to the camera itself (increase of sensor temperature, data transmission error, approximations made during digitization, . . . ). Therefore, the challenge consists in removing the noise from the image while preserving its structure. Many methods of denoising already have been introduced in the past decades – while good performance has been achieved, denoised images still tend to be too smooth (some details are lost) and blurred (edges are less sharp). Seeking to improve the performances of these algorithms is a very active research topic. The present paper introduces a new approach for denoising images, by bringing to the computer vision community ideas developed in the statistical learning literature. The main idea is to combine different classical denoising methods to obtain several predictions of the pixel to denoise. As each classic method has pros and cons and is more or less efficient according to the kind of noise or to the image structure, an asset of our method is that is makes the best out of each method’s strong points, pointing out the “wisdom of the crowd”. We adapt the c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 314–327, 2020. https://doi.org/10.1007/978-3-030-52246-9_22

Non-linear Aggregation of Filters to Improve Image Denoising

315

strategy proposed by the algorithm “COBRA - COmBined Regression Alternative” [2,10] to the specific context of image denoising. This algorithm has been implemented in the python library pycobra, available on https://pypi.org/ project/pycobra/. Aggregation strategies may be rephrased as collaborative filtering, since information is filtered by using a collaboration among multiple viewpoints. Collaborative filters have already been exploited in image denoising. [8] used them to create one of the most performing denoising algorithm: the block-matching and 3D collaborative filtering (BM3D). It puts together similar patches (2D fragments of the image) into 3D data arrays (called “groups”). It then produces a 3D estimate by jointly filtering grouped image blocks. The filtered blocks are placed again in their original positions, providing several estimations for each pixel. The information is aggregated to produce the final denoised image. This method is praised to well preserve fine details. Moreover, [13] proved that the visual quality of denoised images can be increased by adapting the denoising treatment to the local structures. They proposed an algorithm, based on BM3D, that uses different non-local filtering models in edge or smooth regions. Collaborative filters have also been associated to neural network architectures, by [18], to create new denoising solutions. When several denoising algorithms are available, finding the relevant aggregation has been addressed by several works. [16] focused on the analysis of patch-based denoising methods and shed light on their connection with statistical aggregation techniques. [6] proposed a patch-based Wiener filter which exploits patch redundancy. Their denoising approach is designed for near-optimal performance and reaches high denoising quality. Furthermore, [17] showed that usual patch-based denoising methods are less efficient on edge structures. The COBRA algorithm differs from the aforecited techniques, as it combines preliminary filters in a non-linear way. COBRA has been introduced and analysed by [2]. The paper is organized as follows. We present our aggregation method, based on the COBRA algorithm in Sect. 2. We then provide a thorough numerical experiments section (Sect. 3) to assess the performance of our method along with an automatic tuning procedure of preliminary filters as a byproduct.

2

The Method

We now present an image denoising version of the COBRA algorithm [2,10]. For each pixel p of the noisy image x, we may call on M different estimators (f1 ...fM ). We aggregate these estimators by doing a weighted average on the intensities:  q∈x ω(p, q)x(q) , (1) f (p) =  q∈x ω(p, q) and we define the weights as  ω(p, q) = 1

M 

k=1

 1(|fk (p) − fk (q)| ≤ ) ≥ M α ,

(2)

316

B. Guedj and J. Rengot

Fig. 1. General model

where  is a confidence parameter and α ∈ (0, 1) a proportion parameter. Note that while f is linear with respect to the intensity x, it is non-linear with respect to each of the preliminary estimators f1 , . . . , fM . These weights mean that, to denoise a pixel p, we average the intensities of pixels q such as a proportion at least α, of the preliminary estimators f1 , . . . , fM have the same value in p and in q, up to a confidence level . Let us emphasize here that our procedure averages the pixels’ intensities based on the weights (which involve this consensus metric). The intensity predicted for each pixel p of the image is f (p) and the COBRA-denoised image is the collection of pixels {f (p), p ∈ x}. This aggregation strategy is implemented in the python library pycobra [10]. The general scheme is presented in Fig. 1, and the pseudo-code in Algorithm 1. Users can control the number of used features thanks to the parameter “patch size”. For each pixel p to denoise, we consider the image patch, centered on p, of size (2 · patch size + 1) × (2 · patch size + 1). In the experiments section, patch size = 2 is usually a satisfying value. Thus, for each pixel, we construct a vector of nine features. The COBRA aggregation method has been introduced by [2] in a generic statistical learning framework, and is supported by a sharp oracle bound. For the sake of completeness, we reproduce here one of the key theorems. Theorem 1 (adapted from Theorem 2.1 in [2]). Assume we have M preliminary denoising methods. Let |x| denote the total number of pixels in image 1 x. Let  ∝ |x|− M +2 . Let f  denote the perfectly denoised image and f denote the COBRA aggregate defined in (1), then we have 2  E f(p) − f  (p) ≤

min

m=1,...,M

2

E [fm (p) − f  (p)] + C|x|− M +2 , 2

(3)

where C is a constant and the expectations are taken with respect to the pixels.

Non-linear Aggregation of Filters to Improve Image Denoising

317

Algorithm 1. Image denoising with COBRA aggregation INPUT: im noise = the noisy image to denoise psize = the pixel patch size to consider M = the number of COBRA machines to use OUTPUT: Y = the denoised image Xtrain ← training images with artificial noise Ytrain ← original training images (ground truth) cobra ← initial COBRA model cobra ← to adjust COBRA model parameters with respect to the data (Xtrain, Ytrain) cobra ← to load M COBRA machines cobra ← to aggregate the predictions Xtest ← feature extraction from im noise in a vector of size (nb pixels, (2·psize +1)2 ) Y ← prediction of Xtest by cobra Y ← to add im noise values lost at the borders of the image, because of the patch processing, to Y

What Theorem 1 tells us is that on average on all the image’s pixels, the quadratic error between the COBRA denoised image and the perfectly denoised image is upper bounded by the best (i.e., minimal) same error from the preliminary pool of M denoising methods, up to a term which decays to zero as the number of pixels to the −1/M . As highlighted in the numerical experiments reported in the next section, M is of the order of 5–10 machines and this remainder term is therefore expected to be small in most useful cases for COBRA. Note that in (3), the leading constant (in front of the minimum) is 1: the oracle inequality is said to be sharp. Note also that contrary to more classical aggregation or model selection methods, COBRA mactches or outperforms the best preliminary filter’s performance, even though it does not need to identify this champion filter. As a matter of fact, COBRA is adaptive to the pool of filters as the champion is not needed in (1). More comments on this result, and proofs are presented in [2].

3

Numerical Experiments

This section illustrates the behaviour of COBRA. All code material (in Python) to replicate the experiments presented in this paper are available at https:// github.com/bguedj/cobra denoising.

318

3.1

B. Guedj and J. Rengot

Noise Settings

We artificially add some disturbances to good quality images (i.e. without noise). We focus on five classical settings: the Gaussian noise, the salt-and-pepper noise, the Poisson noise, the speckle noise and the random suppression of patches (summarised in Fig. 2).

Fig. 2. The different kinds of noise used in our experiments.

3.2

Preliminary Denoising Algorithms

We focus on ten classical denoising methods: the Gaussian filter, the median filter, the bilateral filter, Chambolle’s method [5], non-local means [3,4], the Richardson-Lucy deconvolution [14,15], the Lee filter [12], K-SVD [1], BM3D [8] and the inpainting method [7,9]. This way, we intend to capture different regimes of performance (Gaussian filters are known to yield blurry edges, the median filter is known to be efficient against salt-and-pepper noise, the bilateral filter well preserves the edges, non-local means are praised to better preserve the details of the image, Lee filers are designed to address Synthetic Aperture Radar (SAR) image despeckling problems, K-SVD and BM3D are state-of-the-art approaches, inpainting is designed to reconstruct lost part, etc.), as the COBRA aggregation scheme is designed to blend together machines with various levels of performance and adaptively use the best local method. 3.3

Model Training

We start with 25 images (y1 ...y25 ), assumed not to be noisy, that we use as “ground truth”. We artificially add noise as described above, yielding 125 noisy images (x1 ...x125 ). Then two independent copies of each noisy image are created by adding a normal noise: one goes to the data pool to train the preliminary filters, the other one to the data pool to compute the weights defined in (2) and perform aggregation. This separation is intended to avoid over-fitting issues [as discussed in [2]]. The whole dataset creation process is illustrated in Fig. 3. 3.4

Parameters Optimisation

The meta-parameters for COBRA are α (how many preliminary filters must agree to retain the pixel) and  (the confidence level with which we declare two pixels identities similar). For example, choosing α = 1 and  = 0.1 means that we impose that all the machines must agree on pixels whose predicted intensities are at most different by a 0.1 margin. The python library pycobra ships with a dedicated class to derive the optimal values using cross-validation [10]. Optimal values are α = 4/7 and  = 0.2 in our setting.

Non-linear Aggregation of Filters to Improve Image Denoising

319

Fig. 3. Data set construction.

3.5

Assessing the Performance

We evaluate the quality of the denoised image Id (whose mean is denoted μd and standard deviation σd ) with respect to the original image Io (whose mean is denoted μo and standard deviation σo ) with four different metrics. – Mean Absolute Error (MAE - the closer to zero the better) given by N M Σx=1 Σy=1

|Id (x, y) − Io (x, y)| . N ×M

– Root Mean Square Error (RMSE - the closer to zero the better) given by 2 N Σ M (Id (x, y) − Io (x, y)) . Σx=1 y=1 N ×M – Peak Signal to Noise Ratio (PSNR - the larger the better) given by

d2 10 · log10 RMSE2 with d the signal dynamic (maximal possible value for a pixel intensity).

320

B. Guedj and J. Rengot

– Universal image Quality Index (UQI - the closer to one the better) given by cov(Io , Id ) 2 · μo · μd 2 · σo · σd · 2 · 2 σ ·σ μ + μ2 σ + σ2 o  d  o  d  o  d  (i)

(ii)

(iii)

where term (i) is the correlation, (ii) is the mean luminance similarity, and (iii) is the contrast similarity [19, Eq. 2]. 3.6

Results

Our experiments run on the gray-scale “lena” reference image (range 0–255). In all tables, experiments have been repeated 100 times to compute descriptive statistics. The green line (respectively, red) identifies the best (respectively, worst) performance. The yellow line identifies the best performance among the preliminary denoising algorithms if COBRA achieves the best performance. The first image is noisy, the second is what COBRA outputs, and the third is the difference between the ideal image (with no noise) and the COBRA denoised image. Results – Gaussian noise (Fig. 4). We add to the reference image “lena” a Gaussian noise of mean μ = 127.5 and of standard deviation σ = 25.5. Unsurprisingly, the best filter is the Gaussian filter, and the performance of the COBRA aggregate is tailing when the noise level is unknown. When the noise level is known, COBRA outperforms all preliminary filters. Note that the bilateral filter gives better results than non-local means. This is not surprising: [11] reaches the same conclusion for high noise levels. Results – salt-and-pepper noise (Fig. 5). The proportion of white to black pixels is set to sp ratio = 0.2 and such that the proportion of pixels to replace is sp amount = 0.1. Even if the noise level is unknown, COBRA outperforms all filters, even the champion BM3D. Results – Poisson noise (Fig. 6). COBRA outperforms all preliminary filters. Results – speckle noise (Fig. e 7). When confronted with a speckle noise, COBRA outperforms all preliminary filters. Note that this is a difficult task and most filters have a hard time denoising the image. The message of aggregation is that even in adversarial situations, the aggregate (strictly) improves on the performance of the preliminary pool of methods. Results – random patches suppression (Fig. 8). We randomly suppress 20 patches of size (4 × 4) pixels from the original image. These pixels become white. Unsurprisingly, the best filter is the inpainting method – as a matter of fact this is the only filter which succeeds in denoising the image, as it is quite a specific noise. Results – images containing several kinds of noise (Fig. 9). On all previous examples, COBRA matches or outperforms the performance of the best filter for each kind of noise (to the notable exception of missing patches, where inpainting

Non-linear Aggregation of Filters to Improve Image Denoising

(a) Noisy image

(b) COBRA

(c) Diff. ideal-COBRA

Fig. 4. Results – Gaussian noise.

(a) Noisy image

(b) COBRA

(c) Diff. ideal-COBRA

Fig. 5. Result – salt-and-pepper noise.

321

322

B. Guedj and J. Rengot

methods are superior). Finally, as the type of noise is usually unknown and even hard to infer from images, we are interested in putting all filters and COBRA to test when facing multiple types of noise levels. We apply a Gaussian noise in the upper left-hand corner, a salt-and-pepper noise in the upper right-hand corner a noise of Poisson in the lower left-hand corner and a speckle noise in the lower right-hand corner. In addition, we randomly suppress small patches on the whole image (see Fig. 9a). In this now much more adversarial situation, none of the preliminary filters can achieve proper denoising. This is the kind of setting where aggregation is the most interesting, as it will make the best of each filter’s abilities. As a matter of fact, COBRA significantly outperforms all preliminary filters. 3.7

Automatic Tuning of Filters

Clearly, internal parameters for the classical preliminary filters may have a crucial impact. For example, the median filter is particularly well suited for saltand-pepper noise, although the filter size has to be chosen carefully as it should grow with the noise level (which is unknown in practice). A nice byproduct of our aggregated scheme is that we can also perform automatic and adaptive tuning of those parameters, by feeding COBRA with as many machines as possible values for these parameters. Let us illustrate this on a simple example: we train our model with only one classical method but with several values of the parameter

(a) Noisy image

(b) COBRA

(c) Diff. ideal-COBRA

Fig. 6. Results – Poisson noise.

Non-linear Aggregation of Filters to Improve Image Denoising

(a) Noisy image

(b) COBRA

(c) Diff. ideal-COBRA

Fig. 7. Results – speckle noise.

(a) Noisy image

(b) COBRA

(c) Diff. ideal-COBRA

Fig. 8. Results – random suppression of patches.

323

324

B. Guedj and J. Rengot

(a) Noisy image

(b) COBRA (Un- (c) COBRA (d) Bilateral fil- (e) Non-local known noise) (Known noise) ter means

(f) Richardson- (g) Gaussian fil- (h) Median filter (i) TV Lucy deconvolu- ter bolle tion

(j) Inpainting

(k) K-SVD

(l) BM3D

Cham-

(m) Lee filter

Fig. 9. Denoising an image afflicted with multiple noises types.

Non-linear Aggregation of Filters to Improve Image Denoising

325

to tune. For example, we can define three machines applying median filters with different filter sizes: 3, 5 or 10. Whatever the noise level our approach achieves the best performance (Fig. 10). This casts our approach onto the adaptive setting where we can efficiently denoise an image regardless of its (unknown) noise level.

Fig. 10. Automatic tuning of the median filter using COBRA.

4

Conclusion

We have presented a generic aggregated denoising method—called COBRA— which improves on the performance of preliminary filters, makes the most of their abilities (e.g., adaptation to a particular kind of noise) and automatically adapts to the unknown noise level. COBRA is supported by a sharp oracle inequality demonstrating its optimality, up to an explicit remainder term which

326

B. Guedj and J. Rengot

quickly goes to zero. Numerical experiment suggests that our method achieves the best performance when dealing with several types of noise. Let us conclude by stressing that our approach is generic in the sense that any preliminary filters could be aggregated, regardless of their nature and specific abilities. While stacking many preliminary filters obviously induces an extra computational cost, the COBRA aggregate will benefit from a statistical accuracy perspective. We hope to help diffuse non-linear aggregation to the denoising community.

References 1. Aharon, M., Elad, M., Bruckstein, A., et al.: K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 54(11), 4311 (2006) 2. Biau, G., Fischer, A., Guedj, B., Malley, J.D.: Cobra: a combined regression strategy. J. Multivariate Anal. 146, 18–28 (2016) 3. Buades, A., Coll, B., Morel, J.: A non-local algorithm for image denoising. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 60–65 (2005) 4. Buades, A., Coll, B., Morel, J.M.: Non-local means denoising. Image Process. Line 1, 208–212 (2011) 5. Chambolle, A.: Total variation minimization and a class of binary MRF models. Energy Minimization Methods Comput. Vis. Pattern Recogn. 3757, 132–152 (2005) 6. Chatterjee, P., Milanfar, P.: Patch-based near-optimal image denoising. IEEE Trans. Image Process. 21(4), 1635–1649 (2012) 7. Chuiab, C., Mhaskar, H.: MRA contextual-recovery extension of smooth functions on manifolds. Appl. Comput. Harmon. Anal. 28, 104–113 (2010) 8. Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Trans. Image Process. 16(8), 2080– 2095 (2007) 9. Damelin, S., Hoang, N.: On surface completion and image inpainting by biharmonic functions: numerical aspects. Int. J. Math. Math. Sci. 2018, 8 (2018) 10. Guedj, B., Srinivasa Desikan, B.: Pycobra: a python toolbox for ensemble learning and visualisation. J. Mach. Learn. Res. 18(190), 1–5 (2018) 11. Kumar, B.S.: Image denoising based on non-local means filter and its method noise thresholding. SIViP 7(6), 1211–1227 (2013) 12. Lee, J.S., Jurkevich, L., Dewaele, P., Wambacq, P., Oosterlinck, A.: Speckle filtering of synthetic aperture radar images: a review. Remote Sens. Rev. 8(4), 313–340 (1994) 13. Liu, J., Liu, R., Chen, J., Yang, Y., Ma, D.: Collaborative filtering denoising algorithm based on the nonlocal centralized sparse representation model. In: 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) (2017) 14. Lucy, L.: An iterative technique for the rectification of observed distributions. Astron. J. 19, 745 (1974) 15. Richardson, W.H.: Bayesian-based iterative method of image restoration. J. Opt. Soc. Am. 62, 55–59 (1972)

Non-linear Aggregation of Filters to Improve Image Denoising

327

16. Salmon, J., Le Pennec, E.: Nl-means and aggregation procedures. In: 2009 16th IEEE International Conference on Image Processing (ICIP), pp. 2977–2980, November 2009 17. Salmon, J.: Agr´egation d’estimateurs et m´ethodes ` a patch pour le d´ebruitage d’images num´eriques. PhD thesis, Universit´e Paris-Diderot-Paris VII (2010) 18. Strub, F., Mary, J.: Collaborative filtering with stacked denoising autoencoders and sparse inputs. In: NIPS workshop on machine learning for eCommerce (2015) 19. Wang, Z., Bovik, A.C.: A universal image quality index. IEEE Signal Process. Lett. 9(3), 81–84 (2002)

Comparative Study of Classifiers for Blurred Images Ratiba Gueraichi(B) and Amina Serir Image Processing and Radiations Laboratory (LTIR), Houari Boumediene University of Science and Technology (USTHB), Algiers, Algeria {rgueraichi,aserir}@usthb.dz

Abstract. In this paper, we want to launch a first step for the classification of images according to their degree of blur based on the subjective measurement of DMOS image quality in the Gblur database. For this purpose, we have carried out a comparative study on several classifiers in order to build a robust learning model based on the transformation into a DCT. The class imbalanced has forced us to look for and find appropriate performance evaluation metrics so that the comparison is not biased. It has been found that random forests (RF) give the best overall performance but that other classifiers discriminate better between certain types of images (depending on the degree of blur) than others. Finally, we compared the classification by the proposed model with another classification based on the NIQE quality measurement algorithm. The results of the model proposed by its simplicity are very promising. Keywords: Discrete Cosine Transform (DCT) · Blur classification · Random Forests (RF)

1 Introduction In recent years, we have seen a significant technological deployment of audio-visual content, particularly image acquisition and processing systems such as digital cameras used in smartphones, video surveillance, etc. However, these technological advances are often accompanied by new issues such as the introduction of artifacts during acquisition, coding or transmission. To control and improve image quality, it is imperative during acquisition that management, communication and processing systems are able to detect, identify and quantify degradation introduced into the image. In practice, blur is considered as a major problem in image quality degradation. It manifests itself as a loss of sharpness at the edges and a decrease in the visibility of fine details. There are several types of blur such as Gaussian blur, motion blur, de-focus blur, etc. To detect blur in images, one of two main approaches could be used: modeling the blur effect or analyzing its disturbing effect on the human visual system HVS [1–6]. The most popular techniques for characterizing the blur effect are approaches based on © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 328–336, 2020. https://doi.org/10.1007/978-3-030-52246-9_23

Comparative Study of Classifiers for Blurred Images

329

edge analysis and transformation [7–9]. The widely used blur identification methods based on transformation, are generally applied in certain frequency domains: local DCT coefficients [9, 10] and image wavelet coefficients [11–13]. A good restoration of a blurred image depends on the right choice of the restoring algorithm which is in an adequacy with the blur rate of this image [14]. Thus, it would be wise to automatically classify images according to the degree of blurring followed by adequate restoration. In this work, we limited ourselves to classify the images in the LIVE database according to three degrees of blur: low blur, medium blur and high blur (based on a descriptor vector formed by statistics of DCT coefficients of blocks (8 × 8)), make a comparative analysis of the following classifiers, namely, k-NN, Naïve-Bayes (NB), multilayer perceptron (MLP), Support Vector Machine (SVMs), random forest (RF) and finally find the classifier which discriminates in the best way each category of blur. The remainder of this paper is organized as follows: Sect. 2 describes the approach taken for extracting features using DCT transformation. In Sect. 3, we will briefly describe the different classifiers used. In Sect. 4, we will present the experimental results obtained. Finally, the last section is devoted to the conclusion of this report and gives some perspectives.

2 Features Extraction The calculation of the descriptor vector is done in the locally frequency domain: • An image of the Gblur dataset (from Live database) is divided into blocks of size (8 × 8) and on which the cosine transform (DCT) is applied. • Each block obtained is divided (in addition to the DC band) into three regions R1, R2, and R3 which represent the low frequency (LF), medium frequency (MF) and high frequency (HF) regions, respectively (Fig. 1). These frequency regions are divided according to their sensitivity to distortions and are consistent with experimental psychophysical results [15]. • On each frequency range, local statistics (mean, variance, kurtosis, skewness, entropy and energy, of all AC components constituting each region) are calculated. • The vector descriptor of the image consists of the averages of respective statistics found in all the blocks of the image. The descriptor vector will therefore be composed of 18 features.

3 Classification At this step, a test image (I) is classified as “slightly blurred”, “moderately blurred” or “strongly blurred” according to the three labels based on the DMOS (Difference Mean

330

R. Gueraichi and A. Serir

Fig. 1. Representation of (8 × 8) DCT bloc Frequency bands.

Opinion Score) values provided by Live’s Gblur database, as expressed below: I is slightly blurred if 19 < DMOS ≤ 30 I is moderately blurred if 30 < DMOS ≤ 60 I is strongly blurred if DMOS > 60

(1)

A priori all types of classifiers can be used in image classification, but formally only classifiers capable of dealing with data complexity should be used. In fact, there are two main families: the first is based on a statistical and probabilistic approach, which presupposes the form of the laws as the Bayesian networks which are a good example; while the second which are considered unconscious a priori examples, include multilayer perceptron (MLP), k nearest neighbors (k-NN) and support vector machines (SVMs) as well as a decision tree classifier called random forests (RF). We will give an overview of each classifier. 3.1 Bayesian Naive Networks (NB) They are based on Bayesian decision theory. The decision problem is assumed to be probabilistic in nature and it is a question of calculating posterior probabilities for a given a class. Bayesian Naive networks are a special case of Bayesian networks where the characteristics of the data are assumed to be statistically independent, which facilitates the use of the Bayesian rule [16–19]. 3.2 The Multilayer Perceptron (MLP) MLP are neural networks interfaced by hidden layers. This will involve adjusting between the number of hidden layers and weights in order to optimize the prediction error on data that are generally non-linearly separable by a learning algorithm, the most popular being the propagation of the gradient [16–20].

Comparative Study of Classifiers for Blurred Images

331

3.3 k-Nearest Neighbors (k-NN) It is a question of assigning a particular class to a given point around a certain number of points. To do this, the distance from this point to as many points around it as possible should be calculated. Theoretically several distances can be used (Euclidean, Mahalanobis, Minkowsky, etc.), the most common is the Euclidean distance. The k that gives the best recognition rate is adopted [20, 22]. 3.4 Support Vector Machines (SVMs) It is a learning algorithm whose purpose is to search for a decision rule based on a hyperplane separation of maximum margin. The search for the optimal hyperplane is reinforced by the very original idea of SVMs, since the separation margin between two classes can always be maximized, when the classification errors are minimized by finding the optimal parameters such as the parameter of the kernel used (Sigmoid, Polynomial, RBF, …) and the regulation parameter C which constitutes a compromise between the maximization of the margin and the classification error due to non-linearly separable data. In the multi-class case, several approaches can be used, the most well-known are: the one against one approach and the one against all approach. In our case we opted for the last approach. It is the easiest and oldest method of decomposition. It consists of using one binary classifier (with real values) per class. The k-th classifier is intended to distinguish the index class k from all the others. To assign an example, it is therefore presented to Q classifiers, and the decision is obtained according to the “winner-takesall” principle: the label selected is the one associated with the classifier that returned the highest value [16–21]. 3.5 Random Forests (RF) A Random Forests is a classifier composed of a set of elementary classifiers of the decision tree type, noted {h(x, θk )|k = 1, . . . , L} where {θk } is a family of independent and identically distributed random vectors, and within which each tree participates in the voting of the most popular class for an input data x. The main advantage of this type of classifier is that even by increasing the number of decision trees, the model never tends towards overfitting [20, 23].

4 Experimental Part and Results Before starting to explain the results of the outputs of each classifier applied to this database, it is important to make some relevant remarks: • To compare the performance of the above-mentioned classifiers, we used the k-fold cross validation with k = 10. • By analyzing the classifier k-NN, the best result is obtained with a parameter k = 3.

332

R. Gueraichi and A. Serir

The results of the learning algorithms must be carefully analyzed and interpreted correctly. This is why several studies have been carried out to find evaluation metrics that respond to each type of data in an appropriate way [24, 25]. These studies focus essentially on the robustness of each metric in the presence of unbalanced data as is the case for our data [26]. The chosen database of blurred images is Live’s Gblur database, which consists of 145 blurred images. Applying the class selection criteria given by expression (1), 39 low blur images (class 1), 78 medium blur images (class 2) and 28 high blur images (class 3) are found. The distribution of classes is thus quite unbalanced. So in order to properly compare the performance of learning algorithms, it would be wise to choose metrics that are not sensitive to class imbalanced. 4.1 Selection of Evaluation Metrics for Unbalanced Data Different evaluation methods are sensitive to unbalanced data when there are more samples of one class in a dataset than in the other classes. If we take the example of a confusion matrix (2 × 2) given in Fig. 2:

Fig. 2. (2 × 2) Confusion matrix.

The class distribution is the ratio between positive and negative samples (P/N) which represents the relationship between two columns. Any evaluation metric that uses the values of the two columns will be sensitive to unbalanced data such as accuracy and precision with exceptions where changes in class distribution cancel each other out [27]. • This is the case of the geometric mean (GM) (also called G-score) whose equation is given below [27]:  √ TN TP × (2) GM = TPR × TNR = TP + FN TN + FP Where TPR and TNR design the true positive rate and the true negative rate, respectively. • A very good measure worth mentioning is Cohen’s kappa-statistics because it can very well manage multi-class and unbalanced class problems. It is defined as: κ=

1 − p0 p0 − pe =1− 1 − pe 1 − pe

(3)

Comparative Study of Classifiers for Blurred Images

333

Where p0 is the observed chord, and pe is the expected indicates how much better our classifier performs than the performance of a classifier that simply guesses at random according to the frequency of each class [27]. • There are metrics based on graphs, as the well-known ROC curve (Receiver Operating Characteristics) which is the plot of the TPR as a function of the FPR (false positive rate). This graph shows the trade-off between specificity and sensitivity. However, it requires particular caution and can be misleading when applied in unbalanced classification scenarios. In addition, it is difficult in many cases to interpret false positive results. • Many studies have shown that the alternative to ROC is the PRC (Precision Recall Curve) which configures the precision according to the recall (or TPR). This graph (PRC) can provide the user with an accurate prediction of future classification performance as it evaluates the fraction of true positives among the positive predictions. Thus, it is a robust metric even with unbalanced data [28]. The score of the area under the PRC curve, noted AUC (PRC), is also effective in multiple classifier comparisons [29]. The AUC can also be generalized to the multi-class parameter. This approach is based on the adjustment of one-vs-all classifiers where, in the i-th iteration, the group is defined as the positive class, while all other classes are considered as negative classes. 4.2 Obtained Results The performances of the five classifiers calculated from the first two selected (scalar) metrics are summarized in Table 1. Table 1. Kappa and G-score values for the five classifiers. Classifier Kappa (%) G-score (%) N-B

57.6

78.1

k-NN

66.2

82.9

SVMs

70

84.1

MLP

74.2

85.9

RF

75.3

86.3

The last metric (AUC-PRC) was applied from both a global and local perspective, i.e. for each class. The results are given from the graph of Fig. 3. From these results, we can recognize that the classifier that best discriminates these images in slightly blurred, moderately blurred and highly blurred images is the “Random Forests” classifier. In addition, its overall performance (for kappa statistics, G-score or global CUA-PRC) is also the best. However, MLP and SVMs classifiers are closely following. Thus, the SVMs classifier recognizes better (than the MLP) slightly and

334

R. Gueraichi and A. Serir

Fig. 3. Comparison of (micro and macro) performance (AUC-PRC) of the five classifiers.

strongly blurred images, while the MLP classifier is better than the SVMs for moderately blurred images. 4.3 Comparison of the Obtained Results from the Proposed Descriptor with Others Methods Obtained (from Quality Measurement) As explained above, the three classes that represent the blur rate in images were based on the value of the DMOS (Difference Mean Opinion Score) which is a subjective measure of image quality. Therefore, our classification results were compared against a classification ones based on objective methods of blind image quality measurement (without reference) (NR-IQA) applied: NIQE (Natural Image Quality Evaluator) algorithm [30]. The NIQE Algorithm: The principle of this algorithm is based on (natural) images that do not have human judgments, we speak of a “no opinion” model (OR: Opinion Unaware); moreover, the type of distortions that can affect these images are not known, we speak of the model (DU: Distortion Unaware). As a result, this algorithm is applied to an “NSS-driven blind OR-DU IQA” model [30]. We limited ourselves to comparing our results to those of this method (applied to the Gblur database of live), to using Cohen’s kappa metric which is a relevant metric. In addition, the comparison will be made with the best result obtained (random forests classifier) (Table 2). Table 2. Evaluation of the proposed results relative to others methods. Methods

Kappa (%)

NIQE

59.4

RF (proposed) 75.3

According to these results, the proposed method has the best performance because the kappa value is higher than 0.7. So, it remains better than the classification performance using the NIQE method.

Comparative Study of Classifiers for Blurred Images

335

5 Conclusion and Perspectives The comparative study carried out on the five classifiers for the classification of images according to the three degrees of blurring, is only the first step for obtaining the training model which discriminate in the best way the blur rate in an image, using a descriptor vector based on the statistics of the DCT transform coefficients for image blocks of size (8 × 8). This vector gives good classification performances with the classifiers such as SVMs, MLP and RF where the kappa rate is greater or equal to 0.7 value for blurred images. Moreover, this study has shown the performance superiority of the Random Forests (RF) classifier and its robustness compared to other classifiers, particularly with metrics adapted to unbalanced data. This model from RF classifier, which is applied to the Gblur database of Live, can be tested on other known databases like TID and CSIQ databases. It will also serve in practice in the restoration of blurred images in an adequate and simple manner according to their blur classification and their quality measurement. The twice aspects will be the subject of future work.

References 1. Wang, Z., Bovic, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment from error visibility to structure similarity. IEEE Trans. Image Process. 4(113), 600–612 (2004) 2. George, A.G., Prabavathy, K.A.: A survey on different approaches used in image quality assessment. Int. J. Emerg. Technol. Adv. Eng. 3(2) (2013) 3. Ferzli, R., Karam, L.J.: A Human Visual System Based No-Reference Objective Image Sharpness Metric. In: Editor, F., IEEE International Conference on Image Processing, Atlanta, pp. 2949–2952 (2006) 4. Ferzli, R., Karam, L.J.: A No-reference objective image sharpness based on the notion of just-noticeable of blur (JNB). IEEE Trans. image processing 18(4), 717–728 (2009) 5. Zhu, X., Milanfar, P.: A No-Reference Sharpness Metric sensitive to blur and noise. In: First International Workshop on Quality Multimedia Experience, San Diego, pp. 64–69 (2009) 6. Marziliano, P., Dufaux, F., Winkler, S. and Ebrahimi, T.: A No-Reference Perceptual blur metric. In: International Conference on Image Processing, vol 3, pp. 57–60 (2002) 7. Ong, E.P, Lin, W.S, Lu, Z.K, Yao, S.S., Yang, X.K, Jiang, L.F.: No-Reference quality Metric for measuring image. In: Proceedings IEEE International Conference on Image Processing, vol. 1, pp. 469–472 (2003) 8. Caviedes, G.E., Oberti, F.: A new sharpness metric based on local kurtosis, edge and energy information. Sign. Process. Image Commun. 19, 147–163 (2004) 9. Marichal, X., Ma, W.Y., Zhang, H.: Blur determination in the compressed domain using DCT information. In: Conference: Image Processing, ICIP 99, vol. 2 (1999) 10. Zhang, N., Vladar, A., Postek, M., Larrabee, B.: A Kurtosis-based statistical for two dimensional process and its application to image sharpness. In: Proceedings Section of physical and engineering Sciences of American Statistical Society, pp. 4730–4736 (2003) 11. Tang, H., Ming Jing, L. I., Zhang, H.J, Zhang, C.: Blur Detection for Images Using Wavelet Transform Conference of Multimedia and Expositions, vol. 1, pp 17–20 (2009) 12. Kerrouh, F., Serir, A.: A no-reference quality metric for evaluating blur image in wavelet domain. Int. J. Digital Inf. Wireless Commun. (IJDIWC) 1(4), 767–776 (2012)

336

R. Gueraichi and A. Serir

13. Tang, H., Ming Jing, L I., Zhang, H.J, Zhang, C.: Blur Detection for Images using Wavelet Transform. In: Conference of Multimedia and Expositions, vol. 1, pp. 17–20 (2009) 14. Kerrouh, F.: A No-Reference Quality measure of blurred images (videos), PhD Thesis in Electronics, Univ, Algiers (2014) 15. Bae, S.H., Kim, M.: A novel DCT-based JND model for luminance adaptation effect in DCT frequency. IEEE Sign. Process. Lett. 20(9), 893–896 (2013) 16. Cheriet, M., Kharma, N., Liu, C.L., Suen, C.Y.: Character Recognition Systems. Wiley, New Jersey (2007) 17. Theodoridis, S., Koutroumbas, K.: Pattern Recognition, Fourth Edition, Edited by Academic Press, Elsevier Inc. (2009) 18. Duda, R.O., Hart, P.O. Stork, D.G.: Pattern Classification, snd Ed. Wiley, New Jersey (1997) 19. de Sá, J.P.M.: Pattern Recognition. In: Concepts, Methods and Applications. Edited by Springer (2001) 20. Witten, I.H, Frank, E.: Data Mining, Practical Machine Learning Tools and Technics. Morgan Kauffman Publishers, Elsevier, Burlington (2005) 21. Vapnick, V.: The Nature of Statistical Learning Theory. Springer, New York (2000) 22. Mathieu-Dupas, E.: Algorithmes des k plus proches voisins pondérés et application en diagnostic. 42èmes Journées de Statistique, 2010, Marseille, France (2010) 23. Breiman, L.: Random Forests Machine Learning, vol. 45, no. 1, pp. 5–32. Kluwer academic Publishers, Berlin (2001) 24. Hamdi, F.: Learning in unbalanced distributions, Doctorate Thesis in Computer Sciences, Univ Paris, vol. 13 (2012) 25. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009) 26. Sokolova, M., Japkowicz, N., Szpakowicz, S.: Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In: Australasian Joint Conference on Artificial Intelligence, pp. 1015–1021 (2006) 27. Tharwat, A.: Classification assessment methods. Journal homepage. http://www.sciencedi rect.com. Accessed August 2019 28. Saito, T., Rehmsmeier, M.: The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets, vol. 10, no. 3 (2015) 29. Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In: Conference: Proceedings of the 23rd International Conference on Machine Learning, pp. 233–240 (2006) 30. Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Signal Process. Lett. 20(3), 209–212 (2013)

A Raspberry Pi-Based Identity Verification Through Face Recognition Using Constrained Images Alvin Jason A. Virata1,2(B) and Enrique D. Festijo1 1 Technological Institute of the Philippines Arlegui, Manila, Philippines

[email protected], [email protected] 2 St. Dominic College of Asia, Bacoor, Philippines

Abstract. In this paper, we proposed a novel, cost-effective and energy-efficient framework by introducing a Raspberry Pi-based identity verification through face recognition in an offline mode. A Raspberry Pi device and mobile phone is wirelessly connected to standard pocket WIFI to process the face detection and face verification. The experimental tests were done using 3000 public constrained images. The proposed method is implemented on Raspberry Pi 3 run in Python 3.7 where the datasets and trained datasets were experimented using LBP algorithm for face detection and face verification in five split testing. The result was interpreted using the confusion matrix and Area Under the Curve (AUC) and Receiver Operating Characteristics (ROC). To sum up, the proposed method showed an average result of 0.98135% accuracy, with 98% recall score and an F1-score of 0.9881. During the offline mode testing, the face detection and verification average timing is 1.4 s. Keywords: Raspberry Pi · Local Binary Pattern (LBP) · Face recognition · Confusion matrix · Area Under the Curve (AUC) and Receiver Operating Characteristics (ROC)

1 Introduction The study on face recognition has been one of the challenging topics since 1970’s due to its weaknesses and limitations. The increase of studies about face recognition have caught the attention of many researchers which aim to improve the studies on biometrics particularly with regards to security implementation [1]. With the fast-paced development of technology, even industry sector leaders have adopted these technological changes and innovations where they played significant roles in the development of embedded devices like Raspberry Pi and Arduino that greatly contribute to the expansion of the mobile applications. Noticeably, with the current impact of mobile technology, it is becoming part of human life system particularly with regards to an effective and convenient way of communication [2]. However, in the Philippines, it is still considered in the infancy period to adopt identity verification system through face recognition. Though, consumers are already fascinated and excited to experience the benefits of implementing face recognition as part of the security measures [3]. © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 337–349, 2020. https://doi.org/10.1007/978-3-030-52246-9_24

338

A. J. A. Virata and E. D. Festijo

In spite of the rapid growth of mobile technology and being recognized for its powerful trend in the field of commerce, there is still truth that cannot be left behind about the limitation of the mobile technology resources in terms of battery life, storage and bandwidth that may influence the mobility and security with regards to communication efficiency and effectiveness. These weaknesses, however, would be a core dilemma on mobile technology service qualities as it affects the capacity to perform computationally intensive application and the demand on storage processes. These obstacles may have hindered the mobile application capability in which demands to consider at least a cloud application to resolve the issue [4–6]. In addition, the human face is considered a complex pattern. Identifying human faces automatically in a certain domain could be difficult [7]. This issue on face recognition have strongly attracted the attention of researchers even from multidisciplinary studies due to its possible unique contribution to the research community. However, on its implementation phase, there are still several challenges and issues with regards to some parameters that need to be considered and calculated to generate a precise result during face recognition and verification process [8] it is because a human face is a dynamic object due to its high level of variability in its attributes. A variety of techniques such as using a simple-edge algorithm to a complex high-level approach to expedite the process of pattern recognition have been recommended [9]. Also, another challenge on face recognition arises when the face is not a rigid object and face image was taken from other different viewpoints of the face [10]. Commonly, face recognition happened during face detection process that greatly contributes to the performance of the sequential operations on identity verification. Rapid developments on face detection algorithms have also been improved by processing a multiview (multipose) images using similar framework [11]. In conducting face recognition, there are three tasks involved in the process: (1) verification, (2) identification and (3) watch list. Verification happens when the query on face image matches the expected identity. While on the other hand, identification takes place when the identity of the query face image is determined. The watch list, however, are the records in the database for future query of the person’s identity [12]. Therefore, face detection happens when the process of locating faces in a given domain using different algorithms is applied in the design environment [13, 14] while face recognition is mainly used for the tasks of identification or verification. The process of confirming the identity of any unknown face image happens when a face data of the person is compared to the face images stored in the database is called face identification. Hence, identity verification is guaranteed genuine when the face image matches the individual feature or attributes. With this wide range of scope, any face recognition is performed based on facial features, emerging technologies and algorithm [15, 16]. To resolve the aforementioned issues and challenges on face recognition on mobile technology environment and making it possible to process face recognition without using a cloud environment, this proposed framework main original work contributions are as follows:

A Raspberry Pi-Based Identity Verification Through Face Recognition

339

1. A Raspberry Pi based identity verification through face detection and recognition was done in an offline mode inserted with microSD memory card to be used as storage. 2. The local binary pattern (LBP) was incorporated in the framework and is considered computationally inexpensive making it possible for a Raspberry Pi based framework. 3. The framework is computationally inexpensive as training sets were done separately using ordinary desktop computer or laptop within a minimal amount of processing. 4. Established an average accuracy result of 98% by conducting five split testing using the 3,000 downloaded constrained images. Further, the remaining structure of this study is organized as follows: in Sect. 2, it discusses the System Design and Development, to be followed by results and discussions, and lastly in Sect. 4, the conclusion and future research directions are explained.

2 System Design and Development 2.1 System Configuration Identity verification through face recognition has widely attracted the research community due to its diverse applicability in different facets of real-time situations in the category of security, monitoring, marketing, segmental analysis and data science. However, due to mobile application challenges in terms of computational power consumption and memory storage, it opens an opportunity to address these issues. In this study, the mobile phone is wirelessly connected to Raspberry Pi device which was used to store the trained data sets to process the face detection and verification. The Raspberry Pi device used has the following specification and configuration and hardware details presented in Table 1. Table 1. Raspberry Pi specifications for the proposed framework Name

Configuration

Processor

Broadcom BCM2837B0, Cortex-A53 (ARMv8) 64-bit SoC @ 1.4 GHz

RAM

1 GB LPDDR2 SDRAM

Connectivity

2.4 GHz and 5 GHz IEEE 802.11.b/g/n/ac wireless LAN, Bluetooth 4.2, BLE

Power

5 V/2.5A DC power input

The Raspberry Pi has a Broadcom BCM2837B0 system on a chip which includes a Cortex-A53 (ARMv8) 64-bit SoC, with 1.4 GHz 64-bit quad-core processor, dual-band wireless LAN, Bluetooth 4.2/BLE, faster Ethernet, and Power-over-Ethernet support (with separate PoE HAT). Also, to facilitate the offline face detection and face verification the training of data was executed in an ordinary laptop with the following hardware and software specifications shown in Table 2.

340

A. J. A. Virata and E. D. Festijo Table 2. Hardware and software specification

Name

Configuration

Display

AMD Radeon R7 Graphics

Processor

AMD A12-9720P Radeon R7, 12 Compute Cores 4C + 8 G 2.70 GHz

RAM

8 GB (6.97 GB Usable)

System

64-bit Operating System, × 64-based processor

During the training process, most researches recommend to utilize the Graphical Processing Unit (GPU) to improve the process of training datasets. However, this has become a breakthrough in this study that improve the performance of face detection and verification without primary using the GPU but instead re-configuring the memory allocation by maximizing a page file that will help the primary storage memory to maximize the processing of training data sets. 2.2 Face Images Collection During the face images collection, the proponents administer the following: • Downloaded the images at (www.essex.ac.uk) selecting the face94 and face96 databases. • The images are stored in 24 bit RGB, JPEG format. • The constrained images have an image resolution of 180 × 200 pixels in a portrait format with green background. 2.3 Experimental Test – Set up As presented in Fig. 1 it shows the system framework of the proposed project. In the framework, it also described how the experimental test setup was conducted to validate the prediction accuracy, the timing as well as the training speed. The proponents downloaded 3000 images to be used as sampling. Out of the 3000 images collected, 80% of the images or 2400 images of the data were used to be the trained set while the remaining 20% or 600 images were used as test data. During the testing process, the proponents aim to measure the prediction accuracy, recall and the F1 score. Approximately, the time consumed in preparing the training set with 2400 constrained images was about less than three minutes using a laptop with the following specifications: AMD A12-9720P Radeon R7, 12 Compute Cores 4C + 8 G 2.70 GHz, 8 GB (6.97 GB Usable), 64-bit Operating System, × 64-based processor. 2.4 System Architecture Illustrated in Fig. 2, the system architecture of the project describes the process of identity verification. Whereby, using a mobile phone, an image will be captured to process the face

A Raspberry Pi-Based Identity Verification Through Face Recognition

341

Fig. 1. System framework

detection and face extraction in the Raspberry Pi device since the classifier/model was already installed. The classifier that processed the identity verification was established using LBP Algorithm since the use of LBP in the recent study was found successful in terms of face authentication and face detection recognition [17]. Hence, in spite of the existing challenges [18] mentioned pertaining to the implementation of face recognition in mobile application development, the proponents opt to pursue the proposed project to address the challenges on the limitation to memory allocation and computation power by adopting the offline identity verification. The offline identity verification was made possible since the process of preparing the training set were done separately using a desktop computer. 2.5 System Prototype A prototype was created to test the proposed project. As illustrated in Fig. 3, there are four menu options that can be selected from the prototype, these are (1) capture menu, where image can be collected or gathered; (2) detect menu, this will compare the image from the taken photos; (3) Identify, is an offline mode process of verifying the person’s identity; and lastly (4) person’s information menu, a database where the person’s information is registered. On the other hand, the options for home menu will help the user to easily go back to the main page/window, the settings are intended for the user to set-up easily manage the application, while the help menu provides assistance to the user which provides detailed instruction.

3 Results and Discussions Using the 3,000 downloaded (face94 and face96 image database) public controlled/constrained images, the proponents come up with different methodologies to test the algorithm being used, specifically, the local binary pattern (LBP).

342

A. J. A. Virata and E. D. Festijo

‘’’’’

Fig. 2. System architecture

Home | Settings | Help

Fig. 3. System prototype

As described through the source from the University of Essex, United Kingdom Department of Computer Science and Electronic Engineering database (www.essex. ac.uk), the images were taken within a fixed distance from the camera and the subjects were instructed to speak, while a series of camera shots were taken. In addition, the source also considered variations of the images such as including green backgrounds, no head scale, a minor variation of head turns when tilt and slant, no lighting, and including a no-hairstyle variation is required. Some of the subject who participated have beard and glasses. The collected data were used by the proponents to produce a data set and a test data set. The proponent divides the images into two subsets, where, 80% was used

A Raspberry Pi-Based Identity Verification Through Face Recognition

343

as a training set and 20% as the test sets. During the process of testing the result of accuracy, prediction and timing are a success. The challenge on timing prediction was also concluded remarkable having a less than ( ],

Structure of a blacklisted node entry in blacklist table. Where, blsrcip represents the blacklisted node IP address, detectioncount represents the total number of times node has been detected as attacker, and status represents the status of blacklisted node, i.e., set as FALSE for suspected and TRUE for permanently blocked

i = 1, . . . , N odemax

Υi ← [ < f rom, tprevious , trecent , DIOcount > ], i = 1, . . . , N odemax

Nblacklist

Structure of a node entry in neighbor table. Where, from represents the DIO sender IP address, tprevious represents the time of previous DIO receiving, trecent represents the time of most recent DIO receiving, and DIOcount represents the total number of DIO’s received from that neighbor till current time Number of blacklisted nodes

λcurrent

Current system clock time

Ψ

Flag to check if the node is present in neighbor table or not

ρ

Flag to check if the node is present in blacklist table or not

srcip

Source IP address of DIO sender node

τ

Null IP address

σ

Safe DIO interval

β

Block threshold

l

Length of the node table at that time

Active

It indicates that IDS’s detection procedure is ready to check for attackers present in neighbor table, it is set TRUE by the legitimate node after every 30 s

δ

Tuning parameter

x ˜

Median

Q1

First quartile

Q3

Third quartile

IQR

Interquartile range

Upper limit

It represents the safe threshold for the number of DIO received from a neighbor

420

A. Verma and V. Ranga

Algorithm 1. Pseudo-code of the proposed IDS 1: procedure IDS()  Checks for malicious neighbors present in neighbor table. 2: l←0  Variable to store current length of neighbor table 3: for i ← 1, Nodemax do 4: if Q[Υi .f rom] ! = τ then 5: l++ 6: end if 7: end for 8: if l > 1 then 9: sort Q on DIOcount column 10: end if 11: compute x ˜, Q1, Q3 values of Q[Υi .DIOcount ], where i = 1, . . . , l 12: IQR ← Q3 − Q1 13: Upper limit ← Q3 + (δ × IQR) 14: for i ← 1, l do 15: if Q[Υi .DIOcount ] > U pper limit then 16: if Q[Υi .trecent ] − Q[Υi .tprevious ] ≤ σ then 17: ρ ← F ALSE 18: for j ← 1, Nblacklist do 19: if Z[ℵj .blsrcip ] = Q[Υi .f rom] then 20: ρ ← T RU E 21: if Z[ℵj .detectioncount ] < β then 22: Z[ℵj .detectioncount ] ← Z[ℵj .detectioncount ] + 1 23: if Z[ℵj .detectioncount ] = β then 24: Z[ℵj .status] ← T RU E 25: Neighbor is permanently blocked 26: call remove neighbor table entry(i) procedure 27: end if 28: end if 29: end if 30: end for 31: if ρ = F ALSE then 32: k ← Nblacklist + + 33: Z[ℵk .blsrcip ] ← Q[Υi .f rom] 34: Z[ℵk .detectioncount ] ← Z[ℵk .detectioncount ] + 1 35: Z[ℵk .status] ← F ALSE 36: Neighbor is suspected to be an attacker 37: end if 38: end if 39: end if 40: end for 41: end procedure

5

Performance Evaluation

We have evaluated our proposed IDS through experiments by implementing it on a popular embedded operating system. Next subsection presents the details

Intrusion Detection in IoT

421

of our experimental setup, attacks impact on the network, and the performance results of our proposed IDS. Table 2. Simulation parameters Parameter

Values

Radio model

Multipath Ray-Tracer Medium (MRM)

Simulation area

150 m × 150 m

Simulation time

1800 s

Objective function

Minimum Rank with Hysteresis Objective Function(MRHOF)

Number of attacker nodes

4

Number of gateway nodes

1

Number of sensor nodes

16

DIO minimum interval

4s

DIO maximum interval

17.5 min

Replay interval

1, 2, 3, 4 s

Data packet size

30 bytes

Data packet sending interval 60 s Transmission power

5.1

0 dBm

Experimental Setup

The proposed IDS is implemented by modifying the ContikiRPL library of the Contiki operating system. We have evaluated the proposed IDS using the Cooja simulator. Table 2 lists the simulation parameters considered in the experiments. The Multipath Ray-Tracer Medium (MRM) radio model is used in all the experiments to simulate realistic channel. To realize the copycat attack, an attacker node is programmed to eavesdrop and capture DIO message from any legitimate node and then replays the captured message in fixed replay intervals. Random Waypoint Mobility Model is used to simulate the behavior of mobile nodes where the speed of nodes is set between 1 m/s and 2 m/s. The attacker node is programmed to launch attack after 90 s of network initialization. Similarly, the proposed is scheduled to activate after 120 s of network initialization and consistently monitors the neighbors in every 30 s. The mean values of the results obtained from 10 independent replications with errors at 95% confidence interval are reported. 5.2

Simulation Results

First, the attack impact on network performance is studied in terms of PDR and AE2ED. Then, the performance evaluation of the proposed IDS is discussed in terms of Accuracy and First Response Time (FRT).

422

5.3

A. Verma and V. Ranga

Impact of Attack on Network Performance

The performance of Static RPL (static reference model without attack), Static RPLUnder Attack (i.e. static reference model under attack), Mobile RPL (mobile reference model without attack), Mobile RPLUnder Attack (i.e. mobile reference model under attack) is evaluated and compared. In non-attack scenarios (i.e. Static RPL and Mobile RPL), the replay interval has no significance. Figure 1 shows the PDR obtained under attack with different replay intervals, i.e., 1, 2, and 3 s. It is observed that the attack severely degrades the network’s performance in both static and mobile cases. This is be confirmed from the comparison of PDR values obtained with attack and non-attack scenarios. The copycat attack has more impact on mobile network as compared with static network. Figure 2 shows AE2ED obtained with different replay intervals. It is observed that the attack increases the network latency. This is confirmed from the AE2ED values obtained in different scenarios. Besides, it is also observed that the attack does not have any significant impact on AE2ED of the static network. On the other hand, in the case of a mobile network, AE2ED significantly increases. The reason for this is the network dynamicity. 5.4

Performance Evaluation of Proposed IDS

Figure 3 shows the accuracy achieved by proposed IDS in different attack scenarios. Where, IDSStatic , IDSMobile depict the accuracy achieved in case of static and mobile network, respectively. The results indicate the effectiveness of the proposed IDS. As can be seen, the proposed IDS detects the attackers with

Fig. 1. PDR values obtained in different scenarios

Intrusion Detection in IoT

423

Fig. 2. AE2ED values obtained in different scenarios

Fig. 3. Accuracy of proposed IDS

high accuracy. The IDS achieves maximum accuracy of ≈94% in static scenario whereas ≈85% in mobile scenario. The main reason for better accuracy of IDS in the static network is network’s stability. While in the case of mobile scenario, the network dynamicity limits the detection accuracy of IDS. It is concluded that the accuracy of the proposed IDS is inversely proportional to the attacker’s replay interval. FRT of the proposed IDS is studied to analyze the responsiveness. Figure 4 illustrates FRT’s of IDS against different attackers. A1, A2, A3, and A4 represent different attackers. The proposed IDS takes lesser time

424

A. Verma and V. Ranga

to detect the attacker in a static scenario as compared to the mobile scenario. This is because of the stable network that makes it easy for the detection mechanism to quickly find the malicious neighbor present in neighbor table. The reason for delayed attacker detection in case of mobile scenario is the network dynamicity which increases the DIO transmission of legitimate nodes. Hence, it becomes very difficult for the detection mechanism to differentiate between normal and attacker neighbors present in neighbor table.

Fig. 4. FRT of proposed IDS to detect attackers with different replay intervals

6

Conclusion and Future Work

Many IoT applications are built upon LLNs due to the requirement of longer operational time. LLNs are vulnerable to different insider and outsider threats, thus users’ security and privacy come at a risk. In this paper, we have addressed a routing attack named as the copycat attack which is shown to have a major negative impact on LLNs performance. We presented an IDS to secure LLNs against copycat attacks. The experimental results show that our proposed IDS detects such attacks quickly with high accuracy in both static and mobile network scenarios. As future work, we are interested in performing testbed experiments using real LLN nodes. Acknowledgment. This research was supported by the Ministry of Human Resource Development, Government of India.

Intrusion Detection in IoT

425

References 1. Airehrour, D., Gutierrez, J.A., Ray, S.K.: SecTrust-RPL: a secure trust-aware RPL routing protocol for Internet of things. Future Gener. Comput. Syst. 93, 860–876 (2018) 2. Barnett, V., Lewis, T.: Outliers in Statistical Data. Wiley, New York (1974) 3. Ben-Gal, I.: Outlier detection. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 131–146. Springer, Boston (2005) 4. Bostani, H., Sheikhan, M.: Hybrid of anomaly-based and specification-based IDS for Internet of things using unsupervised OPF based on MapReduce approach. Comput. Commun. 98, 52–71 (2017) 5. Ghaleb, B., Al-Dubai, A., Ekonomou, E., Qasem, M., Romdhani, I., Mackenzie, L.: Addressing the DAO insider attack in RPL’s Internet of things networks. IEEE Commun. Lett. 23(1), 68–71 (2018) 6. Hui, J.W., Culler, D.E.: Extending IP to low-power, wireless personal area networks. IEEE Internet Comput. 12(4), 37–45 (2008) 7. Mayzaud, A., Badonnel, R., Chrisment, I.: Detecting version number attacks using a distributed monitoring architecture. In: Proceedings of IEEE/IFIP/In Association with ACM SIGCOMM International Conference on Network and Service Management (CNSM 2016), pp. 127–135 (2016) 8. Mayzaud, A., Badonnel, R., Chrisment, I.: A distributed monitoring strategy for detecting version number attacks in RPL-based networks. IEEE Trans. Netw. Serv. Manag. 14(2), 472–486 (2017) 9. Mayzaud, A., Sehgal, A., Badonnel, R., Chrisment, I., Sch¨ onw¨ alder, J.: A study of RPL DODAG version attacks. In: LNCS (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8508, pp. 92–104 (2014) 10. Mayzaud, A., Sehgal, A., Badonnel, R., Chrisment, I., Sch¨ onw¨ alder, J.: Using the RPL protocol for supporting passive monitoring in the Internet of things. In: Proceedings of the NOMS 2016 - 2016 IEEE/IFIP Network Operations and Management Symposium, pp. 366–374 (2016) 11. Raoof, A., Matrawy, A., Lung, C.: Routing attacks and mitigation methods for RPL-based internet of things. IEEE Commun. Surv. Tutorials 21(2), 1582–1606 (2019) 12. Raza, S., Wallgren, L., Voigt, T.: SVELTE: real-time intrusion detection in the internet of things. Ad Hoc Networks 11(8), 2661–2674 (2013) 13. Thulasiraman, P., Wang, Y.: A lightweight trust-based security architecture for RPL in Mobile IoT Networks. In: 16th IEEE Annual Consumer Communications & Networking Conference (CCNC), pp. 1–6. IEEE (2019) 14. Tsao, T., Alexander, R., Dohler, M., Daza, V., Lozano, A., Richardson, M.: A security threat analysis for the routing protocol for low-power and lossy networks (RPLs). Technical report (2015) 15. Verma, A., Ranga, V.: Addressing flooding attacks in IPv6-based low power and lossy networks. In: IEEE Region 10 Conference (TENCON), pp. 552–557, October 2019 16. Verma, A., Ranga, V.: Mitigation of DIS flooding attacks in RPL-based 6LoWPAN networks. Trans. Emerg. Telecommun. Technol. 31, e3802 (2020) 17. Verma, A., Ranga, V.: ELNIDS: ensemble learning based network intrusion detection system for RPL based Internet of things. In: 4th International Conference on Internet of Things: Smart Innovation and Usages (IoT-SIU), pp. 1–6. IEEE (2019)

426

A. Verma and V. Ranga

18. Verma, A., Ranga, V.: Evaluation of network intrusion detection systems for RPL based 6LoWPAN networks in IoT. Wirel. Pers. Commun. 108(3), 1571–1594 (2019) 19. Wallgren, L., Raza, S., Voigt, T.: Routing attacks and countermeasures in the RPL-based Internet of things. Int. J. Distrib. Sens. Netw. 9(8), 794326 (2013) 20. Winter, T., Thubert, P., Brandt, A., Hui, J., Kelsey, R., Levis, P., Pister, K., Struik, R., Vasseur, J.P., Alexander, R.: RPL: IPv6 routing protocol for low-power and lossy networks. Technical report (2012)

On the Analysis of Semantic Denial-of-Service Attacks Affecting Smart Living Devices Joseph Bugeja(B) , Andreas Jacobsson, and Romina Spalazzese Internet of Things and People Research Center, Department of Computer Science and Media Technology, Malm¨ o University, Malm¨ o, Sweden {joseph.bugeja,andreas.jacobsson,romina.spalazzese}@mau.se

Abstract. With the interconnectedness of heterogeneous IoT devices being deployed in smart living spaces, it is imperative to assure that connected devices are resilient against Denial-of-Service (DoS) attacks. DoS attacks may cause economic damage but may also jeopardize the life of individuals, e.g., in a smart home healthcare environment since there might be situations (e.g., heart attacks), when urgent and timely actions are crucial. To achieve a better understanding of the DoS attack scenario in the ever so private home environment, we conduct a vulnerability assessment of five commercial-off-the-shelf IoT devices: a gaming console, media player, lighting system, connected TV, and IP camera, that are typically found in a smart living space. This study was conducted using an automated vulnerability scanner – Open Vulnerability Assessment System (OpenVAS) – and focuses on semantic DoS attacks. The results of the conducted experiment indicate that the majority of the tested devices are prone to DoS attacks, in particular those caused by a failure to manage exceptional conditions, leading to a total compromise of their availability. To understand the root causes for successful attacks, we analyze the payload code, identify the weaknesses exploited, and propose some mitigations that can be adopted by smart living developers and consumers. Keywords: Denial-of-Service (DoS) · Internet of Things (IoT) OpenVAS · Smart home · Security vulnerabilities

1

·

Introduction

The availability of affordable Internet-connected household devices, such as IP cameras, Wi-Fi-enabled light bulbs, and smart TVs, is stimulating the growth of smart living spaces, a typical case of which being the smart connected home. A smart connected home is a residence that uses IoT technologies, such as sensors, smart devices, and communication protocols, allowing for remote access, control, and management, typically via the Internet [12]. As much as we rely more on IoT devices in daily life, these devices are vulnerable to active cyberattacks such as Denial-of-service (DoS) [1]. DoS is typically c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 427–444, 2020. https://doi.org/10.1007/978-3-030-52246-9_32

428

J. Bugeja et al.

described as a widely used attack vector by various malicious threat agents such as hackers, hacktivists, and thieves. Indeed, traditional DoS attacks on information systems can be threats to the smart home, given Internet-connected components [10]. Such attacks may be the first step in removing a smart home component from a network to exploit a vulnerability in its disconnected failure state [10]. The impact of a DoS attack may range from a nuisance to loss of revenues to even loss of life. As an example, in 2016, a major IoT-oriented malware, i.e., Mirai [24], caused severe monetary damage by exploiting devices, mostly consumer IoT devices such as IP cameras found in homes, and converting them into a botnet. Mirai had the capabilities to perform various types of Distributed Denial of Service (DDoS) attacks – DDoS is a DoS attack that uses a high number of hosts to make the DoS attack even more disruptive [11] – like DNS, UDP, STOMP, SYN, and ACK flooding [20]. In 2019, Kaspersky Lab indicated that DDoS attacks have escalated in the IoT by 84%, and their average duration increased by 4.21 times [25]. This highlights the insecurity of current IoT devices and justifies the importance of studying DoS attacks. Research on DoS attacks tends to be primarily focused on attack detection techniques (e.g., anomaly-detection) and response mechanisms (e.g., distributed packet-filtering) [21]. To a lesser extent, fewer scholarly studies have been published that focus on the actual DoS attack scenario – crucial to determine the resilience of an IoT device. In fact, most of the available studies are made by professional penetration testers (cf. white paper “The IoT Threat Landscape and Top Smart Home Vulnerabilities in 2018” by Bitdefender [34]). The mentioned report [34] indicates that DoS is the most common vulnerability present in the smart home, followed by code execution and buffer overflow. There are two broad categories of DoS attacks: semantic and flooding attacks [30], [13], [21]. Respectively, these are also called software exploits or applicationbased attacks and brute-force attacks in scholarly literature. While in flooding attacks a victim is sent a voluminous amount of network traffic to exhaust its bandwidth or computing resources, in semantic attacks packets that exercise specific software bugs (vulnerabilities) are sent to a victim’s operating system or application. Although flooding attacks are important, in this paper we focus on semantic attacks for three main reasons: i) these attacks can be an enabler for other security and privacy threats; ii) most of the existing studies target traditional computer devices or resources, e.g., application servers, and not consumer IoT devices; and iii) arguably, while devices are in theory always prone to flooding they may not be susceptible to software exploits if their software is updated. These characteristics make semantic DoS attacks interesting to study from a scientific perspective. Specifically, we conduct an experiment on five commercial-off-the-shelf devices: a gaming console, media player, lighting system, connected TV, and IP camera. These consequently represent three different categories of smart living devices commonly found in a home – energy and resource management, entertainment systems, and security and safety. All the devices used in this paper are

On the Analysis of Semantic DoS Attacks Affecting Smart Living Devices

429

manufactured by established industry leaders. The assessment is done through vulnerability scanning. Vulnerability scanning is the process of detecting potential weaknesses on a computer, network or services. Specifically, we leverage Open Vulnerability Assessment System (OpenVAS)1 framework. To understand the root causes for successful attacks, we analyze the payload code, identify the weaknesses exploited, and propose some mitigations that can be adopted by smart living developers and consumers. The remainder of this paper is organized as follows. In Sect. 2, we provide an overview of a typical smart connected home architecture. Next, we summarize related work on DoS. The description of a DoS attack and the experiment design is elaborated on in Sect. 4. In Sect. 5, we summarize the achieved results. Subsequently, we discuss some implications of our findings and provide some guidance for mitigating such vulnerabilities in Sect. 6. Finally, in Sect. 7, we conclude and specify directions for future work.

2

Smart Connected Home Architecture

A smart connected home consists of heterogeneous devices. These typically exchange data about the state of the home, environment, and activities of residents. Commonly, the IoT devices are connected to an IoT gateway, which is in turn connected to the residential Internet router. The gateway/router is the endpoint that connects the IoT devices to the Internet Service Provider (ISP). Some connected home devices, in particular, resourceful nodes such as certain smart TVs, may also have built-in gateway functionality allowing them to connect to the Internet router and sometimes to an ISP directly. The connection between the gateway and router tends to be Ethernet or Wi-Fi based; whereas the communication between the IoT devices and the gateway usually leverages wireless protocols such as Zigbee, Z-wave, and Thread. These protocols are designed for power-efficiency making them ideal for batteryoperated devices. Users can interact with the IoT devices and manage their smart connected home devices through different platforms, most commonly through smartphones. The interaction modalities are in general two: i) directly interacting with them using the services provided by the gateway, and ii) accessing Internet cloud services that interact with the gateway and the connected IoT devices. Typically, the smart connected home relies on a cloud-based infrastructure. These two scenarios are often present simultaneously to support local and remote interactions with the IoT devices. In Fig. 1, we provide a graphical overview of the smart connected home architecture.

3

Related Work

Karig and Lee [22] classify DoS attacks into five different categories: networkdevice level, OS level, application level, data flood, and protocol feature attack. 1

https://www.openvas.org/ [accessed December 21, 2019].

430

J. Bugeja et al.

Fig. 1. Typical smart connected home architecture. The smart home devices tend to be connected to a gateway(s), typically via wireless protocols, which is in turn connected to the Internet through a broadband router. End-users commonly access the home through a mobile device, e.g., a smartphone.

This categorization is based on the attacked protocol level. The authors also provide countermeasures that mostly reflect the classification of attacks. This work is useful as a basis for understanding DoS attacks and their impact, however it falls short in elaborating on the causes of certain attack categories, e.g., application-based attacks. Mirkovic and Reiher [29] group DDoS attacks into two categories: semantic and brute-force attacks. Brute-force attacks are related to the data flood attacks in Karig and Lee [22] classification as they involve the sending of a large volume of attack packets to a target, whereas the rest are non-flooding attacks. The authors also provide a taxonomy of defense mechanisms differentiating between preventive and reactive mechanisms. While this work is relevant for comprehending DoS attacks, it is primarily focused on DDoS attacks. DDoS attacks tend to be more related to brute-force attacks and have specific attack types such as DNS, NTP, Chargen, and SSDP, which may not be as relevant to DoS. Bonguet and Bellaiche [11] classify DoS and DDoS attacks into two broad categories: overwhelm the resources and vulnerabilities. Respectively, these correspond to the brute-force and semantic attacks as described by Mirkovic and Reiher [29]. The authors present new types of DoS and DDoS attacks, in particular the XML-DoS and HTTP-DoS, that affect cloud computing. They also discuss some detection and mitigation techniques. In our case, we are mainly

On the Analysis of Semantic DoS Attacks Affecting Smart Living Devices

431

interested in investigating the causes of DoS attacks affecting devices found in smart living spaces. The Open Web Application Security Project (OWASP) [33] focuses on the type of vulnerabilities at the application level allowing a malicious user to make certain functionality or, sometimes, the entire website unavailable. They identified eight test cases, such as buffer overflows, each leading to DoS. We leverage the work of OWASP indirectly by conducting an experiment on connected home devices. In reviewing the existing work, we observe that the majority of the published work whilst providing a solid theoretical basis, it does not elaborate much on the method used for conducting a DoS attack. Specifically, we observe the shortage of such studies that test IoT devices against semantic DoS attacks aimed at the application and data processing layers. Except for a few, also most of these tend to run such tests on web applications, instead of services which may also include network and operating system-based software components. With the rise of increasingly targeted attacks and motivated attackers, we believe that semantic DoS attacks are likely to be exploited and thus are important to study. Finally, we observe that most of the mitigations proposed while generic enough, may not necessarily address certain discovered vulnerabilities. Thus, it is useful to investigate firsthand the causes of such attacks to propose more appropriate solutions.

4 4.1

Method The DoS Attack

DoS attacks attempt to exhaust or disable access to resources at the victim. These resources are either network bandwidth, computing power, or operating system data structures. Effectively, DoS attacks can target all the different protocol layers of the TCP/IP protocol stack. In the home environment, DoS can occur directly at the IoT devices, at the residential router, and at cloud endpoints [16]. Typically, web servers embedded inside IoT devices are a frequent target of attacks. In this work, we focus on semantic attacks. These attacks take advantage of specific bugs in network services that are running on a target host or by using such applications to drain the resources of their victim [22]. It is also possible that the attacker may have found points of high-algorithmic complexity and leverages them to consume all available resources on a remote host [14]. 4.2

Experiment Setup

An experiment was devised to test IoT devices for their resiliency against DoS attacks. The experiment was conducted in April 2019, and it featured smart devices that had their firmware upgraded to the latest as detailed in [3].

432

J. Bugeja et al.

The experimental platform is based on Liang et al. [26] framework which implemented a DoS attack method for IoT systems. Effectively, our experiment setup is an instance of the smart connected home architecture described in Sect. 2. Each smart device had embedded gateway functionality and was directly connected via its Wi-Fi interface to the router. The smartphone role is delegated to the PC. The router is in turn connected to the broadband modem via Ethernet. A smart connected home is typically characterised by a mix of devices, but often contains a so-called starter kit with a few core devices that are typically manufactured by one supplier [36]. Our testbed devices are chosen to reflect this; however we selected devices that were produced by different vendors to have a more generic overview of devices’ exposure to DoS attacks. A schematic illustration of the setup is shown in Fig. 2 and consists of the following components:

Fig. 2. Experiment design setup. The setup consisted of five IoT devices connected to the broadband router over Wi-Fi protocol. The attacker platform (Kali Linux) was connected to the LAN and to the different smart devices over the Wi-Fi channel. Different DoS attacks were executed through OpenVAS software.

– PC: Portable workstation that reads data from smart devices, and furnishes data to users through the help of software applications. The PC, Windows 10, had virtualization software installed; specifically, Oracle VM VirtualBox2 – a free and open-source hosted hypervisor for x86 virtualization – that is used to host the “attacker platform”. The PC had one physical network card installed. – Attacker platform: A virtual machine installed with Kali Linux3 and OpenVAS vulnerability scanner. The attacker platform was connected to the Internet in order to install OpenVAS and later to download the latest vulnerability tests for that. Also, it was connected to the Local Area Network (LAN) in order to execute DoS attacks on the smart devices. Kali Linux was configured 2 3

https://www.virtualbox.org/ [accessed December 21, 2019]. https://www.kali.org/ [accessed December 21, 2019].

On the Analysis of Semantic DoS Attacks Affecting Smart Living Devices

433

with the Network Address Translation (NAT) networking mode as a means for accessing the Internet alongside the smart devices. – Router: A networking device that forwards data packets between the connected devices and the Internet and assigns IP addresses to the PC and smart devices. In our case, the router was a Compal Router that connected the PC, smart devices, and the attacker platform in a LAN setup. – Smart devices: Five commercial-off-the-shelf IoT devices: a gaming console, media player, lighting system, connected TV, and IP camera [3]. The IP addresses for the devices were automatically assigned by the router using the Dynamic Host Control Protocol (DHCP). Smart devices process data, which are transferred to the PC via the router. In reality, the role of the PC could be that of, for instance, a smartphone application or a web page that displays processed results from the smart devices. The components and their IP addresses are summarized in Table 1. Table 1. Device types alongside their assigned IP addresses and roles. Device type

IP address

PC

192.168.0.10 Attacker host

Role

Attacker platform 192.168.0.10 Attacker Router

192.168.0.1

Local network gateway

Gaming console

192.168.0.13 Victim

Lighting system

192.168.0.4

Victim

Media player

192.168.0.9

Victim

Connected TV

192.168.0.7

Victim

IP camera

192.168.0.29 Victim

The network utility ping was used to check the connection between the PC and smart devices. This was used prior to running the vulnerability scans. 4.3

Vulnerability Scanning

Various security tools (scanners) exist that can assist in finding and analyzing security vulnerabilities. Tundis et al. [42] in their review of network vulnerabilities scanning tools, group such tools into two main categories: automatic scanning tools with publicly shared results and personal interaction-based scanning tools. Whereas in the former category tools automatically scan the Internet and render their results publicly, in the latter results are only returned to the tool operator. In our case, we rely on personal interaction-based scanning tools for ethical reasons and as the devices were not configured with a public IP address. Three personal interaction-based scanning tools that are used by security researchers, e.g., in [18], are: Nessus, Metasploit Pro, and OpenVAS. Nessus

434

J. Bugeja et al.

is a proprietary vulnerability scanner produced by Tenable Network Security. Metasploit Pro is a security scanner that also allows for the exploitation of vulnerabilities (i.e., penetration testing). Both Nessus and Metasploit Pro are commercial tools that are used by various security professionals for security compliance and assessment purposes. OpenVAS is free software; effectively a fork of Nessus; for vulnerability scanning and management. In our case, given that OpenVAS is free, it offers a comprehensive vulnerability management platform with similar features to commercial tools, and that other security researchers have used it for similar purposes to our study, we rely on it as our scanner. In the experiment, we assumed an attack model where the malicious threat agent is located inside the smart home network, having both physical and digital access to the connected devices and attacker platform. Nonetheless, we only consider semantic DoS exploits and not DoS caused by physically disabling a device. For the experiment, we configured OpenVAS on Kali Linux according to its official documentation [38]. First, we ensured that Kali Linux was updated and then installed the latest OpenVAS through the command “openvas-setup”. Once the setup was completed, the command “netstat -antp” was entered to verify that OpenVAS’ requisite network services – in particular, its manager, scanner, and the Greenbone Security Assistant Daemon (GSAD) – were open and listening. Next, the command “openvas-start” was keyed to start all the services. Once the services were successfully started, we connected to the OpenVAS web interface by pointing the web browser; in our case Mozilla Firefox; to it. Therein, we configured OpenVAS scanning to “Full and very deep ultimate” and used as input the port list “All TCP and Nmap 5.51 top 100 UDP” [19]. This allowed the scanner to test most of the smart devices’ network ports (in total 65,535 TCP ports and 99 UDP ports) for a broad range of vulnerability classes. Nonetheless, we limited the test cases to include solely DoS attacks, which at the time of the experiment, OpenVAS had 1,384 network tests for DoS. 4.4

Attack Introspection

After the scans were completed, results were displayed on the PC. For each successful attack, we inspected the attack payload, i.e., the exploit code, that resulted in the DoS attack to succeed. This was done to understand the mechanics of the attack. Online security vulnerability databases were used as a source for getting details about the exploits and their code. In doing so, the following public databases were used: SecurityFocus4 , CVE Details5 , and Vulners6 . The aforementioned databases were used in tandem with the actual test case code as executed by OpenVAS. 4 5 6

https://www.securityfocus.com/ [accessed December 21, 2019]. https://www.cvedetails.com/ [accessed December 21, 2019]. https://vulners.com/ [accessed December 21, 2019].

On the Analysis of Semantic DoS Attacks Affecting Smart Living Devices

435

Furthermore, to identify the root causes for an attack to succeed we leveraged the classification scheme employed by the National Vulnerability Database (NVD) of the National Institute of Standards and Technology7 . NVD has gained recognition from organizations such as MITRE Corporation8 and has been used by researchers for similar purposes [2] to ours. This classification is based on the causes of vulnerabilities, grouping them into eight classes: input validation error, access validation error, exception condition error handling, environmental error, configuration error, race condition error, design error, and others [2].

5

Results

5.1

Smart Living Devices Vulnerabilities

Following the execution of vulnerability scanning as described in Sect. 4.3, a total of 13 DoS-related vulnerabilities were found to affect the tested smart living devices. The device that was most prone to semantic DoS attacks was a gaming console. This had nine vulnerabilities, two of which reported as having critical severity. Critical severity indicates that the effects of exploiting the vulnerability can result in total compromise of the device. One of the discovered vulnerabilities – Linksys WRT54G DoS – was rated with the most severe score (CVSS score: 10), allowing an intruder to “freeze” the gaming console web server simply by sending empty GET requests. This leads to a total compromise of confidentiality, integrity, and availability of the system. A similar high severity (CVSS score: 9.3) vulnerability – LiteServe URL Decoding DoS – was found in an IP camera device. Here, a remote web server could simply become unavailable by parsing a URL consisting of a long invalid string of % symbols. Overall, seven of the thirteen vulnerabilities were ranked with medium severity – medium severity means that the vulnerability can reduce the performance or lead to a loss of some functionality to the targeted device – four ranked with critical severity, and two ranked as high severity. Furthermore, all discovered vulnerabilities did not require the attacker to authenticate to the victim host in order to exploit them. No DoS-related vulnerabilities were found to affect the tested lighting system and the media player. While all the conducted attacks involved semantic attacks, certain vulnerabilities, while at a minority, compromised not only the high-level application (e.g., the administration console of an embedded web server) but as well the underlying operating system (e.g., Windows), and hardware (i.e., the device’s firmware). Only one vulnerability – HTTP Windows 98 MS/DOS device names DoS – targeted the operating system software. Table 2 is a summary of discovered DoS-related vulnerabilities occurring on each category of tested smart devices. The severity follows the qualitative severity ranking scale as identified in CVSS v3.0 specification [15]. 7 8

http://www.cve.mitre.org/ [accessed December 21, 2019]. https://www.mitre.org/ [accessed December 21, 2019].

436

5.2

J. Bugeja et al.

DoS Attack Characteristics

The outcome of the attack introspection stage described in Sect. 4.4 is summarized in Fig. 3. Table 2. Summary of semantic DoS-related vulnerabilities found in five commercial smart devices. Device type

Vulnerability title

Affected component

Severity

Availability

Gaming console

Linksys WRT54G DoS

Hardware

Critical

Complete

Mongoose webserver content-length DoS

Application

High

Complete

HTTP Windows 98 MS/DOS device names DoS

Operating system

Critical

Complete

Format string on HTTP method name

Application

Medium

Complete

Webseal DoS

Application

Medium

Partial

Jigsaw web server MS/DOS device DOS

Application

Medium

Partial

HTTP unfinished line Denial

Application

Medium

Partial

mod access referer 1.0.2 NULL point dereference

Application

Medium

Partial

Mereo “GET” request remote buffer overflow vulnerability

Application

Medium

Partial

Connected TV

Mongoose “Content-Length” HTTP header remote DoS

Application

Critical

Partial

Jigsaw web server MS/DOS device DOS

Application

Medium

Partial

IP camera

LiteServe URL decoding DoS

Application

Critical

Complete

Polycom ViaVideo DoS

Hardware

High

Partial

Lighting system

/

/

/

/

Media player

/

/

/

/

From Fig. 3, we observe that most of the DoS-attacks target the high-level application and belong to the “Exception condition error handling” vulnerability class. Vulnerabilities in this class arise due to failures in responding to unexpected data or conditions.

Fig. 3. Summary of discovered vulnerability causes grouped by the affected component. The majority of DoS attacks were successful on the device’s application, typically the web server software, and were the result of a failure in managing exceptional conditions.

On the Analysis of Semantic DoS Attacks Affecting Smart Living Devices

437

Table 3. Characteristics of DoS-attacks that resulted in the compromise of IoT availability. Vulnerability type

Vulnerability title

Attack feature

Payload

Attack method

Remote

Reference

Design Error

Linksys WRT54G DoS

Empty HTTP request

Script

HTTP HEAD

Yes

[8]

HTTP unfinished line Denial

Malformed HTTP request

Script

HTTP HEAD

Yes

[5]

Mereo “GET” Request Remote Buffer Overflow Vulnerability

Large user-supplied input

Script

HTTP GET

Yes

[17]

Format string on HTTP method name

Malformed HTTP request

Script

HTTP HEAD

Yes

[4]

LiteServe URL decoding DoS

Large user-supplied input

Script

HTTP GET

Yes

[9]

Polycom ViaVideo DoS

Incomplete HTTP request

Shellcode

HTTP GET

Yes

[41]

Webseal DoS

Malformed HTTP request

Web browser

HTTP GET

Yes

[40]

Mongoose webserver content-length DoS

“ContentLength” HTTP header field

Script

HTTP GET

Yes

[37]

Mongoose “content-length” HTTP header remote DoS

“ContentLength” HTTP header field

Script

HTTP GET

Yes

[40]

Jigsaw web server MS/DOS device DOS

Resource request

Script

HTTP GET

Yes

[7]

HTTP Windows 98 MS/DOS device name

Filename in URL

Script

HTTP GET

Yes

[6]

mod access referer 1.0.2 NULL point dereference

“Referer” HTTP header field

Script

HTTP GET

Yes

[39]

Input validation error

Exception condition error handling

The rest of the vulnerabilities correspond to “Input validation error” and “Design error”. Input validation error includes vulnerabilities that fail to verify the incorrect input (boundary condition error) and read/write operations involving an invalid memory address (buffer overflows). Design error are caused by improper design of the software structure. In Table 3, we summarize the characteristics of the attacks that exploited these vulnerability classes. We observe that all of the attacks were remote exploits. Remote exploits work over a network, such as the Internet, exploiting the security vulnerability without requiring any prior access to the vulnerable system. This is in contrast to a local exploit which requires prior access to the vulnerable system. The majority of the attacks required basic programming knowledge to develop. At a minimum, this required familiarity with the workings of the HTTP protocol (e.g., HTTP methods, in particular, the GET method) and network pro-

438

J. Bugeja et al.

gramming (e.g., TCP/IP socket management). This is needed to create and send specifically crafted packets to an IoT component. Mostly, the attack payload was transferred to the connected device by manipulating the content of a legitimate HTTP header field, e.g., the “Content-Length” attribute, which specifies the length of the request body.

6 6.1

Discussion The Impact of DoS Attacks

Even though the tested device types represent only around 6% of the available device categories in a smart connected home [12], the attained results already highlight the gravity of the current situation. This is especially as these represent on average around 25% of the number of available devices in a regular smart home [34], the devices belong to the three categories of functionality with the most device types [12], and because our test subjects are manufactured by international companies with overall high-security maturity. The majority of the remaining manufacturers are IoT startups that tend to prioritize simplicity and ease-of-use over security. Due to the limited energy capacities and interconnectedness of IoT devices, the impact of DoS attacks can be severe. For instance, DoS attacks can cause battery-draining issues leading to node outages or a failure to report an emergency situation. This can happen as an example if an attacker targets an Internet-connected smoke detector, which consequently may disable the fire detection system and possibly leading to a fatality. In some cases, a successful DoS attack can also allow an attacker to lock down an entire building or access to a room, for instance, by making access to certain online authentication services, e.g., cloud service required by a smart lock, unavailable. In extreme cases, DoS attacks may lead to permanent damage to a system requiring a device replacement or re-installation of the hardware. This can happen as an example when fake data are sent to connected thermostats in an attempt to cause irreparable damage via extreme overheating. Beyond, affecting the availability of a system, DoS attacks conducted at different architecture layers can compromise other security requirements such as accountability, auditability, and privacy [31]. For instance, when devices are offline, adversaries can use that window of time to hack sensitive information or infer more information. Furthermore, when a high number of hosts are combined, as in the case of DDoS, the effects could be even more disruptive. For instance, in 2016, a DDoS attack with compromised IoT devices targeting the DNS service provider Dyn, effectively took offline sites such as GitHub, Airbnb, and Amazon [28]. Overall, this resulted in reputation damage, diminished IT productivity, and revenue losses to different stakeholders. 6.2

On the Causes of DoS Attacks

Analyzing the software weaknesses that were exploited by the successful DoS attacks, we find that improper checks for unusual or exceptional conditions are

On the Analysis of Semantic DoS Attacks Affecting Smart Living Devices

439

the root cause of such vulnerabilities. This could be indicative that: i) IoT developers are making assumptions that certain events or conditions will never occur; ii) IoT developers are reusing software libraries without performing proper security testing; iii) IoT developers are not properly trained in software security; or iv) security is not a top priority for an organization. Moreover, this raises generic concerns about the way IoT devices are being developed. Our study is similar in scope to that of Bonguet and Bellaiche [11]. However, instead of focusing on cloud computing DoS and DDoS, we focused specifically on consumer-based IoT devices. Moreover, we expanded on the semantic-based DoS attack category, which the aforementioned study classifies as “design flaws” and “software bugs” vulnerabilities, with vulnerability classes we identified firsthand. Analyzing the successful attacks, we observe the prevalence of HTTP GET DoS attacks where the application layer protocol HTTP is exploited. Interestingly, the HTTP GET DoS attacks did not have to be used repeatedly; for example by running the attack in a loop, as is required for instance, in an HTTP flood attack [27] and yet had the same consequences. This signals the dangers of these attacks and the challenges involved in detecting them. While HTTP flooding can trigger an alert about a possible intrusion since multiple HTTP requests are sent to a target device, the chance of detecting semantic DoS is relatively slim as only one request could be needed to achieve the same effect. Many IoT devices share generic components from a relatively small set of manufacturers. This means that a vulnerability in one class of IoT hardware is likely to be repeated across a vast range of products. For instance, one of the vulnerabilities we discovered with the gaming console was targeting the open source Mongoose9 web server. Mongoose is identified as GitHub’s most popular embedded web server and multi-protocol networking library. With this, it is a likely threat that other IoT platforms are prone to the same security risk. This also makes us reflect on the state of the other IoT devices available in the smart home market, and in general, about security practices being adopted by companies. Especially, since most companies develop their software by reusing existing software libraries. This is indicative that besides the functionality aspects, vulnerabilities are automatically inherited, putting the customers at risk but also the vendor’s reputation at stake. A case in point, in DefCon’22 conference [35], a popular cloud-based Wi-Fi camera was revealed as using a vulnerable version of OpenSSL library – a widely used software library for applications to secure communications – with heartbleed vulnerability. Exploiting this could allow for possible eavesdropping on seemingly encrypted communications, steal private data, and impersonate services and users. In this study, we have investigated how DoS attacks are conducted, and how exploit code violates security practices such as lack of input validation. We believe that it is relatively easy to exploit those weaknesses and potentially launch large-scale attacks without the knowledge of the owner. This is also amplified with the availability of automatic scanning tools with publicly shared 9

https://cesanta.com/ [accessed December 21, 2019].

440

J. Bugeja et al.

results, such as Shodan10 that simplify the process of discovering and exploiting Internet-connected devices. Furthermore, this is aggravated considering that some vendors offer services, oftentimes referred to as “stresser” or “booter” services, which can be used to perform, at a cost, unauthorized remote DoS attacks on Internet hosts. 6.3

Mitigating DoS Attacks

Protection against DoS and their distributed counterpart (DDoS) is a challenging task, especially for IoT architectures considering their constraints, e.g., in terms of battery, memory, and bandwidth. Limited research related solutions, e.g., [23], have been proposed for the protection of IoT against DoS attacks. However, such approaches do not focus on the application layer but are mainly dealing with network layer protection. This also concurs with reports from leading industry vendors which underscore the difficulty of defending against application attacks and simultaneously the rise of attacks in this category [32]. Hereunder, we present some approaches that can be adopted by smart living developers and end-users to prevent, detect, and react to semantic DoS attacks. Data Controller Mitigations. This represents safeguards that can be adopted by IoT device manufacturers, IoT developers, and service providers. – Authentication mechanisms. This plays a critical role in the security of any IoT device and service. It is useful for detecting and blocking unauthorized devices and services [43]. Strong authentication can be applied potentially at the home gateway, with this device often acting as the gatekeeper mediating requests between connected devices, services, and users. – Input validation. As a secure coding principle, this helps in preventing against semantic based DoS attacks. Additionally, if input validation is performed properly, including on the HTTP headers, this can also help prevent against SQL injection, script injection, and command execution attacks. – Secure architecture. IoT devices need to sustain their availability under desired levels. Possibly, a robust architecture should leverage a defense-indepth strategy, e.g., having multiple layers of controls at the device level, cloud level, and service level, and thus reducing the risk of having the entire system or stack becoming unavailable. – Secure configuration. IoT devices should be configured not to disclose information about the internal network, server software, and plug-ins/modules installed (e.g., banner information). Primarily, this is important as otherwise such information may get indexed and picked by online scanners which could then be used to conduct attacks. – Security testing: Code should be inspected for vulnerabilities before it gets released to consumers. Here, software auditing and penetration testing could be used, e.g., to detect test interfaces and weak configurations that could lead 10

https://www.shodan.io/ [accessed December 21, 2019].

On the Analysis of Semantic DoS Attacks Affecting Smart Living Devices

441

to compromise. Furthermore, a company may offer incentives, e.g., through bug bounty programs, especially to help discover zero-day vulnerabilities. At the same time, it is also key for vendors to release updates, possibly on a cyclical basis, to improve the security of their product. Consumer Mitigations. This represents controls that can be adopted by endusers, in particular by the IoT device users. – Filtering. Filtering techniques, e.g., ingress/egress filtering or history-based filtering, to prevent unauthorized network traffic from entering into a protected network [27]. Filtering can be applied to residential routers and can also be used as a strategy to respond to DoS attacks. – Intrusion prevention/detection system. Intrusion prevention/detection mechanisms, such as the signature-based detection and anomaly-based detection, can be used to proactively block malicious traffic and threats from reaching IoT devices. This system could be a separate physical device connected to the residential Internet router. – Secure configurations. Operating system and server vendor-specific security configurations should be implemented where appropriate, including the disabling or removal of unnecessary users, groups, and default accounts. – Secure network services. To prevent unauthorized users from connecting to IoT devices and implanting an attack, remote access options (e.g., Telnet or SSH) to the router and other network devices, that may have it enabled for remote administration, should be disabled, or otherwise securely configured. – Secure overlay. This method involves the creation of an overlay network, typically through a firewall, on top of the IP network. This overlay network then acts as the entry point for the outside network ensuring that only trusted traffic can get entry to the protected network. – Security patches. IoT devices should be kept updated with the latest security patches as issued from the vendor regularly to ensure that the system is not affected by malware. When updates are not available some possible alternatives are: to put another control, e.g., a perimeter firewall or intrusion prevention/detection system in front of the vulnerable device; changing the IP address of the affected device; disabling the compromised feature; or replacing the hardware with a newer release. Beyond the data controller and consumer-based mitigations, we also see the need for three other requirements that must be met to ensure the overall security and resiliency of IoT devices. First, more stringent regulations and potentially certification programs are needed for IoT device manufacturers. Second, the early integration of security from the design stage and to enforce a risk management strategy potentially as a joint effort of legislators, security experts, and manufacturers. Third, recognizing that classical security solutions are challenging to port to the IoT domain, it is crucial to increase security awareness among consumers. This could, for instance, be done through government initiatives, but also manufacturers can educate consumers about security.

442

7

J. Bugeja et al.

Conclusion and Future Work

The growth and heterogeneity of connected devices being deployed in smart living spaces, in particular, inside homes, raises the importance of an assessment of their security. In this paper, we conducted a vulnerability assessment focusing on the availability of Internet-connected devices. The experiment was carried out using OpenVAS and it featured five commercial-off-the-shelf IoT devices: a gaming console, media player, lighting system, connected TV, and IP camera. The attained results indicate that the majority of the tested devices are prone to severe forms of semantic DoS attacks. Exploiting these attacks may lead to a complete compromise of the security of the entire smart living system. This indicates the gravity of the current situation serving as a catalyst to raise awareness and stimulate further discussion of DoS related issues within the IoT community. Furthermore, to understand the root causes for successful attacks, we analyzed the payload code, profiled the attacks, and proposed some mitigations that can be adopted by smart living developers and consumers. As part of future work, we intend to generalize this study in three areas. First, we plan to include a broader selection of devices, including routers. Routers tend to be one of the most vulnerable components that a successful attack can leverage to potentially disable legitimate access to the entire smart connected home. Second, we aim to consider an attack model where the malicious threat agent is located remotely behind a cloud or service provider infrastructure. Finally, we plan to research methods that can proactively allow for the detection of DoS attacks. Possibly, this will involve the use of machine learning to learn a baseline security profile for each device. Acknowledgments. This work has been carried out within the research profile “Internet of Things and People,” funded by the Knowledge Foundation and Malm¨ o University in collaboration with 10 industrial partners.

References 1. Alanazi, S., Al-Muhtadi, J., Derhab, A., Saleem, K., AlRomi, A.N., Alholaibah, H.S., Rodrigues, J.J.: On resilience of wireless mesh routing protocol against DoS attacks in IoT-based ambient assisted living applications. In: 17th International Conference on E-health Networking, Application & Services (HealthCom), pp. 205– 210. IEEE (2015) 2. Alhazmi, O.H., Woo, S.-W., Malaiya, Y.K.: Security vulnerability categories in major software systems. Commun. Netw. Inf. Secur. 2006, 138–143 (2006) 3. Andersson, S., Josefsson, O.: On the assessment of denial of service vulnerabilities affecting smart home systems (2019) 4. Arboi, M.: Format string on http method name. https://vulners.com/openvas/ OPENVAS:11801 5. Arboi, M.: Http unfinished line denial. https://vulners.com/openvas/OPENVAS: 136141256231011171

On the Analysis of Semantic DoS Attacks Affecting Smart Living Devices

443

6. Arboi, M.: Http windows 98 MS/DOS device names DOS. https://vulners.com/ openvas/OPENVAS:136141256231010930 7. Arboi, M.: Jigsaw webserver MS/DOS device DOS. https://vulners.com/openvas/ OPENVAS:11047 8. Arboi, M.: Linksys WRT54G DOS. https://vulners.com/openvas/OPENVAS: 136141256231011941 9. Arboi, M.: LiteServe URL decoding DOS. https://vulners.com/openvas/ OPENVAS:11155 10. Barnard-Wills, D., Marinos, L., Portesi, S.: Threat landscape and good practice guide for smart home and converged media. In: European Union Agency for Network and Information Security (ENISA) (2014) 11. Bonguet, A., Bellaiche, M.: A survey of denial-of-service and distributed denial of service attacks and defenses in cloud computing. Future Internet 9(3), 43 (2017) 12. Bugeja, J., Davidsson, P., Jacobsson, A.: Functional classification and quantitative analysis of smart connected home devices. In: Global Internet of Things Summit (GIoTS), pp. 1–6. IEEE (2018) 13. Carl, G., Kesidis, G., Brooks, R.R., Rai, R.: Denial-of-service attack-detection techniques. IEEE Internet Comput. 10(1), 82–89 (2006) 14. Douligeris, C., Mitrokotsa, A.: DDoS attacks and defense mechanisms: classification and state-of-the-art. Comput. Netw. 44(5), 643–666 (2004) 15. FIRST: Cvss v3.1 specification document. https://www.first.org/cvss/ specification-document 16. Geneiatakis, D., Kounelis, I., Neisse, R., Nai-Fovino, I., Steri, G., Baldini, G.: Security and privacy issues for an IoT based smart home. In: 40th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) 17. GmbH, G.N.: Mereo ‘get’ request remote buffer overflow vulnerability. https:// vulners.com/openvas/OPENVAS:100776 18. Gordin, I., Graur, A., Potorac, A., Balan, D.: Security assessment of OpenStack cloud using outside and inside software tools. In: International Conference on Development and Application Systems (DAS), pp. 170–174. IEEE (2018) 19. Greenbone.net: 16. performance—greenbone security manager (gsm) 4 documentation. https://docs.greenbone.net/GSM-Manual/gos-4/en/performance.html# about-ports 20. Herzberg, B., Bekerman, D., Zeifman, I.: Breaking down mirai: An IoT DDoS botnet analysis. Incapsula Blog, Bots and DDoS, Security (2016) 21. Hussain, A., Heidemann, J., Heidemann, J., Papadopoulos, C.: A framework for classifying denial of service attacks. In: Proceedings of the 2003 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pp. 99–110. ACM (2003) 22. Karig, D., Lee, R.: Remote denial of service attacks and countermeasures. Princeton University Department of Electrical Engineering, Technical report CE-L2001002, 17 (2001) 23. Kasinathan, P., Pastrone, C., Spirito, M.A., Vinkovits, M.: Denial-of-service detection in 6LoWPAN based internet of things. In: IEEE 9th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob), pp. 600–607. IEEE (2013) 24. Kolias, C., Kambourakis, G., Stavrou, A., Voas, J.: DDoS in the IoT: Mirai and other botnets. Computer 50(7), 80–84 (2017) 25. Kupreev, A.G.O., Badovskaya, E.: Ddos attacks in q1 2019—securelist. https:// securelist.com/ddos-report-q1-2019/90792/

444

J. Bugeja et al.

26. Liang, L., Zheng, K., Sheng, Q., Huang, X.: A denial of service attack method for an IoT system. In: 8th International Conference on Information Technology in Medicine and Education (ITME), pp. 360–364. IEEE (2016) 27. Mahjabin, T., Xiao, Y., Sun, G., Jiang, W.: A survey of distributed denial-ofservice attack, prevention, and mitigation techniques. Int. J. Distrib. Sens. Netw. 13(12), 1550147717741463 (2017) 28. Mansfield-Devine, S.: DDoS goes mainstream: how headline-grabbing attacks could make this threat an organisation’s biggest nightmare. Netw. Secur. 2016(11), 7–13 (2016) 29. Mirkovic, J., Reiher, P.: A taxonomy of DDoS attack and DDoS defense mechanisms. ACM SIGCOMM Comput. Commun. Rev. 34(2), 39–53 (2004) 30. Moore, D., Shannon, C., Brown, D.J., Voelker, G.M., Savage, S.: Inferring internet denial-of-service activity. ACM Trans. Comput. Syst. (TOCS) 24(2), 115–139 (2006) 31. Mosenia, A., Jha, N.K.: A comprehensive study of security of internet-of-things. IEEE Trans. Emerg. Top. Comput. 5(4), 586–602 (2016) 32. Muncaster, P.: DDoS attacks jump 18% YoY in Q2—infosecurity magazine. https://www.infosecurity-magazine.com/news/ddos-attacks-jump-18-yoy-in-q2/ 33. OWASP: OWASP testing guide. https://www.owasp.org/images/5/56/OWASP Testing Guide v3.pdf 34. Pascu, L.: The IoT threat landscape and top smart home vulnerabilities in 2018. https://www.bitdefender.com/files/News/CaseStudies/study/229/BitdefenderWhitepaper-The-IoT-Threat-Landscape-and-Top-Smart-Home-Vulnerabilitiesin-2018.pdf 35. Patrick Wardle, C.M.: Optical surgery; implanting a dropcam. https://www. defcon.org/images/defcon-22/dc-22-presentations/Moore-Wardle/DEFCON-22Colby-Moore-Patrick-Wardle-Synack-DropCam-Updated.pdf 36. P˘ atru, I.-I., Caraba¸s, M., B˘ arbulescu, M., Gheorghe, L.: Smart home IoT system. In: 15th RoEduNet Conference: Networking in Education and Research, pp. 1–6. IEEE (2016) 37. SecPod: Mongoose webserver content-length denial of service vulnerability. https://vulners.com/openvas/OPENVAS:1361412562310900268 38. Security, O.: Openvas 8.0 vulnerability scanning—kali linux. https://www.kali.org/ penetration-testing/openvas-vulnerability-scanning 39. SecurityFocus: Apache mod access referer null pointer dereference denial of service vulnerability. https://www.securityfocus.com/bid/7375/exploit 40. SecurityFocus: IBM Tivoli policy director WebSeal denial of service vulnerability. https://www.securityfocus.com/bid/3685/exploit 41. SecurityFocus: Polycom ViaVideo denial of service vulnerability. https://www. securityfocus.com/bid/5962/exploit 42. Tundis, A., Mazurczyk, W., M¨ uhlh¨ auser, M.: A review of network vulnerabilities scanning tools: types, capabilities and functioning. In: Proceedings of the 13th International Conference on Availability, Reliability and Security, p. 65. ACM (2018) 43. Yoon, S., Park, H., Yoo, H.S.: Security issues on smarthome in IoT environment. In: Park, J., Stojmenovic, I., Jeong, H., Yi, G. (eds.) Computer Science and Its Applications, pp. 691–696. Springer, Heidelberg (2015)

Energy Efficient Channel Coding Technique for Narrowband Internet of Things Emmanuel Migabo1,2(B) , Karim Djouani1,2 , and Anish Kurien1 1

2

Tshwane University of Technology (TUT), Staatsartillerie Road, Pretoria 0001, South Africa [email protected] Universit´e de Paris-Est Cret´eil, Laboratoire Images, Signaux et Syst`emes Intelligents (LISSI), Vitry sur Seine, 94400 Paris, France

Abstract. Most of the existing Narrowband Internet of Things (NB-IoT) channel coding techniques are based on repeating transmission data and control signals as a way to enhance the network’s reliability and therefore, enable long distance transmissions. However, most of these efforts are made to the expense of reducing the energy consumption of the NB-IoT network and do not always consider the channel conditions. Therefore, this work proposes a novel NB-IoT Energy Efficient Adaptive Channel Coding (EEACC) scheme. The EEACC approach is a two-dimensional (2D) approach which not only, selects an appropriate channel coding scheme based on the estimated channel conditions; but also minimizes the transmission repetition number under a pre-assessed probability of successful transmission. It is aimed at enhancing the energy efficiency of the network by dynamically selecting the appropriate Modulation Coding Scheme (MCS) number and efficiently minimizing the transmission repetition number. Link-level simulations are performed under different channel conditions (good, medium or bad) considerations in order to assess the performance of the proposed up-link adaptation technique for NB-IoT. The obtained results demonstrate that the proposed technique outperforms the existing Narrowband Link Adaptation (NBLA) as well as the traditional repetition schemes, in terms of the achieved energy efficiency as well as reliability, latency and network scalability. Keywords: Link adaptation · Adaptive · Energy efficiency · Data rates · Throughput · Modulation Coding Scheme (MCS) · Repetition number · Narrowband IoT (NB-IoT)

1

Introduction

The Low Power Area Networks (LPWAN) are very promising technologies of the Internet of Things (IoT) for future wireless communications. In the recent decade, the LPWANs have rapidly developed and they have, therefore, been catching significant attention of many researchers around the globe. A study by c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 445–466, 2020. https://doi.org/10.1007/978-3-030-52246-9_33

446

E. Migabo et al.

[2] has identified the Narrowband Internet of Things (NB-IoT) as well as the Long Range (LoRa) as the two most leading LPWAN technologies within the licensed and the unlicensed bands respectively; towards enabling the future of the Internet of Things. However, due to the observed rapid growth of in terms of number of connected IoT devices, there are couple of issues that have been identified among which the energy consumption of the network which is directly related to its lifetime [3]. Narrowband Internet of Things (NB-IoT) is a new narrow-band radio technology introduced in the Third Generation Partnership Project release 13 to the 5th generation evolution for providing low-power wide-area IoT [9]. Techniques regarding the performance enhancement of NB-IoT wireless communication systems are being studied in current research. Specifically, in [4], the authors presented a systematical review about IoT which includes different definitions, key technologies, open issues and major challenges of it. Furthermore, in [5], the authors provided a systematical survey regarding NB-IoT in industry for the first time. It reviewed extensive researches, key enabling technologies, major NB-IoT applications of IoT in industry, and identified research trends and challenges. At a recent plenary meeting in South Korea, the Third Generation Partnership Project (3GPP) completed the standardization of NB-IoT [5], in which NBIoT is regarded to be a very important technology and a large step for 5G IoT evolution. Industries, including Ericsson, Nokia, and Huawei, have shown great interests in NB-IoT as part of 5G systems and have focused significant effort towards standardization. In 3GPP standardization, repeating transmission data and the associated control signaling several times has been utilized as a base solution to achieve coverage enhancement for NB-IoT [1]. However, repeated transmissions come with a multiplicative energy cost as it has been clearly demonstrated by several wireless networks energy consumption modelling studies such as [6] and [7] that the major contributors to the overall average energy consumption of a wireless network are the energy consumed during transmission as well as the energy consumed during reception according to Eq. 1, ET rans =

(PT x + PRx ) × M B × Pst

(1)

where, – Etrans is the energy consumed by the transceiver circuit on single wireless link, – PT x is the transmission Power, – PRx is the reception power, – B is the transmission bit rate and – Pst is the probability of successful transmission. This high energy consumption is what this article is mitigating by proposing of an adaptive energy-efficient channel coding approach.

Energy Efficient Channel Coding Technique for NB-IoT

447

It is also very important to note that, on one hand, the choice of different Modulation Coding Scheme (MCS) levels considerably influence the overall network performance of the NB-IoT system in terms of its reliability, its energy efficiency, its scalability and latency. In fact, the use of low MCS coupled with high transmitting power has demonstrated to be capable of improving the transmission reliability and therefore, enhance the network coverage in terms of longer transmission distance and immunity to noise. However, this results in reducing the network’s throughput and causes the overall energy consumption of the network to be significantly high because although the number of repeated re-transmissions is significantly cut-down, the transceiver’s energy consumption as modelled by Eq. 2 remains quite high. This model is based on the fact that the NB-IoT receiving node has got three operation states. It is either in synchronization mode or in active mode (handling of the received packets) or else in Idle mode. Most of its energy consumption normally occurs in its active state, then medium energy consumption occurs in its synchronization state and lower amount is consumed during its Idle state [11]. ERx = KPRx tsynch +

K 

Psleep tksleep + PIdle tK active

(2)

k=1

where, – tsynch is the synchronization time, – K is the number of cycles or iterations involved in the synchronization process for connection to be effectively established, – tksleep is the time spent in the sleep state in a reception cycle k, – tK active is the total active time during all the K cycles – PRx , Psleep and PIdle are the power value for the receiving, sleeping and idle states, respectively. On the other hand, according to 3GPP Release 13, repeating transmission data or control signals has been selected as a promising approach to enhance coverage of NB-IoT systems, since more repetition number will enhance the transmission reliability, but it results in quite significant spectral efficiency loss [12]. Thus, the present work proposes a 2-dimensional channel coding ad link adaptation scheme capable of providing a trade-off between the transmit reliability and the throughput of system by selecting the suitable MCS on one hand and the appropriate transmission repetition number on the other hand. This work is proposed based on the fact that it has been identified that most existing link adaptation schemes found in literature are solely focused on the selection of a suitable MCS value without consideration of the repetition number which as demonstrated in the paragraphs above is proven to be a key player towards addressing the energy efficiency of the NB-IoT network.

2

Methods/Experimental Approach

In this paper, we focus on analyzing key technologies in uplink scheduling and designing an uplink link adaptation scheme for NB-IoT systems. The remainder

448

E. Migabo et al.

of the article is organized as follows: First, in Sect. 3, the article provides a comprehensive literature survey that concisely and critically discusses the different existing channel coding and MCS selection approaches found through the literature. Then, in Sect. 4, the proposed algorithm is presented, modelled and discussed. This section is then followed by a detailed description of the methods used for validation of the proposed approach in Sect. 5. In this section, a comprehensive table of key simulation parameters is provided and the different design considerations and assumptions are presented and justified. Then, in Sect. 5, the obtained results are presented and critically analysed and discussed. In this Sect. 5, we perform various simulation results using MATLAB to demonstrate the effectiveness of the proposed scheme, and furthermore,the existing approaches are compared under the same setting to show the superiority of the proposed EEACC scheme as compared to the traditional repetition-based as well as the NBLA schemes. Finally, in Sect. 6, conclusions are drawn based on the research objectives and a recommendation for future work is formulated.

3

Related Work

Low-Power Wide-Area Network (LPWAN) technologies both in the licensed and in the unlicensed bands, such as the NB-IoT, Long Range (LoRa) and many others are striving to become energy efficient over very long distances [13]. NB-IoT is designed for long-life devices and targets a battery lifetime of more than 10 years. To this end, the careful design of smart channel coding schemes, has been identified as a potential approach towards enhancing NB-IoT energy efficiency. Channel coding is one of the most important aspects in digital communication systems, which can be considered as the main difference between analog and digital systems making error detection and correction possible [14]. In its current form, the NB-IoT reuses the Long Term Evolution (LTE) design extensively, including the numerologies, downlink orthogonal frequency-division multiple-access (OFDMA), uplink single-carrier frequency division multipleaccess (SC-FDMA), channel coding, rate matching, interleaving, etc. To the best of our knowledge, the only reason for this extensive reuse of the LTE channel coding was to significantly reduce the time required to develop full NB-IoT specifications [15]. However, there are issues very specific to the NB-IoT network designs among which the limited energy capacity issue. Therefore, researchers [16,17] have identified a very crucial need to develop novel channel coding techniques very specific to the NB-IoT with different design goals. Our research work as introduced by the present article has identified the energy efficiency issue as a potential research problem. However, before us, other researchers have looked into this problem from different perspectives and their different approaches are concisely reported in the next paragraphs.

Energy Efficient Channel Coding Technique for NB-IoT

3.1

449

Why is Energy Efficient Channel Coding Important for NB-IoT?

One of the most important issues in the design of NB-IoT systems is error correction because if well designed, the channel coding technique for NB-IoT can help save considerable amount of energy by significantly reducing the number of required re-transmissions. This justifies the fact that a good number of research have proposed channel coding techniques with the objective of achieving energy efficiency. In its current form, the NB-IoT uplink baseband processing can be divided into channel coding and modulation. In the case of NB-IoT uplink the channel coding includes Cyclic Redundancy Check (CRC) generation and attachment, turbo or convolutional coding and rate matching. Reliability is a key performance criterion in any form of wireless communication including the NB-IoT. It is therefore very crucial that any NB-IoT design considers ensuring that end to end communication reliably happens between terminals. Like in most of engineering designs, the strive to enhance one performance often comes with associated cost in terms of the other. Channel coding is often used in most communication systems to ensure resilience to channel impairments and ensure reliable link transmissions. Therefore, research work before the present one have proposed couple of channel coding approaches to ensure that reliability. The main identified approaches have been the traditional repetition approach and the Narrowband Link Adaptation (NBLA) approach [9]. Despite their strive to enhance communication reliability, the later have been identified to suffer significant energy costs and that is what the present research work addresses. It strives to find a balance between ensuring reliable communication on NB-IoT uplinks while maintaining energy efficiency. 3.2

Existing NB-IoT Energy Efficient Channel Coding (CC) Approaches

From the survey of the literature, the following main approaches have been selected to be the most relevant and recent works, 1. Automatic Repeat Request (ARQ) Approaches In this category of approaches, the receiver requests re-transmission of data packets, if errors are detected, using some error detection mechanism. Sami Tabbane in [16] proposes an open loop forward error correction technique for NB-IoT networks with the objective to optimize ARQ signaling. In this approach, signaling only needs to indicate the DL data transfer completion and does not have to be specific on which particular Packet Data Units (PDU) are lost during the transmission. This allows to reduce the simplicity of the channel coding approach and, therefore, allows to save on computational energy consumption. This approaches has demonstrated to be efficient in enhancing the data rate performance in the Downlink (DL) of the NB-IoT network. Due to its low complexity level, this method has further proven to also enhance the energy efficiency of the NB-IoT network as it considerably reduces the computational energy consumption, due to data reception, on the transceiver of the IoT node. One approach proposed

450

E. Migabo et al.

in [17] consists of using an hybrid automatic repeat request (HARQ) process in scenarios where the NB-IoT network can only support half duplex operations. The HARQ approach has demonstrated to be capable of reducing the processing time at the IoT node. The obtained results in [17] are able to prove that the use of the HARQ approach can lead to saving up to 20% on the overall energy consumption of the network. This energy efficiency performance has also been proven not to be significantly affected by the increase in the scalability of the NB-IoT network when the HARQ approach is used. Authors in [18] propose an hybrid channel coding approach. It consists of signaling hybrid automatic repeat request (HARQ) acknowledgements for narrowband physical downlink shared channel (NPDSCH) and uses a repetition code for error correction. In this case, the user Equipment (UE) can be allocated with 12, 6 or 3 tones. However, the 6 and 3 tone formats are introduced for NB-IoT devices that, due to coverage limitations, cannot benefit from the higher device bandwidth allocation and results in higher energy consumption performance. 2. Forward Error Correction (FEC) Approaches Recent research ([19–21]), published from 2011 and the first quarter of 2012 to date, has investigated the efficiency of different re-transmission and FEC techniques in NB-IoT systems. Several researchers have quantified the effect of a number of network parameters on the efficiency of error correction techniques (and their associated network costs). However, no effort has yet been made to unify these studies into a systematic approach that could help with the selection of the most effective technique given certain network conditions. The authors in [22] have proposed an improved error correction algorithm for multicast over the LTE network and, by extension, over the Narrowband IoT network. The used model assumes a random distribution of packet losses and a constant loss rate in each scenario. The model can be expanded to include different error distributions and varying loss conditions during a series of NB-IoT downlink transmissions. The obtained results demonstrate that the use of an hybrid approach (HARQ and FEC combined), outperforms both the HARQ method used alone as well as the FEC approach used alone in terms of energy efficiency. A research study by [23] that has proposed the use of open loop forward error correction technique as a mechanism to not only enhance the energy efficiency of the NB-IoT network, but also to concurrently achieve efficient down link data rate performance. The benefit of this approach lies in the fact that it enables extremely reliable firmware downloads which is an important IoT feature in number of applications among which sensor network applications. Another Forward Error Correction channel coding approach for Narrowband IoT as proposed by [25] has been specifically designed to reduce the number of re-transmission attempts. This is mainly because, it has been identified and demonstrated by [28], [29] and [30] that most energy consumption in Internet of Things and Wireless Sensor Networks (WSNs) is consumed through the transmission and the reception phases.

Energy Efficient Channel Coding Technique for NB-IoT

451

Another quite unique Forward Error Correction approach, which is based on algebraic-geometric theory, compares the BER performance of algebraicgeometric codes and Reed-Solomon codes at different modulations schemes over additive white Gaussian noise [26]. In this approach it is found there is gain in terms of BER performance improvement at the cost of high system complexity when algebraic-geometric codes and Chase-Pyndiah’s algorithm [27] are used in conjunction. Two main channel coding and uplink link adaptation schemes have been found in our literature survey. The first one is what we would term the MCSdominated approach, in which we first adjust the MCS level based on feedback signals and then adjust the repetition number. The second one is the repetitiondominated approach in which the focus is first on determining the appropriate repetition number based on the predefined NB-IoT network design criteria and then only focus on selecting the MCS level using the currently determined repetition number as part of the decision criteria. Apart form these two dominating approaches, there exists other approaches found in literature such as the cooperative approaches [24] in which the impact of uplink interference on resources (energy, spectral efficiency etc.) utilization efficiency is each time investigated prior to making transmission decisions by exploiting the cooperation among base stations which needs to be already designed. 3.3

Efficient Selection of Modulation Coding Scheme (MCS)

The design of a energy efficient channel coding scheme for NB-IoT is directly linked to the selection of an appropriate MCS. In order to achieve long range communication, some work on efficient NB-IoT designs found in literature [9,31,35,36] have proposed efficient techniques for modulation scheme selection. The common idea behind most proposed approaches consists of trading off high data rate for higher energy in each transmitted bit (or symbol) at the physical layer (PHY). This design technique allows for a signal that is more immune and that can travel longer transmission distances. Therefore, in general the identified aim of most designs is to achieve a link budget of 150 ± 10 dB which can translate into a few kilometers and tens of kilometers in urban and rural areas, respectively [9]. Encoding more energy into signal’s bits (or symbols) results in very high decoding reliability on the receiver side. Typical receiver sensitivities could, therefore, be as low as −130 dBm. Modulation techniques used for most LPWAN technologies can be classified into two main categories, namely the narrowband technique and the spread spectrum technique. Spread spectrum techniques spread a narrowband signal over a wider frequency band but with the same power density. The actual transmission is a noise-like signal that is harder to detect by an eavesdropper, more resilient to interference, and robust to jamming attacks (secure) [37]. As opposed to other LPWAN technologies such as the LTE Cat-M1, which mainly use spread spectrum modulation techniques, most work on NB-IoT designs found in literature [9,37] and [35], propose to use Narrowband modulation techniques. In general, Narrowband modulation techniques provide high

452

E. Migabo et al.

link budget often less than 25 KHz. They are very efficient at frequency spectrum sharing between multiple links and they experience very small noise level experienced within each individual narrow band. In order to further reduce the experienced noise, some LPWAN technologies, such as SIGFOX, WEIGHTLESS-N and TELENSA, use ultra narrow band (UNB) of width as short as 100 Hz [35]. They are, therefore, susceptible of achieving longer transmission ranges. One of the major differences between narrowband modulation techniques and spread spectrum techniques remains that spread spectrum techniques often require more processing gain on the receiver side to decode the received signal(below the noise floor) while no processing gain through frequency despreading is required to decode the signal at the receiver for the case of narrowband modulation techniques, resulting in simpler and less expensive transceiver designs. Different variants of spread spectrum techniques such as Chirp Spread Spectrum (CSS) and Direct Sequence Spread Spectrum (DSSS) are used by existing standards LPWA technologies. 3.4

Repetition-Dominated Channel Coding Approaches

In the repetition-dominated method, based on the feedback ACK/NACKs, we first adjust the repetition number and then update the MCS level. Repetition is the key solution adopted by most NB-IoT designs with the objective to achieve enhanced coverage with low complexity. On the other hand, for one complete transmission, repetition of the transmission is required to be applied to both data transmission and the associated control signaling transmission. Therefore, in most NB-IoT systems, before each Narrowband Physical Uplink Shared Channel (NPUSCH) transmission, the corresponding control signal data of which includes the Resource Unit (RU) number, the chosen MCS and repetition numbers are required to be the sent via the Narrowband Physical Downlink COntrol Channel (NPDCCH) [9]. The sequence of transmission with repetition during a single transmission is illustrated as follows,

Frequency

NPDCCH RepeƟƟon

1 ms

Max {3ms, Ɵme unƟl the beginning of the next search space} NPDCCH

Time {8, 16, 32,64} ms as indicated in DCI

NPDCCH RepeƟƟon

Fig. 1. Illustration of data repetition during a single transmission

Figure 1 clearly illustrates the repetition in NB-IoT, where both the NPDCCH and the NPUSCH transmission blocks made of the same content as clearly highlighted by using the same color, are repeated four times in the duration of a single transmission. It is also very important to point out that according to

Energy Efficient Channel Coding Technique for NB-IoT

453

the 3GPP TS 36.211 standard [32], the repetition number for the same block for NB-IoT can only be selected among 1, 2, 4, 8, 16, 32, 64 or 128. 3.5

The NBLA and Its Open-Loop Power Control Approach

The NB-IoT link Adaptation (NBLA) approach is a mainly focused on a inner loop link adaptation aspect mainly focused on addressing the issue of rapid changes that are often observed with the transmission Block Error Ratio (BLER) in NB-IoT systems. Therefore, the NBLA as proposed by [9]. The NBLA approach works in the following manner, during the duration of a single period T , all transmission ACK/NACKs are computed in order to work out an estimated value of the BLER. Based on the obtained BLER value, the transmission repetition number is adjusted accordingly in order to cope with the variation in the channel’s condition and ensure less probability of failed transmission and this way save on energy consumption of the NB-IoT network. The main challenge faced by the NBLA approach resides in the fact that, despite its effort in estimating the channel condition based on the computed BLER value; because of the very reduced (narrow) bandwidth and considerably quite unstable channel conditions of the NB-IoT systems, the NBLA power control strategy often fails to ensure reliable uplink transmissions. This results in significant energy wastage due to repeated unsuccessful transmissions. The development of new and more energy efficient approach capable of adaptively addressing the variations of channel conditions by looking at more than one dimensional aspect of previous BLER performance but also considering the MCS level in selecting an appropriate repetition number is highly needed. This is expected to contribute towards adequate energy management within NB-IoT systems. 3.6

NBLA Open-Loop Power Control

In the uplink scenario, the NB-IoT network normally only supports an openlink power control as stated in [32]. The reason for this open loop power control exclusivity is mainly motivated by limited energy and processing capacity of most NB-IoT nodes such as sensor nodes, etc. which most of the times run on batteries. The manner in which this open loop power control is implemented within the NB-IoT nodes is that, based on the MCS and RU information alone, the NB-IoT node works out an estimate of the required power necessary to achieve an uplink transmission. This means that the Base Station (BS) (eNB in the case of LTE) does not send any form of power control command (information) to the NB-IoT prior to the uplink transmission. According to research studies carried out by [32] and [34], the transmit power (PN P U SCH,c (i)) required by a NB-IoT node within a Narrowband Physical Uplink Shared channel (NPUSCH) during an uplink session within an given uplink slot i for serving a cell c, given that the number of repetitions of the allocated NPUSCH RUs is less than 2, can be modelled as follows, PN P U SCH,c (i) = min{PCM AX,c (i), 10 × log10 × (MN P U SCH,c (i)) + PO N P U SCH,c (j) + αc (j)P Lc } (3)

454

E. Migabo et al.

where, – PCM AX,c (i) is configured NB-IoT node uplink transmit power in slot i for serving cell c, – MN P U SCH,c (i) possible values are { 14 , 1, 3, 6 or 12} as defined in [34], – PO N P U SCH,c (j) is a parameter composed of the sum of two components from the higher layers within the NPUSCH data re-transmission channel model, – P Lc is the downlink path loss estimate calculated in the UE for serving cell c and – αc (j) is a coefficient configured by higher layers based on the estimated total loss over the link. Should the number of repetitions of the allocated NPUSCH RUs be higher or equal to 2, then, the PN P U SCH,c (i) can simply be modelled as, PN P U SCH,c (i) = PCM AX,c (i)

4

(4)

The Proposed Adaptive Channel Coding Technique

The objective of the EEACC approach is to design an appropriate link adaptation scheme integrated with a proper selection mechanism of repetition number and MCS for the NB-IoT systems solely based on the aim of achieving more energy efficiency, long transmission distance and high throughput while maintaining good transmission reliability. The channel coding approach proposed by the present study is a 2dimensional (2D) link adaptation approach which, translates in a bi-objective optimization problem which aims at enhancing the NB-IoT network coverage without compromising on its energy efficiency performance. The proposed adaptive channel coding technique is twofold. It is composed of an inner loop and an outer loop adaptation schemes both aimed at enhancing the energy efficiency as well as the throughput of the network. In particular, the inner loop adaptation scheme is designed based on the channel conditions to guarantee transmission reliability and consequently, enhance the data rate of the network. The outer loop scheme, on the other hand, is designed based on the Modulation Coding Scheme (MCS) number and the transmission repetition number. Due to the fact that the channel conditions of NB-IoT systems are known to be quite unstable [10] as their transmission block error ratio (BLER) rapidly changes, the present approach introduces an inner loop link adaptation procedure which focuses on dynamically varying the transmission repetition number based on a periodically sampled and estimated channel condition quantified by means of its BLER performance. The current BLER performance is each time used to predict the channel conditions on the next transmission based on the Sequential Channel Estimation in the Presence of Random Phase Noise in NB-IoT Systems as proposed by [8]. In this channel estimation model, the main consideration is that although the coherent-time of fading channel is assumed fairly long due to the assumed low mobility of NB-IoT user-equipments (UEs). Therefore, phase

Energy Efficient Channel Coding Technique for NB-IoT

455

noises is considered before combining the channel estimates over repetition as a mechanism to improve the accuracy of the approach. With phase noise φl [n] caused by oscillator fluctuations and a residual FO fe normalized by the sub-carrier frequency, the time-domain base band received signal at the nth sampling time of the lth orthogonal-frequency-divisionmultiplexing (OFDM) symbol which can be expressed as, sl(Estimated) [n] = ejφl [n] (

1 sqrtN



(N/2)−1

Sl [k]e

2πjn(fe +k) N

) ∗ hl [n] + w(n)

(5)

k=−N/2

where “∗” denotes the linear convolution, Sl [k] is the transmit symbol on the lth OFDM symbol and the k th sub-carrier, h[n] is discrete fading channel taps, w(n) is additive-white Gaussian-noise (AWGN), and N is the Fast-Fourier-Transform (FFT) size. The purpose of the inner loop link adaptation is to guarantee the transmission BLER to the target. Accordingly, we refer to the former as outer loop link adaptation meaning, MCS level selection and repetition number determination. The proposed link-adaptation method is presented as follows, 4.1

The Inner Loop Approach

As discussed in a previous section, the inner loop link adaptation approach is designed to cope with the rapid transmission BLER fluctuations. This proposed inner loop approach works as follows, – In one period T , all transmissions of both positive and negative acknowledgments (ACK and NACKs) respectively, are computed to work out the average BLER for that period which is then recorded for that specific period. Specifically, the appropriate evaluation period T for LTE systems is chosen to be in the order of tens of milliseconds while hundreds of milliseconds are used for the NB-IoT systems. This selection is mainly motivated by the realistic expected traffic rate on the LTE systems being normally higher than the NB-IoT counterpart. – At the end of the considered period, the obtained BLER value is passed to the outer loop and used as a parameter in the selection of the appropriate transmission repetition number as clearly labelled in description of the algorithm as presented in Table 1. – If the current BLER (the one of the present period) is found to be less than 7%, the repetition number for the next transmission should be decreased because it means that the channel conditions are good and therefore, the probability of successful transmission is high since there are fewer channel impairments. – On the other hand, if the BLER is found to be between 7% and 13%, the channel is considered to be in the medium condition. Our proposed link adaptation approach proposes that the repetition number should be maintained on the next period.

456

E. Migabo et al.

– Finally, if the current BLER is greater than 13%, the channel is considered to be in bad conditions and therefore, the probability of successful transmission is reduced. This requires, therefore, that the number of transmission repetitions must be increased in order to guaranty a certain level of transmission reliability. 4.2

The Outer Loop Link Adaptation Approach

The outer loop link adaptation approach consists of the MCS level selection which is performed as follows, – If a certain number of ACKs are successively successfully decoded at the NB-IoT receiver, then the MCS level is increased. – On the other hand, when certain number of NACKs are successively decoded at the NB-IoT receiver, then the MCS level is decreased. Generally, the number of ACKs is more than that of NACKs to ensure a slow increase of the MCS level with ACK feedback and quick decrease of the MCS level with NACK feedback. Because of the narrowband and low data rate for NB-IoT systems, the settings for LTE systems might no longer be applicable. Therefore, two aperiodic and event-based actions are defined for the EEACC Approach, namely the fast upgrade (FUG) and the emergency downgrade (EDG). In the event of FUG, the MCS is increased by one while it is decreased by one in the event of an EDG. Thus the EEACC approach introduces a compensation factor ΔC(t), modelled as, ⎧ min {ΔC(t − 1) + Cstepup , ΔCmax }, ⎪ ⎪ ⎪ ⎪ ⎪ if HARQ f eedback = ACK; ⎪ ⎪ ⎪ ⎨max {ΔC(t − 1) − C stepdown , ΔCmin } (6) ΔC(t) = ⎪ if HARQ f eedback = N ACK; ⎪ ⎪ ⎪ ⎪ ⎪ΔC(t − 1), ⎪ ⎪ ⎩ if HARQ f eedback = N/A; where, – ΔCmax and ΔCmin are the upper and lower limits of the compensation factor ΔC(t), – Cstepup and Cstepdown are the incremental compensation step sizes, modelled as per the formula, Cstepdown = Cstepup

1 − BLERtarget BLERtarget

(7)

– N/A means discontinuous transmission (DTX), which practically happens when the eNB does not detect any NPUSCH signal.

Energy Efficient Channel Coding Technique for NB-IoT

457

A number of simulation parameter values are selected to be used within the simulation of the proposed algorithm. The choice of these parameter values is each time motivated by the objective of making the simulation process as realistic as possible. These parameter values include, – A targeted Block Error Rate (BLER) value of 10%: Teh choice of this targeted BLER value is guided by the 3GPP standard Release 13 [32]. This targeted BLER value is considered to be the normal out-of-sync error rate condition for LTE/4G technology during Radio Link Monitoring (RLM) [33]. – The 7% and 13% threshold are the typical values for specifying the channel state in a NB-IoT system to be bad, medium or good. This choice is directly linked to the objective of maintaining the ±3% margin around the targeted 10% BLER as per the 3GPP standard specifications [32]. – The Cstepup and Cstepdown values are the incremental compensation step sizes. Like in any iterative process there is a need to choose a reasonable initial step size. The authors have chosen an initial Cstepup value and not a Cstepdown one, because first the Cstepdown value is modelled as dependent on the Cstepup . – The initial value of 0.2 for the Cstepup parameter is used to ensure that it is only under an initial 20% probability of communication error that the MCS level is stepped down. This is because 20% is the maximum 3GPP standard value for BLER when using LTE/4G Technology. The proposed algorithm can be summarized in Table 1. The inner loop part is covered from lines 2 −→ 8 of the pseudo code as presented in Table 1. In this section, the Block Error rate (BLER) for each encoded block is checked every predefined T period time, against a threshold values of 7% and 13%, which are typical values for specifying the channel state in a NB-IoT system to be bad, medium or good. Depending on the observed type channel condition (good, medium or bad), the channel repetition number N is either reduced to half, progressively increased by one or doubled, respectively. On the other hand, the pseudo code for the outer loop section is presented trough line 3 −→ 52 of Table 1. This section consists of three scenarios namely, – When the Modulation Coding Scheme (MCS) value is between the two predefined threshold values (Lmin < L < Lmax ). In this situation, the compensation value ΔC) is updated to the lowest or highest value based on whether the transmitter receives a positive feedback (ACK) or a negative feedback (N ACK)and the MCS level is increased by one or reduced by one accordingly. – When the MCS reaches the minimum value (Lmin ), then based on what type of feedback is received by the transmitter (ACK or N ACK) and the position of current repetition number N with respect to the minimum predefined value N min and NM ax , a new MCS level is defined as an increase by 1 or maintained; while the transmission repetition number N is reduced by half or doubled accordingly.

458

E. Migabo et al. Table 1. The Link adaptation pseudo-code Algorithm : Proposed Uplink Link Adaptation Algorithm for NB-IoT Systems 1: Initialization: BLERtarget = 10%, Cstepup = 0.2, 1−BLERtarget Cstepdown = Cstepup BLERtarget , ΔC = 0, ΔCmax = +5, ΔCmin = −5, MCS level L and its bounds Lmax , Lmin , repetition number N and its bounds N max , N min . We empirically initialize the MCS level and repetition number based on the channel condition. 2: if period T expired out then 3: if BLER < 7% then 4: N = N/2 5: else if 7% < BLER > 13% then 6: N =N +1 7: else if BLER > 13% then 8: N = 2N 9: end if 10: end if 11: if L > Lmin &L < Lmax then 12: if HARQ feedback=ACK then 13. ΔC = min{ΔC + Cstepup , ΔCmax } 14. else if HARQ feedback-NACK then 15. ΔC = max{ΔC − Cstepdown , ΔCmin } 16. end if 17. if ΔC = ΔCmax then 18. L=L+1 19. if ΔC = ΔCmin then 20. L=L−1 21. end if 22.else if L = Lmin then 23. if HARQ feedback=ACK then 24. if N = N min then 25. ΔC = min{ΔC + Cstepup , ΔCmax } 26. if ΔC = ΔCmax then 27. L=L+1 28. end if 29. if N > N min then 30. N = N/2 31. end if 32. else if HARQ feedback=NACK then 33. if N = N max then 34. the current channel condition is very bad 35. else if N < N max then 36. N = 2N 37. end if 38. end if 39.else if L = Lmax 40. if HARQ feedback=ACK then 41. if N = N min then 42. the current channel condition is very good.

(continued)

Energy Efficient Channel Coding Technique for NB-IoT

459

Table 1. (continued) 43. else if N > N min then 44. N = N/2 45. end if 46. else if HARQ feedback=NACK then 47. ΔC = max{ΔC − Cstepdown , ΔCmin } 48. if ΔC = ΔCmin then 49. L=L−1 50. end if 51. end if 52.end if

– Similarity, when the MCS value reaches the maximum value LM ax , then once again, based on the type of feedback received by the transmitter (ACK or N ACK) and the position of current repetition number N with respect to the minimum predefined value N min , a new MCS level is defined as a decrease by 1 or maintained accordingly. It is important to note that the validity of our proposed EEACC approach is solidly based on the accuracy of our channel conditions assessment. Some key characteristics of this channel condition assessment have been empirically defined and standardized under the 3GPP standard release 13. The parameter ΔC(t) as used within our proposed approach plays the role of a channel characteristic compensation value. In other terms this how much channel noise and interference that needs to be suppressed from the current assessed channel conditions in order to move either to the classification as bad, medium or good condition as defined in the standard. The Cstepdown and Cstepup parameters are, therefore, the exact channel coding scheme deduction or increase needed to achieve ΔC(t) channel compensation.

5

Performance Evaluation

Evaluation Setup. In order to assess and validate the performance of the proposed EEACC approach, simulations are carried out under the NB-IoT network conditions summarized within the set-up parameters Table 2. Furthermore, it is important to note that a 4:1 Multiple Input Multiple Output (MIMO) with Alamouti decoding technique a well as an eNB antenna design is considered in these simulations. Due to the fact that the study is carried on in the context of a Narrowband Internet of Things (NB-IoT) network as deployed on a LTE cellular network, the Quadrature Phase Shift Keying (QPSK) LTE modulation scheme is considered for our study. Normally, eNodeBs in a LTE network are built to support a QPSK, 16 QAM and 64 QAM modulation techniques for the Down Link direction. But considering that the end nodes (NB-IoT motes) are often computation limited

460

E. Migabo et al.

devices most of times micro-controller based (sensor nodes, smart water meters, etc.); the choice of the modulation scheme is often oriented towards the QPSK. Obtained Results and Discussion. The obtained results are as follows, First, the BLER performance of our proposed adaptive channel coding approach (EEACC) against the NB-IoT Link Adaptation (NBLA) scheme as well as the traditional repetition schemes is assessed; as the transmission power is varied. The obtained results are illustrated in Fig. 2. For fair performance evaluation and comparison between the three considered schemes namely the NBLA, the traditional repetition approach as well as our proposed EEACC approach; all the three systems are using the same block size. As per the 3GPP TS36.213 standard [32], a NB-IoT device can select a downlink Transport block size (TBS) on MAC layer from 2 bytes (16 bits) up to 85 bytes (680 bits). For the case of our simulation, we are using the average block size of 44 bytes. Table 2. Key simulation parameters System bandwidth

200 kHz

Carrier frequency

900 MHz

Subcarrier spacing

15 kHz

Chanel estimation for NPDCCH Sequential channel estimation in the presence of random phase noise [8] Interference rejection combiner

MRC

Number of Tx antennas

1

Number of receive antennas

2

Frequency offset

200 Hz

Lmin

4

Lmax

12

Nmin

2

Nmax

10

Time offset period

2.5 us

Network deployment model

Mesh network (Meshnet)

Channel model

Additive white Gaussian noise (AWGN)

No nodes mobility considered

Static nodes No fading channel consideration

LTE modulation scheme

Quadrature phase shift keying (QPSK)

Energy Efficient Channel Coding Technique for NB-IoT

461

Comparision of BLER Performance as SNR is varied 0

BLER

10

-1

10

EEACC NBLA Traditional Repetition -2

10

-18

-16

-14

-12 -10 SNR (dB)

-8

-6

-4

Fig. 2. BLER performance comparison as power is varied

As it can be clearly observed in Fig. 2, the Block Error rate (BLER) of the traditional repetition approach, the Narrowband Internet of things Link Adaptation (NBLA) approach as well as our proposed Energy Efficient Adaptive channel coding (EEACC) approach; all reduce as the Signal-to-Noise Ratio (SNR) increases. As the slope of the EEACC curve shows, the BLER of the EEACC decreases much faster than that of the others after a SNR −14 dB. This can be explained by the fact with its repeated transmission capability coupled with its smart MCS selection; all adapted to the channel conditions. The EEACC’s transmissions are already quite reliable to an extent that an increase in transmission power PT x which results in an increase in SNR assuming that for a single period T , the channel conditions remain more or less the same; simply reduces even further the number of possible transmission failures. It is also important to observe that on one hand, although for SNR values from −18 dB to −14 dB, all the BLER values are quite similar; the BLER performance of the NBLA as well as the traditional repetition approaches are quite higher as compared to the one of the EEACC proposed approach. On the other hand, it can be clearly noticed that the BLER performance of the NBLA as well as remain closer as transmission power is increased while the one of the EEACC significantly drops faster to as low as 0.18% for a SNR of −4 dB. Secondly, the average energy consumption of the NB-IoT network is computed as the number of NB-IoT nodes is varied with each of the three approaches, namely the traditional repetition, the NBLA as well as the proposed EEACC approach. The obtained results are clearly depicted in Fig. 3.

462

E. Migabo et al.

Average Energy Consumption (J)

Average Energy Consumption in a cell variation with Scalability 7 6 5 4 3 2 EEACC NBLA Traditional Repetition

1 0

200

400

600 800 1000 1200 Number of IoT Nodes

1400

1600

Fig. 3. Average energy consumption variation with increased scalability

As it can clearly been observed from Fig. 3, the EEACC is more energy efficient than the NBLA and the traditional repetition approaches. Also, as the average energy consumption of all three approaches increases as the number of NB-IoT nodes is increased in the network. For a total of 800 NB-IoT nodes within a cell, the EEACC proposed approach consumes 44.01% less energy than the NBLA approach and 49.51% even less energy than the traditional repetition approach. Thirdly, the behaviour of the BLER performance as the number of repeated transmissions is increased is also assessed. The aim of this particular performance evaluation was to assess the impact of increasing the transmission repetition on the transmission reliability of the NB-IoT network. Therefore, the transmission power has been hold constant to 0.1 W which translates in a constant SN R = −10 dB BLER Performance with Repetition Number Variation

0

10

Traditional Repetition NBLA EEACC -1

BLER

10

-2

10

-3

10

2

3

4

5 6 7 Repetition Number (N)

8

9

10

Fig. 4. BLER Performance Behaviour with increased transmission repetition number

Energy Efficient Channel Coding Technique for NB-IoT

463

assuming that the channel conditions do not change quite significantly during this particular simulation period. The obtained results are as follows, As it can be clearly depicted in Fig. 4, the BLER is almost half of what it was for a SNR of −10 dB as depicted in Fig. 2; as the transmission repetition number is doubled (N = 2). This translates in the fact for doubling the number of transmission repetitions, the probability of failed block transmission is almost reduced by half. It can also be clearly noticed that as the number of transmission repetitions is increased, the BLER is reduced for all three approaches. It is important to notice that the BLER of the traditional repetition approach as well as the one of the NBLA remain closer as the number of transmission repetitions is increased while the one of the EEACC significantly reduces to as low as 0.18% as the repetition number N reaches its preset maximum value NM ax = 10. Lastly, the average transmission delay on a link basis is evaluated as the number of NB-IoT nodes is scaled up within a cell. The obtained results are as follows, Latency Performance Comparison against Scalability

Average Propagation Delay (ms)

7 6 5 4 3 2

NBLA EEACC Traditional Repetition

1 0

200

400

600 800 1000 1200 Number of IoT nodes

1400

1600

Fig. 5. Average transmission delay with increased scalability

As it can be noticed in Fig. 5, despite its increased computational intelligence as compared to the NBLA and the traditional repetition approaches, the EEACC still achieves less latency on a link basis. This is mainly due to the fact that, thanks to the advancement of nowadays processors, the computational energy but also time is not significant as compared to the propagation delay in a wireless network of object nodes such as sensor nodes etc. Once again, as it can be noticed the latency performance of the NBLA adn the traditional repetition approaches are quite close despite the fact the NBLA still performs with less latency as compared to the traditional repetition approach.

464

6

E. Migabo et al.

Conclusion

This research work consists of the development of novel a 2-D adaptive channel coding and link adaptation technique aimed to minimize the energy consumption of the NB-IoT system. The obtained results have demonstrated that the proposed EEACC approach outperforms the traditional repetition approach as well as exhibits better performance in terms of energy, scalability, latency and reliability than the existing improved version of the traditional repetition approach named the NBLA. Acknowledgment. This work is supported in part by the National Research Foundation of South Africa (Grant Number: 90604). Opinions, findings and conclusions or recommendations expressed in any publication generated by the NRF supported research are those of the author(s) alone, and the NRF accepts no liability whatsoever in this regard. The authors would like to also thank the Telkom Centre of Excellence (CoE) for their support.

References 1. Atzori, L., Iera, A., Morabito, G.: The internet of things: a survey. Comput. Netw. 54(15), 2787–2805 (2010) 2. Miao, Y., Li, W., Tian, D., Hossain, M., Alhamid, M.: Narrow band internet of things: simulation and modelling. IEEE Internet Things J. 5, 2304–2314 (2017) 3. Li, S., Xu, L.D., Zhao, S.: The internet of things: a survey. Inf. Syst. Frontiers 17(2), 243–259 (2015) 4. Xu, L.D., He, W., Li, S.: Internet of things in industries: a survey. IEEE Trans. Ind. Inf. 10(4), 2233–2243 (2014) 5. Introduction of NB-IoT in 36.331, 3GPP RP-161248, 3GPP TSG-RAN Meeting 72, Ericsson, Nokia, ZTE, NTT DOCOMO Inc., Busan, South Korea, June 2016 6. Sharif, A., Vidyasagar, V., Ahmad, R.F.: Adaptive channel coding and modulation scheme selection for achieving high throughput in wireless networks. In: The Proceedings of the IEEE 24th International Conference on Advanced Information Networking and Applications Workshops (AINAW), Perth, WA, Australia, pp. 200–207 , 20–23 April 2010 7. Migabo, M.E., Djouani, K., Kurien, A.M., Olwal, T.O.: A stochastic energy consumption model for wireless sensor networks using GBR techniques, In: The Proceedings of the IEEE African Conference (AFRICON), Addis Ababa, Ethiopia, pp. 1–5, 14–17 September 2015. https://doi.org/10.1109/AFRCON.2015.7331987 8. Rusek, F., Hu, S.: Sequential channel estimation in the presence of random phase noise in NB-IoT systems. In: The Proceedings of the IEEE 28th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC), Montreal, QC, Canada, 8–13 October 2017. https://doi.org/10.1109/ PIMRC.2017.8292588 9. Yu, C., Yu, L., Wu, Y., He, Y., Lu, Q.: Uplink scheduling and link adaptation for narrowband internet of things systems. IEEE Access 5, 1724–1734 (2017) 10. Pollet, T., Bladel, M., Moeneclaey, M.: BER sensitivity of OFDM systems to carrier frequency offset and Wiener phase noise. IEEE Trans. Commun. 43(2), 191–193 (1995)

Energy Efficient Channel Coding Technique for NB-IoT

465

11. El Soussi, M., Zand, P., Pasveer, F., Dolmans, G.: Evaluating the performance of eMTC and NB-IoT for smart city applications, in Semantic Scholars arXiv:1711.07268v1 [cs.IT] repository, pp. 1–18, November 2017 12. Chakrapani, A.: Efficient resource scheduling for eMTC/NB-IoT communications in LTE Rel. 13. In: The Proceedings of the IEEE Conference on Standards for Communications and Networking (CSCN), Helsinki, Finland, 18–20 September 2017. https://doi.org/10.1109/CSCN.2017.8088600 13. Chen, M., Miao, Y., Hao, Y., Hwang, K.: Narrow band internet of things. IEEE Access 5 (2017). https://doi.org/10.1109/ACCESS.2017.2751586 14. Zarei, S.: LTE: channel coding and link adaptation. In: The Seminar on Selected Chapters of Communications Engineering, pp. 1–14, Erlangen, Germany (2009) 15. Spajic, V.: Narrowband internet of things. J. Mech. Autom. Ident. Tech. (JMAIT) 2(1), 1–6 (2017). Vip Mobile 16. Tabbane, S.: Internet of things: a technical overview of the ecosystem. In: The ¨ Proceedings of the Regional Workshop for Africa on Developing the ICT Ecosystem to Harness Internet-of-Things (IoT) Mauritius, pp. 1–6, 28–30 June 2017 17. Ratasuk, R., Mangalvedhe, N., Kaikkonen, J., Robert, M.: Data channel design and performance for LTE narrowband IoT. In: Proceedings of the IEEE 84th Vehicular Technology Conference (VTC-Fall), pp. 1–5, September 2016 18. Inoue, T., Vye, D.: Simulation speeds NB-IoT product development. Microwave J. China 10(1), 38–44 (2018) 19. Wibowo, F.X.A., Bangun, A.A.P., Kurniawan, A.: Multimedia broadcast multicast service over single frequency network (MBSFN) in LTE based femtocell. In: Proceedings of the 2011 International Conference on Electrical Engineering and Informatics, pp. 1–5 (2011) 20. Alexiou, A., Asimakis, K., Bouras, C., Kokkinos, V., Papazois, A., Tseliou, G.: Reliable multicasting over LTE: a performance study. In: IEEE Symposium on Computers and Communications (ISCC) 2011, pp. 603–608 (2011) 21. Bouras, C., Alexiou, A., Papazois, A.: Adopting forward error correction for multicasting over cellular networks. In: European Wireless Conference (EW) 2010, pp. 361–368 (2010) 22. Cornelius, J.M., Helberg, A.S.J., Hoffman, A.J.: An improved error correction algorithm for multicasting over LTE networks, University of North West thesis (2014) 23. Havinga, P.J.M., Smit, G.J.M.: Energy efficient wireless networking for multimedia applications, in wireless communications and mobile computing. Wirel. Commun. Mob. Comput. 1, 165–184 (2001). https://doi.org/10.1002/wcm.9 24. Li, Q., Wu, Y., Feng, S., Zhang, P., Zhou, Y.: Cooperative uplink link adaptation in 3GPP LTE heterogeneous networks. In: Proceedings of the IEEE Vehicular Technology Conference (VTC Spring), pp. 1–5, June 2013 25. Singh, M.P., Kumar, P.: An efficient forward error correction scheme for wireless sensor network. Procedia Tech. 4, 737–742 (2012). https://doi.org/10.1016/j. protcy.2012.05.120 26. Alzubi, J.A., Alzubi, O.A., Chen, T.M.: Forward error correction based on algebraic-geometric theory. Springer, 2014 edition. ISBN- 978-3319082929 27. Chase, D.: Class of algorithms for decoding block codes with channel measurement information. IEEE Trans. Inform. Theory 18(1), 170–182 (1972) 28. Roca, V., Cunche, M., Lacan, J., Bouabdallah, A., Matsuzono, K.: Reed-Solomon Forward Error Correction (FEC) Schemes for FECFRAME, IETF FECFRAME Working Group, draft-roca-fecframe-rs-00.txt (Work in Progress), March 2009. (ASCII) (HTML)

466

E. Migabo et al.

29. Walther, F.: Energy modelling of MICAz: a low power wireless sensor node. Technical report, University of Kaiserslautern, February 2006. http://www.eit.unkl.de/ wehn/files/reports/micazpowermodel.pdf. Accessed on 31 May 2018 30. Migabo, M.E., Djouani, K., Kurien, A.M., Olwal, T.O.: A stochastic energy consumption model for wireless sensor networks using GBR techniques. AFRICON 2015, 1–5 (2015) 31. Lee, J., Lee, J.: Prediction-based energy saving mechanism in 3GPP NB-IoT networks. Sensors (Switzerland) 17(9), 2008 (2017). https://doi.org/10.3390/ s17092008 32. Evolved Universal Terrestrial Radio Access (E-UTRA); Physical Channels and Modulation, 3GPP TS 36.211, 2016. http://www.3gpp.org/ftp/Specs/archive/36 series/36.211/36211-d20.zip 33. Helmersson, K., Englund, E., Edvardsson, M., Edholm, C.: System performance of WCDMA enhanced uplink. In: IEEE 61st Vehicular Technology Conference, Stockholm, Sweden, pp. 1–5 (2005) 34. Evolved Universal Terrestrial Radio Access (E-UTRA); Multiplexing and Channel Coding, 3GPP TS 36.212 (2016). http://www.3gpp.org/ftp/Specs/archive/36 series/36.211/36212-d20.zip 35. Massam, P., Bowden, P., Howe, T.: Narrow band transceiver, 9 January 2013, eP Patent 2,092,682. http://www.google.com/patents/EP2092682B1?cl=pt-PT 36. Maldonado, P.A., Ameigeiras, P., Prados-Garzon, J., Navarro-Ortiz, J., LopezSoler, J.M.: Narrowband IoT data transmission procedures for massive machinetype communications. IEEE Netw. J. 31(6), 8–15 (2017). https://doi.org/10.1109/ MNET.2017.1700081 37. Raza, U., Kulkarni, P., Sooriyabandara, M.: Low power wide area networks: an overview. IEEE Commun. Surv. Tutor. 19, 855–873 (2017)

An Internet of Things and Blockchain Based Smart Campus Architecture Manal Alkhammash1,2(B) , Natalia Beloff2 , and Martin White2 1 Jazan University, Jazan, Kingdom of Saudi Arabia 2 Sussex University, Brighton, UK

{ma979,n.beloff,m.white}@sussex.ac.uk

Abstract. Rapid development in science and information technologies, such as the Internet of things, has led to a growth in the number of studies and research papers on smart cities in recent years and more specifically on the construction of smart campus technologies. This paper will review the concept of a smart campus, discuss the main technologies deployed, and then propose a new novel framework for a smart campus. The architecture of this new smart campus approach will be discussed with particular consideration of security and privacy systems, the Internet of things, and blockchain technologies. Keywords: Smart campus · Internet of things · Blockchain · Security · Privacy

1 Introduction Information and communications technology (ICT) development is a never-ending process, which has led to a growth in the number of studies and research papers on smart cities in recent years. The concept of a smart city is not only about constructing traditional infrastructure, such as a transportation system, but also involves ICT infrastructure in order to improve quality of life and enhance the profile of the city [1]. Therefore, the term ‘smart city’ can be generally defined as dynamically integrating the physical and the digital worlds, in which different data resources are automatically gathered in real time [1–3]. By utilising high-speed networks, the changes in the physical world can be captured and transferred to data centres so that they can be stored and processed [4]. This means that in order to capture the necessary data, there needs to be significant numbers of sensors at diverse locations that can capture this ‘big data’. In addition, the Cloud needs to be utilised in order to store and analyse the data. Consequently, there are many areas that can be developed under the intelligent city framework to achieve the overall goal of improving citizens’ quality of life. There have been many contributions and research papers in different areas to develop smart systems, such as medical and health care [5–8], supply chain management [9, 10], traffic [11, 12], and education systems [13–15] that together can build smart cities. Since a smart campus constitutes an essential element of a smart city, and the concept of the smart campus comes from the notion of smart cities [16], many researchers have focused their attention on developing smart campuses, trying to address the topical © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 467–486, 2020. https://doi.org/10.1007/978-3-030-52246-9_34

468

M. Alkhammash et al.

question of ‘how to develop an intelligent campus’ by contributing the same ideas and bases of intelligent cities to propose the smart campus [17]. Therefore, the aim of this paper is to study different technologies and to design a novel architecture for a smart campus in order to develop an intelligent campus. Such a smart and intelligent campus architecture (or framework) is likely to exploit the Internet of things (IoT), blockchain, and smart contracts as part of its many technology solutions. The paper will be structured as follows. In section two a brief description of a smart campus concept will be addressed. In section three the paper provides a brief background of a smart campus and delineates the main areas of the campus. In section four the paper reports some issues related to a previous generic smart campus architecture and proposes a new one and discusses it in depth in the following section. Finally, in section six the paper provides conclusions and future work.

2 Smart Campus Concept Traditionally, a campus can be defined as a land or an area where different buildings constitute an educational establishment. A campus often includes classrooms, libraries, student centres, residence halls, dining halls, parking, etc. Nowadays, campuses have adopted advanced technologies, such as visual learning environments [18, 19] and timetabling systems [20, 21] in order to provide high-quality services for stakeholders (e.g. academics, students, administrators, and services functions) on campus and to monitor and control facilities. These developments should be evolving constantly in order to increase efficiency, cut operational costs, reduce effort, lead to better decisionmaking, and enhance the student experience [22]. Thus, the term ‘smart campus’ can be defined as a place where digital infrastructure can be developed and that has the ability to gather information, analyse data, make decisions, and respond to changes occurring on campus without human intervention [22, 23]. The authors in [24] define a smart campus as an environment where the structure of ambient learning spaces – application context based on virtual spaces – integrates social and digital services into physical learning resources. If we think of a smart campus as a holistic framework, it encompasses several themes, including but not limited to automated security surveillance and control, intelligent sensor management systems, smart building management, communication for work, cooperation and social networking, and healthcare. Several innovations have been proposed for smart campuses, ranging from developing a whole framework using technologies such as mobile technologies, blockchain, the IoT, and the Cloud to assist learning to enhancing security systems utilising technologies such as ZigBee and radio frequency identification (RFID) [25–28].

3 Smart Campus Background Many studies and architectural plans with different goals have been undertaken on the subject of the smart campus [29]. This smart campus research largely breaks down into the following areas: teaching and learning, data analysis and services, building management and energy use on campus, campus data mining, water and waste management use on campus, campus transportation, and campus security.

An Internet of Things and Blockchain Based Smart Campus Architecture

469

3.1 Smart Campus Leaning Environments Much research has been focused on constructing smart campuses by developing suitable technologies and applications that involve teaching and learning. Therefore, the common purpose of designing and developing a smart campus has often been from a learning and educational perspective. The authors in [27] developed a novel holistic environment for a smart campus known as iCampus. The aim of their research is to propose a beginningto-end lifecycle within the knowledge ecosystem in order to enhance learning. Atif and Mathew designed a framework for a smart campus that integrates a campus social network within a real-world educational facility [30]. The study’s goal was to provide a social community where knowledge could be shared between students, teachers, and the campus’s physical resources. Further, [1] proposed a model of a smart campus to enable stakeholders on the campus to shape and understand their learning futures within the learning ecosystem. Based on cloud computing and IoT, [31] stated the concept of a smart campus and demonstrated some issues that related to intelligence application platforms after establishment. However, these approaches were focused only on proposing a smart campus by using IoT technology. 3.2 Smart Campus Data Analysis and Service Orientation Other research has considered the development of smart campuses based on data analysis. According to [32], a smart campus should be able to gather data from a crowd and analyse it by using crowdsourcing technologies in order to deliver services of added value. In 2011, [33] explained the prototype of a smart campus implementation that uses semantic technologies in order to integrate heterogeneous data. However, some researchers have envisioned smart campuses from social networking aspects. For instance, [34] elaborated upon an architectural system that can be deployed on campus in order to support social interaction by using service-oriented specifications. This will depend upon their proposed social network platform (WeChat) and an examination of its architecture, functions, and features. Xiang et al. developed a smart campus framework based on information dissemination [17]. However, these approaches did not address blockchain technology in order to eliminate centralisation. 3.3 Building Management and Energy Efficiency on Smart Campuses Several of the current initiatives that are developing smart campuses have been based on high-energy efficiency perspectives. In order to decrease the energy consumption of buildings, monitoring and controlling environmental conditions is essential, such as controlling both natural and artificial lighting, humidity, and temperature. An example of this is a project that was undertaken at the University of Brescia in Italy in 2015 that aimed to enhance energy efficiency inside buildings by monitoring lighting, temperature, and electrical equipment by using control systems, automation, and grid management. The project progressed in stages towards this goal. First of all, it aimed to reduce the consumption of the buildings’ energy by analysing possible actions. Then it attempted to implement different measures and evaluated their efficiency. Simultaneously, in order

470

M. Alkhammash et al.

to enhance users’ awareness of energy consumption, a system for monitoring operational conditions was also developed. Finally, the project evaluated the energy balance between consumption and generation, renewable energy production, and energy reduction. The outcome displayed a significant energy consumption reduction of 37.3% while improving the buildings’ thermal properties [35]. In addition, [36] proposed and implemented a web-based system to manage energy in campus buildings known as CAMP-IT. The system aimed to optimise the operation of energy systems in order for buildings to achieve goals of reducing energy consumption while at the same time enhance the quality of the indoor environment in terms of visual comfort and air quality. The modelling collected, controlled, and analysed the energy load for each building and for the campus as a whole. The results showed a reduction in energy consumption of nearly 30%. Again, these approaches did not study the integration of blockchain into the proposed architectures. 3.4 Smart Campus Data Mining Additionally, some researchers have focused on applying interest mining, which is based on location, context awareness, proximity, and user profiles as well as other related information, to assist users in meeting their needs within the campus environment. In 2014, [37] studied web log mining, which is an essential technique in web data mining to determine users’ characteristic interests by developing a reliable and efficient method of data pre-processing. In 2010, [38] proposed a data-mining method from e-learning systems to identify users’ interests and obtain information about learners’ logs and knowledge background. Along these lines, the model would be able to automatically recommend resources that may be of interest to individual students. However, blockchain technology could be used to protect user profiles and preferences. 3.5 Water Management on Smart Campuses Regarding water and waste management, since they are considered expensive and important services on smart campuses, several research studies have focused on proposing management systems on campuses for these services in order to reduce the environmental and financial impact [22]. In terms of sustainable water management, there are three important pillars: water harvesting processes, water recycling, and water consumption reduction [39]. Different approaches have been proposed to manage water. Some focused on controlling and monitoring the water level and water consumption on campus. For instance, [40] developed a water monitoring system to reduce water consumption on campus. The system designed a three-dimensional map of the campus and used a geographical information system (GIS) to display a water pipeline in the electronic map with detailed status information in real time. Therefore, the model can monitor water directly from pipelines; detect any problems that occur in the equipment, such as leaking; and analyse the amount of water consumption. In 2015, [41] developed a suitable system for medium-sized campuses to monitor the water balance in real time. The design used an ultrasound level sensor, a cloud software stack, and communication links and carefully considered industrial design. To be able to

An Internet of Things and Blockchain Based Smart Campus Architecture

471

monitor the water, the system measured the water level in tanks by sending ultrasound pulses to the water’s surface. After observing the reflection, the sensors can estimate the distance and calculate the tank volume. Based on previous work, [42] developed an automatic water distribution system for large campuses so that each tank on the campus would have enough water to meet the local needs. The authors utilised ultrasonic ranging sensors, which are suitable for measuring water levels in large tanks, and a wireless network using sub-GHz radio frequency to connect sensors across long distances for further analysis. Moreover, many other experiments have proved efficient for developing water management systems, and they can be implemented on smart campuses to reduce water consumption [43]. For example, [44] developed a meter of a smart water that can provide a user with real-time reading information, analyse his consumption data, and present it in visual graphs to improve the readability. Simultaneously, the system monitors the consumption and alerts the user if there is unusual water usage. However, these approaches did not address blockchain technology. 3.6 Waste Management on Smart Campuses Similarly, numerous studies have been devoted to developing waste management systems. Authors in [45, 46] stated that general research studies in this area focused on developing waste tracks and bins with sensor devices attached to collect and analyse real-time data. This information can be used for several purposes, for example, for developing an efficient cleaning timetable and preventing overfilling of bins. Ebrahimi et al. [47] in 2017 investigated the current waste and recycling infrastructure on Western Kentucky University campus to determine whether it had an adequate service by using spatial techniques, such as GIS, to track, recognise, and visualise waste and recycling bins in a large-scale area. They used spatial information for analysis and decision making to reduce solid waste steam and improve the university’s recycling stream. Furthermore, they drew an accurate roadmap for a suitable waste management plan for the campus. Although most papers use different techniques for waste management systems on smart campuses, they are still at the primary stages, and they lack a generic model. 3.7 Smart Campus Transportation Recently, global positioning system (GPS) has become the most common method for streaming a location and tracking a moving object, such as a vehicle on the road. To improve the accuracy of GPS, external information is needed, such as Wi-Fi, digital imaging, and computer vision [48]. The authors in [49] developed a tracking system for buses using GPS devices that reported the buses’ locations every ten seconds. The location was sent from the server via SMS. The system also had safety features, such as the ability to send alerts or emergency reports when the vehicle crashed or was stolen. Other studies have tracked the location of a college bus using a mobile phone and Google Maps [50, 51]. Saad et al. [48] developed a real-time monitoring system for a university bus that used a GPS service to send the location of the bus to a cloud database every second. The system could also analyse data to estimate the bus’s arrival time. However, these approaches did not use blockchain technology to improve system security.

472

M. Alkhammash et al.

3.8 Smart Campus Security Many mobile applications have been developed for campus safety. Some of them allow users to contact campus security guards, such as EmergenSee and CampusSafe, whereas others, such as Guardly and CircleOf6, allow friends to contact each other [52]. These applications allow user location, photos, and situation descriptions to be shared with campus security guards. In addition, [53] also proposed a smart campus framework that includes several aspects, of which security was a notable one. They pointed out that a smart system can reduce burglaries by detecting glass breaking or any distinct sound; then, the system has the ability to alert security to the location. Also, the system may have the ability to reduce drug or alcohol abuse by alerting public safety to the presence of alcohol. Therefore, a smart campus can be described as an environment that has the ability to provide a suitable infrastructure in order to deliver services required in light of contextual awareness. In addition, it is a well-structured place that can generate huge amounts of information to a number of users by using their profiles and locations in order to best address their needs. Consequently, the desirable characteristics of a smart campus are accurate context awareness and ubiquitous access to networks, efficient and optimal utilisation, many varied resources, and the use of objective principles as a basis to make smart decisions or predictions. 3.9 Summary All the above approaches and implementations are useful and contribute to building smart campuses; however, they rely on IoT technologies with a centralised system architecture, which could lead to many issues and will be discussed in the next section. Next, we describe a new architecture that incorporates a distributed architecture exploiting blockchain and smart contracts to overcome some of the prevalent issues.

4 Smart Campus Concept Developing an architecture for a smart campus while considering advanced technologies, such as IoT, blockchain, and other technologies, is a complicated and difficult task since there are a large variety of devices and objects, associated services with such a system, and link layer technologies. Many different smart campus architectures have been developed with different aims [22, 29]. However, most of these frameworks usually contain three essential layers that interact with each other. First is the perception layer, which contains physical technologies, such as sensors, that collect all kinds of data from the surrounding environment. Second is the network layer, which contains all communication networks that are responsible for receiving and transmitting data. Third is the application layer, which is responsible for supporting business and personalised services and interacting with individual users. Figure 1 shows a generic illustration of this layered architecture approach. Here, we can see on the lowest level of the architecture that the perception layer is allocated, and it accommodates sensors to extract and gather the data from physical

An Internet of Things and Blockchain Based Smart Campus Architecture

473

Fig. 1. A generic smart campus layered architecture [54]

devices. In the middle of the architecture the network layer is utilised to aggregate, filter, and transmit data. The last layer is used by the Cloud or servers to store and analyse smart campus data. There are several problems with this generic architecture since it relies on IoT architecture. The IoT systems rely on centralised computing and storage platforms, such as cloud platforms, which is a suitable place to start for joining, managing, and controlling a massive number of different objects and devices as well as providing the required authentication and identification for various IoT devices. However, the centralised system architecture suffers from several limitations. Atlam and Wills [55] studied these limitations as follows. First, the centralised system has privacy vulnerabilities because data is collected from different devices and then stored in a centralised platform, which can be easily breached. Second, security is a major aspect for any system since processing and storing data through a centralised platform can lead to it being an easy target for attacks, such as distributed denial of service (DDoS) and denial of service (DoS) attacks. In addition, the devices in the IoT system are heterogeneously connected in nature, while a centralised platform uses a single operating system to connect to various devices. In this case, a centralised platform could prevent some objects from connecting to the system. Lastly, scalability is another issue related to a centralised platform since the number of connecting devices in the system is increasing. This is especially a problem for large business organisations, such as campuses, that are distributed in different areas. According to Piekarska and Halpin [56], there are concerns about the efficiency of operating and the scale of the IoT system with centralised architecture taking into account the increasing demands. Recently, blockchain technology has been involved in various application areas beyond the cryptocurrency domain since it has multiple features, such as decentralisation, support for integrity, resiliency, autonomous control, and anonymity [57]. Blockchain eliminates a central authority by using a distributed ledger and is decentralised to provide more efficiency for operating and controlling communication among all participating nodes. It also eliminates the single point of failure if the centralised platform goes down,

474

M. Alkhammash et al. Smart Building

Applica on Layer

CoAp

Smart Classroom

MQTT

Dashboard

DSS

HTTP

Analy cs and Models

Visualisa on and Decision Support

Smart Administra on

Web/Portal AMQP

Mathema cal Package

Catalogs Metadata

Data Mining Library Data Storage Real Time Reasoning Student Lecturer Data Data

Staff Data

Other Data

IaaS

PaaS

Data / Event Processing

Private Cloud

Hybrid Cloud

Public Cloud

Pla orm Layer

Library Data

Security System

Data Layer

Block chain

Business Layer

Smart Library

Smart Water/Waste Management

SaaS IoT Gateway

IEEE 802.11

IEEE 802.15.1

IEEE 802.15.4

IEEE 802.16

WIFI

Bluetooth

ZigBee

WI-MAX

Communica on Layer

Mobile Communica on 2G, 3G, 4G

WLANs Campus Sensor Networks

Physical Layer

WIFI

RFID

Devices Controllers NFC

QR

Bluetooth

ZigBee

Fig. 2. A new smart campus framework

which could lead to the failure of a whole system [55]. Therefore, blockchain can be an efficient technology to handle the issues related to a centralised IoT, particularly security. Thus, we propose a more detailed smart campus architectural framework that combines IoT and blockchain technology, as shown in Fig. 2. Our smart campus framework consists of the blockchain and six layers: 1. physical, which includes several objects, such as campus sensors and devices; 2. communication, which includes the communication protocols and IoT gateway; 3. platform, which is a cloud component since it is recently considered an ideal technique for storing and analysing of volume of data as well as for running several services; 4. data, which is used to store campus data and includes real-time events; 5. business, which produces high-level reports and analysis; and 6. application, which provides services to the end user for connecting and controlling the smart campus environment. In addition, this framework has a security system to provide a secure data connection that ensures the secure transfer of trusted data from the physical, communication, platform, data, business, to the final application layer. The following sub-sections describe each layer in more detail. 4.1 Physical Layer The first layer, the physical layer, includes devices and sensors to detect data, such as motion, temperature, humidity, locations, attendance in the physical environment, etc. When the sensors sense the physical campus environment the parameters are then

An Internet of Things and Blockchain Based Smart Campus Architecture

475

converted to data signals to be handled on the Cloud for analysing. On the way, such data may pass through brokerage protocols, such as MQTT, to a suitable blockchainbased distributed storage in the data layer. In the physical layer, actuators operate in the opposite way: they convert data signals into physical actions [58] perhaps as a response to sensor data stored on the blockchain, which is subsequently analysed and results in an actuator event. The devices in this layer represent hardware components that are connected to the upper layers of the architecture either wirelessly or by wires. 4.2 Communication Layer The communication layer is sometimes known as the network layer or transmission layer [59, 60]. The different data sources that are provided by the perception layer need to be connected to the upper architecture layers to handle collected data. Devices and sensors use protocols and adequate communication technology to connect to the Internet. The diverse data sources in a smart environment lead to diverse communication technologies. For example, Wi-Fi/IEEE 802.11 utilises radio waves to allow smart devices to exchange and communicate within a 100 m range and without utilising a router in some ad hoc configurations [61]. IEEE 802.15.4 standard uses short-wavelength radio to exchange data between smart devices and to minimise power, such as Bluetooth low energy (BLE), which operates for a longer period of time and within a 100 m range. Recently, BLE was considered a suitable technology to support IoT applications [62]. In addition, IEEE 802.15.4 protocol is the specification of low high-message throughput, low cost, low data rate, and low power consumption and is also a good candidate for machine-to-machine (M2M), wireless sensor network, and IoT. This standard is used to produce Zigbee protocol for more reliable communication and a high level of security [61]. Therefore, the main objectives of this layer are to transmit data from and to different objects through gateways to integrated networks. Biswas and Muthukkumarasamy [63] discussed using blockchain in a smart city to provide a secure communication platform. They illustrated that blockchain should be integrated with the network layer to provide privacy and the security of transmitted data. They recommended that the transaction data can be into blocks using TeleHash protocol for broadcast in the network. 4.3 Platform Layer Generally, a smart environment based on IoT uses a large number of data sources, including actuators and sensors that produce big data, which need to extract knowledge by using complex computations, applying data mining algorithms, and managing the services and allocation tasks [64]. Thus, cloud computing presents the suitable technology and a powerful computational resource for IoT to process, compute, and store big data. In addition, blockchain is used to eliminate a centralised system architecture. 4.4 Data Layer The data layer represents a database for the system and processing of the data. A huge amount of data is stored in this layer, which is called ‘big data’. The previous layer uses

476

M. Alkhammash et al.

this layer to generate useful information. In the case of a smart campus, blockchain with a decentralised structure is needed to add security and privacy to the data. 4.5 Business Layer The business layer relies on middleware technology, which manages the system services and activities. It is responsible for building flowcharts, graphs, and business models as well as analysing, monitoring, evaluating, designing, and developing smart systems. Based on big data analysis, the business layer has the ability to support processing in decision making, visualise the outcomes to the user, and operate the controlling actuators. 4.6 Application Layer This layer can consist of many different application types and services required by many different end users. For example, in a smart campus, this layer can provide data related to air humidity and temperature measurements. Therefore, the application layer’s main objectives are to provide high-quality intelligent services to stockholders [65, 66] and allow users to interact with the system and visualise the data via an interface. In addition, the application layer has some protocols to deal with. For instance: • Constrained Application Protocol (CoAP) is one-to-one communication protocol that is inspired by Hypertext Transfer Protocol (HTTP). – CoAp is suitable for smart devices and IoT technology because CoAp is thin, lightweight, and causes as little traffic as possible [58]. • Message Queue Telemetry Transport (MQTT) is a protocol for messaging, and it is responsible for connecting networks and smart devices with middleware and applications [61]. Several applications use the MQTT, such as monitoring and social media notifications [58]. Thus, this protocol is able to provide an ideal messaging protocol for M2M and IoT communications due to its low bandwidth networks, low power, and low cost. • Moreover, Advanced Message Queuing Protocol (AMQP) is an open standard protocol that supports reliable transport protocol and communication and focuses on a message-oriented environment. Data Distribution Service (DDS) is a publish–subscribe protocol for real-time communication [65]. The application layer is responsible for providing high reliability and excellent quality of service to the applications. Therefore, there are a variety of communication protocols that can each work in a different scenario and with a different device manufacturer. 4.7 Blockchain Blockchain is a distributed ledger technology that implements transactions with a decentralised digital database. The transaction is verified by a network of computers before it

An Internet of Things and Blockchain Based Smart Campus Architecture

477

is added and updated to the ledger. Blockchain allows parties to exchange assets in real time without going through intermediaries [67]. Blockchain technology is a peer-to-peer (P2P) distributed ledger technology that records contracts, transactions, and agreements [63, 68]. In other words, blockchain verifies data after receiving it from a physical layer then constructs it into a transaction. It should be stated that the details of blockchain technology and how it works are outside the scope of this paper. For more information about blockchain technology principles, [69] and [70] can be helpful. To decide which type of blockchain to use in our framework, the types will be addressed in a comparative analysis. Recently, blockchain technology has been classified into three types: public blockchain, private blockchain, and consortium blockchain [71]. The first type is also called a permission-less blockchain because there is no need for permission for a single entity, such as Bitcoin [72] or Ethereum [73], to join the network. Anyone can engage and participate successfully by downloading the blockchain and executing the code as well as by sending transactions to the network. Therefore, a public blockchain is fully decentralised so all transactions or ledgers are shared and verified by all nodes, and there is no need for a central authority. In order to prove identities, peers in networks have to solve the proof-of-work puzzle, which requires time and power. This means the chain is not centralised, and once the data is validated the ledger is changed; therefore, the ledger or the transaction is immutable. However, a private blockchain is designated by its participants in advance to allow it writing, reading, and consensus processes. In other words, this blockchain is a permission-based chain, and only those who are authorised can join the network. This type of blockchain is useful for organisations or groups of individuals that share the ledger privately. Thus, malicious nodes cannot enter the network without permission. Specific nodes or services can be removed or added as needed, which provides better scalability for the network. Since the private blockchain is controlled on the network by a single trust node and has fewer authorised participants than a public blockchain, it performs much faster on a ledger and processes more transactions for each block. Furthermore, this blockchain has many consensus methods, such as practical Byzantine fault tolerance, proof of elapsed time, and proof of stake. A private blockchain is used widely in an environment that needs more security and privacy, such as by companies and in the banking sector. Cordra [74] is an example of a private blockchain. In addition, a consortium blockchain is a hybrid that combines private and public blockchains, and it is classified as a permission-based blockchain [71]. In this blockchain, the participants engage in writing and reading on the blockchain across organisations. The preselected nodes control the consensus process in this blockchain. In other words, several institutions govern this blockchain, unlike a private blockchain, which is operated by a single node. A hybrid blockchain has many advantages that relate to a private blockchain, such as privacy and efficiency of the ledger as well as higher scalability and faster transactions. In addition, a consortium blockchain is an easily implemented environment and more energy-efficient compared to a public blockchain [71, 75]. To summarise, all blockchain types are decentralised P2P networks, and all nodes share a verified ledger. All blockchains provide a ledger’s immutability. All users in all types of blockchain maintain a replica of the ledger. However, the main difference

478

M. Alkhammash et al.

between public and private blockchain is authorisation. A public blockchain allows any users to participate in the network. In addition, private and consortium blockchains are more efficient for IoT networks since they both have faster response times in the network and lower computational requirements. While public blockchain has proved over the years to be suitable and efficient for cryptocurrencies, it is not that effective to use for IoT applications due to its bandwidth requirements and high computational requirements [76]. In our architecture, we suggest using a consortium blockchain, for example, the Hyperledger Fabric blockchain platform [77–79], for many key features. A Hyperledger blockchain is widely used for businesses and enterprises. It is designed to support pluggable implementation of components delivering high degrees of confidentiality, resilience, scalability, and low latency. Hyperledger has a modular architecture and can be used very flexibly. In addition, modular consensus protocols are been used, which permit a user to trust models and tailor the system for particular use cases. This platform runs smart contracts or chain code, which is an executing programmable code that allows participants to write their own scripts without a middleman [80].

5 Smart Campus Exploiting the Internet of Things, Blockchain, and Security Requirements The main reason for developing a blockchain in 2008 was to address the potential problem related to stakeholders’ trust in various use cases, including financial and nonfinancial fields [81, 82]. It provides security requirements for the transactions by using several cryptography mechanisms, such as signature, asymmetric cryptography, and hash. A lot of research has explored whether blockchain technology meets the need for providing more secure, trusted, and immutable data by adopting the blockchain into existing software, such as in the financial industry [83] and healthcare fields [84–86]. However, integrating blockchain technology into education institutions is still in its early stages and needs more research. We have therefore provided a discussion about security requirements for the proposed framework of a smart campus since the security aspect is the main concern in most of the recent blockchain applications. We would like to study this aspect in more detail in the following sub-sections, covering authorisation and privacy in addition to the CIA triad of confidentiality, integrity, and availability. 5.1 Authentication Authentication is one of the key security aspects and is a process of verifying a peer’s identity in order to use a system and communicate with each other [87]. There are many studies that have focused on user authentication with the majority of cases looking at data leaks and identity theft. The current authentication mechanisms, which have been used in most applications, vary from using a single factor, for example, a password or user ID, to using a multi-factor authentication, such as a smart card or biological characteristic. These traditional methods are not effective in providing appropriate protection and can cause various issues and damage, for example, recently passwords have been easily and

An Internet of Things and Blockchain Based Smart Campus Architecture

479

frequently hacked [88]. Multi-factor authentication relies on centralisation or trusting third-party services, which, as we discussed previously, have high security risks. Recently blockchain has been used to improve protection against illegitimate access of several IoT applications without the need for centralised services. For example, Cha et al. [89] designed a blockchain gateway by integrating the blockchain in an IoT gateway to securely protect user preferences while connecting to IoT devices. This approach can raise the authentication level between the users and the connected devices. In addition, Sanda and Inaba [90] used blockchain technology with a Wi-Fi network to provide the authentication to the connected users and protect the network from malicious usage. The blockchain in this implementation was used to encrypt the communication and ensure security to the network. Therefore, the blockchain has the benefit of increasing the security of the authentication aspects. 5.2 Privacy Privacy is an essential aspect for most of the systems. The majority of the researchers have taken advantage of blockchain technology to increase the level of privacy in the IoT environment and protect the individual private data being revealed [91]. For example, Kianmajd et al. [92] presented a framework that integrates blockchain to preserve users’ privacy while using community resources. The framework highlighted that the decentralised environment of the blockchain can be used to increase the users’ data privacy. In addition, Zyskind et al. [93] structured a personal data management platform in order to provide privacy for users. The study proposed a protocol that integrated with a blockchain to produce ‘an automated trustless access-control manager’. The constructed platform achieved the privacy using encrypted data in the ledger and storing pointers to it instead of the transaction of the data itself to the network. Thus, personal data should be secured and controlled by the user and not be trusted to a third party. 5.3 Confidentiality, Integrity, and Availability (CIA) Data confidentiality is an aspect of protecting data from unauthorised access. Since blockchain uses cryptography mechanisms, it offers confidentiality and protects data, such as bank account [81] and personal data [94], from parties that do not have permission. Data integrity is another security aspect that is concerned with assuring and preserving the consistency, reliability, and accuracy of the data [95]. In other words, the data stored in the database should be kept from changing throughout its lifecycle. In this case, through the use of various cryptography mechanisms, blockchain technology provides data integrity and promises to protect data from unauthorised change [96, 97]. Banerjee et al. [98] combined the blockchain with IoT devices’ firmware to maintain the integrity of shared data. Moreover, Liu et al. [99] implemented a framework for a data integrity service using blockchain to verify the integrity of IoT data without the need for a third party.

480

M. Alkhammash et al.

Data availability is one of many important terms in any system and means ensuring that the required data is available and accessible when needed [100]. One of the benefits of blockchain technology with a decentralised structure and distributed ledger is that it is resistant to outages [101].

6 Conclusion Recently, many researchers have focused on the study of developing smart and intelligent environments in many fields, such as smart cities, hospitals, and homes that mostly rely on IoT systems. The privacy and security aspects have been attracting research interest since they are considered the critical issues and challenges for connected IoT devices. This paper surveyed a number of schemes and frameworks for smart campuses that were proposed in the literature as an example of IoT and addressed the issues related to security and privacy. This paper presented an overview of the smart campus concept, including architectures; enabling different technologies, such as IoT; cloud computing; and blockchain with the aim of improving the quality of life on campuses. It studied eight varied domains in the smart campus and defined problem assets per domain. In addition, the paper discussed the generic framework of a smart campus and its limitations. Furthermore, we proposed a new smart campus framework combining IoT and blockchain to mitigate the IoT issues in the previous architectures, particularly in relation to security and privacy since blockchain technology has multiple properties, such as autonomous and decentralised control, support for integrity, and resiliency. Moreover, this study discussed the security requirements for the proposed framework of a smart campus.

References 1. Kwok, L.: A vision for the development of i-campus. Smart Learn. Environ. 2, 2 (2015) 2. Szabo, R., Farkas, K., Ispany, M., Benczur, A.A., Batfai, N., Jeszenszky, P., Laki, S., Vagner, A., Kollar, L., Sidlo, C., Besenczi, R., Smajda, M., Kover, G., Szincsak, T., Kadek, T., Kosa, M., Adamko, A., Lendak, I., Wiandt, B., Tomas, T., Nagy, A.Z., Feher, G.: Framework for smart city applications based on participatory sensing. In: Proceedings of the 4th IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2013, pp. 295– 300 (2013) 3. Caragliu, A., Del Bo, C., Nijkamp, P.: Smart cities in Europe. In: 3rd Central European Conference on Regional Science 0732, pp. 1–15 (2015) 4. Perera, C., Liu, C.H., Jayawardena, S., Chen, M.: A survey on internet of things from industrial market perspective. IEEE Access 2, 1660–1679 (2015) 5. Pramanik, M.I., Lau, R.Y.K., Demirkan, H., Azad, M.A.K.: Smart health: big data enabled health paradigm within smart cities. Expert Syst. Appl. 87, 370–383 (2017) 6. Catarinucci, L., De Donno, D., Mainetti, L., Palano, L., Patrono, L., Stefanizzi, M.L., Tarricone, L.: An IoT-aware architecture for smart healthcare systems. IEEE Internet Things J. 2, 515–526 (2015)

An Internet of Things and Blockchain Based Smart Campus Architecture

481

7. Farahani, B., Firouzi, F., Chang, V., Badaroglu, M., Constant, N., Mankodiya, K.: Towards fog-driven IoT eHealth: promises and challenges of IoT in medicine and healthcare. Futur. Gener. Comput. Syst. 78, 659–676 (2018) 8. Amendola, S., Lodato, R., Manzari, S., Occhiuzzi, C., Marrocco, G.: RFID technology for IoT-based personal healthcare in smart spaces. IEEE Internet Things J. 1, 144–152 (2014) 9. Tachizawa, E.M., Alvarez-Gil, M.J., Montes-Sancho, M.J.: How “smart cities” will change supply chain management. Supply Chain Manag. 20, 237–248 (2015) 10. Luki´c, J., Radenkovi´c, M., Despotovi´c-Zraki´c, M., Labus, A., Bogdanovi´c, Z.: Supply chain intelligence for electricity markets: a smart grid perspective. Inf. Syst. Front. 19, 91–107 (2017) 11. Ghazal, B., Elkhatib, K., Chahine, K., Kherfan, M.: Smart traffic light control system. In: 3rd International Conference on Electrical, Electronics, Computer Engineering and their Applications, EECEA 2016 (2016) 12. Galán-García, J.L., Aguilera-Venegas, G., Rodríguez-Cielos, P.: An accelerated-time simulation for traffic flow in a smart city. J. Comput. Appl. Math. 270, 557–563 (2014) 13. Nair, P.K., Ali, F., Lim, C.L.: Interact. Technol. Smart Educ. 12, 183–201 (2015) 14. Alelaiwi, A., Alghamdi, A., Shorfuzzaman, M., Rawashdeh, M., Hossain, M.S., Muhammad, G.: Enhanced engineering education using smart class environment. Comput. Human Behav. 51, 852–856 (2015) 15. Ibrahim, M.S., Razak, A.Z.A., Kenayathulla, H.B.: Smart principals and smart schools. Procedia Soc. Behav. Sci. 10, 826–836 (2013) 16. Muhamad, W., Kurniawan, N.B., Suhardi, S., Yazid, S.: Smart campus features, technologies, and applications: a systematic literature review. In: Proceedings of the International Conference on Information Technology Systems and Innovation, ICITSI 2017 (2018) 17. Dong, X., Kong, X., Zhang, F., Chen, Z., Kang, J.: OnCampus: a mobile platform towards a smart campus Background. Springerplus 5, 974 (2016) 18. Ahern, N., Wink, D.M.: Virtual learning environments: second life. Nurse Educ. 35, 225–227 (2010) 19. Alam, A., Ullah, S.: Adaptive 3D-virtual learning environments: from students’ learning perspective. In: Proceedings of the 14th International Conference on Frontiers of Information Technology, FIT 2016 (2017) 20. Komaki, H., Shimazaki, S., Sakakibara, K., Matsumoto, T.: Interactive optimization techniques based on a column generation model for timetabling problems of university makeup courses. In: Proceedings of the IEEE 8th International Workshop on Computational Intelligence and Applications, IWCIA 2015 (2016) 21. Mei, R., Guan, J., Li, B.: University course timetable system design and implementation based on mathematical model. In: 2nd International Conference on Computer and Automation Engineering, ICCAE 2010 (2010) 22. Abuarqoub, A., Abusaimeh, H., Hammoudeh, M., Uliyan, D., Abu-Hashem, M.A., Murad, S., Al-Jarrah, M., Al-Fayez, F.: A survey on internet of thing enabled smart campus applications. In: Proceedings of the International Conference on Future Networks and Distributed Systems - ICFNDS 2017, pp. 1–7 (2017) 23. Khamayseh, Y., Mardini, W., Aljawarneh, S., Yassein, M.B.: Integration of wireless technologies in smart university campus environment. Int. J. Inf. Commun. Technol. Educ. 11, 60–74 (2015) 24. Atif, Y., Mathew, S.S., Lakas, A.: Building a smart campus to support ubiquitous learning. J. Ambient Intell. Humaniz. Comput. 6, 223–238 (2015)

482

M. Alkhammash et al.

25. Chen, Y., Zhang, R., Shang, X., Zhang, S.: An intelligent campus space model based on the service encapsulation. In: Proceedings of 2nd International Conference on Logistics, Informatics and Service Science, LISS 2012, pp. 919–923 (2013) 26. Chen, Y., Li, X., Wang, Y., Gao, L.: The design and implementation of intelligent campus security tracking system based on RFID and ZigBee. In: Proceedings of the 2nd International Conference on Mechanic Automation and Control Engineering, MACE 2011, pp. 1749–1752 (2011) 27. Ng, J.W.P., Azarmi, N., Leida, M., Saffre, F., Afzal, A., Yoo, P.D.: The intelligent campus (iCampus): end-to-end learning lifecycle of a knowledge ecosystem. In: Proceedings of the 6th International Conference on Intelligent Environments, IE 2010, pp. 332–337 (2010) 28. Jackson, P.M.: Intelligent campus. In: Proceedings of the First International Symposium on Pervasive Computing and Applications, SPCA 2006, p. 3 (2007) 29. Hirsch, B., Ng, J.W.P.: Education beyond the cloud: anytime-anywhere learning in a smart campus environment. In: International Conference for Internet Technology and Secured Transactions, pp. 718–723 (2011) 30. Atif, Y., Mathew, S.: A social web of things approach to a smart campus model. In: Proceedings of the IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing, GreenCom-iThings-CPSCom 2013, pp. 349–354 (2013) 31. Liu, Y.L., Zhang, W.H., Dong, P.: Research on the construction of smart campus based on the internet of things and cloud computing. Appl. Mech. Mater. 543, 3213–3217 (2014) 32. Adamkó, A., Kollár, L.: A system model and applications for intelligent campuses. In: Proceedings of the IEEE 18th International Conference on Intelligent Engineering Systems, INES 2014, pp. 193–198 (2014) 33. Boran, A., Bedini, I., Matheus, C.J., Patel-Schneider, P.F., Keeney, J.: A smart campus prototype for demonstrating the semantic integration of heterogeneous data. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 238–243 (2011) 34. Yu, Z., Liang, Y., Xu, B., Yang, Y., Guo, B.: Towards a smart campus with mobile social networking. In: Proceedings of the IEEE International Conferences on Internet of Things and Cyber, Physical and Social Computing, iThings/CPSCom 2011 (2011) 35. De Angelis, E., Ciribini, A.L.C., Tagliabue, L.C., Paneroni, M.: The Brescia smart campus demonstrator. Renovation toward a zero energy classroom building. Procedia Eng. 118, 735–743 (2015) 36. Kolokotsa, D., Gobakis, K., Papantoniou, S., Georgatou, C., Kampelis, N., Kalaitzakis, K., Vasilakopoulou, K., Santamouris, M.: Development of a web based energy management system for university campuses: the CAMP-IT platform. Energy Build. 123, 119–135 (2016) 37. Han, Y., Xia, K.: Data preprocessing method based on user characteristic of interests for web log mining. In: Proceedings of the 4th International Conference on Instrumentation and Measurement, Computer, Communication and Control, IMCCC 2014, pp. 867–872 (2014) 38. Kuang, W., Luo, N.: User interests mining based on topic map. In: Proceedings of the 7th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2010, pp. 2399–2402 (2010) 39. Amr, A.I., Kamel, S., Gohary, G.El, Hamhaber, J.: Water as an ecological factor for a sustainable campus landscape. Procedia Soc. Behav. Sci. 216, 181–193 (2016) 40. Shi, G.B.: The design of campus monitoring and managing system for watersaving based on webgis. In: Proceedings of the IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), iThingsGreenCom-CPSCom-SmartData 2017, January 2018, pp. 951–954 (2018)

An Internet of Things and Blockchain Based Smart Campus Architecture

483

41. Kudva, V.D., Nayak, P., Rawat, A., Anjana, G.R., Kumar, K.R.S., Amrutur, B., Kumar, M.S.M.: Towards a real-time campus-scale water balance monitoring system. In: Proceedings of the IEEE International Conference on VLSI Design, pp. 87–92 (2015) 42. Verma, P., Kumar, A., Rathod, N., Jain, P., Mallikarjun, S., Subramanian, R., Amrutur, B., Kumar, M.S.M., Sundaresan, R.: Towards an IoT based water management system for a campus. In: IEEE 1st International Smart Cities Conference, ISC2 2015 (2015) 43. Alghamdi, A., Shetty, S.: Survey toward a smart campus using the internet of things. In: Proceedings of the IEEE 4th International Conference on Future Internet of Things and Cloud, FiCloud 2016, pp. 235–239 (2016) 44. Mudumbe, M.J., Abu-Mahfouz, A.M.: Smart water meter system for user-centric consumption measurement. In: Proceeding of the IEEE International Conference on Industrial Informatics, INDIN 2015, pp. 993–998 (2015) 45. Goenka, S., Mangrulkar, R.S.: Robust waste collection: exploiting IoT potentiality in smart cities. i-Manager’s J. Softw. Eng. 11, 10–18 (2017) 46. Folianto, F., Low, Y.S., Yeow, W.L.: Smartbin: smart waste management system. In: IEEE 10th International Conference on Intelligent Sensors, Sensor Networks and Information Processing, ISSNIP 2015 (2015) 47. Ebrahimi, K., North, L., Yan, J.: GIS applications in developing zero-waste strategies at a mid-size American university. In: International Conference on Geoinformatics (2017) 48. Saad, S.A., Hisham, A.A.B., Ishak, M.H.I., Fauzi, M.H.M., Baharudin, M.A., Idris, N.H.: Real-time on-campus public transportation monitoring system. In: Proceedings of the IEEE 14th International Colloquium on Signal Processing and its Application, CSPA 2018 (2018) 49. Ramadan, M., Al-Khedher, M., Al-Kheder, S.: Intelligent anti-theft and tracking system for automobiles. Int. J. Mach. Learn. Comput. 2, 83 (2012) 50. Priya, S., Prabhavathi, B., Shanmuga Priya, P., Shanthini, B., Scholar, U.: An android application for tracking college bus using google map. Int. J. Comput. Sci. Eng. Commun. 3, 1057–1061 (2015) 51. Suresh Mane, M.P., Khairnar, P.V.: Analysis of bus tracking system using Gps on smart phones. IOSR J. Comput. Eng. (2014) 52. Ferreira, J.E., Visintin, J.A., Okamoto, J., Pu, C.: Smart services: a case study on smarter public safety by a mobile app for University of São Paulo. In: IEEE SmartWorld, Ubiquitous Intelligence and Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People and Smart City Innovation, SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI 2017, pp. 1–5 (2018) 53. Wang, Y., Saez, B., Szczechowicz, J., Ruisi, J., Kraft, T., Toscano, S., Vacco, Z., Nicolas, K.: A smart campus internet of things framework. In: IEEE 8th Annual Ubiquitous Computing, Electronics and Mobile Communication Conference, UEMCON 2017 (2018) 54. Cheng, X., Xue, R.: Construction of smart campus system based on cloud computing. In: Proceedings of the 6th International Conference on Applied Science Engineering and Technology, Atlantis Press, pp. 187–191 (2016) 55. Atlam, H.F., Wills, G.B.: Intersections between IoT and distributed ledger (2019) 56. Halpin, H., Piekarska, M.: Introduction to security and privacy on the blockchain. In: Proceedings of the 2nd IEEE European Symposium on Security and Privacy Workshops, EuroS and PW 2017 (2017) 57. Chowdhury, M., Ferdous, S., Biswas, K.: Blockchain Platforms for IoT Use-cases, pp. 3–4 (2018) 58. Hejazi, H., Rajab, H., Cinkler, T., Lengyel, L.: Survey of platforms for massive IoT. In: IEEE International Conference on Future IoT Technologies, Future IoT 2018 (2018)

484

M. Alkhammash et al.

59. Lin, J., Yu, W., Zhang, N., Yang, X., Zhang, H., Zhao, W.: A survey on internet of things: architecture, enabling technologies, security and privacy, and applications. IEEE Internet Things J. 4, 1125–1142 (2017) 60. Leo, M., Battisti, F., Carli, M., Neri, A.: A federated architecture approach for Internet of Things security. In: Euro Med Telco Conference - From Network Infrastructures to Network Fabric: Revolution at the Edges, EMTC 2014 (2014) 61. Al-Fuqaha, A., Guizani, M., Mohammadi, M., Aledhari, M., Ayyash, M.: Internet of things: a survey on enabling technologies, protocols, and applications. IEEE Commun. Surv. Tutorials 17, 2347–2376 (2015) 62. Decuir, J.: Introducing bluetooth smart: Part 1: a look at both classic and new technologies. IEEE Consum. Electron. Mag. 3, 12–18 (2014) 63. Biswas, K., Muthukkumarasamy, V.: Securing smart cities using blockchain technology. In: Proceedings of the 18th IEEE International Conference on High Performance Computing and Communications, 14th IEEE International Conference on Smart City and 2nd IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2016 (2017) 64. Bryant, R., Katz, R., Lazowska, E.: Big-data computing: creating revolutionary breakthroughs in commerce, science and society. Comput. Res. Assoc. (2008) 65. Khan, R., Khan, S.U., Zaheer, R., Khan, S.: Future internet: The internet of things architecture, possible applications and key challenges. In: Proceedings of the 10th International Conference on Frontiers of Information Technology, FIT 2012 (2012) 66. Yang, Z., Yue, Y., Yang, Y., Peng, Y., Wang, X., Liu, W.: Study and application on the architecture and key technologies for IOT. In: International Conference on Multimedia Technology, ICMT 2011 (2011) 67. Morkunas, V.J., Paschen, J., Boon, E.: How blockchain technologies impact your business model. Bus. Horiz. 62, 295–306 (2019) 68. Christidis, K., Devetsikiotis, M.: Blockchains and smart contracts for the internet of things (2016) 69. Olleros, F., Zhegu, M., Pilkington, M.: Blockchain technology: principles and applications. In: Research Handbook on Digital Transformations (2016) 70. Ahram, T., Sargolzaei, A., Sargolzaei, S., Daniels, J., Amaba, B.: Blockchain technology innovations. In: IEEE Technology and Engineering Management Society Conference, TEMSCON 2017 (2017) 71. Sankar, L.S., Sindhu, M., Sethumadhavan, M.: Survey of consensus protocols on blockchain applications. In: 4th International Conference on Advanced Computing and Communication Systems, ICACCS 2017 (2017) 72. Satoshi, N., Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash system. Bitcoin (2008) 73. Dannen, C.: Introducing Ethereum and Solidity: Foundations of Cryptocurrency and Blockchain Programming for Beginners (2017) 74. Hearn, M.: Corda: A distributed ledger. Whitepaper (2016) 75. Lim, S.Y., Tankam Fotsing, P., Almasri, A., Musa, O., Mat Kiah, M.L., Ang, T.F., Ismail, R.: Blockchain technology the identity management and authentication service disruptor: a survey. Int. J. Adv. Sci. Eng. Inf. Technol. 8, 1735–1745 (2018) 76. Salimitari, M., Chatterjee, M.: A Survey on consensus protocols in blockchain for IoT networks, pp. 1–15 (2018) 77. Windley, P.J.: Hyperledger Welcomes Project Indy. Hyperledger (2017) 78. Cachin, C.: Architecture of the Hyperledger. Blockchain Fabric. Work. Distrib. Cryptocurrencies Consens. Ledgers (DCCL 2016) (2016) 79. Androulaki, E., Manevich, Y., Muralidharan, S., Murthy, C., Nguyen, B., Sethi, M., Singh, G., Smith, K., Sorniotti, A., Stathakopoulou, C., Vukoli´c, M., Barger, A., Cocco, S.W., Yellick, J., Bortnikov, V., Cachin, C., Christidis, K., De Caro, A., Enyeart, D., Ferris, C., Laventman, G.: Hyperledger fabric: A distributed operating system for permissioned blockchains. In: Proceedings of the 13th European Conference on Computer System, ACM, pp. 1–15 (2018)

An Internet of Things and Blockchain Based Smart Campus Architecture

485

80. Buterin, V., Abarbanell, J.S., Bushee, B.J., Adcock, C., Adebiyi, A.A., Adewumi, A.O., Ayo, C.K., Atzei, N., B, M.B., Cimoli, T., Bartoletti, M., Cimoli, T., B, Y.H., Chakraborty, K., Mehrotra, K., Mohan, C.K., Ranka, S., Chen, M., Narwal, N., Schultz, M., Choi, H.K., Choudhry, R., Garg, K., Chrystus, J., Connor, J.T., Martin, R.D., Atlas, L.E., Corbet, S., Lucey, B., Yarovaya, L., Dechow, P.M., Hutton, A.P., Meulbroek, L.K., Sloan, R.G., Duarte Lima Freire Lopes, G., Falinouss, P., Faugeras, O.D., Fischer, T., Krauss, C., Frisiani, N., Hebrero-Martínez, M., Lerma, R.V., Trollé, C.M., Pérez-Cuevas, R., Muñoz, O., Hu, Z., Liu, W., Bian, J., Liu, X., Liu, T.-Y., Kadiri, E., Alabi, O., Kim, Y. Bin, Kim, J.G.J.H., Kim, W., Im, J.H., Kim, T.H., Kang, S.J., Kim, C.H., Lee, J., Park, N., Choo, J., Kim, J.G.J.H., Kim, C.H., Kimoto, T., Asakawa, K., Yoda, M., Takeoka, M., Kohara, K., LeCun, Y., Bengio, Y., Maciel, L.S., Ballini, R., Mu, S., Guo, Y., Yang, P., Wang, W., Yu, L., Nelson, D.M.Q., Pereira, A.C.M., De Oliveira, R.A., Of, a B., Counsel, P., Pagolu, V.S., Reddy, K.N., Panda, G., Majhi, B., Persson, S., Shaw, I., Phaladisailoed, T., Numnonda, T., Richardson, S., Tuna, I., Wysocki, P., Roche, J., Mcnally, S., Roondiwala, M., Patel, H. and Varma, S., Shukla, N., Fricklas, K., Song, Y.-G., Zhou, Y.-L., Han, R.-J., Tang, Z., de Almeida, C., Fishwick, P.A., Vargas, M.R., Lima, B.S.L.P. De, Evsukoff, A.G., Chohan, U., Nakamoto, S., DemirgucKunt, A., Klapper, L., Singer, D., Ansar, S., Hess, J., Wiederhold, B.K., Riva, G., Graffigna, G., Schöneburg, E., Guo, T., Bifet, A., Antulov-Fantulin, N., Wood, G., Vineeth, N., Ayyappa, M., Bharathi, B.: A next-generation smart contract and decentralized application platform. PLoS One (2018) 81. Crosby, M., Nachiappan, Pattanayak, P., Verma, S., Kalyanaraman, V.: Blockchain Technology - BEYOND BITCOIN. Berkley Eng (2016) 82. Davidson, S., De Filippi, P., Potts, J.: Economics of Blockchain. SSRN (2016). https://ssrn. com/abstract=2744751. https://doi.org/10.2139/ssrn.2744751 83. Khan, C., Lewis, A., Rutland, E., Wan, C., Rutter, K., Thompson, C.: A distributed-ledger consortium model for collaborative innovation. Computer 50, 29–37 (2017) 84. Benchoufi, M., Porcher, R., Ravaud, P.: Blockchain protocols in clinical trials: transparency and traceability of consent. F1000. Res. 6, 66 (2018) 85. Azaria, A., Ekblaw, A., Vieira, T., Lippman, A.: MedRec: Using blockchain for medical data access and permission management. In: Proceedings of the 2016 2nd International Conference on Open and Big Data, OBD 2016 (2016) 86. Dagher, G.G., Mohler, J., Milojkovic, M., Marella, P.B.: Ancile: privacy-preserving framework for access control and interoperability of electronic health records using blockchain technology. Sustain. Cities Soc. 39, 283–297 (2018) 87. Wazid, M., Das, A.K., Hussain, R., Succi, G., Rodrigues, J.J.P.C.: Authentication in clouddriven IoT-based big data environment: survey and outlook. J. Syst, Archit (2019) 88. Mhenni, A., Cherrier, E., Rosenberger, C., Amara, N.E.B.: Double serial adaptation mechanism for keystroke dynamics authentication based on a single password. Comput. Secur. 83, 151–166 (2019) 89. Cha, S.C., Chen, J.F., Su, C., Yeh, K.H.: A blockchain connected gateway for ble-based devices in the internet of things. IEEE Access 6, 24639–24649 (2018) 90. Sanda, T., Inaba, H.: Proposal of new authentication method in Wi-Fi access using Bitcoin 2.0. In: IEEE 5th Global Conference on Consumer Electronics, GCCE 2016 (2016) 91. Mohsin, A.H., Zaidan, A.A., Zaidan, B.B., Albahri, O.S., Albahri, A.S., Alsalem, M.A., Mohammed, K.I.: Blockchain authentication of network applications: taxonomy, classification, capabilities, open challenges, motivations, recommendations and future directions (2019) 92. Kianmajd, P., Rowe, J., Levitt, K.: Privacy-preserving coordination for smart communities. In: Proceedings of the IEEE INFOCOM (2016) 93. Zyskind, G., Nathan, O., Pentland, A.S.: Decentralizing privacy: using blockchain to protect personal data. In: Proceedings of the 2015 IEEE Security and Privacy Workshops, SPW 2015 (2015)

486

M. Alkhammash et al.

94. Peterson, K., Deeduvanu, R., Kanjamala, P., Boles, K.: A blockchain-based approach to health information exchange networks. In: Proceedings of the NIST Workshop Blockchain Healthcare (2016) 95. Moin, S., Karim, A., Safdar, Z., Safdar, K., Ahmed, E., Imran, M.: Securing IoTs in distributed blockchain: analysis, requirements and open issues. Futur. Gener. Comput. Syst. 100, 325–343 (2019) 96. Wüst, K., Gervais, A.: Do you need a Blockchain? IACR Cryptology ePrint Archive(2017) 97. Apte, S., Petrovsky, N.: Will blockchain technology revolutionize excipient supply chain management? (2016) 98. Banerjee, M., Lee, J., Choo, K.K.R.: A blockchain future for internet of things security: a position paper. Digit. Commun. Netw. 4, 149–160 (2018) 99. Liu, B., Yu, X.L., Chen, S., Xu, X., Zhu, L.: Blockchain based data integrity service framework for IoT data. In: Proceedings of the IEEE 24th International Conference on Web Services, ICWS 2017 (2017) 100. Scarfone, K., Tracy, M.: Guide to General Server Security. Natl. Inst. Stand. Technol. 800, 123 (2008) 101. Zhu, H., Zhou, Z.Z.: Analysis and outlook of applications of blockchain technology to equity crowdfunding in China (2016)

Towards a Scalable IOTA Tangle-Based Distributed Intelligence Approach for the Internet of Things Tariq Alsboui(B) , Yongrui Qin, Richard Hill, and Hussain Al-Aqrabi School of Computing and Engineering, University of Huddersfield, Huddersfield, UK {tariq.alsboui,y.qin2,r.hill,h.al-aqrabi}@hud.ac.uk

Abstract. Distributed Ledger Technology (DLT) brings a set of opportunities for the Internet of Things (IoT), which leads to innovative solutions for existing components at all levels of existing architectures. IOTA Tangle has the potential to overcome current technical challenges identified for the IoT domain, such as data processing, infrastructure scalability, security, and privacy. Scaling is a serious challenge that influences the deployment of IoT applications. We propose a Scalable Distributed Intelligence Tangle-based approach (SDIT), which aims to address the scalability problem in IoT by adapting the IOTA Tangle architecture. It allows the seamless integration of new IoT devices across different applications. In addition, we describe an offloading mechanism to perform proof-of-work (PoW) computation in an energy-efficient way. A set of experiments has been conducted to prove the feasibility of the Tangle in achieving better scalability, while maintaining energy efficiency. The results indicate that our proposed solution provides highly-scalable and energy efficient transaction processing for IoT DLT applications, when compared with an existing DAG-based distributed ledger approach. Keywords: Scalability · Distributed Ledger Technology (DLT) · IOTA Tangle · Internet of Things (IoT) · Distributed Intelligence (DI)

1

Introduction

Internet of Things (IoT) applications connect everyday objects to the Internet and enable the gathering and exchange of data to increase the overall efficiency of a common objective [1]. It is estimated that there will be approximately 125 billion devices connected to the Internet in 2030 [2–4]. Consequently, most IoT applications are required to be highly scalable and energy efficient, so that they are capable of dynamically responding to a growing number of IoT devices [5]. IoT applications have a number of common elements: (1) sensing to perceive the environment; (2) communication for efficient data transfer between objects, and (3) computation, which is usually performed to generate useful information from the raw data. c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 487–501, 2020. https://doi.org/10.1007/978-3-030-52246-9_35

488

T. Alsboui et al.

Distributed Intelligence (DI) is an approach that could address the challenges presented by the proliferation of IoT applications. DI is a sub-discipline of artificial intelligence that distributes processing, enabling collaboration between smart objects, and mediating communications, thus supporting IoT system optimisation and the achievement of goals [6]. This definition is the basis for the research that is described in this article. Contribution: We propose a system architecture for IoT, called the Scalable Distributed Intelligence Tangle-based Approach (SDIT). This research successfully addresses some of the technical challenges presented by the IoT, whilst also supporting the necessary proof-of-work (PoW) mechanism in an energy-efficient way. The key contributions are summarised as follows: • A Tangle-based architecture that manages resources and enables the deployment of IoT applications with the primary motivation being scalability. • A task offloading mechanism for performing the proof-of-work (PoW) on powerful IoT, devices to minimise energy consumption when resources (such as power) are constrained. • A set of experimental results that verify the effectiveness and contribution of the proposed approach. The ultimate objective of this paper is to design, and develop a scalable and energy efficient IOTA Tangle-enabled IoT intelligent architecture to support DI. The proposed approach differs from other solutions by using a Tangle-based DLT with the primary motivation of being energy-efficient, and scalable to accommodate the growth of IoT while taking resource constrains into consideration. This work outlines the design of a scalable system that can be used in various IoT applications. IOTA Tangle is used to achieve scalability and a higher level node is responsible for performing the proof of work (PoW) to minimise energy consumption of IoT devices. The initial idea can be found in our previous positioning work [5]. The reminder of this paper is organised as follows: Sect. 2 describes distributed ledger technology from the perspective of IOTA. Section 3 presents the suitability of IOTA Tangle for IoT. Section 4 discusses the differences of our work from other closely related work. In Sect. 5 we present our proposed SDIT system architecture. The performance of the proposed implementation is evaluated in Sect. 6. Finally, we conclude this paper and present an interesting future directions in Sect. 7.

2

Distributed Ledger Technology

Distributed Ledger Technology (DLT) can be divided into three main types based on the differentiation of the data structure used for the ledger, including: BlockChain (BC) [7], IOTA Tangle (DAG) [8], and Hashgraph [9]. BC is a distributed, decentralised, and immutable ledger for storing transactions and sharing data among all network participants [10]. Hashgraph, is considered as

Towards Building a Scalable Tangle-Based Distributed

489

an alternative to BC and is used to replicate state machines, which guarantees Byzantine fault tolerance by specifying asynchrony and decentralization, as well as no need for proof-of-work (PoW), eventual consensus with probability of one, and high speed in the consensus process [11]. BC has been criticized for its cost, energy consumption and lack of scalability. To overcome these limitations, the IOTA Tangle technology has been introduced as a decentralized data storage architecture and a consensus protocol, based on a Directed Acyclic Graph (DAG). Each node in the DAG represents a transaction, and the connections between transactions represent the transaction validators [8]. BC technology recently started to receive attention from both academic and industry, since it offers a wide range of potential benefits to areas beyond cryptocurrency (in particular the IoT), as it has unique characteristics such as immutability, reliability, fault-tolerance, and decentralization [12]. It is predicted that BC will transform the IoT ecosystems by enabling them to be smart and more efficient. According to the International Data Cooperation (IDC), it is stated that 20% of IoT deployments will employ a basic level of BC enabled services [13]. This number will continue to increase for the adoption of BC in the IoT since it is in the early stages of innovation. Overall, BC is considered as an effective solution to be integrated with the IoT to achieve some of the IoT technical challenges [14]. BC is potentially able to overcome some of the IoT issues such as privacy and security [15]. However, building an energy-efficient, and scalable IoT applications remains a challenge. Firstly, all BC consensus mechanisms in either private or public BC, require all fully participating nodes to retain copies of all transactions recorded in the history of BC, which comes at the cost of scalability [12]. Furthermore, IoT devices have limited computational, memory and networking constraints, which brings an issue when using BC-based architectures. Some of the IoT devices will not be able to engage in performing Proof of Work (PoW) consensus operations due to their limited computational power and battery life. Also, IoT devices do not always come with the required storage space to place a complete copy of the BC [16]. With the IOTA Tangle, transactions are directly attached to the chain without the need to wait as they need to approve two previous transactions called tips. Hence, the Tangle is more efficient than traditional BC under the welldesigned consensus mechanisms [17].

3

The IOTA Tangle Suitability for IoT

The IOTA Tangle is intuitively understandable, and the benefits offered by it can be employed to realise a DI approach. It offers a wide range of prospective modifications to fit specific goals. The scalability and flexibility essential for IoT can be obtained by IOTA technology. IOTA can facilitate IoT interactions in the form of transactions. The following are the potential benefits and motivations for integrating the IOTA technology in the IoT infrastructure:

490

T. Alsboui et al.

• Scalability: IoT demands scalable infrastructure to cope with the increasing number of IoT applications. The IOTA Tangle private network offers high scalability due to the unique design of the decentralized consensus Tanglebased architecture, in which users are also validators, and has no scaling limitations. • Decentralization: in centralized network architectures, the exchange of data is validated and authorized by central third-party entities. This leads to a higher cost in relation to centralized server maintenance. In IOTA Tangle based architecture, nodes exchange transactions with each other without relying on a central entity. Therefore, any participants who want to exchange transaction on the Tangle need to actively engage in consensus operations. • Security and privacy: one of the most critical technical challenges of coping with IoT is related to network security and privacy. In order to ensure confidentiality and data protection, IOTA technology has developed Masked Authenticated Messaging (MAM) as a second layer data communication protocol that provides the ability to transmit and access encrypted data streams over the Tangle. MAM encryption is enabled by three modes to control visibility and access to channels including: public, private, and restricted. Consequently, it can encrypt, authenticate, and broadcast data to the IOTA network. • Zero-fees transactions: IOTA does not require miners as IOTA participants to perform the proof-of-work (PoW) themselves. The transaction cost is regarded as the electricity required to validate two previous transactions in the working mechanism. This means that all network participants are required to contribute their computational power to maintain the network, thus removing transaction fees. The Tangle method allows IOTA to operate fee-free, making the network even more distributed. • Energy-efficiency: IoT devices have limitations in terms of power consumption, and applications have to be developed to maximise energy efficiency in order to extend device and network life. IOTA technology enables proof-ofwork (PoW) to be outsourced to a more powerful device to reduce the energy consumption of constrained IoT devices. • Resiliency: IoT applications require integrity in the data being transmitted and analyzed, therefore IoT infrastructure is required to be resilient against data leaks and breakage (i.e, offline capability). An IOTA network has replicas of records stored over IOTA peers. This assists the maintenance of data integrity, and together with offline tangle capability, provides additional resilience for the IoT infrastructure.

4

Related Work

Recently, DI gained new momentum from researchers to overcome the technical challenges of IoT [18–23]. A distributed dataflow programming model to enable DI is proposed in [18]. The system consists of fog devices that are classified according to their computing resources, edge IO (input/output), and compute

Towards Building a Scalable Tangle-Based Distributed

491

nodes. The input nodes are capable of brokering communications and data transfer to compute nodes. The compute nodes are responsible for processing the data arriving from input devices. The decisions of assigning logical nodes to physical devices are based on the capability of the nodes and the system designer. The proposed architecture achieves scalability, mobility, and can cope with heterogeneity. However, privacy, offline capability, and resource efficiency are not considered in their design and the approach is not suitable for time-critical applications that require fast responses. An approach named as PROTeCt—Privacy aRquitecture for integratiOn of internet of Things and Cloud computing to enable DI is presented in [19]. The proposed approach consists of IoT devices and a cloud platform. The IoT devices are responsible for sensing and implementing a cryptographic mechanism i.e. a symmetric algorithm to ensure privacy, before transmitting the data to a cloud. Similarly, in [24], the authors presented an approach based on Mobile Cloud Computing to support DI. The main idea is to merge sensing and processing at different levels of the network by sharing the application’s workload between the server side and the smart things, using a cloud computing platform when needed. The proposed approach enables real-time monitoring and analysis of all the acquired data by networked devices and provides flexibility in executing the application by using resources from cloud computing services. However, these approaches are neither scalable nor suitable for time-critical applications. Furthermore, resiliency of the system i.e, offline capability, multi-party authentication for data security [25], and the fusion of data sources from external devices is not considered in their design. More advanced approaches are proposed in [20,21,23,26] in which they rely on fog computing to enable DI, for example the work presented in [20], in which the authors applied two techniques, device-driven and human driven intelligence to reduce energy consumption and latency. The approach relies on machine learning (ML) to detect user behaviors and adaptively adjusts the sampling rate of sensors and resource schedules (timeslots in the MAC layer) between sensor nodes. Furthermore, an algorithm was designed to deal with the offloading of local tasks among a cluster of fog nodes to further reduce energy consumption, which may reduce energy demands and system latency. However, the approach is not scalable, interactions and information sharing among sensor nodes is not explicitly defined, and it lacks the mechanisms to deal with privacy and offline processing capability. An architecture that is composed of three layers is proposed in [26]. The approach employs several technologies to achieve DI at different layers. It consists of three layers, each of which is responsible for managing a specific task. The first layer consists of IoT devices and sensor devices, which are responsible for measuring and capturing environmental data. The second layer comprises fog nodes, which is responsible for providing an offloading path for the data captured from a group of IoT devices. The third layer is the cloud, which is responsible for managing computing resources and data, and provides overall control and monitoring of the application. The proposed approach leads to a reduction in

492

T. Alsboui et al.

energy consumption and latency. However, cooperation amongst the physical devices is not provided, and issues related to privacy [27], and offline capability is not considered. In [21], the authors present a novel three tier architecture to support DI. In the three tier architecture, IoT components such as sensors, mobile phones, vehicles, base station, network connection, and management elements are connected in a multi-tier distributed schema consisting of different levels of intelligence named as follows: group of devices tier, regional tier, and global tier. The group of devices tier consists of IoT devices and is responsible for managing distributed services (data) generated by sensors. The regional tier is made up of fog colony nodes that are considered as intermediate nodes and responsible for data preprocessing and integration. The global tier consists of cloud data centres that are responsible for further data processing. The proposed approach is robust and reduces the cost of maintaining the fog computing paradigm. However, it lacks scalability, resource utilisation mechanisms and privacy, which are considered major challenges in IoT. Furthermore, it uses a predetermined static orchestration, which results in the failure of the system due to depletion of their energy. Another architecture is proposed in [22] to support DI called Distributed Internet like ArchiTecture (DIAT). The architecture is divided into three layers including: virtualization of physical objects (VO), corresponding virtual object (CVOL), and service level (SL) all of which have their own functionalities and responsibilities. The virtualization of a physical objects layer provides a semantic description of the capabilities and features of the associated real world objects. The second layer is responsible for communicating and coordinating tasks coming from the VO layer. Finally, the service layer (SL) is responsible for creating and management of services and it handles various requests from users. The proposed architecture is scalable, interoperable and privacy is considered. However, support of the other IoT technical challenges i.e. offline capability, and the conservation of IoT resources is not supported. Another approach is introduced in [23] where the authors have developed several layers to achieve DI. The approach is divided into four layers including: A first layer is cloud computing that consists of a data center for providing wide monitoring and centralized control. The second layer comprises intermediate computing nodes that are responsible for identifying dangerous events that provide and act upon them. A third layer comprises low power and high performance edge devices connected to a group of sensors that are responsible for handling the raw data coming from sensors and perform analysis in a timely manner. Finally, the fourth layer consists of sensor nodes distributed to monitor the environment. The advantage of this approach is optimal responses in real time, and low latency. However, IoT related issues such as, energy efficiency, scalability, privacy [28] are not considered in their design. Most recently, a DAG-based scalable transactive smart home infrastructure is proposed in [17]. The approach adopts IOTA tangle to build an IoT smart home infrastructure focusing on scalability and security. A network of 40 nodes were established and divided into three main parts including: smart homes, the

Towards Building a Scalable Tangle-Based Distributed

493

Tangle of inter-house transactions (TXs), and smart devices in the homes. In all smart home devices, there is an always an online computation device (“Home Node”) with pre-installed firmware and a corresponding tangle reference implementation. All home nodes are connected to their neighbour with TCP/UDP protocols for communication and synchronizing the distributed ledger. The approach is scalable to a small number of nodes, and would consume a lot of energy since all nodes are required to perform the proof-of-work. Also, other IoT-related issues such as offline capability is not considered in their design. Furthermore, since their approach relies on a coordinator, full decentralization is not achieved. The DAG-based smart home approach is similar to the SDIT approach proposed in this paper. Both approaches utilise IOTA Tangle with different number of nodes to achieve scalability, where our approach focuses more on energy efficiency of constrained IoT devices as well as maintaining a decentralized architecture.

5

SDIT: A Scalable Distributed Intelligence Tangle-Based Approach

In this section, we present our proposed Scalable Distributed Intelligence Tanglebased approach (SDIT) that aims at tackling the scalability, energy-efficiency, and decentralisation by adopting the IOTA Tangle technology. 5.1

SDIT: System Architecture

Figure 1 illustrates an abstract view picture of the proposed system architecture. The architecture is divided into three main parts including: IoT devices, Tangle to process transactions (txs), and PoW enabled server. Each IoT device is connected with neighbour nodes via TCP/IP protocols for communication, and interaction with the Tangle is in the form of transactions. Tangle is responsible for managing, collecting and processing the transactions. A PoW-enabled server has rich resources and is mostly responsible for performing all of the computations on behalf of the IoT devices. This is a critical task so as to minimise energy consumption. The Tangle can act as a data management layer for processing and storing data in an efficient way. However, the management of data processing is beyond the scope of this paper. The green boxes in Fig. 1 represent fully confirmed transactions, which means that they are approved by all of the current tips, whereas the red boxes are not confirmed transactions. The blue boxes are the tips. 5.2

Consensus Mechanism Employed

Since we are utilising the IOTA Tangle to deal with transactions, we follow the same working principals in which a new transaction should choose two previous unapproved transactions, which are called Tips, to approve based on the tip selection algorithm. After tips are selected, the IOTA nodes are able to publish

494

T. Alsboui et al.

Fig. 1. The Scalable Distributed Intelligence Tangle-based approach (SDIT)

their new transactions to the Tangle. In the advanced Markov Chain Monte Carlo Tip Selection Algorithm (MCMC) N independent random walks are generated on the tangle; the walks begin at the genesis or at a random node and keep moving along the edges of the tangle based on a probability function. The MCMC Algorithm ensures that the tips are selected nondeterministically along the path of the largest cumulative weight for a reasonable amount of time. The probability from transaction walking from the genesis Lx to a tip Ly is proportional to P (−α(Lx − Ly )), where Pxy is an increasing function (generally an exponential), α is a constant and ci represents the Cumulative Weight of transaction i. The process ends when the walker reaches a tip, which is then selected for approval. Typically, the first tip is usually selected for approval. For further details on the working mechanism of the MCMC algorithm, we refer the interested reader

Towards Building a Scalable Tangle-Based Distributed

495

to the IOTA white paper in [8]. To support the advanced tip selection process, the MCMC technique [8], applies a set of rules for deciding the probability of each step in a random walk, and works as follows: 1. Run the MCMC algorithm N times to choose 100 new transactions (tips). The probability of the transaction being accepted is therefore M of N (M is the number of times a tip is reached that has a direct path to the transaction). 2. Calculate how many tips that are directly or indirectly connected to a particular transaction and decide with what probability transactions will be accepted as follows: (a) if it is less than 50%, the transaction is not approved as yet (not confirmed); (b) if it is above 50%, the transaction has a fair chance to be approved (awaiting to be confirmed); (c) if it reaches the level of 98% or 100%, the transaction is considered approved (fully confirmed). 5.3

Computation Offloading

Computation offloading can be divided into two categories: data offloading and computation offloading. The former refers to the use of novel network techniques to transmit mobile data originally planned for transport via cellular networks. The latter refers to the offloading of heavy computation tasks to conserve resources [29]. It is commonly assumed that the implementation of computation offloading depends heavily on the design of the network architecture. The main goal of offloading is to reduce total energy consumption or overall task execution time, or both of them. The IOTA PoWbox (Proof of Work box) is a service provided by the IOTA Foundation that allows the offloading of the PoW to nodes with rich resources, thus reducing the energy consumption of constrained IoT devices and speeding up the development workflow [30]. It was suggested by the authors in [31] to conserve energy of IoT devices by performing the proof-of-work operation offline on a device with rich resources, thus achieving improved energy-efficiency. Figure 2 illustrates the computation offloading mechanism used in the SDIT approach in which the computation operation of performing the PoW is offloaded to a device with higher resources. This leads to a reduction in energy consumption in constrained IoT devices. In particular, we achieve scalability, decentralization and energy efficiency by adapting the IOTA Tangle and their consensus mechanism. We have presented the proposed approach in view of the architecture, consensus mechanism, and the computation offloading technique employed.

6

Experimental Results, Evaluation and Analysis

In this section, we provide a description of the implementation of our experimental results, followed by an evaluation of the proposed solution in regard to the

496

T. Alsboui et al.

Fig. 2. Computation offloading in SDIT approach

scalability, energy efficiency, and decentralization. Then, we conduct an analysis and discuss the results obtained, to highlight the useful characteristics of IOTA Tangle for IoT. 6.1

Environment Setup

We have deployed the latest release of IOTA Reference Implementation (IRI 1.8.1), which is the official Java build embodying the IOTA network specifications1 , on DigitalOcean cloud platform2 , and another IOTA Reference Implementation (IRI 1.8.1) on a local server dedicated for performing the operation of the Proof of Work (PoW). The functionality related to IOTA addresses, transactions, broadcasting, routing, and multi-signatures has been implemented using iota.lib.py [32], the official Python library of IOTA Distributed ledger. In total, a large number of nodes with the specifications of medium size virtual machines (4 GB RAM, 2 1 2

https://github.com/iotaledger/iri/releases/tag/v1.8.1-RELEASE. https://www.digitalocean.com.

Towards Building a Scalable Tangle-Based Distributed

497

Fig. 3. Scalability in Tangle with 290 Nodes

Fig. 4. Scalability in Tangle with 290 Nodes

Fig. 5. Performance of TPS under different MWM

Fig. 6. Performance of CTPS under different MWM

VCPU and 60.0 GB Disk) are used to create the network. We have used medium size nodes and nodes with rich resources because this is more representative of real-world IoT scenarios. Nodes with rich resources enhance the performance by reducing the time it takes to perform the PoW. In order to measure transaction speed and scalability, we configured each sending node to initiate a fixed number of transactions = 5. We also used a set of different MWM (9, 11, 13, 14). These transactions are broadcasted among all nodes through TCP/IP. We have tested the Transaction Per Second (TPS) and Confirmed Transaction Per Second (CTPS) under different numbers of nodes (50, 100, 150, up to 2903 ) with different MWM configurations as presented above, as shown in Fig. 3, and 4, respectively. TPS is defined as the number of transactions published to the network per second and CTPS is defined as the number of transactions that move from pending to confirmed per second. 6.2

Results and Analysis

In this part, we present the performance of the scalability and energy efficiency, which was evaluated over several runs to obtain accurate results. We compared it against one of the recent approaches in the literature, namely, DAG-based 3

Due to resource constraints, we could only run up to 290 nodes.

498

T. Alsboui et al.

smart communities [17]. Their publications gave their full specifications, making it possible for researchers to implement and reproduce the published results. Finally, they achieved promising results for smart homes and are planning to extend their work with further comparisons. Scalability: The results can be seen from Fig. 3 to Fig. 6. As shown in Fig. 3 and Fig. 4, when the MWM is set to 14, the TPS/CTPS results with a different number of nodes, it is clear that as the number of nodes increases, the TPS/CTPS transaction speed approximately increases linearly. Therefore, the transaction speed has good linear scalability when the number of nodes increases. For example, when 50 nodes are sending transactions, the STDI-Based approach TPS reaches 1.376 tx/s and CTPS 6.418 tx/s respectively, whereas with a DAG-Based approach the TPS reaches 1.743 tx/s and CTPS reaches 7.519 tx/s respectively. This demonstrates that our proposed STDI-Based approach outperforms the DAG-Based approach and performs well when the number of nodes increases. Throughput: As shown in Fig. 3 and Fig. 4, it is clear that our proposed STDIBased approach outperforms the DAG-Based approach in terms of efficiency in processing transactions. For example, in the situation in which 10 nodes are sending, the average TPS reaches 1.132 tx/s and CTPS 6.234 tx/s, respectively in STDI-Based approach. Whereas in the DAG-Based approach the TPS reaches 1.314 tx/s and CTPS reaches 7.256 tx/s respectively. This is due to the computation offloading mechanism used in STDI-Based approach. Energy-Efficiency: The nodes involved in performing PoW have an impact upon total energy consumption. Therefore, computation offloading not only conserves energy but also improves the time to process transactions. A DAG-Based approach consumes more energy since IoT devices are required to perform the PoW as compared to our proposed STDI-based approach. Consequently, the STDI-Based approach outperforms a DAG-Based approach in terms of energy consumption. The results in Fig. 5 and Fig. 6 are conducted to test the effect of MWM on the TPS and CTPS. In these experiments, we set the MWM to 9, 11, 13, 14 to measure the effect on the TPS/CTPS. In Fig. 5, it is clear that the TPS is affected by the use of different MWM, as when it is set to 13, it almost reaches 5.321 tx/s and when it is set to 14, it almost reaches 6.591 tx/s. From Fig. 6, the changes in MWM have almost no influence on the CTPS. Decentralization: Our proposed SDIT-Based approach outperforms the DAGBased approach in terms of decentralization. This is due to the use of the consensus mechanism in an SDTI-Based approach.

7

Conclusions and Future Work

The work proposed in this paper is an important step towards the integration of IOTA Tangle DLT with the IoT. The results indicate that an IOTA Tangle can scale to a large number of IoT devices, thus addressing the scalability challenges in the IoT domain. IOTA Tangle can achieve considerable energy savings

Towards Building a Scalable Tangle-Based Distributed

499

since IoT devices do not engage in performing the PoW. Compared to existing work, SDIT enables high-scalability and energy efficiency possibilities for building large-scale IoT applications. There are a number of limitations in the paper so far that need to be addressed in the future, for example, the cost incurred by maintaining and deploying a dedicated servers for performing the PoW. There are several interesting directions for future work that we intend to follow. First, we plan to investigate the usefulness of a Mobile Agent in assisting the Tangle in supporting DI i.e., delegating low-level intelligence to various network and application functions. For instance, IoT devices running the IOTA light node are not engineered to allow collaboration among these nodes in the network. This, in particular, would be useful for IoT applications, i.e., target tracking, where data needs to be shared among nodes to provide accurate location information. Second, a Tangle can be used to solve the problem of offline capability. This task is not simply a network entities configuration problem; the major issue is related to clustering the network. However, it can be achieved by creating offline Tangles where a certain number of nodes can effectively go offline and issue transactions among themselves. This means that an active internet connection is not needed at all times for the Tangle to function. Upon completion, it is possible to attach the transactions of the offline Tangle back to the online one. Third, we intend to explore the average confirmation rate of the transactions, which provides an insight into transaction time latency. This metric will be affected by the value of α, the constant that affects confirmation rates. Therefore, an exploration will configure and test α under different scenarios to improve confirmation rates. Finally, we are planning to explore the benefits offered by IOTA to other areas, such as Wireless Sensor Networks (WSN). It will not necessarily be pertinent to the scalability and energy-efficiency issues, and in particular, the work will focus on customising IOTA Tangle to drive an efficient routing protocol for IoT, taking into consideration various factors, such as Quality of Service. In addition to that, we shall investigate the possibility of adapting it to suit Information Extraction (IE) techniques in WSNs [33], and therefore not limiting the benefits of IOTA to a specific problem or problem domain.

References 1. Atzori, L., Iera, A., Morabito, G.: The internet of things: a survey. Comput. Netw. 54(15), 2787–2805 (2010) 2. Cisco. Internet of things at a glance, 1 December 2016 3. Gartner. Gartner says the internet of things installed base will grow to 26 billion units by 2020, 1 December 2013 4. API Research. More than 30 billion devices will wirelessly connect to the internet of everything in 2020, 1 May 2013 5. Alsboui, T.A.A., Qin, Y., Hill, R.: Enabling distributed intelligence in the internet of things using the iota tangle architecture. In: IoTBDS (2019)

500

T. Alsboui et al.

6. Lynne, P.: Distributed intelligence: overview of the field and its application in multi-robot systems. In: The AAAI Fall Symposium Series. AAAI Digital Library (2007) 7. Nakamoto, S., et al.: Bitcoin: a peer-to-peer electronic cash system (2008) 8. Popov, S.: The tangle, 1 October 2017 9. El Ioini, N., Pahl, C.: A review of distributed ledger technologies. In: Panetto, H., Debruyne, C., Proper, H.A., Ardagna, C.A., Roman, D., Meersman, R. (eds.) On the Move to Meaningful Internet Systems. OTM 2018 Conferences, pp. 277–288. Springer, Cham 2018 10. Antonopoulos, A.M.: Mastering Bitcoin: Unlocking Digital Crypto-Currencies, 1st edn. O’Reilly Media, Inc., Sebastopol (2014) 11. Cao, B., Li, Y., Zhang, L., Zhang, L., Mumtaz, S., Zhou, Z., Peng, M.: When internet of things meets blockchain: challenges in distributed consensus. IEEE Netw. 33, 1–7 (2019) 12. Ali, M.S., Vecchio, M., Pincheira, M., Dolui, K., Antonelli, F., Rehmani, M.H.: Applications of blockchains in the internet of things: a comprehensive survey. IEEE Commun. Surv. Tutor. 21(2), 1676–1717 (2019) 13. I-SCOOP. Blockchain and the internet of things: the IoT blockchain opportunity and challenge, 1 February 2018. Accessed 19 Sept 2019 14. Dorri, A., Kanhere, S.S., Jurdak, R.: Towards an optimized blockchain for IoT. In: 2017 IEEE/ACM Second International Conference on Internet-of-Things Design and Implementation (IoTDI), pp. 173–178, April 2017 15. Christidis, K., Devetsikiotis, M.: Blockchains and smart contracts for the internet of things. IEEE Access 4, 2292–2303 (2016) 16. Biswas, K., Muthukkumarasamy, V.: Securing smart cities using blockchain technology. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 1392–1393, December 2016 17. Fan, C., Khazaei, H., Chen, Y., Musilek, P.: Towards a scalable dag-based distributed ledger for smart communities. In: 2019 IEEE 5th World Forum on Internet of Things (WF-IoT), pp. 177–182, April 2019 18. Giang, N.K., Blackstock, M., Lea, R., Leung, V.C.M.: Developing IoT applications in the fog: a distributed dataflow approach. In: 2015 5th International Conference on the Internet of Things (IOT), pp. 155–162, October 2015 19. Pacheco, L.A.B., Alchieri, E.A.P., Barreto, P.A.S.M.: Device-based security to improve user privacy in the internet of things. Sensors 18(8), 2664 (2018) 20. La, Q.D., Ngo, M.V., Dinh, T.Q., Quek, T.Q.S., Shin, H.: Enabling intelligence in fog computing to achieve energy and latency reduction. Digital Commun. Netw. 5(1), 3–9 (2019). Artificial Intelligence for Future Wireless Communications and Networking 21. Tran, M.-Q., Nguyen, D.T., Le, V.A., Nguyen, D.H., Pham, T.V.: Task placement on fog computing made efficient for IoT application provision. Wirel. Commun. Mob. Comput. (2019) 22. Sarkar, C., SN, A.U.N., Prasad, R.V., Rahim, A., Neisse, R., Baldini, G.: Diat: a scalable distributed architecture for IoT. IEEE Internet Things J. 2(3), 230–239 (2015) 23. Tang, B., Chen, Z., Hefferman, G., Wei, T., He, H., Yang, Q.: A hierarchical distributed fog computing architecture for big data analysis in smart cities. In: Proceedings of the ASE BigData & SocialInformatics 2015, ASE BD&SI 2015, pp. 28:1–28:6. ACM, New York (2015)

Towards Building a Scalable Tangle-Based Distributed

501

24. Mora, H., Pont, M.T., Gil, D., Johnsson, M.: Collaborative working architecture for IoT-based applications. Sensors 18, 1676 (2018) 25. Al-Aqrabi, H., Hill, R.: Dynamic multiparty authentication of data analytics services within cloud environments. In: Proceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018, United States, pp. 742–749. IEEE Computer Society (2019) 26. Muthanna, A., Ateya, A.A., Khakimov, A., Gudkova, I., Abuarqoub, A., Samouylov, K., Koucheryavy, A.: Secure IoT network structure based on distributed fog computing, with SDN/blockchain (2019) 27. Al-Aqrabi, H., Johnson, A., Hill, R.: Dynamic multiparty authentication using cryptographic hardware for the internet of things. In: IEEE Smartworld Congress 2019, United States. IEEE Computer Society, May 2019 28. Al-Aqrabi, H., Johnson, A.P., Hill, R., Lane, P., Liu, L.: A multi-layer security model for 5G-enabled industrial internet of things. In: 7th International Conference on Smart City and Informatization (iSCI 2019), Guangzhou, China, 12–15 November 2019. Lecture Notes in Computer Science, Switzerland. Springer, Singapore (2019) 29. Peng, K., Leung, V., Xiaolong, X., Zheng, L., Wang, J., Huang, Q.: A survey on mobile edge computing: focusing on service adoption and provision. Wirel. Commun. Mob. Comput. 2018, 10 (2018) 30. IOTA Foundation. Minimum weight magnitude, 1 November 2017. Accessed 6 Jan 2019 31. Elsts, A., Mitskas, E., Oikonomou, G.: Distributed ledger technology and the internet of things: a feasibility study, pp. 7–12, November 2018 32. IOTA Foundation. PyOTA: The IOTA Python API Library, 1 February 2018. Accessed 8 Aug 2019 33. Alsbou´ı, T., Hammoudeh, M., Bandar, Z., Nisbet, A.: An overview and classification of approaches to information extraction in wireless sensor networks (2011)

An Architecture for Dynamic Contextual Personalization of Multimedia Narratives in IoT Environments Ricardo R. M. do Carmo1(B) and Marco A. Casanova2 1 University of the State of Amazonas, Manaus, AM, Brazil

[email protected] 2 Pontifical University of Rio de Janeiro, Rio de Janeiro, RJ, Brazil

[email protected]

Abstract. The proliferation of shared multimedia narratives on the Internet is due to three main factors: increasing number of narrative producers, availability of narrative-sharing services, and increasing popularization of mobile devices that allow recording, editing, and sharing narratives. These factors characterize the emergence of an environment we call Internet of Narratives. One of the issues that arise with this environment is the cognitive overload experienced by users when consuming narratives. In this context, consuming a narrative means not only choosing from a large number of possibilities, but also how a narrative must be personalized to suit the user profile and the ubiquitous features of presentation environments. Narrative personalization is not restricted to removing, reordering and restructuring narratives, but it also covers configuring presentation environments according to narrative and user profiles. Through Internet of Things devices, narratives can sense the environment context and actuate on it to offer a more immersive and interactive consumption experience. This article proposes a middleware architecture for personalization of multimedia narratives. This architecture considers the ubiquitous characteristics of IoT presentation environment of multimedia narratives and the continuous stream of unstructured information of such environments. Keywords: Adaptation · Synchronization · Personalization middleware · Multimedia narratives · IoT systems · Internet of Narratives · Context-aware pervasive systems

1 Introduction A narrative is the specification of a logical ordering, possibly non-linear, of real and fictitious events to present a story [1]. Events are occurrences in time that have some essential associated properties. A multimedia narrative is a narrative that can be presented using multiple synchronized media types. The growing number of distributed multimedia narratives and the diversity of presentation contexts has provided the emergence of the crisis of choice [2]. Reduce this © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 502–521, 2020. https://doi.org/10.1007/978-3-030-52246-9_36

An Architecture for Dynamic Contextual Personalization

503

cognitive overload means adopting more sophisticated search and personalization mechanisms. User profile and information about the presentation environment are important elements that can help diminishing that overload. The advent of Ubiquitous Computing, Internet of Things (IoT) and Cognitive Computing are other factors that have contributed to change the multimedia narrative consumption paradigm. Consider the following scenario. John works in the area of home automation. His home has many intelligent IoT compliances and devices, for example, lighting, air conditioning, TVs and entertainment systems, etc. His preferred multimedia narratives are about sports, news and films. His wife Martha has similar interests, except for the sports and certain film genres. As they have no time for selecting narrative content, they acquired a personalization system, called Jarvis, that, based on their profile and contextual information, searches and personalizes narratives for them. The system personalizes narratives, i.e., it selects and reorder events that make up the original narrative and adapt these events in terms of the media objects used to present the narrative. For example, considering John’s preferences, the system sets up an alternative content with news about economy and sports, selects fiction and adventure films, and others that fit in his profile. As John also likes an alternative order for the events about news and sports, i.e., an order different from the original news narrative, the system reorder these events. Martha doesn’t like sport commercials, so Jarvis automatically extracts such commercials from Martha’s preferable narratives and adapts the media objects (audio, video, and image) quality of these narratives to be presented in her tablet. Their children, Joseph and Mary, five and seven years old, respectively, like cartoons and educative programmes. Not all content is suitable for the children. For example, some adult commercials are not to be presented to them. The system takes these restrictions into consideration during the narrative personalization process and sets up for them, possibly dynamically, an alternative narrative, i.e. a narrative that has some of its parts adapted. One of John’s and Martha’s favorite movies is Star Wars: Rogue One. The narrative plot has many scenarios in which lighting and temperature change. To offer an immersive interaction with the movie, the system dynamically synchronizes the IoT appliances and devices with the narrative, changing the temperature, lighting, sound and other room contextual aspects. In this scenario, it is not hard to notice some challenges current systems are still not able to handle very well. In this paper we propose a middleware architecture to handle the challenges the scenario exposes. The proposed middleware aims to offer ways for dynamically personalizing multimedia narratives and presentation environments according to user contexts. In order to enable this, we use a narrative meta-model called NEMo [1], which helps specifying multimedia narratives at a higher abstraction level, the event abstraction level. A consequence of this approach is that media objects are a way of presenting events, i.e., they are considered a property of events and they offer different contextual visions of events. Our proposal considers multimedia narratives as a means of sharing and presenting experiences. This approach provides a better abstraction for specifying multimedia narratives and allows for greater flexibility with respect to their personalization. Multimedia Narrative Personalization (MNP), in this paper, involves removing, restructuring and reordering the sequence of events that defines a narrative, adapting

504

R. R. M. do Carmo and M. A. Casanova

media objects for the requirements of the presentation system and user profiles, and dynamically configuring the presentation environment. Dynamically configuration of the presentation environment is achieved through IoT devices that are bound to a narrative. Through IoT devices, narratives can sense the environment context (room and user contexts) and actuate on it (adjusting room temperature and lights intensity, sound system preferences, etc.) in order to offer a more immersive and interactive consumption experience. Usually narrative personalization is mostly a manual process, but in our proposal some tasks, for example, event selection, presentation template selection, and media object selection and adaptation, are considered automated processes. Another important feature we highlight is that user interaction is less obtrusive thanks to sensors and services that capture user contextual information. The rest of this paper is organized as follows. Sections 2 presents basic concepts. Section 3 discusses related work. Section 4 analyses the requirements that guided our proposal. Section 5 introduces the middleware architecture. Section 6 presents a usage example. Finally, Sect. 7 presents our conclusions.

2 Basic Concepts 2.1 Synchronization Synchronization is used to coordinate elements of an application domain to make them work in harmony. Synchronizing assumes three essential parts: the application domain, the synchronization criteria and the method. A set of criteria defines the rules of the synchronization method. As there are an unlimited number of application domains and criteria, the types of synchronization are also unlimited. However, most types share a causal structure of relationship. A causal structure is a cause and effect relation between two parts of the relation, where the first part is partially responsible for the second and the second is partially dependent on the first. A complex event is a compound event that has a logical internal structure. Soccer matches and TV Programmes (film, newscast, soap opera, etc.) are some other events that require synchronization of (sub-)events. Complex events are also called narratives. Events are occurrences that, in addition to the causal, temporal and spatial properties, have structure, media objects and two main types of information: semantic description and technical description [1]. Text, image, audio, video, among others, and their composition are some examples of media objects. In this paper, to synchronize events means to define a total or partial order of events bound through causal relationships. Causal relations may be defined using a set of synchronization criteria. The temporal synchronization may be used to define a chronological ordering of events and the spatial one to trace a route of the places of occurrence of events. Media objects may be used to specify a presentation ordering in which events with video are presented first, followed by events with images. The informational property, in addition to contributing to the semantic and technical description of events and media objects, can be used to define a synchronization of events according to a set of structured concepts. For instance, a newscast may be defined as a logical ordering of events

An Architecture for Dynamic Contextual Personalization

505

news about sports, economics and science interleaved by commercials. Event technical description is metadata that record the event technical attributes, such as creation information (author, date and time of creation) and media object technical description (production and creation information, capturing process, file formats, resolutions, color profiles, etc.). Event semantic information is semantic metadata that record concepts and agents that participate in the event. Multimedia presentation is a composition of synchronized different types of media elements. This composition defines a multimedia document, which is a natural extension of a conventional textual document in multimedia area. Multimedia documents rely on the experiential property, i.e., media objects for synchronizing spatially and temporally events. Notice that media objects are the main representatives of events in multimedia area. Models in which events are attached to media objects are inappropriate to specify narrative. The main reason is that media content is just another source of information and, as stated before, they are used to present events. Other reasons are the difficulty in classifying media content using tags, references, etc., they can lead to duplication of information about events [1], and scalability problems [3, 4]. In this work we do not treat event synchronization using media objects, but we use events itself. For this we use NEMo, a meta-model that specifies narratives in terms of atomic and compound events that have some associated properties [1]. Besides date and time, duration and name of events, it may be available information about agents (presenter, players, reporters etc.), a short description, a list of similar events, among others. Some synchronization methods, such as block programming, cross-programming, hammocking, spoiling [6], use this information to set a pre-sync. Personalized Electronic Program Guides (pEPGs), proposed by [7], use temporal, spatial and informational properties to recommend multimedia narratives. Few authors, such as [8], use Radio Frequency Identification (RFID) devices to gather contextual user information and semantic information to provide pEPGs. 2.2 Narratives As defined in Sect. 2.1, a narrative is a complex event resulting from the synchronization of real and fictitious events. This synchronization specifies a logical ordering, possibly non-linear, of those events to present a story. A narrative has five elementary context-dependent aspects (structural, causal, temporal, spatial and experiential) and one partially context-independent aspect (informational). The experiential aspect perceptually registers and presents events. This aspect is characterized by media types and multimedia documents. Narratives represented by multimedia documents are called multimedia narratives. Electronic games, online newspapers, webpages, interactive TV programmes are some examples. Media objects can be considered low-level abstractions of events. Events therefore are situated at a higher abstraction level, the event abstraction level. At this level, events are first-class entities that have context-dependent and partially independent properties, which can be used to define synchronization criteria.

506

R. R. M. do Carmo and M. A. Casanova

2.3 Internet of Narratives The growth of multimedia narratives in recent years is due to (i) the increasing number of users producing narratives, (ii) the availability of multimedia narrative sharing services, and (iii) the increasing popularization of handheld devices, such as smartphones and tablets, which enable recording, editing and sharing of multimedia narratives. Considering this, it is possible to notice that users are immersed in an Internet of multimedia narratives, or simply Internet of Narrative (IoN). IoN is in the context of connected entertainment, a subset of IoT entertainment industry. In IoN, IoT is used for gathering, measuring, analyzing and processing huge amounts of data and triggering actions to provide more intelligent, automated, and personalized multimedia narrative distribution services. In this context data are particularly fundamental to provide better services. One of these services is narrative personalization. 2.4 Personalization Personalization consists of pre-tailoring a product or service to the unique characteristics and needs of a user profile [9] and it aims to increase user satisfaction and make interactions faster and easier. Specifying user profiles enables mass personalization [10– 12]. A user profile is a unique set of users (or potential users) that share some common characteristics that make them different from other users. Personalize narratives means tailoring a narrative to the needs, characteristics and specifications of a user. Adaptive Hypermedia (AH) [13] systems use personalization processes that explore the essential aspects of narratives to allow insertion, adaptation and recommendation of relevant content to the user. Due to explicit user model, a distinctive feature of adaptive systems, it is possible to set user profile (implicit behavior, user knowledge, preferences, interests and needs, etc.), system interaction data, and any other explicit information provided [14]. Although personalization processes vary according to the domain of the system, they can be classified into four distinct general classes: Adaptive Content Selection (ACS); Adaptive Content Presentation (ACP); Adaptive Navigation Support (ANS); and the Composite Class. As synchronization plays an important role in all these classes, personalized narratives can be specified. In ACS, a personalized narrative is an alternative narrative represented by a Web page composed of ads and news that may be of interest of the user. An alternative narrative is an interpretation or vision of the original narrative [1]. An alternative narrative exists when at least some of the parts of the original narrative are adapted. ACS techniques are also used in adaptive recommendation systems, such as Spotify®, Deezer®, Amazon®, Netflix®, and YouTube® to predict the probability of a user preferring an item [15]. In these systems, however, better recommendations could be obtained if ambiguity were removed with the use of ontologies [16]. In ACP systems the presentation of the content is adapted according to the user context. In this field researches focus on educational hypermedia and web information systems, such as virtual museums, electronic encyclopedias, electronic guides, etc. In video on demand (VoD) services, such as YouTube, Netflix and Amazon Prime Video, only video quality is adapted.

An Architecture for Dynamic Contextual Personalization

507

ACP can be useful to enable alternative narrative in VoD and audio on demand (AoD) systems. [17–19] propose methods for dynamically inserting content in the main multimedia narrative. [17, 18] suggest real time insertion methods. [19] uses contextual information. However, some challenges still need to be overcome, such as automatic annotation and summarization, dynamic hiding and highlighting of media types, synchronization, ordering and continuity maintenance [20, 21], to name a few. According to [23] the theories behind ASN go beyond the scope of hypermedia. We argue that theories of all personalization processes classes also go beyond that scope and can be useful to multimedia narratives. Narrative conceptual models associated with these techniques can be used to dynamically generate personalized narratives [24] with a high level of personalization. This is especially true for multimedia narratives. However, to achieve this level of automation some requirements, such as segmentation of multimedia narratives [25], and models and methods that allow specifying abstractly these clips, are needed. Another important requirement is the use of user context to reduce pre-defined selection options and user explicit interaction. In this work this type of personalization is called Dynamic Contextual Personalization (DCP). 2.5 IoT as a Source of Contextual Information Until now IoT devices were not considered as a source of contextual information for personalization of narratives. The main relevance of IoT is to maximize the user’s consumption by offering personalized content. To do this, IoT sensors helps gathering and integrating user data and devices to generate behavioral statistics. Such contextual information could help multimedia systems by supporting the generation of alternative narratives. For example, supposing John has some heart disease, in order to offer him a narrative that do not compromise his health condition, reports on his blood pressure could be used to help adapting the presentation of some scenes of Rogue One. IoT smart sensor are usually used to authenticate users. A sensor is an autonomous physical object used to detect and/or respond to changes in the environment. An intelligent sensor is a smart physical object (SPO) with integrated computational resources that detects environment changes and performs predefined functions to process the gathered data. Wearable and face/voice recognition SPOs might support multimedia systems generating alternative narrative by providing identification, user mood and health status. Using the user preferences, it could be also possible block or stop dynamically multimedia narratives presentation, adapt sound and image, select scenes with the preferred characters in order to generate a personalized narrative. In such scenarios, the multimedia narrative presentation systems become the command center of personalization, in which narratives, SPOs and user profiles are synchronized in order to offer an immersive experience to users. The main purpose in synchronizing all things, that is, data and events generated by IoT devices, user context management systems, and multimedia presentation systems, is not only to provide personalized recommendation and search, but also narrative consumption that is more targeted to the user. The main challenges to achieve this level of synchronization are: the need of a specification model so narratives can be generated and personalized dynamically; the lack of open standard for integrating proprietary devices [26]; and an approach to deal with growing number of multimedia narratives.

508

R. R. M. do Carmo and M. A. Casanova

One approach to address these challenges is through the adoption of a middleware. A middleware could manage the massive amount of data generated in IoT environments, increase interoperability between devices and services of different manufacturers, deal with each aspect of communication between devices, manage profiles and contextual data of user and presentation room, personalize narratives according to user needs, among other challenges.

3 Related Work IoT management architectures, such as [27], are designed to be scalable and to operate autonomously, causing contextual data gathering and processing to continue even though the related application is not active. Given the heterogeneity of device models and data formats, promoting interoperability between devices and the architecture is therefore essential. To ensure this an integration reference model is crucial. Although the paper defines Communication Coordinator module as responsible for ensuring interoperability, it does not define a standard or reference model for integration. Furthermore, it does not define a model for specifying profiles, an essential requirement for the management of those devices. In [28] is described a distributed architecture of an adaptive pervasive system, whose purpose is to provide real-time, precise and effective access control of devices. Two important architectural features are: distributed and dynamic rule base, which promotes scalability; and contextual access authorization, which allows to provide more correct results and more efficient inferences of access authorization. To save devices power consumption, context changes and updates are actively identified and performed, respectively, by a context manager module. Although power consumption is an important constraint for devices, we argue that context manager module should play a passive role, accepting notifications only when contextual changes occur. In [29] a solution for direct and indirect management of multiple classes of smart home IoT devices is proposed. It supports time-sensitive devices, i.e., devices that are only available for a period of time. Due to the distributed characteristic, the number of devices and power consumption, we argue that only indirect management is an appropriate approach in order to provide decoupling of space and time for device access. In [22] is proposed a meta-operating system built as a distributed middleware infrastructure for coordinating software systems and SPOs. Called Gaia, the middleware is designed to support the development and execution of portable applications for active spaces, which are programmable ubiquitous computing environments where users interact with several devices and services simultaneously. Gaia provides a framework that abstracts the services for the users and let them to develop user-centric, resource-aware, multidevice, context-sensitive, and mobile applications. The main contribution of Gaia is letting users and developers abstract ubiquitous computing environments as a single reactive and programmable entity instead of a collection of heterogeneous individual devices. As it was designed to coordinate portable applications it lacks support for multimedia narratives, an important feature in our proposal. In [30] is proposed a layered reference architecture for IoT environments, whose main characteristics are modularity and scalability. These characteristics aim to support

An Architecture for Dynamic Contextual Personalization

509

the addition and removal of resources according to the requirements of each category. Given the distributed characteristic of the IoT applications, in this work devices are univocally identified through a UUID code, which facilitates their management, data synchronization and diminishes information leakage. The architecture adopts a centralized approach for interoperability, integration and isolation of devices. In spite of favoring low computational resource devices, we argue that it discourages modularity and scalability. A unique feature of this proposal is the use of local and cloud platforms for data analysis. This is particularly interesting because of the large volume of data generated by IoT environments. Authors [31–33] propose methods for selecting media objects that should be part of the personalized narrative. Each media object is represented by an abstract description (metadata), which is composed by information of media types, users and presentation environment. The selection mechanism uses a media abstract representation and user profiles to query a media database and select those media that meet the requirements defined in the description. As in our proposal, the approach adopted in these works is a solution to the problem of the selection engine options explosion. Similarly to [24, 32] and [33] defines an abstract representation model (a template) for the dynamic generation of adaptive multimedia narratives. It seeks to integrate different multimedia documents models into an abstract model. We argue that this helps the generation of alternative narratives, which is an important feature in our proposal. Authors in [34, 35] and [25] argue that, because each segment of multimedia narrative contains only one subject (or event), segmentation can lead to a high degree of personalization. In these works, the segmentation of narratives is performed manually. A user identifies subjects in the segment and associates descriptive metadata to it. We agree that segmentation of narratives has many benefits, for instance, segment reuse, subjects and events documentation, and it also favors segment indexation. In [36] is proposed an extended IPTV/IMS service delivery architecture. This proposal aims to adapt an IPTV/IMS architecture to the user-centered interactive TV model. To do this, it integrates a context-aware system to provide personalized delivery service and adapt multimedia narratives according to the user and environment contexts. The adaptation of narratives is limited to adjust media objects quality (see Sect. 2). Personalization of the delivery service is restricted to the implicit choice of the multimedia narrative presentation device. Like [17, 19] and [37], contextual information is used to enable personalization. User profile, device information and network usage data are gathered implicitly to determine the narrative presentation device and media objects quality. We argue that for more complex scenarios, such as the one we describe in our proposal, those modules do no cover personalization of logical order of events, spatial and acoustic layouts, narrative structure and media objects selection. Authors in [17, 19] and [34] adopt models, such as MPEG-7 [5] and TV-Anytime Metadata Specification [20], and delivery protocols, such as MPEG-DASH [38] and Apple HLS, to personalize the delivery of commercials. These works highlight that, in order to effectively select relevant commercials, a user model is needed, and it must capture contextual information, user preferences and user consumption data. Authors in [17] pack commercials into the multimedia stream to force users to consume them.

510

R. R. M. do Carmo and M. A. Casanova

We argue that forcing a user to consume non-desirable commercials hurts his rights of choice. In general, our architecture proposal addresses the limitations discussed in all mentioned work. Contrasting with them, we consider multimedia narratives as a means of sharing and presenting experiences in the form of multimedia documents (see Sect. 2). This approach, in addition to providing better abstractions for specifying multimedia narratives, allows for greater flexibility with respect to their personalization. Authors [17, 19, 24, 31–33] and [37] propose abstraction models for adapting and synchronizing media objects. As stated before (see Sect. 2.1), media objects as firstclass entities favor the undesirable duplication of event metadata. In our proposal, media objects present events and personalization process begins at the event abstraction level. Media object adaptation and synchronization are performed at the last stages of the personalization process. Unlike [25, 31–36], we segment events, not media objects. Media objects are linked to events and they offer different contextual visions of events. Unlike [17], our proposal does not restrict the freedom of the user and the use of multimedia distribution technologies. MNP is not only about inserting commercials into real-time media streams, but also removing, restructuring and reordering the sequence of events, and synchronize the presentation room with the narrative. It is domain dependent. For instance, in the field of computer-assisted education, where most of the proposals are focused, personalization requires explicit user interaction [39]. In other domains, such as in the entertainment industry, implicit user interaction must be maximized. Like [17], we argue that to make user interaction less conscious, models are necessary for capturing profiles and contextual information of users and presentation environment. Moreover, it is necessary an infrastructure of SPOs for managing contextual information. Like [22] we argue that context information must be described using easily writing rules. For this, semantic reference models, based on ontologies, are natural candidates due to its decidability and to improve middleware semantic interoperability. Considering the distributed characteristic of such infrastructure, the management of large amount of unstructured data, SPO connectivity, and issues related to scalability, security and privacy, we argue that the infrastructure for SPO management must be indirect and guided by a reference model. In addition, similarly to [22] the infrastructure must be built based on the concept of microservice to improve scalability, modularity and resilience of applications.

4 Requirements Analysis The discussions and definitions presented in Sects. 2 and 3 may be rephrased as a set of requirements for a dynamic contextual personalization for multimedia narratives in IoT environments: • A narrative is a logical causal sequence, total or partial, of events; • Multimedia narratives are ways for sharing and presenting events and must be segmented in order to maximize the personalization;

An Architecture for Dynamic Contextual Personalization

511

• Media object is one of the properties of events and is used to register, represent and present events. Media object must be specified abstractly in order to enable dynamic contextual personalization of multimedia narratives; • SPOs gather and process contextual information, are transparently accessed and seen as a single reactive and programmable entity; • User profile must be specified through an explicit abstract model capable of capturing implicit and explicit contextual information; • Profiles of users, SPO and presentation environment must be specified using semantic reference models in order to increase the middleware interoperability and support automatic knowledge inference. Besides the requirements coming from the discussion about narratives and SPOs, the following requirements are added: • The personalization process must consider contextual information of the narrative domain, presentation environment and user; • Narrative personalization must consider event properties; • Narrative templates are used to generate alternative narratives automatically; • User abstract profiles help generate personalized narratives; • The middleware must be distributed, scalable, resilient, and modular; • Media objects are segmented according to the segmentation of narratives.

5 Middleware Architecture The middleware architecture proposed in this work, shown in Fig. 1, aims at delivering personalized multimedia narratives. It synchronizes narratives and SPOs in order to provide an immersive experience to user.

Fig. 1. Middleware architecture.

512

R. R. M. do Carmo and M. A. Casanova

According to the requirements listed in Sect. 4, there are several hardware and software components at various abstraction layers. We distribute them in a multitier architecture in order to help meeting the scalability requirement, distributing a layer across multiple components and possibly external services. The middleware architecture is distributed in three horizontal tiers. The first tier has five layers: Narrative Layer (NL), Presentation Layer (PL), Context Information Layer (CIL), Acquisition/Actuation Layer (AAL), and SPO Layer (SPOL). The second and third horizontal tiers have one layer each, the Management Layer (ML) and the Persistence Layer (PerL), respectively. The following sections describe the layers and their respective essential services. Non-essential services can be defined by user to help the middleware performing its tasks. 5.1 SPO Layer – SPOL All the SPOs of the middleware are located at SPOL. Each SPO is a programmable or non-programmable sensor or actuator node. SPOs are used for gathering contextual information or configuring the presentation environment appropriately. A single SPO can also add multiple sensors and actuators. Sensors and actuators are the glue that enables synchronizing IoT environments with narratives. For example, a presence sensor indicates which users are in the environment and thereby helps personalize narratives. Actuators can help narratives modify an environment by setting temperature, adjusting lights and sound preferences, etc. As in [30], each SPO has a UUID [40] associated to it in order to facilitate management, data synchronization, access policies definition, and to prevent information leakage and improper alteration. 5.2 Acquisition/Actuation Layer – AAL As in [27], AAL optimize the context information management and gathering in terms of direct access to SPOs. Each set of SPOs is managed by an essential service called wrapper. A wrapper is a proxy service responsible for formatting data and handling actuation, acquisition and configuration tasks. A wrapper adds features similar to those of Manage Agent, Intelligent Applications and ProxyIn/PorxyOut of [29]. In addition to having its own UUID it associates a UUID to each SPO it is responsible for. A wrapper meets a high priority requirement in respect to low resource SPOs [41], offers independence of access to each SPO and prevents the propagation of operational differences of SPOs to other components, thus protecting the middleware from crashes. It also provides support for legacy protocols, maps between SPOs data models and the middleware, and standardizes the middleware communication with SPOs. As in [30], it is possible to rely on cloud infrastructure in order to easily scale the amount of SPOs. Wrappers can remain active throughout the lifetime of a SPO or just for handling a request and then terminated. Active wrappers are generally used to deal with sensor nodes, while the others to deal with actuators. Another essential service of AAL is the Acquisition/Actuation Service (AAS). AAS is responsible for abstracting to the layer above the data acquisition and actuation details.

An Architecture for Dynamic Contextual Personalization

513

When upper layer services request contextual information acquisition and environment actuation, AAS checks for wrappers to fulfill the requests. AAS is designed to run autonomously, i.e., contextual data acquisition and processing keep running, even if upper layer services interested in contextual information are inactive. Similar to Home Gateway [29] and Aggregation/BUS Layer [30], it gives indirect access to wrappers. Similar to [27] and together with wrappers it gives support to scalability and interoperability between SPOs and the middleware. 5.3 Context Information Layer – CIL CIL hides all the complexity regarding actuation requests and contextual information gathering. This layer has three essential services: Context Information Service (CIS), Rule Engine (RulEng) and Context Information Publisher (CIPher). Due to the distributed characteristic of the middleware, CIS is responsible for hiding all the complexity of CIL and, like Aggregation/Bus Layer [30], it acts as a resource service. It triggers CIS agents to handle requests made by upper layers services and, similar to the Access Control Manager of [28], all requests must be authenticated. As requests are handled, CIS agents are terminated. CIS offers two types of request: prompt and scheduled. A prompt request is used when upper layer services need to setup the presentation environment and to obtain an immediate contextual information update. For example, in order to get the current status of the presentation environment and users present in it, Personalization Engine (PE) makes a prompt request to CIS, which in turn triggers agents to interact with AAS. AAS contacts wrappers to get the requested information. After receiving the status, CIS agents reply it to PE. PE can make a prompt request asking CIS to setup the environment for the presentation of a narrative. Scheduled requests are those that must be triggered at a specific date and time. When such a request is made, CIS subscribe a request rule to RulEng defining the date and time, how long the rule must remain active, the information to be obtained, and the upper layers services that must be notified. For example, NL services can make a scheduled request to be triggered every Friday at 6PM to verify if a user is in the presentation environment. Like Event Processing and Analytics Layer of [30], RulEng is responsible for processing request rules, which are composed of a condition and an action that must be triggered when the condition is satisfied. The middleware does not restrict the number of rules that can be specified. Rules are specified using RDF/OWL syntax and stored at PerL. RulEng supports automatic inference mechanisms for processing the rules. CIPher is responsible for publishing and announcing contextual information. When RulEng process a rule it also triggers a CIPher agent that will be responsible for handling the AAS replies. As soon as AAS replies, this agent announces contextual information to interested upper layers services and publishes it to PerL. As soon as these tasks are fulfilled, the agents are terminated. This approach prevents propagation of errors and possible crashes and avoids service overloading. Contextual information updates are published to PerL in order to prevent information loss and to allow later recovery of information.

514

R. R. M. do Carmo and M. A. Casanova

5.4 Presentation Layer – PL PL provides services for handling multimedia narratives presentation. At this layer there are three essential services: Narrative Presentation Service (NPS), Narrative Converter Service (NCS), and Narrative Players (NPlrs). NPS receives personalized narratives from NL services and triggers NCS for converting the narrative into formats supported by the presentation environment. For example, if the environment supports HTML5, SMIL, and NCL formats, the personalized narrative is converted to one or all of these formats before being presented. After a personalized narrative is converted, NPS triggers a NPlr. NPlr is an agent responsible for presenting a narrative and requesting continuous context information updates from CIS. For example, if any change in the environment occurs (a user leaves or enters the environment), a NPlr receives an announcement from CIS and requests from PE a new personalized narrative so this narrative can meet the new context. NPlr is also responsible for requesting environment setup. For example, if a narrative requires a sound adjustment, NPlr requests it to CIS. 5.5 Narrative Layer – NL NL services are responsible for personalizing and generating narratives according to contextual information of user and presentation environment. There are four essential services at this layer: Narrative Generation Service (NGS); Narrative Provider (NPdr), Narrative Access Service (NAS); and Personalization Engine (PE). NGS offers support for specifying rules for automatic generation of narratives. The rules may cover types of event, types of synchronization, profiles and contextual information of users and environments, among others. NAS provides access to narratives, whether they come from NPrds or Persistence Layer. NAS first checks for narratives at PerL. If no narrative is found, it requests narratives to NPdr. NAS also segments narratives into events to facilitate narrative personalization. A NPdr is a proxy service responsible for obtaining narratives from various external sources and feeding the middleware with narratives. In case external sources do not provide NEMo specifications of the narratives, NPdr parses the narratives and generate a NEMo specification for each one. For example, if an external source only provides narratives in HTML format, NPdr generates a NEMo specification for each one. Both the external format of narrative and its NEMo specification are stored at PerL. As discussed in Sect. 2.1, a NEMo specification defines the logical ordering of events. In this proposal, we use NEMo specifications generated by NPdrs as a preliminary synchronization plan that will be used by PE to personalize narratives. A NPrd has two types of operation mode: active and passive. In the active mode, a NPrd is configured to periodically feed PerL with narratives so NAS can recover them when necessary. In passive mode, a NPrd only provides narratives when it is requested by NAS. In this mode a NPdr sends narratives to PerL and replies them to NAS. PE personalizes a narrative according to the restrictions imposed by contextual information of the user and the presentation environment and to the rules specified at NGS. It defines, for example, which events should compose the personalized narrative, which narrative template must be used, the adaptations that must be made to media objects

An Architecture for Dynamic Contextual Personalization

515

and presentation environment, among others. Indicative classification, user profile and preferences, such as favorite events, presentation order of event, media object formats, are some context information. A narrative template is a document that specifies the structural, causal, spatial, temporal, informational and experiential properties of a narrative. Similar to I-Objects [32], these templates are used to guide PE and NGS during personalization and generation processes. Narrative templates are stored at PerL. Once the template is selected, PE uses it to define what events from the original narrative should be in the personalized narrative. PE is equipped with mechanisms of complex events processing to select and synchronize events and uses contextual information of the environment to personalize layout and media objects. 5.6 Management Layer – ML ML has direct access to all other layers of the middleware. Loading and unloading of the middleware essential services is done by the Core Management Service (CMS). In addition to CMS this layer has other two essential services, Profile Registration Management Service (PRMS) and SPO Management Service (SPOMS). CMS is also responsible for monitoring the middleware services and for keeping the middleware operational. CMS periodically sends keepalive messages in order to check if a service is running. In case CMS does not receive a reply, the service is assumed to be down, and an appropriate action is performed. If the unresponsive service is an essential service, CMS initiates the service recovery procedure. As in [22], when a management request is made, CMS instantiates an agent to handle the respective request. For example, when a management request for updating an essential service, monitoring services and triggering service recovery task is made, CMS instantiates a specific management service to handle it. After the management request is handled the agent is terminated. PRMS offers management interface for (de)registering, (de)activating, updating user and environment profiles and non-essential services. SPOMS is responsible for managing SPOs and wrappers. No other the first-tier service has direct access to Persistence Layer. In order to access information from the Persistence Layer, all first-tier services must contact ML services. For example, to query for narratives NAS contact CMS that triggers an agent, called CMS Proxy, to handle the query and interact with Persistence Services. 5.7 Persistence Layer – PerL Persistence Layer stores rules, narratives, narrative and presentation templates, user and environment profiles, contextual information, media objects and their descriptions, authentication codes, encryption keys, certificates, etc. As in [30], it relies on local and cloud distributed infrastructure of repositories in order to easily scale the volume of data the middleware uses and generates. This layer has two essential services, Persistence Service (PerS) and Registry Service (RegS).

516

R. R. M. do Carmo and M. A. Casanova

PerS provides an interface for browsing, querying, adding, removing, updating and reasoning on the basis of semantic descriptions. In order to improve semantic interoperability and decidability, the middleware adopts semantic reference models, based on description logics and ontologies, as the standard format for representing contextual and non-contextual information. We developed a SPARQL-based querying language for accessing information of the PerL. In addition to SPARQL functionalities, the language has constructors for event concept and its properties. As in [32], PerS also allow users, through PRMS Proxies web interface to update their profile. RegS maintains a registry of local and cloud repositories where all information is stored. CMS Proxies contact RegS in order to obtain the correct repository service that must be accessed. For example, when CIPher publishes contextual information, it contacts CMS which requests RegS for the respective PerS agent responsible for managing the respective repository. The agent reference is returned to CMS Proxy and from now on this agent handles the CIPher publishing request. After the publishing request is finished the CMS Proxy is terminated. 5.8 Narrative Specification Model According to [32], a model is needed for specifying the multimedia content present in narrative. However, as the proposed middleware is based on the concept of narrative defined in Sect. 2.2, another approach is necessary. The narrative specification model must capture the semantics of the relationships between events. We use NEMo [1] to meet this requirement. Thus, the narratives supported by the middleware are those specified according this model. 5.9 Representation Model of User and Presentation Environment According to [32], advanced models are needed to capture contextual information of user for personalizing narratives. In addition, models are required to represent contextual information of the presentation environment, so that it is possible, for example, to interact with the environment through SPOs. These models should therefore capture preferences, local settings (time, location, etc.) and other domain-specific information that is relevant to the personalization process. The specification of a user profile usually consists of two phases [14]. In the first phase it is performed user data gathering from several data sources. In the second phase methods of data science and machine learning are used to consolidate the profile. Due to the lack of space, both the user profile specification model and presentation environment specification model will be discussed in future work.

6 Using the Middleware In this section we present how the middleware works. We use Fig. 1 to illustrate the use. To explore all the features of the system, John must configure Jarvis. After installing the system, John sets the profiles of his family. Using the PRMS Web interface, John

An Architecture for Dynamic Contextual Personalization

517

inserts information of each member of the family to help the personalization system learning about users and the presentation environment. John informs his profile: age = 40 years old, gender = male; consumption habits: watch narratives from Monday to Thursday between 8PM and 8:30PM, Fridays and Saturdays between 9PM and 11PM, and Sundays, together with his wife and children, between 7AM and 10AM. He inserts his wife profile: age = 37 years old, gender = female; consumption habits: watch narratives from Monday to Thursday between 7PM and 7:30PM, Fridays and Saturdays, with her husband, to enjoy his company. He also inserts the children information. The children, Joseph and Mary, are allowed to enjoy narratives that are appropriate for their age and in periods of time that do not conflict with school and homework activities. Because of this, in addition to the children profile, John sets up Jarvis with condition-action rules to automatically turns the system on when it detects that the conditions are true, and off when the children have watched 45 min of narratives. For instance, John defines ChildrenShowTime rule: IF (((location(userId(“Joseph”), “Presentation Room”) OR location(userId(“Mary”), “Presentation Room”)) AND (currentTime(“6 PM”) AND isAllowedDay(today(), between(“Monday”, “Friday”)))) THEN turnOn(“Air conditioning System”), setTemperatureTo((“Air conditioning System”, “23 °C”), setLightTo(“Light System”, diminish(“45%”)), turnOn(“Presentation System”), play(“Children Narratives List”). The turn off rule is defined as ChildrenShowTimeOver: IF (((location(userId(“Joseph”), “Presentation Room”) OR location(userId(“Mary”), “Presentation Room”)) AND (currentTime(“6 PM”) AND isAllowedDays(today(), between(“Monday”, “Friday”)))); THEN stop(“Narratives list 01”), turnOff(“Presentation System”), turnOff (“Air Conditioning System”), turnOff (“Light”, “Light System”). PRMS triggers PRMS Proxy1 that sends the rules to RulEng to be validated and if valid they are store at PerL with the help of a PerS 1 agent, which handles a set of repositories. For instance, PerS 1 is responsible for a relational repository and an RDF/OWL repository. The rules are stored locally or in a cloud infrastructure. The advantage of storing in the cloud is that rules can be made available for other instances of Jarvis in different locations. Two AAL wrappers, Presence Notification Service (PNS) and DateTime Notification Service (DTNS), are responsible for detecting presence and periodically sending date and time updates, respectively. In order to work properly, PNS needs a presence detection SPO. This device must be registered with the help of SPOMS. For this John must access SPOMS Web interface and insert the ID and authentication information of the SPO. SPOMS validates the information and activates the SPO, which will be available through PNS. When PNS detects that one or both of the children are at “Presentation Room”, it sends a notification to AAS, which then notifies RulEng. CIPher is contacted by RulEng to publish date and time information to PerS 2 using SPOMS Proxy2 . As soon as RulEng receives a DTNS update from CIS it verifies that rule ChildrenShowTime is valid and sends an announcement to NPS indicating that the Children Narratives List presentation must be started. When NPS receives the announcement from CIS it triggers turnOn(), setTemperatureTo(), setLightTo(), and play() actions. As PNS SPO, temperature system, light and sound systems, TV SPOs must also be registered through SPOMS to be handled by AAL

518

R. R. M. do Carmo and M. A. Casanova

wrappers. For instance, in order to ChildrenShowTime to trigger turnOn(), play(), setTemperatureTo(), and setLightTo() to actuate on the presentation environment through wrappers, the respective SPOs must have been registered. Children profile and additional context data informed by John are used as search and personalization parameters. For example, age and semantic tags, such as “color learning” and “direction learning”, are used to search for TV programmes suitable for the children. These tags are used by a NAS agent to request NPdrs for narratives. NPdrs are sources of narratives and must be registered through CMS. The tags are based on semantic models in order to disambiguate queries. Each contacted NPdr executes a query and sends the result in NEMo format to NAS, which in turn publishes the query result to PerS. As the middleware adopts RDF/OWL, queries are in SPARQL syntax. Another NAS agent is responsible for creating the Children Narratives List. This list is a playlist of personalized narratives. The NAS agent retrieves from PerS 3 narratives that meet the children profile and segments them into events to facilitate the personalization task. Once these events are selected, a narrative template is used to synchronize the events according to the order defined by John during the profile configuration. Event synchronization is a task performed by PE and, once the synchronization is finished it adapts the media objects that will be used to present the personalized narrative. Media object adaptation can be performed in at least two steps: media object selection and media object adaptation. Each event may have a set of media objects that represent it. PE takes into account the formats supported by the presentation environment and the children profile to select the appropriate media object to represent an event. For example, if Mary prefers yellow more than blue, PE tries to choose media objects that have some yellow element. Media object adaptation may be performed if no appropriate media object is found. In this case, the media properties may be adjusted. As soon as media object adaptation is performed the list of personalized narratives is sent to PerS 3 . When the play() action of ChildrenShowTime is performed, NPlr 1 retrieves Children Narratives List from PerS 3 and an NCS agent is loaded to convert the narrative to the multimedia document format supported by the presentation room. Notice that this is not an adaptation procedure, but only a conversion from the middleware synchronization document format to the TV and sound system presentation format. During the conversion NCS agent annotates the presentation document with context synchronization tags. These tags are the elements that makes it possible to synchronize narratives with presentation environments. They store rules that are used by NPlrs to actuate through SPOs in the presentation room. For example, NPlr 1 synchronizes the narratives with the sound system to stimulate directions learning (turn right and left, go ahead) by making them follow beeps around the room; the light system to teach numbers (blinking ten times one lamp) and colors (turning yellow, red, green, and red lights on). After NCS agent finishes, NPlr 1 presents the narratives to Joseph and Mary.

7 Concluding Remarks An issue that arises with the Internet of Narratives is the Crisis of Choice. One way to address this problem is adopting narrative personalization strategies. Narrative personalization assumes the existence of up-to-date contextual information of users and

An Architecture for Dynamic Contextual Personalization

519

presentation environment. Current architectural proposals focus only on the management of IoT devices. Based on this, we propose a middleware architecture for dynamic contextual personalization of narratives that orchestrates the IoT devices. Our middleware uses contextual information to enable personalization of narratives and synchronization of presentation environment with both narrative and users. Although we have not presented narrative, user and presentation environment specification models, it is important to notice that they are essential to achieve the level of personalization we proposed in this work. We have implemented a narrative query model that covers the specificities of events and their properties. Due to the lack of space, they will be discussed in future work. In addition to the benefits a middleware commonly brings to the integration of complex and heterogeneous environments, we highlight four that the proposed middleware brings. First, it reduces the cognitive effort of users: contextual information of users and presentation environment is gathered without user interaction. Second, it allows the narrative to interact with the user and dynamically configure the environment as the narratives unfolds. Third, narrative personalization, not covered in this paper, is decomposed into two levels of abstraction. Finally, the architecture supports different multimedia narratives specification models. Personalization Engine is an important module of our middleware. Due to the lack of space we intend to discuss about it in future work. Narrative Providers conversion procedures, Persistence Service agents, actuation/acquisition wrappers, authentication and more management services, user profile specification model and presentation environment specification model are other topics that we will discuss in future work. Acknowledgment. This work was partly funded by grants CNPq/302303-2017-0, FAPERJ/E26-202.818-2017, CAPES/PRINT/88881.310592-2018-01 and FAPEAM/SECT-020/2009.

References 1. Carmo, R.R.M., Soares, L.F.G., Casanova, M.A.: Nested event model for multimedia narratives. Presented at the 2013 IEEE International Symposium on Multimedia (ISM), Anaheim, CA, USA, pp. 106–113 (2013) 2. Donoso, V., Geerts, D., Cesar, P., de Grooff, D. (eds.): Networked Television Adjunct proceedings of EuroITV 2009, Leuven, Belgium, pp. 1–190 (2009) 3. Perera, C., Zaslavsky, A., Christen, P., Georgakopoulos, D.: Context aware computing for the internet of things: a survey, arXiv, vol. 16, no. 1, pp. 414–454, March 2014 4. Westermann, U., Jain, R.: {\rm E – A generic event model for event-centric multimedia data management in eChronicle applications}. Presented at the 22nd International Conference on Data Engineering Workshops (ICDEW 2006), p. x106 (2006) 5. Martínez, J.M. (ed.): MPEG-7 Overview 10 ed. Palma de Mallorca, Spain (2004) 6. Adams, W.J., Eastman, S.T.: Prime-time network entertainment programming. In: Broadcast Television Strategies, no. 4 (2007). wadsworthmedia.com 7. Bellekens, P., Van Kerckhove, G., Kaptien, A.: iFanzy: A ubiquitous approach towards a personalized EPG. Presented at the Euro iTv 2009, Leuven, Belgium, pp. 130–131 (2009) 8. Aroyo, L., Conconi, A., Dietze, S., Kaptein, A., Nixon, L., Nufer, C., Palmisano, D., Vignaroli, L., Yankova, M.: NoTube – making TV a medium for personalized interaction. In: Euro iTv 2009, pp. 22–26, April 2009

520

R. R. M. do Carmo and M. A. Casanova

9. Mathur, P.: Technological Forms and Ecological Communication: A Theoretical Heuristic, pp. 1–229. Lexington Books, Laham (2017) 10. Tseng, M.M., Jiao, J.: Mass customization. In: Salvendy, G. (ed.) Handbook of Industrial Engineering, 3rd edn, pp. 684–709. Wiley, Hoboken (2001) 11. Kaplan, A.M., Haenlein, M.: Toward a parsimonious definition of traditional and electronic mass customization. J. Prod. Innov. Manag. 23(2), 168–182 (2006) 12. McCarthy, I.P.: Special issue editorial: the what, why and how of mass customization. Prod. Plan. Control 15(4), 347–351 (2007) 13. Schneider-Hufschmidt, M., Kühme, T., Malinowski, U.: Adaptive User Interfaces. North Holland (1993) 14. Brusilovsky, P.: Adaptive hypermedia for education and training. In: Durlach, P.J., Lesgold, A.M. (eds.) Adaptive Technologies for Training and Education, no. 3, pp. 46–66. Cambridge University Press, Cambridge (2012) 15. Ricci, F., Rokach, L., Shapira, B.: Introduction to recommender systems handbook. In: Recommender Systems Handbook, no. 1, pp. 1–35. Springer, Boston (2010) 16. Munir, K., Anjum, M.S.: The use of ontologies for effective knowledge modelling and information retrieval. Appl. Comput. Inf. 14(2), 116–126 (2017) 17. Bringuier, L.: Increasing ad personalization with server-side ad insertion, pp. 403–412. Amsterdam, Netherlands (2016) 18. Seeliger, R., Silhavy, D., Arbanowski, S.: Dynamic ad-insertion and content orchestration workflows through manifest manipulation in HLS and MPEG-DASH. Presented at the CNS, vol. 2017, pp. 450–455 (2017) 19. Thawani, A., Gopalan, S., Sridhar, V.: Context Aware Personalized Ad Insertion in an Interactive TV Environment, pp. 1–7 (2004) 20. McParland, A.: TV-Anytime - using all that extra data. British Broadcasting Corp (2002) 21. Masthoff, J., Pemberton, L.: Adaptive hypermedia for personalised TV. In: Adaptable and Adaptive Hypermedia Systems, no. 13, pp. 246–263. IGI Global (2005) 22. Román, M., Hess, C.K., Cerqueira, R., Ranganathan, A., Campbell, R.H., Nahrstedt, K.: A middleware infrastructure for active spaces. IEEE Pervasive Comput. 1(4), 74–83 (2002) 23. Brusilovsky, P. (ed.): The Adaptive Web: Methods and Strategies of Web Personalization, vol. 4321. Springer, Heidelberg (2007) 24. Scherp, A.: A Component Framework for Personalized Multimedia Applications. ansgarscherp.net, Oldenburg, Germany (2006) 25. Cutts, S., Davies, P., Newell, D., Rowe, N.: Requirements for an adaptive multimedia presentation system with contextual supplemental support media. Presented at the 2009 First International Conference on Advances in Multimedia (MMEDIA), Colmar, France, pp. 62–67 (2009) 26. Rose, K., Eldridge, S., Chapin, L.: The Internet of Things: An Overview, October 2015 27. Masoero, R., Buono, S., Malatesta, L.: Internet of Things: The Next Big Opportunity for Media Companies. Accenture, April 2017 28. Souza, R., Lopes, J., Geyer, C., Garcia, C., Davet, P., Yamin, A.: Context awareness in UbiComp: An IoT oriented distributed architecture. Presented at the 2015 IEEE International Conference on Electronics, Circuits, and Systems (ICECS), Cairo, Egypt, pp. 535–538 (2016) 29. Jih, W.-R., Cheng, S.-Y., Hsu, J.Y.-J., Tsai, T.-M.: Context-aware access control in pervasive healthcare. Presented at the EEE 2005 Workshop: Mobility, Agents, and Mobile Services (MAM), pp. 1–8 (2005) 30. Pham, C., Lim, Y., Tan, Y.: Management architecture for heterogeneous IoT devices in home network. Presented at the IEEE 5th Global Conference on Consumer Electronics, Kyoto, Japan, pp. 1–5 (2016) 31. Fremantle, P.: A Reference Architecture for the Internet of Things, October 2015

An Architecture for Dynamic Contextual Personalization

521

32. Jourdan, M., Bes, F.: A new step towards multimedia documents generation. Presented at the International Conference on Media Futures, pp. 25–28 (2001) 33. Sebe, N.: Personalized multimedia retrieval: the new trend? Presented at the International Workshop, New York, USA, no. Special Session on Personalized Multimedia Information Retrieval, pp. 299–306 (2007) 34. Brusilovsky, P.: Adaptive hypermedia. In: UM’03, vol. 11, no. 1, pp. 87–110 (2001) 35. Davies, P., Newell, D., Rowe, N., Atfield-Cutts, S.: An adaptive multimedia presentation system. IJAS 4(1), 1–11 (2011) 36. De Bra, P., Knutov, E., Smits, D., Stash, N., Ramos, V.F.C.: GALE: a generic open source extensible adaptation engine. New Rev. Hypermedia Multimedia 19(2), 182–212 (2013) 37. Song, S., Moustafa, H., Afifi, H.: Personalized TV service through employing contextawareness in IPTV/IMS architecture. In: Zeadally, S., Cerqueira, E., Curado, M., Leszczuk, M. (eds.) Adaptive Hypermedia and Adaptive Web-Based Systems, vol. 6157, no. 8, pp. 75–86. Springer, Heidelberg (2010) 38. ISO/IEC 14496-20:2008 Information technology – Coding of audio-visual objects – Part 20: Lightweight Application Scene Representation (LASeR) and Simple Aggregation Format (SAF), San Jose, CA, USA, December 2008 39. Pham, S., Krauss, C., Silhavy, D., Arbanowski, S.: Personalized dynamic ad insertion with MPEG DASH. Presented at the 2016 Asia Pacific Conference on Multimedia and Broadcasting (APMediaCast), Bali, Indonesia, pp. 1–6 (2016) 40. Rowe, N., Davies, P.: The anatomy of an adaptive multimedia presentation system (AMPS). Presented at the Third International Conference on Advances in Multimedia, Budapest, Hungary, pp. 30–37 (2011) 41. Huang, Z., Nahrstedt, K., Steinmetz, R.: Evolution of temporal multimedia synchronization principles. ACM Trans. Multimedia Comput. Commun. Appl. 9(1), 1–23 (2013)

Emotional Effect of Multimodal Sense Interaction in a Virtual Reality Space Using Wearable Technology Jiyoung Kang(B) Graduate School of Cinematic Contents, Dankook University, Yongin-si, Gyeonggi-do, Korea [email protected]

Abstract. The virtual reality (VR) market has been expanding fast with the extraordinary progress of relevant hardware and software. Particularly, with the commercialization of standalone head-mounted display (HMD) devices in the recent years, there has been an increasing amount of interest in interfaces providing more expanded and sensual information to users. To intensify user immersion in a VR space, it is necessary to provide physical sense information similar to a real world. However, the sense information provided to users is limited to mostly simple vibration or force feedback, and the emotional responses of users have not yet been investigated with respect this information. Therefore, this study presents the approach of providing multimodal sense feedback to a user in a VR space through the “Emoract” interface that can be worn on both hands and also investigates the resulting emotional responses by measuring the user’s biological response signals. All the experiments were conducted using an interactive VR animation “Lonely Noah”. In the VR animation, because the user’s emotional response to the user’s sensory response changes depending on the story and situation, there is an effective interaction with vibration sensations such as the existing VR interface, and there is an effective iteration with context-based multi-modal sensory feedback rather than vibration. Keywords: Multimodal interaction · Virtual reality · Emotional effect · Wearable technology

1 Introduction 1.1 A Subsection Sample The year of 2016 was when virtual reality (VR) went mainstream, and as of 2019, the growth of VR market has been accelerating along with the full-scale commercialization of 5G. Meanwhile, the VR content market that has also been growing based on games and other diverse areas such as movies, education, and medical treatment. Furthermore, Gartner, a market research organization has selected “immersive technologies” as one of the “top 10 strategic technologies of 2019” [1]. Consequently, the development of VR technologies and their associated contents have been attracting an increasing amount of © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 522–530, 2020. https://doi.org/10.1007/978-3-030-52246-9_37

Emotional Effect of Multimodal Sense Interaction

523

attention worldwide. Due to the saturation of the mobile platform market, which is based on smartphones, the software development companies as well as hardware suppliers seeking new markets are now concentrating on the development of VR technologies. The extension of VR into diverse areas such as movies has led to the large-scale requirement of VR content. In particular, the VR movies introduced during international film festivals recently have shown extremely different style of movies from the VR movies of the past years, whereby an entirely independent market is being developed by combining the movies with VR experiences. In the VR movies of early years, linear structure based on narratives combined with traditional movie format was the mainstream technology. Furthermore, interaction with audience was limited to looking around a space of 360°. This can be viewed as a technical limitation of the earlier versions of VR movies and consequently, new versions of VR movies stemmed from the conventional ones. Recently, however, along with the extraordinary progress of VR hardware and software, the VR movies have started providing new experiences to audiences by deviating from the conventional formats of traditional movies. Studies have been continually conducted for methods attempting to satisfy the five senses of a user by going one step further from audiovisual immersion. Efforts are being continually made to improve the emotional immersion of audiences. However, studies are not being actively conducted as whether the use of senses in VR space indeed has positive immersion effect on its users. Therefore, this study aimed to investigate the emotional immersion effect of sense information provided to the users in VR. All the experiments were conducted by developing an interface that can observe the biological response signals of audience through the left hand of the user in real-time when multimodal sense information is delivered to the right hand of the user in a VR space.

2 Related Work With the arrival of the fourth industrial revolution in the recent times, VR-related HW devices and SW technology including software development kits (SDK) have undergone advancements. VR contents that represent this fourth industrial revolution era have grown quickly in the present era of 5G-based surrealism, in which VR-related large content data containing high-definition images and high functionalities can be transmitted using wired/wireless service, extremely fast and with high accuracy. Based on the aforementioned points, the VR contents relying on audiovisual immersions are delivering a variety of sense information to users, thereby increasing immersion. VR has been defined as I3 for “Immersion-Interaction-Imagination” [2]. Here, immersion in VR space can be called telepresence, which allows a user to feel as if he/she is present in the VR space. The sense of presence is provided with stimuli such as vision, hearing, and sense of touch from a remote location [3–5]. Lobbard et al. explained telepresence by classifying it into two categories: physical presence and presence [6]. The present study focused on the research of sense interface by means of a hand. Here, the physical presence is enhanced when a user performs realistic physical interactions. Preliminary studies on physical presence using such hand interfaces are being continuously conducted as VR continues to advance. Schlumberger’s Gupta developed CAD assembly simulation by collaborating with Sheridan and MIT’s Whitney [7]. They developed the PHANToM force feedback interface in which a designer can use their thumb

524

J. Kang

and index finger to identify the designed components and feel the force when touching them. This type of a multimodal simulation integrated voice and finger position inputs, including visual, auditory, and force feedbacks. By comparing the handling, insertion, and assembling time of components in the virtual world, this study demonstrated that such force feedbacks lead to an increase in work efficiency. However, the feedback interface presented in this study was not devised to provide various sense feedbacks like those of a real world. Wolverine [8] is a portable wearable haptic device designed to simulate the gripping of a hard object in a VR interface. Unlike the previous wearable force feedback gloves, Wolverine is focused on creating a low-cost lightweight device that simulates a stationary object on the graph of a pad’s opposite side (precision) type by directly exerting force between the thumb and three fingers. This system can endure a force of more than 100 N between each finger and the thumb using a low-power brake-based lock slider and consumes only 0.24 mWh (0.87 J) for each braking interaction. This force feedback glove however suffers from the difficulty in delivering other types of sense responses to its user. “Emogle” [9], which was developed in a previous study by the present researcher, is a wearable glove that provides a variety of sense information to a user through vibration, heat, wind, and electrical stimulation in a VR environment. This study proposes an individual’s sense VR system focused on the touch sense to deliver high physical immersion to the user in a VR environment. Emogle is a personal haptic VR interface worn on the hand by the user. Through this study, we confirmed that the immersion was high in a virtual space when various multimodal sense stimulations were delivered to the user. However, we found that a more detailed study is required to determine which senses induced emotional immersion in the user. Therefore, this study proposed “Emoract”, a multimodal emotion interaction interface, by upgrading the previously developed “Emogle” interface, which was worn on only one hand. The upgraded devise can be worn on both hands. Emoract can deliver various sense stimulations to a user, and simultaneously collect the corresponding emotional responses of the user in real-time.

3 Multimodal Interface for VR 3.1 Body Awareness Current commercialized VR devices such as Oculus, HTC Vive, and Samsung Gear VR provide a haptic interface as the user can interact using both hands. Oculus’s Oculus Touch or Vive’s controller senses the manner in which both of the user’s hands move in real-time through separate sensors and provides vibration feedback by reflecting it. Gear VR or Daydream VR also provides remote controller type of an interface that can be held in one hand to interact, and identifies the controller’s position and a button click through the communication with the smartphone in the HMD, thereby providing appropriate vibration feedbacks. Furthermore, more advanced types of interfaces have appeared such as Tesla Suit [10], which can be worm on the entire body, or VRFree [11], which is a glove type interface. As such, while the conventional VR interfaces basically provided a degree of freedom with a 360° head movement, research and development have been conducted focusing on the free movement of the user’s both hands and the

Emotional Effect of Multimodal Sense Interaction

525

whole body. The primary reason for the requirement of such changes is the enhancement of body awareness of the user by the addition of physical interaction experienced by a user in the real world; this is achieved by going beyond the high-resolution and 360° 3D visual provided in the conventional VR contents as well as auditory information such as spatial sound, which provides stereophonic sound similar to that in the real world. Body awareness refers to the familiarity and understanding that people have about their own bodies regardless of the environment that they are in [12]. For example, a person knows sensations accompanied when perceiving relative positions (proprioceptive senses), motion range, and certain phenomenon of four limbs. The various skills of controlling the movements of head, eyes, etc. for crawling, walking, or kicking are acquired by people when growing up. An interface for a VR space requires enriched senses like those felt by a user in the real world and therefore has to include the use of both hands and the entire body for interaction similar to that in the real world. However, as mentioned earlier, the current VR contents, especially VR movies provide limited sense feedbacks to the users. Furthermore, the emotional effects on a user associated with these simple feedbacks have also not yet been verified, thereby limiting them in terms of VR experience. Therefore, in this study, an interface was developed to provide multimodal sense feedbacks to a user based on suitable VR contents. The user’s emotional responses were measured in order to investigate the emotional effects of each sense interaction. 3.2 Emoract: A Multimodal Sense Interface Emoract (Fig. 1) is a compound word formed from the two words, emotion and interact, and it refers to an emotion interface based on a user’s sense interaction. It is a wearable glove interface worn on both hands. By providing multimodal sense information to the user, and simultaneously collecting the user’s emotional response in real-time, it allows the determination of the effectiveness of sense information. The user can conveniently wear the glove on their left hand using a band, and this glove delivers vibration, heat, wind, and electrical stimulation suitable for contents such as a VR movie, or game. The other glove can be worn on the right hand for measuring the user’s hand movements and therefore acts as an emotion measuring device, which can promptly measure the user’s bio-signal responses. A vibration sensor, an electrical stimulation sensor, a heat sensor, and a small fan are mounted onto the sense stimulation part worn on the left hand of the user. Thus, various sense feedbacks can be delivered to the user through one or a combination of the aforementioned sensors. The emotion measurement part won on the right hand of user can sense free hand motion of user through the IMU sensor, and at the same time, observe biological response signals of user through heartrates and skin conductance stimulation in order to investigate the emotional responses through them. We collected and analyzed the subjects’ three different physiological signals such as their hands’ movements, heart rate and the degree of skin irritation while they were experiencing the VR animation ‘Lonely Noah’. In the present study, different sense responses were delivered to the user through an interactive VR animation, “Lonely Noah”, which was produced by the present researcher;

526

J. Kang

Fig. 1. “Emoract”, a wearable multimodal emotion interface

and investigations on the emotional effect corresponding each sense response is presented by measuring the physiological signals of the users. We found in advance studies that the heart rate and skin irritation level increase when users are immersed in provocative situations in movies [13]. Thus, the user’s emotional immersion level was determined by the change in the user’s heart rate and skin irritation level shown in this study. In addition, the degree of movement of a user’s hand gesture was analyzed together by correlating the user’s immersion with the positive behavior.

4 Interactive VR Animation “Lonely Noah” “Lonely Noah” (Fig. 2) is an interactive VR animation lasting for roughly 5 min and has four-branching structures. It is a story of rescuing a lonely child called Noah from a dark house and subsequently undertaking an adventure in the pursuit of a bright future. The user becomes a helper of Noah and goes through four dangerous situations by helping Noah stay alone in a dark room. Furthermore, “Lonely Noah” is designed to be provided with two types sense information bases on the gesture interaction of the user in every branching structure. Each branching structure provided different sensory responses along with gestures, and the user’s emotional responses were analyzed by observing the user’s hand gesture movements, pulses, and skin irritation as shown on Fig. 3. The user’s gestures and the sense feedbacks in each branching story are presented in the following Table 1, and the sense information is provided by classifying sense feedbacks A and B. This sensory information was designed with senses suitable for the gestures taken by users in each branch, and sensory feedback A was centered on vibrations provided by many existing VR controllers and sensory feedback B provided

Emotional Effect of Multimodal Sense Interaction

527

Fig. 2. Scenes of VR animation “Lonely Noah”

the sensory stimuli of heat, electric stimulation, and wind, except vibration, in context. For example, in a gesture to strike a bat flying toward Noah with a rod, Group A provided vibrations like the existing VR interface, while Group B provided wind and electric stimulation instead of vibrations, providing multimodal senses to feel the wind and the impulse to hit the rod as if the bat’s wings were near. Thus, it was intended to compare sensory information centered on vibrations provided by existing interfaces with sensory information suitable for scenario-oriented situations, except vibrations.

5 User Test Forty normal subjects (15 males and 15 females) aged between 20 and 40 years (male: 35.2 ± 0.3 years, female: 28 ± 1.5 years) participated in the experiment. Physiological signals were measured before the subjects watched the VR animation for 5 min and while they watched the VR animation for 5 min. Prior to this experiment, the experimenters were informed of possible side effects in the virtual reality space by checking for claustrophobia about closed spaces. In addition, the experiment was conducted with consent by providing prior information that the degree of electrical stimulation is not harmful to the human body and is slightly higher than the normal static electricity, since it is an experiment accompanied by weak electrical stimulation. The participants were divided into groups A and B twenty people each. Subsequently, different sense feedbacks were given to each group and real-time bio-signal data were collected and analyzed. First, both the A and B groups showed the biggest emotional response in the second branch when they took out flying bats. In both groups, the changes in heart rate and skin irritation rate to the sensory responses users received when they barbed incoming bats.

528

J. Kang

Fig. 3. Flow chart of analyzing the users’ emotional responses feedback to the different sense feedbacks Table 1. Sense feedbacks for 4 branches First branching Second branching

Third branching

Fourth branching

Hold Noah’s hand

Use a stick to hit a bat flying toward the user

Perform a firework

Hug noah

Sense feedback A Vibration

Vibration

Vibration

Vibration

Sense feedback B

Electrical stimulation, wind

Heat, wind

Heat

Gesture

Heat

Among other things, we could observe that group B, which, through electric stimulation and wind, led users to become more immersed in the context of the story than group A, which provided only vibrations, increased on average. The next biggest emotional reaction was a firework display in the third branch, in which the scene also showed a high change in heart rate and skin irritation for Group B’s sensory feedbacks, which reflected well the context of the fireworks site by providing heat and wind together than Group A, which also provided vibration only.

Emotional Effect of Multimodal Sense Interaction

529

However, for the interaction scenes with Noah provided in the first and last four branch, in contrast, the heart rate and skin irritation of the sensory feedback group, which provided only vibrations, varied significantly from Group B’s heat feedback. This showed that instant sensory feedback through vibrations stimulated the user’s emotional response more than contextual information, which tried to make the living character feel like touching a live character by passing on heat. The reason is that it takes a little time for users to detect heat sensors that are installed in “Emoract”, and there will be various reasons such as a difference in sensing heat for real life and the heat felt through sensors. In addition, the hand gestures from Group A, which provided only vibrations through the analysis of IMU sensor data collected in real time, showed average motion values for all 20 users without significant changes. In the case of Group B, it was possible to observe that the user’s hand gesture movement increased on average from Group A after the second interaction interval that delivered the electric stimulation and wind. This means that the user’s interaction has become more active when the user receives sensory responses suitable for non-vibration contexts as feedback. In order to communicate sensory feedback suitable for interaction to users in the Virtual Reality space, it was shown that there are effective situations where instant feedback, such as vibrations, is effective and other multimodal sensory information that is suitable for the context of the situation in the story. In addition, multi-modal sensory information suitable for context could be inferred that as users become more immersed in the situation, the behavioral radius of the gesture increases and thus become more absorbed in the situation.

6 Conclusion This study proposed a multimodal sense interface that can be worn on both hands by a user in a VR environment and an “Emoract” interface that measures the user’s emotional response in real-time. Unlike the conventional VR interfaces, which provided simple interfaces such as force feedback and vibration to increase the physical presence of user, a variety of sense information was provided using the proposed interface, and the emotional effect of the sense feedbacks was investigated. Through VR animation ‘Lonely Noah’, an interactive VR animation with four branch structures, existing VR interfaces were experimented by dividing users into vibration feedback groups and context-based sensory feedback groups suitable for stories and situations other than vibration. Through this experiment we were able to discover that multi-modal sensory feedback of electric stimulation, wind and heat provided according to the situation of the story and the context, are effective scenes and there is a more effective interaction of the existing vibrating feedback. They also found that the range of hand gestures behavior increases as users become immersed in the play through various sensory stimuli. Through this study, we believe that when designing applicable sensory feedback when developing Virtual Reality films or games, we can develop content that can experience more effective physical presentation if applied through verification of these emotional responses. Since this study still evaluates the emotional response based solely on the user’s bi-signal information, further research and analysis will be necessary, including the implementation of in-depth interviews of users in the future and the association of visual and sensory feedback from Virtual Reality spaces.

530

J. Kang

References 1. 2. 3. 4. 5. 6. 7. 8.

9.

10. 11. 12.

13.

Gartner, Gartner Top 10 Strategic Technology Trends for 2019 (2018) Burdea, G.C., Coiffet, P.: Virtual Reality Technology. Wiley, New York (1980) Minsky, M.: Telepresence: a Manifesto. Omni Magazine, pp. 44–52 (1980) Tachi, S.: Telexistence. World Scientific Publishing Company, Singapore (2009) Guizzo, E.: When my avatar went to work. IEEE Spectrum 9, 24–30 (2010) Lombard, M., Ditton, T.: At the heart of it all: the concept of presence. J. Comput. Mediated Commun. 3(2), JCMC321 (1997) Gupta, R., Sheridan, T., Whitney, D.: Experiments using multimodal virtual environments in design for assembly analysis. Presence 6, Article 3, 318–338 (1997) Kang, J., Lee, J., Jin, S.: Personal sensory VR interface utilizing wearable technology. In: 2018 International Conference on Information and Communication Technology Convergence (ICTC), pp. 546–548 (2018) Choi, I., Hawkes, E.W., Christensen, D.L., Ploch, C.J., Follmer, S.: Wolverine: a wearable haptic interface for grasping in virtual reality. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 986–993 (2016) Teslasuit. https://teslasuit.io/. Accessed 21 Sept 2019 VR free Glove. http://www.sensoryx.com/product/vrfree_glove_system/. Accessed 11 Aug 2019 Jacob, R.J., Girouard, A., Hirshfield, L.M., Horn, M.S., Shaer, O., Solovey, E.T., Zigelbaum, J.: Reality-based interaction: a framework for post-WIMP interfaces. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 201–210. ACM (2008) Kang, J., Lim, J., Kim, C.: Emotion collector, a wearable multi-sensor band to recognize fear. J. Intell. Fuzzy Syst. 1–7 (2018, preprint)

Genetic Algorithms as a Feature Selection Tool in Heart Failure Disease Asmaa Alabed1(B) , Chandrasekhar Kambhampati2 , and Neil Gordon2 1 Faculty of Science and Engineering, Computer Science Department,

University of Hull, Hull, UK [email protected] 2 Faculty of Science and Engineering, University of Hull, Hull, UK

Abstract. A great wealth of information is hidden in clinical datasets, which could be analyzed to support decision-making processes or to better diagnose patients. Feature selection is one of the data pre-processing that selects a set of input features by removing unneeded or irrelevant features. Various algorithms have been used in healthcare to solve such problems involving complex medical data. This paper demonstrates how Genetic Algorithms offer a natural way to solve feature selection amongst data sets, where the fittest individual choice of variables is preserved over different generations. In this paper, a Genetic Algorithm is introduced as a feature selection method and shown to be effective in aiding understanding of such data. Keywords: Feature selection · Decision-making · Algorithms · Genetic algorithm

1 Introduction The performance of pattern modeling and classification is greatly affected if the dataset has a very high dimensionality. At the same time, the computational complexity, both numerically and in terms of space, increases [1–3] and [4]. The rapid development of technology and the corresponding ability to gather data, has led to an explosion of the size of datasets. This does not imply that all of the features/attributes in a dataset are necessary and sufficient, in terms of the information required to determine patterns accurately and provide predictions. Feature selection methods can be used to identify and remove redundant or irrelevant features from a given dataset without loss of accuracy in predictions. At the same time, feature selection can provide an insight into the features in terms of their importance [1, 3]. Feature selection can be defined as the process of choosing a minimum subset of features from the original dataset where [3]: • The classification accuracy does not significantly decrease • The resulting class distribution, given only the values for the selected features, is a close as possible to the original class distribution, given all the features. © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 531–543, 2020. https://doi.org/10.1007/978-3-030-52246-9_38

532

A. Alabed et al.

Feature selection algorithms consist of four key steps: subset generation, evaluation subset, stopping criteria and result validation [4, 5]. Subset generation is a heuristic search that generates a subset of features for evaluation procedures. Each subset generated is evaluated by certain evaluation criteria to determine the ‘goodness’ of the generated subset of the features. The generated subset is validated by carrying out different tests and comparisons with the previous best subset. If a new subset is found not to be better, then the previous best subset is replaced by the new subset. This process is repeated until stopping criteria is reached as shown in Fig. 1.

Original Dataset

Subset Generation

Subset Evaluation

No Stopping Criterion

Yes Result Validation

Fig. 1. Four steps for feature selection process [3]

There are three approaches to feature selection: filter, wrapper or embedded approach [1, 6, 7], and [8]. Filter feature selection methods apply a statistical measure to assign a weight to each feature according to its degree of relevance. Filters independently measure the relevance of feature subsets to classifier outcomes where each feature is evaluated with a measure such as the distance to outcome classes, correlation or Euclidean distance. All the features in the dataset are then ranked according to these measures. The advantages of filter methods are that they are fast, scalable and independent of a learning algorithm. The most distinguishing characteristic of the filters is that the relevance index is calculated solely on a single feature without considering the values of other features [9]. Such implementation implies that the filter assumes orthogonality of features, which is often not true in practice. Therefore, filters omit any conditional dependences (or independence) that might exist, which is known to be one of the weaknesses of filters. Wrapper methods use the predictor as a black box and the predictor performance as the objective function to evaluate the feature subset [1]. The expression wrapper approach covers the category of variable subset selection algorithms that apply a learning algorithm in order to conduct the search for the optimal or a near-optimal subset [10]. The number of the created subset is equal to 2n becomes an NP-heard problem, a suboptimal subset is selected by applying the search algorithm that finds the subset heuristically. The

Genetic Algorithms as a Feature Selection Tool in Heart Failure Disease

533

embedded approach is with specific learning algorithms that perform feature selection in the process of training. An important aspect of using feature selection algorithms is that they can improve inductive learning, either in terms of general capabilities, learning speed or reducing the complexity of the induced model and classification accuracy [2]. Often a compromise is reached in achieving these various objectives in a feature selection approach. This work focuses on applying the Genetic Algorithms (GAs) as a feature selection technique for Heart Failure data sets in order to improve the classification accuracy and reduce the number of features. The GAs was tested as a ‘wrapper’ features selection method. GAs makes up one of the global methods for optimization, for searching in complex, large and multidimensional datasets [1, 9, 11–15]. First, the GAs was built using different populations, generations, and neighborhoods (k). Secondly, selected features from the best performing GAs were tested again, using different populations and k values. Finally, the GAs investigation was carried out by setting a population of up to 800. In terms of classification accuracy, two different classifiers were used namely; Bayes Nets (BN) and Random Forest (RF).

2 Genetic Algorithms (GAs) as a Feature Selection Tool GAs is optimizing and search technique based on natural biological evolution theory (survival for the fittest) [1, 6, 7]. Over successive generations, the population “evolves” toward an optimal solution. The advantage of GAs over others is that allows the best solution to emerge from the best of the prior solution. The idea of GAs is to combine different solutions generation after generation to extract the best genes from each one. GAs can manage data set with a large number of features and it does not need any extra knowledge about the problem under study. The subsets of features selected by genetic algorithms are generally more efficient than those obtained by classical methods of feature selection since they can produce a better result by using a lower number of features [16].

Fig. 2. Genetic Algorithms

534

A. Alabed et al.

The individuals in the genetic space are called chromosomes. The chromosome is a collection of genes where the real value or binary encoding can generally represent genes. The number of genes is the total number of features in the data set. If genes are binary values that mean each chromosome in the GAs population has value of 1 or 0. A value of (1) in a chromosome representation means that the corresponding feature is included in the specified subset. A value of (0) indicates that the corresponding feature is not included in the specified dataset. Each solution in a genetic algorithm is represented through chromosomes. The collection of all chromosomes is called ‘population’ as shown in Fig. 2. As a first step of GAs, an initial population of individuals is generated at random or heuristically. In each generation, the population is evaluated using fitness functions.

Fig. 3. GA as a feature selection [9]

Genetic Algorithms as a Feature Selection Tool in Heart Failure Disease

535

The next step is the selection process, where in the high fitness chromosomes are used to eliminate low fitness chromosomes. Better feature subsets have a greater chance of being selected to form a new subset through crossover or mutation. In this manner, good subsets are “evolved” over time [17]. The commonly used methods for reproduction or selection are Roulette-wheel selection, Boltzmann selection, Tournament selection, Rank selection, and Steady-state selection. The selected subsets are ready for reproduction using crossover and mutation. The crossover combines different features from a pair of subsets into a new subset as shown in Fig. 2. Cross over tends to create a better string. The mutation changes some of the values (thus adding or deleting features) in a subset randomly as shown in Fig. 2. The new population generated undergoes the further selection, crossover, and mutation until the termination criterion is satisfied or maximum numbers of generation were reached as shown in Fig. 3.

3 Genetic Algorithms (GAs) Experiments In this experiment, the Matlab GAs toolbox is used. GAs started by initially creating a random population then it will be evaluated by using a fitness function. The elite kids have then pushed automatically to the next generation and the remaining kids in the current population are allowed to genetically pass through the function of cross over and mutilation to form a new generation [13]. The dataset is a real-life heart failure dataset. In this dataset, there are 60 features for 1944 patient records. The class is “dead” or “alive”. The data sets were imputed by different methods such as Concept Most Common Imputation (CMCI) and Support Victor Machine (SVM). Different classification methods have been applied to these datasets to select which dataset will be trained [18]. The performance of these datasets was measured using accuracy, sensitivity, and specificity. SVM dataset was chosen since its accuracy, sensitivity and specificity were the best. The experiments were designed using Weka (version 3.8.1-199-2016). The accuracy was the best using Bayes net, random forest, decision tree, REP tree, J48. In this work, BN and RF were selected as classifiers since the accuracy was the highest value as shown in Table 1. A list of all features considered is given in Table 7 (Appendix A). Table 1. Imputed dataset Classification algorithms Accuracy Sensitivity Specificity SVM J48

77.8%

86.09%

52.99%

Random Forest

84.72%

96.78%

48.45%

Decision Tree

83.6%

95.27%

48.87%

REP tree

81.2%

92.66%

46.8%

Bayes.Net

87.34%

89.1%

82.06%

GAs parameters are shown in Table 2.

536

A. Alabed et al. Table 2. GAs parameters GAs parameter

Value

Number of features

60

Population size

50, 75, 100

Genomelength

60

Population type

Bite strings

Fitness function

kNN-based classification error

Number of generation 100, 130 Crossover

Arithmetic crossover

Mutation

Uniform mutation

Selection scheme

Roulette wheel

Elite count

2

As discussed above, the number of chromosomes used in a particular implementation is of particular interest, in evolutionary computation [14, 19, 20]. Various results about the appropriate population size can be found in the literature [21, 22]. Researchers usually argue that a “small” population size could guide the algorithm to poor solutions [23–25] and that a “large” population size could make the algorithm expend more computation time in finding a solution [23, 25, 26]. For GAs to select a subset feature, a fitness function must be defined to evaluate the fitness of each subset feature. In this work, the fitness function was based on Oluleye’s fitness function [14] that is based on error minimization and reducing the number of features. The fitness of each chromosome in the population is evaluated using kNNbased fitness function as defined in FSP1. The kNN algorithm computes Euclidean distance between test data and the training sets then finds the nearest point from the training set to the test set. The individuals are evaluated and their fitness is ranked based on the kNN based classification error. Individuals with minimum fitness have a better chance of surviving into the next generation. GA ensures that the GA reduces the error rate and picks the individual with the best fitness error rate that will reduce the number of features as well. The model representation for KNN is the entire training dataset. Predictions are made for a new data point by searching through the entire training set for the K most similar instances (the neighbours) and summarizing the output variable for those K instances. For classification problems, this might be the mode (or most common) class value. Roulette wheel selection was used as the selection method for these experiments as it was discussed in the earlier section. With roulette wheel selection, each individual is assigned as a ‘slice’ of the wheel in proportion to the fitness value of the individual. Therefore, the fitter an individual is, the larger the slice of the wheel. The wheel is simulated by normalization of fitness values of the population of individuals.

Genetic Algorithms as a Feature Selection Tool in Heart Failure Disease

537

4 Results and Discussions In this work, different population size was tested to find the optimal size. The optimal accuracy was achieved using GAs where the population is 100 and k = 5 as shown in Table 3. The number of features was dropped from 60 to 27 features. As K is increased, the accuracy changes as well as shown in Table 3. The researcher should try different values for k to reach the optimal solution. BN accuracy was 87.8% that can be interpreted as predicting 12.2% as being a false classified. Table 3. The performance of classification algorithms using various GA variables Population RF accuracy K=3 100

K=5

K=9

84.82% 85.03

85.4%

75

83.75% 80.76% 84.51%

50

86.7%

85%

83.69%

Population BN accuracy K=3 100

K=5

83.79% 87.8%

K=9 86.21%

75

84.92% 82.25% 83.84%

50

86.7%

83.07% 85.18%

The number of features was 60, for kNN, the trick is in how to determine the similarity between the data instances. The simplest technique, if the attributes are all of the same scale (all in inches for example), is to use the Euclidean distance. A number it can be calculated directly based on the differences between each input variable. In this case, it is impossible because the features are recorded on different scales. The idea of distance or closeness can break down in very high dimensions (lots of input variables) which can negatively affect the performance of the algorithm on this problem. This is called the curse of dimensionality. In order to improve the GAs performance, it’s suggested only use those input variables that are most relevant to predicting the output variable [27, 28]. In the next experiments, the selected features from GAs, where accuracy was the highest (population 100, generation 130, k = 5), were tested and the results are shown in Table 4. BN accuracy was 86.77% that can be interpreted as predicting 13.233% as being a false classified. Sensitivity of 91.0% can be interpreted as the algorithm predicting 8.91% dead when they should have been predicted as alive. Specificity shows a performance of 74.02% which can be interpreted as the algorithm predicting 25.98% FP (alive). The performance of GAs has not improved significantly regarding the accuracy; however, the number of selected features was reduced from 27 to 14 features as shown in Table 4.

538

A. Alabed et al. Table 4. The results of GAs for different generations and k using 27 features

Features selection

Classification algorithms

Accuracy

Sensitivity

Specificity

GAs 100, 130, k = 3 2, 4, 6, 16, 21, 23, 31, 32, 34, 39, 41, 46

Random Forest

83.02%

93.35%

51.95%

Bayes.Net

86.36%

91.09%

72.16%

GAs 100, 130, k = 5 4, 6, 16, 21, 23, 24, 25, 31, 32, 34, 39, 45, 48, 55

Random Forest

84.92%

94.65%

55.67%

Bayes.Net

86.77%

91.02%

74.02%

GAs 100, 130, k = 9 Random Forest 6, 16, 21, 23, 24, 25, 32, 34, Bayes.Net 39, 41, 55

83.84%

94.03%

53.19%

84.49%

90.88%

67.21%

GAs 75, 130, k = 3 Random Forest 4, 21, 23, 31, 32, 34, 39, 40, Bayes.Net 41, 55

83.02%

93.35%

49.89%

86.36%

92.39%

60.61%

GAs 75, 130, k = 5 4, 6, 16, 21, 23, 25, 31, 32, 42, 46, 55

Random Forest

82.71%

94.24%

50.72%

Bayes.Net

84.46%

91.02%

69.27%

GAs 75, 130, k = 9 Random Forest 6, 16, 23, 24, 25, 32, 39, 41, Bayes.Net 55

84%

94.04%

53.81%

85.39%

91.91%

65.77%

GAs 50, 130, k = 3 2, 4, 6, 21, 23, 31, 32, 39, 41

Random Forest

82.20%

93.42%

48.45%

Bayes.Net

84.00%

91.43%

61.64%

GAs 50, 130, k = 5 4, 6, 24, 28, 31, 32, 34, 39, 40, 41, 46, 48

Random Forest

83.12%

93.35%

52.57%

Bayes.Net

85.85%

90.95%

70.51%

GAs 50, 130, k = 9 4, 7, 16, 21, 23, 23, 28, 32, 34, 40, 41, 45, 46, 48

Random Forest

84.1%

96.02%

48.24%

Bayes.Net

85.03%

90.95%

67.21%

The number of populations was increased to 400,600, and 800 in order to investigate if there will be any improvement on the GAs performance. Table 5 shows the accuracy for different generations, the optimal accuracy is 86.3% which is less than 87.7% that was achieved using 100 populations. The results showed that it took a long time and almost the same number of selected features.

Genetic Algorithms as a Feature Selection Tool in Heart Failure Disease

539

Table 5. GAs results for 400, 600, & 800 populations where k = 3 Feature Selection

Classification algorithms Accuracy Sensitivity Specificity

Genetic algorithms (400, Random Forest 100) Bayesian Networks K=3 1, 5, 7, 13, 15, 16, 17, 18, 24, 28, 39, 31, 33, 34, 38, 39, 40, 42, 45, 46, 49, 55, 59, 60 Genetic algorithms (600, Random Forest 100) Bayesian Networks K=3 1, 7, 10, 11, 14, 15, 16, 17, 19, 23, 24, 25, 28, 29, 30, 32, 33, 39, 40, 43, 46, 48, 50, 52, 53, 55, 56, 58, 59, 60 Genetic algorithms (800, Random Forest 100) Bayesian Networks K=3 1, 2, 3, 5, 6, 7, 8, 14, 19, 20, 28, 30, 35, 37, 38, 42, 45, 47, 48, 50, 54, 59, 60

85.03%

95.54%

53.40%

86.3%

88.8%

78.96%

84.77%

96.23%

50.3%

85.75%

88.8%

76.9%

82.76%

94.37%

47.83%

82.30%

87.11%

67.83%

Al Khaldy [28] investigated several feature selection methods including wrapper and filter methods and used a representative set of classification methods for evaluating the features selected. These methods enabled the identification of a core set of features, from the same dataset. As shown in Table 6, there are many common features between his findings and this work. Table 6. Common factors GA Al Khaldy Urea (mmol/L)

1

4

Uric acid (mmol/L)

1

4

MCV (fL)

1

5

Iron (umol/L)

1

6

Ferritin (ug/L)

1

4

CRP (mg/L)

1

3

White cell count

1

2

CT-proET1

1

7 (continued)

540

A. Alabed et al. Table 6. (continued) GA Al Khaldy LVEDD (HgtIndexed) 1

6

E

1

3

Height (Exam)(m)

1

2

PCT

1

1

MR-proADM

1

5

FVC (L)

1

6

5 Conclusions The experiments in this paper demonstrate the feasibility of using GA as a feature selection tool for large data sets. While the number of features was reduced from 60 to 27 features using GA, the accuracy - being 87.8% - was almost the same. In order to improve the GA performance, the input variables were the most relevant to predicting the output variable (27 features). Whilst the performance of GA has not improved significantly regarding the accuracy, the number of selected features was reduced from 27 to 14 features thus identifying the most important features. GA picked up the three variables that are used by clinicians in diagnosing heart failure [29], namely, Urea, Uric acid and Creatinine. In order to validate the performance of GA, different feature selection experiments were carried out using WEKA tool to show this is a viable technique for such problems.

Appendix A

Table 7. List of all features considered. #Feature

Feature

#Feature

Feature

1

Age

31

MR-proADM

2

Sodium (mmol/L)

32

CT-proET1

3

Potassium (mmol/L)

33

CT-proAVP

4

Chloride (mmol/L)

34

PCT

5

Bicarbonate (mmol/L)

35

Rate (ECG)(bpm)

6

Urea (mmol/L)

36

QRS width (ms)

7

Creatinine (umol/L)

37

QT

8

Calcium (mmol/L)

38

LVEDD (cm) (continued)

Genetic Algorithms as a Feature Selection Tool in Heart Failure Disease

541

Table 7. (continued) #Feature

Feature

#Feature

Feature

Adj Calcium (mmol/L)

39

LVEDD (HgtIndexed)

10

Phosphate (mmol/L)

40

BSA (m2 )

11

Bilirubin (umol/L)

41

Left Atrium (cm)

12

Alkaline Phophatase (iu/L)

42

Left Atrium (BSAIndexed)

13

ALT (iu/L)

43

Left Atrium (HgtIndexed)

14

Total Protein (g/L)

44

Aortic velocity (m/s)

15

Albumin (g/L)

45

E

16

Uric acid (mmol/L)

46

Height (Exam)(m)

17

Glucose (mmol/L)

47

Weight (Exam)(kg)

18

Cholesterol (mmol/L)

48

BMI

19

Triglycerides (mmol/L)

49

Pulse (Exam)(bpm)

20

Haemoglobin (g/dL)

50

Systolic BP (mmHg)

21

White cell count (109 /L)

51

Diastolic BP (mmHg)

22

Platelets (109 /L)

52

Pulse BP (mmHg)

23

MCV (fL)

53

Pulse BP (mmHg)

24

Hct (fraction)

54

FEV1 (L)

25

Iron (umol/L)

55

FEV1 Predicted (L)

26

VitaminB12 (ng/L)

56

FEV1

27

Ferritin (ug/L)

57

FVC (L)

28

CRP (mg/L)

58

FVC Predicted (L)

29

TSH (mU/L)

59

FVC

30

MR-proANP

60

PEFR (L)

9

References 1. Chandrashekar, G., Sahin, F.: A survey on feature selection methods. Comput. Electr. Engineeing 40(1), 16–28 (2014) 2. Panthong, R., Srivihok, A.: Wrapper feature subset selection for dimension reduction based on ensemble learning algorithm. Procedia Comput. Sci. 72, 162–169 (2015) 3. Dash, M., Liu, H.: Feature selection methods for classifications. Intell. Data Anal. 1(3), 131–156 (1977) 4. Kumar, V., Minz, S.: Feature selection: a literature review. Smart Comput. Rev. 4(3), 211–229 (2014) 5. Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 17(4), 491–502 (2005) 6. Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 4–37 (2000)

542

A. Alabed et al.

7. Cai, J., Luo, J., Wang, S., Yang, S.: Feature selection in machine learning: a new perspective. Neurocomputing 300, 70–79 (2018) 8. Shikhpour, R., Sarram, M.A., Gharaghani, S., Ali, M., Chahooki, Z.: A survey on semisupervised feature selection methods. Pattern Recognit. 64, 141–158 (2017) 9. Masilamani, A., Anupriya, Iyenger, N.: Enhanced prediction of heart disease with feature subset selection using genetic algorithm. Int. J. Eng. Sci. Technol. 2(10), 5370–5376 (2010) 10. Kohavi, R., John, G.H.: The wrapper approach. In: Liu, H., Motoda, H. (eds.) Feature Extraction, Construction and Selection, p. 33. Kluwer Academic Publisher (1998) 11. Tiwari, R., Singh, M.P.: Correlation-based attribute selection using genetic algorithm. Int. J. Comput. Appl. 4(8), 28–34 (2010) 12. Karthikeyan, T., Thangaraju, P.: Genetic algorithm based CFS and Naïve Bayes algorithm to enhance the predictive accuracy. Indian J. Sci. Technol. 8(27) (2015) 13. Oluleye, B., Armstrong, L.J., Leng, J., Diepeveen, D.: Zernike moments and genetic algorithm: tutorial and application. Br. J. Math. Comput. Sci. 4(15), 2217–2236 (2014) 14. Alander, J.T.: On optimal population size of genetic algorithms. In: Proceedings of the IEEE Computer Systems and Software Engineering, pp. 65–69 (1992) 15. Jabbar, M., Deekshatulu, B.L., Chandra, P.: Classification of heart disease using K-Nearest neighbor and genetic algorithm. In: International Conference on Computational Intelligence: Modeling Techniques and Applications, vol. 10, pp. 85–94 (2013) 16. Boggia, R., Riccardo, L., Marco, T.: Genetic algorithms as a strategy for feature selection. J. Chemom. 6(5), 267–281 (1992) 17. Siedlecki, W., Sklansky, J.: A note on genetic algorithms for large-scale feature selection. Pattern Recognit. Lett. 10, 335–347 (1989) 18. Khaldy, M., Kambhampati, C.: Performance analysis of various missing value imputation methods on heart failure dataset. In: SAI Intelligent systems Conference, London UK, September 20–22 (2016) 19. Diaz-Gomez, P.A, Hougen, D.F: Initial population for genetic algorithms: a metric approachs. In: Proceedings of the 2007 International Conference on Genetic and Evolutionary Methods, Las Vegas, GEM 2007, June 2007 20. Piszcz, A., Soule, T.: Genetic programming: optimal population sizes for varying complexity problems. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 953– 954 (2006) 21. Reeves, C.R.: Using genetic algorithms with small populations. In: Forrest, S. (ed.) Proceedings of the Fifth International Conference on Genetic Algorithms, pp. 92–99. Morgan Kaufmmann, San Mateo (1993) 22. Roeva, O.: Improvement of genetic algorithm performance for identification of cultivation process models. In: Advanced Topics on Evolutionary Computing, Book Series: Artificial Intelligence Series-WSEAS, pp. 34–39 (2008) 23. Koumousis, V.K., Katsaras, C.P.: A sawtooth genetic algorithm combining the effects of variable population size and reinitialization to enhance performance. IEEE Trans. Evol. Comput. 10(1), 19–28 (2006) 24. Pelikan, M., Goldberg, D.E., Cantu-Paz, E.: Bayesian optimization algorithm, population sizing, and time to convergence, Illinois Genetic Algorithms Laboratory, University of Illinois, Technical report (2000) 25. Lobo, F.G., Goldberg, D.E.: The parameterless genetic algorithm in practice. Inf. Sci. Inform. Comput. Sci. 167(1–4), 217–232 (2004) 26. Lobo, F.G., Lima, C.F: A review of adaptive population sizing schemes in genetic algorithms. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 228–234 (2005) 27. Raymer, M.L., Punch, W.F., Goodman, E.D., Khun, L.A., Jain, A.K.: Dimensionality reduction using genetic algorithm. IEEE Trans. Evol. Comput. 4(2), 164–171 (2000)

Genetic Algorithms as a Feature Selection Tool in Heart Failure Disease

543

28. Al Khaldy, M.: Clinical data issues and autoencoder framework to compress datamining methodology, Ph.D. thesis, University of Hull, June 2017 29. Kirke, L.: Datamining for heart failure: an investigation into the challenges in real life clinical datasets, Ph.D. thesis, The University of Hull, June 2015

Application of Additional Argument Method to Burgers Type Equation with Integral Term Talaibek Imanaliev(B) and Elena Burova American University of Central Asia, Bishkek, Kyrgyz Republic {imanaliev t,burova e}@auca.kg http://auca.kg

Abstract. Application of Additional Argument Method to Burgers type equation with an integral term on the right-hand side was considered. A scheme of the method for the Burgers type equation is constructed. The validness of the scheme construction using Computer Algebra System (CAS) Maple was proved. The capability of modern CAS to prove mathematical theorems was demonstrated. The graphical solution of the series sample equations was constructed. Keywords: Partial differential equation · Burgers type · Integral term · Additional argument method · Computer algebra system

1

Introduction

A new method for studying partial differential equations, later called “the Additional Argument Method” (AAM) was developed in the works [1–5]. This method allows to reducing partial differential equation to the system of integral equations, which is much easier to analyze in terms of the existence and uniqueness of solutions. Naturally, the idea of application of the Method to investigate the classic problems appeared. One of them is the Burgers equation which is the particular one-dimensional case of the Navier-Stokes equation. Besides the applications to hydrodynamics, Burgers equation has applications to wide variety of knowledge fields. For example, the Burgers equation is used in macroeconomics to model development of the “World Economy” system [6]: dY  d2 Y  dY  +Y = KS 2 , dt dLQ dLQ where t is the time interval of consideration,

c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 544–553, 2020. https://doi.org/10.1007/978-3-030-52246-9_39

AAM to Burgers Type Equation with Integral Term

545

KS is the self-organization coefficient, the structural characteristics, the parameter, which describes the economical usefulness, the effectiveness of political system structure and characterizes dissipation minimization and capability to optimize resource distribution for industry and benefits for consumption. Y is the production of goods in the time interval (GDP). Y  = dY dt is the output speed or economic growth in the time interval; 2  Y = ddtY2 is economic growth rates; LQ = LKN is skilled labor or population of the country taking into account labor qualification where: L is population; L is population growth rate; L is population growth rate; N is population with higher education; KN is coefficient of qualification of work of public system, characteristics of the growth of structural information, expressed by created new knowledge. Creating new knowledge is the intellectual labour of the population with higher education, expressed by the growth of the population with higher education N . Another example of Burgers equation application is the single-band transport stream modelling. This macroscopic (hydrodynamic) model was developed by Witham G. ([7], 1974). In contrast to the previously proposed models, based on the law of conservation of the number of vehicles, the “farsightedness” of drivers was taken into account [10].   ∂ ∂p ∂Q(p) ∂p + = D(p) (1) ∂t ∂x ∂x ∂x where Q(p) = pv(p); p(t, x) is the amount of transport per unit length at time t in the neighborhood of a route point x; v(t, x) is velocity at time t in the neighborhood of a route point. The left-hand side of Eq. (1) represents the conservation law of transport. The diffusion terms appearing in the right-hand side of equation correspond to the fact that drivers reduce speed with increasing traffic density ahead and increase speed when decreasing. In the work [2] the scheme of the additional argument method is implemented for the various classes of equations. In this paper the solution of the particular equation was verified. The solving of this problem is quite complicated and time-consuming. Therefore the decision to use of the CAS Maple to facilitate verification of the correctness of the solution was taken. In the following part of the paper the theoretical bases of Additional Argument Method is given. The detailed prove of the theorem using Maple is provided. The test example with the known analytical solution was compared to the solution obtained by AAM. Then the solution of the problem with integral term on the right hand side is provided. This theoretical analysis is followed by sample equation with the numerical solution and it’s graph.

546

2

T. Imanaliev and E. Burova

Construction of a Solution Scheme

The scheme for the Burgers equation in the classical form was considered in [8]. In this paper, we consider a scheme for constructing a solution for a more general case. Consider initial value problem for the Burgers type equation:  t u (s, x) ds, (2) ut (t, x) + u(t, x)ux (t, x) = μuxx (t, x) + 0

u(0, x) = ϕ(x),

(3)

where 0 < μ < 1 − const, D = {x ∈ R, t ∈ R+ } . According to AAM the problem (2)–(3) is equivalent to integral equation 



t

u(t, x) = ϕ(x −

t

v(s, t, x)ds) + μ 0

vxx (s, s, p(s, t, x))ds

(4)

0

 t

q

v(s, s, p(q, t, x))dsdq,

+ 0

0

where functions v(s, t, x) and p(s, t, x) are defined from the system 



t

v(τ, t, x) = ϕ(x −

v(s, t, x)ds) + μ 0



τ

τ

vxx (s, s, p(s, t, x))ds

(5)

0



+

q

v(s, s, p(q, t, x))dsdq, 0

0



p(τ, t, x) = x −

t

v(s, t, x)ds. τ

Theorem 1. Let ϕ (x) ∈ C

(2)

(Rn ) ∩ Lip (L),

∂v(τ, t, x) ∂v(τ, t, x) + u(t, x) = 0, ∂t ∂x

v(t, t, x) = u(t, x),

∂p(τ, t, x) ∂p(τ, t, x) + u(t, x) = 0, p(t, t, x) = x, ∂t ∂x then the system (4)–(5) is a solution to the problem (2)–(3), and there exists such T > 0 that solution of (2)–(3) u(t, x) ∈ C((0, T ] × R) exists and unique. The algorithm to solve the system (4)–(5) in CAS Maple was created. The algorithm is based on the Fixed-Point Iterations principle. In the following Maple code the correctness of the constructed solution is proved.

AAM to Burgers Type Equation with Integral Term

547

548

T. Imanaliev and E. Burova

The values of v(s, t, x) and p(s, t, x) substituted in (4), and the analytic solution to the original problem (2)–(3) obtained. It should be noted that the analytical solution is very voluminous, and within the framework of this paper it cannot be given. However, the solution can be represented graphically.

3

Example 1. Test for Analytic Solution

As is known, the Burgers equation has two basic types of solutions [9] - stationary (self-similar) and traveling wave type solution. When investigating the behavior of the solution obtained, the question arose as to which type the solution belongs. To answer this question, let’s compare the solution obtained by AAM and the well-known solutions for classic Burgers equation. In particular, let us consider the IVP ut (t, x) + u(t, x)ux (t, x) = μuxx (t, x), u(0, x) = 1 − 2μ tanh(1 + x) Its solution can be found either by the well-known Hopf–Cole [11,12] transformation either by standard procedures of Maple: u(t, x) = 1 − 2μ tanh(1 + x − t) Solving the same problem by the method of an additional argument we obtain analytical results, presented in the form of animated graphs. Analyzing the behavior of the solution obtained at different instants of time t, at different values μ, we came to the following results.

(a) µ = 0.001

(b) µ = 0.01

Fig. 1. Solutions for µ = 0.001 and µ = 0.01

AAM to Burgers Type Equation with Integral Term

549

The graph in Fig. 1(a) is a snapshot of the animation executed in Maple. It shows that the solution obtained by AAM, dashed line, and coincides with the exact analytic solution, solid line. It has a form of a travelling wave. The graph has no shocks for big values of time t and spatial variable x due to the vanishing viscosity μ = 0.001. It corresponds to the physical significance of equation. This example illustrates that AAM is applicable to these types of equations. Graph in Fig. 1(b) describes appearance of shock at sufficiently large values of time t and spatial variable x. Maple allows analysis of pre-shocks and aftershocks.

(a) µ = 0.1

(b) µ = 0.1

(c) µ = 1

Fig. 2. Solutions for µ = 0.1 and µ = 1

For larger values of viscosity μ = 0.1 as shown in the Fig. 2, shocks appear at the very beginning of wave. It corresponds to physical significance of the problem. Even in this case AAM gives the solution corresponding to classic results and there exists interval of existence and uniqueness. The limiting value μ = 1 has theoretical significance only. The shocks appear at small t. Here one can see fulfillment of Maxwell rule: the loops right and left of the shocks should have equal areas.

4

Example 2. Application for Burgers Type Equation with Integral Term

Consider initial value problem for the Burgers type equation:  t ut (t, x) + u(t, x)ux (t, x) = μuxx (t, x) + u (s, x) ds, 0

u(0, x) = μsin(x),

550

T. Imanaliev and E. Burova

To find the solution of this problem we will use the following system of integrodifferential equations  t v(s, t, x)ds) u(t, x) = μsin(x − 0  t q  t vxx (s, s, p(s, t, x))ds + v(s, s, p(q, t, x))dsdq, +μ 0 0 0  t v(s, t, x)ds) v(τ, t, x) = μsin(x −  τ 0 q  τ vxx (s, s, p(s, t, x))ds + v(s, s, p(q, t, x))dsdq, +μ 0 0 0  t v(s, t, x)ds. p(τ, t, x) = x − τ

(a) µ = 0.01

(b) µ = 0.05

Fig. 3. Solutions for µ = 0.01 and µ = 0.05

There is no known exact solution for Burgers equation with integral term. The case with integration wrt x on [0, 1] and positive values of integral corresponds to Aubry-Mather theory [13]. It represents the chemical reaction of particles within tube. Mathematically it is named as forced Burgers equation. We suggest integration wrt time variable t. If we consider a transport stream model it corresponds to behavior of the drivers in case of accident within time interval. In Fig. 3(a) we provide solutions obtained by AAM using Maple for μ = 0.01 and different time values. For these values of time t there is no shocks. The choice of initial function is determined by paper [14]. In this paper Sobolevski analyzed periodic solutions for positive integral terms.

AAM to Burgers Type Equation with Integral Term

551

With increasing of viscosity μ = 0.05 one can observe shocks for t = 6. Graph shows that the Maxwell rule holds as well. For lower values of t = 2 and t = 4 the solution remains continuous and bounded.

5

Example 3. Application for Burgers Type Equation with Integral Term

Consider initial value problem for the Burgers type equation:  t ut (t, x) + u(t, x)ux (t, x) = μuxx (t, x) + u (s, x) ds, 0

u(0, x) = tanh(x), To find the solution of this problem we will use the following system of integrodifferential equations  t u(t, x) = tanh(x − v(s, t, x)ds)x 0  t q  t vxx (s, s, p(s, t, x))ds + v(s, s, p(q, t, x))dsdq, +μ 0 0 0  t v(s, t, x)ds) v(τ, t, x) = tanh(x − 0  τ q  τ vxx (s, s, p(s, t, x))ds + v(s, s, p(q, t, x))dsdq, +μ 0 0 0  t v(s, t, x)ds. p(τ, t, x) = x − τ

(a) µ = 0.001

(b) µ = 0.1

Fig. 4. Solutions for µ = 0.001 and µ = 0.1

552

T. Imanaliev and E. Burova

As we expected with vanishing viscosity μ = 0.001 as shown in the Fig. 4, the time span is very wide and shocks free. The initial function tanh(x) is the natural Burgers equation condition. In this graph we show that even for smooth initial conditions with increase of viscosity μ have shocks. In this case shock appears at t = 3. Maxwell rule takes place. The Theorem 1 on existence and uniqueness also holds.

6

Conclusion

In this paper we proved that Additional Argument Method is applicable for Burgers type of equation with integral term. Sufficient conditions for existence and uniqueness of continuous and bounded solution is provided. The prove was processed by Maple software package. We provided several examples. All of them were solved symbolically using Maple. The scope of this paper doesn’t allow us to present here. We may provide all interested in on request. We demonstrated some graphs of these solutions in the paper, and animations of the graphs in presentation. We have chosen illustrative examples with different initial conditions. Burgers equation is considered as a test equation for Navier-Stokes equation and posses known analytic solution. The solution found by Additional Argument Method agree with that. We now plan to use this method to analyze the 6th Clay Mathematics Institutes millennium “Navier-Stokes existence and smoothness” problem with the use of the power of symbolic computation of Maple.

References 1. Imanaliev, M.I., Ved’, U.A.: On the first order partial differential equation with the integral coefficient. In: Differenzialnye uravneniya, vol. 25, no. 3, pp. 465–477 (1989) 2. Imanaliev, T.M.: The substantiation and development of the additional argument method to solve partial differential equations. Ph.D. thesis, Bishkek (2001) 3. Imanaliev, M.I., Alekseenko, S.N.: On the theory Widhem type partial interrodifferental equations. In: DAN, vol. 323, no. 3, pp. 410–414 (1992) 4. Imanaliev, M.I., Alekseenko, S.N.: On the theory nonlinear equations with the total derivative with respect to time variable. In: DAN, vol. 329, no. 5, pp. 543– 546 (1993) 5. Imanaliev, M.I., Pankov, P.S., Imanaliev, T.M.: Additional argument method on the theory of wave equation. In: DAN, vol. 343, no. 5, pp. 596–598 (1995) 6. Chistilin, D.K.: Comparative analysis of the results of simulation modeling of the development of the world economy and 12 civilizations for the period 1970–2005. In: Preview Sketch. Economic Strategies, vol. 1, pp. 87–97 (2011) 7. Whitham, J.: Linear and Nonlinear Waves. Mir, Moscow (1977) 8. Burova, E.S.: Simbolic and graphical solution of Burgers equation by the additional argument method. In: Vestnik OshGU, vol. 1, pp. 110–114 (2013) 9. Polyanin, A.D., Zaitsev, V.F.: Handbook of Nonlinear Partial Differential Equations. A CRC Press Company, Boca Raton (2004)

AAM to Burgers Type Equation with Integral Term

553

10. Gasnikov, A.V.: Intro to Mathematical Modelling of Transport Flows. MFTI, Moscow (2010) 11. Hopf, E.: The partial differential equation ut + uux = uxx . Commun. Pure Appl. Math. 3, 201–230 (1950) 12. Cole, J.D.: On a quasi-linear parabolic equation occurring in aerodynamics. Quart. Appl. Math. 9, 225–236 (1951) 13. Weinan, E.: Aubry-Mather theory and periodic solutions for the forced Burgers equation. Commun. Pure Appl. Math. 52, 811–828 (1999) 14. Sobolevski, A.N.: Periodic solutions of the Hamilton-Jacobi equation with a periodic non-homogeneous term and Aubry-Mather theory. Sbornik Math. 190, 1487– 1504 (1999)

Comparison of Dimensionality Reduction Methods for Road Surface Identification System Gonzalo Safont1(B) , Addisson Salazar1 , Alberto Rodríguez2 , and Luis Vergara1 1 Institute of Telecommunications and Multimedia Applications,

Universitat Politècnica de València, Valencia, Spain [email protected], {asalazar,lvergara}@dcom.upv.es 2 Departamento de Ingeniería de Comunicaciones, Universidad Miguel Hernández de Elche, Elche, Spain [email protected]

Abstract. Road surface identification is attracting more attention in recent years as part of the development of autonomous vehicle technologies. Most works consider multiple sensors and many features in order to produce a more reliable and robust result. However, on-board limitations and generalization concerns dictate the need for dimensionality reduction methods. This work considers four dimensionality reduction methods: principal component analysis, sequential feature selection, ReliefF, and a novel feature ranking method. These methods are used on data obtained from a modified passenger car with four types of sensors. Results were obtained using three classifiers (linear discriminant analysis, support vector machines, and random forests) and a late fusion method based on alpha integration, reaching up to 96.10% accuracy. The considered dimensionality reduction methods were able to reduce the number of features required for classification greatly and increased classification performance. Furthermore, the proposed method was faster than ReliefF and sequential feature selection and yielded similar improvements. Keywords: Classification · Decision fusion · Feature selection · Road surface identification · Self-driving vehicles

1 Introduction Road surface identification (RSI) is one of the key parts in the development of autonomous or semi-autonomous vehicle technologies. Driving decisions and adjustments could be made depending on the road the vehicle is traversing, improving traffic flow, assisting the driver, and improving safety and driving experience. Existing studies on RSI can be roughly classified in three groups, depending on the road surfaces considered: (i) road roughness profiling and other road maintenance works; (ii) detecting hazardous weather conditions (sleet, snow, rain…); (iii) detecting different types of terrain (grass, asphalt, stone…). Road roughness profiles and related studies are typically concerned with costeffective solutions for supervising and planning road maintenance works. For instance, © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 554–563, 2020. https://doi.org/10.1007/978-3-030-52246-9_40

Comparison of Dimensionality Reduction Methods

555

Tudon-Martinez et al. estimated road roughness profiles from multisensor data as a cost-effective alternative to existing laser-based solutions [1], and Park et al. detected road damage and anomalies using deep ensemble networks [2]. Conversely, studies on hazardous weather conditions are typically concerned with increasing the safety of the vehicle occupants. For instance, Alonso et al. identified wet road conditions using tire/road noise to prevent aquaplaning [3], and Zhao, Wu and Chen employed video feeds to detect four hazardous weather conditions (wet road, snow, ice, and water) [4]. Finally, studies that detect the type of terrain have goals that depend on the considered classes. For instance, Masino et al. classified five types of pavement using the sound in the tire cavity for the purpose of estimating road traffic using support vector machines and postprocessing [5], and Bystrov et al. have considered sonar, ultrasound, and radar to perform classification of the road surface in front of the car for the purposes of autonomous car technology [6–8]. Both studies reached over 90% accuracy in their respective problems. This work is related to the third category: detecting different types of terrain. It presents a combination of four different sensors for road surface identification: accelerometers, microphones, speed signals, and the torque and position of the steering wheel. The last two sensors are already included in the vehicle’s electric power steering (EPS) system, while the former two were specially fitted for this work. The number of features extracted from these sensors is relatively large compared to the number of available samples. Therefore, dimensionality reduction was carried out to identify significant features and eliminate irrelevant or redundant features. There are two main reasons for dimensionality reduction. Firstly, we would like to reduce the number of features for computational reasons: faster evaluation times, lower memory consumption, lower implementation cost, and so on. This is particularly important in autonomous vehicle technologies, where on-board systems are constrained. Secondly, reducing the number of features might improve performance or reduce the variability of the result. While hierarchical and knowledge discovery methods could perform a similar function [9–12], they tend to require large computational or temporal costs, which defeats the first reason. This work compares several dimensionality reduction methods for a road surface identification system. The following methods are considered: principal component analysis (PCA, [13]); sequential feature selection [19]; and the ReliefF feature ranking method [20]. Furthermore, a novel feature ranking method is also proposed. The effect of these dimensionality reduction methods is considered on three different classifiers: linear discriminant analysis (LDA), support vector machines (SVM), and random forests (RDF). Furthermore, late fusion of the three classifiers using alpha integration [21–23] is also considered.

2 Dimensionality Reduction Methods Given the high dimensionality of the data in this work, dimensionality reduction was performed on the extracted features before classification. Dimensionality reduction methods are typically classified in two categories: feature extraction, where new features are derived from the original ones; and feature selection, where one or more of the original features are selected and the rest are discarded [19]. In turn, feature selection is typically approached in one of two ways: ranking features according to some criterion and selecting the top Q features (feature ranking); or selecting a subset of features that keep or

556

G. Safont et al.

improve classification performance (subset selection) [19]. Subset selection algorithms can automatically determine the number of selected features, while feature ranking algorithms need to rely on a user-determined threshold (or some other method) to set the number of selected features. 2.1 Principal Component Analysis PCA is a linear transformation of the data that maximizes the uncorrelation of the transformed components. The original features are projected onto an orthogonal space where the projections (components) are sorted by the amount of variance of the original data they explain. In many applications, the first few components contain most of the original variance, thus the rest can be discarded to reduce the dimensionality of the problem. PCA has been heavily used as a feature extraction method and to investigate data structure [13]. However, unlike feature selection methods, PCA requires all the original features to compute the projected components. The consideration of more advanced feature extraction methods such as stochastic PCA or independent component analysis [14–18] is outside the scope of this work. 2.2 Sequential Feature Selection Sequential feature selection (SFS, [19]) is a feature subset selection method that performs a greedy search of the optimal subset of features. There are two main types of SFS: forward, where the subset of selected features is iteratively grown from the empty set; and backward, where the subset of features is iteratively reduced from the full set. Essentially, at each iteration, the method determines the effect of adding one of the unselected features to the subset (forward SFS) or removing one of the selected features from the subset (backward SFS) using a wrapped classification method. For each iteration, forward SFS adds the feature that would increase performance the most if added to the current subset. Conversely, backward SFS removes the feature that would increase performance the most if removed from the current subset. In both cases, iterations continue until performance cannot be improved any more. In a trained system, discarded features can be omitted from the feature extraction stage, thus lightening the load of the system. In this work, we considered forward SFS, with each of the classifiers as the wrapped method. 2.3 Feature Ranking Feature ranking methods assign a score or importance to each feature and then rank them by score [19]. This ranked list can then be used to perform feature selection, e.g., by choosing the Q best-ranked features. As with feature subset selection, the features that are not selected are never used and, in a trained system, could be omitted from the feature extraction stage. ReliefF. ReliefF is a feature selection method that considers the interactions between features, returning a score that can be used later for feature ranking [20]. The feature scores computed by ReliefF are based on the distances between nearest neighbors. Essentially, the distances between nearest neighbors of the same class decrease the score of

Comparison of Dimensionality Reduction Methods

557

the feature, and the distances between nearest neighbors of different classes increase the score of the feature. Proposed Feature Ranking Method. In this work, we propose a feature ranking method based on simple classifiers. The method is robust, quick to compute, and does not require a lot of memory. For binary classification problems, the score of the ith feature is computed as the informedness [24] of a simple classifier fit to the feature: informedness = specificity + recall − 1

(1)

The simple one-variable classifier used for feature ranking is shown in Algorithm 1. The algorithm maximizes the informedness of the result when splitting the values using two thresholds; any value x is assigned to class 1 if x1 < x ≤ x2 and class 0 otherwise. The search for the pair of values that optimize the informedness of the classification is done in only one pass through the set of values. This method is fast to compute and only considers the order of the values of the input variable, while disregarding the actual values. Thus, the method is robust with respect to isolated outliers, extreme values, and any data transformation that does not affect the order of the values of the features, such as: centering, scaling, exponentiation (when negative values of x are possible, only for odd powers), and logarithmic transformation. Algorithm 1. Simple one-feature classifier based on the optimization of the informedness. 0

Given input feature values

1

Sort the values in ascending order, then arrange the binary labels in the same order; call the sorted values and the sorted labels For Compute the probability of detection and the probability of false alarm for values up to :

2 3

4

and binary labels

(

if

belongs to class , and 0 otherwise),

Compute informedness as

5

Find the indices of the maximum and minimum values of informedness,

6

If

7

Else, the score of feature

where

and , the score of feature

is the informedness within the range

is the informedness outside the range

For multiclass problems such as the one shown in this work, the problem with K classes is first divided into K binary 1-vs-all problems. Then, the score of the ith feature is obtained as the average of its scores for each of the K binary problems.

558

G. Safont et al.

3 Experiments The dimensionality reduction methods described in Sect. 2 were used on a set of data from a road surface identification problem. These data were obtained using a specially converted passenger car coursing over three different road surfaces: cobblestones, smooth flat asphalt, and stripes. Ten channels were recorded: • Three channels from an accelerometer on the intermediate shaft (X, Y, and Z directions). • Three mono sound channels from microphones, two located close to the driver’s head and one on the upper side of the electric power steering (EPS) system column. • Two channels with the speed of the left and right wheels of the car. • The torque and position of the steering wheel. These four sensors were sampled at a rate of 48 kHz. A grand total of 63 files were taken, each with a different combination of vehicle speed and road surface, and the features shown in Table 1 were extracted from each channel in epochs of size 1.5 s, with an overlap of 1.4 s between windows. This resulted in a total of 8,309 samples and 560 features available for classification. The high dimensionality of the data and the restrictions of the problem (on-board car systems) justify the necessity of dimensionality reduction methods. We considered the following classifiers because of their widespread application in machine learning problems: linear discriminant analysis (LDA), support vector machines with linear kernel (SVM), and random forests with 50 trees (RDF). While SVM and RDF might be computationally expensive to train, the trained model is fast to evaluate, making them appropriate for the task. LDA included a regularization term equal to γ = 0.01, in order to compensate for the large amount of features. Aside from the classifiers, we also considered late fusion using separated score integration (SSI, [21–23]). SSI is a method based on alpha integration that optimally combines the scores from several classifiers into one response. The parameters of SSI were obtained by the last mean squares criterion [23]. To test the effect of the considered dimensionality reduction methods on performance, we performed a series of Monte Carlo experiments that followed the procedure shown in Fig. 1. First, the files were randomly split in three subsets: training (50% of the files), validation (25% of the files), and testing (the remaining 25% of the files). For training, all the features shown in Table 1 were extracted. Then, one of the dimensionality reduction methods described in Sect. 2 was trained using said features, and the features remaining after reduction were used to train the LDA, SVM and RDF classifiers. During validation, the results of training were used to modify the feature extraction stage by omitting discarded features, and then the scores of LDA, SVM and RDF were obtained on the remaining features. SSI was trained using the scores of the classifiers on the validation stage. Finally, for testing, the modified extracted features were used to obtain results for LDA, SVM and RDF, and the scores of said classifiers were fused using SSI. This experiment was repeated for 100 iterations for each dimensionality reduction method, and the average and standard deviation of the results were obtained. As explained in Sect. 2, PCA and feature ranking methods cannot automatically determine the number of selected features. To determine the optimal number of selected

Comparison of Dimensionality Reduction Methods

559

Table 1. Features extracted from data x in epochs of length . Feature

Definition 1 Δ ∑ x ( n ) ⋅ x ( n − 1) ⋅ x ( n − 2 ) Δ − 2 n =3

Third-order autocorrelation

−3/ 2

Time reversibility

⎛1 Δ 2 ⎞ ⎜ ∑ x (n) ⎟ Δ ⎝ n =1 ⎠

Average power

1 Δ 2 ∑ x (n) Δ n =1

Centroid frequency

fS Δ

∑ ∑

fS Δ

⎛ ⎞ X(f)⎟ ⎜ arg max f ⎝ ⎠

Maximum frequency

Δ f =1 Δ

3 1 Δ ∑ ( x ( n ) − x ( n − 1) ) Δ − 1 n=2

(3) (4)

f X(f)

f =1

(2)

X(f)

2

2

where X ( f ) is the direct Fourier transform of x within the epoch taken at Δ points, and f S is the sampling rate

(5)

(6)

where f o1 , f o 2 are respectively the start and end indices of the oth octaves, taking f , f o1 ≤ f ≤ f o 2 440Hz as the reference (440Hz is the end (7) Spectral contrast min X ( f ) f limit of the 4th octave) [25]. 10 octaves were considered. Trend a of the model log X ( f ) = a log f + b (8) Spectral slope max X ( f )

(∏

Spectral flatness

fo 2 f = f o1

X(f)

)

1/ ( f 02 − f o1 )

1 ∑ X(f) f o 2 − f o1 f = fo1 fo 2

where f o1 , f o 2 are respectively the start and (9) end indices of the oth quarter of octave [25]

Testing 25% 25% Files

Split

50%

Feature extraction

Train dimension. reduction

Feature extraction

Test LDA, SVM, RDF

Test SSI

Feature extraction

Test LDA, SVM, RDF

Train SSI

Train LDA, SVM, RDF

Results

Validation

Training

Fig. 1. Diagram of each iteration of the proposed Monte Carlo experiments.

features, the number of selected features was changed from 2 to 560 (all features) and the result of each classifier was verified over the validation subset for 100 iterations. SFS was not included in this experiment because the optimal number of selected features belonging in the subset is automatically decided by the method. Likewise, SSI was also not included in the experiment because it will be used to combine the optimal scores of LDA, SVM, and RDF, regardless of the features used for those classifiers.

560

G. Safont et al.

The results of this training are shown in Fig. 2. For LDA, ReliefF yielded the best performance with 15 features, the proposed method yielded a very similar result with 10 features, and PCA yielded the worst result with 325 features. For SVM, ReliefF yielded the best result with 250 features, PCA yielded the next best result with 6 features, and the proposed method yielded the worst result with 5 features. Finally, for RDF, the proposed method and ReliefF yielded the best result with 560 features (i.e., no feature selection), and PCA yielded the worst result with 5 features. This might have owed to the subset selection performed in RDF, which might have interfered with the variable selection procedure.

Fig. 2. Performance of each of the considered classifiers with respect to the number of features selected using PCA, ReliefF, and the proposed feature ranking method.

Overall, PCA yielded the lowest performance, and both feature ranking methods (ReliefF and the proposed method) yielded similar performances. PCA, ReliefF and the proposed method showed different trends and results for different classifiers, which suggested that the optimal number of remaining features would depend on the subsequent stages in the classification. Thus, in the following, we used a different number of remaining features for each dimensionality reduction method on each classifier.

Comparison of Dimensionality Reduction Methods

561

After selecting the optimal number of features for each dimensionality reduction method and each classifier, we repeated the experiment and added the results for SFS and SSI. The results of this experiment are shown in Table 2. Table 2. Results of the proposed RSI system with different dimensionality reduction methods. Dimensionality reduction method Sequential feature selection

PCA

ReliefF

Proposed feature ranking method

Classifier

# features

Accuracy (%) Average

Std. error

LDA

4

94.81

0.29

SVM

4

93.64

0.49

RDF

4

92.82

0.66

SSI

n/a

96.03

0.20

LDA

325

93.36

0.69

SVM

6

93.35

0.61

RDF

5

92.12

0.69

SSI

n/a

94.13

1.44

LDA

15

95.22

0.28

SVM

250

94.46

0.56

RDF

560

95.53

0.37

SSI

n/a

96.10

0.71

LDA

10

94.50

0.35

SVM

5

93.10

0.46

RDF

560

95.48

0.36

SSI

n/a

96.02

0.28

In accordance with the results in Fig. 2, PCA yielded the worst overall result. ReliefF yielded the best result, and the proposed feature ranking method and SFS yielded similar results. With respect to the computation times, PCA and the proposed method took an average of 0.28 s to compute, ReliefF took an average of 53.33 s to compute, and SFS took an average of 50 min to compute. These calculations were performed in Matlab R2016b, running on a Windows 7 machine with an Intel Xeon E3 CPU and 16 GB of RAM. Thus, the proposed feature ranking method was much faster than the other considered feature selection methods, and just as fast as PCA.

4 Conclusion This work has considered the effect of several dimensionality reduction methods on a road surface identification system using a real dataset with three types of road surfaces. Four dimensionality reduction methods were considered: one feature extraction method,

562

G. Safont et al.

PCA; one subset feature selection method, SFS; and two feature ranking methods, ReliefF and a novel method proposed in this work. The results of these dimensionality reduction methods were then tested using three single classifiers (LDA, SVM and RDF) and a late fusion method based on alpha integration (SSI). The considered dimensionality reduction methods were able to significantly reduce the number of features required for classification and improve classification performance, reaching a maximum classification accuracy of 96.10%. Out of the original 560 features, most of the combinations of dimensionality reduction method and classifier were able to use 15 features or less (2.68% of the features). Furthermore, the proposed feature ranking method yielded competitive results to successful dimensionality reduction techniques, such as ReliefF and forward sequential feature selection, and outperform the classical PCA technique. In terms of computational cost, the proposed feature ranking method was just as fast as PCA and considerably faster than ReliefF and SFS. This proves the potential of the proposed method for dimensionality reduction. Acknowledgment. This work was supported by Spanish Administration and European Union grant TEC2017-84743-P, and Generalitat Valenciana under grant PROMETEO/2019/109.

References 1. Tudon-Martinez, J.C., Fergani, S., Sename, O., Martinez, J.J., Morales-Menendez, R., Dugard, L.: Adaptive road profile estimation in semiactive car suspensions. IEEE Trans. Control Syst. Technol. 23(6), 2293–2305 (2015) 2. Park, J., Min, K., Kim, H., Lee, W., Cho, G., Huh, K.: Road surface classification using a deep ensemble network with sensor feature selection. Sensors 18, (2018). Article no. 4342 3. Alonso, J., López, J.M., Pavón, I., Recuero, M., Asensio, C., Arcas, G., Bravo, A.: On-board wet road surface identification using tyre/road noise and support vector machines. Appl. Acoust. 76, 407–415 (2014) 4. Zhao, J., Wu, H., and Chen, L.: Road surface state recognition based on SVM optimization and image segmentation processing. J. Adv. Transp. 2017 (2017). Article no. 6458495 5. Masino, J., Pinay, J., Reischl, M., Gauterin, F.: Road surface prediction from acoustical measurements in the tire cavity using support vector machine. Appl. Acoust. 125, 41–48 (2017) 6. Bystrov, A., Hoare, E., Tran, T.Y., Clarke, N., Gashinova, M., Cherniakov, M.: Automotive surface identification system. In: IEEE International Conference on Vehicular Electronics and Safety (ICVES), Vienna, Austria, pp. 115–120 (2017) 7. Bystrov, A., Hoare, E., Tran, T.Y., Clarke, N., Gashinova, M., Cherniakov, M.: Road surface classification using automotive ultrasonic sensor. Procedia Eng. 168, 19–22 (2016) 8. Bystrov, A., Abbas, M., Hoare, E., Tran, T.Y., Clarke, N., Gashinova, M., Cherniakov, M.: Analysis of classification algorithms applied to road surface recognition. In: 2015 IEEE Radar Conference (RadarCon), Piscataway, NJ, USA, pp. 907–911 (2015) 9. Igual, J., Salazar, A., Safont, G., Vergara, L.: Semi-supervised Bayesian classification of materials with impact-echo signals. Sensors 15(5), 11528–11550 (2015) 10. Salazar, A., Igual, J., Vergara, L., Serrano, A.: Learning hierarchies from ICA mixtures. In: IEEE International Joint Conference on Artificial Neural Networks, Orlando, FL, USA, pp. 2271–2276 (2007)

Comparison of Dimensionality Reduction Methods

563

11. Salazar, A., Gosalbez, J., Bosch, I., Miralles, R., Vergara, L.: A case study of knowledge discovery on academic achievement, student desertion and student retention. In: IEEE ITRE 2004 - 2nd International Conference on Information Technology: Research and Education, London, United Kingdom, pp. 150–154 (2004) 12. Salazar, A., Igual, J., Safont, G., Vergara, L., Vidal, A.: Image applications of agglomerative clustering using mixtures of non-Gaussian distributions. In: International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, pp. 459–463 (2015) 13. Jolliffe, I.T.: Principal Component Analysis. Springer, New York (2002) 14. Shamir, O.: A stochastic PCA and SVD algorithm with an exponential convergence rate. In: International Conference on Machine Learning, Lille, France, pp. 144–155 (2015) 15. Llinares, R., Igual, J., Salazar, A., Camacho, A.: Semi-blind source extraction of atrial activity by combining statistical and spectral features. Digit. Signal Process. Rev. J. 21(2), 391–403 (2011) 16. Safont, G., Salazar, A., Rodriguez, A., Vergara, L.: On Recovering missing ground penetrating radar traces by statistical interpolation methods. Remote Sens. 6(8), 7546–7565 (2014) 17. Safont, G., Salazar, A., Vergara, L., Gomez, E., Villanueva, V.: Probabilistic distance for mixtures of independent component analyzers. IEEE Trans. Neural Netw. Learn. Syst. 29(4), 1161–1173 (2018) 18. Safont, G., Salazar, A., Vergara, L., Gómez, E., Villanueva, V.: Multichannel dynamic modeling of non-Gaussian mixtures. Pattern Recognit. 93, 312–323 (2019) 19. Lui, H., Motoda, H. (eds.): Computational Methods of Feature Selection. CRC Press, Boca Ratón (2007) 20. Kononenko, I., Šimec, E., Robnik-Šikonja, M.: Overcoming the myopia of inductive learning algorithms with RELIEFF. Appl. Intell. 7(1), 39–55 (1997) 21. Amari, S.: Information Geometry and its Applications. Springer, Berlin (2016) 22. Soriano, A., Vergara, L., Bouziane, A., Salazar, A.: Fusion of scores in a detection context based on alpha-integration. Neural Comput. 27, 1983–2010 (2015) 23. Safont, G., Salazar, A., Vergara, L.: Multiclass alpha integration of scores from multiple classifiers. Neural Comput. 31(4), 806–825 (2019) 24. Powers, D.M.W.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2011) 25. Peeters, G.: A large set of audio features for sound description (similarity and classification) in the CUIDADO project. CUIDADO IST Project Report 54 (2004)

A Machine Learning Platform in Healthcare with Actor Model Approach Mauro Mazzei(B) National Research Council, Istituto di Analisi dei Sistemi ed Informatica “Antonio Ruberti”, Via di Taurini, 19, 00185 Rome, Italy [email protected]

Abstract. Today Information Technology is strongly invaded by disciplines known as Big Data and Machine Learning which pervasively play a key role in all sectors involved in the current digital revolution. Cloud Computing now allows excellent computational resources. The requirement to obtain near realtime results emerges. The model with actors plays its primary role. This paper describes the state of the art of Big Data architectures for near real-time processing, and then describes the model with actors. A possible solution is to offer the programmer a level of abstraction, independent of the domain being dealt with, which makes it possible not to deal with the technical aspects of scalability, competition and interaction with the various Big Data frameworks adopted, making use of the actor-based programming paradigm. As a test of the platform, a case study was implemented in the healthcare domain. The study of an analysis model is presented to assess the effectiveness of a wearable sensor in identifying elderly patients with Parkinson’s disease (PD), “use case” Tele Parkinson with no history of falls that will undergo falls in the following 12 months. Keywords: Data mining · Machine learning · Big Data · Actor model

1 Introduction The model with actors has in its simplicity its best weapon: to allow the programmer to focus his attention exclusively on the behaviors and application logics of his program, adopting a flexible and lean paradigm for the realization of competing and distributed solutions, based essentially on units independent minimums of computation - the actors who cooperate and share information through asynchronous and non-blocking message exchange. The programming model with actors is certainly not a new element. Introduced in 1973 in the computer-based paper Carl Hewitt [1] “A universal modular ACTOR formalism for artificial intelligence”, the concept of actor is described as “a fundamental unit of calculation that includes processing, archiving and communication” [2]. It was first adopted by the Erlang language, a language born in the laboratories of the Swedish multinational Ericsson for the design of applications on distributed systems in the telecommunications sector. © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 564–571, 2020. https://doi.org/10.1007/978-3-030-52246-9_41

A Machine Learning Platform in Healthcare with Actor Model Approach

565

The model based on actors is based on a fundamental unit of independent calculation, defined as an actor, which performs its work based essentially on a few simple basic principles: – independence between actors: an actor performs his own behavior independently of other actors with whom he can communicate anyway; – asynchronous communications: communications between actors take place exclusively through asynchronous and non-blocking exchange of immutable messages: – private status: an actor has his own status in his life cycle making it a stateful component, but private and not directly accessible from the outside, the only tool for sharing information is the exchange of messages. – The fundamental rule to guarantee effective decoupling between actors is the “tell-dont’t-principle”, i.e. the use of non-blocking asynchronous calls for message exchange. – In the model ad actors we can distinguish: – “fire-and-forget” (tell): I send a message without waiting for the reply I will eventually receive in the form of a new message at the end of the requested action; – “Request-response” (ask): traditionally observed approach in web-type enterprise applications, I send a synchronous message and I wait for the answer blocking the rest of my activities; highly discouraged because it conflicts with the principles of reactive applications and because it requires a greater number of threads to handle callback calls. 1.1 Proposed Platform The proposed platform is presented as a framework supporting the implementation of Machine Learning and Advanced Analytics models by providing an abstraction layer composed of a set of abstract classes that generalize behaviors and operations to interact with the various Big Data frameworks used to implement machine learning models [5–7]. The platform follows the data flow from the initial generation phase, passing through the processing, up to the saving, by defining an application pipeline. It is possible to identify three reference application macro layers: the first for the management and transmission of input data, the second for the routing of requests and the third for the analysis core distinguished by machine learning and advanced analysis activities see Fig. 1. By decoupling the analysis layer from the rest of the system, there are two jobs for the creation and training of Machine Learning models and for advanced data analysis; these jobs are run on a regular basis and completely independent of the rest of the components [8]. After the elaboration of each iteration, the results produced are promptly transmitted to the concerned components which can immediately use them to replace those previously loaded in the respective memory buffers. The Machine Learning component generates regression or classification models based on the historical data already present on the distributed file system and on the data that are processed in real time by the platform [10–12]. In fact it is an adaptive type system: every message that is produced and analyzed contributes to the enrichment of the training set. At the end of each generation, the one that best fits and shows the highest accuracy for the data currently available is chosen from the implemented regression or classification algorithms [13–15].

566

M. Mazzei

Fig. 1. Platform machine learning.

1.2 Components and Functions The design is based on a pipeline that includes the following components: Producer: is the actor responsible for constantly monitoring the raw dataset produced by the source devices/systems; this dataset is used as input for the pipeline and at each variation of the files contained in it the producer receives the notification of change and provides to append a new input message to the message broker. Consumer: it is the actor responsible for extracting a new input message received through the message broker; once the message is extracted it is analyzed and depending on the type (specific for each case study) it provides: – route it to the predictor if it is a prediction request; – route it to the feeder if it is a request to retrieve statistical data; – save it in the distributed file system if it is a raw output message produced by the source device/system, thus enriching the training set; Trainer: is the actor responsible for creating and training Machine Learning models; it is executed at regular intervals, it retrieves both historical data and the data collected by the platform during execution; starting from regenerated training sets at each iteration, it creates different models and applies them to new test sets, selecting at the end the best model according to specific distinct metrics for regression and classification models; at the end of the processing [4] transmits the model to the respective predictor. Analyzer: it is the actor responsible for generating statistical data on the historical dataset and on the data acquired in real time from the platform, it is executed at regular intervals and proceeds to transmit the results to the feeder. Predictor: is the actor responsible for receiving requests for prediction and satisfies them by using models generated by the trainer; always keeps a copy of the model in memory and replaces it when it receives a new model from the trainer. Feeder: is the actor responsible for providing the statistical data generated by the analyzer in response to specific requests; similarly to the predictor, it keeps in memory a copy of the statistical dataset that replaces with the new generation one when received.

A Machine Learning Platform in Healthcare with Actor Model Approach

567

2 Case Study To validate the realized platform, it was decided to implement a case study on the healthcare domain declined in remote healthcare. The pervasiveness of the technologies, the evolution of consumer-type hardware that has led to the production of increasingly powerful and smaller and consequently easily wearable or transportable devices, has placed the data as actors of our daily life and no longer relegated to passive role. From business to social, through sports, data are the backbone of our daily life and, consciously or not, often a bargaining chip for the use of internet services. And from this it is not exempt the health field, where modern technologies allow the use of low-cost devices to monitor the behavior of the human body and most of its vital signs: heart rate monitor to detect heart rate, the combined accelerometer, gyroscope, magnetometer and GPS to detect movements and movements, up to the most modern smartwatches equipped with electric cardiac sensors capable of performing an electrocardiogram by contact and directly from the wrist [16, 17]. The dematerialization of medical exams and medical records then does all the rest, allowing diagnostic and statistical processing of the data available through sophisticated algorithms aimed not only at detecting any pathologies but also at predicting future evolution by comparing it with the ever-widening knowledge base. In this scenario, the domain of remote healthcare is placed. The use of these devices now allows continuous monitoring - 24 h a day, seven days a week - as the observation period can be extended considering the low invasiveness [18]. Of these tools, in the daily life of the patient who does not have to be hospitalized for a long stay. The same goes for the doctor: he can now constantly monitor the patient’s vital parameters and be alerted, not only during his performance at the health facilities, of any anomalies found and accompanied by all the data acquired. In addition to reactive behaviors, the doctor can also act proactively by projecting the potential results of a therapy according to a specific statistical model or correct it in the race by immediately evaluating the changes produced. 2.1 Data Analysis My case study aims to recognize the motor activity and relative status of the monitored patient. The data elaborated has been downloaded from the UC Irvine Machine Learning Repository repository which makes freely available for research areas numerous Big Data datasets dedicated in particular to their use through Machine Learning techniques [3]. This dataset was produced by researchers from the University of Adelaide Shinmoto Torres, R. L., Ranasinghe, D. C., Shi, Q., Sample, A. P. (2013, April). Sensor enabled wearable RFID technology for mitigating the risk of falls near beds [9, 20]. In 2013 IEEE International Conference on RFID pp. 191–198. The treated dataset is named “Activity recognition with healthy older people using the batteryless wearable sensor DataSet and collects sequential motion data of 14 healthy older people with ages ranging from 66 to 86 years wearing a batteryless sensor above the breastbone and they move in a room according to a programmed model”. The data are produced by the sensor and by RFID antennas positioned in two rooms with four and three antennas respectively. The activities performed and surveyed are:

568

M. Mazzei

walking towards the chair, sitting on the chair, getting up from the chair, walking towards the bed, lying on the bed, getting out of bed and walking towards the door. The possible classes identifying the status of activities assigned for each observation are four: sitting on the bed, sitting on the chair, lying on the bed, walking or standing or walking inside the room. The data is organized in CSV files containing the following columns: – – – – – – – – –

time in seconds; frontal axis acceleration; vertical axis acceleration; lateral axis acceleration; identification of the antenna that produced the observation; RSSI (Received Signal Strength Indicator) to detect the quality of the signal; phase; frequency; label indicating the planned activity carried out by the patient among the four available.

Respecting the previously described data structure, 52,000 elements were processed which produced a data output in delimited a comma-separated values (CSV) files. Then they were passed to the prediction dataset which contains the predictions made in real time in the format: message input data, values input and as a last column the predicted class. The last activity is that of the evaluation of the classification models that contains the list of classification algorithms applied by means of relative models and accuracy value in the format: model construction date, algorithm, accuracy indicator, training set elements, test set [19]. Below (see Fig. 2) are reported the accuracy values obtained with respect to the ML models used.

Fig. 2. Accuracy classification algorithms used.

A Machine Learning Platform in Healthcare with Actor Model Approach

569

Finally, Table 1 shows the confusion matrix of the model that it contains the values for the construction of a confusion matrix based on the predicted classes and the real classes. Table 1. Confusion matrix Actual

Predicted

Sit on bed

Logistic regression

Sit on bed

3193

0

1389

0

Sit on chair

396

0

944

0

Lying Ambulating Decision tree classifier

Sit on bed Sit on chair Lying Ambulating

Random forest classifier

Sit on chair

Lying

Ambulating

0

0

9281

0

497

0

83

0

4514

10

41

17

8

1300

12

20

39

0

9242

0

224

35

2

319

Sit on bed

4513

21

48

0

Sit on chair

429

905

0

6

4

0

9277

0

279

17

3

281

Lying Ambulating

3 Conclusions The diffusion of Big Data in everyone’s daily life has produced remarkable changes, not only in the IT world, launched new challenges to solve what has become the current technological dilemma: to process every data immediately. The presented platform provides the programmer with a level of abstraction that allows not to be involved in the complexities of interaction with the various frameworks adopted, delegating to the abstraction the task of communicating with them to implement an application pipeline that takes the data in its form raw directly to the origin, it transmits it asynchronously to the components appointed to analyze it and finally route it to the prediction and statistical analysis functions; it is these functions that are the core of the platform that allow the iterative construction of Machine Learning models that, at each iteration, are reconstructed based on the union of a historical dataset and a continuously enriched real-time dataset, showing an adaptive characteristic in the select the most accurate model for data subject to periodic mutation. The case study dealt with in the area of remote healthcare made use of studies and literature datasets to carry out the recognition of the status of activity of elderly people with the only use of passive RFID sensors and antennas positioned inside a room. The output files that are periodically generated by the platform, with these files were traced graphs according to the discipline of advanced analytics, demonstrating the fact

570

M. Mazzei

that such an approach also guarantees the necessary tools to carry out more in-depth analysis on the data. In conclusion, the main novelty element proposed in this work, namely the actorbased model for the definition of an application abstraction level that coordinates predictive analysis activities, can therefore be a harbinger of new ideas in a changing architectural context that must constantly to adapt to the ever consolidated digital revolution, driven not only by the birth of new devices and new services but also by the process of search for improvement. This work is still under development for possible additions and extensions. Compared to the case study treated it would be value added to examine a case study which produces data with very high throughput, with the aim of assessing the potential of scalability of the platform in terms of data streaming.

References 1. Hewitt, C., Bishop, P., Steiger, R.: A universal modular actor formalism for artificial intelligence. ACM (1973) 2. Clinger, W.D.: Foundations of actor semantics. ACM (1981) 3. UCI Machine Learning Repository, Activity recognition with healthy older people using a batteryless wearable sensor Data Set. https://archive.ics.uci.edu/ml/datasets/Activity+recogn ition+with+healthy+older+people+using+a+batteryless+wearable+sensor. Accessed 10 Sept 2019 4. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. ACM (2004) 5. Marz, N.: Big Data: Principles and Best Practices of Scalable Realtime Data Systems. Manning, Shelter Island (2015) 6. Estrada, R.: Big Data SMACK: A Guide to Apache Spark, Mesos, Akka, Cassandra, and Kafka. Apress, New York (2016) 7. Akka case studies. https://www.lightbend.com/case-studies. Accessed 01 Sept 2019 8. Gamma, E., Helm, R., Johnson, R., Vlissidies, J.: Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, Boston (1994) 9. Shinmoto Torres, R.L., Ranasinghe, D.C., Shi, Q., Sample, A.P.: Sensor enabled wearable RFID technology for mitigating the risk of falls near beds. In: IEEE International Conference on RFID, pp. 191–198 (2013) 10. Chambers, B., Zaharia, M.: Spark: The Definitive Guide: Big Data Processing Made Simple. O’Reilly, Sebastopol (2018) 11. Karau, H.: Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly, Sebastopol (2015) 12. Let-it-crash Reactive design pattern. https://www.reactivedesignpatterns.com/patterns/let-itcrash.html. Accessed 01 Aug 2019 13. Mazzei, M.: An unsupervised machine learning approach in remote sensing data. In: ICCSA, no. 3, pp. 435–447 (2019) 14. Mazzei, M., Palma, A.L.: Spatial multicriteria analysis approach for evaluation of mobility demand in urban areas. In: ICCSA, no. 4, pp. 451–468 (2017) 15. Mazzei, M.: Software development for unsupervised approach to identification of a multi temporal spatial analysis model. In: Muller, J. (ed.) The Proceedings of the 2018 International Conference on Image Processing, Computer Vision, & Pattern Recognition. Computer Science, Computer Engineering & Applied Computing (2018)

A Machine Learning Platform in Healthcare with Actor Model Approach

571

16. Wickramasinghe, A., Ranasinghe, D.C., Fumeaux, C., Hill, K.D., Visvanathan, R.: Sequence learning with passive RFID sensors for real time bed-egress recognition in older people. IEEE J. Biomed. Health Inform. PP(99), 1 (2016) 17. Shinmoto Torres, R.L., Visvanathan, R., Hoskins, S., van den Hengel, A., Ranasinghe, D.C.: Effectiveness of a batteryless and wireless wearable sensor system for identifying bed and chair exits in healthy older people. Sensors 16(4), 546 (2016) 18. Wickramasinghe, A., Ranasinghe, D. C.: Recognising activities in real time using body worn passive sensors with sparse data streams: to interpolate or not to interpolate? In: Proceedings of the 12th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, pp. 21–30. ICST, August 2015 19. Shinmoto Torres, R.L., Ranasinghe, D.C., Shi, Q.: Evaluation of wearable sensor tag data segmentation approaches for real time activity classification in elderly. In: International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and Services, pp. 384–395. Springer, December 2013 20. Shinmoto Torres, R.L., Ranasinghe, D.C., Shi, Q., Sample, A.P.: Sensor enabled wearable RFID technology for mitigating the risk of falls near beds. In: 2013 IEEE International Conference on RFID, pp. 191–198. IEEE, April 2013

Boundary Detection of Point Clouds on the Images of Low-Resolution Cameras for the Autonomous Car Problem Istvan Elek(B) Eotvos Lorand University, Budapest, Hungary [email protected] http://mapw.elte.hu/elek

Abstract. In this article, we present a procedure that uses lowresolution cameras to define the boundaries of point clouds quickly. Using the same procedure, we get almost identical results to high- resolution images. The emphasis is not so much on the method presented, but on the fact that applying this method on images of low-resolution cameras will made the process being much quicker. This result is remarkable because the segmentation and clustering methods, when applied to big sized and high-resolution images, are sometimes overwhelmingly slow. If you use low-resolution images in order to detect boundaries of larger objects then the process accelerates remarkably. This approach, which is described here, attempts to provide a possible solution to recognize the desired objects. Keywords: Boundary detection enhanced safety

1

· Computer vision · Autonomous car

Introduction

Because of the technological advances, ever more resolution cameras are being commercialized. The rapid development of space technology produces for better and better satellite imagery. Developers of self-driving cars are installing increasingly high-resolution cameras into their experimental vehicles, so that devices with smaller resolution are already used for hobby purposes only. No one examines the possibilities of lower resolution cameras, because the high-resolution images are actually more detailed. In this article, we are attempting to draw attention to the fact that in some cases we want to detect objects, coherent fields or groups very quickly. The classification or the segmentation are frequent methods when the purpose is define boundaries of objects on an image for both high and low images. For the high-resolution images, the running time of these processes can be extremely long. In remote sensing, a frequently used technique is the principal component analysis, which reduces the dimensions of the multi-band images thus reduces the running time of image processing algorithms. This method is probably not c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 572–581, 2020. https://doi.org/10.1007/978-3-030-52246-9_42

Boundary Detection

573

ideal to apply to the image processors of autonomous cars. Our approach is based on that supposition that the boundaries can be recognize on the lowresolution images too. Our approach is to detect big boundaries on the lowresolution image at first, and after that to move the focus to the high-resolution images for further processing. If we got larger segment boundaries then we can continue image-processing algorithms inside these boundaries or on any segments that we recognized. For example, in case of an autonomous car, the process recognized a person or another vehicle with their contours. This a primary task that needs to be implemented very quickly. Similarly, it is also mandatory to recognize the traffic signs quickly, but reading the inscription under the traffic sign (e.g. scope) is no longer as sensitive to time, and its interpretation can already come from a higher resolution image. However, in this case, there is no need to process the whole image. It is enough to process the small environment of the traffic sign, which is a manageable mass of data even in the case of high-resolution images. The following is a method that demonstrates the usability of images of smaller resolution cameras, not as a compromise solution, but for real use. The process will be demonstrated on synthetic images.

2

Creating and Analysing Synthetic Images

We will demonstrate the concept to synthetic images and point clouds. These images have been created to clearly identify dense territories, nodes and clusters. With visual interpretation, within a fraction of a second, we are able to recognize the existence and boundaries of these clusters, that is, we are able to interpret the seen image very quickly. The Interpretation is an emphasized word here, because in the case of a moving vehicle, the quick interpretation of the images seen by the cameras and the recognition and identification of the objects there are essential. Let us generate an image of randomly scattering dots around three nodes (Fig. 1). When we look at such images, we recognize these condensation fields in a fraction of a second, immediately perceiving the areas that can be identified by the thickening. Such tasks can be considered a common problem when we have to group the pixels of the images (not necessarily all of them). Well-known methods of clustering and classification are used for this purpose [1]. However, their running time is usually large, so they are not suitable for fast grouping in real time. This is especially true for applications where short run times are important, e.g. processing images of cameras of self-driving cars. Based on the knowledge of the physiology of vision, it is known that the human eye is not a high-resolution device. Yet the images interpreted by our brain are quite good, since we are able to recognize many fine details, and even on that basis it looks like our eyes are a high-resolution device. The reason for this phenomenon is that these low-resolution images are processed by our brain, and our consciousness interprets this processed image. Based on the results of the vision psychology experts, we know that our eyes are also equipped with edgedetectors. There are Gabor filters in the human eye in eight different directions,

574

I. Elek

Fig. 1. The colour depth of the generated image is 24-bit on which the scatter points are whites on a black background

which perform edge detection [3]. As is known, edge detection is a fundamental method of the shape recognition [2]. In the course of evolution, our eyes had to identify the contours of objects in their surroundings, as it had to recognize the enemy, or another man in the same tribe, the family, and anything that had an impact on his survival during his life [7]. There is no doubt that there is no such animal, which would use clustering or any classifying methods to recognize important shapes among pixels. However, interpreting images is impossible on the pixel’s logic of images. Without finding edges and groups, it is impossible to produce any interpreted image. We can only interpret a vectorised shape instead of raw pixels. In the following, a method package will be presented, which is based on the edge detection. It determines the boundaries of point clouds coming from the Fig. 1 and generates this as vector data. The procedure consists of smoothing, thresholding, edge detection and vectorisation as it can be seen on the Fig. 2.

Fig. 2. The workflow of vectorising process

We omit the presentation of the operations of digital filters as it can be studied in detail in these works [1,2,8,9]. The operation of the procedure pack is schematically as follows: 1. Let us smooth the image, which transform the image into a cloud (Fig. 3). The run time of the process is about 12,000 ms for the picture shown in Fig. 1.

Boundary Detection

575

2. We create a picture processed with thresholding and edge-detecting, which has two values (edge or not edge) 3. The threshold value is obtained from the distribution of the intensity values 4. Finally, we define the boundaries of point clouds.

Fig. 3. The smoothed image

The result of processes 1–4 are shown in Fig. 4. The process run time is 25 ms. The pixels have two values. The intensity is 0 everywhere except the boundaries, but at the border, however, the intensity is 255. For such an image, the vectorization is a trivial task [9]. Let us compare the starting image with the result (Fig. 5). It can be seen that the boundaries of the point cloud were correctly detected by the procedure. As well as it is obvious that there can be a number of boundaries, depending on which parameter is given to the procedure. For small and high-resolution images, we will always use the same parameters for the correct comparability. Determining the boundaries of point clouds is a common problem in image processing and in some GIS tasks [4–6] but we do not talk about in this article. Our goal is to demonstrate that low-resolution images can also be used to solve some specific problem, especially in which are speed-sensitive. The process package described so far consists of elements well known in the art of image processing. Producing a vector image is approximately 12 s, which is due to smoothing. The time for producing cloud boundaries from the smoothed image is negligible (25 ms). That is to say, the time needed to produce a smooth point cloud should be drastically reduced if we want to use a fast procedure that can be used in practice. To do this, think of the physics behind the camera.

576

I. Elek

Fig. 4. The boundaries of clouds in the high-resolution image

Fig. 5. The comparison of the point cloud and the vectorisation result

3

The Photoelectric Effect

According to classical electromagnetic theory, this effect can be attributed to the transfer of energy from the light to an electron. The photoelectric effect occurs when light falls on the surface of a material (mostly metal). An electron is ejected by an electromagnetic wave quantum, such as light, at a frequency higher than the threshold level. There is no electron emission below the boundary frequency because the photon cannot provide enough energy to eject the electrons from the atomic bond. Particles of lightray (photons) have a wavelength-dependent energy. If we increase the intensity of lightray, it does not change the energy of

Boundary Detection

577

photons that make up the light beam, but the number of emitted electrons, so the phenomenon is suitable for mapping surfaces. The operation of the semiconductor chips of the cameras is based on this phenomenon (Fig. 6).

Fig. 6. The photoelectric effect [10]

It can be seen from the above that the size of an image is determined by the resolution of the photosensitive tile. As many photosensitive cells were integrated into a tile as many pixels we have got in the image. However, it can also be seen that the elementary cell of each tile averages the outgoing electrons, i.e. the light intensity, and produces proportional numbers of electrons. Therefore, a high-resolution camera averages a smaller area than a lower resolution one. We know from the theory of digital filters that averaging results in smoothing because the smoothed image is resulted by the convolution of the image and the kernel [1,2,8]. Now, let us ignore what means averaging exactly. Suppose that the kernel consists of values of a proper Gaussian function, but if it contains 1, 1, 1. . . , it is also acceptable (this is a box filter). Consequently, if smoothing is not done by convolutional filtering of a high resolution image, but using a smaller resolution camera that already has a smoother image since physics has done the convolution, so we can eliminate the most time consuming process from the vectorization process.

4

Comparison of Processing of High- and Low-resolution Images

Let us look at a much smaller resolution image. Run the procedure mentioned above on a lower resolution image. Figure 7 shows the low-resolution image before processing. Smooth the low-resolution image with the same limit frequency as we made it for the high-resolution one. Figure 8 shows the cloud of the smoothed image. Perform edge detection, thresholding, and vectorisation with the same parameters as for the high-resolution image. Figure 9 shows the vector boundaries of the point cloud. Let us examine if the cloud contours that are in the low resolution image are correct. When we look at Fig. 10, we can say that point cloud contours meet

578

I. Elek

Fig. 7. The low-resolution image

Fig. 8. The smoothed low-resolution image

our expectations. If we had established the boundaries by visual inspection, the result is the same, so the result of our process is appropriate. It is an interesting task to compare the two results. Since we do not have a benchmark to which we can compare the result, it is obvious that we compare these two results, ie the contours obtained from the processing of high and low resolution images. As you can see, the two results are very close to each other (Fig. 11), though not the entirely same, of course. What is remarkable is the difference in running time. While the processing time for the high-resolution image is 12,000 ms, the low-resolution image is only 167 ms, which speed is more than seventyfold.

Boundary Detection

579

Fig. 9. Vector boundaries of point clouds in low-resolution image

Fig. 10. The low-resolution image, cloud and processing result

Compare the results of runs with a different parameter (allowing more dropped out points). Draw the contours resulting from the processing of small and high-resolution images are overlapped each other. As shown in the Fig. 12, the thin curve, which comes from the high-resolution image, and the thick curve, which comes from the low-resolution image are running together.

580

I. Elek

Fig. 11. The left part of the figure shows the larger image, the right side shows the lower resolution image, and the results of the processing

Fig. 12. Comparison of the contours from low- and high-resolution images

5

Conclusion

In conclusion, smaller resolution cameras are able to detect the boundaries of larger objects relatively quickly, much faster than doing the same on highresolution images. This does not mean, of course, that it is not worth using high-definition cameras. Runtime-sensitive cases, such as the autonomous driving, seem to be worth using a combination of cameras with different resolutions. Of course, after defining the boundaries of larger objects, high-resolution images should be used to determine the fine details. There are countless examples of this in the wildlife. After identifying the larger objects, the creatures are able to focus on one detail. In some cases, eg eagles, their eyes can zoom. Even in the case of non-zoomable eyes, such as the human eye, we know the phenomenon of concentration into small details very

Boundary Detection

581

well. After the quick evaluation defining the large objects, we turn to observe the small objects more detailed. It seems that smaller resolution cameras do not have to be forgotten because smart combinations of different resolution images can significantly speed up some image processing tasks. Acknowledgment. Project no. ED 18-1-2019-0030 (Application-specific highly reliable IT solutions) has been implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the Thematic Excellence Programme funding scheme.

References 1. Gonzalez, R., Woods, R.: Digital Image Processing. Pearson, London (2018) 2. Allen, R., Mills, D.: Signal Analysis, Time, Frequency, Scale, and Structure. Wiley, IEEE (2004) 3. Gregory, R.: Eye and Brain The Psychology of Seeing. Princeton University Press, Princeton (1998) 4. Almqvist, H., Magnusson, M., Kucner, T., Lilienthal, A.: Learning to Detect Misaligned Point Clouds. Wiley (2017). https://doi.org/10.1002/rob.21768 5. Mineo, C., Pierce, S., Summan, R.: Novel algorithms for 3D surface point cloud boundary detection and edge reconstruction. J. Comput. Des. Eng. 6(1), 81–91 (2019) 6. Sidiropoulos, A., Lakakis, K.: Edge points detection in unorganized point clouds. Int. J. Constr. Res. Civil Eng. (IJCRCE) 2(3), 8–15 (2016). https://doi.org/10. 20431/2454-8693.0203002 7. Elek, I.: Emergence of Intelligence. NOVA Science Publishers, New York (2016). ISBN 978-1-53613-545-9 8. Elek, I.: Adatb´ azisok, t´erk´epek, inform´ aci´ os rendszerek. ELTE E¨ otv¨ os kiad´ o (2010). ISBN 978 963 312 039 2 9. Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms, 3rd edn. The MIT Press, Cambridge (2009). ISBN-13: 978-0262033848 10. https://en.wikipedia.org/wiki/Photoelectric effect

Identification and Classification of Botrytis Disease in Pomegranate with Machine Learning M. G. S´ anchez1(B) , Veronica Miramontes-Varo1 , J. Abel Chocoteco1 , and V. Vidal2 1

TecNM/Instituto Tecnol´ ogico de Cd. Guzm´ an, 49100 Jalisco, Mexico [email protected], [email protected], [email protected] 2 Universitat Polit`ecnica de Val`encia, 46022 Valencia, Spain [email protected]

Abstract. The Botrytis cinerea represents an economic risk for the pomegranate industry because the quality of pomegranate-derived products is mainly affected by the number of bad arils present in the fruit. Manual identification and classification of this fruit requires expertise and professional skill and is time-consuming, expensive, and subjective. Automated identification and classification of Botrytis can be an alternative to the traditional manual methods. Machine learning techniques such as Knearest neighbor algorithms, support vector machines (SVMs), random forest, and artificial neural networks have been successfully applied in the literature for fruit classification problems. In this paper, we propose a new method to identify and classify Botrytis disease of the pomegranate through combining machine-learning techniques. The method also uses different techniques such as Gaussian filter, morphological operations, among others, to extract the image features. The results show that 96% of classification accuracy can be achieved using the proposed method. Keywords: Botrytis cinerea disease · Feature extraction · Image classification · K-means clustering · Random forest · Support vector machines

1

Introduction

Botrytis cinerea is a phytopathogenic fungus that infects a wide variety of plants, causing economic losses before and after harvesting [1]. The crops such as vine, tomato, strawberry, pomegranate, and ornamental, are affected by this fungus. Pomegranate fruit is affected through flower parts, surface wounds or insect injuries, and it remains latent until after harvest. This infection causes brown discoloration of the entire fruit, starting from blossom end and progressing to all the fruit [2]. The aforementioned disease is a threat to the pomegranate industry c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 582–598, 2020. https://doi.org/10.1007/978-3-030-52246-9_43

Identification and Classification of Botrytis

583

because the quality of pomegranate-derived products is mainly affected by the number of bad arils present in the fruit. Currently, some companies dedicated to pomegranate-derived products perform the selection of the healthy pomegranate and the one containing Botrytis through human observation. Manual identification and classification require expertise and professional skill and is time-consuming, expensive, and subjective. Techniques to identify and classify diseases in various types of fruits, vegetables, plants, and agricultural crops have been proposed in various works. However, few investigations directed to the pomegranate fruit are reported. Machine learning and pattern recognition have been successfully applied for classification problems reported in the literature, especially for images of fruit, leaves, and plants. In machine learning, pattern recognition is an important one for classification and identification. The purpose is to classify a group of patterns known as a set of tests into two or more classes of categories [3]. In this paper, a new method used to classify the pomegranate fruit is proposed. It consists of four stages: (i) preprocessing, (ii) segmentation, (iii) feature extraction and (iv) classification. Pre-processing includes techniques such as noise reduction and detail enhancement. Segmentation is the process of partitioning a digital image into multiple regions or segments in order to extract objects of interest. In agricultural, the main aim of image segmentation is to separate the disease and background part from the images. Feature extraction is the process by which features are obtained to differentiate one type of object from another; i.e., permits to extract the features of an image in order to achieve an adequate classification. Finally, classification is the process of identifying the objects that are present in the images [4]. The objective of fruit classification systems is main an anomaly. Some preprocessing techniques include color transformation, image enhancement [5]. There are different techniques used in image segmentation like K-mean clustering, thresholding, color segmentation, learning-based segmentation, edge detection, region-based segmentation and model-based segmentation, otsu method, region growing, neural networks, etc., [5,6]. There are several methods of feature extraction. The most used are texture, color, and shape [5]. Probabilistic neural network (PNN), support vector machine (SVM), artificial neural network back (ANN), redial basis network (RBN), K-nearest neighbor artificial (KNN), backpropagation neural network (BPNN), naive Bayes classifier (NBC), convolution neural network (CNN) and random forest [5,7]. In this paper, we present a method for the identification and classification of Botrytis disease in the pomegranate fruit through machine learning techniques. The proposed method uses morphological operations and Gaussian filter to improve image quality; manual and K-means segmentation; texture features extraction (mean, standard deviation, smoothness, asymmetry, energy, and entropy), and the discrete wavelet transformation (DWT); and finally, SVM, random forest and CNN to classify the image.

584

M. G. S´ anchez et al.

In order to measure the accuracy of the classifier, we use the evaluation functions such as accuracy, precision, recall, and F1-score. In addition, statistical measures as mean absolute error (MAE), root mean squared error (RMSE), relative absolute error (RAE), and root relative squared error (RRSE), are also used [8]. This paper has five main sections. Section 2 introduces some previous works about detection, identification, and classification reported in the literature for several types of fruits, plants, or leaves. The proposed method for the classification of Botrytis is presented in Sect. 3. Section 4 includes experiment design, results, and analyses of the results. Finally, Sect. 5 presents the conclusions.

2

Previous Works

Several methods have been proposed to detect, identify, and classify diseases in various types of fruits, crops, plants, or leaves, as shown in Table 1. Each of these works uses different techniques to achieve its objective. Each of these works are color transformation, thresholding, K-means clustering, SIFT, SVM, and neural networks. A part of them is in the harvest and others in the after harvest. Some survey has been reported in the literature for the same purpose. A review of image processing techniques for the detection and classification of citrus plant diseases is reported in [5]. A survey of deep learning techniques applied to agricultural and food production are presented in [23]. Applications of machine learning in agriculture production systems are proposed in [8]. An analysis of neural networks (NN) in plant disease detection is presented in [24]. Different image preprocessing techniques, segmentation, feature extraction, features selection, and classification methods for the detection and classification of citrus plants are presented in [5]. To automate the process of plant disease detection and classification using leaf images are presented in survey [25]. In addition, several applications have been developed for detecting and classifying different types of diseases that affect plants [26]. Other applications send messages to alert farmers of disease detection using IoT technology [27]. Few methods for detection and classification of pomegranate fruit have been reported in the literature. Some of them are focused in the detection of fruits of the pomegranate the tree and counting the number of pomegranates [28]. Other methods include the detection of defects in the skin of pomegranate using characteristics of color texture and DWT [27]. There are also methods for the detection of pomegranate diseases [27] through leaves images [29].

3

The Proposed Approach

The method for the classification of pomegranate fruit consists of four stages: (i) preprocessing, (ii) segmentation, (iii) feature extraction and (iv) classification. Before the preprocessing there is an image acquisition stage (see Fig. 1).

Identification and Classification of Botrytis Table 1. Previous investigations Authors Issue

Plant/fruit/leaf

Technique

[9]

Detection and classification

Citrus leaves

Intensity-invariant texture analysis, and random forest classifier

[10]

Disease detection

Soya leaf

SVM classifier and scale invariant feature transform (SIFT)

[11]

Disease detection and classification

Apple scab, black rot canker, and core

Contrast enhancement K-Means clustering GLCM features SVM classifier

[12]

Disease detection

Fruit

Morphological operations Thresholding E-nose

[13]

Identification of diseases

Grape leaf

Wiener filter method Otsu method and morphological algorithms Prewitt operator Back-propagation neural networks (BPNN)

[14]

Disease detection

Different crop species, Local binary patterns (LBPs) for vine leaves feature extraction

[15]

Classification

Leaf

Deep neural networks (DNNs)

[16]

Segmentation and classification

Rice leaf

Threshold value Color transformation

[17]

Detection and classification

Leaf

Study of digital image processing techniques

[18]

Detection and classification

Tomato leaf

CNN and Learning vector quantization (LVQ) algorithm

[19]

Classification

Tomato plant

SIFT texture feature

SVM classifier

K-means clustering

Color statistics feature Quadratic SVM [20]

Identification and classification

Plant leaf

Bacterial foraging optimization based radial basis

[21]

Detection and classification

Leaves

Converting RGB image to an HSV image

Neural network

Histogram of an oriented gradient (HOG) Random forest SVM Gaussian Nayve bayes Logistic regression Linear discriminant analysis [22]

Classification

Plant diseases

CNN

585

586

M. G. S´ anchez et al.

Fig. 1. Steps of the method.

3.1

Image Acquisition

The first stage of any vision system is image acquisition. The images are often captured different illumination conditions and, consequently, texture features extracted from the images can vary considerably [30]. A cabin to avoid the influence of light and other environmental conditions has therefore been implemented in order to provide a robust classification of diseases. The images of the pomegranate are captured using a digital camera in RGB format. 3.2

Image Preprocessing

This stage is necessary to enhance the pomegranate image quality for further processing. The preprocessing adjust the image intensities in order to identify areas containing Botrytis disease, that is, to make the image as compared to the original image. Pre-processing uses techniques such as image resizing or cropping, image enhancement (color space transformation, filter image, and morphological operations). Resizing and Cropping Image. Resizing images is a common image preprocessing tool mainly. For two reasons. First, an image can have different sizes. They must have the same dimensions. Second, the images are very large; the images can occupy lots of storage space if they are very large, and by resizing them, we can reduce its usage. To this end, the captured images are resized to reduce the computational load in subsequent processing. The original images are of size 5184 × 3456 pixels, which cropping into a smaller size of 2353 × 2353 pixels. Contrast Enhancement. The similarity of colors between the disease part and the healthy part of the Botrytis. In order to perform a contrast enhancement of the image, an algorithm is applied to accentuate the colors of the pomegranate fruit’s disease and to facilitate the segmentation. The image is first converted

Identification and Classification of Botrytis

587

from RGB color space to HSI space and to CieLAB color space. After, we apply morphological operations such as dilation or erosion and filter the image with the Gaussian filter. 3.3

Image Segmentation

In this phase, the regions containing Botrytis disease are identified. There are many segmentation techniques. One of them is K-nearest Neighbor Artificial (KNN). The most common technique is the K-means clustering algorithm. Kmeans has been used in this work. This algorithm classifies the pixels into k number of classes based on a set of features. The classification of the pixels is done by minimizing the sum of the squares of the distance between the object and the corresponding cluster (or cluster centroids). Another way to perform image segmentation is manually. 3.4

Feature Extraction

Image texture features were extracted from the images. The training set includes 290 images of healthy and 290 images of diseased pomegranate of each kind or totally 580 images. The input image was divided into 3 × 3 matrices (as shown in Fig. 2). Each matrix contains nine statistical features, resulting in a total of 54 features per image. On the other hand, features of the image were also extracted through discrete wavelet transformation (DWT). Thus, two feature vectors, one statistical and another DWT, are achieved. Those features feed to SVM and Random Forest for the classification of the pomegranate.

Fig. 2. Matrix image division.

Statistical Moments. In the literature, there are some techniques to analyze textures in images, one of which is based on statistical properties of intensity histogram. In this work, we use mean, standard deviation, smoothness, asymmetry, uniformity, and entropy measures. These moments are summarily shown in Table 2.

588

M. G. S´ anchez et al.

The nth-order moments about the mean is given by (1) μn =

l−1 

(xi − m)n p(xi ),

(1)

i=0

where l is the number of possible intensity levels, xi is a random variable indicating intensity, p(xi ) is the histogram of the intensity levels in a region, and m is the average intensity [31]. The mean value represents the average intensity of an image. The mean value represents the average intensity of an image, and it is then possible to determine the average contrast. The smoothness measures the relative smoothness of the intensity in a region. The asymmetry describes the skewness of a distribution (skewness measures the symmetry of the shape of a distribution). The uniformity measures the homogeneity of intensity in the histogram, and entropy is a measure of randomness and takes low values for smooth images. Table 2. Statistical moments Moment

Expression  First moment/mean μ1 = m = l−1 xi p(xi ) √  i=0 Second moment/standard deviation μ2 = σ = μ2 (x) = σ 2 Smoothness Third moment/ asymmetry Uniformity Entropy

1 R = 1 − 1+σ 2 l−1 μ3 = i=0 (xi − m)3 p(xi )  2 U = l−1 i=0 p (xi )  e = − L−1 i=0 p(xi ) log2 p(xi )

DWT. The wavelet transform uses a discrete set of the wavelet scales with which to gives rise to the Discrete Wavelet Transform (DWT). An image transformed using this decomposed into four subbands denoted LL, LH, HL, and HH. LH, HL, and HH represent the finest scale wavelet coefficients and also have vertical, horizontal, and diagonal information of an original image. LL stands for the coarse-level coefficients, and the original image can be reconstructed by considering only LL band image. 3.5

Classification

The performance of three different learning algorithms was compared to select the best model for this problem. The extracted features are classified in order to identify the Botrytis diseases by using the SVM and Random Forest classifiers. The feature vector is given as input to the classifier and is divided randomly into training and testing vectors. 60% of the data for training and 40% for tests have been used.

Identification and Classification of Botrytis

589

On the other hand, the Inception V3 CNN has been used as another classifier. The term CNN is used from now on to refer to this classifier, which is used to compare the results with SVM and Random forest. Similar to the previous classifiers, 60% of the data for the training and 40% for the tests have been used. It is important to note that the data used in the tests are not used in the training phase. The set of features extracted from the images were labeled according to the state of the pomegranate, i.e., class one (C1 ) for Botrytis disease, class two (C2 ) for the healthy pomegranate. 3.6

Evaluation Function/Performance

Many accuracy measures have been proposed to evaluate the performance of the methods. On the one hand, three standard performance indicators, namely precision, recall, and F-score, and on the other hand, scale-independent measure and scale-dependent of the data. The measures accuracy, precision, recall, and f-measure are built from a confusion matrix [32], which records correctly and incorrectly recognized examples for each class. Matrix of size n x n associated with a classifier shows the predicted and actual classification, where n is the number of different classes, in our case, n = 2, C1 and C2 . Table 3 shows a confusion matrix for binary classification, where: – – – –

TP (true positive), is the number of correct positive predictions; FN (false negative), is the number of incorrect negative predictions; FP (false positive), is the number of incorrect positive predictions; TN (true negative), is the number of correct negative predictions;

Table 3. Confusion matrix Predicted positive Predicted negative Actual positive TP

FN

Actual negative FP

TN

The prediction accuracy and classification error can be obtained from the matrix as 2 and 3: Accuracy = Error =

TN + TP . TP + FN + FP + TN

FP + FN . TP + FN + FP + TN

(2) (3)

590

M. G. S´ anchez et al.

Precision and Recall. Precision may be defined as the probability that an object is relevant, given that it is returned by the system. Precision is a function of true positives and examples misclassified as positives (false positives). The precision is given by (4). precision =

TP . TP + FP

(4)

Recall is a function of its correctly classified examples (true positives) and its misclassified examples (false negatives). This measure is calculated by (5). recall =

TP . TP + FN

(5)

F-Score. Taking the (weighted) harmonic average of precision and recall leads to the F- score. This measure is calculated by (6).  2  β + 1 ∗ precision ∗ recall . (6) F − measure = recall + β 2 ∗ precision If β = 1 is evenly balanced in F-score measure, and favors precision when β > 1, and recall otherwise [33]. Dependent and Independent Measures. There are other accuracy measures used; on the one hand, whose scale depends on the scale of the data and the other hand, accuracy measures to be scale-independent using relative errors based on the errors produced by a benchmark method. The most used measures of scale-dependent accuracy are based on the absolute error or squared error [34]. Mean Absolute Error M AE =

N  1   ˆ  θ i − θ i  . N i=1

(7)

Root Mean Squared Error   N

2 1   RM SE = θˆi − θi . N i=1

(8)

Some measures based on relative errors are Relative Absolute Error (RAE) and Root Relative Square Error (RRSE). They are given in (9) and (10).  N  ˆ  θ − θ   i i i=1 (9) RAE = N  . ¯  i=1 θi − θi where θ¯ is a mean value of θ.

Identification and Classification of Botrytis



2   N  i=1 θˆi − θi RRSE =  N  2 . ¯ i=1 θi − θi

4 4.1

591

(10)

Results Dataset

The dataset contains 580 images from both healthy and infected pomegranate, with 290 representative images of each type. All the images were captured by a camera at 5184 × 3456 pixels and saved as 24 bits JPEG format. Each image was cropped to smaller size of 2353 × 2353 pixels. 4.2

Set Configurations

Several image processing techniques were tested and compared in each of the phases, in order to identify which offers the best results for the classification of the Botrytis. K-means segmentation algorithm requires users to select the appropriate value k. In this work, several tests were performed from k = 2 to k = 5, and we conclude that the value of k = 3 is the appropriate number of clusters. RGB, CIELab, and HSV color spaces were used in K-means clustering, with the absolute difference. Likewise, two morphological operations, such as dilation and erosion are applied, as well as a Gaussian filter to enhance the image quality. Three experimental configurations were used to obtain the best classification. The first configuration considers the morphological operations and the Gaussian filter. The K-means segmentation algorithm is then applied, and then statistical or DWT features are then extracted. Feature vectors are finally inputted for the Random Forest and SVM classifiers. The second configuration is similar to the first, but segmentation is carried out manually, that is, the disease is extracted by cropping the area of interest (ROI) of size 657 × 657 pixels. The third configuration consists first, in applying the Gaussian filter and then the CNN classifier. When the Gaussian filter is applied, the manual segmentation or K-means segmentation is then performed, and finally, the images are classified with the CNN classifier. These three configurations are shown in Fig. 3. 4.3

Experimental Setup

We have made experimental simulations to determine the accuracy of the method. Were performed experiments on a 3.1 GHz Intel Core i5 (Mac) of four cores, with 8 GB of RAM, and under the operating system OS X 10.11.6. Matlab 2014b was used to implement the classification method.

592

M. G. S´ anchez et al.

Fig. 3. Configurations experiment; (a) first configuration; (b) second configuration; (c) third configuration.

Fig. 4. Pomegranate fruit, (a) outside; (b) inside.

4.4

Results

A pomegranate that is visibly healthy, but has Botrytis inside is shown in Fig. 4. A previous experiment was carried out to perform the segmentation stage using K-means with RGB, CieLab, and HSV color spaces. The results are shown in Fig. 5. As can be seen in it, the HSV color space is better than in the other color spaces, that is, the disease is better grouped with HSV color space. From now on, the experiments with K-means use this color space. The results of both the first and second configuration considering performance evaluation measures such as accuracy, precision, recall, and f-measure, which are shown in Fig. 6. Figure 6a shows that the best results in accuracy are presented using the K-means segmentation, and the SVM classifier. However, the outperformance is presented using the Gaussian filter, statistical features, manual segmentation, and SVM classifier with 96% of accuracy. The same behavior is presented in precision, recall, and fmeasure (Fig. 6b, 6c, 6d, respectively). Table 4 shows the results obtained in MAE, RMSE, RAE, and RRSE using Kmeans segmentation, statistic feature, DWT feature, and random forest classifier.

Identification and Classification of Botrytis

593

Fig. 5. Segmentation of the image. RGB color spaces: (a) cluster 1, (b) cluster 2, (c) cluster 3. CieLab color spaces: (d) cluster 1, (e) cluster 2, (f) cluster 3. HSV color spaces: (g) cluster 1, (h) cluster 2, (i) cluster 3.

In these measures, the smaller is better. Therefore, the best results are presented using the Gaussian filter and statistical features. Table 4. K-means segmentation and random forest classifier. Statistic feature DWT feature Dilatation Erosion Gaussian Dilatation Erosion Gaussian MAE

0.1757

0.1351

0.0676

0.2027

0.1486

0.1892

RMSE

0.4191

0.3676

0.2599

0.4502

0.3855

0.435

RAE(%)

34.8129

26.7792 13.3896 40.1687

29.4571 37.4908

RRSE(%) 82.8714

72.6831 51.3947 89.0182

76.2306 85.9998

Table 5 shows the results of the classification using manual segmentation and random forest. The best results are obtained when the Gaussian filter and DWT feature are used. Table 6 shows the results obtained using K-means segmentation and the SVM classifier. As noted in this table, that the best results are obtained by extracting the DWT feature instead of the statistics, regardless of which preprocessing technique was used.

594

M. G. S´ anchez et al.

Fig. 6. Performance evaluation measures

Identification and Classification of Botrytis

595

Table 5. Manual segmentation and random forest classifier. Statistic Feature Dilatation Erosion

DWT Feature Gaussian Dilatation Erosion

Gaussian

MAE

0.3327

0.3327

0.1886

0.3714

0.3543

0.1429

RMSE

0.5768

0.5768

0.4342

0.6094

0.5952

0.378

RAE(%)

66.5353

66.5353

37.7137

74.2845

70.856

28.5709

RRSE(%) 115.3546

115.3546 119.9975 121.8873

119.0413 86.8478

Table 6. K-means segmentation and SVM classifier. Statistic Feature DWT Feature Dilatation Erosion Gaussian Dilatation Erosion MAE

0.1216

0.0946

0.1486 0.3855

0.0541

0.0541

0.2325

0.2325

RMSE

0.3487

0.3076

RAE(%)

24.1012

18.7454 29.4571

10.7117

10.7117 10.7117

60.811

45.9688

45.9688 45.9688

RRSE(%) 68.9532

76.2306

0.2325

Gaussian

0.0541

Table 7 shows the results obtained using manual segmentation and the SVM classifier. As we can see, the best results are obtained when the Gaussian filter and statistical features are used, except for the RRSE measure, whose best results were obtained using the DWT features. Table 7. Manual segmentation and SVM classifier. Statistic Feature Dilatation Erosion

DWT Feature Gaussian Dilatation Erosion

Gaussian

MAE

0.2686

0.3486

0.04

0.2857

0.2971

0.1029

RMSE

0.5182

0.5904

0.2

0.5345

0.5451

0.3207

RAE(%)

53.7134

69.7131

7.9999

57.1419

59.4276

20.5711

RRSE(%) 103.6456

118.0774 105.8278 106.9022

109.0193 64.1413

In considering the results presented in the previous tables, it is significant to note that the best results are presented using the Gaussian filter in preprocessing stage, manual segmentation, statistical feature, and the SVM classifier. Finally, the percentage of pomegranates correctly classified and incorrectly classified with different configurations, as noted in Table 8. As noted in this Table, the best classification is obtained using manual segmentation, whether CNN or SVM are used, with 96.93% and 96% correct classifications, respectively. On the other hand, the segmentation can be obtained by using the K means

596

M. G. S´ anchez et al.

algorithm. K-means segmentation, DWT feature, and SVM classifier are the best configurations with K-means. Table 8. Percentage of classification. Configuration

Correctly classified (%) Incorrectly classified (%)

Manual segmentation, statistic feature and SVM classifier

96.00

4.00

Manual segmentation, DWT feature and SVM classifier

89.71

10.29

K-means segmentation, statistic feature and SVM classifier

85.14

14.86

K-means segmentation, DWT feature and SVM classifier

94.59

5.41

Manual segmentation, statistics 81.14 feature and random forest classifier

18.86

Manual segmentation, DWT 85.71 feature and random forest classifier

14.29

K-means segmentation, statistics 93.24 feature and random forest classifier

6.76

K-means segmentation, DWT 85.14 feature and random forest classifier

14.86

Manual segmentation and CNN classifier

5

96.93

1.32

Gaussian filter and CNN classifier

81.58

1.75

K-means segmentation and CNN classifier

74.58

25.42

Conclusion

In this paper, we have proposed a method to identify and classify Botrytis disease of the pomegranate through machine learning techniques. This method utilizes a combination of several image processing techniques, segmentation, feature extraction, and classification in order to identify which of them offers the best results in the classification. Manual segmentation with CNN or SVM classifier achieves an accuracy of 96.93% and 96%, respectively. K-means clustering with DWT features, and SVM classifier, reaches 94.59% of accuracy. The results obtained indicate that it is possible to carry out the selection of the pomegranate with a high percentage of accuracy, helping to identify and classify Botrytis disease automatically. The automated identification and classification of Botrytis can be an alternative to the visual methods. Thus, the quality of pomegranate-derived products will improve. It would be convenient to study this methodology in other crops.

Identification and Classification of Botrytis

597

References 1. Jarvis, W.: Botrytinia and Botrytis species: taxonomy and pathogenicity. [Canada Department of Agriculture Monograph No. 15]. Agriculture Canada, Ottawa (1977) 2. Teksur, P.K.: Alternative technologies to control postharvest diseases of pomegranate. Stewart Postharvest Rev. 11, 1–7 (2015) 3. Zaccone, G.: Getting Started with TensorFlow. Packt Publishing Ltd (2016) 4. Forsyth, D.A., Ponce, J.: Computer vision: a modern approach. Prentice Hall Professional Technical Reference (2002) 5. Iqbal, Z., Khan, M.A., Sharif, M., Shah, J.H., ur Rehman, M.H., Javed, K.: An automated detection and classification of citrus plant diseases using image processing techniques: a review. Comput. Electron. Agric. 153, 12–32 (2018) 6. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, no. 14, pp. 281–297, June 1967 7. Sharif, M., Khan, M.A., Iqbal, Z., Azam, M.F., Lali, M.I.U., Javed, M.Y.: Detection and classification of citrus diseases in agriculture based on optimized weighted segmentation and feature selection. Comput. Electron. Agric. 150, 220–234 (2018) 8. Liakos, K.G., Busato, P., Moshou, D., Pearson, S., Bochtis, D.: Machine learning in agriculture: a review. J. Sens. 18(8), 2674 (2018) 9. G´ omez-Flores, W., Garza-Salda˜ na, J.J., Varela-Fuentes, S.E.: Detection of Huanglongbing disease based on intensity-invariant texture analysis of images in the visible spectrum. Comput. Electron. Agric. 162, 825–835 (2019) 10. Abraham, A., Dutta, P., Mandal, J.K., Bhattacharya, A., Dutta, S.: Emerging technologies in data mining and information security. In: Proceedings of IEMIS2018. Springer, Heidelberg (2018) 11. Agarwal, A., Sarkar, A., Dubey, A.K.: Computer vision-based fruit disease detection and classification. In: Smart Innovations in Communication and Computational Sciences, pp. 105–115. Springer, Singapore (2019) 12. Awate, A., Deshmankar, D., Amrutkar, G., Bagul, U., Sonavane, S.: Fruit disease detection using color, texture analysis and ANN. In: 2015 International Conference on Green Computing and Internet of Things (ICGCIOT), pp. 970–975. IEEE, October 2015 13. Zhu, J., Wu, A., Wang, X., Zhang, H.: Identification of grape diseases using image analysis and BP neural networks. Multimedia Tools Appl. 1–13 (2019) 14. Pantazi, X.E., Moshou, D., Tamouridou, A.A.: Automated leaf disease detection in different crop species through image features analysis and one class classifiers. Comput. Electron. Agric. 156, 96–104 (2019) 15. Kaya, A., Keceli, A.S., Catal, C., Yalic, H.Y., Temucin, H., Tekinerdogan, B.: Analysis of transfer learning for deep neural network based plant classification models. Comput. Electron. Agric. 158, 20–29 (2019) 16. Archana, K.S., Sahayadhas, A.: Automatic rice leaf disease segmentation using image processing techniques. Int. J. Eng. Technol. 7(3.27), 182–185 (2018) 17. Dhingra, G., Kumar, V., Joshi, H.D.: Study of digital image processing techniques for leaf disease detection and classification. Multimedia Tools Appl. 77(15), 19951– 20000 (2018) 18. Sardogan, M., Tuncer, A., Ozen, Y.: Plant leaf disease detection and classification based on CNN with LVQ algorithm. In: 2018 3rd International Conference on Computer Science and Engineering (UBMK), pp. 382–385. IEEE, September 2018

598

M. G. S´ anchez et al.

19. Hlaing, C.S., Zaw, S.M.M.: Tomato plant diseases classification using statistical texture feature and color feature. In: 2018 IEEE/ACIS 17th International Conference on Computer and Information Science (ICIS), pp. 439–444. IEEE, June 2018 20. Chouhan, S.S., Kaul, A., Singh, U.P., Jain, S.: Bacterial foraging optimization based radial basis function neural network (BRBFNN) for identification and classification of plant leaf diseases: an automatic approach towards plant pathology. IEEE Access 6, 8852–8863 (2018) 21. Maniyath, S.R., Vinod, P.V., Niveditha, M., Pooja, R., Shashank, N., Hebbar, R.: Plant disease detection using machine learning. In: 2018 International Conference on Design Innovations for 3Cs Compute Communicate Control (ICDI3C), pp. 41– 45. IEEE, April 2018 22. Gandhi, R., Nimbalkar, S., Yelamanchili, N., Ponkshe, S.: Plant disease detection using CNNs and GANs as an augmentative approach. In: 2018 IEEE International Conference on Innovative Research and Development (ICIRD), pp. 1–5. IEEE, May 2018 23. Kamilaris, A., Prenafeta-Bold´ u, F.X.: Deep learning in agriculture: a survey. Comput. Electron. Agric. 147, 70–90 (2018) 24. Golhani, K., Balasundram, S.K., Vadamalai, G., Pradhan, B.: A review of neural networks in plant disease detection using hyperspectral data. Inf. Process. Agric. 5(3), 354–371 (2018) 25. Kaur, S., Pandey, S., Goel, S.: Plants disease identification and classification through leaf images: a survey. Arch. Comput. Methods Eng. 26(2), 507–530 (2019) 26. Huang, K.Y.: Application of artificial neural network for detecting Phalaenopsis seedling diseases using color and texture features. Comput. Electron. Agric. 57(1), 3–11 (2007) 27. Pawara, S., Nawale, D., Patil, K., Mahajan, R.: Early detection of pomegranate disease using machine learning and internet of things. In: 2018 3rd International Conference for Convergence in Technology (I2CT), pp. 1–4. IEEE, April 2018 28. Akin, C., Kirci, M., Gunes, E.O., Cakir, Y.: Detection of the pomegranate fruits on tree using image processing. In: 2012 First International Conference on AgroGeoinformatics (Agro-Geoinformatics), pp. 1–4. IEEE, August 2012 29. Pawar, R., Jadhav, A.: Pomogranite disease detection and classification. In: 2017 IEEE International Conference on Power, Control, Signals and Instrumentation Engineering (ICPCSI), pp. 2475–2479. IEEE, September 2017 30. Masotti, M., Campanini, R.: Texture classification using invariant ranklet features. Pattern Recognit. Lett. 29(14), 1980–1986 (2008) 31. Sheshadri, H.S., Kandaswamy, A.: Experimental investigation on breast tissue classification based on statistical feature extraction of mammograms. Comput. Med. Imaging Graph. 31(1), 46–48 (2007) 32. Provost, F., Kohavi, R.: Guest editors’ introduction: on applied research in machine learning. Mach. Learn. 30(2), 127–132 (1998) 33. Sokolova, M., Japkowicz, N., Szpakowicz, S.: Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In: Australasian Joint Conference on Artificial Intelligence, pp. 1015–1021. Springer, Heidelberg, December 2006 34. Chen, C., Twycross, J., Garibaldi, J.M.: A new accuracy measure based on bounded relative error for time series forecasting. PLoS ONE 12(3), e0174202 (2017)

Rethinking Our Assumptions About Language Model Evaluation Nancy Fulda(B) Brigham Young University, Provo, UT 84602, USA [email protected]

Abstract. Many applications of pre-trained language models use their learned internal representations, also known as word- or sentence embeddings, as input features for other language-based tasks. Over recent years, this has led to the implicit assumption that the quality of such embeddings is determined solely by their ability to facilitate transfer learning. In this position paper we argue that pre-trained linguistic embeddings have value above and beyond their utility as input features for downstream tasks. We adopt a paradigm in which they are instead treated as implicit knowledge repositories that can be used to solve commonsense reasoning problems via linear operations on embedded text. To validate this paradigm, we apply our methodology to tasks such as threat detection, emotional classification, and sentiment analysis, and demonstrate that linguistic embeddings show strong potential at solving such tasks directly, without the need for additional training. Motivated by these results, we advocate for empirical evaluations of language models that include vector-based reasoning tasks in addition to more traditional benchmarks, with the ultimate goal of facilitating language-based reasoning, or ‘reasoning in the linguistic domain’. We conclude by analyzing the structure of currently available embedding models and identifying several shortcomings which must be overcome in order to realize the full potential of this approach. Keywords: Language models · Language model evaluation · Word embeddings · Sentence embeddings · Common-sense reasoning

1

Introduction

When evaluating language models and particularly their learned sentence representations, researchers often focus on cross-task generalization. The objective is to determine how well the learned sentence embeddings function as input features for other language-based tasks. The quality of the embedding space is thus, by default, defined in terms its facilitation of transfer learning. This is a valid paradigm, but not the only possible one, and this paper encourages a community wide re-examination of our assumptions about language model evaluation. We begin by taking note of the way these models are being used in c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 599–609, 2020. https://doi.org/10.1007/978-3-030-52246-9_44

600

N. Fulda

the wild – by hobbyists and industry professionals. When one reads blog posts and online articles about word embeddings, or when one browses through discussion forums, the most common application of these learned representations is not as uninterpretable input features for downstream tasks. Instead, one observes an inherent fascination with the embeddings themselves. For example, the AI infrastructure website Skymind features an article that explores meaningful word2vec associations such as Iraq - violence = Jordan and library - books = hall [25]; computer science blogger Adrian Colyer writers about “the amazing power of word vectors” at representing semantic meaning [6]; and Chris Moody at StitchFix explores the idea of using vector addition to augment semantic search [24]. A common theme in these and other online articles is the idea that cosine distance between embedded texts can be used as an analogue for semantic similarity [14,30,31]. Many web sites also allow users to ‘play’ with various embedding spaces by calculating cosine similarity or projecting embeddings into interesting subspaces for observation [1,2,12,21,28]. These online artifacts leave us facing a strange dichotomy. Multi-word embedding spaces like skip-thoughts, BERT, and Google’s universal sentence encoder gained prestige by exceeding previous transfer learning benchmarks [5,7,19], and yet the average user seems to want to use the linguistic embeddings directly, independent of any transfer learning they facilitate. Undeniably, there is a certain intuitive appeal to this desire. After all, if words and sentences can be represented as numbers, ought one not to be able to manipulate them mathematically? 1.1

Doing Math with Language

The answer, of course, is that one can, at least at the level of single words. In 2013, Mikolov et al. observed what has since become a hallmark feature of word-level embedding models: their ability to represent linguistic regularities in the form of analogical relations [23]. Although trained for different purposes entirely, most word-level embedding models can be used to solve analogical queries of the form a:b::c:d (a is to b as c is to d). Leveraging this principle, it is possible to use simple linear relations to discover the unknown value d, e.g. Madrid − Spain + France = Paris The possibilities are tantalizing. Researchers have demonstrated that mathematical operations on word embeddings can be used to detect affordances [9], infer the locations of everyday objects [11], and condition agent behaviors on natural language instructions [8]. A natural extension of this phenomenon would be to apply these same principles at the sentence level. However, there is a notable dearth of papers demonstrating such applications, perhaps because coherent results are more difficult to achieve. For example, using a pre-trained skip-thought encoder [18] and corresponding decoder trained on reddit data, the following equivalences hold: ‘I am angry’ − ‘angry’ + ‘happy’ = ‘I happy’

Language Model Evaluation

601

‘thank you’ + ‘you’re welcome’ = ‘enjoy you’ At first glance, it appears that this sentence-level embedding space is functioning with the same precision as its word-level predecessors. Most people find it intuitively obvious that if you remove anger and add happiness the sentence ‘I am angry’ would transmute into ‘I am happy’, and the skip-thought embedding space has produced an acceptible approximation to that result. Similarly, the pleasantries ‘thank you’ and ‘you’re welcome’ are generally used when we wish to create an atmosphere of congeniality and enjoyment of one another’s company. The decoded phrase ‘enjoy you’ is suggestive of this idea. So far so good, but alas, the illusion of analogical coherence breaks down as soon as more equations are attempted: ‘the sky is blue’ − ‘blue’ + ‘orange’ = ‘the orange is the orange’ We would have expected the output phrase to be ‘the sky is orange’ or perhaps ‘the sunset is orange’, but instead we end up with a nonsense statement. Taking these examples together, it seems clear that the potential for direct mathematical manipulation of the embedding space is present, but the space is insufficiently structured (or the decoder insufficiently trained) to allow this information to be reliably extracted. The goal of this paper is twofold: First, to demonstrate that this same potential is present in many of the currently available sentence embedding models, albeit in primordial form. Second, to outline steps that may lead to the full realization of this potential in future models. We will proceed as follows: In Sect. 2 we discuss related work as pertaining to linguistic embedding models. Section 3 introduces a series of quantitative experiments that measure, to some extent, the amount of semantic knowledge encoded within a given embedding space, and presents our experimental results. Section 4 interprets and lends context to these results, issues a call for researchers to reevaluate their default assumptions about language model evaluation, and lays out a roadmap for future work on this topic.

2

Related Work: Linguistic Embeddings as Knowledge Repositories

Common-sense knowledge is intrinsic to our perception of the world. We see a closed door and instantly understand that a new environment lies beyond it. We see gyrating reflections and immediately know we are looking at a body of water. Common-sense knowledge also helps us to make predictions: A dropped ball will bounce. A tipped glass will spill its contents. These and similar experiences are so ubiquitous that we seldom notice the assumptions we make or the way our expectations shape our experience. We assert that one reason people find linguistic embeddings so fascinating is because they represent common-sense knowledge about the world in interpretable and manipulable ways. For example, you can add the words Vietnam

602

N. Fulda

and Capitol and get Hanoi [6], or you can calculate that human − animal = ethics [25]. Somehow, although they have not been explicitly trained for it, the embedding models are allowing us to play with language and produce meaningful results. This is simultaneously intriguing and puzzling, particularly when one considers the way such embeddings are produced. 2.1

Overview of Embedding Models

Linguistic embeddings, also known as vector space models or distributed representations of words, rose to unprecedented prominence within the research community with the introduction of the GLoVE [26] and word2vec [22] algorithms, both of which use unsupervised methods to produce vector representations of words. Additional word-level embedding models followed, including the FastText algorithm [3], which utilizes subword information to enrich the resulting word vectors. More recently, the ELMo architecture [27] uses a deep bidirectional language model which was pre-trained on a large text corpus. The resulting word vectors are learned functions of the language model’s internal states. In 2016, Kiros et al. presented skip-thoughts [19], an extension to multiword text of the context-prediction task used to train word2vec. Other neural embedding models quickly followed. Google’s Universal Sentence Encoder [5] features two variants: A lightweight implementation that disregards syntax in favor of a quickly trainable bag-of-words representation [15], and a large model based on a Transformer architecture structured around attention mechanisms [32]. Most recently, the BERT architecture utilizes a multi-layer bidirectional Transformer encoder to create general purpose embeddings that generalize to a variety of downstream tasks [7]. 2.2

Knowledge Extraction via Vector Offsets

A number of researchers have explored vector-based methods for extracting common-sense knowledge from learned embedding spaces. Georgios et al. used centroids of word embeddings in combination with Word Mover’s distance in a biomedical document retrieval task [4]. Fulda et al. used a similar approach to determine object affordances in text-based video games [9]. Linguistic embeddings have also been used to link entities through an ontology [16], identify correlations between images and captions [17], and augment the behavior of regex matching [34]. A particularly interesting application is diachronic word embeddings [13], which were trained on a series of temporally discrete corpora and then used to analyze the evolution of cultural attitudes over time. While these applications utilize the semantic and extractive properties of linguistic embeddings for common-sense reasoning, they also combine the embeddings with other computational techniques in order to achieve the desired result. Our work is distinct in that we explore the behavior of the embedding space directly via a form of n-shot learning in which a small number of example cases are used to generalize to a broader reasoning task.

Language Model Evaluation

3

603

Quantitative Experiments

To demonstrate the potential of linguistic embedding spaces to act as commonsense knowledge repositories, we apply a simple distance metric within the embedding space in order to solve three classification tasks. 1. Task 1: Threat detection Utilizing the Skyrim dataset presented in [10], we classify each humangenerated caption as representing one of four possible interaction modes: Threat, Barter, Explore, or Puzzle. An example from the dataset is shown in Fig. 1. 2. Task 2: Emotional classification Using a subset of the Daily Dialog dataset [20], we classify each sentence according to the emotion it expresses: anger, disgust, fear, happiness, sadness, surprise. (Sentences in which no emotion was expressed were removed from the dataset prior to classification). 3. Task 3: Sentiment analysis Using data from SemEval 2013 [33], we classify each tweet as being positive, negative or neutral/objective.

Generated Text Human text:

‘An archer ready to fight against the enemy’

Label Threat

Fig. 1. Example image and associated caption from the SkyRim dataset. The goal of the algorithm was to determine which of four possible interaction modes was indicated based on the input text and a set of example sentences like those shown in Fig. 2.

604

3.1

N. Fulda

Classification Algorithm

Classification was accomplished strictly on the basis of cosine distance metrics within the embedding space. A set of ten exemplars per category is extracted from each dataset1 prior to evaluation. During evaluation, each new sentence or tweet is assigned the same category as the nearest exemplar sentence (Thus, we are using a KNN classification algorithm with K = 1). The purpose of this highly simplified algorithm was to explore the native properties of the embedding space. We wanted to know how much common-sense knowledge was implicitly encoded within the geometry of the embeddings, and whether it was sufficient to solve sophisticated common-sense reasoning tasks. We specifically wanted tasks that did not rely on semantic similarity alone, but instead required the agent to distinguish between various categories of emotion, sentiment, or situation regardless of specific semantic content. We compared results using several popular linguistic embedding models currently available for download, as well as a random baseline for comparison. It is worth noting that we also tried taking the centroid of the exemplars rather than using a nearest neighbor approach. This method performed worse overall. Interaction Mode: Threat ‘You see a soldier holding a sword’ ‘You are badly wounded’ ‘A massive troll bars the path’ ‘The bull paws the ground, then charges toward you’ ‘The poisonous spider advances, ready with its deadly bite’ ‘You are in danger’ ‘If you fall from this height, you will die’ ‘The battle rages around you’ ‘The angry man begins to attack you’ ‘You are plummeting to your death, with only a few seconds before you hit the ground’

Fig. 2. Example texts used to define the ‘Threat’ mode, meaning that an immediate physical danger is present. Similar example texts were available for the interaction modes ‘Explore’, ‘Barter’, and ‘Puzzle’.

3.2

Results

Results are shown in Fig. 3. Note that the interesting aspect of these experiments is not the classification accuracy per se, but what the results reveal about the underlying nature of the various embedding spaces. Our objective was to create a quantifiable measurement of the amount of semantic knowledge that 1

In the case of the SkyRim dataset, we used the same exemplar sentences provided by the original authors.

Language Model Evaluation

605

could be extracted from each embedding model on the basis of cosine similarity alone2 . If such knowledge is demonstrably present and extricable, this provides a foundation for researchers to reconsider whether these properties should be explicitly encouraged via our evaluation metrics, rather than allowing them to develop haphazardly as a byproduct of current training methods.

Skip-thought Google USE lite Google USE large Spacy FastText BERT random

Skyrim Emotions Sentiment average 45.45% 39.48% 39.94% 41.62% 54.54% 37.83% 38.96% 43.78% 63.63% 43.45% 41.50% 49.53% 27.27% 20.06% 37.13% 28.15% 51.52% 28.80% 39.80% 40.04% 54.54% 35.79% 34.78% 41.70% 24.24% 16.62% 32.31% 24.39%

Fig. 3. Categorization accuracy on three n-shot tasks that require common-sense reasoning. The spacy vectors were generated using spacy version 2.0.11, which is based on a (possibly weighted) average of GLoVE vectors [29]. The FastText embedding was generated by averaging the FastText vectors for each word in the sentence. Other embeddings were generated using the models cited in their respective papers [5, 7, 19]. The highest accuracy in each column is bolded.

In all cases, the sole use of vector offsets within the embedding space is able to outperform a random baseline, thus demonstrating that some amount of semantic information and common-sense knowledge is present. At the same time, the generally poor performance of the algorithms reveals that the embedding spaces are not sufficiently structured to fully realize this potential. Of the algorithms explored, Google’s large encoding model appears to be the most effective, with BERT, USE lite, and Skip-thought more or less tied for second place. A simple averaging of FastText word vectors performs remarkably well given that it retains no information about word order or grammatical structure. It is interesting that performance on the sentiment analysis task is lower than on other tasks despite the relatively small number of categories. With only three options to choose from, one would expect the algorithms to perform better. We speculate that the abbreviations, urls and webisms of twitter may be functioning as distractors for embedding models that were trained on the more traditional text found in Google News, Wikipedia or the Toronto Book Corpus. As mentioned earlier, cosine distance is not the only possible method for extracting semantic information from learned sentence representations. Other distance metrics such as correlation, Manhattan Distance, or Mahalanobis distance could be explored. But since the common practice among developers is 2

Other extraction methods could also be explored, of course. But since the general usage of linguistic embeddings by hobbyists and developers relies on cosine distance as an estimate of semantic similarity, we chose to support that paradigm.

606

N. Fulda

to take the cosine distance of word vectors when estimating their similarity, it seems logical to design an embedding space that matches these expectations. 3.3

Semantic Analysis

Linguistic embedding spaces are attractive for reasoning tasks because they contain implicit knowledge about language, causation, the physical behavior of objects, and the social behavior of humans. Unfortunately, the structure of currently-available embedding spaces does not fully utilize this potential. In particular, the inability to distinguish between polar opposites such as hot/cold, beautiful/ugly, or yes/no can become a hindrance to many analogical reasoning tasks, as can the inability to detect the difference between a sentence and its negation. Various applications ranging from embedding grammars [35] to language-based information transfer [8] would benefit from linguistic representations that made these distinctions easy to detect. To determine the extent that these distinctions are represented in current state-of-the-art embedding spaces, we conducted a small case study based on cosine distance. Figure 4 shows the calculated distances between pairs of related sentences under six commonly used embedding models. Examination of the data reveals that across all six embedding models, semantically similar sentence pairs (e.g. “In Tahiti, the cat chased the dog” and “The cat chased the dog in Tahit”) are consistently assigned a higher cosine distance than a semantically distinct pairing (e.g. “the cat chased the dog” and “the dog chased the cat”). Only one of the semantic distance challenges was successfully solved by any of the models. skipthought USE lite USE large spacy Fasttext Elmo BERT the cat chased the dog the dog chased the cat In Tahiti, the cat chased the dog The cat chased the dog in Tahiti I am a cat I am not a cat I am a cat I am a domesticated cat

0.1269

0.0200

0.0045

0.0230

0.0000

0.4391

0.0491

0.0274

0.0710

0.0070

0.0134 0.0060 0.1590 0.1140

0.0686

0.0692

0.0776

0.1152

0.0250

0.0891 0.0670

0.1393

0.0980

0.1405

0.1501

0.0717

0.0745 0.1851

Fig. 4. Case study exploring cosine distance between sentence pairs under various embedding models. Distance tuples that represent semantically appropriate relative distances are shown in bold-face text. Of the embedding models surveyed, only Elmo was able to rank semantically disparate sentences as having a higher cosine distance than a related synonymous pair, and it succeeded on only one of the two examples depicted.

This (small) case study suggests that serious semantic flaws are a common occurrence in current state-of-the-art embedding spaces. One might consider this a major setback, but from our point of view it represents a critical opportunity for linguistic embedding spaces to chart new territory. If one were able to design and train an embedding model that correctly reflects the semantic meaning of sentences via pairwise cosine distance, then a form of language-based commonsense reasoning, or ‘reasoning in the linguistic domain’, becomes possible. Such

Language Model Evaluation

607

an embedding space would facilitate reasoning tasks via vector offset methods, such as determining that a jilted lover + a dangerous weapon + an argument late at night → murder. At present, only word-level embedding spaces are able to function with such precision, but we envision a future in which things might be different.

4

Conclusions and Future Work

In this position paper we have shown that latent potential for language-based reasoning exists in current state-of-the-art embedding spaces. Our ability to leverage this potential, however, is limited by semantic flaws within the structure of the embedding space itself. We believe that this drawback can be overcome by re-examining our chosen evaluation metrics for neural language models. As researchers continue to develop new architectures and training curricula for large language models, it becomes important to carefully consider what kind of performance we are measuring and whether it is leading us to the outcomes we desire. There is (obviously) nothing wrong with linguistic embeddings that are trained for, and evaluated based upon, the model’s performance with respect to established task-transfer benchmarks. However, a myopic focus on benchmarkbased evaluations might lead us to an increasingly large selection of embedding spaces that are increasingly unsuited for language-based reasoning tasks. We strongly urge researchers to consider common-sense reasoning tasks based on cosine distance and vector offsets as potential evaluation metrics for future language models. By doing so, we will open the door to creating a new kind of embedding space, one that is able to facilitate effective task transfer while still enabling language-based reasoning. Such models, if we are able to develop them, could support research in fields such as planning and prediction, explainable AI, question answering, and language-based interfaces. Future work in this area should focus on neural architectures and training methods that are designed with the explicit goal of capturing semantic knowledge within the structure of the learned embedding space. Various network architectures including recurrent networks, convolutional networks, and transformers should be evaluated based on the quality of their learned embedding spaces instead of, or in addition to, their facilitation of downstream learning tasks. Finally, novel extraction methods should be customized to the nature of each kind of embedding space, and researchers should develop improved analytical methods for determining the amount of semantic knowledge contained within a linguistic embedding space.

References 1. Embedding projector. https://projector.tensorflow.org/ 2. Text similarity: Estimate the degree of similarity between publisher = Dandelion API. https://dandelion.eu/semantic-text/text-similarity-demo/

608

N. Fulda

3. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017) 4. Brokos, G.-I., Malakasiotis, P., Androutsopoulos, I.: Using centroids of word embeddings and word mover’s distance for biomedical document retrieval in question answering. CoRR, abs/1608.03905 (2016) 5. Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., St. John, R., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Sung, Y.-H., Strope, B., Kurzweil, R.: Universal sentence encoder. CoRR, abs/1803.11175 (2018) 6. Colyer, A.: The amazing power of word vectors (2016). https://blog.acolyer.org/ 2016/04/21/the-amazing-power-of-word-vectors/ 7. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 8. Fulda, N., Murdoch, B., Ricks, D., Wingate, D.: Informing action primitives through free-form text. In: NIPS Workshop on Visually Grounded Interaction and Language (2017) 9. Fulda, N., Ricks, D., Murdoch, B., Wingate, D.: What can you do with a rock? Affordance extraction via word embeddings. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, pp. 1039– 1045 (2017) 10. Fulda, N., Ricks, D., Murdoch, B., Wingate, D.: Threat, explore, barter, puzzle: a semantically-informed algorithm for extracting interaction modes. In: AAAI Workshop on Knowledge Extraction from Games (2018) 11. Fulda, N., Tibbetts, N., Brown, Z., Wingate, D.: Harvesting common-sense navigational knowledge for robotics from uncurated text corpora. In: Proceedings of the First Conference on Robot Learning (CoRL) (2017) 12. The Turku NLP Group. Word embeddings demo. http://bionlp-www.utu.fi/wv demo/ 13. Hamilton, W.L., Leskovec, J., Jurafsky, D.: Diachronic word embeddings reveal statistical laws of semantic change. CoRR, abs/1605.09096 (2016) 14. Horan, C.: Using sentence embeddings to automate customer support, part one, December 2018. https://blog.floydhub.com/automate-customer-supportpart-two/ 15. Iyyer, M., Manjunatha, V., Boyd-Graber, J., Daum´e III, H.: Deep unordered composition rivals syntactic methods for text classification. In: Association for Computational Linguistics (2015) ¨ ur, A.: Linking entities through an ontology using word embed16. Karadeniz, I., Ozg¨ dings and syntactic re-ranking. BMC Bioinformatics 20(1), 156 (2019) 17. Karpathy, A., Joulin, A., Li, F.: Deep fragment embeddings for bidirectional image sentence mapping. arXiv:1406.5679 (2014) 18. Kiros, R.: Sent2vec encoder and training code from the paper “skip-thought vectors” (2017). https://github.com/ryankiros/skip-thoughts 19. Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R.S., Torralba, A., Urtasun, R., Fidler, S.: Skip-thought vectors, pp. 3294–3302 (2015) 20. Li, Y., Su, H., Shen, X., Li, W., Cao, Z., Niu, S.: DailyDialog: a manually labelled multi-turn dialogue dataset. arXiv preprint arXiv:1710.03957 (2017) 21. Liu, A.: Word to vec JS demo (2016). http://turbomaze.github.io/word2vecjson/ 22. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR, abs/1301.3781 (2013) 23. Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. Association for Computational Linguistics, May 2013

Language Model Evaluation

609

24. Moody, C.: A word is worth a thousand vectors (2015). https://multithreaded. stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/ 25. Nicholson, C.: A beginner’s guide to word2vec and neural word embeddings. https://skymind.ai/wiki/word2vec 26. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, Doha, Qatar, 25–29 October 2014, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1532–1543 (2014) 27. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018) 28. Shima, H.: Ws4j demo. http://ws4jdemo.appspot.com/ 29. spaCy. Word vectors and semantic similarity (2016–2019). https://spacy.io/usage/ vectors-similarity 30. username: DaveTheAl. Best practical algorithm for sentence similarity (2017). https://datascience.stackexchange.com/questions/25053/best-practicalalgorithm-for-sentence-similarity 31. username: whs2k. How is the similarity method in spacy computed (2017). https://stats.stackexchange.com/questions/304217/how-is-the-similaritymethod-in-spacy-computed 32. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L  ., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008. Curran Associates, Inc. (2017) 33. Wilson, T., Kozareva, Z., Nakov, P., Ritter, A., Rosenthal, S., Stoyanov, V.: Sentiment analysis in twitter (2013). http://www.cs.york.ac.uk/semeval-2013/task2/ 34. Wingate, D., Myers, W., Fulda, N., Etchart, T.: Embedding grammars. arXiv preprint arXiv:1808.04891 (2018) 35. Wingate, D., Myers, W., Fulda, N., Etchart, T.: Embedding grammars (2018)

Women in ISIS Propaganda: A Natural Language Processing Analysis of Topics and Emotions in a Comparison with a Mainstream Religious Group Mojtaba Heidarysafa1(B) , Kamran Kowsari1 , Tolu Odukoya2 , Philip Potter3 , Laura E. Barnes1,3 , and Donald E. Brown1,3 1

Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA, USA [email protected] 2 School of Data Science, University of Virginia, Charlottesville, VA, USA 3 Department of Politics, University of Virginia, Charlottesville, VA, USA

Abstract. Online propaganda is central to the recruitment strategies of extremist groups and in recent years these efforts have increasingly extended to women. To investigate Islamic State’s approach to targeting women in their online propaganda and uncover implications for counterterrorism, we rely on text mining and natural language processing (NLP). Specifically, we extract articles published in Dabiq and Rumiyah (Islamic State’s online English language publications) to identify prominent topics. To identify similarities or differences between these texts and those produced by non-violent religious groups, we extend the analysis to articles from a Catholic forum dedicated to women. We also perform an emotional analysis of both of these resources to better understand the emotional components of propaganda. We rely on Depechemood (a lexical-base emotion analysis method) to detect emotions most likely to be evoked in readers of these materials. The findings indicate that the emotional appeal of ISIS and Catholic materials are similar. Keywords: ISIS propaganda Natural language processing

1

· Topic modeling · Emotion detection ·

Introduction

Since its rise in 2013, the Islamic State of Iraq and Syria (ISIS) has utilized the Internet to spread its ideology, radicalize individuals, and recruit them to their cause. In comparison to other Islamic extremist groups, Islamic State’s use of technology was more sophisticated, voluminous, and targeted. For example, during its advance toward Mosul, ISIS related accounts tweeted some 40,000 tweets in one day [6]. However, this heavy engagement forced social media platforms to institute policies to prevent unchecked dissemination of terrorist propaganda to their users, forcing ISIS to adapt to other means to reach their target audience. c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 610–624, 2020. https://doi.org/10.1007/978-3-030-52246-9_45

Women in ISIS Propaganda: A Natural Language Processing Analysis

611

One such approach was the publication of online magazines in different languages including English. Although discontinued now, these online resources provided a window into ISIS ideology, recruitment, and how they wanted the world to perceive them. For example, after predominantly recruiting men, ISIS began to also include articles in their magazines that specifically addressed women. ISIS encouraged women to join the group by either traveling to the caliphate or by carrying out domestic attacks on behalf of ISIS in their respective countries. This tactical change concerned both practitioners and researchers in the counterterrorism community. New advancements in data science can shed light on exactly how the targeting of women in extremist propaganda works and how it differs from mainstream religious rhetoric. We utilize natural language processing methods to answer three questions: – What are the main topics in women-related articles in Islamic State’s online magazines? – What similarities and/or differences do these topics have with non-violent, non-Islamic religious material addressed specifically to women? – What kind of emotions do these articles evoke in their readers and are there similarities in the emotions evoked in the religious materials of non-violent religious organizations? To understand what, if anything, makes extremist appeals distinctive, we need a point of comparison with outreach efforts to women from a mainstream, nonviolent religious group. For this purpose, we rely on an online Catholic women’s forum. Comparison between Catholic material and the content of Islamic State’s online magazines allows for novel insight into the distinctiveness of extremist rhetoric targeted towards women. To accomplish this task, we employ topic modeling and an unsupervised emotion detection method. The remainder of the paper is organized as follows: in Sect. 2, we review related works on ISIS propaganda and applications of natural language methods. Section 3 describes data collection and pre-processing. Section 4 describes in detail the approach. Section 5 reports the results, and finally, Sect. 6 presents the conclusion.

2

Related Work

Soon after ISIS emerged and declared its caliphate, counterterrorism research and practitioners turned their attention towards understanding how the group operated. Researchers investigated the origins of ISIS, its leadership, funding, and how it rose to became a globally dominant non-state actor [12]. This interest in the organization’s distinctiveness immediately led to inquiries into Islamic State’s rhetoric, and particularly their use of social media and online resources in recruitment and ideological dissemination. For example, Al-Tamimi examines how ISIS differentiated itself from other jihadist movements by using social media

612

M. Heidarysafa et al.

with unprecedented efficiency to improve its image with locals [2]. One of Islamic State’s most impressive applications of its online prowess was in the recruitment process. The organization has used a variety of materials, especially videos, to recruit both foreign and local fighters. Research shows that ISIS propaganda is designed to portray the organization as a provider of justice, governance, and development in a fashion that resonates with young westerners [7]. This propaganda machine has become a significant subject of research, with scholars such as Winter identifying key themes such as brutality, mercy, victimhood, war, belonging and utopianism [26]. However, there has been insufficient attention focused on how these approaches have particularly targeted and impacted women. This is significant given that scholars have identified the distinctiveness of this population when it comes to nearly all facets of terrorism. For a significant period of time, Twitter was an effective platform for ISIS recruitment and radicalization. The Arabic Twitter app allowed ISIS to tweet extensively without triggering spam-detection mechanisms [6]. Scholars followed the resulting trove of data and this became the preeminent way by which to assess ISIS messages. For example, Bodine-Baron et al. used both lexical analysis of tweets as well as social network analysis to examine ISIS support or opposition on Twitter [4]. Other researchers used data mining techniques to detect proISIS user divergence behavior at various points in time [18]. Text mining and lexical analysis allowed the research community to analyze big the large mass of unstructured data produced by ISIS. This approach, however, became less productive as the social media networks began cracking down and ISIS recruiters moved off of them. With their ability to operate freely on social media now curtailed, ISIS recruiters and propagandists increased their attentiveness to another longstanding tool–English language online magazines targeting western audiences. Al Hayat, the media wing of ISIS, published multiple online magazines in different languages including English. The first ISIS English online magazine, Dabiq, first appeared on the dark web on July 2014 and continued publishing for 15 issues. This publication was followed by Rumiyah which produced 13 English language issues through September 2017. The content of these magazines provides a valuable but underutilized resource for understanding ISIS strategies and how they appeal to recruits, specifically English-speaking audiences. They also provide a way to compare Islamic State’s approach with other radical groups. Ingram compared Dabiq contents with Inspire (an al Qaeda publication) and suggested that al Qaeda heavily emphasized identity-choice, while Islamic State’s messages were more balanced between identity-choice and rational-choice [8]. In another research paper, Wignell et al. [25] compared Dabiq and Rumiah by examining their style and what both magazine messages emphasized. Despite the volume of research on these magazines, only a few researchers used lexical analysis, relying instead on experts’ opinions. Vergani and Bliuc are an exception, they used word frequency on 11 issues of Dabiq publications and compared attributes such as anger, anxiety, power, and motive [23].

Women in ISIS Propaganda: A Natural Language Processing Analysis

613

This paper seeks to establish how ISIS specifically tailored propaganda targeting western women, who became a particular target for the organization as the “caliphate” expanded. Although the number of recruits is unknown, in 2015 it was estimated that around 10 percent of all western recruits were female [15]. Some researchers have attempted to understand how ISIS propaganda targets women. Kneip, for example, analyzed women’s desire to join as a form of emancipation [9]. We extend that line of inquiry by leveraging technology to answer key outstanding questions about the targeting of women in ISIS propaganda. To further assess how ISIS propaganda might affect women, we used emotion detection methods on these texts. Emotion detection techniques can be divided into lexicon-base or machine learning-base methods. Lexicon-base methods rely on several lexicons while machine learning (ML) methods use algorithm to detect the elation of texts as inputs and emotions as the target, usually trained on a large corpus. Unsupervised methods usually use Non-negative Matrix Factorization (NMF) and Latent Semantic Analysis (LSA) [5] approaches. An important distinction that should be made when using text for emotion detection is that emotion detected in the text and the emotion evoked in the reader of that text might differ. In the case of propaganda, it is more desirable to detect possible emotions that will be evoked in a hypothetical reader. In the next section, we describe methods to analyze content and technique to find evoked emotions in a potential reader using available natural language processing tools.

3 3.1

Data Collection and Pre-processing Data Collection

Finding useful collections of texts where ISIS targets women is a challenging task. Most of the available material do not reflect Islamic State’s official point of view or do not talk specifically about women. However, Islamic State’s online magazines are valuable resources for understanding how the organization attempts to appeal to western audiences, particularly women. Looking through both Dabiq and Rumiyah, many issues of the magazines contain articles specifically addressing women, usually with “to our sisters” incorporated into the title. Seven out of fifteen Dabiq issues and all thirteen issues of Rumiyah contain articles explicitly targeting women, clearly suggesting an increase in attention to women over time. We converted all the ISIS magazines to texts using pdf readers and all articles that addressed women in both magazines (20 articles) were selected for our analysis. To facilitate comparison with a mainstream, non-violent religious group, we collected articles from catholicwomensforum.org, an online resource catering to Catholic women. We scrapped 132 articles from this domain. While this number is larger, the articles themselves are much shorter than those published by ISIS. These texts were pre-processed by tokenizing the sentences and eliminating non-word tokens and punctuation marks. Also, all words turned into lower case and numbers and English stop words such as “our, is, did, can, etc.” have been removed from the produced tokens. For the emotion analysis part, we used a spaCy library as part of speech tagging to identify the exact role of words in

614

M. Heidarysafa et al.

the sentence. A word and its role have been used to look for emotional values of that word in the same role in the sentence. 3.2

Pre-processing

Text Cleaning and Pre-processing. Most text and document datasets contain many unnecessary words such as stopwords, misspelling, slang, etc. In many algorithms, especially statistical and probabilistic learning algorithms, noise and unnecessary features can have adverse effects on system performance. In this section, we briefly explain some techniques and methods for text cleaning and pre-processing text datasets [10]. Tokenization. Tokenization is a pre-processing method which breaks a stream of text into words, phrases, symbols, or other meaningful elements called tokens [24]. The main goal of this step is to investigate the words in a sentence [24]. Both text classification and text mining requires a parser which processes the tokenization of the documents; for example: sentence [1]: After sleeping for four hours, he decided to sleep for another four. In this case, the tokens are as follows: {“After” “sleeping” “for” “four” “hours” “he” “decided” “to” “sleep” “for” “another” “four”}. Stop Words. Text and document classification includes many words which do not hold important significance to be used in classification algorithms such as {“a”, “about”, “above”, “across”, “after”, “afterwards”, “again”,. . .}. The most common technique to deal with these words is to remove them from the texts and documents [19]. Term Frequency-Inverse Document Frequency. K Sparck Jones [20] proposed inverse document frequency (IDF) as a method to be used in conjunction with term frequency in order to lessen the effect of implicitly common words in the corpus. IDF assigns a higher weight to words with either high frequency or low frequency term in the document. This combination of TF and IDF is well known as term frequency-inverse document frequency (tf-idf). The mathematical representation of the weight of a term in a document by tf-idf is given in Eq. 1. N ) (1) W (d, t) = T F (d, t) ∗ log( df (t) Here N is the number of documents and df (t) is the number of documents containing the term t in the corpus. The first term in Eq. 1 improves the recall while the second term improves the precision of the word embedding [22]. Although

Women in ISIS Propaganda: A Natural Language Processing Analysis

615

tf-idf tries to overcome the problem of common terms in the document, it still suffers from some other descriptive limitations. Namely, tf-idf cannot account for the similarity between the words in the document since each word is independently presented as an index. However, with the development of more complex models in recent years, new methods, such as word embedding, have been presented that can incorporate concepts such as similarity of words and part of speech tagging.

4

Method

In this section, we describe our methods used for comparing topics and evoked emotions in both ISIS and non-violent religious materials. 4.1

Content Analysis

The key task in comparing ISIS material with that of a non-violent group involves analyzing the content of these two corpora to identify the topics. For our analysis, we considered a simple uni-gram model where each word is considered as a single unit. Understanding what words appear most frequently provides a simple metric for comparison. To do so we normalized the count of words with the number of words in each corpora to account for the size of each corpus. It should be noted, however, that a drawback of word frequencies is that there might be some dominant words that will overcome all the other contents without conveying much information. Topic modeling methods are the more powerful technique for understanding the contents of a corpus. These methods try to discover abstract topics in a corpus and reveal hidden semantic structures in a collection of documents. The most popular topic modeling methods use probabilistic approaches such as probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA). LDA is a generalization of pLSA where documents are considered as a mixture of topics and the distribution of topics is governed by a Dirichlet prior (α). Figure 1 shows plate notation of general LDA structure where β represents prior of word distribution per topic and θ refers to topics distribution for documents [3]. Since LDA is among the most widely utilized algorithms for topic modeling, we applied it to our data. However, the coherence of the topics produced by LDA is poorer than expected. To address this lack of coherence, we applied non-negative matrix factorization (NMF). This method decomposes the term-document matrix into two non-negative matrices as shown in Fig. 2. The resulting non-negative matrices are such that their product closely approximate the original data. Mathematically speaking, given an input matrix of document-terms V , NMF finds two matrices by solving the following equation [13]: min V − W HF

W,H

s.t

H ≥ 0, W ≥ 0.

Where W is topic-word matrix and H represents topic-document matrix.

616

M. Heidarysafa et al.

Fig. 1. Plate notation of LDA model

Fig. 2. NMF decomposition of document-term matrix [11]

NMF appears to provide more coherent topic on specific corpora. O’Callaghan et al. compared LDA with NMF and concluded that NMF performs better in corporas with specific and non-mainstream areas [14]. Our findings align with this assessment and thus our comparison of topics is based on NMF. 4.2

Emotion Detection

Propaganda effectiveness hinges on the emotions that it elicits. But detecting emotion in text requires that two essential challenges are overcome. First, emotions are generally complex and emotional representation models are correspondingly contested. Despite this, some models proposed by psychologists have gained wide-spread usage that extends to text-emotion analysis. Robert Plutchik presented a model that arranged emotions from basic to complex in a circumplex as shown in Fig. 3. The model categorizes emotions into 8 main subsets and with addition of intensity and interactions it will classify emotions into 24 classes [17]. Other models have been developed to capture all emotions by defining a 3-dimensional model of pleasure, arousal, and dominance.

Women in ISIS Propaganda: A Natural Language Processing Analysis

617

The second challenge lies in using text for detecting emotion evoked in a potential reader. Common approaches use either lexicon-base methods (such as keyword-based or ontology-based model) or machine learning-base models (usually using large corpus with labeled emotions) [5]. These methods are suited to addressing the emotion that exist in the text, but in the case of propaganda we are more interested in emotions that are elicited in the reader of such materials. The closest analogy to this problem can be found in research that seek to model feelings of people after reading a news article. One solution for this type of problem is to use an approach called Depechemood. Depechemood is a lexicon-based emotion detection method gathered from crowd-annotated news [21]. Drawing on approximately 23.5K documents with average of 500 words per document from rappler.com, researchers asked subjects to report their emotions after reading each article. They then multiplied the document-emotion matrix and word-document matrix to derive emotionword matrix for these words. Due to limitations of their experiment setup, the emotion categories that they present does not exactly match the emotions from the Plutchik wheel categories. However, they still provide a good sense of the general feeling of an individual after reading an article. The emotion categories of Depechemood are: AFRAID, AMUSED, ANGRY, ANNOYED, DON’T CARE, HAPPY, INSPIRED, SAD. Depechemood simply creates dictionaries of words where each word has scores between 0 and 1 for all of these 8 emotion categories. We present our finding using this approach in the result section.

Fig. 3. 2D representation of Plutchik wheel of emotions [16]

Said

Khadijah

Munafiqin

Iman

Abu

Ibn

Dunya

Mother

People

Asma

Husband

Bakr

Believers

Hearts

Killed

Know

Steadfast

Sumayyah

Sisters

Spreading

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Lands

People

Firm

Land

Slavegirl

Said

State

Praise

Islamic

Qurn

Ibn

Radiyallhu

Sisters

Ab

Khilfah

Sallallhu

Alayhi

Sallam

Hijrah

Wa

Topic 1

Topic 0

1

Islam/khilafah

Early islam women

Sister

Man

Fear

Islam

Lord

Prohibited

Lawful

Permitted

Say

Duny

Jihd

Married

Ibn

Marry

Wife

Wives

Said

Sharah

Woman

Husband

Topic 2

Marriage

Shariah

Wives

Stay

Abu

Leaving

Muslim

Default

Men

Prevent

Reported

Hadith

Prophet

Ibn

Woman

Going

Home

Masajid

Said

Prayer

Masjid

Topic 3

Islamic praying

Food

Living

Concern

Follow

Despite

Garment

Aishah

Family

Time

Lived

Small

Clothing

Life

Prophet

Menstruation

Mat

Wearing

Regards

Spend

Worldly

Topic 4

Women’s life

Leave

Umm

Returned

Journey

Dua

Shariah

Hijrah

Informed

Later

Children

Soon

Previous

Mujahidin

Knew

Arrived

Brothers

Abu

Khilafah

State

Islamic

Topic 5

Hijrah

Relatives

State

Return

Religion

Good

Aqidah

Anger

Islam

Affection

Muslims

Shariah

Prayer

Remained

Salam

Kab

Enmity

Sake

Bara

Said

Wala

Topic 6

Islamic

Table 1. NMF Topics of women in ISIS

Husbands

Pregnant

Ab

Clothing

Married

Sleep

House

Widows

Wear

Night

Away

Woman

Perfume

Said

Ibn

Husband

Home

Widow

Iddah

Mourning

Topic 7

Divorce

Mother

Lord

Waging

Soul

Shepherd

Policy

Prophet

Albukhari

Cause

Muslim

Flock

Wealth

Abu

Child

Charity

Children

Reported

Jihad

Ibn

Said

Topic 8

Motherhood

Narrated

Hind

Home

Brother

Person

Woman

Tongue

Relationship

Backbite

Instead

Problems

Word

Divorce

Abu

Wife

Listening

Backbiting

Husband

Said

Spouses

Topic 9

Spousal relationship

618 M. Heidarysafa et al.

Women

Abortion

Pro

Men

Life

Feminist

March

Feminists

Woman

Feminism

Equality

Female

Choice

Male

Movement

Vulnerable

Sex

Today

Time

Human

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Government

Right

Parenthood

Planned

Case

Judge

Constitution

Roe

Medicaid

Law

Supreme

Kansas

State

Abortions

Decision

Federal

Circuit

Constitutional

Court

Abortion

Topic 1

Topic 0

1

Law

Feminism

Report

Orientation

People

Agenda

Person

Dysphoria

Biological

Catholic

Activists

Female

Reality

Male

Sexual

Sex

Trans

Ideology

Transgender

Identity

Lgbt

Gender

Topic 2

Gender identity

Husband

Situations

Older

Fidelity

Church

Catholics

Love

Abandoned

Married

Spouse

Families

Annulment

Spouses

Divorced

Marriages

Marital

Children

Family

Divorce

Marriage

Topic 3

Divorce

Authority

Priesthood

Faith

Faithful

God

Holy

Bishop

Vatican

Mccarrick

Christ

Letter

Rome

Women

Francis

Priests

Synod

Pope

Bishops

Catholic

Church

Topic 4

Church

Day

Got

Little

Son

Light

Children

Charlie

Life

World

Child

Motherhood

Love

Baby

Like

Jesus

Mother

Christ

Mary

God

Mom

Topic 5

Motherhood

Free

Sexual

Prevent

Medical

Sanger

Contraceptive

Catholic

Iud

Fertility

Mandate

Effects

Health

Contraceptives

Women

Birth

Control

Contraception

Vitae

Pill

Humanae

Topic 6

Birth control

Table 2. NMF Topics of women in catholic forum

Middle

Self

Liberty

World

Nature

Society

Ecological

Person

Dignity

Man

Politics

Political

Moral

Family

People

Revolution

Work

Ecology

Social

Human

Topic 7

Life

Kind

Risk

Like

Good

Fly

Pornography

Power

Mating

Marriage

Market

Weinstein

Porn

Desire

Women

Consent

Cheap

Regnerus

Men

Sexual

Sex

Topic 8

Sexuality

News

Public

Boy

Continue

Confused

Gender

Family

Girl

Youth

Kids

Reading

Child

Transgender

Students

Education

Federalist

Schools

Children

School

Parents

Topic 9

Parenting

Women in ISIS Propaganda: A Natural Language Processing Analysis 619

620

M. Heidarysafa et al.

Fig. 4. Word frequency of most common words in catholic corpora

Fig. 5. Word frequency of most common words in dabiq corpora

5

Results

In this section, we present the results of our analysis based on the contents of ISIS propaganda materials as compared to articles from the Catholic women forum. We then present the results of emotion analysis conducted on both corpora. 5.1

Content Analysis

After pre-processing the text, both corpora were analyzed for word frequencies. These word frequencies have been normalized by the number of words in each corpus. Figure 4, 5 show the most common words in each of these corpora. A comparison of common words suggests that those related to marital relationships (husband, wife, etc.) appear in both corpora, but the religious theme of ISIS material appears to be stronger. A stronger comparison can be made using topic modeling techniques to discover main topics of these documents.

Women in ISIS Propaganda: A Natural Language Processing Analysis

621

Although we used LDA, our results by using NMF outperform LDA topics, due to the nature of these corpora. Also, fewer numbers of ISIS documents might contribute to the comparatively worse performance. Therefore, we present only NMF results. Based on their coherence, we selected 10 topics for analyzing within both corporas. Table 1 and Table 2 show the most important words in each topic with a general label that we assigned to the topic manually. Based on the NMF output, ISIS articles that address women include topics mainly about Islam, women’s role in early Islam, hijrah (moving to another land), spousal relations, marriage, and motherhood. The topics generated from the Catholic women forum are clearly quite different. Some, however, exist in both contexts. More specifically, marriage/divorce, motherhood, and to some extent spousal relations appeared in both generated topics. This suggests that when addressing women in a religious context, these may be very broadly effective and appeal to the feminine audience. More importantly, suitable topic modeling methods will be able to identify these similarities no matter the size of the corpus we are working with. Although, finding the similarities/differences between topics in these two groups of articles might provide some new insights, we turn to emotional analysis to also compare the emotions evoked in the audience. 5.2

Emotion Analysis

We rely on Depechemood dictionaries to analyze emotions in both corpora. These dictionaries are freely available and come in multiple arrangements. We used a version that includes words with their part of speech (POS) tags. Only words that exist in the Depechemood dictionary with the same POS tag are considered for our analysis. We aggregated the score for each word and normalized each article by emotions. To better compare the result, we added a baseline of 100 random articles from a Reuters news dataset as a non-religious general resource which is available in an NLTK python library. Figure 6 shows the aggregated score for different feelings in our corpora. Both Catholic and ISIS related materials score the highest in “inspired” category. Furthermore, in both cases, being afraid has the lowest score. However, this is not the case for random news material such as the Reuters corpus, which are not that inspiring and, according to this method, seems to cause more fear in their audience. We investigate these results further by looking at the most inspiring words detected in these two corpora. Table 3 presents 10 words that are among the most inspiring in both corpora. The comparison of the two lists indicate that the method picks very different words in each corpus to reach to the same conclusion. Also, we looked at separate articles in each of the issues of ISIS material addressing women. Figure 7 shows emotion scores in each of the 20 issues of ISIS propaganda. As demonstrated, in every separate article, this method gives the highest score to evoking inspirations in the reader. Also, in most of these issues the method scored “being afraid” as the lowest score in each issue.

622

M. Heidarysafa et al.

Fig. 6. Comparison of emotions of our both corpora along with Reuters news

Fig. 7. Feeling detected in ISIS magazines (first 7 issues belong to Dabiq and last 13 belong to Rumiyah) Table 3. Words with highest inspiring scores Words Group Catholic

ISIS

W ord1 Avarice

Uprightness

W ord2 Perceptive

Memorization

W ord3 Educationally

Merciful

W ord4 Stereotypically Affliction (continued)

Women in ISIS Propaganda: A Natural Language Processing Analysis

623

Table 3. (continued) Words

Group Catholic

ISIS

W ord5

Distrustful

Gentleness

W ord6

Reverence

Masjid

W ord7

Unbounded Verily

W ord8

Antichrist

Sublimity

W ord9

Loneliness

Recompense

W ord10 Feelings

6

Fierceness

Conclusion and Future Work

In this paper, we have applied natural language processing methods to ISIS propaganda materials in an attempt to understand these materials using available technologies. We also compared these texts with a non-violent religious groups’ (both focusing on women related articles) to examine possible similarities or differences in their approaches. To compare the contents, we used word frequency and topic modeling with NMF. Also, our results showed that NMF outperforms LDA due to the niche domain and relatively small number of documents. The results suggest that certain topics play particularly important roles in ISIS propaganda targeting women. These relate to the role of women in early Islam, Islamic ideology, marriage/divorce, motherhood, spousal relationships, and hijrah (moving to a new land). Comparing these topics with those that appeared on a Catholic women forum, it seems that both ISIS and non-violent groups use topics about motherhood, spousal relationship, and marriage/divorce when they address women. Moreover, we used Depechemood methods to analyze the emotions that these materials are likely to elicit in readers. The result of our emotion analysis suggests that both corpuses used words that aim to inspire readers while avoiding fear. However, the actual words that lead to these effects are very different in the two contexts. Overall, our findings indicate that, using proper methods, automated analysis of large bodies of textual data can provide novel insight insight into extremist propaganda that can assist the counterterrorism community.

References 1. Aggarwal, C.C.: Machine Learning for Text. Springer, Heidelberg (2018) 2. Aymenn Jawad Al-Tamimi: The dawn of the islamic state of iraq and ash-sham. Curr. Trends Islamist Ideol. 16, 5 (2014) 3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

624

M. Heidarysafa et al.

4. Bodine-Baron, E., Helmus, T.C., Magnuson, M., Winkelman, Z.: Examining ISIS support and opposition networks on twitter. Technical report, RAND Corporation Santa Monica United States (2016) 5. Canales, L., Mart´ınez-Barco, P.: Emotion detection from text: A survey. In: Proceedings of the Workshop on Natural Language Processing in the 5th Information Systems Research Working Days, JISIC, pp. 37–43 (2014) 6. Farwell, J.P.: The media strategy of ISIS. Survival 56(6), 49–55 (2014) 7. Gates, S., Podder, S.: Social media, recruitment, allegiance and the Islamic state. Perspect. Terrorism 9(4), 107–116 (2015) 8. Ingram, H.J.: An analysis of inspire and Dabiq: Lessons from AQAP and Islamic state’s propaganda war. Stud. Confl. Terrorism 40(5), 357–375 (2017) 9. Kneip, K.: Female jihad–women in the ISIS. Politikon 29, 88–106 (2016) 10. Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.: Text classification algorithms: a survey. Information 10(4), 150 (2019) 11. Kuang, D., Brantingham, P.J., Bertozzi, A.L.: Crime topic modeling. Crime Sci. 6(1), 12 (2017) 12. Laub, Z., Masters, J.: Islamic state in Iraq and greater Syria. The Council on Foreign Relations (2014) 13. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788 (1999) 14. Callaghan, D., Greene, D., Carthy, J., Cunningham, P.: An analysis of the coherence of descriptors in topic modeling. Expert Syst. Appl. 42(13), 5645–5657 (2015) 15. Pereˇsin, A.: Fatal attraction: western muslimas and isis. Perspect. Terror. 9(3), 21–38 (2015) 16. Plutchik, R.: A general psychoevolutionary theory of emotion. In: Plutchik, R., Kellerman, H. (eds.) Theories of Emotion, pp. 3–33. Academic Press, Cambridge (1980) 17. Plutchik, R.: The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. Am. Sci. 89(4), 344–350 (2001) 18. Rowe, M., Saif, H.: Mining pro-ISIS radicalisation signals from social media users. In: Tenth International AAAI Conference on Web and Social Media (2016) 19. Saif, H., Fern´ andez, M., He, Y., Alani, H.: On stopwords, filtering and data sparsity for sentiment analysis of twitter (2014) 20. Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972) 21. Staiano, J., Guerini, M.: Depechemood: a lexicon for emotion analysis from crowdannotated news. arXiv preprint arXiv:1405.1605 (2014) 22. Tokunaga, T., Makoto, I.: Text categorization based on weighted inverse document frequency. In: Special Interest Groups and Information Process Society of Japan (SIG-IPSJ. Citeseer (1994) 23. Vergani, M., Bliuc, A.M.: The evolution of the ISIS’language: a quantitative analysis of the language of the first year of Dabiq magazine. Sicurezza, Terrorismo e Societ` a = Secur. Terror. Soc. 2(2), 7–20 (2015) 24. Verma, T., Renu, R., Gaur, D.: Tokenization and filtering process in rapidminer. Int. J. Appl. Inf. Syst. 7(2), 16–18 (2014) 25. Wignell, P., Tan, S., O’Halloran, K., Lange, R.: A mixed methods empirical examination of changes in emphasis and style in the extremist magazines dabiq and rumiyah. Perspect. Terror. 11(2), 2–20 (2017) 26. Winter, C.: The Virtual ‘Caliphate’: Understanding Islamic State’s Propaganda Strategy, vol. 25. Quilliam, London (2015)

Improvement of Automatic Extraction of Inventive Information with Patent Claims Structure Recognition Daria Berduygina(B) and Denis Cavallucci INSA of Strasbourg, Strasbourg, France {daria.berdyugina,denis.cavallucci}@insa-strasbourg.fr

Abstract. Our recent research finding produces methods for automatic extraction of inventive information out of patents thanks to the use NLP; notably the automatic text processing. However, these methods have drawbacks due to a high amount of noise (duplicates, errors) in the output result that prevent the further use of TRIZ methodology. In the mean-time, we observed that patent claims are the most important source for inventive information. These text paragraphs have nevertheless a dual nature (combining legal and technical vocabulary) and this nature engender part of the observed noise. We postulate that taking into consideration claims hierarchical structure and its structural information can reduce the time for extraction and refine the final output quality, which is the principal aim of the paper. In this paper, we report on the methodology we have employed based on the patent claim structure recognition as a way to address our objectives. Keywords: TRIZ · Text mining · Natural Language Processing (NLP)

1 Introduction Today’s progress is coming fast. For this reason, engineers and scientists aim to find a creative idea that can lead to invention. To help them, the researchers have developed methods that facilitate the inventive process, such as Brainstorming [1], Delphi method [2] and Synectics [3]. The TRIZ (Theory of Solving Inventive Problem) [4] began to be developed and adopted in the 1990s with the aim of making the inventive process easier and faster. This theory has earned its place among creativity techniques as an effective method which can be applied in all areas of human activity. However, the classic TRIZ methods are difficult to understand because of the absence of formalized ontology. One more drawback is due to the fact of difficulty to perform any computation on its abstract concepts. The IDM (Inventive Design Methodology) was created by our laboratory to extend the limitations of TRIZ mentioned before. In the IDM ontology the core elements to define a problem situation and a solution consist mainly of three concepts: problems, partial solutions and parameters. We aim to extract these three concepts to automate a problem-solving process. Patents represent an abundant source of information related to IDM. By examining patents, we can learn about technological advances over time and, more significantly, © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 625–637, 2020. https://doi.org/10.1007/978-3-030-52246-9_46

626

D. Berduygina and D. Cavallucci

the technological challenges and solutions that have been invented by specialists and engineers in the area. Given the importance of patents as a source of information, a number of academic research and patent exploitation activities have been carried out in recent years. Nowadays, the number of patent applications is increasing, thus it is mandatory to use adequate methods and processing tools because it can lead to better results in any patent-related activity. NLP (Natural Languages Processing) techniques related to the distinctiveness of the patent field are encouraging enhancing the quality of patent document processing. It is known that patents have extremely long and complex sentence structures with peculiar style. This feature is due to the double nature of patent text which is at the same time a legal and technical document aiming to protect the inventor and identify the boundaries of the invention. The use of NLP analyzer for patents (with morphological, syntactic or semantic modules) is essential goal. The overall task of patent analysis is to find repeatable inventive steps that can be applied to new problems. During the last few years, our team has constructed such tool for automatic extraction of IDM-related knowledge from Englishlanguage patents. However, our tool does not take into account the hierarchical structure of patent text notably a structure of patent claims. Therefore, this is one of the reasons that our tool produces a lot of noise at the output. The adequate automatic treatment of this part of patent document could be a rich source for IDM-related knowledge, thus, because of the difficulties to process this, there is no efficient technique to extract the precious knowledge out of claims. The double function of patent document is represented by two central parts of patent text. Firstly, a Description defines an invention, secondly, the Claims ‘define the matter for which protection is sought’ [5]. The description is written in similar to scientific papers style. It may contain examples that aid engineers to understand the content. The claim is the central point of the patent disclosure. It describes the essential properties of the invention. And it is subject to legal protection. This part is usually written by special patent agents for other patent experts. For this reason, the legal language is used for. The detailed analysis of claims structure enables us to conclude that claims refer to each other. Referencing is a main feature of a legal document which aims to elucidate all aspects of invention in face of legislation. Simple test makes obvious the presence of hierarchical structure of claims (there are independent claims explaining the general characteristics of invention and dependent claims further clarifying that has already been claimed). A dependent claim may refer to one or more previous claims. For clarification purposes, an example would be: 1. ‘In a fluid transport hose comprising […]; a passage composed of […]. 2. The transport hose according to claim 1, wherein said resilient body is helically formed and said channel is formed between adjacent turns of said helically formed resilient body. 3. […] 4. […] 5. The transport hose according to claim 3, wherein said flexible bag is provided under its deflated and folded condition with a coloring agent sandwiched between two outside folded surfaces of the folded flexible bag.’ [6]

Improvement of Automatic Extraction of Inventive Information with Patent Claims

627

As it has been told, the set of claims form a hierarchical structure. This structure could be represented as directed graph: As shown in Fig. 1. Structure of claims of patent US4259553A [6] made by our claims analyzing code), the directed graph represents a hierarchical structure from set of 18 patent claims with 11 dependent claims. We can observe that claims no 1–9 form a group, claims no 10–12 and 13–18 form another group. This is a simple relatively simple structure but we can also find a patent document which has more complex structure because of the number of claims that could be above more than 30.

Fig. 1. Structure of claims of patent US4259553A [6] made by our claims analyzing code).

Dealing with this kind of structure could improve the quality of extraction of our tool in terms of noise reduction. Thus, limitation of information retrieval algorithms to one group of claims could drastically reduce the noise during the process of extraction. In this article, we propose to look through an overview of IDM-methods and its tool for automatic knowledge extraction from patent documents and literature review about patent claims structure recognition and other methods to process the claim text (State of the Art). Thereafter, we shortly describe the tool for the automatic extraction of the IDMconcepts from patent texts, which was recently constructed by our laboratory (Extraction tool). Then, we describe our methodology concerning the improvement of IDM-related information extraction (Methodology) and we present results of our experimental work (Evaluation and Implementation).

2 State of the Art In this section, we describe the IDM-methods and its tool for automatic knowledge extraction from patent documents and literature review about patent claims structure recognition and other methods to process the claim text.

628

D. Berduygina and D. Cavallucci

2.1 The Inventive Design Method For our goal of extraction of IDM-related information, we have to define the main notions and the basic statements of these Methods. The theory developed by Genrich Altshuller is the basis for a significant part of the work carried out by the CSIP team: the TRIZ. This theory is based on four fundamental elements [7]. The Inventive Design Method (IDM) based on TRIZ extend the limitations of the grounding theory. The IDM describe the four necessary steps for problem-solving process. The first step consists on extraction of the information and on its organization into a graphical form comprising ‘problems’ and ‘partial solutions’. The second step involves using the first to formulate a list of contradictions according to the specific model. The third step includes the individual solving of each key contradiction. Finally, the fourth step is to select using statistics and engineers evaluation the most suitable Solution Concept [8]. For extending the limits of TRIZ and for making this theory useful for industrial innovation, the IDM proposes a practical definition of the contradiction notion [9]. According to this definition, the contradiction is “[…] characterized by a set of three parameters and where one of the parameters can take two possible opposite values Va and Va.” [9] Thus, it is important to distinguish the action parameter (AP) and the evaluation parameter (EP). The first one, the AP, “[…] is characterized by the fact that it has a positive effect on another parameter when its value tends to Va and that it has a negative effect on another parameter when its value tends to Va (that is, in the opposite direction)” [9]. The two other parameters in a contradiction definition are called an EP which “[…] can evolve under the influence of one or more action parameters” and which make possible to “evaluate the positive aspect of a choice made by the designer” [9]. For the clarification purposes, we add the possible formulation of the model of contradiction according to IDM postulates (Fig. 2).

Fig. 2. Possible representation model of contractions (physical and technical) [9]

The understanding the way of contradiction formulation helps in the process of information retrieval and information extraction. The information that we aim to extract from patent text comprises the four elements: problems, partial solutions, APs and EPs, moreover, if it is possible, their Va and Va values. With the help of research on IDM, we have a basic definition for the notion of problems as well as of partial solutions (how it should be represented syntactically and graphically and which information should it content). The following schemas show the graphical representation of the problem (Fig. 3. Graphical representation of a problem [10]) and of partial solution (Fig. 4. Graphical representation of a partial solution [10]).

Improvement of Automatic Extraction of Inventive Information with Patent Claims

629

Fig. 3. Graphical representation of a problem [10]

Fig. 4. Graphical representation of a partial solution [10]

A problem (Fig. 3. Graphical representation of a problem [10]) “describes a situation where an obstacle prevents progress, an advance or the achievement of what has to be done” [10]. A partial solution (Fig. 4. Graphical representation of a partial solution [10]) “expresses a result that is known in the domain and verified by experience” [10]. 2.2 Extraction from Patent Texts Patent texts are an important source of IDM-related information. However, this type of text presents a challenge for NLP applications because of its double nature (in the same time legal and technical) [11]. The technical knowledge issue from patent documents is rare and it is difficult to find into it innovative information of the same quality of, like for instance in scientific papers [12]. A basic inventive principle method of TRIZ and IDM relates on the fact an inventive solution could be found in another domain thanks to analogy. i.e., to find a solution for a problem it is necessary to search for analogical solved problem belonging to other domains. These analogies could be found in patent texts because this type of text represents an available inventive solution. However, searching for required patents and reading a mass of texts even by professions is a time-consuming process. For time-saving purposes, our team constructed a tool [13], that extracts out patent database the IDM-related information selected by users in English language (see the Sect. 3). This tool can also construct problems, solutions and parameters graph which is helpful to understand the user’s problem through contradictory representations and to find an appropriate solution from the same or even from another distant domain [7]. However, this tool has the drawbacks, notably, the noisy extraction. To evaluate the state of work of the tool before starting, the authors made an analysis of the quality of extraction. We took 20 patents randomly in the domain of Machine Translation, then, we made our algorithm work. Thanks to this analysis, we could note that the doubles presented in the output of extraction are abundant. This fact is shown on the Fig. 5. Quality of extraction. As shown above, the doubles are extracted for each concept. The presence of redundant information deteriorates drastically the quality of extraction, notably the statistical scores (Table 1).

630

D. Berduygina and D. Cavallucci

Solutions

Parameters

79

134 22

92

134

216 159

191

326

Problems

DOUBLES

FALSE

CORRECT

Fig. 5. Quality of extraction Table 1. Statistical calculation of the extraction quality for each IDM concept

Precision

Problems

Solutions

Parameters

0,2913043478 (29%)

0,2913043478 (29%)

0,461988304 (46%)

Recall

0,881578947 (88%)

0,817073171 (81,7%)

0,975308642 (97,5%)

F-score

0,437908497 (43,7%)

0,521400778 (52%)

0,626984127 (62,6%)

Despite the fact that the recall is relatively high (yet in this analysis it is not possible to calculate a good number, which is why we consider that the tool does not miss anything), the precision is still poor. The F-score (see Table 1) seems fine for this evaluation point, yet it is only an approximation. In reality, the results can be even worse. 2.3 Patent Claims Nowadays, there are a number of tools that can be used to recognize at least partially the patent claim hierarchical structure. These tools are created by companies or institutions for their own purposes of analysis. For example, Espacenet [14] is the European Patent Office’s search engine which permits to build a tree-like representation of claim structure in their viewer. Dependent claim detection is offered by the French company Intellixir [15]. The TotalPatent [16] (the LexisNexis product) also constructs a hierarchical structure visualization of patent claims. Moreover, the recent research in the patent information retrieval domain is focused on claims. The Information Retrieval Facility [17] is the series of conferences started in 2006, which carried off patent documents. In addition, there are several projects, financed by the European Union, that conduct research about patent searching and analyzing (PATEXPERT [18] and iPatDocs [19]). The linguistic approaches prevail in the works of Sheremetyeva S. [20] and Shinmori et al. [21]. They aim to break the complex sentences into sub-sentences for making easy

Improvement of Automatic Extraction of Inventive Information with Patent Claims

631

to read and understand it. This topic continues to interest the research, for instance Parapatics P. et al. [22]. The efforts to make a structural parsing are related in the works of Verberne et al. [23], D’hondt et al. [24], and Yang and Soo [25]. However, these structural parsers are focused on searching of grammatical relations in claims. The theme of patent claims dependency dominates at the work of Hackl-Sommer R. et al. [26]. They make two strong hypotheses that led us to advance that “the occurrence of references in patent claims is a direct indicator to identify and separate independent from dependent claims” [26]. i.e., the presence in the text of a patent claim with such phrase like “according to claim 1” lead us to conclude that this claim is dependent. Inversely, the absence of this type of phrase is the index of independent claims. The second hypothesis is ‘the formulaic language of patent claims allows for pattern-based analysis of the claims to identify references’ [26]. i.e., the legal language used in patent claims facilitates the claims structure analysis and extraction with purposes to minimize the redundant information extraction.

3 Extraction Tool In this section, we shortly describe the tool for the automatic extraction of the IDMconcepts from patent texts, which was recently constructed, by our laboratory. 3.1 Tool Description Before we start to describe our methodology, we shortly present the toolkit for automatic extraction of IDM-related concepts. The toolkit [13] to be improved uses linguistic and statistical methods to extract concepts related to IDM. It is based on knowledge-oriented approach (in contrast with data-oriented approach: tokenization, lemmatization, segmentation, naming entities recognizing concepts; used generally for structured data) [27]. However, the patent text represents the unstructured data, that is the reason why the knowledge-oriented approach was used. This approach consists of an automatic extraction of the relevant linguistic patterns for each concept (problem, partial solution and parameters). Firstly, two corpora of patent texts were built (the first corpus was used to complete the list of linguistic markers and the second one for the result evaluation). The classical NLP approaches such as corpus preprocessing, stop word elimination, linguistic marker weighting, part of speech tagging and lemmatization were applied for the training corpus [28]. The linguistic markers are extracted from the patent corpora with help of the TF-IDF methods (term frequency—inverse documents frequency) [29] and the identification of a contiguous sequence of n items methods, also called n-gram identification. The last one is based on the extraction of all the word sequences from 1 to 10 tokens and on the calculation of the most frequents. This approach conducted to analyze all the n-grams to choose the most relevant linguistic markers and to study it in the context. For example, the problems are preceded by markers such as ‘it is known that…’ or ‘resulting in…’. And the partial solution is

632

D. Berduygina and D. Cavallucci

preceded by the phrases like ‘the present invention relates to…’ or ‘…an object of this invention is to…’ [13]. After construction of the list of linguistic markers for each IDM concept and its classification, the API was built to operate this extracted data using the Python language. At the input, a user gives a patent text, then the algorithm perform the extraction based on the lists of linguistic markers.

4 Methodology This section contains a methodology concerning the improvement of IDM-related information extraction. 4.1 Corpus Analysis In order to extract from the patent, the information about its structure, we need to find the linguistic clues that permit to establish the dependency between patent claims. The formulaic language used in patent claims construction enables to say that it exists certain amount of determined dependency constructions. Therefore, we need to obtain the list of dependency constructions. For this purpose, we chose the patent from different technical domain of knowledge in text format from our database. The style of writing patent document is formalistic but the lexical and syntactic construction can be dissimilar that is the reason that we analyze as much domain as possible. We chose 20 patents randomly from chemistry, engineering science and linguistics. Thereafter, we use the AntConc [30] which is an open-source corpus analysis toolkit for concordancing and text analysis. This software permits to find all sequences of searching features in corpus by entering a query word. Due to formalistic style and language used for writing patent claims, the dependency constructions repeat in each document. The lexical and syntactic structure of phrases is independent of the domain of knowledge, i.e., same constructions are used in each domain. For example, engineering domain [31]: 1. […] 2. The seal device as set forth in claim 1, wherein said contact surfaces have a ring configuration. 3. The seal device as set forth in claim 2, wherein said sensing member also has a ring configuration. […] In linguistic domain [32]: 1. […] 2. The method of claim 1 wherein providing the translation output comprises […] 3. The method of claim 2 and further comprising: calculating a confidence measure for each translation output.

Improvement of Automatic Extraction of Inventive Information with Patent Claims

633

4. The method of claim 3 wherein calculating comprises: calculating the confidence measure […] By the assumption that all dependent claims contain the word ‘claim’ and the number of claim/claims on which they refer to, we conduct the research using this key word. Through this analysis, we arrive to find 34 typical claims dependency clues like, for example, ‘according to claim Num., wherein,’ ‘in accordance with claim Num., wherein.’ During the analysis, we divide the claim dependency structures by the following groups: 1. 2. 3. 4.

foregoing term, for example, ‘referenced above,’ ‘one of the,’ ‘above-mentioned’; interval, for example, ‘Numb. to Numb,’ ‘between Numb. and Numb.’; filler adverbs: for example, ‘before,’ ‘previous’; enumeration, for example,’ according to claims 1, 3 to 5 and 10–20.’

Moreover, we can find the numerous combinations of these type of dependency clues like a foregoing term + interval, for example, ‘one or more of claims 1 to 5’. To obtain the best results, we should take into account all types of dependency clues, even the combinations. In closing our corpus analysis, we need to mention the different ways of claim numbers referencing. For this goal, the authors of patent claims use Arabic numerals as well as Roman numerals. These two types are relatively easy to process. However, we need to take into account the existence of spelled-out numerals used frequently in patent claims constructions in order to establish the hierarchical structure. 4.2 Workflow After constructing the list of dependency structure, we could start of claims hierarchy identification and extraction. The workflow is relatively straightforward. The following steps describe all processes. 1. Segmentation. We need to detect the beginning and the end of each patent claim as well as find the section with claims in patent text. 2. Number recognition. Each claim is numbered consecutively and this number needs to be identified for each claim. 3. Classification. Each claim needs to be categorized as dependent or independent. 4. Selection. In case of dependent claims, the parent claim has to be extracted. We will discuss each step of our workflow in subsequent sections. 4.2.1 Segmentation and Number Recognition For identification of the beginning and the end of each claims as well as the claims section in patent text, we need to find a reliable method. The claim section is always located at the end of patent text. The beginning of this section is identified in the same way: it usually started by “Claims”: sequence. Thus, the automatic retrieval of this section is a relatively easy task.

634

D. Berduygina and D. Cavallucci

The individually claim segmentation is more complex. However, we introduce a number of rules permitting to identify the beginning of each one. The beginning might be represented by a new line, by a number preceded by the term “Claim” like ‘Claim 1’, by a number followed by a character (blank, dot, closing parentheses or hyphens) like ‘1’, ‘1’, “1)” and “1—”. These rules have been included in our algorithm of claims segmentation. Seldom, the beginning of a new claim is not separated by a new line, thus the claims in text appear in the block. Alternatively, we can find a blank or any other character between mentioned clues, for example, “Claim 1 3. A method…” We added in our algorithm the rules to deal with this type of structure. 4.2.2 Classification and Selection The next two steps in our workflow aim to classify the claims according to our definition of dependent and independent claim and to select which claim we need to extract. To achieve this goal, we use the algorithm that allows finding the dependency structures described in Corpus analysis Sect. (4.1). Then, we extract the number of the parent claims. In case, when this information is not founded for a claim, we consider this claim as an independent.

5 Evaluation and Implementation The results of our experimental work is presented in this section. 5.1 Method Evaluation In order to evaluate our methodology of claim segmentation and hierarchical structure extraction, we processed random patent texts from different domains issued from our database. In particular, we are interested in patent containing an important number of claims (more than 20). At the output of our algorithm, we can see a list of dependent and independent claims with their number as well as a directed graph represented the hierarchical structure of claims. As the language used for claim construction is formal with limited number of formal constructions, the class of an analyzed claim is evident. For example, the claim, “2. The sealing ring of claim 1 wherein said electrode is embedded within said body and is spaced from said exterior surface.” [33] depends on claim 1. In our analysis we processed 13 documents. The results of output are: • accuracy of claim segmentation = 92.3% (12 of 13 documents) • accuracy of dependency recognition = 100% (12 of 12 documents). Obviously, the algorithm does not process the claims that were wrongly segmented or not recognized. The failure is due to the format of the original document, after changes of code, we arrived to segment this 13th document.

Improvement of Automatic Extraction of Inventive Information with Patent Claims

635

5.2 Implementation After claims hierarchical structure recognition, we make our IDM-concept analyzing tool does not take into account child claims. Before, we processed our dataset without any changes. The result is following: • before child claims extraction: • 42 concepts have found, their 13 problems and 29 partial solutions, • processed in 81.31 s; • after child claim extraction: • 32 concepts have found, their 13 problems, 19 partial solutions; • processed in 75.31 s. As shown above, our method allows reducing the time of the procession of our tool as well as the quantity of partial solutions extraction. It is obvious because in claims are listed ready solutions, not problems. The 8 of 10 dropped patterns considered as partial solutions are doubles and the other 2 was errors. If we will focus on these 19 partial solutions extracted, the 16 are correct, 3 are incorrect. The doubles are represented by 3 phrases in the output anymore. This change allows to conclude that the work of our IDM-concept extraction tool has been improved: • improvement of processing time = 7.38% • noise reduction = 76.92%. The presence of doubles is because these partial solutions are extracted from other parts of patent texts.

6 Discussion The method of automatic extraction of IDM-related information, described in this article, proved our hypothesis that dual nature of patent texts makes all the document structure more complex. This complexity poses many problems for automatic concept extraction as well as for text understanding. Extracted output containing noise represents a difficulty with analyzing the results of extraction because our global goal is to help engineers to find an appropriate solution using as much as possible sources of information. The precision of information extraction is important because in real algorithm application situation, the doubles and errors are barriers and they need to be eliminated from output as much as possible. Despite the fact that we improved the quality of extraction, we also need to refine our approach because we processed a small amount of texts to complete the list of dependency structures.

636

D. Berduygina and D. Cavallucci

As a future work, we need to refine the quality of the output. Firstly, we suggest dealing with hierarchical structure of patent text, notably for reducing the noise in the output of problems, which are located mostly in the Abstract and Description section. Secondly, the resolution of co-references such as an anaphora, cataphora or split antecedents that are represented in the patent texts can also reduce the noise and make the output phrases more coherent and clear. Thirdly, the most efficient way to improve the quality of extraction it is an implementation of a method of user’s validation of the output. For example, once the user reports an extracted sentence as a noise, the algorithm record it and learn not to extract the similar sentences.

7 Conclusion The contribution of our method according to the Sect. 4 is important. However, the analysis of working has made in small patent corpus, i.e. we need to repeat testes with bigger corpus to find more limitations and fix it before implementation in the toolkit. Moreover, we suggest that all patent document, not only claim section has a hierarchical structure. This treatment of this hypothesis and an it adequate implementation could reduce drastically the noise even remove it in certain cases.

References 1. Parker, J.P., Begnaud, L.G.: Developing Creative Leadership. Libraries Unlimited, Westport (2004) 2. Dalkey, N.C., Helmer-Hirschberg, O.: An Experimental Application of the Delphi Method to the Use of Experts (1962). https://www.rand.org/pubs/research_memoranda/RM727z1.html. Accessed 09 Apr 2019 3. Prince, G.M.: The Practice of Creativity: A Manual for Dynamic Group Problem Solving. Collier Books, New York (1972) 4. Altxyllep, G.: Hati ide: Bvedenie v TPIZ—teopi pexeni izobpetatelckix zadaq. Alpina Pablixep (2008) 5. European Patent Office: Guidelines for Examination in the European Patent Office (2018) 6. Tanaka, M., Saito, H.: Transport hose with leak detecting structure, US 4259553A, 31 March 1981 7. Cavallucci, D. (ed.): TRIZ — The Theory of Inventive Problem Solving: Current Research and Trends in French Academic Institutions. Springer, Cham (2017) 8. Cavallucci, D.: From TRIZ to Inventive Design Method (IDM): towards a formalization of Inventive Practices in R&D Departments (2012) 9. Rousselot, F., Zanni-Merk, C., Cavallucci, D.: Towards a formal definition of contradiction in inventive design. Comput. Ind. 63(3), 231–242 (2012) 10. Cavallucci, D., Rousselot, F., Zanni, C.: Initial situation analysis through problem graph. CIRP J. Manuf. Sci. Technol. 2(4), 310–317 (2010) 11. Guyot, B., Normand, S.: Le document brevet, un passage entre plusieurs mondes. Document et organisation, Paris (2004) 12. Bonino D., Ciaramella A., Corno, F.: Review of the state-of-the-art in patent information and forthcoming evolutions in intelligent patent informatics—ScienceDirect. https://www.scienc edirect.com/science/article/pii/S0172219009000465. Accessed 10 Apr 2019

Improvement of Automatic Extraction of Inventive Information with Patent Claims

637

13. Souili, A.W.M.: Contribution à la Méthode de conception inventive par l’extraction automatique de connaissances des textes de brevets d’invention’, Université de Strasbourg, École Doctorale Mathématiques, Sciences de l’Information et de l’Ingénieur Laboratoire de Génie de la Conception (LGéCo) – INSA de Strasbourg (2015) 14. Espacenet Patent search: worldwide.espacenet. https://worldwide.espacenet.com/. Accessed 10 Apr 2019 15. Questel: Orbit Intellixir, Questel, 2019. https://www.questel.com/software/orbit-intellixir/. Accessed 11 Apr 2019 16. Patent Research & Analysis Software| LexisNexis TotalPatent OneTM , LexisNexis® IP 17. Information Retrieval Facility. http://www.ir-facility.org/. Accessed 22 Mar 2019 18. Advanced patent document processing techniques| Projects| FP6| CORDIS| European Commission. https://cordis.europa.eu/project/rcn/79394/factsheet/en. Accessed 11 Apr 2019 19. BRUGMANN SOFTWARE GMBH, iPatDoc (2013) 20. Sheremetyeva, S.: Natural language analysis of patent claims. In: Proceedings of the ACL2003 workshop on Patent corpus processing—Not Known, vol. 20, pp. 66–73 (2003) 21. Shinmori, A., Okumura, M.: Aligning patent claims with detailed descriptions for readability. In: Proceedings Fourth NTCIR Workshop, vol. 12, no. 3, pp. 111–128, July 2005 22. Parapatics, P., Dittenbach, M.: Patent Claim Decomposition for Improved Information Extraction, ResearchGate (2011). https://www.researchgate.net/publication/226411853_Pat ent_Claim_Decomposition_for_Improved_Information_Extraction. Accessed 11 Apr 2019 23. Verberne, S., D’hondt, E., Oostdijk, N.: Quantifying the challenges in parsing patent claims, ResearchGate (2010). https://www.researchgate.net/publication/228739952_Quanti fying_the_challenges_in_parsing_patent_claims. Accessed 11 Apr 2019 24. D’hondt, E., Verberne, S., Alink, W., Cornacchia, R.: Combining document representations for prior-art retrieval, p. 9 (2011) 25. Yang, S.-Y., Soo, V.-W.: Extract conceptual graphs from plain texts in patent claims. Eng. Appl. Artif. Intell. 25(4), 874–887 (2012) 26. Hackl-Sommer, R., Schwantner, M.: Patent claim structure recognition, Arch. Data Sci. Ser. A (Online First) (2017). https://publikationen.bibliothek.kit.edu/1000069936. Accessed 11 Apr 2019 27. Souili, A., Cavallucci, D.: Automated extraction of knowledge useful to populate inventive design ontology from patents. In: Cavallucci, D. (ed.) TRIZ—The Theory of Inventive Problem Solving, pp. 43–62. Springer, Cham (2017) 28. Souili, A., Cavallucci, D., Rousselot, F.: A lexico-syntactic pattern matching method to extract Idm—Triz knowledge from on-line patent databases. Proc. Eng. 131, 418–425 (2015) 29. Salton, G., Yang, C.S.: On the Specification of Term Values in Automatic Indexing, June 1973 30. Anthony, L.: AntConc. Tokyo, Japan: Waseda University (2019) 31. Bennett, B.E.: Seals with integrated leak progression detection capability, US 7316154B1, 08 January 2008 32. Zhou, M., Huang, J.-X., Huang, C.N.T., Wang, W.: Example based machine translation system, US 7353165B2, 01 April 2008 33. Sunkara, M.K.: Sealing ring with electrochemical sensing electrode, US5865971A, 02 February 1999

Translate Japanese into Formal Languages with an Enhanced Generalization Algorithm Kazuaki Kashihara(B) Department of Computer Science, Arizona State University, Tempe, AZ 85281, USA [email protected]

Abstract. In this paper, we propose the extension of the semiautomated semantic parsing platform: NL2KR to Japanese. Japanese is an agglutinative language and it is difficult to assign the meaning of each word since different meanings are created using a single root-word. We introduce two algorithms, the Phrase Override and the enhanced Generalization. The Phrase Override algorithm gives the same feature of the original NL2KR: Syntax Override that adjusts the output Combinatory Categorial Grammar (CCG) Parse tree structure from its English CCG parser. To extend the other languages, however, it is needed to implement the CCG parser for the other languages. Japanese CCG Parser is provided, and the Phrase Override gives the way to adjust the generated CCG parse tree structure from the Japanese CCG parser. The Generalization used in NL2KR generates the meanings of missing words by applying missing words. Our proposing enhanced Generalization algorithm uses the semantically similar words of the missing word and apply the templates of these words to generate the missing word’s meanings. The evaluation result shows that this approach improves the accuracy of not only Japanese but also English with the smaller learned lexicons. For the evaluation, we provide new data corpora for Japanese. GeoQuery corpus is translated several languages including Japanese but the Japanese GeoQuery is a Japanese transliteration. We provide the Japanese translated GeoQuery and this is the first Japanese corpora. Our proposed approach can extend to other agglutinative languages such as Turkish, Finish, and Esperanto when a CCG parser is available for them. Our platform is Java base and it does not depends on the machine environment. Keywords: Natural language processing Understanding · Semantic parsing

1

· Natural Language

Introduction

Within the field of Natural Language Understanding, many systems require semantic parsing capabilities in order to convert natural language text to c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 638–655, 2020. https://doi.org/10.1007/978-3-030-52246-9_47

Translate Japanese into Formal Languages

639

formal statements, where formal statements are needed for reasoning and generating appropriate responses. For example, some robotic interactions will accept instructions to a robot in the form of natural language but the input must be converted into a logic-based Robot Control Language [18]. Most of the systems are focused on English and not many of the systems support the other languages. There are systems that translate agglutinative languages namely Japanese and Turkish such as WASP [33], λ-WASP [34], UBL & UBL-s [14], and FUBL [15]. However, these systems treat only Japanese transliteration written in the Roman alphabet. By using Japanese transliteration, several aspects of the Japanese language are overlooked or missed. For example, the sentence “Give me the cities in Virginia” [33] is written there in transliterated Japanese as “baajinia no toshi wa nan desu ka” for the Japanese sentence ”. Their Japanese transliteration loses the infor“ mation of the Japanese characters, especially the Kanji related information. For instance, the Japanese sound “hashi” has several representations and meanings in Japanese, “ (bridge)”, “ (edge)”, “ (chop sticks)” and “ (ladder)”. The Japanese transliteration “watashi ha hashi kara tobu” is ambiguous and several meanings corresponding to different Japanese sentences, “ (I jump from the bridge)”, “ (I jump from the edge)”, and (I jump from the ladder)”. In other words, if a sentence is “ written directly in Japanese scripts (and not as a transliteration), then it will be easy to understand the different meanings of “hashi”. ccg2lambda [21] is a framework for a compositional semantics based on the wide coverage Combinatory Categorial Grammar (CCG) Parser, combined with a higher-order inference system. This system is used to parse Japanese sentences with the Japanese CCG parser [22]. More recent works [37,38] show the improvement of parsing accuracy with ccg2lambda. However, this system requires all meanings of each word in the given Japanese sentence to convert the sentence to the higher-order logical inference. NL2KR [1,10,32] is a semantic parsing platform that builds a translation system from English text to the desired target language, with the examples of sentences and their translations, and some initial lexicon. Recent research shows that NL2KR system can construct the target language translation system easily and perform with high accuracy [32]. To parse English sentences, NL2KR system uses a CCG parser. We believe that if there is some CCG parser for the other natural languages, NL2KR system may extend to the other languages. CCG is a radically lexicalized theory of natural language grammar, in which the lexicon is the only resource for specifying language-specific information such as the order of constructions, and it is widely accepted in the NLP field [27]. Noji et al. [22] created the first Japanese CCG parser. The recent works of Yoshikawa et al. [37,38] show the improvement of the parsing accuracy. Since the grammar structures of English and Japanese are very different and the Japanese CCG parser has space to improve its accuracy, however, we need to analyze the outputs of the Japanese CCG parser with a variety of Japanese sentences and add algorithms to adjust the Japanese grammar for NL2KR system.

640

K. Kashihara

Then, the NL2KR system is applied this modified Japanese CCG parser with Phrase Override that can trim the parsed CCG tree. Using this approach, we developed Japanese NL2KR (JNL2KR), the first semantic parsing platform to convert from Japanese sentences with Kanji to formal languages. Once this approach works, it can be easily applied to the other agglutinative languages that have their CCG parser. The NL2KR System has the Inverse-λ and the Generalization [1] for learning of word meanings. The Inverse-λ can learn the semantic representation of one child node if the semantic representations of the parent node and the other child node are known. The Generalization is called when the Inverse-λ is not enough to learn new semantic representation of words. In contrast to the Inverse-λ which learns the exact meaning of a word in a particular context, the Generalization generates the meanings of a word from similar words with existing representations. Thus, the Generalization supports the NL2KR to learn meanings of words that are not even present in the training dataset. The Generalization solves a challenge of semantic parsing; semantic parsers must be able to ascertain the meaning of never-before-seen words to accurately translate entire utterances to the formal representations machines can understand, in the same way a human must figure out the meaning of a new word when trying to understand something they have never heard before. However, the Generalization can be imprecise in how it generalizes meaning from known words to unknown words. For example, the meaning for a new word might be identified based on whether it belongs to the same grammatical class of the known dictionary. If an unknown word is a noun, the meanings of all known nouns are applied to the unknown word. This can result in multiple, inaccurate meanings. In the creation of a formal representation for an entire utterance, these multiple inaccurate meanings can have a snowball effect on a system’s effectiveness. We explore how to address the issues of Generalization by incorporating context through the application of word vectors trained by Word2Vec and WordNet. We are specifically interested in how different types of vector representations can improve Generalization. WordNet focuses on sense representations [20] and Word2Vec is a prediction-based model [19]. These two different approaches are based on different assumptions about the relationships that are pertinent between words and there are open questions about whether these assumptions will have different downstream effects on the accuracy of semantic parsing. We propose two specific research questions regarding the use of vector representations for improving Generalization in semantic parsing: – How do the different underlying assumptions of Word2Vec and WordNet influence the overall accuracy of a semantic parser when these representations are used to enhance the Generalization? – What are the differences when using Word2Vec and WordNet to the enhance Generalization for a semantic parser for English (NL2KR) [32] versus our JNL2KR system? We have evaluated JNL2KR on a standard dataset: GeoQuery [33]. GeoQuery250 is a database of geographical questions that supports: English,

Translate Japanese into Formal Languages

641

Spanish, Turkish, and Japanese (Herein referred to “Japanese transliteration”). It is a part of the original dataset that has 880 English questions. A Japanese professional interpreter translated the original 880 English sentences into Japanese sentences and we use these Japanese sentences for evaluation. We call the Japanese translated original dataset call as GeoQuery880. These Japanese GeoQuery250 and GeoQuery880 are the first Japanese standard dataset. Experiments demonstrate that JNL2KR can exhibit state-of-the-art performance with a fairly small initial lexicon for both GeoQuery250 and GeoQuery880 in Japanese. It is also evaluated our JNL2KR and original NL2KR with the enhanced Generalization, and the results show the improvement of the performance with smaller size of the learned lexicon compare to the original Generalization. Therefore, we can extend JNL2KR system to the other languages with their CCG parsers using the enhance Generalization.

Fig. 1. Example of Syntax Override: The left CCG parse tree is after given the syntax override information “has (S\N P )/S”, and the right CCG parse tree is the original CCG parse tree. “running through it” decorates “the most river” in the sentence in the left CCG parse tree correctly.

2

Background and Related Work

Over the years, various models have been proposed to learn semantic parsers from natural language expressions paired with their meaning representations [14,15,34,39,41]. These systems learn some lexicalized mapping rules and scoring models to construct a meaning representation for a given sentence. Recently, neural sequence-to-sequence models have been applied to semantic parsing with promising results [7], eschewing the requirement for extensive feature engineering. The recent work by Dong and Lapata [8] presents a coarse-to-fine decoding framework for neural semantic parsing. All of the above systems are evaluated with GeoQuery dataset [33] in two agglutinative languages, Japanese and Turkish. However, the Japanese GeoQuery dataset is using segmented Japanese transliteration. In the GeoQuery

642

K. Kashihara

dataset, Japanese is only the language that is written in romanized segmentalized transliteration. The other languages: English, Spanish, and Turkish, are all written in their own script and not their transliteration. There has been considerable work on addressing or sidestepping the issue of unseen words in semantic parsing. Some approaches use external resources [3,36] and an external paraphrase model [4,13]. Other work utilizes a method of parsing language into some kind of intermediate representation such as a syntactic tree and then deterministically transforming the intermediate representation. This latter method uses an external parser with larger lexicons and mechanisms for handling unseen words [9,12,13,25]. 2.1

Ccg2lambda

The ccg2lambda [21] is a higher-order automatic inference system, which converts CCG derivation trees into semantic representations and conducts natural deduction proofs. The system supports Japanese and it uses the Japanese CCG parser [22] for parsing Japanese to generate the CCG tree. The templates for English and Japanese accompany the system, and these templates are easy to understand, use, and extend to cover other linguistic phenomena or languages. However, the system operates by composing semantic formulas bottom-up, given a CCG parse tree, and it requires all semantic templates for each word. If one of the word’s semantic template is missing, the system cannot build the semantic representation of the sentence. 2.2

NL2KR

The NL2KR [1,10,32] is a user friendly platform that takes examples of sentences and their translations (in a desired target language that varies with the application) with some bootstrap information (an initial lexicon), and constructs a translation system from text to that desired target language. It is a Java based platform, so that it does not depend on the machine environment. The NL2KR has the Syntax Override feature that can adjust the CCG parse tree from the given sentence since their CCG parser generates incorrect CCG Parse trees sometimes. For example, their CCG parser generates the wrong CCG parse tree of “what state has the most rivers running through it” in the right side of Fig. 1, and the Syntax Override information “has (S\N P )/S” helps to generate the correct CCG parse tree of the sentence in the left side of the figure. In addition, they used “ ” to treat some phrase as one word, so that they could parse the given sentences correctly. For example, the sentence, “How many states are in the USA”, is converted to “How many states are in the USA”. Baral et al. and Vo et al. [1,32] call these approaches where the semantic meaning of unseen or unknown words is learned based on the semantic meaning of other similar words Generalization. With the Generalization, ‘similar’ words are usually identified by the CCG category. If the semantic parser comes across a word with no known meaning, then the parser will search for all words of the

Translate Japanese into Formal Languages

643

same CCG category. The Generalization will apply the meanings from all of the words of the same CCG category to the unknown or unseen word. Occasionally an unknown word is actually present in the training set under a different CCG category. In that case the meaning of the counterpart is applied to the unknown word.

Fig. 2. English CCG Parse Tree of “Can you tell me the capital of Texas” with meanings

Currently, the NL2KR system supports only English sentences. Based on the idea of the NL2KR system, it is able to be applied the Japanese Combinatory Categorial Grammar (CCG) parser and some algorithms to our system to support Japanese sentences. 2.3

Japanese CCG Parser

Derivation of a sentence in Combinatory Categorial Grammar (CCG) determines the way the constituents combine together to establish the meaning of the whole sentence. Natural language sentences can have various CCG parse trees, where each tree expresses a different meaning of the sentence. Existing English CCG parsers [6,17] either return a single best parse tree for a given sentence or parse it in all possible ways with no preferential ordering among them. In Japanese, a comprehensive theory for Japanese syntax based on CCG is proposed by Bekki [2]. Noji et al. [22] created a Japanese CCG parser based on the Japanese CCG bank constructed by Uematsu et al. [30,31]. The parser is implemented as the Shift-reduce parser [40] that combines beam search. The recent works show the improvement of the accuracy in Japanese CCG parser

644

K. Kashihara

[37,38]. We evaluated the performance of Japanese CCG parsers, Jigg [22] and depccg [37]. The performance are not significantly difference, and Jigg is a Java based parser so that we can directly use it for our JNL2KR system. Thus, we use Jigg in this work.

Fig. 3. Japanese CCG Parse Tree of “Can you tell me Texas’s capital” with meanings. (Japanese translation of “Can you tell me the capital of Texas” is slightly different)

Figure 2, Fig. 3, and Fig. 4 show the different CCG parse tree structure of the sentence “Can you tell me the capital of Texas” in English and Japanese. In Fig. 2 and Fig. 3, they show the different formal meanings of the phrases, ”. “tell” has a key meaning of the “Can you tell me” and “ ” has a meaning. phrase, “λz.λy.λx.z@(x@answer(y))”, and “ “λx.answer(x)”. Since the grammatical structure and CCG parse trees, these formal meanings of some words and phrases are different. The Japanese CCG parser has the challenge to parse Japanese transliteration with hiragana since it is hard to segment Japanese transliteration into the segmentalized transliteration. In Fig. 4, this shows the issue of parsing Japanese transliteration that the sentence is segmented incorrectly. For example, the word, “Texas”, is split into two parts that are nonsense. The phrase, “Can you tell me the capital of”, is also split into two parts and each part of the phrase is incorrect and meaningless.

3

Methodology

In Japanese, semantic information such as tense and modality are extensively expressed by using auxiliary verbs. The auxiliaries connect to the main

Translate Japanese into Formal Languages

645

verb in sequential order. An example of the output graph of the sentence, (Can you tell me Texas’s capital)” is shown “ in Fig. 3.

Fig. 4. Japanese Transliterated CCG Parse Tree of “Can you tell me the capital of Texas” with meanings. However, the edge meanings are missing due to incorrect parse tree.

For instance, “ (Can you tell me - Verb Phrase CCG category: (Independent verb, CCG category: S\N P )”, S\N P )” in Fig. 3 is split into “ (Dependent “ (Conjunctive particle, CCG category: S\S)” and “ verb, CCG category: S\S)”. However, it is hard to break the phrase meaning “λx.answer(x)” down into each word’s meaning. If we force to break down, (Can you tell me)” is the meanings of each word in the phrase “ shown in Fig. 3. In the NL2KR, they used “ ” to preprocess the original sentence to make the phrases into words such as “How much” to “How much”, “Tell me” to “Tell me”, and “Through which” to “Through which”. Then, they combine their Syntax Override to adjust the output CCG parse trees. However, our system uses Japanese CCG parser, Jigg [22], and we are not able to change the tree structure while the CCG parser builds the CCG parse tree of the given Japanese sentence. Thus, we introduce Phrase Override that trims the output CCG parse tree, by the Japanese CCG parser. This Phrase Override algorithm is applicable to the other languages’ CCG parsers. 3.1

Phrase Override

The users give a Phrase Override set which contains the pair of phrase and the phrase’s CCG category; the Japanese CCG parser with Phrase Override outputs the CCG parse tree such that if one of the pairs in the Phrase Override set is in the tree as a node, that node becomes a terminal node. For example, if a user

646

K. Kashihara

Algorithm 1: Phrase Override algorithm input : parseTrees, OverridePhraseSet output: Updated parseTrees for tree ∈ parseTrees do for node ∈ tree do if the node is not a terminal node then if The pair of sentence and CCG category of node matches a pair of OverridePhraseSet then The node becomes a terminal node end end end end

wants to stop the noun phrase “ ” (How many people. Its CCG category is NP” N P ) as a terminal node, the user just specifies the tab separated pair “ in the phrase override file. Manual Phrase Override gets the phrase override file and stops the noun phrase as a terminal node when Japanese CCG parser outputs the CCG parse tree. The algorithm of Phrase Override is shown in Algorithm 1. In the box of Fig. 3, our Japanese CCG parser with Phrase Override outputs the S\N P ” as a terminal CCG parse tree with the verb phrase “ node so that the children of the node will not be shown. The JNL2KR system can use the Japanese CCG parser independently through the interactive GUI. If we want to use Phrase Override, we can give the Phrase Override file as the Phrase Override set and it will show the overridden CCG parse tree. It can be zoomed in or out and the nodes can be moved. This CCG parser makes it easier to work with the longer sentences that have several phrases and make the parse tree simple. For instance, if we give “ S\N P ” (the phrase and CCG category is split by tab space) in the Phrase Over(Can you ride file, then the CCG parse tree of “ tell me the Texas’s capital)” will be the tree that the rectangle part (verb phrase part) is removed in Fig. 3. 3.2

Comparing Formal Meanings in English and Japanese

English and Japanese grammar structures are very different. However, most of the meanings of the keywords are the same. For instance, in Fig. 2 and Fig. 3, the meaning of “Texas” is the same, “stateid(Texas)” and the meaning of “capital” is also the same, “λy.λx.capital(x, y)”. In addition, “of” explains the relation of the possession between “capital” and “Texas”, and “ ” also explains the relation of the possession. Then, both “of” and “ ” have the same meaning, “λy.λx.x@y”. Thus, most of the meanings of keywords such as nouns, adjectives and adverbs can be used directly from their meanings in English. On the other hand, multiple English phrases can translate into the same phrase in Japanese in many cases. For instance, the English phrases, “Can you

Translate Japanese into Formal Languages

647

tell me”, “Could you tell me”, “Tell me”, “Give me”, and “Teach me”, are ”, in Japanese. Thus, we can reduce the translated to a phrase, “ number of verb phrases in Japanese. Moreover, the verb phrase is a function and the noun phrase is an argument in most of the cases. However, because of the CCG parse tree’s issue in Japanese, the noun phrase becomes a function, and the verb phrase becomes an argument. We call this case “flipping” and we need to give additional meaning to the adoposition. For instance, “ S answer(size(stateid(T exas))) (How large is Texas)” splits into the noun S/S λx.x@stateid(T exas) (Texas)”, and the verb phrase, phrase, “ S λx.answer(size(x)) (How large is)”. In this exam“ ple, the noun phrase is the function and the verb phrase is the argument. If the relationship is flipped, the formal meaning of the noun phrase is simply, stateid(T exas). However, the adposotion of the noun phrase, “ ” needs to have a special meaning, λy.λx.x@y, to generate the meaning, “λx.x@stateid(T exas)”, ”. for the noun phrase, “ 3.3

An Enhanced Generalization

We are interested in improving how unknown words are handled when encountered in a semantic parsing task. The Generalization, one of the more popular solutions for this problem, has several limitations as described in the Related Work section. To counter these limitations, we introduce a more human-like approach. When a person encounters an unknown word, they can often guess the meaning of the unknown word in one of two ways: (1) by searching for other words which may occur in a similar context or (2) for other words which are similar in meaning. In ascertaining the meaning of an unknown word by searching for words of similar context, we can think of the example from the introduction. An individual trying to identify the meaning of the word ‘chateau’ may look to other words which occur in similar contexts. In the second approach, lets consider the word ‘rainforest.’ If an individual knows the meaning of the word ‘forest’ but has never seen the word ‘rainforest,’ they may infer that ‘forest’ and ‘rainforest’ are similar words. They can then apply the meaning of ‘forest’ to ‘rainforest’. Both of these types of thought processes inspire our intuitive methodology in implementing a new Generalization approach utilizing different types of vector space representations. In the next section, we provide a brief overview of several different approaches to vector space representations and why we chose to use them. We then describe how we use these representations in a new approach to the Generalization. 3.4

Word Vector Representations

Representing words in vector space has become an extremely popular and powerful technique within the field of natural language processing. It involves representing individual words with real-value vectors; applying vector arithmetic on

648

K. Kashihara

these word vectors can capture fine-grained semantic and syntactic regularities depending on how the vectors are trained. Application of these vectors to many problems involving some level of word-sense disambiguation has shown great success, including in word analogy [23] detection, grammatical parsing [26], and named entity recognition [29].

Fig. 5. New triplet template set for our approach. *word is any given word

We posed two specific research questions around using vector representations to improve the Generalization. As regards to the first question looking at how Word2Vec and WordNet as methods of training word vectors differ in their influence on the outcome or accuracy of a semantic parser, it is possible that given how semantic parsing maps a natural-language sentence into a formal representation of its meaning. In regards to the second research question regarding how different approaches for training word vectors may influence semantic parsing in English versus Japanese, it is possible that training based on word sense may be more optimal due to the grammatical differences inherent in Japanese. For training based on WordNet, we adopt the WS4J (WordNet Similarity for Java) API1 and the Wu and Palmer similarity metric for WordNet [35] for evaluating word similarity. For Word2vec, vectors can be trained utilizing one of two prediction-based methods, continuous bag of words and skip-gram. We use word vectors trained with skip-gram. 3.5

A New Generalization Approach

We introduce a new Generalization approach with a new semantic template structure. In this new structure, we utilize a template triplet list which has the CCG category of a word, the word itself, and the semantic template of the word. This triplet list helps to find the most similar word under the same CCG category with some vector space representation. Figure 5 shows the example of the triplet list. First, the list is categorized by CCG category. Then, it is classified by word and its template(s). In the proposed framework, when the new Generalization approach is called, either pre-trained vector space representations are utilized to identify a single lexicon entry which is the most similar to the unknown word or we utilize the word-sense representation (measured using Wu and Palmer metric) to identify 1

goo.gl/MAj70v.

Translate Japanese into Formal Languages

649

Algorithm 2: Enhanced Generalization Algorithm input : missingWord, KnownWordList output: generated meanings of missingWord bestMatchWord ; bestSimilarityScore = 0 ; for word ∈ KnownWordList do SimilarityScore = CheckSimilarity(missingWord, word); if similarityScore ≥ bestSimilarityScore then bestMatchWord = word; bestSimilarityScore = similarityScore; end end Generate the meaning of missingWord with the template(s) related to bestMatchWord ;

a single lexicon entry to match to the unknown word. The vector space representations can be trained prior to running the system or during the training phase when an initial input dictionary is given. For future work, the algorithm is flexible enough to allow for any method of word representation. To train the vector space representations for this new Generalization, we combine the initial input to the parser with the Text8 corpus. For JNL2KR, we train the vector representations with the first 1 GB data of Japanese Wikipedia corpus, but do not include the data input to the parser. The structure of the proposed Generalization algorithm is described in Algorithm 2. missingWord is the word which is unknown or misses meaning, and KnownWordList is the list of words that features the same CCG category of the missingWord from the template triplet list.

4

Experimental Evaluation

We evaluate our methods on GeoQuery250 and GeoQuery880 for both English and Japanese, and BioKR and Jobs for English. The BioKR corpus is specific to the biology domain. We collected 300 biology “how” type questions from the Question and Answers site2 . These are manually converted to the Answer Set Programming language [11]. We follow the same simplification strategy for BioKR, replacing proper nouns and biology keywords such as helium with generic terms like ent0 and ent1, resulting in a total number of 101 unique sentences. The Jobs corpus [28] has 640 job related queries written in English. The meaning of the queries are written in the PROLOG programming language. This corpus contains a training split of 500 sentences and a test split of 140 sentences. We follow the same simplification strategy for Jobs, replacing the 2

http://www.biology-questions-and-answers.com.

650

K. Kashihara

proper nouns of job keywords such as programming languages such as Java and C++ like lang0 and lang1. We report the performance in terms of precision, recall, F1 score, and accuracy. The initial experimentation of GeoQuery250 and GeoQuery880 (both English and Japanese) is performed using 10 fold cross-validation for GeoQuery250, and we report the performance in terms of the number of learned vocabulary, precision (percentage of returned logical forms that are correct), recall (percentage of sentences for which that correct logical form was returned) and F1 score (harmonic mean of precision and recall). In order to compare performance on the Jobs corpora directly to the original NL2KR system, we follow the same evaluation approach laid out by Vo et al. [32] utilizing a manually prepared input dictionary. For evaluating the same condition, we prepare two initial lexicons; one for Japanese and one for English. The Japanese initial lexicon for our JNL2KR system is manually created and the English initial lexicon is the initial lexicon used by Vo et al. [32]. We compare the performance of our system, JNL2KR, and the enhanced Generalization algorithm with λ-WASP [34], UBL & UBL-s [14], FUBL [15]. Since their systems deal with Japanese segmentalized transliteration sentences, we make Japanese transliteration sentences from our Japanese sentences and compare the performance as well (JNL2KR-Tran in Table 1 and Table 2). In addition, we also compare the performance of JNL2KR system with and without Phrase Override algorithm (JNL2KR w/o PO). When we apply the enhanced Generalization algorithm to both the NL2KR system and the JNL2KR system, the performance increased in most of the cases. Table 1 shows the results of GeoQuery250 in Japanese. The size of initial lexicon is 250. We can see a better performance of enhanced Generalization methods than the original system in terms of Precision, Recall, F1 score, and Accuracy. JNL2KR+WS4J outputs smaller output learned lexicon after the training. It means many of meanings in initial lexicons are removed and added few new words. Then, JNL2KR+WS4J performs best in GeoQuery250 in Japanese. Table 2 shows the results of GeoQuery880 in Japanese. The JNL2KR system performs better in this case. We suspect that the enhanced Generalization with small Table 1. Recall, precision, F1 score, and accuracies on GeoQuery250 in Japanese Method

Trained dict size # of new meanings Recall Precision F1

λ-WASP





81.2

90.1

85.8 –

Acc

UBL





78.9

90.9

84.4 –

UBL-s





83.0

83.2

83.1 –

FUBL





83.2

83.8

83.5 –

JNL2KR

93.5 88.1

530.4

280.4

88.4

99.4

JNL2KR w/o PO 511.5

261.6

84.7

96.7

90.2 82.4

JNL2KR-Tran

470.2

220.2

14.9

98.3

25.3 14.8

JNL2KR+W2V

360.8

110.8

91.1

99.5

95.1 90.8

JNL2KR+WS4J

244.2

−5.8

92.4

99.5

95.8 92.1

Translate Japanese into Formal Languages

651

Table 2. Recall, precision, F1 score, and accuracies on GeoQuery880 in Japanese Method

Trained dict size # of new meanings Recall Precision F1

JNL2KR

994.2

44.7

737.2

99.4

Acc

61.7 44.7

JNL2KR w/o PO 952.4

695.4

40.9

98.3

57.8 40.7

JNL2KR+W2V

599

342

43.8

99.8

60.8 43.8

JNL2KR+WS4J

472.5

215.5

18.9

98.5

31.6 18.9

Table 3. Accuracies on GeoQuery880 in English Method

Accuracy

ZC07 [39]

86.1

UBL

87.9

FUBL

88.6

KCAZ13 [13]

89.0

DCS+L [16]

87.9

TISP [41]

88.9

SEQ2SEQ [7]

84.6

SEQ2TREE [7]

87.1

ASN [24]

85.7

ONESTAGE [8]

85.0

COARSE2FINE [8] 88.2 NL2KR

88.6

NL2KR+W2V

92.5

NL2KR+WS4J

89.6

lexicon cannot generate many missing meanings than the original Generalization since GeoQuery880 contains more longer and complicated sentences, and more unknown words are included in the sentences. This is the first time to evaluate with GeoQuery880 in Japanese. The original JNL2KR performs better in Recall, F1 score, and Accuracy, but our approach with Word2Vec (JNL2KR+W2V) achieves the best score in Precision with almost half of new learned word than the number of new learned meanings in JNL2KR. For the Japanese corpus, both Word2Vec and WS4J achieve the highest precision, and WS4J has the highest F1 score. Since the Japanese CCG parser cannot parse most of the Japanese transliteration sentences, the recall of JNL2KRTran is the lowest of all methods. In addition, the recalls of all approaches in Japanese GeoQuery880 are low since there are more complicated and longer Japanese sentences in Japanese GeoQuery880 than in Japanese GeoQuery250. Since there are more complicated and longer sentences in GeoQuery880, more Japanese sentences are not able to be generated their correct parse trees.

652

K. Kashihara Table 4. Statistic of learned dictionary on GeoQuery880 in English Method

Trained dict size # of new meanings

NL2KR

1792

1557

NL2KR+W2V

612

377

NL2KR+WS4J

512

277

Table 5. Exact-match accuracy on the BioKR data set System

Trained dict size # of new meanings Recall Precision F1

NL2KR

169

NL2KR+W2V

83

NL2KR+WS4J 82.2

114

64

100

76.4

28

70.9

100

82.1

27.2

72.1

100

83

Table 6. Exact-match accuracy on the Jobs data set System

Recall Precision F1

NL2KR

96.9

91.9

94.3

NL2KR+W2V

96.4

100

98.2

NL2KR+WS4J 96.4

100

98.2

Table 3 shows the results of GeoQuery880 in English. Our enhanced Generalization also works in the NL2KR system, and improves the accuracy. In addition, NL2KR+W2V reached the highest accuracy with GeoQuery880 in English. Moreover, Table 4 shows that our enhanced Generalization reduced the number of output learned lexicons for both Word2Vec and WS4J. Table 5 shows the result of BioKR. The output dictionaries of our approach with Word2Vec(W2V) for BioKR is at least half the size as the unmodified NL2KR result, and we achieve better Recall and F1 scores for both approaches. The smaller output dictionary size and improved accuracy indicate that our approach of decreasing the number of miscellaneous and inaccurate meanings can improve the overall accuracy on multiple corpora. Interestingly, our approach utilizing WS4J reaches the highest Recall and F1 score for BioKR. For the Jobs corpus, we follow Vo et al. [32] and utilize a manually prepared input dictionary. This means the dictionary we use is the same for all approaches; this is unique to this corpus and allows us to evaluate not only how these approaches perform on learning, but explore specifically how they perform on the translation process. The results shown in Table 6 reveal that all approaches using W2V and WS4J achieve higher F1 scores than Vo et al. This is largely the result of perfect precision. While the recall is not as high, we can conclude that this result shows our approach is useful in not only the learning process but also the translation process.

Translate Japanese into Formal Languages

5

653

Conclusion

We described the system architecture and algorithms, then evaluated our system on GeoQuery250 and GeoQuery880 datasets in Japanese, and GeoQuery880, BioKR and Jobs in English. The proposed Phrase Override and enhanced Generalization algorithm are useful to extend the NL2KR to Japanese. Our enhanced Generalization approach with word vector representation is a simple method. However, the experimental results in both English and Japanese show that our approach reduces the number of new learned meanings through the training process and improve the performance in most of the cases. Through the experiment with GeoQuery250 dataset, our system can perform with the highest recall, precision, and F1 scores. This means that our system can learn and convert from Japanese sentence into the target language effectively. However, there is some space to scope for improvement since it is the first time to apply the Japanese CCG parser for the NL2KR system. For future work, we have three major ideas. Firstly, we work with Noji et al. [22] and improve the accuracy of the Japanese CCG parser. Their latest works reported the improvement of the Japanese CCG parser’s accuracy [37,38], but we found that eight sentences in Japanese GeoQuery880 are still not able to parse correctly with them. If we can improve this part and can parse the problematic sentences that have no parse tree or incorrect parse tree, our recall score will be better. Secondly, we evaluate our systems with more larger datasets. In Japanese, our GeoQuery in Japanese is the only corpus to evaluate this kind of the task. In English, there are many larger corpora like the ATIS dataset (This dataset has 5,410 queries to a flight booking system). We need to find a larger corpora for evaluating our system in Japanese. Finally, we extend our system to other languages, especially agglutinative languages. For example, Turkish has its CCG parser [5] and we will apply it to our system.

References 1. Baral, C., Dzifcak, J., Kumbhare, K., Vo, N.H.: The NL2KR system. In: Proceedings of LPNMR 2013 (2013) 2. Bekki, D.: Formal Theory of Japanese Syntax. Kuroshio Shuppan (2010). (in Japanese) 3. Berant, J., Chou, A., Frostig, R., Liang, P.: Semantic parsing on freebase from question-answer pairs. In: Proceedings of EMNLP, pp. 1533–1544 (2013) 4. Berant, J., Liang, P.: Semantic parsing via paraphrasing. In: Proceedings of ACL (2014) 5. Cakici, R.: Automatic induction of a CCG grammar for Turkish. In: ACL 2005, pp. 73–78 (2005) 6. Curran, J., Clark, S., Bos, J.: Linguistically motivated large-scale NLP with C&C and boxer. In: Proceedings of ACL 2007, pp. 33–36. ACL, Prague, June 2007 7. Dong, L., Lapata, M.: Language to logical form with neural attention. In: Proceedings of ACL 2016 (2016)

654

K. Kashihara

8. Dong, L., Lapata, M.: Coarse-to-fine decoding for neural semantic parsing. In: Proceedings of ACL 2018, pp. 731–742 (2018). https://www.aclweb.org/anthology/ P18-1068/ 9. Gardner, M., Krishnamurthy, J.: Open-vocabulary semantic parsing with both distributional statistics and formal knowledge. In: Proceedings of AAAI, pp. 3195– 3201 (2017) 10. Gaur, S., Vo, N.H., Kashihara, K., Baral, C.: Translating simple legal text to formal representations. In: JSAI-isAI 2014, pp. 259–273 (2014) 11. Gelfond, M., Lifschitz, V.: The stable model semantics for logic programming. In: Kowalski, R., Bowen, K. (eds.) Logic Programming: Proceedings of the Fifth International Conference and Symposium, pp. 1070–1080. MIT Press (1988) 12. Krishnamurthy, J., Mitchell, T.M.: Learning a compositional semantics for freebase with an open predicate vocabulary. TACL 3, 257–270 (2015) 13. Kwiatkowski, T., Choi, E., Artzi, Y., Zettlemoyer, L.S.: Scaling semantic parsers with on-the-fly ontology matching. In: Proceedings of EMNLP 2013, pp. 1545–1556 (2013) 14. Kwiatkowski, T., Zettlemoyer, L., Goldwater, S., Steedman, M.: Inducing probabilistic CCG grammars from logical form with higher-order unification. In: Proceedings of the EMNLP 2010, pp. 1223–1233. Association for Computational Linguistics (2010) 15. Kwiatkowski, T., Zettlemoyer, L., Goldwater, S., Steedman, M.: Lexical generalization in CCG grammar induction for semantic parsing. In: Proceedings of the EMNLP 2011, pp. 1512–1523. Association for Computational Linguistics (2011). http://dl.acm.org/citation.cfm?id=2145593 16. Liang, P., Jordan, M.I., Klein, D.: Learning dependency-based compositional semantics. Comput. Linguist. 39(2), 389–446 (2013) 17. Lierler, Y., Sch¨ uller, P.: Parsing combinatory categorial grammar via planning in answer set programming. In: Correct Reasoning, pp. 436–453. Springer (2012) 18. Matuszek, C., Herbst, E., Zettlemoyer, L., Fox, D.: Learning to parse natural language commands to a robot control system. In: Experimental Robotics, pp. 403– 415. Springer (2013). http://link.springer.com/chapter/10.1007/978-3-319-000657 28 19. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Word2vec (2014). https://code. google.com/p/word2vec 20. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995) 21. Mineshima, K., Tanaka, R., Mart´ınez-G´ omez, P., Miyao, Y., Bekki, D.: Building compositional semantics and higher-order inference system for a wide-coverage Japanese CCG parser. In: Proceedings of EMNLP 2016, pp. 2236–2242 (2016) 22. Noji, H., Miyao, Y., Johnson, M.: Using left-corner parsing to encode universal structural constraints in grammar induction. In: Proceedings of the EMNLP 2016, Austin, Texas, USA, 1–4 November 2016, pp. 33–43 (2016) 23. Pennington, J., Socher, R., Manning, C.D.: GloVe: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014). http://www.aclweb.org/anthology/D14-1162 24. Rabinovich, M., Stern, M., Klein, D.: Abstract syntax networks for code generation and semantic parsing. In: Proceedings of ACL 2017, pp. 1139–1149 (2017) 25. Reddy, S., T¨ ackstr¨ om, O., Collins, M., Kwiatkowski, T., Das, D., Steedman, M., Lapata, M.: Transforming dependency structures to logical forms for semantic parsing. TACL 4, 127–140 (2016)

Translate Japanese into Formal Languages

655

26. Socher, R., Bauer, J., Manning, C.D., Ng, A.Y.: Parsing with Compositional Vector Grammars. In: ACL, no. 1, pp. 455–465 (2013) 27. Steedman, M.: The Syntactic Process. MIT Press, Cambridge (2000) 28. Tang, L.R., Mooney, R.J.: Using multiple clause constructors in inductive logic programming for semantic parsing. In: j-LECT-NOTES-COMP-SCI, vol. 2167, 466–477 (2001) 29. Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of ACL 2010, pp. 384–394. Association for Computational Linguistics (2010) 30. Uematsu, S., Matsuzaki, T., Hanaoka, H., Miyao, Y., Mima, H.: Integrating multiple dependency corpora for inducing wide-coverage Japanese CCG resources. In: Proceedings of ACL 2013 (2013) 31. Uematsu, S., Matsuzaki, T., Hanaoka, H., Miyao, Y., Mima, H.: Integrating multiple dependency corpora for inducing wide-coverage Japanese CCG resources. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 14(1), 1:1–1:24 (2015) 32. Vo, N.H., Mitra, A., Baral, C.: The NL2KR platform for building natural language translation systems. In: Proceedings of ACL (2015) 33. Wong, Y.W., Mooney, R.J.: Learning for semantic parsing with statistical machine translation. In: Proceedings of HLT-NAACL 2006 (HLT-NAACL 2006), pp. 439– 446. ACL, Stroudsburg (2006). https://doi.org/10.3115/1220835.1220891 34. Wong, Y.W., Mooney, R.J.: Learning synchronous grammars for semantic parsing with lambda calculus. In: Proceedings of the 45th Annual Meeting of the ACL (ACL 2007), Prague, Czech Republic, June 2007. http://www.cs.utexas.edu/users/ ai-lab/?wong:acl07 35. Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: ACL 1994, pp. 133– 138. Association for Computational Linguistics, Stroudsburg (1994) 36. Yao, X., Durme, B.V.: Information extraction over structured data: question answering with freebase. In: Proceedings of ACL, pp. 956–966 (2014) 37. Yoshikawa, M., Mineshima, K., Noji, H., Bekki, D.: Consistent CCG parsing over multiple sentences for improved logical reasoning. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, 1–6 June 2018, Volume 2 (Short Papers), pp. 407–412 (2018). https://www. aclweb.org/anthology/N18-2065/ 38. Yoshikawa, M., Noji, H., Matsumoto, Y.: A* CCG parsing with a supertag and dependency factored model. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 277–287. Association for Computational Linguistics, Vancouver, July 2017. http://aclweb. org/anthology/P17-1026 39. Zettlemoyer, L.S., Collins, M.: Online learning of relaxed CCG grammars for parsing to logical form. In: Proceedings of EMNLP-CoNLL-2007 (2007) 40. Zhang, Y., Clark, S.: Shift-reduce CCG parsing. In: Proceedings of the 49th Annual Meeting of the ACL: Human Language Technologies, pp. 683–692. ACL (2011). http://aclweb.org/anthology/P11-1069 41. Zhao, K., Huang, L.: Type-driven incremental semantic parsing with polymorphism. In: NAACL HLT 2015, pp. 1416–1421 (2015)

Authorship Identification for Arabic Texts Using Logistic Model Tree Classification Safaa Hriez(B) and Arafat Awajan Princess Sumaya University for Technology, King Hussein Faculty of Computing Sciences, Amman, Jordan [email protected], [email protected]

Abstract. Identifying the author of an anonymous text based on a predefined collection of sample texts for candidate authors is called Authorship identification. Due to the huge increase in the volume of digital texts, authorship identification becomes a need. This paper proposed a novel set of features in addition to the known features in the field in order to improve the authorship identification process in Arabic texts. The proposed framework in this paper was evaluated using a dataset that is a part of the Authorship attribution of Ancient Arabic Texts dataset. The used classifier in this research is the Logistic Model Tree classifier, which works fine with small datasets. The experimental results showed an accuracy of 88.73%. Such results support the efficacy of the proposed framework in order to identify the author of an anonymous text. Keywords: Authorship identification · AAAT dataset · Logistic model tree · Writing style features

1 Introduction Authorship Identification is the process of identifying the writer of unknown texts based on the predefined list of texts for a group of authors. This process was used for the first time in the nineteen century on the plays of Shakespeare. These days, the need for authorship identification is increased because of the huge amount of anonymous texts transmitted over the network such as blogs, tweets, posts, emails [1]. The process of identifying the authors is applied in many fields like computer forensics, criminal and intelligence. In computer forensics, it is used to identify the writer of a source code, which may cause damage in data or computers [2]. Authorship identification concerns of two issues. The first issue is the extraction of features from the texts that can aid in determining the writing styles of the authors. The second one is the chosen of the appropriate methods that may help in predicting the right author. In the field of authorship identification, the extracted features from the text are categorized based on the level of analysis into six main categories; lexical, character, syntactic, structural, semantic and content-specific features [1, 3]. © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 656–666, 2020. https://doi.org/10.1007/978-3-030-52246-9_48

Authorship Identification for Arabic Texts Using Logistic Model Tree Classification

657

In the lexical level of analysis, the length of the sentences and words are calculated [4]. Also, the richness of vocabulary is measured [5]. In addition, the frequencies of the words and the n-gram of the words are considered. In the characters’ level of analysis, the frequencies of the characters and their types in addition to the n-gram of the characters are extracted from the texts. In syntactic features, the researchers examine the patterns, which are utilized to compose the sentence or chunk of text [3]. Additionally, they study the part of speech tagging of each word in the text. The structural features take into consideration the total number of lines, sentences, and paragraphs. Furthermore, the number of sentences, characters, words per paragraph is considered in the analysis. The semantic features study the synonyms of the words and the semantic dependencies between words. The last category, the content-specific features, measures the frequencies of the keywords in the texts. The authorship identification process is considered in many languages such as English, French, Chinese and Arabic and others. In some languages, the researchers achieved impressive results but in other languages, the work is still not mature. In the Arabic language, the number of works was done in authorship identification is less than what was done in the English language. In this paper, a novel set of features was proposed in addition to the known features in the field of Authorship Identification. It is named Writing Style features that include four features. A total of twenty features were used in this research as input to the Logistic Model Tree classifier in order to distinguish the author of an anonymous text that was written in the Arabic language. The rest of the paper will present the proposed framework in detail. Section 2 illustrates the features of the Arabic language. Section 3 presents the research that was done in the field of authorship identification. The proposed framework is presented in Sect. 4. Section 5 discusses the experiments and the evaluation of the proposed framework. Finally, a summary of what was mentioned in this paper is presented in Sect. 6.

2 Arabic Language Features In natural languages, there is a set of features that present each language. In this section, the main features of the Arabic language including the word structure, the word categories, and the morphological patterns are presented. Words Structure. In the Arabic language, there are 28 characters (letters) form the alphabet. The words are formed by a group of letters. The start and the end of the words are determined by the spaces. The consecutive words are separated by a space. Special marks are used as short vowels; they also known as diacritical marks [6]. These marks have to follow each letter in the word. However, the use of these marks these days is not popular, especially in the digital texts. These texts made another challenge for the researchers who have to use these texts in their studies. Words Categories. In the Arabic language, the words are categorized into two main categories: derivative and non-derivative words. The derivative words contain the words that are generated by applying derivative rules to basic entities named roots. Most of the

658

S. Hriez and A. Awajan

derivative words are generated from three or four-letter roots. On the other hand, the nonderivative words contain the words which don’t have roots like the words borrowed from other languages and functional words. The functional words include the conjunctions, prepositions, demonstrative pronouns, relative pronouns and question words, etc. They are the most frequent words in Arabic texts [7]. Morphological Patterns. All types of derivative words including nouns, adjectives, verbs, and adverbs are derived from roots based on predefined patterns and derivative rules. The patterns play a role in determining many characteristics of the words including the gender (masculine or feminine) and tense (past, present and imperative). Also, the patterns determine the number (singular, plural or dual). In addition, there are morphologies including suffixes and prefixes can be attached to the derivative words [6, 7].

3 Related Works There are some works considered the authorship identification in Arabic texts. These works studied different types of texts including books, poets, articles, emails, messages, and tweets. This section presents a brief review of these works. In the early study, Abbasi et al. [3] have used lexical, syntactic, structural and contentspecific features in order to identify the authors of messages that were taken from the yahoo group. They applied the Support vector machine (SVM) and decision tree classifiers. In [8], Estival et al. extracted lexical, syntactic and semantic features from 8028 emails that were written by 1030 authors in order to identify the author’s traits. Another research was done by Shaker et al. [9]. The research considered the functional words to identify the authors. They used the evolutionary algorithm with the linear discriminant analysis classifier. In [10], they proposed a dataset called the Authorship attribution of Ancient Arabic Texts (AAAT). They applied the SVM classifier with lexical features. The best results (80%) were achieved by using the n-gram words, n-gram characters and the rare words in the texts. In [11], the researchers investigated the best classifier for the AAAT dataset using lexical features. They concluded that SVM is the most appropriate classifier among the other six classifiers. Baraka et al. [12] in their research used lexical and syntactic features with the SVM classifier to identify the authors of 313 articles that were written by eight authors. Another work was done by Altheneyan et al. [13]. They used the Naïve Bayes classifier in order to identify the authors of thirteen books written by ten authors. They used lexical and structural features. In [14], they identified the authors of five hundred articles written by five authors using SVM and Naïve Bayes classifiers. Their research based on lexical features only. Otoom et al. [15] proposed a model to identify the authors by extracting the lexical, syntactic, structural, content-specific features. They investigated 456 articles written by seven authors using the SVM classifier. Al-Ayyoub et al. [16] used the same sets of features to identify the authors of 14,039 articles written by 42 authors. They used three classifiers: SVM, Naive Bayes and Bayes Networks.

Authorship Identification for Arabic Texts Using Logistic Model Tree Classification

659

Albadarneh et al. [17] used big data analytics to address the authorship authentication problem of 53,205 messages written by twenty authors. In [18], they identified the authors of eighteen books written by three authors using lexical features. They used the computational method, the Burrows-delta method and the algorithm of Winnow. Bourib et al. [19] used lexical features to identify 25 documents written by five authors. They used SVM, MLP, Linear regression, Stamatatos distance, and Manhattan distance. In summary, the most used classifier in the literature is the Support Vector Machine classifier, and the most used set of features is the lexical features. This paper proposed a novel set of features, which are not mentioned in any previous work. Also, it used a Logistic Model Classifier that is not used in any previous work in order to improve the authorship identification process.

4 Methodology The proposed framework consists of three phases; text analysis, feature extraction, and author classification. Figure 1 illustrates the proposed framework. The input of the framework is the set of texts written by multiple authors in the Arabic language. In the first phase, the entered texts will be analyzed in order to calculate some measures and preprocess the texts to be ready for the next phase.

Fig. 1. The proposed framework for Authorship Identification.

660

S. Hriez and A. Awajan

Feature extraction is the second phase. The types of features extracted in this phase are Structural, Lexical, Syntactic, Semantic and Writing style features. These features are the input for the third phase. In this phase, the Logistic Model Tree classifier is used to classify the authors. The following subsections present these phases in detail. 4.1 Text Analysis This phase starts with tokenizing the texts. Then, a suitable tag is assigned for each word in the texts. After that, the type of each phrase in the texts is determined. Three phrase types were used including the Noun phrase, Verb Phrase and Partial Phrase. Also, the active voice or passive voice of the verb phrases is determined. The pronouns and the function words are determined too. Next, for each text, the Unigram, Bigram, and Trigram are calculated. The Unigram probability for each word is calculated using Eq. 1. Where Pwi is the probability of the ith word, count(W i ) is the frequency of the ith word in the text, and # of words is the total number of words in the text. Pwi =

count(Wi ) # of words

(1)

Then, Eq. 2 is used in order to find the Unigram for each text. Where n is the number of words in the text and Pwi is the probability of the ith word. Unigramfor texti =

n 

PWi

(2)

i=1

The Bigram for every consecutive words is calculated using Eq. 3. Where PWi |Wi−1 is the probability for occurring the ith word preceded by the word i–1, Count(Wi−1 Wi ) is the frequency of the ith word preceded by the word i–1, and Count(Wi−1 ) is the frequency of the word i–1 in the text. PWi |Wi−1 =

Count(Wi−1 Wi ) Count(Wi−1 )

(3)

After that, Eq. 4 is used for calculating the Bigram for each text. Where m is the total number of words in the text minus one, and PWi |Wi−1 is the conditional probability of the occurrence of the ith word given that the word i–1 is the preceded word. Bigramfor the texti =

m 

PWi |Wi−1

(4)

i=1

The Trigram for every three consecutive words is calculated using Eq. 5. Where PWi |Wi−2 Wi−1 is the probability for occurring the ith word preceded by the words i–1 and i–2, Count(Wi−2 Wi−1 Wi ) is the frequency of occurring the three consecutive words in the text, and Count(Wi−2 Wi−1 ) is the frequency of occurring the words i–1 and i–2 in the text. PWi |Wi−2 Wi−1 =

Count(Wi−2 Wi−1 Wi ) Count(Wi−2 Wi−1 )

(5)

Authorship Identification for Arabic Texts Using Logistic Model Tree Classification

661

Finally, the trigram for the text is calculated using Eq. 6. Where c is the total number of words in the text minus two, and PWi |Wi−2 Wi−1 is the conditional probability for the occurrence of ith word given that the word i–1 and i–2 are the preceded words. Trigramfor the texti =

c 

PWi |Wi−2 Wi−1

(6)

i=1

4.2 Feature Extraction The features were used in this research are categorized into five sets as shown in Table 1. The total number of features is twenty features that are categorized into Structural, Lexical, Syntactic, Semantic and Writing Style features. Table 1. Feature sets were used in this research. Feature set

Description

Structural

The number of characters The number of words The number of chunks The average length of the words The maximum length of the words The minimum length of the words The unigram probability of the text The bigram probability of the text The trigram probability of the text

Lexical features

The number of pronouns The number of function words

Syntactic features

The number of noun phrases The number of verb phrases The number of partial phrases

Semantic features

The number of active voice verbs The number of passive voice verbs

Writing style features

Does the author use the same pattern for the last word in all sentences (Assonance)? Does the author use the nonessential sentences in the text? Does the author use quotes? Does the user use the reported speech?

Total

20 Features

662

S. Hriez and A. Awajan

Structural Features. This set of features consists of nine features including the average length of the words, the maximum length of the words, and the minimum length of the words in the text. Also, it includes the number of characters, the total number of words and the total number of chunks in the text. Other structural features are the unigram probability, the bigram probability, and the trigram probability, which were calculated in the previous phase. Lexical Features. This set of features includes two features. They are the total number of pronouns in the text and the total number of function words. Syntactic Features. This set of features contains three features; the total number of the noun phrase, verb phrase, and partial phrase. Semantic Features. Two features are extracted in this set; they are the total number of active and passive voice verbs in the text. Writing Style Features. Four features are extracted in this set. The first feature determines if the author uses the same pattern for the last word in each sentence, which is called assonance. The second feature determines if the author uses the sentence; it is enclosed between two hyphens to indicate that this sentence is nonessential. The third feature checks if the author uses quotes in the texts. Finally, the last feature determines if the author used to use the reported speech during the writing.

4.3 Logistic Model Trees Classifier The Logistic model tree is a decision tree structure in the nonterminal nodes and logistic regression functions at the terminal nodes. Much similar to a model tree is a regression tree with regression functions at the terminal nodes. Similar to the ordinary decision trees, a test on each attribute is associated with every inner node. For each nominal attribute with n number of values, then the node has n child nodes, and the instances are arranged down in one of the n branches depending on their values. It is made up of a tree structure that consists of a set of intermediate nodes N and a set of terminal nodes T. S refers to the whole instance space and spanned by all attributes which are illustrated the data. Then, the tree structure consists of disjoint subdivisions of S, each one of them is presented by a leaf in the tree. Equation 7 illustrates the relationship between these subdivisions.  St , St ∩ St  =∅ for t = t  (7) S= t∈T

The logistic model tree starts with building a standard classification tree. Then, a logistic regression model is built at each node using LogitBoost algorithm since during the pruning step, each node is a candidate leaf node. Each node benefits from the parameters of the logistic regression model that is fit in its parent node. After the completion of building the tree and the logistic model tree at each node, the CART-based pruning is used to prune the tree [20].

Authorship Identification for Arabic Texts Using Logistic Model Tree Classification

663

5 Experimental Results and Discussion This section reports the experimental results of the proposed framework and the dataset (corpora) was used in these experiments. The following subsections present these two in detail. 5.1 Dataset (Corpora) The dataset was used in this research is part of the Authorship attribution of Ancient Arabic Texts (AAAT) dataset. It contains pages of books written by seven authors. From these pages of texts, a group of paragraphs was extracted. The total number of paragraphs is seventy-one paragraphs. Table 2 illustrates the description of the dataset used in this research. Table 2. Description of the dataset that was used in this research. Author name

Date (after the Christ Book title birth)

Chosen pages # of paragraphs

Ibn Batuta

1325–1352

Travels of Ibn Batuta

Page4 page 10

21

Ibn Jubayr

1182–1185

Travels of Ibn Jubayr

Page 4 Page 37 Page 110

10

Nasser Khasru

1045

Book of the Travels Page 2 Page 25

Ibn Fathlan

10

921

Travels of Ibn Fathlan

Page 2 Page 5 Page 14

6

Ibn Al Mujawer 1233

History of the Mustabsir

Page 8 Page 25

5

Lessan Addin

1684

Khatrat Al Tife Page 4 during the travel of Page 6 the winter and summer

Al Hamawi

1542–1608

Hady Alathaan Annajdia to the Egypt houses

Page 20 Page 30

9

10

5.2 Experimental Result In order to measure the performance of the proposed framework, a 10-fold crossvalidation test was used. In this test, the dataset is divided into 10 sub-groups. Each

664

S. Hriez and A. Awajan

group is used as a testing set and the remaining nine groups are used as a training set. This process is repeated 10 times in order to use each group as a testing set exactly one time. The extracted features from all texts were entered to the Logistic Model Tree classifier which is implemented in WEKA version 3.8.1 [21]. Then, the evaluation metrics are calculated; they are the Precision, Recall, F-measure and Accuracy. In order to test the quality of each feature set, six experiments were conducted. One of the experiments test the whole feature sets and the other five experiments exclude one of the feature sets and compute the evaluation metrics. Table 3 shows the results of these experiments. The achieved results for the whole feature sets are shown in the last column in the table. Table 3. The achieved results using each feature set. Measure

Exclude structural feature (11 features)

Exclude lexical feature (18 features)

Exclude syntactic features (17 features)

Exclude semantic features (18 features)

Exclude writing style features (16 features)

Include all feature sets (20 features)

Correctly classified instances

62

61

57

63

56

63

Incorrectly 9 classified instances

10

14

8

15

8

Precision

0.870

0.813

0.888

0.789

0.891

0.872

Recall

0.873

0.859

0.803

0.887

0.789

0.887

F-Measure

0.871

0.857

0.806

0.886

0.782

0.887

Accuracy

87.32

85.91

80.28

88.73

78.87

88.73

The experiments examine the Structural, Lexical, Syntactic, and Semantic feature sets show that the results are close to the results of the whole feature sets. This means that each one of these sets contributes a little in the process of authorship identification. The Writing Style feature set shows a great difference in the results with the results of the whole feature sets. This means that this set contributes a lot in the process of authorship identification. In order to test the quality of the chosen classifier, an experiment was conducted to show the difference between the result of the Decision Tree and the Logistic Model Tree. Table 4 presents the results of the two classifiers. It is clear from the table that there is a big difference between the results of the Decision Tree and the Logistic Model Tree. This difference is reasonable since the Logistic Model Tree applies the logistic regression on the leaves and the dataset is small. The achieved accuracy is 88.73% which shows that the proposed framework possesses a strong capability to identify the author of an anonymous text.

Authorship Identification for Arabic Texts Using Logistic Model Tree Classification

665

Table 4. The achieved results using Decision Tree and Logistic Model Tree. Measure

Decision Tree Logistic Model Tree

Correctly classified instances

56

63

Incorrectly classified instances 15

8

Precision

0.793

0.891

Recall

0.789

0.887

F-Measure

0.781

0.887

Accuracy

78.87%

88.73%

6 Conclusion This paper proposed a novel set of features in order to identify the author of an anonymous text that was written in the Arabic language. This set of features is named Writing Style Features. It includes the use of assonance in all sentences, the use of nonessential sentences, the use of quotes and the reported speech in the text. This set of features in addition to the known features including structural, lexical, syntactic and semantic features are used to identify the author. The total number of used features in this research is 20 features. The proposed framework used the Logistic Model Tree classifier which promises to work fine with small datasets. The dataset used in this research is part of the AAAT dataset. It consists of 71 paragraphs written by seven authors. The achieved accuracy in this research is 88.73%. Such a result provides evidence of the efficacy of the proposed framework.

References 1. Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60, 538–556 (2009) 2. Iqbal, F., Binsalleeh, H., Fung, B.C., Debbabi, M.: A unified data mining solution for authorship analysis in anonymous textual communications. Inf. Sci. 231, 98–112 (2013) 3. Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intell. Syst. 20(5), 67–75 (2005) 4. Mendenhall, T.C.: The characteristic curves of composition. Science IX, 237–249 (1887) 5. de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. SIGMOD Rec. 30(4), 55–64 (2001) 6. Ryding, K.C.: A Reference Grammar of Modern Standard Arabic. Cambridge University Press, Cambridge (2005) 7. Awajan, A.: Multilayer model for Arabic text compression. Int. Arab J. Inf. Technol. 8, 188–196 (2011) 8. Estival, D., Gaustad, T., Pham, S.B., Radford, W., Hutchinson, B.: Tat: an author profiling tool with application to Arabic emails. In: Proceedings of the Australasian Language Technology Workshop, pp. 21–30 (2007) 9. Shaker, K., Corne, D.: Authorship attribution in Arabic using a hybrid of evolutionary search and linear discriminant analysis. In: Computational Intelligence (UKCI), pp. 1–6. IEEE (2010)

666

S. Hriez and A. Awajan

10. Ouamour, S., Sayoud, H.: Authorship attribution of ancient texts written by ten Arabic travelers using a SMO-SVM classifier. In: Communications and Information Technology (ICCIT), International Conference. pp. 44–47. IEEE (2012) 11. Ouamour, S., Sayoud, H.: Authorship attribution of short historical Arabic texts based on lexical features. In: International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, pp. 144–147 (2013) 12. Baraka, R.S., Salem, S., Abu Hussien, M., Nayef, N., Abu Shaban, W.: Arabic Text Author Identification Using Support Vector Machines (2014) 13. Altheneyan, A.S., Menai, M.E.B.: Naïve Bayes classifiers for authorship attribution of Arabic texts. J. King Saud Univ. Comput. Inf. Sci. 26(4), 473–484 (2014) 14. Alwajeeh, A., Al-Ayyoub, M., Hmeidi, I.: On authorship authentication of Arabic articles. In: 5th International Conference on IEEE Information and Communication Systems (ICICS), pp. 1–6 (2014) 15. Otoom, A.F., Abdullah, E.E., Jaafer, S., Hamdallh, A., Amer, D.: Towards author identification of Arabic text. In: The Fifth International Conference on Information and Communication Systems (2014) 16. Al-Ayyoub, M., Alwajeeh, A., Hmeidi, I.: An extensive study of authorship authentication of Arabic articles. Int. J. Web Inf. Syst. 13(1), 85–104 (2017) 17. Albadarneh, J., Talafha, B., Al-Ayyoub, M., Zaqaibeh, B., Al-Smadi, M., Jararweh, Y., Benkhelifa, E.: Using big data analytics for authorship authentication of Arabic tweets. In: Utility and Cloud Computing (UCC) (2015) 18. Mustafa, T.K., Razzaq, A.A.A., Al-Zubaidi, E.A.: Authorship Arabic text detection according to style of writing by using (SABA) method. Asian J. Appl. Sci. (2017). (ISSN: 2321–0893) 19. Bourib, S., Sayoud, H.: Author identification on noise Arabic documents. In: 5th International Conference on Control, Decision and Information Technologies, pp. 216–221 (2018) 20. Landwehr, N., Hall, M., Frank, E.: Logistic model trees. Mach. Learn. 59(1–2), 161–205 (2005) 21. Bouckaert, R.R.: Bayesian Network Classifiers in Weka (2004)

The Method of Analysis of Data from Social Networks Using Rapidminer Askar Boranbayev1(B) , Gabit Shuitenov2 , and Seilkhan Boranbayev3 1 Nazarbayev University, Nur-Sultan, Kazakhstan

[email protected] 2 Department of Information Systems and Technology, Kazakh University of Economics,

Finance and International Trade, Nur-Sultan, Kazakhstan 3 L.N. Gumilyov, Eurasian National University, Nur-Sultan, Kazakhstan

Abstract. The article is devoted to the analysis of content presented in text form in social networks. Natural language text, in addition to information, can express an emotional assessment of what is being reported. The analysis of the tonality of information flows has great potential for various analytical and monitoring systems. A method is proposed and an analysis of the tonality of information flows for the selected language environment is made. The method allows you to classify the message into positive, negative and neutral. Keywords: Data · Method · Analysis · Information · Content · Social network · Classification

1 Introduction DataMining methods help solve many of the tasks that an analyst faces. Of particular interest in data analysis methods arose in connection with the development of data collection and storage tools that allowed the accumulation of large amounts of information. Specialists from different fields of human activity faced the question of processing the data collected, turning them into knowledge. Known statistical methods cover only part of the data processing needs, and for their use it is necessary to have a clear idea of the desired patterns. In such a situation, data mining techniques are of particular relevance. Their main feature is to establish the presence and nature of hidden patterns in the data, while traditional methods deal mainly with a parametric assessment of patterns that have already been established [1]. With the development of social networks, a huge amount of user-generated content has appeared, a significant part of which is presented in text form. Natural language text, in addition to information, can express an emotional appreciation of what is being reported. The analysis of tonality of information flows has great potential for application for analytical, monitoring and other systems. There are various tools for implementing Text Mining methods, such as IBM Intelligent Miner for Text, SAS TextMiner, Semio Corporation Semio Map, Oracle Text, RapidMiner. © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 667–673, 2020. https://doi.org/10.1007/978-3-030-52246-9_49

668

A. Boranbayev et al.

2 Processing of Data in Social Media Using RAPIDMINER In our work, a study of text message analytics using the RapidMiner program was done. In the course of work, methods of text mining were applied; text analysis methods; methods for classifying and clustering text. Text mining is the process of studying large volumes of text resources to generate new information and transform unstructured text into structured data for use in further analysis. Today in the market of text analytics tools, there are several basic software solutions. To compare functionality, the following programs were selected: • • • • •

GATE (General Architecture for Text Engineering). Knime Analytics Platform. Orange software. LPU (Learning from Positive and Unlabeled Examples). RapidMiner.

Based on the analysis, it was concluded that RapidMiner meets all modern requirements in the field of text mining, it is well suited for academic purposes as it has a convenient interface and an extensive help system. RapidMiner is a program used for scientific research and analytics, it is a machine learning environment in which the user is protected from all “rough work”. Instead, he is invited to “draw” the entire desired data analysis process in the form of a chain (graph) of operators and run it for execution. The chain of operators is represented as an interactive graph and as an expression in XML (eXtensible Markup Language, the main language of the system). The system is written in Java and distributed under the AGPL Version 3 license. All main functions are accessible through the Java API and include data mining functions, such as data preprocessing, visualization, predictive analysis, etc. In addition to standard data mining functions such as data cleaning, filtering, clustering, it has built-in templates, reproducible workflows, and professional visualization environment. To work correctly with this program, we have installed the following extension packages: 1. Textprocessing Text extensions support several text formats, including plain text, HTML or PDF, as well as other data sources. It provides standard filters for tokenization, definition of passwords, etc. to provide everything necessary for the preparation and analysis of texts. 2. Web Mining The web extension provides access to various Internet sources such as web pages, RSS channels and web services. In addition to the operators for accessing these data sources, the extension also provides special operators for processing and transforming the contents of web pages in order to prepare it for further processing. To further process the web pages available through this extension, the Text Mining extension must be installed separately.

The Method of Analysis of Data from Social Networks Using Rapidminer

Fig. 1. Map of popular tweets

Fig. 2. Establishing connection with Twitter

669

670

A. Boranbayev et al.

3. WordNet Extension The WordNet extension provides operators with the ability to use the WordNet database to identify and detect related words (synonyms, hyper-names, hyponyms, etc.). 4. TextAnalysisbyAylien. Text analysis with AYLIEN allows you to analyze and extract information from text, text data, news articles, social comments, tweets and reviews - all from RapidMiner. To use trends, Twitter is used - a service that tracks all trends and displays a list of dozens of terms that have recently been popular in tweets. Most trends are simple hashtags that are required for our research. The website Trendsmap.com is a map of the world (Fig. 1), which displays various trending terms on Twitter depending on the region. The terms actually grow and shrink in size depending on their popularity, which is useful for a quick overview of global trends on Twitter. To work with Twitter (Fig. 2), we needed this service.

Fig. 3. Tweets in Excel

The Method of Analysis of Data from Social Networks Using Rapidminer

671

A special operator called searchtwitter was placed in the RapidMiner environment and its configuration was made. A connection was created to work with the tweeter site. For a correct connection to the site and further work with it, it was necessary to request an access token, and get a permission code. RapidMiner program independently sent a request, it was necessary to log in through your account. Further, after gaining access, a check of this connection was started and as the parameters of such a connection, we determined that 10,000 tweets are enough to determine the mood of users leaving comments with such a hashtag. The Russian-language environment was chosen for the study, the Russian language was chosen as the language of the studied tweets. Adding the WriteExcel operator in the RapidMiner environment allowed us to organize the import of all studied tweets on a given topic into an Excel file, and after starting the created construct, the environment produced the following results (Fig. 3). In order to directly analyze the received tweets, we needed the AnalyzeSentiment tool, for which we created our connection. To configure this connection, you had to register on the official website to get applicationID and applicationKEY. The Analyze Sentiment statement assigns each message one of three labels - positive, negative or neutral. It should be noted that the text of the messages is analyzed to determine the public’s attitude to a particular event, and not the negativity or positivity of the fact itself. The final stages are the classification of the extracted tweets in accordance with various classification schemes and taxonomy, as well as the analysis and visualization of the results obtained (Fig. 4).

Fig. 4. Classification of received tweets

672

A. Boranbayev et al.

3 Conclusion To solve this problem, we use the Excel XLMiner add-in. It is an add-on for data mining with neural networks, classification and regression trees, logistic regression, linear regression, Bayesian classifier, K-nearest neighbors, discriminant analysis, association rules, clustering, main components and much more. XLMiner provides the necessary data from various sources - PowerPivot, Microsoft/IBM/Oracle databases or spreadsheets; allows you to explore and visualize data using several related diagrams; pre-process and “clear” the data, select data mining models, etc. Interpreting the results of the extracted sample, we can conclude that for this language environment 12.4% of messages are classified as positive, 36.8% as negative and 50.8% as neutral. The results of the study allow us to conclude that the proposed method can conveniently and quickly solve problems associated with text analysis. In the future, we are planning to increase the number of experiments in order to obtain more interesting and accurate results. The proposed methods can be compared to other similar methods, using a common comparison framework. This paper can be relevant for people interested in similar case study. Also, this technique may be used for further research and analytics in [2–22].

References 1. Kotelnikov, E.V.: Automatic analysis of tonality of texts based on machine learning methods. Comput. Linguist. Intell. Technol. 11(18), 7–10 (2012) 2. Boranbayev, A., Boranbayev, S., Nurbekov, A., Taberkhan, R.: The software system for solving the problem of recognition and classification. Adv. Intell. Syst. Comput. 997, 1063–1074 (2019) 3. Boranbayev, A., Boranbayev, S., Nurbekov, A., Taberkhan, R.: The development of a software system for solving the problem of data classification and data processing. Adv. Intell. Syst. Comput. 800, 621–623 (2019) 4. Boranbayev, S., Nurkas, A., Tulebayev, Y., Tashtai, B.: Method of processing big data. Adv. Intell. Syst. Comput. 738, 757–758 (2018) 5. Boranbayev, A., Shuitenov, G., Boranbayev, S.: The method of data analysis from social networks using Apache Hadoop. Adv. Intell. Syst. Comput. 558, 281–288 (2018) 6. Boranbayev, A., Boranbayev, S., Nurusheva, A.: Analyzing methods of recognition, classification and development of a software system. Adv. Intell. Syst. Comput. 869, 690–702 (2018) 7. Boranbayev, A., Boranbayev, S., Nurusheva, A., Yersakhanov, K.: Development of a software system to ensure the reliability and fault tolerance in information systems. J. Eng. Appl. Sci. 13(23), 10080–10085 (2018) 8. Boranbayev, A., Boranbayev, S., Nurusheva, A., Yersakhanov, K.: The modern state and the further development prospects of information security in the Republic of Kazakhstan. Adv. Intell. Syst. Comput. 738, 33–38 (2018) 9. Boranbayev, A., Boranbayev, S., Yersakhanov, K., Nurusheva, A., Taberkhan, R.: Methods of ensuring the reliability and fault tolerance of information systems. Adv. Intell. Syst. Comput. 738, 729–730 (2018)

The Method of Analysis of Data from Social Networks Using Rapidminer

673

10. Akhmetova, Z., Boranbayev, S., Zhuzbayev, S.: The visual representation of numerical solution for a non-stationary deformation in a solid body. Adv. Intell. Syst. Comput. 448, 473–482 (2016) 11. Akhmetova, Z., Zhuzbaev, S., Boranbayev, S.: The method and software for the solution of dynamic waves propagation problem in elastic medium. Acta Phys. Pol., A 130(1), 352–354 (2016) 12. Hritonenko, N., Yatsenko, Y., Boranbayev, S.: Environmentally sustainable industrial modernization and resource consumption: is the Hotelling’s rule too steep? Appl. Math. Model. 39(15), 4365–4377 (2015) 13. Boranbayev, S., Altayev, S., Boranbayev, A.: Applying the method of diverse redundancy in cloud based systems for increasing reliability. In: The 12th International Conference on Information Technology: New Generations (ITNG 2015), 13–15 April 2015, Las Vegas, Nevada, USA, pp. 796–799 (2015) 14. Turskis, Z., Goranin, N., Nurusheva, A., Boranbayev, S.: A fuzzy WASPAS-based approach to determine critical information infrastructures of EU sustainable development. Sustainability (Switzerland) 11(2), 424 (2019) 15. Turskis, Z., Goranin, N., Nurusheva, A., Boranbayev, S.: Information security risk assessment in critical infrastructure: a hybrid MCDM approach. Informatica (Netherlands) 30(1), 187– 211 (2019) 16. Boranbayev, A.S., Boranbayev, S.N., Nurusheva, A.M., Yersakhanov, K.B., Seitkulov, Y.N.: Development of web application for detection and mitigation of risks of information and automated systems. Eurasian J. Math. Comput. Appl. 7(1), 4–22 (2019) 17. Boranbayev, A.S., Boranbayev, S.N., Nurusheva, A.M., Seitkulov, Y.N., Sissenov, N.M.: A method to determine the level of the information system fault-tolerance. Eurasian J. Math. Comput. Appl. 7(3), 13–32 (2019) 18. Boranbayev, A., Boranbayev, S., Nurusheva, A., Seitkulov, Y., Nurbekov, A.: Multi criteria method for determining the failure resistance of information system components. Adv. Intell. Syst. Comput. 1070, 324–337 (2020) 19. Askar, B., Seilkhan, B., Assel, N., Kuanysh, Y., Yerzhan, S.: A software system for risk management of information systems. In: Proceedings of the 2018 IEEE 12th International Conference on Application of Information and Communication Technologies (AICT 2018), 17–19 October 2018, Almaty, Kazakhstan, pp. 284–289 (2018) 20. Seilkhan, B., Askar, B., Sanzhar, A., Askar, N.: Mathematical model for optimal designing of reliable information systems. In: Proceedings of the 2014 IEEE 8th International Conference on Application of Information and Communication Technologies-AICT2014, Astana, Kazakhstan, 15–17 October 2014, pp. 123–127 (2014) 21. Seilkhan, B., Sanzhar, A., Askar, B., Yerzhan, S.: Application of diversity method for reliability of cloud computing. In: Proceedings of the 2014 IEEE 8th International Conference on Application of Information and Communication Technologies-AICT2014, Astana, Kazakhstan, 15–17 October 2014, pp. 244–248 (2014) 22. Seilkhan, B.: Mathematical model for the development and performance of sustainable economic programs. Int. J. Ecol. Dev. 6(1), 15–20 (2007)

The Emergence, Advancement and Future of Textual Answer Triggering Kingsley Nketia Acheampong1(B) , Wenhong Tian1,2 , Emmanuel Boateng Sifah3 , and Kwame Obour-Agyekum Opuni-Boachie3 1

Information Intelligence Tech Lab, School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, China [email protected], tian [email protected] 2 Chongqing Institute of Green and Intelligent Technology, The Chinese Academy of Sciences, Beijing, China 3 School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China {emmanuelsifah,obour539}@std.uestc.edu.cn

Abstract. The recent emergence of Answer Triggering has been annexed to the task of Answer Sentence Selection. Fundamentally, Answer Triggering challenges Question Answering systems to be capable of handling Answer Retrieval with two clauses: detecting whether a candidate answer set contains an answer or not; and if yes, retrieve that answer to the user. Till now, this task has proved itself comparatively harder, with most Answer Triggering models achieving F1 scores below 0.44. Meanwhile, due to little literature on the task, the potentials of this task in unlocking vital contributions to Natural Language Processing is hindered. Insights of this “new” NLP task assuredly needs to be known to researchers for immense contributions to kickstart again. This survey presents extensive details and analyses of a myriad of methods used by benchmark models on the task of Answer Triggering. Additionally, it discusses the datasets that emerged concurrently with the task. It again identifies the challenges, discusses solutions and provides recommendations for the advancement of such current systems and future ones. Keywords: Answer Triggering · Natural Language Processing Question Answering · Deep learning

1

·

Introduction

Question answering (QA) has been a longstanding task of the fields of information retrieval (IR) and natural language processing (NLP) [13,34,38]. As NLP focuses on programming computers to process and analyze natural language, the QA aspect of NLP ensures these computers of being capable of determining answers to human language questions [8,19,36]. QA systems usually tend to answer questions with typical answers, e.g. “What is the colour of Mars?” and c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 674–693, 2020. https://doi.org/10.1007/978-3-030-52246-9_50

Answer Triggering

675

open-ended questions, e.g. “Why do we exist?”. In order to answer the questions, many QA systems rely on available data. They may arrive at their answers by querying a knowledge base or an unstructured compilation of natural language documents [4,18,28]. These questions could come in a variety of forms, such as text, speech, image or video. Over the years of its existence, Question Answering has been placed under two main categories: Closed Domain Question Answering (C-DQA) and OpenDomain Question Answering (O-DQA) [26,33,39]. The first category, C-DQA, depicts QA systems that deal with questions under a distinct domain (e.g. Pharmacy or Medicine). More often than not, C-DQA is seen as a more manageable task because C-DQA systems could attain domain-specific knowledge which is generally saved as ontologies. On the other hand, O-DQA is seen as more difficult, as O-DQA systems are required to answer human queries about virtually everything [1,4,9,40]. Hence, O-DQA systems depend on broad, generalized world knowledge and reasoning. Both C-DQA and O-DQA have several subcategories. One essential subcategory of O-DQA is Selection-Based Question Answering (S-BQA) [14,39]. An S-BQA system is developed to answer a query by selecting a piece of a given document that adequately answers the query. The piece of the document could be a single sentence, a section of a document, adjoining sentences or even a context. One classic task of S-BQA is Answer Sentence Selection. Like other QA systems, the task of Answer Sentence Selection requires a QA system to: 1. Choose a sentence from a set of candidate sentences. 2. The sentence chosen should be valid, and can sufficiently support the answer choice of the question. Usually, QA systems are developed to work in a like manner, just as portrayed by the assertion above, where it is often assumed that, there exist, at least, a valid answer in the set of candidate sentences. Therefore, taking n as the number of valid sentences or answers in the set of candidate sentences, Sc , n would be evaluated as n > 0. However, in real-life scenarios and practical question answering, this might not be the case. A more subtle case exists, where the set of candidate sentences, Sn , may or may not contain a valid answer, that could sufficiently support the question being posed. In order to answer a question of this nature, QA systems are expected to: 1. Choose a sentence from a set of candidate sentences. 2. The sentence chosen should be valid, and can sufficiently support the answer choice of the question 3. If no valid answer is found, the QA system should refrain from choosing, thus returning no sentence. Establishing the conditions above, QA systems in this track of QA, are to be capable of triggering answers from a set of candidate sentences’ Sc , and hence, the emergence of the task of Answer Triggering [1,14,39]. It is important to know that, in the set of candidate sentences, Sc , the number of valid sentences or answers would be evaluated as n ≥ 0. The elusiveness of the task in judging the

676

K. N. Acheampong et al.

possibility of answers, and triggering these answers correctly if really they exist, has been affirmed by low accuracies of contributing models. To a greater extent, outstanding models and frameworks that perform well with Answer Sentence Selection Tasks, still perform poorly at Answer Triggering task. This eminent challenge instils that Answer Triggering (AT) task needs researchers’ attention, as much as researchers give other Selection-Based Question Answering. Thus, models are robust to Answer Triggering systems will inherently perform well in several Selection-Based Question Answering like Answer Sentence Selection tasks. The content of this study is geared towards discussions that clarify the emergence of the Answer Triggering task, as well as the details of its advancement and its future potentials. The key contributions of this work include: 1. Presenting a thorough digest of existing Answer Triggering literature with its recent model variations presented; since Answer Triggering emerged [1,14,16, 39,42]. 2. Analyzing the features of the datasets that also emerged with the task of Answer Triggering and discussing possible data features that might enhance the performance of the task. 3. Pinpointing and analyzing the challenges faced by those models and amass solutions for the challenges from literature. 4. Providing recommendations on how future Answer Triggering systems should be developed, based on the insights of the advancement of Answer Triggering. 5. Extending the survey with cognitive crowdsourcing that gathers cognitive features of the datasets to further enhance the performance of Answer Triggering systems.

2

The New Datasets for Triggering Answers

Typically, the QA datasets are created in relation to the task of the QA systems involved [2,29,39,43]. As such, several datasets have been created for the training and evaluations of S-BQA systems like Answer Sentence Selection, which has been annexed to Answer Triggering [14,39]. Despite being very much related to Answer Sentence Selection, Answer Triggering demanded a new form of dataset due to its peculiar nature. By virtue of the possibility of a non-existent answer in a candidate set n ∈ {0, 1, 2 . . . , N }, n = 0; Answer Triggering required a dataset that would have question sets with and without answers in their respective candidate sets, Sc . In that, a question may or may not have an answer. In this study, we address the features of these distinctive datasets, their sources, methods of creation, and questions and answer distributions. 2.1

WIKIQA

The first dataset for the task of Answer Triggering that is, WIKIQA dataset, was created alongside the first publication that brought attention to this QA task [39].

Answer Triggering

677

Table 1. A detail composition of QASENT dataset [39] Train Dev

Test

Total

# of ques.

94

68

227

# of sent.

5,919 1,117 1,442 8,478

# of ans.

475

65 205

248

928

Avg. ques. len. 11.39 8.00

8.63

9.59

Avg. sent. len. 30.39 24.90 25.61 28.85

There were some major differences between the WIKIQA dataset and other existing S-BQA datasets, such as the QASENT dataset (see Table 1). With no editorial revision unlike other datasets, WIKIQA dataset became a benchmark dataset for Answer Sentence Selection and Answer Triggering. It provided two tracks of datasets, one for Answer Triggering and other for Answer Sentence Selection. This survey focuses on WikiQA’s Answer Triggering dataset. Method of Creation. WIKIQA consists of real user Bing queries. Unlike existing datasets, such as QASENT, where sentences were manually revised to ensure all candidate sentences share content words from questions. This feature makes WIKIQA questions more varied and natural, reflecting actual questions posed by humans. About 20.3% of candidate sentences in WIKIQA has no shared content with their questions. Composition of WIKIQA. WIKIQA has two separate sets of datasets, one for Answer Sentence Selection and the other for Answer Triggering. The Answer Triggering dataset of WIKIQA consists of 3047 questions, of which 1805 of the questions have no answers (about 60% labelled 0) (see Table 2). Each of these questions has a set of candidate sentences, with each QA pair labelled a 0 or a 1 through crowdsourcing. It also includes these features: 1. QLen - the length of a questions. 2. SLen - the length of a candidate answer. 3. Cnt - IDF-weighted and unweighted word-overlapping counts between questions and candidate answers. By far, WIKIQA has been used by most Answer Triggering literature. And hence, majority of the works discussed in this survey applied this dataset. 2.2

SELQA

The SELQA datasets for the task of Answer Sentence Selection and Answer Triggering followed suit after the emergence of WIKIQA datasets [14]. The SELQA dataset held the focus of introducing a more extensive and natural (unrevised by editors) dataset, just as WIKIQA’s intended purpose. Again, we focus on

678

K. N. Acheampong et al. Table 2. A detail composition of WIKIQA dataset [39]

# of ques. # of sent. # of ans. Avg. ques. len. Avg. sent. len.

Train

Dev

Test

Total

2,118 20,360 1,040 7.16 25.29

296 2,733 140 7.23 24.59

633 6,165 293 7.26 24.95

3,047 29,258 1,473 7.18 25.15

170

390

1805

# of ques. w/o ans 1245

only the Answer Triggering dataset. There exists a good number of differences between these two datasets. Method of Creation. SELQA dataset was created to be a new automatically generated answer triggering dataset through an annotation scheme (see Fig. 1). The annotation scheme’s structure was a list of five individual crowdsourcing tasks, that used crowdsourcing in the creation of the question answering datasets. Out of the five tasks, only the last task was used to generate the Answer Triggering dataset. The annotation scheme also indexed all 14M sections of the entire Wikipedia. Questions from the first four task were queried through ElasticSearch, and the top 5 candidate answers from every section were selected. Due to this effect, candidate sentences for Questions in SELQA are more numerous and very good for training and testing purposes.

Fig. 1. The graphical overview of SELQA data collection and annotation scheme [14]

Answer Triggering

679

Table 3. The distribution of SELQA Answer Triggering corpus Q/SEC/SEN: number of questions/sections/sentences [14] Set

# of Ques # of SEC # of SEN

Train 5,529

27,645

205,075

Dev

785

3,925

28,798

Test

1,590

7,950

59,845

Composition of SELQA. SELQA likewise has two separate datasets, one for Answer Sentence Selection and the other for Answer Triggering. Although akin to WIKIQA, SELQA has more diversified topics. It is also made up of a more significant number of questions, that is, about 2.5 times larger for its Answer Triggering dataset, and six times larger for its Answer Sentence Selection dataset than that of WikiQA’s (see Table 3). It also has good context buildup, by deriving contexts from the entire article, rather than from only the abstract. The subsequent sections of this survey give insights and discuss the various Answer Triggering models. Again, the challenges of these models are highlighted and potential solutions are laid for their improvement.

3

Robust Methods and Strategies

Several models and their respective variations have been proposed ever since the task of Answer Triggering emerged. All these models have contributed immensely in arriving at a good solution for Answer Triggering task. This work presents these contributing models in a distinctive categorization scheme for an informed purpose for further researching about the task. In our study, we review and assess the models’: 1. Question and Answer Representation 2. Question and Answer Interaction 3. Scoring Relevance and Answer Retrieval 3.1

Question and Answer Representation

The representation of question and candidate sentences (of which might include an answer) in Answer Triggering task, is generally modified samples of those found in existing QA representations. It is quite interesting to note that Word Embeddings are virtually necessary for the task of Answer Triggering. All the model variations discussed in this survey employ Word Embedding methods. Word Embeddings. Word Embedding is a common connotation given to methods that use feature learning to map words or phrases from a vocabulary, to real number vectors, and it is used to model language [15,37]. Researchers

680

K. N. Acheampong et al.

could generate these vectors of real numbers using vocabulary or simply using pre-trained embeddings from repositories [21,22]. From recent existing work on QA systems, using pre-trained embeddings has been indispensable, both for a fair comparison of network models and the ease of their use [15,17]. Two of such pre-trained vectors are: 1. Stanford’s GloVe Pretrained Vectors [25]. 2. Google’s 300-dimensional Pretrained Vectors [20,21]. The later has been used for all the contributing models discussed in this survey. The first solution was provided after the invocation of the concept of Answer Triggering in QA systems [39]. Inspired by an early Deep Convolution Neural Network model [41], the CNN-Cnt [39] became a vital model for computation of Answer Triggering tasks, in which word embeddings for computing the question and answer representations. From thence, Word embeddings has been used as the primary step to represent the questions and candidate answers in successive works. Although the input to this representation step can be customized, using pre-trained word2vec model from Google would be ideal [1,14,16,39,42]. 3.2

Question and Answer Interaction

Despite the few models proposed for the task of Answer Triggering, various variations of the models have been proposed, of which most of them entails the different encoding strategies. This section addresses the methods that improved the solution of the task. Convolutional Neural Networks with Count Features. The best model presented by Yang et al. for solving the Answer Triggering task is the CNNCnt. CNN-Cnt was based on Yu et al.’s work [41] which involved the use of Convolutional Neural Networks [8] (see Fig. 2). The design principle for this model was to evaluate Answer Triggering on question level for the first time using CNN-CNT, a model that worked well with Answer Sentence Selection [39]. CNN-Cnt combines the question and answer-sentence representations built by a convolutional neural network (CNN) with logistic regression. Three variations of the CNN-Cnt model were proposed as solutions [39] by coupling the baseline CNN-Cnt with additional features, namely, 1. Length of Question (QLen) 2. Length of Candidate Sentences (SLen) 3. Class of the Questions (QClass)

Convolutional Neural Networks with Subtree Matching. Jurczyk et al. presented a replication of CNN-Cnt, and further coupled this baseline model with a novel subtree matching mechanism that evaluates the contextual similarity between the question and their respective candidate sentences [6,14].

Answer Triggering

681

Fig. 2. The architecture of a 1D convolutional neural network [41]

The comparator for each of the CNN models utilizing the subtree matching varies, which are: 1. Word-forms comparator 2. Word embeddings comparator In the case of the Word-forms comparator (fc (x, y)), it returns 1 if x and y had the same forms. Otherwise, returns 0. For the Word embeddings comparator, cosine similarity between x and y is used. Besides, a function takes a list of scores and returns the max of the scores as the answer to the question, with its answer triggering threshold in its implementation. Cognitive-Based Implausible Sentence Matching Strategies. Acheampong et al. presented a model based on a human thinking process, with a focus on “Implausible Sentences” in the candidate set and writing them off by the deletion [10,31]. OCM-IS uses a question classification algorithm to identify the candidate sentences suitable to answer the questions [11,12]. OCM-IS uses other various sentence-level computations such as Latent Semantic Indexing (LSI) to compare topics of the question and candidate sentences and capture the topmost answer as the correct answer [3,32]. OCM-IS worked on the basis that, if no sentence is left after the implausible sentence check is executed, the question is deemed to have no answer. Otherwise, the first sentence in the LSI ranking is the answer [3,23]. Recurrent Neural Network with Subtree Matching. Recurrent Neural Networks (RNNs) have hidden recurrent states and extends the concepts of feedforward neural networks. The activations of the hidden recurrent states banks on previous activations. This mechanism enables RNNs to include sequential and

682

K. N. Acheampong et al.

timing information in their processing of data. The point of learning long-term dependencies gave rise to a variation of the traditional RNNs, that is, LongShort Term Memory (LSTM) networks, which possess a gradient-based learning algorithm. LSTM networks have been invaluable to several NLP tasks and QA systems, and Answer Triggering is one of the beneficiaries. In solving the task of Answer Triggering, one of the earliest RNN used was developed by Jurczyk et al., which is addressed as SELQA-RNN in this work. The objective of SELQA-RNN to geared towards competing with SELQA-CNN with Subtree Matching to ascertain their respective performances and hence, the model also attempted to contextuality between questions and their respective candidate sentences SELQA-RNN utilized attentive pooling mechanism, while using Gated Recurrent Units as its memory cells [30]. This approach improves the convergence rate of the learning without affecting the performance, as opposed to using LSTM networks [5,7]. Attentive pooling enables SELQA-RNN to jointly learns a similarity measure between a question and a candidate sentence using the hidden states of the question and the candidate sentence [30]. Attention-Based Recurrent Neural Networks. The Inner Attention-based Recurrent Neural Networks, IARNN-Gate model is based also based on Recurrent Neural Network [35] (see Fig. 3). IARNN-Gate utilizes the representation of questions to build the GRU gates for every candidate answer. In that, IARNNGate applies attention to its GRU inner activation, instead of applying the attention information to its original input [16,35]. Adding attentional information to active update gates and forget gates, this units focus on short-term, long-term and attention information at the same time.

Fig. 3. The structure of an IARNN-Gate model (IABRNN) [35]

Answer Triggering

683

Recurrent Neural Tensor Approach. The Hierarchical Gated Recurrent Neural Tensor, HGRNT model, also increased the F1 score for the Answer Triggering task [16]. The HGRNT model captures context information and deep interactions between the candidate sentences and the question through the usage of a Gated Recurrent Neural Network (GRNN). The GRNN model is identical to the IARNN model. However, using IARNN-Gate, the question vector is not first calculated and then is added to compute the gates of the answers, as it is seen in the structure of IARNN-Gate. Variations of GRNN were also developed by Zhao et al. These were: 1. GRNN + tensor 2. GRNN + content 3. GRNN + tensor + content Using a neural tensor network, HGRNT model becomes effective in modelling the similarity sentence similarity, and the precision of its variant, GRNN with tensor and content model, surpasses all Answer triggering model in Precision [16,27]. Details can be found in Table 4 of this work. GAT. A Group-level Answer Triggering framework, connoted as GAT, to the best of our knowledge, has been the ultimate framework to address the task of Answer Triggering [42]. GAT possesses a unique way of jointly learning to: 1. T1 : build an individual-level model to rank candidate sentences 2. T2 : make a group-level binary prediction on the possibility of the answers in the candidate sentence set. thus, optimizing T1 and T2 at the same time. Also, the GAT framework provided a structure where the encoder could be modified for improvement of its model [42]. The next section focuses on the evaluation of the contributing models of Answer Triggering, and give in-depth details about their performances. 3.3

Evaluation and Performance Comparison

Evaluation Metrics. The Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR) are accepted methods for evaluating the retrieval performance of QA Systems. Since its usage in TREC Q/A track in 1999, MRR, together with MAR, has been used frequently as a performance measure in the task of Answer Sentence Selection. MAR for a set of queries is the mean of the average precision scores for each query [18,24]. Thus, M AP = Note: Q is the number of queries

Q n=1

AveP (q) Q

(1)

684

K. N. Acheampong et al.

The MRR performance measure evaluates a list of possible responses of a sample of queries (Q), ordered by the probability of validity [34]. A reciprocal rank of a query response is the multiplicative inverse of the rank of the first valid answer, that is, 1, 12 , 13 ; with replace to their ranking positions (ranki ). MRR then becomes the average of the reciprocal ranks of results for the sample of Q [28,34]. Mathematically, M RR =

|Q| 1  1 |Q| n=1 ranki

(2)

Note: ranki - the rank position of the first relevant document for the i-th query Notwithstanding the standards of MAR and MRR for Answer Sentence Selection systems’ performance evaluations, MAR and MRR are not used for the task of Answer Triggering due to the nature of the task. The task of Answer Triggering requires metrics that can ascertain the existence of answers in the candidate set of a question, and the validity of the systems predictions [1,14,39,42]. Yang et al. employed P recision, Recall and F1 scores for answer triggering, at the question level. After that, works on Answer Triggering task have all used these metrics, not only for evaluation but also for a reasonable comparison between their works and other Answer Triggering works [1,14,16,42]. Although the metrics carry their traditional meanings, this study limits them to the task of Answer Triggering. – Precision is the fraction of retrieved candidate sentences that are relevant to the question. That is, P recision =

|{RelevantSents} ∩ {RetrievedSents}| |{RetrievedSents}|

(3)

– Recall is the fraction of the relevant candidate sentences (answers) that are successfully retrieved. That is, Recall =

|{RelevantSents} ∩ {RetrievedSents}| |{RelevantSents}|

(4)

– F1 Score is the harmonic mean of the precision and the recall. That is, F1 =

2 1 precision

+

1 recall

(5)

All the works surveyed in this work employed the metrics enlisted here to evaluate their respective QA systems. Performance Comparison. In Table 4, we present a tabular representation that summarizes the major contributing works of the task of Answer Triggering. All contributing works are text-based. To the best of our knowledge, Answer

Answer Triggering

685

Table 4. The performance comparison of major contributing models to the task of Answer Triggering Model

WikiQA

SelQA

Prec

Prec

Recall F1

Recall F1

CNN-Cnt

26.09 37.04

30.61 –





CNN-Cnt + QLen

27.96 37.86

32.17 –





CNN-Cnt + QLen

26.14 37.86

30.92 –





CNN-Cnt + Qclass

27.84 33.33

30.34 –





CNN-Cnt + {Q,S}Len + QClass

28.34 35.80

31.64 –





GRNN

38.03 25.51

30.54 –





GRNN + tensor

39.36 30.45

34.34 –





GRNN + context

37.55 42.80

39.99 –





GRNN + content + tensor

40.91 44.44

42.60 –





IARNN-Gate

25.94 42.39

32.19 –





IARNN-Gate + context + tensor

36.82 44.86

40.45 –





Compare aggregate

27.64 39.92

32.65 –





Compare aggregate + context + tensor 29.71 50.62

37.44 –





OCM + IS

31.46 45.16

37.09 –





GAT

32.70 48.59

39.09 –





GAT + Cnt (Full)

33.54 60.92 43.27 –





GAT + Cnt + QLen

33.12 59.09

42.45 –





GAT + Cnt + SLen

28.03 64.60

39.10 –





GAT + All

31.35 58.82

40.90 –





SELQA-CNN

29.70 37.45

32.73 52.10 40.34

45.47

SELQA-CNN + max + word

29.77 42.39

34.97 52.22 47.30

49.64

SELQA-CNN + max + emb

29.77 42.39

34.97 53.69 48.38 50.89

SELQA-CNN + max + emb+

29.43 48.56

36.65 52.14 47.14

49.51

SELQA-RNN + attn-pool

24.32 47.74

32.22 47.96 43.59

45.67

Average Model Score

31.06 44.11

35.81 51.62 45.35

48.24

Triggering has not been extended to other forms modalities, such as audio and video. The table illustrates that more models have been trained and evaluated with the WikiQA, while only the creators of the SelQA dataset trained and evaluated their models on SelQA. We also uncover an interesting observation which is worth noting. In Table 4, there is a gross performance of the five models, in terms of their evaluations on the WikiQA dataset and the SelQA dataset. The best performing model, SELQA-CNN+max+emb, achieves an F1 score of 50.89, whereas its F1 score on WikiQA was just 34.97. This observation runs through all the varying models of SelQA. We deduce two main possibilities: 1. The Hardness of WikiQA: The large margin between the F1 scores of the same model, while employing WikiQA dataset and SelQA dataset could imply that, the questions and candidate sentence in WikiQA are much more subtle and thus, difficult to trigger; than questions and candidate sentences in SelQA.

686

K. N. Acheampong et al.

2. The Large Size of SelQA: SelQA dataset is vast, with five times more questions and candidate sentences than WikiQA. The size infers that models trained and evaluated with SelQA might have had the advantage of the size of the dataset, reflecting their high F1 scores. These two possibilities, coupled with the subtle observation of the high F1 score’s margin realized in this survey gives a right direction towards unravelling the hardness of the task of Answer Triggering. We propose that future Answer Triggering models should be evaluated on both datasets for a more accurate view of the dataset comparison. During the evaluation of models, researchers could also employ a multi-learning approach to evaluate the task, using both datasets during the learning process. Model Score Analyses. There is an aorta of weariness realized as we surveyed the performances of the models over the datasets. Taking WikiQA, referring to Table 2, there exists 3047 questions, of which 1805 of the questions have no answers. 1242 questions, forming about 40.11% of the entire questions, have answers. Nevertheless, the triggering precision at question level of most of the models still lingers way below 40, with the best at being GRNN+content+tensor at 40.91. The low in performance concludes that the capability of representing perception, learning, reasoning and consequently, effectively triggering answers, is far from being reached, and current models might be naively selecting questions. Due to the limitation of precisely selecting valid questions from further processing, Recall is also hindered and affects the general F1 score.

Fig. 4. A graphical illustration of Precision, Recall and F1 scores of various contributing models, with their respective trends.

Answer Triggering

687

Despite the seemingly poor F1 scores of contributing models, there is an upward trend in Precision, Recall and F1 scores from the emergence of the task till now. The graphical plot of the scores is illustrated in Fig. 4, together with the increasing pattern realized in this survey.

4

Major Challenges with Plausible Solutions

In this section, we address all profound challenges realized through the survey and initiate plausible solutions that could be applied to mitigate the challenges. Since the task of Answer triggering has in itself two minor tasks, of which one falls under the ranking of candidate sentence, various limitations of ranking affect the models, just as its seen in the task of Answer Sentence Selection. However, in this work, we highlight the limitations which directly hinders Answer Triggering models performances and give suggestive solutions to them. 4.1

Syntactic Similarity Bias

Even though modern QA models, including models for S-BQA systems, make assurances of the ability to capture semantic relations between questions and candidate sentences, biases which gear toward syntactic similarity persists. Due to this bias, answers which could easily be triggered by humans are inherently difficult for systems to trigger. In Fig. 5, one can observed that, A1 , which is the outright wrong option, although fundamental enough is ranked the highest, whereas A2 , the correct answer ranks the least, with a margin of ≈ 0.2325 between their similarities with this “simple” question. A similar observation is realized (A1 ≈ 0.8986; A2 ≈ 0.7352; difference in margin ≈ 0.1634) when the model the 300-dimensional word vectors trained on Common Crawl with GloVe from spaCy is used. From this observation, we propose that models designed for the task of answer triggering could be validated on an increasing generalization semantic relatedness, for the question’s cognitive difficulty, as opposed to an overall generalized semantic relatedness. Generalizing models along semantic relatedness would enhance the triggering of simple questions while seeking to improve the model to trigger more advanced questions. The use of datasets with more features that capture more semantic or cognitive features of sentences could be employed, so models are capable of answering difficult questions of this sort. Thus, the combination of syntactic information, semantic relatedness and cognitive approaches would aid in triggering answers better. 4.2

Ancillary Content Verification

With the use of word embeddings and various probability-based models, distances between vectors (most especially, phrases, clauses and sentences) play key roles in the ranking of an answer for its triggering. The ability of an Answer Triggering system to detect ancillary content and explanatory components of

688

K. N. Acheampong et al.

Fig. 5. An example of semantic similarity using spaCy’s trained en core web sm model

sentences is rather a salient one than to be overlooked. In some cases, sentences that further explain the answer, rather steers the models away from the required answer, thus, ranking a positive answer lower. In other cases, descriptive parts of the sentences may not even be about the answer [1,42]. Figure 6 illustrates an example of an ancillary context in one WikiQA sentence. To suppress the weights ancillary content of sentences, AT systems could be enhanced by episodic memories that would retain the salient sentences in the sentences. Those parts of the sentences could be prioritized and compared with the idea behind the question being posed. This would improve the overall triggering process and overcome the challenge of ancillary content affecting the ranking score. 4.3

Lexical Ambiguity

Ambiguity in questions occur when there exist more than a single intent to interpret the question. When a question is equivocal, this creates an ambiguous interpretation when training an Answer Triggering model (see Fig. 7). Although, A1 and A2 rightfully answer the question, A2 is labeled “1” in the WIKIQA dataset, with A1 labelled “0”. This is a typical example in an Answer Triggering task where the intent of the question is not obvious. The phrase “originally stood” would mean this name may not be used again or may have been replaced with another meaning, of which A1 would have been correct.

Fig. 6. An example of a typical answer ranked lower, due to unrelated contents (in red), that forms a part of the answer [42]

Answer Triggering

689

Fig. 7. An Example of an ambiguous question in Answer Triggering dataset

Hence, with regard to ambiguity in the Answer Triggering task, the more diverse dataset could be curated, to include intent, category or conceptual information a question holds. Answer Triggering models being developed could be trained for disambiguation. Furthermore, NLP researchers should be strategic enough when training word embeddings for Answer Triggering tasks. In that, closer word vectors might be distant in the embeddings trained on new data, such as training on 2017 Wikipedia dump, as opposed to training with 2015 Wikipedia dump. If this is ensured, triggering is bound to reflect the state at the time the dataset was created. Cognitive and other semantic features (as discussed in Sect. 4.1) can be applied. 4.4

Fixed Triggering Thresholds

In ranking and retrieval of candidate sentences in QA systems, setting fixed thresholds are a norm, which has transcended to most of its subsidiaries, including Answer Triggering. Although fixing thresholds seem to be one of the best ways, this approach sometimes tends to give false negatives on positive answers that ranks lower than the thresholds. Application of fixed thresholds (which is 0.5 mark in most QA systems) plays a part in the learning process of Answer Triggering models. This is a considerable shortfall to overcome in Answer Triggering, yet possible. Using a range of thresholds may be experimented during a model’s learning process, and the best threshold chosen, rather than predefining it. Again, during the model’s learning process, positive candidate sentences that are ranked negatively could be motivated, whereas negative candidate sentences that are ranked highest could be penalized [42]. This method could aid the overall performance of Answer Triggering models with coupled with attentional information.

5

Answer Triggering Data Contribution

Inspired by the need for extensive training features for triggering answers, we extended this comprehensive study to initiate cognitive crowdsourcing that gathers cognitive and semantic features of sentences in the light of establishing

690

K. N. Acheampong et al.

Fig. 8. Crowdsourcing 24 cognitive features for training Answer Triggering

communicative inferences that connect question-answer pairs in Answer Triggering1 . In this data gathering, annotators are requested to provide cognitive and semantic features of sentences from the two answer triggering datasets (i.e. WikiQA and SelQA) (see Fig. 8). Both datasets have only three features (Questions(Q), Candidate sentence(C), Polarity(P)) and have no cognitive features. With reference to the project, a question sentence in WikiQA: “How African Americans were immigrated to the US” can be inferred to have a: 1. 2. 3. 4.

Cognitive Categorization: law, govt and politics ← legal issues ← civil rights Conceptual Information: Southern United States Entities: US Keywords: African Americans

with their degrees of certainty, and also all the question’s candidate answers. This extension has 27 fields, of which 24 of these were gathered solely for boosting Answer Triggering models cognitively. Moreover, the new features, that includes the concept information, cognitive categorisation, keywords and entities, and their respective degrees of certainty can directly be applied to eliminate questionanswer pairs that are not related conceptually. We hope to make this dataset an open data and contribute it as an immense contribution to solving answer triggering tasks. 1

A view of the new data: https://wertyworld.com/cloud/index.php/s/gvRs42yuuWU 71Zw.

Answer Triggering

6

691

Conclusion

In this comprehensive survey, we have presented a thorough top-down digest of various contributing works that have produced substantial models for the task of Answer Triggering since its emergence. We have also presented and discussed the datasets this new Question Answering challenge emerged with, giving possible ways these datasets could be enhanced to aid in the task of Answer Triggering. We also analyzed the challenges that existing Answer Triggering models face, and provided new directions that would enhance the overall performance of these models, and any new Answer Triggering models that will emerge. Again, we have observed that current state-of-the-art models are far from effectively solving the task of Answer Triggering. This observation is realized in the respective F1 scores of AT models, which is well below 50% good. From our comprehensive survey of existing Answer Triggering models, we propose that research should be geared towards merging the essential features of any of the presented model, thus, leveraging their positive aspects. In doing so, the F1 scores of new models will increase. Moreover, considering most of these models lacking high precision, new models should be trained to enhance their precision of triggering answers, and this might readily increase the F1 performance as Recall is usually high in current models. Models could be tested with other datasets, such as the SELQA dataset, in order to ascertain their capabilities well enough, with good evaluations for enhancement. The task of Answer Triggering is subtle and demands a conscious research effort to solve it, hence its awareness through this survey. From the trend illustrations shown in this survey, it is evident that despite the few research efforts contributed to solving this task; there is a positive trend. The trend has a strong indication that the task could be solved after all despite its hardness. Solutions to Answer Triggering task would be essential to QA systems, and its models would make a substantial improvement of other QA systems. Acknowledgment. We thank Wertyworld, Co., for its contribution in the data gathering process and offering online servers used in this work. We also appreciate the effort of the students of the University of Electronic Science and Technology of China (UESTC) who participated in the validation and verification of this data. This research is partially sponsored by the Natural Science Foundation of China (NSFC) Grant 61672136 and 61828202.

References 1. Acheampong, K.N., Pan, Z.H., Zhou, E.Q., Li, X.Y.: Answer triggering of factoid questions : a cognitive approach, pp. 33–37 (2016). https://ieeexplore.ieee.org/ document/8079800/ 2. Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C.L., Parikh, D., Batra, D.: VQA: visual question answering. Int. J. Comput. Vision (2017). www.visualqa.org 3. Alghamdi, R., Alfalqi, K.: A survey of topic modeling in text mining. Int. J. Adv. Comput. Sci. Appl. 6, 147–194 (2015)

692

K. N. Acheampong et al.

4. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python (2009) 5. Cho, K., Van Merri¨enboer, B., G¨ ulcehre, C ¸ ., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP (2014) 6. Choi, J.D., Mccallum, A.: Transition-based dependency parsing with selectional branching. In: ACL 2013 (2013) 7. Chung, J., G¨ ulcehre, C ¸ ., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014) 8. Collobert, R., Weston, J.: A unified architecture for natural language processing. In: Proceedings of the 25th International Conference on Machine Learning (ICML 2008) (2008) 9. Dong, L., Wei, F., Zhou, M., Xu, K.: Question answering over freebase with multicolumn convolutional neural networks. In: Proceedings ACL 2015 (2015) 10. Dufresne, R.J., Leonard, W.J., Gerace, W.J.: Marking sense of students’ answers to multiple-choice questions. The Physics Teacher (2002) 11. Galitsky, B., Pampapathi, R.: Can many agents answer questions better than one? First Monday (2005) 12. Huang, Z., Thint, M., Qin, Z.: Question classification using head words and their hypernyms. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2008) (2008) 13. Iyyer, M., Boyd-Graber, J., Claudino, L., Socher, R., Daum´e III, H.: A neural network for factoid question answering over paragraphs. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014) 14. Jurczyk, T., Zhai, M., Choi, J.D.: SelQA: a new benchmark for selection-based question answering. In: Proceedings - 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI 2016) (2017) 15. Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: ICML (2014) 16. Li, W., Wu, Y.: Hierarchical gated recurrent neural tensor network for answer triggering. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2017) 17. Liu, Y., Liu, Z., Chua, T.S., Sun, M.: Topical word embeddings. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI 2015) (2015) 18. Manning, C.D., Ragahvan, P., Schutze, H.: An Introduction to Information Retrieval (2009) 19. Martin, J.H., Jurafsky, D.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2001) 20. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013) 21. Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations (ICLR 2013) (2013) 22. Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of NAACL-HLT (2013) 23. Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (2011) 24. National Institute of Standards and Technology: Common evaluation measures. In: The Eighteenth Text REtrieval Conference (TREC 2009) Proceedings (2009)

Answer Triggering

693

25. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014) R Inf. Retrieval 26. Prager, J.: Open-domain question? Answering. Found. Trends (2006) 27. Qiu, X., Huang, X.: Convolutional neural tensor network architecture for community-based question answering. In: IJCAI International Joint Conference on Artificial Intelligence (2015) 28. Radev, D., Qi, H., Wu, H., Fan, W.: Evaluating web-based question answering systems. In: Proceedings of the Third International Conference on Language Resources and Evaluation (2002) 29. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100, 000+ questions for machine comprehension of text. In: EMNLP (2016) 30. dos Santos, C.N., Tan, M., Xiang, B., Zhou, B.: Attentive pooling networks. CoRR abs/1602.03609 (2016) 31. Smith, M.A., Karpicke, J.D.: Retrieval practice with short-answer, multiple-choice, and hybrid tests. Memory (2014) 32. Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Handbook of Latent Semantic Analysis (2007) 33. Sun, H., Ma, H., Yih, W.T., Tsai, C.T., Liu, J., Chang, M.W.: Open domain question answering via semantic enrichment. In: Proceedings of the 24th International Conference on World Wide Web (WWW 2015) (2015) 34. Voorhees, E.M.: The TREC-8 Question Answering Track Report (1999) 35. Wang, B., Liu, K., Zhao, J.: Inner attention based recurrent neural networks for answer selection. In: ACL (2016) 36. Weikum, G.: Foundations of statistical natural language processing. ACM SIGMOD Record (2002) 37. Xiao, M., Guo, Y.: Distributed word representation learning for cross-lingual dependency parsing. In: CoNLL (2014) 38. Xiong, C., Merity, S., Socher, R.: Dynamic memory networks for visual and textual question answering. In: Proceedings of the 33rd International Conference on Machine Learning (2016) 39. Yang, Y., Yih, W.T., Meek, C.: WIKIQA: a challenge dataset for open-domain question answering. In: Proceedings of EMNLP 2015 (2015) 40. Yih, W.T., He, X., Meek, C.: Semantic parsing for single-relation question answering. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (2014) 41. Yu, L., Hermann, K.M., Blunsom, P., Pulman, S.G.: Deep learning for answer sentence selection. CoRR abs/1412.1632 (2014) 42. Zhao, J., Su, Y., Guan, Z., Sun, H.: An end-to-end deep framework for answer triggering with a novel group-level objective. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2017) 43. Zhu, Y., Groth, O., Bernstein, M.S., Fei-Fei, L.: Visual7w: grounded question answering in images. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4995–5004 (2016)

OCR Post Processing Using Support Vector Machines Jorge Ram´on Fonseca Cacho(B) and Kazem Taghva Department of Computer Science, University of Nevada, Las Vegas, Las Vegas, USA {Jorge.FonsecaCacho,Kazem.Taghva}@unlv.edu

Abstract. In this paper, we introduce a set of detailed experiment using Support Vector Machines (SVM) to try and improve accuracy selecting the correct candidate word to correct OCR generated errors. We use our alignment algorithm to create a one-to-one correspondence between the OCR text and the clean version of the TREC-5 data set (Confusion Track). We then extract five features from the candidates suggested by the Google web 1T corpus and use them to train and test our SVM model that will then generalize into the rest of the unseen text. We then improve on our initial results using a polynomial kernel, feature standardization with minmax normalization, and class balancing with SMOTE. Finally, we analyze the errors and suggest on future improvements. Keywords: OCR · Support Vector Machines Processing · SMOTE

1

· SVM · OCR Post

Introduction

In previous research we have discussed correcting OCR generated errors during the post processing stage by using a confusion matrix and the context for identifying likely character replacements [1]. For context based corrections, we generate all possible candidate words by searching the Google Web 1T corpus [2] for all 3-grams that have a matching predecessor and successor word. Once generated, we selected the candidate with the closest Levenshtein Edit distance [3] as this represents the word with the least amount of changes required to transform it into the correct word. However, we also took into consideration the frequency of the trigram in the Google Web 1T corpus by assuming that common phrases are more likely to be correct than obscure ones. In doing this, the biggest challenge was how we could weight, or balance, both of those features when choosing our candidate. In that case, we experimented with different weights and decided on a threshold that seemed to work best for the majority of the cases. While experimentation decided the ideal weights, humans were very much involved. Because of this, at the conclusion of that research we proposed using machine learning to find the ideal weights between features to select a candidate word given these and, possibly, other additional features with the goal of automating the process and removing the need for the human factor. c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 694–713, 2020. https://doi.org/10.1007/978-3-030-52246-9_51

OCR Post Processing Using Support Vector Machine

695

In this paper, we provide an application of machine learning to generalize the past ad hoc approaches to OCR error corrections. As an example, we investigate the use of Support Vector Machines (SVM) to select the correct replacement for misspellings in the OCR text in an automated way. In one experiment we achieved a 91.58% precision and 58.63% recall with an accuracy of 96.72%. This results in the ability to automatically correct 58.63% of all of the OCR generated errors with very few false positives. In another experiment we achieved 91.89% recall and therefore were able to correct almost all of the errors (8.11% error rate) but at the cost of a high number of false positives. These two results provide a promising solution that could be a combination of both models. Our contributions in these models are an automated way to generate the training data and test data (features and class output) with usage of, among other tools, the alignment algorithm from our previous work [4] and a set of tools we developed in python and mysql to automate the workflow entirely. To achieve the machine learning aspect we use the existing LIBSVM library [5] for creating a Support Vector Machine that we can train using any of our datasets including the original, normalized, or balanced class distribution datasets.

2

Support Vector Machines

For these set of experiments we will use Supervised learning as it allows us to train a model with a given number of input features that in return have one expected binary output. Once trained, we can then use this model on new data it has not been exposed to in order to predict the output. Support Vector Machines (SVMs) are a set of supervised learning methods that are very popular for both classification and regression. SVMs are a combination of a modified loss function and the kernel trick, “Rather than defining our feature vector in terms of kernels, φ(x) = [κ(x, x1 ), ..., κ(x, xN )], we can instead work with the original feature vectors x, but modify the algorithm so that it replaces all inner products of the form x, x  with a call to the kernel function, κ(x, x )” [6]. The kernel trick can be understood as lifting the feature space to a higher dimensional space where a linear classifier is possible. SVMs can be very effective in high dimensional spaces, while remaining very versatile due to the variety of kernel functions available. Four popular kernel functions are [7]: – – – –

linear: κ(xi , xj ) = xT xj . polynomial: κ(xi , xj ) = (γxTi xj + r)d , γ > 0. radial basis function: κ(xi , xj ) = exp(−γxi − xj 2 ), γ > 0. sigmoid: κ(xi , xj ) = tanh(γxTi xj + r).

To use these kernels, “First, they encode sparsity in the loss function rather than the prior. Second, they encode kernels by using an algorithmic trick, rather than being an explicit part of the model” [6].

696

3

J. R. Fonseca Cacho and K. Taghva

Methodology

Because the goal is to find the ideal weights between the given features in order to correctly select a candidate word, this is a decision problem. While the corrections are at a character level, the candidates are words with the corrections– possibly multiple character corrections in one word–included. Therefore, we are trying to decide what candidate word can have the highest probability of being the correct candidate. Because of this we can use either linear regression and provide a numerical score that indicates how confident we are in this candidate, and therefore pick the candidate with the highest score, or logistic regression – binary classification – to select whether a candidate is a correct suggestion or not. The advantage of linear regression is we can pick a single winner. On the other side with logistic regression, it is possible to have multiple acceptable candidates with no way of knowing which one is more likely since the classification is binary. Even if we did multi-class classification with 3 classes, (Bad, Acceptable, Highly Acceptable) we would still run into a problem with two candidates classifying into the Highly Acceptable class. Some logistic regression models will give a probability that it belongs to a class, but this is not always the case. However, when it comes to training, after running our alignment algorithm, we only have matched the error with the correct word and when training our model we can only say whether a candidate is correct or not, as we have no way of quantifying whether something is likely or not. One possible way to work around that is to give very high scores to correct candidates and then low scores to incorrect ones with candidates that are close based on Levenshtein edit distance receiving some intermediary score; however that causes problems of its own as well as we are giving the edit distance a priority over other potential features. Because of the aforementioned reasons, we have chosen to use a classification algorithm that we can easily train with yes, or no, responses to possible candidates since the target variable is dichotomous. Now that we have defined everything, we can word our problem as: “Given a set of features and given a candidate word, our binary classifier will predict if that candidate word is correct or not.” Next, we will discuss what features we will be providing our model for each candidate.

4

Features and Candidate Generation

For each candidate we generate using ispell or OCRSpell [8,9], we will generate the following five features with the corresponding values that apply for each of them. Using these features we will train our model: – – – – –

Levenshtein Edit Distance: distance Confusion Weight: confWeight Unigram Frequency: uniFreq Predecessor Bigram Frequency: bkwdFreq Successor Bigram Frequency: fwdFreq

OCR Post Processing Using Support Vector Machine

4.1

697

Levenshtein Edit Distance

Between a candidate word and the error word we measure the Levenshtein edit distance [3] that consists of the smallest amount of insertions, deletions, or substitutions required to transform one word into the other. For example, hegister → register would require only one substitution h → r. Ideally the lower the value, the more closely related the words are and the higher likelihood that it is a correct candidate; however it will be up to the learning algorithm to decide if this correlation is appropriate and furthermore, if this feature is as important as the others. 4.2

Confusion Weight

We can use the confusion matrix’s weight for the candidate word in relation to the OCR generated word. Such as if the OCR generated word is “negister”, and the candidate word is “register” and the confusion weight for ‘n → r’ is 140, then we store that value into this feature. This way the learning algorithm will pick up that higher values are more preferred as they signify that it is a typical OCR error to misclassify an ‘r’ as an ‘n’ in comparison to say, a ‘w’ as an ‘n’. The way we generate this confusion matrix is from a subset of training data where we took all our alignment algorithm for a subset of data, aligned it to the correct word and analyzed the changes necessary to correct the word. These can then be used as our confusion weights and will not bias the results of our learning algorithms as long as we do not use the same data set in any of the sets. The top 20 highest weighted confusion matrix entries (the most common substitutions/deletions/insertions on a character basis) can be seen in Table 1. Note, the blank entries in the toletter column represent a deletion. Similarly, a blank entry in the fromletter column represents an insertion. As we can see from Table 1, the most common error is 1 → comma meaning that the commas are being misread as the number one. Furthermore, words with the letter d are being misread as l since the most common corrections involve l → d. Many of these, like y → k and l → i, are popular errors with OCR software and typically appear in those confusion matrices. 4.3

Unigram Frequency

Another feature we will generate per candidate is the candidate frequency, also known as the term frequency on the Google Web 1T unigram list and of the corpus where the data came from (TREC-5 corpus) [10]. This could be zero if we are using a word from a dictionary that is not found in the Google Web 1T unigram list or in the corpus, as would be the case if all instances of the word were misspelled. The idea behind using this frequency is that when deciding on which candidate to pick, the probability that it is a more frequent word in the dictionary is higher than that of a rare word. Furthermore, if it is a word that appears often in other parts of the corpus. Then it is highly likely it is being used again.

698

J. R. Fonseca Cacho and K. Taghva Table 1. Confusionmatrix table top 1–20 weights (most common) Fromletter Toletter Frequency 1

,

3347

l

d

1421

y

k

1224

l

i

1188

fl

i

1164

h

r

918

k

z

620

s

549

s

493

1

444

485 :



370

.

a

267

d

j

240

o

235

0

6

232

n

h

226

.

221

5

211

f 3

4.4

219

Predecessor Bigram Frequency and Successor Bigram Frequency

To build on the context based correction system we have developed, we will generate individual bigram frequencies for the predecessor and successor words of a given candidate word. In other words for each trigram of the form: WORD1 WORD2 WORD3 where WORD2 contains an OCR generated error, we will query the bigram frequency containing WORD1 and our candidate word for WORD2. Similarly, we will query the bigram frequency containing the candidate word and WORD3. We will record these frequencies as predecessor bigram frequency bkwdFreq and successor bigram frequency fwdFreq respectively for each candidate. These bigram frequencies come fro mboth Google Web 1T 2-gram and the corpus itself. Given these features xi for each candidate word we will train our model and then output a decision yˆ on whether to accept the candidate (1) or refuse it (0) which we will compare with the correct answer Y for accuracy. Then do this for all training data points, update our weights, and then continue doing this

OCR Post Processing Using Support Vector Machine

699

until the algorithm converges. To do this we will attach an output column to our feature training set in order to be able to train our model and also test it. A sample of what our candidate table looks like can be seen in Table 2. Note that the blank entry in the 5th row’s toword column means deletion of the error word in its entirety.

5

backwardbigramfreq

forwardbigramfreq

output

decision

docsource

(2), ), 2

unigramfrequency

existed Register registered semester

confusionweight

hegister hegister hegister hegister ) ) ) )

distance

fromword

1 1 1 1 2 2 2 2

toword

location

Table 2. Candidates table

3 1 3 3 1 3 1 1

37 918 459 1 8 93 186 2

2 83 4 4 1 2 1 34

0 83 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 1 0 0 0 0 0 1

-1 -1 -1 -1 -1 -1 -1 -1

FR940104-0-00001 FR940104-0-00001 FR940104-0-00001 FR940104-0-00001 FR940104-0-00001 FR940104-0-00001 FR940104-0-00001 FR940104-0-00001

TREC-5 Data Set

As part of our ongoing research with OCR Post Processing and Error correction we need to find a data set that had a large corpus and included the original ‘source’ text and the OCR’d version so we could easily test our corrections and compare the accuracy of them with the ground truth. The U.S. Department of Commerce’s National Institute of Standards and Technology (NIST) Text REtrieval Conference (TREC) TREC-5 Confusion Track [10] fit these requirements. The TREC-5 Confusion Track file confusion track.tar.gz [11] is freely available to be downloaded from their website. Along visiting the TREC-5 Website, please see our previous work with the data set for more details on how we use it and its benefits and limitations [1,4]. For this experiment we use the first 99 documents of the TREC-5 Corpus: ‘FR940104-0-00001’ to ‘FR940104-0-00099’. With these we have generated 325,547 individual data points that consist of the 5 features mentioned along with the expected output: distance, confWeight, uniFreq, bkwdFreq, fwdFreq, output. Of these, 14,864 are correct candidates and 310,683 are incorrect candidates. This means that only 4.57% of the data points belong to class 1 (the correct candidate) and the majority of them are incorrect

700

J. R. Fonseca Cacho and K. Taghva

candidates. We will discuss this imbalance the issues it poses and how to tackle them. For now, this data is split into an 80% train and 20% test ratio. The following is an example of what the features along with the expected output looks like: 3, 1, 3, 3, 1, 3, 1, 1, 1, 3, 2, 3, 3, 3, 3, 3, 3, 2, 1, 3, 2,

37, 2, 0, 0, 0 918, 83, 83, 0, 1 459, 4, 0, 0, 0 1, 4, 0, 0, 0 8, 1, 0, 0, 0 93, 2, 0, 0, 0 186, 1, 0, 0, 0 1, 78, 0, 0, 0 2, 34, 0, 0, 1 0, 27, 0, 0, 0 120, 141, 3, 91, 1 120, 1, 0, 1, 0 263, 15, 0, 0, 0 1, 30, 0, 1, 0 0, 9, 0, 2, 0 79, 3, 0, 0, 0 1, 13, 0, 1, 0 363, 5, 0, 0, 0 620, 7, 3, 3, 1 0, 3, 0, 0, 0 612, 158, 17, 4, 1

Note that the first entry on this data set matches the first row of the sample candidates (Table 2). This shows how the data from the candidates table is converted into the text file shown above. The order is still the same but unnecessary columns have been removed and only the five feature values and the output value are kept. Ultimately, the process of generating the candidates table was complex, but beyond the scope of this paper; however all data along with documentations is available for review and is easily reproducible. Now that we have our regression matrix.txt we can proceed to training our model.

6

Experiment: Linear Classification

To have a baseline to compare we first try linear logistic regression as a baseline to compare our SVM with. To do so we run the model on our training data and after about 30 million iterations and several hours, the model achieves a 92.78% accuracy. Aside from the training time everything looks promising so far, but when we run it on our test data we achieved only a 4.57% accuracy rate. This means that the model did not generalize at all.

OCR Post Processing Using Support Vector Machine

701

EXP5: Logistic Regression -Ran on Full DataSet -100% Train, 100% Test (so same data we train we test with) -Achieved 302053/325547 classifications correctly -That’s a 92.7832% accuracy. GOOD! -Time: 31545.431312084198 seconds (8.76 hours) EXP6: Logistic Regression -Ran on Full DataSet -First 80% Train, 20% Test -Achieved 2978/65110 classifications correctly -That’s a 4.5738% accuracy. BAD! -Time: 17780.923701047897 seconds (4.94 hours)

Because of how extremely low the accuracy results are, one’s initial, but naive, reaction would be to merely flip the model’s output and then achieve the opposite accuracy (100% − 4.5738% = 95.43%) ending up with a model that predicts 95.43% of the cases. And while the accuracy would indeed increase to 95.43%, this would be no better than having a model that regardless of input, returns a 0 (xi → y not the right candidate). This is because of the aforementioned class imbalance on the data set we mentioned earlier. Only 4.57% of our candidates belong to class 1. So of course just assuming the data point is 0 would give us such a good accuracy and yet be a completely useless model for choosing the right candidate. What this tells us is that the data is not linearly separable using the five features. Because each feature can be considered a dimension. We have hyperplane in a five-dimensional space that is not enough and instead need a hypersurface. To understand this in a two-dimensional space we would be dealing with a line, or a plane in a 3-dimensional space. However, because the data is not separable in this line or in this plane we need to curve it so we can properly assign each data point to the right class.

7

Non-linear Classification with Support Vector Machines

As mentioned earlier, Support Vector Machines can be very effective in high dimensional spaces and are very versatile depending on the kernel used. In this case because we know the data is not linearly separable, we will use a polynomial kernel (κ(xi , xj ) = (γxTi xj + r)d , γ > 0), and specifically we will use (u v + 1)3 . To better understand what non-linearly separable data is we can look at Fig. 1 and 2 where we are trying to separate the o and x into different classes – binary classification. Figure 2, next page, shows 2-dimensional data that can be easily separated using a straight line. There is more than one solution for what this line may be. The three gray lines show three possible lines in the center graph. On the other hand, in Fig. 1 it is impossible to draw one straight line

702

J. R. Fonseca Cacho and K. Taghva

that can separate the two classes without misclassifying multiple points. The center drawing of this figure shows two possible lines that each have four points misclassified (12/16 correct, 75% accuracy). On the other side, using a non-linear classification method we can achieve 100% accuracy with the circle shown on the right graph. That graph is an example of a SVM with a polynomial kernel; but many learning algorithms are able to be used to classify non-linearly separable data.

Fig. 1. Binary classification of non-linear data using a linear kernel versus a polynomial kernel

A good comparison between SVM and logistic regression can be in how the loss function is defined in each, “SVM minimizes hinge loss while logistic regression minimizes logistic loss [and] logistic loss diverges faster than hinge loss. So, in general it will be more sensitive to outliers. Logistic loss does not go to zero even if the point is classified sufficently confidently” [12]. The last point is known to lead to degradation in accuracy and because of this SVMs typically perform better than logistic regressions even with linearly separable data [12]. To better understand this let us look at the right graph in Fig. 2. In this graph, the light gray lines above and beneath the dark gray line are known as the support vectors. The space between them is known as the maximum margin. They are the margins that decide what class each point belongs to.

Fig. 2. Logistic Regression versus Support Vector Machines for linearly separable data

OCR Post Processing Using Support Vector Machine

703

Ultimately SVMs try to find “the widest possible separating margin, while Logistic Regression optimizes the log likelihood function, with probabilities modeled by the sigmoid function. [Furthermore,] SVM extends by using kernel tricks, transforming datasets into rich features space, so that complex problems can be still dealt with in the same ‘linear’ fashion in the lifted hyperspace” [12]. With this in mind, we will now proceed to using a SVM for our classification of candidates.

8

Software: LIBSVM

To run the Machine Learning Algorithms we used several software for different tasks. Most importantly to run the SVM we used LIBSVM – A Library for Support Vector Machines by Chih-Chung Chang and Chih-Jen Lin [5]. LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, nu-SVR) and distribution estimation (oneclass SVM). It supports multi-class classification [5]. It is available in many different programming languages including the version we used in Python. The main available options and features are: options: -s svm_type : set type of SVM (default 0) 0 -- C-SVC 1 -- nu-SVC 2 -- one-class SVM 3 -- epsilon-SVR 4 -- nu-SVR -t kernel_type : set type of kernel function (default 2) 0 -- linear: u’*v 1 -- polynomial: (gamma*u’*v + coef0)^degree 2 -- radial basis function: exp(-gamma*|u-v|^2) 3 -- sigmoid: tanh(gamma*u’*v + coef0) -d degree : set degree in kernel function (default 3) -g gamma : set gamma in kernel function (default 1/num_features) -r coef0 : set coef0 in kernel function (default 0) -c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1) -n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5) -p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1) -m cachesize : set cache memory size in MB (default 100) -e epsilon : set tolerance of termination criterion (default 0.001) -h shrinking : whether to use the shrinking heuristics, 0 or 1 (default 1) -b probability_estimates : whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0) -wi weight: set the parameter C of class i to weight*C, for C-SVC (default 1)

Since we are using the following polynomial kernel (u v + 1)3 , then we set the following options: param = svm_parameter(’-s 0 -c 10 -t 1 -g 1 -r 1 -d 3’) ‘‘Classify a binary data with polynomial kernel (u’v+1)^3 and C = 10’’

If we wanted to run a linear kernel for comparison, we would set: param = svm_parameter(’-s 0 -t 0 -h 0’) ‘‘Linear binary classification C-SVC SVM’’

704

9

J. R. Fonseca Cacho and K. Taghva

SVM Results with Polynomial Kernels

First we run the SVM model with a polynomial kernel on our training data as we did before. EXP4: param = svm_parameter(’-s 0 -c 10 -t 1 -g 1 -r 1 -d 3’) ‘‘Classify a binary data with polynomial kernel (u’v+1)^3 and C = 10’’ -Ran on Full DataSet -100% Train, 100% Test (so same data we train we test with) optimization finished, #iter = 32554700 nu = 0.050608 obj = -1531685219869132324864.000000, rho = -60738085076395.093750 nSV = 16578, nBSV = 16395 Total nSV = 16578 Accuracy = 96.2918% (313475/325547) (classification) Time: 26164.813225507736 seconds (7.27 hours)

To understand what each line above means we look at the LIBSVM’s Website FAQ (https://www.csie.ntu.edu.tw/∼cjlin/libsvm/faq.html). – – – –

iter is the iterations performed. obj is the optimal objective value of the dual SVM problem. rho is the bias term in the decision function sgn(wT x − rho). nSV and nBSV are number of support vectors and bounded support vectors (i.e., αi = C). – nu-svm is a somewhat equivalent form of C-SVM where C is replaced by nu. – nu simply shows the corresponding parameter. More details can be found in the libsvm document [5]. Furthermore, the above data among with all the experiments ran is located in the Results folder. For further details and instructions on how to run and replicate each and all, experiment, or run new ones, please see QuickStart.txt and [13]. After about 32.5 million iterations and several hours, the model achieves a 96.29% accuracy. This is statistically significant compared to the 92.78% achieved by the logistic regression; however this is just training data and the real test comes when we run it on the test data. To see how well it generalized. Therefore, we run with a 80% Train and 20% Test distribution using the same SVM model parameters and polynomial kernel: EXP8: param = svm_parameter(’-s 0 -c 10 -t 1 -g 1 -r 1 -d 3’) ‘‘Classify a binary data with polynomial kernel (u’v+1)^3 and C = 10’’ ‘‘Re-run of EXP2, but with Precision and Recall output’’ -Ran on Full DataSet -First 80% Train, 20% Test optimization finished, #iter = 26043700 nu = 0.044441 obj = -1033776563990009085952.000000, rho = -1896287955440235.250000 nSV = 11840, nBSV = 11439

OCR Post Processing Using Support Vector Machine

705

Total nSV = 11840 Accuracy = 93.6231% (60958/65110) (classification) Metrics: Total Instances: True Positive: False Positive: False Negative: True Negative: Precision: Recall: Accuracy: F-Score:

65110 494 85 4067 60464

85.31951640759931% 10.830958123218593% 93.62309937029643% 0.19221789883268484

Time: 18892.750891447067 seconds

(5.25 hours)

After about 26 million iterations and 5.25 h, the model achieves a 93.62% accuracy on the test data. This may not be as high as the 96.29% accuracy on the training data, but that is to be expected as the model has never seen the test data. This is a great indication that there is very little, if any, overfitting in our model and that it has generalized successfully. Note that we ran additional experiments (EXP3, EXP7) to conclude that the best training/test ratio was 80% Train and 20% Test. By this we mean that we need at least 260,000 data points to be able to train our algorithm properly. Because the TREC-5 data set is far larger than this. We will be experimenting on further parts of it with just this small percentage of the data as training in future tests. At this point we have improved tremendously between the linear example and the polynomial kernel based on just accuracy, but accuracy does not tell the full story of the performance of our model for the test data. This is where Precision, Recall, and F-Score come in to play. These metrics allow us to better judge how good the learning algorithm is classifying and what the 6.38% error rate (1 − accuracy = error rate) really means for our task of trying to choose the right candidate word. But first let us take a look at the four possible outcomes that our learning algorithm can have: If Given the five features, the model accepts the candidate word and that word is the correct word from the aligned ground truth, then that is a True Positive. On the other hand if the model accepts the candidate word, but that word was not the correct word then this is known as a False Positive. False positives count as errors and are included in the error rate. On the other hand if given the five features, the model rejects the candidate word and that word was the correct word from the aligned ground truth text, then that word is a False Negative. This means that the model should have not rejected the word, but it did. We can think of false negatives as the model not having the confidence to accept the candidate given the five features. Finally, if the model rejects the word and the word was not the correct candidate, then

706

J. R. Fonseca Cacho and K. Taghva

this is a True Negative. These are good because it means that the model did not accept bad choices. Together these four outcomes should add up to the total number of test cases given. In this case, T P + F P + F N + T N = T otalInstances or 494 + 85 + 4067 + 60464 = 65, 110. For our problem deciding if we want high recall or high precision model we have to decide if we want an OCR Post Processing system that will correct a few words without making mistakes (high precision), but still leave mistakes in the text, or a model that will detect and correct a lot more mistakes, but will sometimes correct to the wrong choice (high recall). To decide we must consider if this is the final step of the OCR post processing or if this is just part of a pipeline of several models this will be fed in. If the latter is the case we can choose the second variant because this means that while we will suggest incorrect words as correct, we will also include the correct words so we are narrowing the candidate list and then leaving it for a model down the pipeline to hopefully make the right choice. This means that we detect the correct candidate, but also some incorrect and pool them together for someone else to decide. This is the ideal choice for that system. However, if this is the final post processing step before the text reaches the end then we want a high precision model that only chooses correct candidates in order to not introduce harder to detect errors in the OCR’d text. If a human is reading the text and sees a misspelled word it will know that it is most likely an OCR generated error and possibly even correct it on the fly if not too hard, but if the same human sees a correctly spelled word that does not match the original text then it is possible he will assume it is a mistake of the text and not of the OCR process. Coming back to our results, even though our accuracy was high at 93.62% the recall is very low at only 10.83%. This means there were a lot of false negatives in our test, or instances where the right candidate was suggested but the model rejected them. 4,067 correct words were rejected while only 494 candidates were accepted. This of course is not very good. Nevertheless, the precision was very high at 85.35% with only 85 instances accepted that shouldn’t have been accepted (false positives). Therefore because we are using the harmonic mean, our F-score was low at 0.1922. This raises the question, how can we improve upon this result? Among many options there are two that we will tackle: Rescaling our data and balancing our classes. 9.1

LIBSVM Experiment with MinMax Normalization

Normalization, also known as MinMax scaling, will rescale the data into a range between 0 and 1 for a positive data set, like ours, or into the range −1 to 1 for a data set that contains negative values. To do this, one must compute the range of each feature, as we did earlier and then uses these max(xi ) and min(xi ) to update each instance of xij in the dataset: xij =

xij − min(xi ) max(xi ) − min(xi )

OCR Post Processing Using Support Vector Machine

707

One disadvantage to minmax scaling compared to z-score standardization is that minmax is sensitive to outliers data points. Meaning that if the features have outliers it can unnecessarily compress the range of the majority of the non-outliers data. We normalized dataset using a MinMax algorithm, and then re-run the polynomial kernel experiment with a 80% Train and 20% Test distribution. EXP9: param = svm_parameter(’-s 0 -c 10 -t 1 -g 1 -r 1 -d 3’) ‘‘Classify a binary data with polynomial kernel (u’v+1)^3 and C = 10’’ ‘‘Re-run of EXP8, but with Normalized dataset’’ -Ran on Full DataSet -First 80% Train, 20% Test optimization finished, #iter = 130782 nu = 0.074323 obj = -192342.464438, rho = -1.000211 nSV = 19409, nBSV = 19303 Total nSV = 19409 Accuracy = 96.724% (62977/65110) (classification) Metrics: Total Instances: True Positive: False Positive: False Negative: True Negative: Precision: Recall: Accuracy: F-Score:

65110 2674 246 1887 60303

91.57534246575342% 58.62749397062048% 96.7240055291046% 0.7148776901483759

Time: 393.0401563644409 seconds

(6.55 minutes!)

As we can see the normalization, which normalizing the full dataset took only a few seconds, decreased the runtime from hours to only 6 min and 33 s with 130,782 iterations. The model achieved a 96.72% accuracy on the test data. This is slightly higher than its performance on the training data. The model definitely generalized successfully. Analyzing it further, we can see that Precision was at 91% meaning that there was a very low number of false positives in comparison to the number of true positives. Do note that we still had more false positives in this run than with the original unnormalized data set. As previously we only had 85 false positives. However, in the previous version we only accepted 494 candidates that were correct whereas in this version we accepted 2,674 candidates that were correct. This is why the precision still went up even though we had more false positives. Recall, also, greatly increased to 58.62% meaning that the number of false negatives decreased greatly from 4,067 to 1,887. Normalizing is a well known trick to help speed up and improve several machine learning algorithms. As we can see our accuracy increased from 93.62%

708

J. R. Fonseca Cacho and K. Taghva

to 96.72%, a statistically significant difference, but most importantly in increasing our recall from 10.83% to 58.63% while maintaining and slightly increasing our precision from 85.32% to 91.58%. This also means that our F-score increased from 0.1922 to 0.7149. Aside from the small number of false positives that increased, the learning algorithm clearly performed better, and was trained much faster, thanks to the minmax normalized dataset. We have now successfully corrected 58.63% of the errors detected in the OCR post processing stage. Next we can try to improve further by balancing our class distribution in the training data.

10

Over-Sampling with SMOTE to Balance Class Distribution

Next we will use SMOTE [14]’s implementation in Imbalance Learn’s [15] library in Scikit-learnscikit-learn to balance our classes and try and improve our precision and recall. We can see that the class distribution is 250,135 data instances belonging to class 0 (incorrect candidates) and 10,303 data instances belonging to class 1 (correct candidates). This means that class 1 represents only 3.95% of all data instances. After we run SMOTE we can see that class 0 still has the 250,135 instances but class 1 also has 250,135 instances. Meaning that the class distribution is balanced. The new size of the training set is almost twice as big at 500,270. However, because we still split our train and test at 80% Train and 20% Test, and because these are synthetic instances, we still consider to have that same ratio even though the training data instances nearly doubled. We re-train our model and run it. EXP14: param = svm_parameter(’-s 0 -c 10 -t 1 -g 1 -r 1 -d 3’) ‘‘Classify a binary data with polynomial kernel (u’v+1)^3 and C = 10’’ ‘‘Re-run of EXP9, but with SMOTE applied to the Normalized dataset’’ -Ran on Full DataSet but using the split datasets we made. -Only training DataSet has SMOTE applied to it to avoid biasing test set. -First 80% Train, 20% Test optimization finished, #iter = 517981 nu = 0.278321 obj = -1369922.733221, rho = -2.488460 nSV = 139339, nBSV = 139145 Total nSV = 139339 Accuracy = 91.3069% (59449/65109) (classification) Metrics: Total Instances: True Positive: False Positive: False Negative: True Negative:

65109 4191 5290 370 55258

Precision: 44.204197869423055% Recall: 91.88774391580795%

OCR Post Processing Using Support Vector Machine Accuracy: F-Score:

709

91.30688537682963% 0.5969235151687794

Time: 4078.6771759986877 seconds

(67.98 minutes!)

The model performed better in terms of accuracy on the test data than it did on the training data with a 91.30% accuracy versus the training data’s 87.81% accuracy. Both however are short of the 96.72% accuracy of first experiment with the imbalanced normalized data set. In terms of precision we had a low precision of 44.20% meaning the model did not generalize well in terms of precision meaning that we had a higher amount of false positives than before. However, we achieved the highest recall yet, even surpassing the training set with a recall of 91.89% compared to the training set’s 85.71% and the imbalanced model’s recall of 58.63%. These means we have very few false negatives. It is very clear that SMOTE balancing the classes greatly improved the recall at the cost of the precision. An interesting trade-of. This resulted in an F-Score of 0.5969 compared to the imbalance model’s 0.7149 F-score. In more careful analysis we can see that the model seems to choose as many correct candidates as it does incorrect candidates. All in all while we had a very high recall, the trade off loss in precision makes SMOTE not the clear winner. It appears that the normalized imbalance dataset are the best results so far.

11

Discussion and Error Analysis

Given the varying results with the SMOTE dataset it appears an ensemble learning technique could help combine both models into one that maintains a high recall without so much precision loss. Further exploration of this approach will be done in the near future. Furthermore, there are many other class balancing algorithms to experiment with including variations of SMOTE, ADASYN [16] and also, under sampling techniques like TOME-LINK [17] that are worth exploring. In addition to that we can analyze the errors that the learning algorithm is making to try and add features or help us in modifying hyper parameters to improve the model. To do this we will look carefully at the confusion matrix and convert our normalized data back to the original values by reversing the minmax scaler formula. The first error we will look at is a false positive where the word ‘projects’ has an OCR generated error resulting in the word ‘prodects1’ that when fed through the model, returned to us the word ‘products’, as what it believes to be the correct candidate. To understand why such is the case we can take a closer look at each of the features for all of the candidates for that given word. 116529, 116529, 116529, 116529, 116529, 116529, 116529, 116529,

prodects1, prodects1, prodects1, prodects1, prodects1, prodects1, prodects1, prodects1,

products, products,, products., projected, projects, projects,, Projects., protect,

2, 2, 2, 3, 2, 2, 3, 3,

222, 1674, 30, 120, 342, 1793, 100, 39,

158, 23, 28, 7, 7, 1, 3, 18,

0, 0, 0, 0, 0, 1, 0, 0,

7, 5, 0, 0, 2, 1, 0, 1,

0, 0, 0, 0, 0, 1, 0, 0,

0 1 0 0 0 0 0 0

710

J. R. Fonseca Cacho and K. Taghva

The Columns in order from left to right are location, fromword, toword, distance, confusionweight, unigramfrequency, backwardbigramfreq, forwardbigramfreq, output (ground truth), decision (model’s output). From this we can see that the model rejected a lot of the incorrect suggestions which is good, but it also rejected the correct one and accepted an incorrect one. The confusion weights were similar for both the incorrect candidates it picked and for what was the correct candidate that it did not pick, 1674 and 1793 respectively, but ‘products’, had a higher unigram frequency of 23 versus 1 for ‘projects’. ‘products’, also had a higher forward bigram frequency, 5 versus 1, than ‘projects’. However, ‘projects’, did have one instance of a backwards bigram versus a 0 frequency for ‘products’. Ultimately both are very similar and context is absolutely necessary to see which one is correct. To do this we can identify the surrounding 6-gram phrase from both the original text and the OCR’d text thanks to the alignment algorithm [4]: Original 6-gram: bany protection, projects, and NMFS anticipates OCR’d text 6-gram: bank protection, prodects1, and NMFS anticipates

Note that both words ‘projects’ and ‘products’ would fit as correct candidates given the context of the 6-gram. In this sense the algorithm made the human choice to pick the more common word and hope for the best. It is doubtful a human would know which of the two to pick with just this information as well. So what could we have done? Do we give up? Of course not, we could look further than the bigram frequency in our feature set to and try and see what the document is about in future models. The more context we have, the more features we have and the better chances of predicting the right word. However, with more context and more features the bigger the dataset and the slower the training time so at a certain size the trade off is too high, but further exploration into this will be done in the future. Next we can look at the first example of a false negative we encountered. In this case, the suggested candidate word ‘Spring’ was rejected as the correct candidate for the error word ‘sprlng’. 115768, 115768, 115768, 115768, 115768, 115768, 115769, 115769,

sprlng, sprlng, sprlng, sprlng, sprlng, sprlng, creey, creey,

spread, Spring, Spring,, string), strong, wrong, agree, came,

3, 2, 3, 3, 2, 3, 3, 3,

0, 594, 458, 399, 29, 29, 73, 1,

2, 10, 1, 1, 5, 3, 22, 1,

0, 1, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0,

0, 1, 0, 0, 0, 0, 0, 0,

0 0 0 0 0 0 0 0

OCR Post Processing Using Support Vector Machine

711

The character ‘l’ was misclassified as an ‘i’ by the OCR scanning software, a very common mistake. In fact it is such a common mistake that our confusion matrix ranks it number 4 amongst most common error corrections (l → i). Query: SELECT * FROM confusionmatrix ORDER BY frequency DESC LIMIT 5; Output: 1, ,, 3347 l, d, 1421 y, k, 1224 l, i, 1188 fl, i, 1164

As we can see the fourth most common edit is (l → i) however the confusion weight for it (1188) does not match the confusion weight for our candidate entry. The reason for this is because technically there are two edits happening. The ‘s’ is being capitalized (s → S) and the l → i. In future versions we intend to improve on this by lowercasing the entire corpus therefore eliminating inflated Levenshtein edit distances between the candidate word and error word. In addition to this, we will be removing all punctuation from candidate words and correct words. By doing this we will be able to combine entries like ‘Spring’ and ‘Spring’ in order to make the current features we have more robust. Next we identify the 5-gram of both the original text and the OCR’d text: Original 5-gram: to enlarge Spring Creek Debris OCR’d text 5-gram: . enlarge Sprlng Creey Lebris

Looking above, the second problem that caused the misclassification is having words with errors in our trigram of the word, which creates problems for accurately measuring the forward and backwards bigram frequencies. We mentioned a similar issue when we were working in the Google 1T experiment and as we mentioned then, a possible approach to solving this is executing multiple passes when correcting the text in order to try and use our previous corrections to correct neighboring words by tackling the easy words first and then the harder ones. Another solution could be using ensemble learning where a model that heavily focuses on the confusion matrix feature would certainly have corrected this error instance, and then by having other models that focus on other types of errors we can put them together and hope that this correction appears in more than 50% of the models. Alternatively we can stack the models similar to how Multilayer Perceptrons work that can each focus on correcting specific errors and letting others pass through that can then be corrected by later models. Ultimately the

712

J. R. Fonseca Cacho and K. Taghva

fact that the same issues that appear here appeared in the Google Web 1T context based corrections is a testament to how machine learning is a way to automate corrections, but not a magic solution to it all. It may be called machine learning, but there very well is a human factor still involved in nurturing the learning algorithm.

12

Conclusion and Future Work

There are still many different types of errors to report on and further analyze. Furthermore one may ask where the validation tests are! Well, we are not done experimenting yet as we have mentioned throughout so we will reserve running that until we publish the more concrete results in the near future. In the meantime, the test results have shown that the model using the normalized data set display the most promising results. However, balancing the dataset gave us the highest recall (91.89%) at the cost of precision. All in all we have shown our work flow for correcting OCR generated errors using Support Vector Machines which consists of taking part of the errors, and correcting them using the ground truth text and our alignment algorithm. We then generate five values for each candidate word: Levenshtein Edit distance, character confusion weight for the correction, unigram frequency of the candidate, bigram frequency of the candidate and the previous word, and the bigram frequency of the candidate word and the following word. These five features in combination with the output of the alignment algorithm, on whether this is the correct candidate or not, are used to generate the dataset that we can then use to train and test our models with the goal of creating a model that will generalize successfully so that it can be used for future unseen text. Because choosing the right Kernel is very important in SVMs, in the future we intend on testing with the Radial Basis Function (RBF) Kernel among others as well. We then showed how normalization of a data set can be very important for some machine learning techniques to improve not just the speed time (hours to minutes in our case), but to also improve the performance of the learning algorithm on the training data. Finally we attempted to balance the two classes with SMOTE and, while that increased our recall, it overall decreased our F-score and accuracy. Much work remains to be done. Future plans on using ensemble learning or stacked learning to combine multiple methods to improve our results are planned. We intend on continuing research on this and publishing our results in the near future. Due to our ongoing goal to produce and encourage accessible reproducible research [18,19], our implementation source code and ongoing experiments and results are available in multiple repositories including Docker, Zenodo, & Git. (Search: unlvcs).

OCR Post Processing Using Support Vector Machine

713

References 1. Fonseca Cacho, J.R., Taghva, K., Alvarez, D.: Using the Google Web 1T 5-gram corpus for OCR error correction. In: 16th International Conference on Information Technology-New Generations (ITNG 2019), pp. 505–511. Springer (2019) 2. Brants, T., Franz, A.: Web 1T 5-gram version 1 (2006) 3. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady 10(8), 707–710 (1966) 4. Fonseca Cacho, J.R., Taghva, K.: Aligning ground truth text with OCR degraded text. In: Intelligent Computing-Proceedings of the Computing Conference, pp. 815– 833. Springer (2019) 5. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011). http://www.csie.ntu.edu.tw/∼ cjlin/libsvm 6. Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012) 7. Hsu, C.-W., Chang, C.-C., Lin, C.-J., et al.: A practical guide to support vector classification (2003) 8. Taghva, K., Stofsky, E.: OCRSpell: an interactive spelling correction system for OCR errors in text. Int. J. Doc. Anal. Recogn. 3(3), 125–137 (2001) 9. Taghva, K., Nartker, T., Borsack, J.: Information access in the presence of OCR errors. In: Proceedings of the 1st ACM workshop on Hardcopy Document Processing, pp. 1–8. ACM (2004) 10. Kantor, P.B., Voorhees, E.M.: The TREC-5 confusion track: comparing retrieval methods for scanned text. Inf. Retrieval 2(2–3), 165–176 (2000) 11. TREC-5 confusion track. https://trec.nist.gov/data/t5 confusion.html. Accessed 10 Oct 2017 12. Drakos, G.: Support vector machine vs logistic regression. https:// towardsdatascience.com/support-vector-machine-vs-logistic-regression94cc2975433f. Accessed 21 June 2019 13. Fonseca Cacho, J.R.: Improving OCR post processing with machine learning tools. Ph.D. dissertation, University of Nevada, Las Vegas (2019) 14. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002) 15. Lemaˆıtre, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017). http://jmlr.org/papers/v18/16-365.html 16. He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322– 1328. IEEE (2008) 17. Devi, D., Purkayastha, B., et al.: Redundancy-driven modified Tomek-link based undersampling: a solution to class imbalance. Pattern Recogn. Lett. 93, 3–12 (2017) 18. Fonseca Cacho, J.R., Taghva, K.: Reproducible research in document analysis and recognition. In: Information Technology-New Generations, pp. 389–395. Springer (2018) 19. Fonseca Cacho, J.R., Taghva, K.: The state of reproducible research in computer science. In: Latifi, S. (ed.) 17th International Conference on Information Technology-New Generations (ITNG 2020). Advances in Intelligent Systems and Computing, vol. 1134. Springer, Cham (2020). https://doi.org/10.1007/978-3-03043020-7 68

Author Index

A Abdalla, Areeg, 94 Acheampong, Kingsley Nketia, 674 Akkari, Nissrine, 245, 264 Alabed, Asmaa, 531 Alamir, Mashael Bin, 301 Al-Aqrabi, Hussain, 487 Albakri, Ghazal, 301 Alkhammash, Manal, 467 Al-Malak, Sawsan, 181 Alsaleh, Deem, 301 Alsboui, Tariq, 487 Ang, Chun Kit, 103 Ang, Koon Meng, 103 Awajan, Arafat, 656 B Banik, Shipra, 33 Barnes, Laura E., 610 Bello, Aliyu Muhammad, 232 Beloff, Natalia, 467 Berduygina, Daria, 625 Berke, Alex, 1 Bhandari, Mahabir, 19 Block, Florian, 63 Bontozoglou, Christos, 395 Boranbayev, Askar, 380, 667 Boranbayev, Seilkhan, 380, 667 Brown, Donald E., 610 Buckberry, Heather, 19 Bugeja, Joseph, 427 Burova, Elena, 544

C Casanova, Marco A., 502 Casenave, Fabien, 245, 264 Cavallucci, Denis, 625 Chen, Daqing, 395 Chen, Wen-Yen, 155 Chirikhina, Elena, 395 Chocoteco, J. Abel, 582 Chong, Onn Ting, 103 Ciesielski, Vic, 367 D Demediuk, Simon, 63 Denis, Pierre, 133 Djouani, Karim, 445 do Carmo, Ricardo R. M., 502 Drachen, Anders, 63 E El Halaby, Mohamed, 94 El-Behaidy, Wessam H., 82 Elek, Istvan, 572 F Festijo, Enrique D., 337 Fonseca Cacho, Jorge Ramón, 694 Franklin, D. Michael, 19 Fulda, Nancy, 599 G Gegov, Alexander, 291 Goel, Tarun, 169

© Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1229, pp. 715–717, 2020. https://doi.org/10.1007/978-3-030-52246-9

716

Author Index

Gordon, Neil, 531 GuangZhi, Dai, 358 Guedj, Benjamin, 314 Gueraichi, Ratiba, 328

N Nawyn, Jason, 1 Ng, Kwan Hoong, 405 Nurbekov, Askar, 380

H Hanafi, Effariza, 405 Hasan, Mahady, 33 Heidarysafa, Mojtaba, 610 Herman, Maya, 214 Hill, Richard, 487 Hook, Jonathan, 63 Hosny, Manar, 181 Hriez, Safaa, 656 Hsu, Roy Chaoming, 155

O Odukoya, Tolu, 610 Olarewaju, Oluseji, 63 Omran, Sherin M., 82 Opuni-Boachie, Kwame Obour-Agyekum, 674

I Imanaliev, Talaibek, 544 Irminger, Philip, 19 Isa, Nor Ashidi Mat, 103 J Jacobsson, Andreas, 427 Jafari, Raheleh, 291 Jansone, Anita, 125 K Kambhampati, Chandrasekhar, 531 Kang, Jiyoung, 522 Kashihara, Kazuaki, 638 Katz, Garrett E., 198 Kirmann, Ben, 63 Kokkinakis, Athanasios, 63 Kowsari, Kamran, 610 Kurien, Anish, 445 Kushwaha, Dharmender Singh, 43 L Lane, Brad, 367 Larson, Kent, 1 Lengeling, Thomas Sanchez, 1 Lim, Wei Hong, 103 Lin, Yu-Pi, 155 Liong, Yu Ling, 405 Lipkin, Felix, 367 M Mackare, Kristine, 125 Mackars, Raivo, 125 Marie-Sainte, Souad Larabi, 301 Mashford, John, 367 Mazzei, Mauro, 564 Migabo, Emmanuel, 445 Miramontes-Varo, Veronica, 582

P Pan, Wei, 395 Patra, Sagarika, 63 Pedrassoli Chitayat, Alan, 63 Perrin, Marc-Eric, 264 Potter, Philip, 610 Q Qin, Yongrui, 487 R Rabbi, Mohammed Iqbal Hossain, 33 Ranga, Virender, 415 Razvarz, Sina, 291 Rengot, Juliette, 314 Robertson, Justus, 63 Rodríguez, Alberto, 554 Ryckelynck, David, 245, 264 S Sachan, Rohit Kumar, 43 Safont, Gonzalo, 554 Salazar, Addisson, 554 Salman, Ammar S., 198 Salman, Odai S., 198 Sánchez, M. G., 582 Sarah, Anan, 33 Serir, Amina, 328 Shuitenov, Gabit, 667 Siddiqua, Mahpara Sayema, 33 Sifah, Emmanuel Boateng, 674 Spalazzese, Romina, 427 T Taghva, Kazem, 694 Tambuwal, Ahmad Idris, 232 Tian, Wenhong, 674 Tiang, Sew Sun, 103 U Ursu, Marian, 63

Author Index

717

V van der Pas, Mark, 277 van der Pas, Niels, 277 Vergara, Luis, 554 Verma, Abhishek, 415 Vidal, V., 582 Virata, Alvin Jason A., 337

X Xiao, Perry, 395 XiaoJun, Wen, 358 Xu, Huaping, 350

W Waller, Jonathan, 169 Wang, Boyu, 350 Wang, Shuang, 350 White, Martin, 467

Z Zagagy, Ben, 214 Zhang, Jiawei, 350 Zhang, Xu, 395 Zouhair, Jalila, 301

Y Youssif, Aliaa A. A., 82 Yu, Ja Lin, 405