Artificial Intelligence Applications and Innovations: 15th IFIP WG 12.5 International Conference, AIAI 2019, Hersonissos, Crete, Greece, May 24–26, 2019, Proceedings [1st ed.] 978-3-030-19822-0;978-3-030-19823-7

This book constitutes the refereed proceedings of the 15th IFIP WG 12.5 International Conference on Artificial Intellige

1,001 112 61MB

English Pages XXIV, 689 [694] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Artificial Intelligence Applications and Innovations: 15th IFIP WG 12.5 International Conference, AIAI 2019, Hersonissos, Crete, Greece, May 24–26, 2019, Proceedings [1st ed.]
 978-3-030-19822-0;978-3-030-19823-7

Table of contents :
Front Matter ....Pages i-xxiv
Front Matter ....Pages 1-1
The Power of the “Pursuit” Learning Paradigm in the Partitioning of Data (Abdolreza Shirvani, B. John Oommen)....Pages 3-16
Front Matter ....Pages 17-17
Cyber-Typhon: An Online Multi-task Anomaly Detection Framework (Konstantinos Demertzis, Lazaros Iliadis, Panayiotis Kikiras, Nikos Tziritas)....Pages 19-36
Investigating the Benefits of Exploiting Incremental Learners Under Active Learning Scheme (Stamatis Karlos, Vasileios G. Kanas, Nikos Fazakis, Christos Aridas, Sotiris Kotsiantis)....Pages 37-49
The Blockchain Random Neural Network in Cybersecurity and the Internet of Things (Will Serrano)....Pages 50-63
Front Matter ....Pages 65-65
A Visual Neural Network for Robust Collision Perception in Vehicle Driving Scenarios (Qinbing Fu, Nicola Bellotto, Huatian Wang, F. Claire Rind, Hongxin Wang, Shigang Yue)....Pages 67-79
An LGMD Based Competitive Collision Avoidance Strategy for UAV (Jiannan Zhao, Xingzao Ma, Qinbing Fu, Cheng Hu, Shigang Yue)....Pages 80-91
Mixture Modules Based Intelligent Control System for Autonomous Driving (Tangyike Zhang, Songyi Zhang, Yu Chen, Chao Xia, Shitao Chen, Nanning Zheng)....Pages 92-104
Front Matter ....Pages 105-105
An Adaptive Temporal-Causal Network Model for Stress Extinction Using Fluoxetine (S. Sahand Mohammadi Ziabari)....Pages 107-119
Clustering Diagnostic Profiles of Patients (Jaakko Hollmén, Panagiotis Papapetrou)....Pages 120-126
Emotion Analysis in Hospital Bedside Infotainment Platforms Using Speeded up Robust Features (A. Kallipolitis, M. Galliakis, A. Menychtas, I. Maglogiannis)....Pages 127-138
FISUL: A Framework for Detecting Adverse Drug Events from Heterogeneous Medical Sources Using Feature Importance (Corinne G. Allaart, Lena Mondrejevski, Panagiotis Papapetrou)....Pages 139-151
Front Matter ....Pages 153-153
A New Topology-Preserving Distance Metric with Applications to Multi-dimensional Data Clustering (Konstantinos K. Delibasis)....Pages 155-166
Classification of Incomplete Data Using Autoencoder and Evidential Reasoning (Suvra Jyoti Choudhury, Nikhil R. Pal)....Pages 167-177
Dynamic Reliable Voting in Ensemble Learning (Agus Budi Raharjo, Mohamed Quafafou)....Pages 178-187
Extracting Action Sensitive Features to Facilitate Weakly-Supervised Action Localization (Zijian Kang, Le Wang, Ziyi Liu, Qilin Zhang, Nanning Zheng)....Pages 188-201
Image Recognition Based on Combined Filters with Pseudoinverse Learning Algorithm (Xiaodan Deng, Xiaoxuan Sun, Ping Guo, Qian Yin)....Pages 202-209
Front Matter ....Pages 211-211
Design-Parameters Optimization of a Deep-Groove Ball Bearing for Different Boundary Dimensions, Employing Amended Differential Evolution Algorithm (Parthiv B. Rana, Jigar L. Patel, D. I. Lalwani)....Pages 213-222
Exploring Brain Effective Connectivity in Visual Perception Using a Hierarchical Correlation Network (Siyu Yu, Nanning Zheng, Hao Wu, Ming Du, Badong Chen)....Pages 223-235
Solving the Talent Scheduling Problem by Parallel Constraint Programming (Ke Liu, Sven Löffler, Petra Hofstedt)....Pages 236-244
Front Matter ....Pages 245-245
A Deep Reinforcement Learning Approach for Automated Cryptocurrency Trading (Giorgio Lucarelli, Matteo Borrotti)....Pages 247-258
Capacity Requirements Planning for Production Companies Using Deep Reinforcement Learning (Harald Schallner)....Pages 259-271
Comparison of Neural Network Optimizers for Relative Ranking Retention Between Neural Architectures (George Kyriakides, Konstantinos Margaritis)....Pages 272-281
Detecting Violent Robberies in CCTV Videos Using Deep Learning (Giorgio Morales, Itamar Salazar-Reque, Joel Telles, Daniel Díaz)....Pages 282-291
Diversity Regularized Adversarial Deep Learning (Babajide O. Ayinde, Keishin Nishihama, Jacek M. Zurada)....Pages 292-306
Interpretability of a Deep Learning Model for Rodents Brain Semantic Segmentation (Leonardo Nogueira Matos, Mariana Fontainhas Rodrigues, Ricardo Magalhães, Victor Alves, Paulo Novais)....Pages 307-318
Learning and Detecting Stuttering Disorders (Fabio Fassetti, Ilaria Fassetti, Simona Nisticò)....Pages 319-330
Localization of Epileptic Foci by Using Convolutional Neural Network Based on iEEG (Linfeng Sui, Xuyang Zhao, Qibin Zhao, Toshihisa Tanaka, Jianting Cao)....Pages 331-339
Review Spam Detection Using Word Embeddings and Deep Neural Networks (Aliaksandr Barushka, Petr Hajek)....Pages 340-350
Tools for Semi-automatic Preparation of Training Data for OCR (Ladislav Lenc, Jiří Martínek, Pavel Král)....Pages 351-361
Training Strategies for OCR Systems for Historical Documents (Jiří Martínek, Ladislav Lenc, Pavel Král)....Pages 362-373
A Review on the Application of Deep Learning in Legal Domain (Neha Bansal, Arun Sharma, R. K. Singh)....Pages 374-381
Long-Short Term Memory for an Effective Short-Term Weather Forecasting Model Using Surface Weather Data (Pradeep Hewage, Ardhendu Behera, Marcello Trovati, Ella Pereira)....Pages 382-390
Segmentation Methods for Image Classification Using a Convolutional Neural Network on AR-Sandbox (Andres Ovidio Restrepo Rodriguez, Daniel Esteban Casas Mateus, Paulo Alonso Gaona Garcia, Adriana Gomez Acosta, Carlos Enrique Montenegro Marin)....Pages 391-398
Front Matter ....Pages 399-399
A Hybrid Model Based on Fuzzy Rules to Act on the Diagnosed of Autism in Adults (Augusto J. Guimarães, Vinicius J. Silva Araujo, Vanessa S. Araujo, Lucas O. Batista, Paulo V. de Campos Souza)....Pages 401-412
An Unsupervised Fuzzy Rule-Based Method for Structure Preserving Dimensionality Reduction with Prediction Ability (Suchismita Das, Nikhil R. Pal)....Pages 413-424
Interpretable Fuzzy Rule-Based Systems for Detecting Financial Statement Fraud (Petr Hajek)....Pages 425-436
Front Matter ....Pages 437-437
Learning Automata-Based Solutions to the Single Elevator Problem (O. Ghaleb, B. John Oommen)....Pages 439-450
Optimizing Self-organizing Lists-on-Lists Using Enhanced Object Partitioning (O. Ekaba Bisong, B. John Oommen)....Pages 451-463
EduBAI: An Educational Platform for Logic-Based Reasoning (Dimitrios Arampatzis, Maria Doulgeraki, Michail Giannoulis, Evropi Stefanidi, Theodore Patkos)....Pages 464-472
Front Matter ....Pages 473-473
A Machine Learning Tool for Interpreting Differences in Cognition Using Brain Features (Tiago Azevedo, Luca Passamonti, Pietro Lió, Nicola Toschi)....Pages 475-486
Comparison of the Best Parameter Settings in the Creation and Comparison of Feature Vectors in Distributional Semantic Models Across Multiple Languages (András Dobó, János Csirik)....Pages 487-499
Distributed Community Prediction for Social Graphs Based on Louvain Algorithm (Christos Makris, Dionisios Pettas, Georgios Pispirigos)....Pages 500-511
Iliou Machine Learning Data Preprocessing Method for Suicide Prediction from Family History (Theodoros Iliou, Georgia Konstantopoulou, Christina Lymperopoulou, Konstantinos Anastasopoulos, George Anastassopoulos, Dimitrios Margounakis et al.)....Pages 512-519
Ontology Population Framework of MAGNETO for Instantiating Heterogeneous Forensic Data Modalities (Ernst-Josef Behmer, Krishna Chandramouli, Victor Garrido, Dirk Mühlenberg, Dennis Müller, Wilmuth Müller et al.)....Pages 520-531
Random Forest Surrogate Models to Support Design Space Exploration in Aerospace Use-Case (Siva Krishna Dasari, Abbas Cheddad, Petter Andersson)....Pages 532-544
Stacking Strong Ensembles of Classifiers (Stamatios-Aggelos N. Alexandropoulos, Christos K. Aridas, Sotiris B. Kotsiantis, Michael N. Vrahatis)....Pages 545-556
Front Matter ....Pages 557-557
An Agent-Based Framework for Complex Networks (Alexander Wendt, Maximilian Götzinger, Thilo Sauter)....Pages 559-570
Studying Emotions at Work Using Agent-Based Modeling and Simulation (Hanen Lejmi-Riahi, Mouna Belhaj, Lamjed Ben Said)....Pages 571-583
Towards an Adaption and Personalisation Solution Based on Multi Agent System Applied on Serious Games (Spyridon Blatsios, Ioannis Refanidis)....Pages 584-594
Front Matter ....Pages 595-595
Constant Angular Velocity Regulation for Visually Guided Terrain Following (Huatian Wang, Qinbing Fu, Hongxin Wang, Jigen Peng, Shigang Yue)....Pages 597-608
Motion Segmentation Based on Structure-Texture Decomposition and Improved Three Frame Differencing (Sandeep Singh Sengar)....Pages 609-622
Using Shallow Neural Network Fitting Technique to Improve Calibration Accuracy of Modeless Robots (Ying Bai, Dali Wang)....Pages 623-631
Front Matter ....Pages 633-633
Banner Personalization for e-Commerce (Ioannis Maniadis, Konstantinos N. Vavliakis, Andreas L. Symeonidis)....Pages 635-646
Hybrid Data Set Optimization in Recommender Systems Using Fuzzy T-Norms (Antonios Papaleonidas, Elias Pimenidis, Lazaros Iliadis)....Pages 647-659
MuSIF: A Product Recommendation System Based on Multi-source Implicit Feedback (Ιοannis Schoinas, Christos Tjortjis)....Pages 660-672
On the Invariance of the SELU Activation Function on Algorithm and Hyperparameter Selection in Neural Network Recommenders (Flora Sakketou, Nicholas Ampazis)....Pages 673-685
Back Matter ....Pages 687-689

Citation preview

IFIP AICT 559

John MacIntyre Ilias Maglogiannis Lazaros Iliadis Elias Pimenidis (Eds.)

Artificial Intelligence Applications and Innovations 15th IFIP WG 12.5 International Conference, AIAI 2019 Hersonissos, Crete, Greece, May 24–26, 2019 Proceedings

123

IFIP Advances in Information and Communication Technology Editor-in-Chief Kai Rannenberg, Goethe University Frankfurt, Germany

Editorial Board Members TC 1 – Foundations of Computer Science Jacques Sakarovitch, Télécom ParisTech, France TC 2 – Software: Theory and Practice Michael Goedicke, University of Duisburg-Essen, Germany TC 3 – Education Arthur Tatnall, Victoria University, Melbourne, Australia TC 5 – Information Technology Applications Erich J. Neuhold, University of Vienna, Austria TC 6 – Communication Systems Aiko Pras, University of Twente, Enschede, The Netherlands TC 7 – System Modeling and Optimization Fredi Tröltzsch, TU Berlin, Germany TC 8 – Information Systems Jan Pries-Heje, Roskilde University, Denmark TC 9 – ICT and Society David Kreps, University of Salford, Greater Manchester, UK TC 10 – Computer Systems Technology Ricardo Reis, Federal University of Rio Grande do Sul, Porto Alegre, Brazil TC 11 – Security and Privacy Protection in Information Processing Systems Steven Furnell, Plymouth University, UK TC 12 – Artificial Intelligence Ulrich Furbach, University of Koblenz-Landau, Germany TC 13 – Human-Computer Interaction Marco Winckler, University of Nice Sophia Antipolis, France TC 14 – Entertainment Computing Rainer Malaka, University of Bremen, Germany

559

IFIP – The International Federation for Information Processing IFIP was founded in 1960 under the auspices of UNESCO, following the first World Computer Congress held in Paris the previous year. A federation for societies working in information processing, IFIP’s aim is two-fold: to support information processing in the countries of its members and to encourage technology transfer to developing nations. As its mission statement clearly states: IFIP is the global non-profit federation of societies of ICT professionals that aims at achieving a worldwide professional and socially responsible development and application of information and communication technologies. IFIP is a non-profit-making organization, run almost solely by 2500 volunteers. It operates through a number of technical committees and working groups, which organize events and publications. IFIP’s events range from large international open conferences to working conferences and local seminars. The flagship event is the IFIP World Computer Congress, at which both invited and contributed papers are presented. Contributed papers are rigorously refereed and the rejection rate is high. As with the Congress, participation in the open conferences is open to all and papers may be invited or submitted. Again, submitted papers are stringently refereed. The working conferences are structured differently. They are usually run by a working group and attendance is generally smaller and occasionally by invitation only. Their purpose is to create an atmosphere conducive to innovation and development. Refereeing is also rigorous and papers are subjected to extensive group discussion. Publications arising from IFIP events vary. The papers presented at the IFIP World Computer Congress and at open conferences are published as conference proceedings, while the results of the working conferences are often published as collections of selected and edited papers. IFIP distinguishes three types of institutional membership: Country Representative Members, Members at Large, and Associate Members. The type of organization that can apply for membership is a wide variety and includes national or international societies of individual computer scientists/ICT professionals, associations or federations of such societies, government institutions/government related organizations, national or international research institutes or consortia, universities, academies of sciences, companies, national or international associations or federations of companies. More information about this series at http://www.springer.com/series/6102

John MacIntyre Ilias Maglogiannis Lazaros Iliadis Elias Pimenidis (Eds.) •





Artificial Intelligence Applications and Innovations 15th IFIP WG 12.5 International Conference, AIAI 2019 Hersonissos, Crete, Greece, May 24–26, 2019 Proceedings

123

Editors John MacIntyre University of Sunderland Sunderland, UK

Ilias Maglogiannis University of Piraeus Piraeus, Greece

Lazaros Iliadis Democritus University of Thrace Xanthi, Greece

Elias Pimenidis University of West England Bristol, UK

ISSN 1868-4238 ISSN 1868-422X (electronic) IFIP Advances in Information and Communication Technology ISBN 978-3-030-19822-0 ISBN 978-3-030-19823-7 (eBook) https://doi.org/10.1007/978-3-030-19823-7 © IFIP International Federation for Information Processing 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

AIAI 2019 According to Professor Klaus Schwab (founder and executive chairman of the World Economic Forum), we are living in the era of a great revolution that is rapidly bringing huge changes and challenges in our daily lives. This is the Fourth Industrial Revolution, which has a big impact on all disciplines, even in the way that we communicate and interact with each other. Artificial intelligence (AI) is a major and significant part of the Fourth Industrial Revolution. Its rapid technical breakthroughs are enabling superhuman performance by machines in a real-time mode. Machine vision (e.g., face recognition) or language translators and assistants like Siri and Alexa are characteristic examples. AI is promising a brave new world where business and economies will expand their productivity and innovation. Machine learning and deep learning are part of our usual common interactions on our mobile phones and on social media. Numerous applications of AI are used in almost all domains from cybersecurity to financial and medical cases. However, historic challenges for the future of mankind are being faced. Potential unethical use of AI may violate democratic human rights and may alter the character of Western societies. The 15th Artificial Intelligence Applications and Innovations (AIAI) conference offered insight into all timely challenges related to technical, legal, and ethical aspects of intelligent systems and their applications. New algorithms and potential prototypes employed in diverse domains were introduced. AIAI is a mature international scientific conference held in Europe and is well established in the scientific area of AI. Its history is long and very successful, following and spreading the evolution of intelligent systems. The first event was organized in Toulouse France in 2004. Since then, it has had a continuous and dynamic presence as a major global, but mainly European scientific event. More specifically, it has been organized in China, Greece, Cyprus, Australia, and France. It has always been technically supported by the International Federation for Information Processing (IFIP) and more specifically by the Working Group 12.5, which is interested in AI applications. Following a long-standing tradition, this Springer volume belongs to the IFIP AICT Springer Series and it contains the papers that were accepted to be presented orally at the AIAI 2019 conference. An additional volume comprises the papers that were accepted and presented at the workshops and were held as parallel events.

vi

Preface

The diverse nature of papers presented demonstrates the vitality of AI algorithms and approaches. It certainly proves the very wide range of AI applications as well. The event was held during May 24–26, 2019, in the Aldemar Knossos Royal five-star Hotel in Crete, Greece. The response of the international scientific community to the AIAI 2019 call for papers was more than satisfactory, with 101 papers initially submitted. All papers were peer reviewed by at least two independent academic referees. Where needed, a third referee was consulted to resolve any potential conflicts. A total of 49 papers (48.5% of the submitted manuscripts) were accepted to be published as full papers (12 pages long) in the proceedings. Owing to the high quality of the submissions, the Program Committee decided that it should accept six more papers to be published as short ones (10 pages long). Three scientific workshops on timely AI subjects were organized under the framework of AIAI 2019. – The 8th Mining Humanistic Data Workshop (MHDW 2019) organized by the University of Patras and Ionion University, Greece – The 4th Workshop on 5G-Putting Intelligence to the Network Edge (5G-PINE 2019) organized by the research team of the Hellenic Telecommunications Organization (OTE) in cooperation with 22 major partner companies – The First Workshop on Emerging Trends in AI (ETAI 2019) Sponsored by the Springer journal Neural Computing and Applications (open workshop without submission of papers) We are grateful to Professor John MacIntyre from the University of Sunderland, UK, for organizing this workshop and, moreover, for his continuous support of the AIAI and EANN conferences. We wish to thank Professor Andrew Starr for his contribution to this very interesting workshop. AI is in a new “boom” period, with exponential growth in commercialization of research and development, products being introduced into the market with embedded AI as well as “intelligent systems” of various types. Projections for commercial revenue from AI show exponential growth; such is the ubiquitous nature of AI in the modern world that members of the public are interacting with intelligent systems or agents every day – even though they often are not aware of it! This workshop, led by Professor John MacIntyre, considered emerging themes in AI, covering not only the technical aspects of where AI is going, but the wider question of ethics, and the potential for future regulatory frameworks for the development, implementation, and operation of intelligent systems and their role in our society. The workshop format included three short presentations by the keynote speakers, followed by an interactive panel Q&A session where the panel members and audience engaged in a lively debate on the topics discussed! The subjects of their presentations were the following: – John MacIntyre: “The Future of AI – Existential Threat or New Revolution?” – Andrew Starr: “Practical AI for Practical Problems”

Preface

vii

This was an open workshop without submission of papers. Four keynote speakers were invited to give lectures on timely aspects of AI. The following talks were given: 1. Professor Plamen Angelov, University of Lancaster, UK: “Empirical Approach: How to Get Fast, Interpretable Deep Learning” 2. Dr. Evangelos Eleftheriou, IBM Fellow, Cloud and Computing Infrastructure, Zurich Research Laboratory Switzerland. subject: “In-Memory Computing: Accelerating AI Applications” 3. Dr. John Oommen: Carleton University, Ottawa, Canada: “The Power of Pursuit. Learning Paradigm in the Partitioning of Data” 4. Professor Panagiotis Papapetrou, Stockholm University: “Learning from Electronic Health Records: From temporal Abstraction to Timeseries Interpretability” A tutorial was hosted on the topic: “Automated Machine Learning for Bioinformatics and Computational Biology.” The tutorial (3 hours) was given by Professor Ioannis Tsamdinos (Computer Science Department of University of Crete, co-founder of Gnosis Data Analysis PC, a University spin-off company, and Affiliated Faculty at IACM-FORTH) and Professor Vincenzo Lagani (Ilia State University, Tbilisi, Georgia and Gnosis Data Analysis PC co-founder). Numerous bioinformaticians, computational biologists, and life scientists in general are applying supervised learning techniques and feature selection in their research work. The tutorial was addressed to this audience intending to shield them against methodological pitfalls, inform them about new methodologies and tools emerging in the field of Auto-ML, and increase their productivity. The accepted papers of the 15th AIAI conference are related to the following thematic topics: Deep learning ANN Genetic algorithms - optimization Constraints modeling ANN training algorithms Social media intelligent modeling Text mining/machine translation Fuzzy modeling Biomedical and bioinformatics algorithms and systems Feature selection Emotion recognition Hybrid intelligent models Classification-pattern recognition Intelligent security modeling Complex stochastic games Unsupervised machine learning ANN in industry Intelligent clustering Convolutional and recurrent ANN

viii

Preface

Recommender systems Intelligent telecommunications modeling Intelligent hybrid systems using internet of things The authors of submitted papers came from 23 different countries from all over the globe, namely: Austria, Brazil, Canada, Colombia, Czech Republic, Finland, France, Germany, Greece, The Netherlands, Hungary, India, Italy, Japan, P.R. China, Peru, Poland, Portugal, Spain, Sweden, Tunisia, the UK, and the USA. May 2019

John MacIntyre Ilias Maglogiannis Lazaros Iliadis Elias Pimenidis

Organization

Executive Committee General Chairs John MacIntyre

Ilias Maglogiannis (President of the IFIP WG12.5) Plamen Angelov

University of Sunderland, UK (Dean of the Faculty of Applied Sciences and Pro Vice Chancellor of the University of Sunderland) University of Piraeus, Greece

University of Lancaster, UK

Program Chairs Lazaros Iliadis Elias Pimenidis

Democritus University of Thrace, Greece University of the West of England, Bristol, UK

Advisory Chairs Stefanos Kolias Spyros Likothanasis Georgios Vouros

University of Lincoln, UK University of Patras, Greece University of Piraeus, Greece

Honorary Chair Barbara Hammer

Bielefeld University, Germany

Workshop Chairs Christos Makris Phivos Mylonas Spyros Sioutas

University of Patras, Greece Ionian University, Greece University of Patras, Greece

Publication and Publicity Chair Antonis Papaleonidas

Democritus University of Thrace, Greece

Program Committee Michel Aldanondo Athanasios Alexiou Mohammed Alghwell Ioannis Anagnostopoulos George Anastassopoulos Vardis-Dimitris Anezakis

IMT Mines Albi, France NGCEF, Australia Freelancer, Libya University of Central Greece, Greece Democritus University of Thrace, Greece Democritus University of Thrace, Greece

x

Organization

Costin Badica Kostas Berberidis Nik Bessis Varun Bhatt Giacomo Boracchi Farah Bouakrif Antonio Braga Peter Brida Ivo Bukovsky Paulo Vitor Campos Souza George Caridakis Jheymesson Cavalcanti Ioannis Chamodrakas Ioannis Chochliouros Adriana Mihaela Coroiu Dawei Dai Vilson Luiz Dalle Mole Debasmit Das Bodhisattva Dash Konstantinos Demertzis Antreas Dionysiou Ioannis Dokas Sergey Dolenko Xiao Dong Shirin Dora Rodrigo Exterkoetter Mauro Gaggero Claudio Gallicchio Ignazio Gallo Spiros Georgakopoulos Eleonora Giunchiglia Giorgio Gnecco Ioannis Gkourtzounis Foteini Grivokostopoulou Hakan Haberdar Petr Hajek Xue Han Ioannis Hatzilygeroudis Jian Hou Lazaros Iliadis Jacek Kabziński Antonios Kalampakas Andreas Kanavos

University of Craiova, Romania University of Patras, Greece Edge Hill University, UK Indian Institute of Technology, Bombay, India Politecnico di Milano, Italy University of Jijel, Algeria Federal University of Minas Gerais, Brazil University of Zilina, Slovakia Tohoku University, Japan CEFET-MG, Brazil National Technical University of Athens, Greece UPE, Brazil National and Kapodistrian University of Athens, Greece Hellenic Telecommunications Organization S.A. (OTE), Greece Babes Bolyai University Fudan University, China UTFPR, Brazil Purdue University, USA IIIT Bhubaneswar, India Democritus University of Thrace, Greece University of Cyprus, Cyprus DUTH, Greece D.V. Skobeltsyn Institute of Nuclear Physics, M.V. Lomonosov Moscow State University, Russia Institute of Computing Technology, China University Van Amsterdam, The Netherlands LTrace Geophysical Solutions National Research Council, Italy University of Pisa, Italy University of Insubria, Italy University of Thessaly, Greece Università di Genova, Italy IMT School for Advanced Studies, Italy University of Northampton, Greece University of Patras, Greece University of Houston, USA University of Pardubice, Czech Republic China University of Geosciences, China University of Patras, Greece Bohai University, China Democritus University of Thrace, Greece Lodz University of Technology, Poland AUM, Kuwait University of Patras, Greece

Organization

Savvas Karatsiolis Kostas Karatzas Antonios Karatzoglou Ioannis Karydis Petros Kefalas Katia Lida Kermanidis Nadia Masood Khan Sophie Klecker Yiannis Kokkinos Petia Koprinkova-Hristova Athanasios Koutras Ondrej Krejcar Efthyvoulos Kyriacou Guangli Li Annika Lindh Ilias Maglogiannis George Magoulas Christos Makris Mario Malcangi Boudjelal Meftah Nikolaos Mitianoudis Haralambos Mouratidis Phivos Mylonas Shigang Yue Yancho Todorov George Tsekouras Mihaela Oprea Paul Krause Rafet Sifa Alexander Ryjov Giannis Nikolentzos Duc-Hong Pham Elias Pimenidis Hongyu Li Marcello Sanguineti Zhongnan Zhang Doina Logofatu Ruggero Labati Florin Leon Aristidis Likas Spiros Likothanassis Francesco Marcelloni Giorgio Morales Stavros Ntalampiras

xi

University of Cyprus, Cyprus Aristotle University of Thessaloniki, Greece Karlsruhe Institute of Technology, Germany Ionian University, Greece University of Sheffield International Faculty, Greece Ionian University, Greece University of Engineering and Technology Peshawar, Pakistan University of Luxembourg, Luxembourg University of Macedonia, Greece Bulgarian Academy of Sciences, Bulgaria TEI of Western Greece, Greece University of Hradec Kralove, Czech Republic Frederick University, Cyprus Institute of Computing Technology, Chinese Academy of Sciences, China Dublin Institute of Technology, Ireland University of Piraeus, Greece University of London, Birkbeck College, UK University of Patras, Greece Università degli Studi di Milano, Italy University Mustapha Stambouli, Mascara, Algeria Democritus University of Thrace, Greece University of Brighton, UK National Technical University of Athens, Greece University of Lincoln, UK Aalto University, Espoo, Finland University of the Aegean, Greece Petroleum-Gas University of Ploiesti, Romania University of Surrey, UK Fraunhofer IAIS, Germany Lomonosov Moscow State University, Russia Ecole Polytechnique, France VNU, Vietnam University of the West of England, UK Zhongan Tech, China University of Genoa, Italy Xiamen University, China Frankfurt University of Applied Sciences, Germany Università degli Studi di Milano, Italy Technical University of Iasi, Romania University of Ioannina, Greece University of Patras, Greece University of Pisa, Italy INICTEL-UNI, Peru University of Milan, Italy

xii

Organization

Basil Papadopoulos Antonios Papaleonidas Isidoros Perikos Nicolai Petkov Miltos Petridis Jielin Qiu Juan Qiu Bernardete Ribeiro Simone Scardapane Andreas Stafylopatis Antonino Staiano Ioannis Stephanakis Ricardo Tanscheit Francesco Trovò Nicolas Tsapatsoulis Nikolaos Vassilas Petra Vidnerová Panagiotis Vlamos George Vouros Xin-She Yang Drago Žagar Rabiaa Zitouni

Democritus University of Thrace, Greece DUTH, Greece University of Patras, Greece University of Groningen, The Netherlands Middlesex University, UK Shanghai Jiao Tong University, China Tongji University, China University of Coimbra, Portugal Sapienza University of Rome, Italy National Technical University of Athens, Greece Parthenope University of Naples, Italy Hellenic Telecommunications Organisation SA, Greece PUC-Rio, Brazil Politecnico di Milano, Italy Cyprus University of Technology, Cyprus TEI of Athens, Greece The Czech Academy of Sciences, Czech Republic Ionian University, Greece University of Piraeus, Greece Middlesex University, UK University of Osijek, Croatia University of Tunis el Manar, Tunisia

Abstracts of Invited Talks

Learning from Electronic Health Records: From Temporal Abstractions to Time Series Interpretability

Panagiotis Papapetrou Department of Computer and Systems Sciences, Stockholm University [email protected] Abstract. The first part of the talk will focus on data mining methods for learning from Electronic Health Records (EHRs), which are typically perceived as big and complex patient data sources. On them, scientists strive to perform predictions on patients’ progress, to understand and predict response to therapy, to detect adverse drug effects, and many other learning tasks. Medical researchers are also interested in learning from cohorts of population-based studies and of experiments. Learning tasks include the identification of disease predictors that can lead to new diagnostic tests and the acquisition of insights on interventions. The talk will elaborate on data sources, methods, and case studies in medical mining. The second part of the talk will tackle the issue of interpretability and explainability of opaque machine learning models, with focus on time series classification. Time series classification has received great attention over the past decade with a wide range of methods focusing on predictive performance by exploiting various types of temporal features. Nonetheless, little emphasis has been placed on interpretability and explainability. This talk will formulate the novel problem of explainable time series tweaking, where, given a time series and an opaque classifier that provides a particular classification decision for the time series, the objective is to find the minimum number of changes to be performed to the given time series so that the classifier changes its decision to another class. Moreover, it will be shown that the problem is NP-hard. Two instantiations of the problem will be presented. The classifier under investigation will be the random shapelet forest classifier. Moreover, two algorithmic solutions for the two problem instantiations will be presented along with simple optimizations, as well as a baseline solution using the nearest neighbor classifier.

Empirical Approach: How to Get Fast, Interpretable Deep Learning

Plamen Angelov Department of Computing and Communications, University of Lancaster [email protected] Abstract. We are witnessing an explosion of data (streams) being generated and growing exponentially. Nowadays we carry in our pockets Gigabytes of data in the form of USB flash memory sticks, smartphones, smartwatches etc. Extracting useful information and knowledge from these big data streams is of immense importance for the society, economy and science. Deep Learning quickly become a synonymous of a powerful method to enable items and processes with elements of AI in the sense that it makes possible human like performance in recognizing images and speech. However, the currently used methods for deep learning which are based on neural networks (recurrent, belief, etc.) is opaque (not transparent), requires huge amount of training data and computing power (hours of training using GPUs), is offline and its online versions based on reinforcement learning has no proven convergence, does not guarantee same result for the same input (lacks repeatability). The speaker recently introduced a new concept of empirical approach to machine learning and fuzzy sets and systems, had proven convergence for a class of such models and used the link between neural networks and fuzzy systems (neuro-fuzzy systems are known to have a duality from the radial basis functions (RBF) networks and fuzzy rule based models and having the key property of universal approximation proven for both). In this talk he will present in a systematic way the basics of the newly introduced Empirical Approach to Machine Learning, Fuzzy Sets and Systems and its applications to problems like anomaly detection, clustering, classification, prediction and control. The major advantages of this new paradigm are the liberation from the restrictive and often unrealistic assumptions and requirements concerning the nature of the data (random, deterministic, fuzzy), the need to formulate and assume a priori the type of distribution models, membership functions, the independence of the individual data observations, their large (theoretically infinite) number, etc. From a pragmatic point of view, this direct approach from data (streams) to complex, layered model representation is automated fully and leads to very efficient model structures. In addition, the proposed new concept learns in a way similar to the way people learn – it can start from a single example. The reason why the proposed new approach makes this possible is because it is prototype based and non-parametric.

“In-memory Computing”: Accelerating AI Applications

Evangelos Eleftheriou IBM Fellow, Cloud and Computing Infrastructure, Zurich Research Laboratory, Zurich, Switzerland [email protected] Abstract. In today’s computing systems based on the conventional von Neumann architecture, there are distinct memory and processing units. Performing computations results in a significant amount of data being moved back and forth between the physically separated memory and processing units. This costs time and energy, and constitutes an inherent performance bottleneck. It is becoming increasingly clear that for application areas such as AI (and indeed cognitive computing in general), we need to transition to computing architectures in which memory and logic coexist in some form. Brain-inspired neuromorphic computing and the fascinating new area of in-memory computing or computational memory are two key non-von Neumann approaches being researched. A critical requirement in these novel computing paradigms is a very-high-density, low-power, variable-state, programmable and non-volatile nanoscale memory device. There are many examples of such nanoscale memory devices in which the information is stored either as charge or as resistance. However, one particular example is phase-change-memory (PCM) devices, which are very well suited to address this need, owing to their multi-level storage capability and potential scalability. In in-memory computing, the physics of the nanoscale memory devices, as well as the organization of such devices in cross-bar arrays, are exploited to perform certain computational tasks within the memory unit. I will present how computational memories accelerate AI applications and will show small- and large-scale experimental demonstrations that perform high-level computational primitives, such as ultra-low-power inference engines, optimization solvers including compressed sensing and sparse coding, linear solvers and temporal correlation detection. Moreover, I will discuss the efficacy of this approach to efficiently address not only inferencing but also training of deep neural networks. The results show that this co-existence of computation and storage at the nanometer scale could be the enabler for new, ultra-dense, low-power, and massively parallel computing systems. Thus, by augmenting conventional computing systems, in-memory computing could help achieve orders of magnitude improvement in performance and efficiency.

Contents

Invited Paper The Power of the “Pursuit” Learning Paradigm in the Partitioning of Data . . . Abdolreza Shirvani and B. John Oommen

3

AI Anomaly Detection - Active Learning Cyber-Typhon: An Online Multi-task Anomaly Detection Framework . . . . . . Konstantinos Demertzis, Lazaros Iliadis, Panayiotis Kikiras, and Nikos Tziritas Investigating the Benefits of Exploiting Incremental Learners Under Active Learning Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stamatis Karlos, Vasileios G. Kanas, Nikos Fazakis, Christos Aridas, and Sotiris Kotsiantis The Blockchain Random Neural Network in Cybersecurity and the Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Will Serrano

19

37

50

Autonomous Vehicles - Aerial Vehicles A Visual Neural Network for Robust Collision Perception in Vehicle Driving Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qinbing Fu, Nicola Bellotto, Huatian Wang, F. Claire Rind, Hongxin Wang, and Shigang Yue An LGMD Based Competitive Collision Avoidance Strategy for UAV . . . . . Jiannan Zhao, Xingzao Ma, Qinbing Fu, Cheng Hu, and Shigang Yue Mixture Modules Based Intelligent Control System for Autonomous Driving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tangyike Zhang, Songyi Zhang, Yu Chen, Chao Xia, Shitao Chen, and Nanning Zheng

67

80

92

Biomedical AI An Adaptive Temporal-Causal Network Model for Stress Extinction Using Fluoxetine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. Sahand Mohammadi Ziabari

107

xx

Contents

Clustering Diagnostic Profiles of Patients . . . . . . . . . . . . . . . . . . . . . . . . . . Jaakko Hollmén and Panagiotis Papapetrou

120

Emotion Analysis in Hospital Bedside Infotainment Platforms Using Speeded up Robust Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Kallipolitis, M. Galliakis, A. Menychtas, and I. Maglogiannis

127

FISUL: A Framework for Detecting Adverse Drug Events from Heterogeneous Medical Sources Using Feature Importance. . . . . . . . . . . . . . Corinne G. Allaart, Lena Mondrejevski, and Panagiotis Papapetrou

139

Classification - Clustering A New Topology-Preserving Distance Metric with Applications to Multi-dimensional Data Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Konstantinos K. Delibasis

155

Classification of Incomplete Data Using Autoencoder and Evidential Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Suvra Jyoti Choudhury and Nikhil R. Pal

167

Dynamic Reliable Voting in Ensemble Learning . . . . . . . . . . . . . . . . . . . . . Agus Budi Raharjo and Mohamed Quafafou

178

Extracting Action Sensitive Features to Facilitate Weakly-Supervised Action Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zijian Kang, Le Wang, Ziyi Liu, Qilin Zhang, and Nanning Zheng

188

Image Recognition Based on Combined Filters with Pseudoinverse Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaodan Deng, Xiaoxuan Sun, Ping Guo, and Qian Yin

202

Constraint Programming - Brain Inspired Modeling Design-Parameters Optimization of a Deep-Groove Ball Bearing for Different Boundary Dimensions, Employing Amended Differential Evolution Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parthiv B. Rana, Jigar L. Patel, and D. I. Lalwani

213

Exploring Brain Effective Connectivity in Visual Perception Using a Hierarchical Correlation Network . . . . . . . . . . . . . . . . . . . . . . . . . Siyu Yu, Nanning Zheng, Hao Wu, Ming Du, and Badong Chen

223

Solving the Talent Scheduling Problem by Parallel Constraint Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ke Liu, Sven Löffler, and Petra Hofstedt

236

Contents

xxi

Deep Learning - Convolutional ANN A Deep Reinforcement Learning Approach for Automated Cryptocurrency Trading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giorgio Lucarelli and Matteo Borrotti Capacity Requirements Planning for Production Companies Using Deep Reinforcement Learning: Use Case for Deep Planning Methodology (DPM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Harald Schallner Comparison of Neural Network Optimizers for Relative Ranking Retention Between Neural Architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . George Kyriakides and Konstantinos Margaritis

247

259

272

Detecting Violent Robberies in CCTV Videos Using Deep Learning . . . . . . . Giorgio Morales, Itamar Salazar-Reque, Joel Telles, and Daniel Díaz

282

Diversity Regularized Adversarial Deep Learning . . . . . . . . . . . . . . . . . . . . Babajide O. Ayinde, Keishin Nishihama, and Jacek M. Zurada

292

Interpretability of a Deep Learning Model for Rodents Brain Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Leonardo Nogueira Matos, Mariana Fontainhas Rodrigues, Ricardo Magalhães, Victor Alves, and Paulo Novais Learning and Detecting Stuttering Disorders . . . . . . . . . . . . . . . . . . . . . . . . Fabio Fassetti, Ilaria Fassetti, and Simona Nisticò Localization of Epileptic Foci by Using Convolutional Neural Network Based on iEEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linfeng Sui, Xuyang Zhao, Qibin Zhao, Toshihisa Tanaka, and Jianting Cao Review Spam Detection Using Word Embeddings and Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aliaksandr Barushka and Petr Hajek

307

319

331

340

Tools for Semi-automatic Preparation of Training Data for OCR. . . . . . . . . . Ladislav Lenc, Jiří Martínek, and Pavel Král

351

Training Strategies for OCR Systems for Historical Documents . . . . . . . . . . Jiří Martínek, Ladislav Lenc, and Pavel Král

362

A Review on the Application of Deep Learning in Legal Domain . . . . . . . . . Neha Bansal, Arun Sharma, and R. K. Singh

374

xxii

Contents

Long-Short Term Memory for an Effective Short-Term Weather Forecasting Model Using Surface Weather Data . . . . . . . . . . . . . . . . . . . . . Pradeep Hewage, Ardhendu Behera, Marcello Trovati, and Ella Pereira Segmentation Methods for Image Classification Using a Convolutional Neural Network on AR-Sandbox. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andres Ovidio Restrepo Rodriguez, Daniel Esteban Casas Mateus, Paulo Alonso Gaona Garcia, Adriana Gomez Acosta, and Carlos Enrique Montenegro Marin

382

391

Fuzzy Modeling A Hybrid Model Based on Fuzzy Rules to Act on the Diagnosed of Autism in Adults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Augusto J. Guimarães, Vinicius J. Silva Araujo, Vanessa S. Araujo, Lucas O. Batista, and Paulo V. de Campos Souza

401

An Unsupervised Fuzzy Rule-Based Method for Structure Preserving Dimensionality Reduction with Prediction Ability . . . . . . . . . . . . . . . . . . . . Suchismita Das and Nikhil R. Pal

413

Interpretable Fuzzy Rule-Based Systems for Detecting Financial Statement Fraud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Petr Hajek

425

Learning Automata - Logic Based Reasoning Learning Automata-Based Solutions to the Single Elevator Problem . . . . . . . O. Ghaleb and B. John Oommen Optimizing Self-organizing Lists-on-Lists Using Enhanced Object Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . O. Ekaba Bisong and B. John Oommen EduBAI: An Educational Platform for Logic-Based Reasoning . . . . . . . . . . . Dimitrios Arampatzis, Maria Doulgeraki, Michail Giannoulis, Evropi Stefanidi, and Theodore Patkos

439

451 464

Machine Learning - Natural Language A Machine Learning Tool for Interpreting Differences in Cognition Using Brain Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tiago Azevedo, Luca Passamonti, Pietro Lió, and Nicola Toschi

475

Contents

Comparison of the Best Parameter Settings in the Creation and Comparison of Feature Vectors in Distributional Semantic Models Across Multiple Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . András Dobó and János Csirik Distributed Community Prediction for Social Graphs Based on Louvain Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christos Makris, Dionisios Pettas, and Georgios Pispirigos Iliou Machine Learning Data Preprocessing Method for Suicide Prediction from Family History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Theodoros Iliou, Georgia Konstantopoulou, Christina Lymperopoulou, Konstantinos Anastasopoulos, George Anastassopoulos, Dimitrios Margounakis, and Dimitrios Lymberopoulos Ontology Population Framework of MAGNETO for Instantiating Heterogeneous Forensic Data Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . Ernst-Josef Behmer, Krishna Chandramouli, Victor Garrido, Dirk Mühlenberg, Dennis Müller, Wilmuth Müller, Dirk Pallmer, Francisco J. Pérez, Tomas Piatrik, and Camilo Vargas Random Forest Surrogate Models to Support Design Space Exploration in Aerospace Use-Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Siva Krishna Dasari, Abbas Cheddad, and Petter Andersson Stacking Strong Ensembles of Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . Stamatios-Aggelos N. Alexandropoulos, Christos K. Aridas, Sotiris B. Kotsiantis, and Michael N. Vrahatis

xxiii

487

500

512

520

532 545

Multi Agent - IoT An Agent-Based Framework for Complex Networks . . . . . . . . . . . . . . . . . . Alexander Wendt, Maximilian Götzinger, and Thilo Sauter

559

Studying Emotions at Work Using Agent-Based Modeling and Simulation. . . Hanen Lejmi-Riahi, Mouna Belhaj, and Lamjed Ben Said

571

Towards an Adaption and Personalisation Solution Based on Multi Agent System Applied on Serious Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spyridon Blatsios and Ioannis Refanidis

584

Nature Inspired Flight and Robot Control - Machine Vision Constant Angular Velocity Regulation for Visually Guided Terrain Following . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huatian Wang, Qinbing Fu, Hongxin Wang, Jigen Peng, and Shigang Yue

597

xxiv

Contents

Motion Segmentation Based on Structure-Texture Decomposition and Improved Three Frame Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . Sandeep Singh Sengar

609

Using Shallow Neural Network Fitting Technique to Improve Calibration Accuracy of Modeless Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ying Bai and Dali Wang

623

Recommendation Systems Banner Personalization for e-Commerce . . . . . . . . . . . . . . . . . . . . . . . . . . . Ioannis Maniadis, Konstantinos N. Vavliakis, and Andreas L. Symeonidis

635

Hybrid Data Set Optimization in Recommender Systems Using Fuzzy T-Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonios Papaleonidas, Elias Pimenidis, and Lazaros Iliadis

647

MuSIF: A Product Recommendation System Based on Multi-source Implicit Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ioannis Schoinas and Christos Tjortjis

660

On the Invariance of the SELU Activation Function on Algorithm and Hyperparameter Selection in Neural Network Recommenders . . . . . . . . . Flora Sakketou and Nicholas Ampazis

673

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

687

Invited Paper

The Power of the “Pursuit” Learning Paradigm in the Partitioning of Data Abdolreza Shirvani1 and B. John Oommen1,2(B) 1

School of Computer Science, Carleton University, Ottawa, Canada [email protected] 2 Centre for Artificial Intelligence Research, University of Agder, Grimstad, Norway

Abstract. Traditional Learning Automata (LA) work with the understanding that the actions are chosen purely based on the “state” in which the machine is. This modus operandus completely ignores any estimation of the Random Environment’s (RE’s) (specified as E) reward/penalty probabilities. To take these into consideration, Estimator/Pursuit LA utilize “cheap” estimates of the Environment’s reward probabilities to make them converge by an order of magnitude faster. This concept is quite simply the following: Inexpensive estimates of the reward probabilities can be used to rank the actions. Thereafter, when the action probability vector has to be updated, it is done not on the basis of the Environment’s response alone, but also based on the ranking of these estimates. While this phenomenon has been utilized in the field of LA, until recently, it has not been incorporated into solutions that solve partitioning problems. In this paper, we will submit a complete survey of how the “Pursuit” learning paradigm can be and has been used in Object Partitioning. The results demonstrate that incorporating this paradigm can hasten the partitioning by a order of magnitude. Keywords: Object Partitioning · Learning Automata · Object Migration Automaton · Partitioning-based learning

1

Introduction

The Pursuit Concept in LA: Absolutely Expedient LA are absorbing and there is always a small probability of them not converging to the best action. Thathachar and Sastry realized this phenomenon and proposed to use Maximum Likelihood Estimators (MLEs) to hasten the LA’s convergence. Such an MLE-based update method would utilize estimates of the reward probabilities in the update equations. At every iteration, the estimated reward vector was also The second author gratefully acknowledges the partial support of NSERC, the Natural Sciences and Engineering Council of Canada. c IFIP International Federation for Information Processing 2019  Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 3–16, 2019. https://doi.org/10.1007/978-3-030-19823-7_1

4

A. Shirvani and B. John Oommen

used to update the action probabilities, instead of updating it based only on the RE’s feedback. In this way, the probabilities of choosing the actions with higher reward estimates were increased, and those with lower estimates were significantly reduced, using which they proposed the family of estimator algorithms. The Pursuit strategy of designing LA is a special derivative of the family of estimator algorithms. Pursuit algorithms “pursue” the currently-known best action, and increase the action probability associated with this action. The pursuit concept was first introduced by Thathachar et al., and the corresponding LA was proven to be -optimal. Its discretized version was proposed by Lanctot et al. in [7], who also discretized the original estimator algorithm. Agache et al. [10] then analyzed all the four linear combinations, i.e., the LRI and LRP paradigms. The Object Partitioning Problem (OPP): Consider the problem of partitioning a set A = {A1 , · · · , AW } of W physical objects into R groups Ω = {G1 , · · · , GR }. We assume that the true but unknown state of nature, Ω ∗ , is a partitioning of the set A into mutually exclusive and exhaustive subsets {G∗1 , G∗2 , · · · G∗R }. The composition of {G∗i } is unknown, and the elements in the subsets fall together based on some criteria which may be mathematically formulated, or may even be ambiguous. These objects are now presented to a learning algorithm, for example, in pairs or tuples. The goal of the algorithm is to partition A into a learned partition, Ω + . The hope is to have Ω + converge to Ω ∗ . In most cases, the underlying partitioning of Ω ∗ is not known, nor are the joint access probabilities by which the pairs/tuples of A are presented to the learning algorithm known. This problem is known to be NP-hard [9]. Clearly, if we increase the number of objects, the number of partitions increases, and in addition to this quantity, the problem’s complexity grows exponentially. To resolve this, it is possible to explore all partition combinations, use a ranking index, and to thereafter, report the best plausible partition. The goal of the OPP is to identify the best or most likely realizable partitioning. This requires the AI algorithm to perceive the semantic physical world aspects of the objects, and to then make local decisions based on the best partition in the abstract domain [4,5]. Real versus Abstract Objects: If there exists a mutual relation between the real objects in the semantic domain A, and a domain of abstract objects O = {O1 , · · · , OW }, we define the partitioning of O in such way that the corresponding partitions of O map onto the partitions of the real objects in A so as to mimic the state-of-nature. Thus, while we operate on the abstract objects in O, the objects in A are not necessarily moved because they constitute reallife objects which cannot be easily moved. A special case of the OPP is the Equi-Partitioning Problem (EPP) in which all the partitions are equi-sized. The Object Migrating Automation (OMA): Due to the poor convergence of prior OPP/EPP solutions, they were never utilized in real-life applications. The introduction of an LA-based partitioning algorithm, the OMA, (explained in Sect. 2) made real-life applications possible. The OMA resolved the EPP both

The Power of the Pursuit Learning Paradigm in the Partitioning of Data

5

efficiently and accurately. This solution is regarded as a benchmark for the EPP. Indeed, since 19861 , it has been applied to variety of real-life problems and domains which include keyboard optimization, image retrieval, distributed computing, graph partitioning, the constraint satisfaction problems, cryptanalysis, reputation systems, parallel and distributed mapping etc. The Intent of this Paper: Although the “Pursuit” learning paradigm has been utilized in the theory and applications of LA as fundamental machines, until recently, it has not been incorporated into solutions that solve partitioning problems. The goal of this paper is to submit a comprehensive survey of how this paradigm can be used in Object Partitioning, and to optimize various versions of the OMA. We also include simulation results on benchmark environments that demonstrate the advantages of incorporating it into the respective machines.

2

The Object Migration Automata

The OMA is a fixed structure LA designed to solve the EPP. It is defined as a quintuple with R actions2 , each of which represents a specific class, and for every action there exist a fixed number of states, N . Every abstract object from the set O resides in a state identified by a number, and can move from one state to another, or migrate from one group to another. If the abstract object Oi is in state ξi belonging to a group αk , we say that Oi is assigned to class k. If two objects Oi and Oj happen to be in the same class and the OMA receives a query Ai , Aj , they are jointly rewarded by E, the Environment. Otherwise, they will be penalized. We formalize the movements of {Oi } on reward/penalty. We shall formalize the LA as follows: For every action αk , there is a set of states {φk1 , · · · , φkN }, where N is the fixed depth of the memory, and where 1 ≤ k ≤ R represents the number of desired classes. We also assume that φk1 is the most internal state and that φkN is the boundary state for the corresponding action. The response to the reward and penalty feedback are as follows: – Reward: Given a pair of physical objects presented as a query Ai , Aj , if both Oi , and Oj happen to be in the same class αk , the reward scenario is enforced, and they are both moved one step toward the actions’s most internal state, φk1 . This is depicted in Figure 3.2(a) in [11]3 . – Penalty: If, however, they are in different classes, αk and αm , (i.e., Oi , is in state ξi where ξi ∈ {φk1 , · · · , φkN } and Oj , is in state ξj where ξj ∈ {φm1 , · · · , φmN }) they are moved away from φk1 and φm1 as follows: 1. If ξi = φkN and ξj = φmN , we move Oi and Oj one state toward φkN and φmN , respectively, as shown in Figure 3.2(b) in [11]. 1 2 3

The bibliography in this paper is necessarily limited. The majority of the present results very briefly summarize the results in the Ph.D. thesis of the First Author. To be consistent with the terminology of LA, we use the terms “action”, “class” and “group” synonymously. The OMA’s algorithms/figures are in [11], and omitted here in the interest of space.

6

A. Shirvani and B. John Oommen

2. If ξi = φkN or ξj = φmN but not both (i.e., only one of these abstract objects is in the boundry state), the object which is not in the boundary state, say Oi , is moved towards its boundary state as shown in Figure 3.2(c) in [11]. Simultaneously, the object that is in the boundary state, Oj , is moved to the boundary state of Oj . Since this reallocation will result in an excess of objects in αk , we choose one of the objects in αk (which is not accessed) and move it to the boundary state of αm . In this case, we choose the object nearest to the boundary state of ξi , as shown in Figure 3.2(c) in [11]. 3. If ξi = φkN and ξj = φmN (both objects are in the boundary states), one object, say Oi , will be moved to the boundary state of αm . Since this reallocation, will again, result in an excess of objects in αm , we choose one of the objects in αm (which is not accessed) and move it to the boundary state of αk . In this case, we choose the object nearest to the boundary state of ξj , as shown in Figure 3.2(d) in [11]. To asses the partitioning accuracy and the convergence speed of any EPP solution, there must be an “oracle” with a pre-defined number of classes, and with each class containing an equal number of objects. The OMA’s goal is to migrate the objects between its classes, using the incoming queries. E is characterized by three parameters: (a) W , the number of objects, (b) R, the number of partitions, and (c) a probability ‘p’ quantifying how E pairs the elements in the query. Table 1. Experimental results for the OMA done for an ensemble of 100 experiments in which we have only included the results from experiments where convergence has occurred. W W/R R OMAp9

OMAp8

OMAp7

4

2

2 (2, 26)

(2, 36)

(2, 57)

6

2

3 (3, 44)

(4, 62)

(4, 109)

-

3

3 (22, 66)

(20, 88)

(26, 153)

9

3

3 (44, 110)

(43, 144)

(70, 261)

12 2

6 (10, 101)

(12, 146)

(15, 285)

-

3

4 (82, 172)

(84, 228)

(128, 406)

-

4

3 (401, 524)

(252, 405)

(256, 552)

-

6

2 (2240, 2370)

(1151, 1299) (1053, 1486)

15 3

5 (152, 265)

(155, 325)

(191, 607)

-

3 (1854, 2087)

(918, 1136)

(735, 1171)

18 2

9 (17, 167)

(24, 252)

(29, 582)

-

3

6 (180, 319)

(202, 413)

(288, 839)

-

6

3 (5660, 5786)

(1911, 2265) (1355, 2111)

-

9

2 (11245, 11456) (6494, 7016) (3801, 4450)

5

The Power of the Pursuit Learning Paradigm in the Partitioning of Data

7

Every query presented to the OMA by E consists of two objects. E randomly selects an initial class with probability R1 , and it then chooses the first object in the query from it, say, q1 . The second element of the pair, q2 , is then chosen with the probability p from the same class, and with the probability (1 − p) from one of the other classes uniformly, each of them being chosen with the probability of 1 R−1 . Thereafter, it chooses a random element from the second class uniformly. We assume that E generates an “unending” continuous stream of query pairs. The results of the simulations are given in Table 1, where in OM ApX , X refers to the probability specified above, W , is the number of objects, W/R is the number of objects per class, and R is the number of classes. The results are given as a pair (a, b) where a refers to the number of iterations for the OMA to reach the first correct classification and b refers to the case where the OMA has fully converged. In all experiments, the number of states of the OMA is set to 10. Also, the OMA’s convergence for a single run and for an ensemble of runs display a monotonically decreasing pattern (with time) for the latter.

3

Developing the Pursuit Concept: The Environment

In an “un-noisy” Environment, we can denote the actual value of the relation between Ai and Aj (for k ∈ {1, · · · , R}) by the quantity μ∗ (i, j), expressed as: μ∗ (i, j) = P (Rk ) · P (Aj |Ai ) · P (Ai ), ∀i, j if Ai , Aj  ∈ RK , = 0 otherwise, where P (Rk ) is the probability that the first element, Ai , is chosen from the group Rk , and P (Aj |Ai ) is the conditional probability of choosing Aj , which is also from Rk , after Ai has been chosen. Since E chooses the elements of the pairs from the other groups uniformly, with a possible re-numbering operation, the matrix M∗ = [μ∗ (i, j)] is a block-diagonal matrix given by Eq. (1). ⎡M∗ 0 . . . 0 ⎤ 1 .. ⎥ ⎢ . ⎥ ⎢ 0 M∗2 ∗ ⎢ . ⎥ M =⎣ .. .. . .. ⎦ . 0 · · · . . . M∗R

(1)

where 0 represents a square matrix containing only 0’s. Theorem 1. The matrix M∗r , (1 ≤ r ≤ R), is a matrix of probabilities of size W W R × R possessing the following form: ⎡ ⎤ R R 0 ··· W (W −R) W (W −R) ⎢ ⎥ R R 0 ··· ⎢ W (W −R) W (W −R) ⎥ ∗ ⎢ ⎥ (2) Mr = ⎢ .. .. .. ⎥ . ⎣ ⎦ . . R R ··· 0 W (W −R) W (W −R)

8

A. Shirvani and B. John Oommen

Proof. The proof of the theorem is omitted here. It is found in [11].



In a real-world scenario where E is noisy, i.e., the objects from the different groups can be paired together in a query, the general form for M∗ is: ⎡M∗ θ . . . θ ⎤ 1 .. ⎥ ⎢ . ⎥ ⎢ θ M∗2 ∗ ⎢ . ⎥ M =⎣ .. , .. . .. ⎦ . θ · · · . . . M∗R

(3)

where θ and M∗r s are specified as per Eqs. (4) and (5). Theorem 2. In the presence of noise in E, the entries of the pair Ai , Aj  can be selected from two different distinct classes, and hence the matrix specifying the probabilities of the accesses of the pairs obeys Eq. (3), where: ⎡ ⎤ 1 1 ··· 1 ⎢ ⎥ θ = θo · ⎣ ... ... . . . ... ⎦ , (4) 1 ··· ··· 1 ⎡ ⎤ 0 1 ··· 1 ⎢1 0 · · · 1⎥ ⎢ ⎥ M∗r = θd · ⎢ . . . . ⎥ , ⎣ .. .. . . .. ⎦

(5)

1 1 ··· 0 where, 0 < θd < 1 is the coefficient which specifies the accuracy of E, and θo is 1−θo (W − W R ) related to θd as θd = . W −1 R

Proof. The proof of the theorem is omitted here and found in [11]. 3.1

The Design and Results of the POMA

In a real world scenario, since E’s true statistical model is unknown, the expressions in Eqs. (1) and (2) can only be estimated through observing a set of queries. In the presence of noise though, we need to devise a measurable quantity which makes the algorithm capable of recognizing divergent pairs. Observe that whenever a real query Ai , Aj  appears, we will be able to obtain a simple ML estimate of how frequently Ai and Aj are accessed concurrently. Clearly, by virtue of the Law of Large Numbers, these underlying estimates will converge to the corresponding probabilities of E actually containing the elements Ai and Aj in the same group. As the number of queries processed become larger, the quantities inside M∗i will become significantly larger than the quantities in each of the θ matrices. From the plot of these estimates [11], one will observe that the estimates corresponding to the matrix M∗i have much higher values than

The Power of the Pursuit Learning Paradigm in the Partitioning of Data

9

the off-diagonal entries. This implies that these off-diagonal entries represent divergent queries which move the objects away from their accurate partitions. Intuitively, the pursuit concept for the OPP can be best presented by a matrix of size W × W where every entry will capture the same statistical measure about the stream of the input pairs. For the sake of simplicity, we use a simple averaging and denote this matrix by P. Every block represents a pair and the height of the block is set to the frequency count of the reciprocal pair. To obtain the average frequency of each pair, we let the OMA iterate for a sufficient time, say J iterations, and at every incident we update the value of the matrix P respectively. In this way, at the end of the J-th iteration, we have simply estimated the frequency of each pair. At this point, by observing the values of the matrix, the user can determine an appropriate threshold (τ > 0) to be adopted as the accept or reject policy for any future occurrence of this particular pair of objects. If we permit the algorithm to collect a large enough number of pairs, we see that ∗ . ∃θ∗ | ∀θo ≤ θ∗ , and that ∀ i, j : μi,j θi,j If we utilize a user-defined threshold, τ , (which is reasonably close to 0), we will be able to compare every estimate to τ and make a meaningful decision about the identity of the query. In other words, by merely comparing the estimate to τ we can determine whether a query pair Ai , Aj  should be processed, or quite simply, be ignored. This leads us to algorithm POMA on Page 74 of [11] in which every query which is inferred to be divergent is ignored. Otherwise, one invokes the Reward and Penalty functions of the original OMA algorithm. The issue of Table 2. Experimental results for the POMA approach done for an ensemble of 100 runs. W W/R R POMAp9

POMAp8

POMAp7

4

2

2 (2, 25)

(3, 30)

(3, 38)

6

2

3 (4, 44)

(4, 52)

(5, 67)

-

3

2 (20, 65)

(22, 77)

(24, 106)

9

3

3 (44, 105)

(70, 148)

(85, 169)

12 2

6 (10, 88)

(12, 103)

(20, 173)

-

3

4 (77, 166)

(105, 205)

(292, 462)

-

4

3 (328, 417)

(228, 372)

(202, 487)

-

6

2 (1563, 1836) (945, 1091)

(1088, 1395)

15 3

5 (112, 213)

(142, 274)

(179, 315)

-

3 (1534, 1655) (766, 998)

(556, 931)

18 2

9 (20, 151)

(26, 161)

(29, 566)

-

3

6 (245, 410)

(198, 417)

(226, 395)

-

6

3 (3146, 3270) (2182, 2371) (1145, 1542)

-

9

2 (5500, 5621) (5064, 5523) (4104, 4711)

5

10

A. Shirvani and B. John Oommen

determining the parameters of the POMA algorithm are detailed in [11], and omitted here in the interest of space. In the initial phase of the algorithm, the estimates for the queries are unavailable. Thus, it only makes sense to consider every single query and to process them using the OMA’s Reward and Penalty functions. Since the objects in each are equally-likely to happen and the classes are equi-probable,  class

W 2 −W k≥ R R × R, is chosen as the lower-bound of the number of iterations for any meaningful initialization. We have compared our results with those presented in [5] and those reported for the original OMA for various values of R and W . The number of states in every action was set to 10, and the convergence was expected to have taken place as soon as all the objects in the POMA fell within the last two internal states. The results (specified using the same notation as in Table 1) obtained are outstanding and are summarized in Table 2. The simulation results are based on an ensemble of 100 runs with different uncertainty values, (i.e., values of p). To observe the efficiency of the POMA, consider an easy-to-learn Environment of 6 groups with 2 objects in each group and where p = 0.9. It took the OMA 599 iterations to converge. As opposed to this, the POMA converged in only 69 iterations, which represents a ten-fold improvement. On the other hand, given a difficult-to-learn Environment with 12 objects in 2 groups, the OMA needs 6, 506 iterations to converge. The POMA required only 2, 112 iterations to converge, which is more than a three-fold improvement.

4

Enhanced OMA (EOMA)

The learning of an enhanced LA proposed by Gale et al. [5], is based on the same principles of the OMA and leads to the Enhanced OMA (EOMA). They introduced three enhancements to improve its efficiency and speed as below: 1. Initial Boundary State Distribution: All of the objects are initially distributed at the respective boundary states of their respective classes; 2. Redefinition of Internal States: They diminished the vulnerability of the convergence criterion of the OMA by redefining the internal state to include “the two innermost states of each class”, rather than a single innermost state. 3. Breaking the Deadlock: The original OMA possesses a deadlock-prone infirmity (please see Section 4.3 of [11]) in which the machine can cycle between two identical configurations by virtue of a sequence of query pairs. Thus is especially evident in noise-free Environments. The EOMA remedies this as follows. Give a query pair of objects Oi , Oj , let us assume that Oi , is in the boundary state, and Oj is in a non-boundary (internal) state of another class. If there exists an object in the boundary state of the same class, we propose that it gets swapped with the boundary object Oj to bring both of the queried objects together in the same class. Simultaneously, a non-boundary object has to be moved toward the boundary state of its class. Otherwise, if there is no object in the boundary state of the class that contained Oj , the algorithm performs identically to the OMA.

The Power of the Pursuit Learning Paradigm in the Partitioning of Data

11

The simulation results obtained by introducing all of the three abovementioned modifications are given in Table 3. In this table, we have reported the results from various Environments, with probabilities p = 0.9, 0.8 and 0.7, where p = 0.9 is the near-optimal Environment. Such an Environment is easy to learn from. On the other hand, for the case where p = 0.7, we encounter a difficult-to-learn scenario. Our results have also compared our own implementation of the OMA algorithm described in [5] with the EOMA’s simulation results. These are the results displayed in Table 3. All the simulations reported were done on an ensemble of 100 experiments to guarantee statistically stable results. From the table we clearly see that as the value of p increases, the queries are more informative, and the convergence occurs at a faster rate. As before, the complexity of the classification problem has two criteria which are both observable in the tables, i.e., the number of objects in every group, W R , and the number of groups, R. As the number of objects and groups increase, the problem becomes increasingly complex to solve. The reader will easily observe the advantages gleaned by the above three modifications in the EOMA, by comparing Tables 1 and 3. The convergence of the EOMA with respect to time starts with a large number of objects which are located in random partitions. This number steadily decreases with time to a very small value. This graph is not monotonic for any Table 3. Experimental results for the Enhanced OMA (EOMA) done for an ensemble of 100 runs. W W/R R EOMAp9 EOMAp8 EOMAp7 4

2

2 (2, 26)

(2, 30)

(3, 60)

6

2

3 (4, 46)

(4, 65)

(5, 106)

-

3

2 (6, 50)

(8, 74)

(11, 127)

8

2

4 (6, 64)

(7, 95)

(8, 158)

-

4

2 (14, 75)

(20, 110)

(32, 185)

9

3

3 (18, 91)

(24, 132)

(35, 233)

2 (8, 85)

(10, 118)

(13, 226)

10 5 -

2

5 (25, 106)

(33, 153)

(70, 277)

12 2

6 (10, 102)

(12, 154)

(17, 291)

-

3

4 (43, 136)

(56, 207)

(81, 380)

-

4

3 (54, 150)

(66, 196)

(99, 388)

-

6

2 (40, 133)

(64, 208)

(105, 405)

15 3

5 (65, 187)

(92, 284)

(134, 554)

-

5

3 (75, 191)

(108, 295) (192, 617)

18 2

9 (19, 170)

(26, 253)

-

3

6 (106, 258) (140, 389) (242, 827)

-

6

3 (114, 255) (167, 392) (261, 857)

-

9

2 (112, 246) (142, 363) (311, 854)

(36, 630)

12

A. Shirvani and B. John Oommen

given experiment. But from the perspective of an ensemble, the performance is much more monotonic in behavior.

5

Enhancing the EOMA with a Pursuit Paradigm

The methodology by which we incorporated the pursuit concept into the OMA required us to formally model noise-free and noisy queries in Sect. 3. However, this is the precise dilemma that the EOMA faces. On the one hand, it would be advantageous, from a partitioning perspective, to have a noise-free Environment. However, it is precisely such noise-free Environments that lead to deadlock situations. Consequently, an attempt to elevate a noisy Environment to become noise-free would only defeat the purpose by exaggerating deadlock scenarios. However, rather than seeking to make the Environment noise-free, we will again accept or reject queries from the incoming stream. To accomplish this, we again apply the same “Pursuit” paradigm, explained below for this specific setting. To design the PEOMA, as before, we again incorporate the pursuit principle by estimating the Environment’s reward/penalty probabilities. This could, of course, be done based on either a ML or Bayesian methodology. As these estimates become more accurate, we force the LA to converge to the superior actions at the faster rate. In all brevity, the PEOMA utilizes the exact same Pursuit principles explained in Sect. 3.1. Essentially, the EOMA which mitigates the “deadlock” situation is now augmented with the Pursuit concept, and thus: – The stream of queries is processed using an estimation phase; – The divergent queries are filtered using a thresholding phase, that serves as a filter for the above estimates; – The deadlock scenarios are resolved using the enhancements of the EOMA over the OMA; – The convergence criterion of making the two most internal states of every group to report convergence, makes the entire process converge even faster. The experimental results compared the PEOMA with the EOMA, and we were able to show how the PEOMA out-performed the EOMA and the OMA. The results are given in Table 4 which uses the same notation and for the same settings as in the Tables 1 and 2. One can also compare its performance with the results presented in [5] and those reported in Table 3 for various values of R and W and in different Environments. The number of states in every action was set to be 10 as in Table 4. The convergence condition was also identical to the one specified in Sect. 4, and was assumed to have taken place when all the objects in the PEOMA fell within the last two internal states. Further, the query probability approximations were updated after receiving every single query. The performance significance of the PEOMA is, really, not noticeable for easy problems where we had a small number of objects and groups, and where the noise level was low. But this becomes invaluable when we encounter a large number of actions as well as a stream of divergent queries (when p is “smaller”) throughout the simulation, especially if we factor in the number of iterations

The Power of the Pursuit Learning Paradigm in the Partitioning of Data

13

used to obtain an estimate of τ . Indeed, the PEOMA’s performance can be up to more than two times better than the EOMA. But if we compare the results with the original OMA, the immense performance gain leads to about forty times less number of iterations for a complete convergence – which is by no means insignificant. It is fascinating to note that the reduction in the number of iterations required by the PEOMA can again be seen to be a consequence of a Pursuit-like filtering phase in all problem domains. Table 4. Experimental results for the PEOMA approach done for an ensemble of 100 runs. W W/R R PEOMAp9 PEOMAp8 PEOMAp7 4

2

2 (2, 23)

(2, 37)

(3, 44)

6

2

3 (4, 42)

(4, 52)

(5, 73)

-

3

2 (7, 47)

(8, 62)

(10, 91)

8

2

4 (6, 59)

(6, 76)

(8, 102)

-

4

2 (15, 73)

(23, 100)

(36, 145)

9

3

10 2 -

6

5

3 (20, 85)

(24, 110)

(40, 146)

5 (8, 79)

(10, 102)

(12, 141)

2 (26, 100)

(36, 140)

(54, 213)

12 2

6 (10, 97)

(12, 129)

(17, 181)

-

3

4 (38, 126)

(55, 165)

(74, 222)

-

4

3 (44, 134)

(58, 165)

(87, 241)

-

6

2 (34, 127)

(60, 182)

(110, 310)

15 3

5 (72, 174)

(88, 228)

(147, 308)

-

5

3 (76, 185)

(105, 249)

(155, 348)

18 2

9 (19, 166)

(26, 218)

(36, 323)

-

3

6 (98, 231)

(139, 310)

(207, 419)

-

6

3 (118, 246)

(162, 328)

(239, 472)

-

9

2 (100, 236)

(133, 330)

(280, 553)

Cohesiveness in the EPP: The Transitive PEOMA

The first issue that we encounter when we want to advance the field of resolving the EPP is to see if we can use new criteria to identify which objects belong to the same partition. We intend to investigate how this can be inferred without considering the issues that have been analyzed earlier. It is easy to see that all the objects within an underlying partition should be strongly and directly related to each other, and that they should frequently co-appear in the queries. Such structural patterns are, in turn, based on so-called casual propositions which

14

A. Shirvani and B. John Oommen

should lead towards relational “interactions” between the objects themselves. This is the avenue that we now investigate. Structural relations that are imposed by the Environment can orient the objects towards a uniformity when there is an “interaction” between a pair of objects. Such relations may be “transmitted” through intermediaries even when two objects are not explicitly examined at any given time instant. This interconnection is directly associated with the relational bonds that these objects possess. We shall now investigate whether this property, which already relates subgroups and not just pairs, can be quantified by various specific properties that can be extracted from the Environment. They can be seen to be: 1. 2. 3. 4.

The frequency of objects co-occurring; The relative frequency of the objects in a pair belonging to distinct partitions; The symmetric property of the queries in any pair presented by E; The reachability of the objects in a partition within the graph representing the set of all objects.

We first formalize the partitioning problem’s symmetry and transitivity properties proven in [11]. Theorem 3. The model of E and the solution invoked by any pursuit-based paradigm of the EPP possess the property of symmetry. Theorem 4. The model of E proposed for the EPP possesses the property of transitivity from a probabilistic perspective. Since E is transitive, our aim is now to have the LA infer this transitivity and to further enhance the PEOMA. Indeed, if the pursuit matrix is appropriately thresholded, the entries become unity and zero, which allows us to demonstrate transitivity and thus, invoke reward/penalty operations even while the environment is dormant and not generating any new queries. Without going into the explicit details (omitted due to space limitations), this is essentially done by invoking the assertion: ∀Oi , Oj , Ok ∈ W : (Oi ROj ∧ Oj ROk ) =⇒ Oi ROk . This leads us to the Transitive PEOMA (TPEOMA). The experimental results for the PEOMA are given in Table 5 for the same settings as in the previous tables. By way of example, if the TPEOMA is compared with the previously bestreported algorithm, the PEOMA reported in Sect. 5, one can see that the PEOMA can solve the partitioning problem with p = 0.9 and 3 groups with 3 objects in each group, in 85 iterations. For the same problem, the TPEOMA required only 65 iterations to converge. For a difficult-to-learn Environment (p = 0.7) and a more complex partitioning problem with 18 objects in 3 groups, the PEOMA needed 472 iterations to converge. The TPEOMA required only 244 iterations to converge, which is nearly two times better than the PEOMA. It is certainly the fastest partitioning algorithm reported to date, and its behavior is monotonically decreasing for an ensemble of many experiments. The reader should observe the considerable performance that is gained by a very little additional computational cost. Again, by comparing Tables 4 and 5, one observes that although the gain is not significant for simple problems and easy Environments, it becomes remarkably high for complex partitioning experiments.

The Power of the Pursuit Learning Paradigm in the Partitioning of Data

15

Table 5. Experimental results for the TPEOMA approach done for an ensemble of 100 runs. W W/R R TPEOMAp9 TPEOMAp8 TPEOMAp7

7

4

2

2 (2, 24)

(2, 30)

(3, 40)

6

2

3 (4, 41)

(4, 51)

(5, 64)

-

3

2 (6, 37)

(8, 50)

(13, 74)

8

2

4 (7, 57)

(7, 71)

(8, 91)

-

4

2 (14, 50)

(25, 78)

(41, 125)

9

3

3 (19, 65)

(21, 78)

(29, 113)

10 2

5 (8,75)

(10, 95)

(14, 121)

-

2 (26,69)

(41, 92)

(76, 178)

12 2

6 (12, 95)

(15, 123)

(18, 155)

-

3

4 (30, 91)

(37, 110)

(52, 155)

-

4

3 (34, 86)

(47, 107)

(66, 157)

-

6

5

2 (43, 86)

(62, 121)

(111, 209)

15 3

5 (48, 123)

(61, 159)

(81, 203)

-

5

3 (51, 101)

(71, 133)

(105, 205)

18 2

9 (20, 156)

(28, 199)

(36, 275)

-

3

6 (66, 153)

(85, 194)

(126, 283)

-

6

3 (63, 126)

(95, 170)

(136, 244)

-

9

2 (77, 129)

(148, 222)

(268, 391)

Conclusions

In this paper we have shown how we can utilize the “Pursuit” concept to enhance solutions to the general problem of partitioning. Unlike traditional Learning Automata (LA), which work with the understanding that the actions are chosen purely based on the “state” in which the machine is, the “Pursuit” concept has been used to estimate the Random Environment’s (RE’s) reward probabilities and to take these into consideration to design Estimator/Pursuit LA. They, utilize “cheap” estimates of the Environment’s reward probabilities to make them converge by an order of magnitude faster. This is achieved by using inexpensive estimates of the reward probabilities to rank the actions. Thereafter, when the action probability vector has to be updated, it is done not on the basis of the Environment’s response alone, but also based on the ranking of these estimates. In this paper we have shown how the “Pursuit” learning paradigm can be and has been used in Object Partitioning. The results demonstrate that incorporating this paradigm can hasten the partitioning by a order of magnitude. This paper comprehensively describes all the Object Migration Automaton (OMA)-related machines to date, including the Enhanced OMA [5]. It then incorporates the

16

A. Shirvani and B. John Oommen

Pursuit paradigm to yield the Pursuit OMA (POMA), the Pursuit Enhanced OMA (PEOMA) and the Pursuit Transitive Enhanced OMA (PTOMA). Apart from the schemes themselves, the papers reports the experimental results that have been obtained by testing them on benchmark environments.

References 1. Godsil, C., Royle, G.F.: Algebraic Graph Theory. Graduate Texts in Mathematics, vol. 207. Springer, New York (2013). https://doi.org/10.1007/978-1-4613-0163-9 2. Biggs, N.: Algebraic Graph Theory. Cambridge University Press, Cambridge (1993) 3. Fayyoumi, E., Oommen, B.J.: Achieving microaggregation for secure statistical databases using fixed-structure partitioning-based learning automata. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 39(5), 1192–1205 (2009) 4. Freuder, E.C.: The object partition problem. Vision Flash, -(4) (1971) 5. Gale, W., Das, S., Yu, C.T.: Improvements to an algorithm for equipartitioning. IEEE Trans. Comput. 39(5), 706–710 (1990) 6. Jobava, A.: Intelligent traffic-aware consolidation of virtual machines in a data center. Master’s thesis, University of Oslo (2015) 7. Lanctot, J.K., Oommen, B.J.: Discretized estimator learning automata. IEEE Trans. Syst. Man Cybern. 22(6), 1473–1483 (1992) 8. Mamaghani, A.S., Mahi, M., Meybodi, M.: A learning automaton based approach for data fragments allocation in distributed database systems. In: 2010 IEEE 10th International Conference on Computer and Information Technology (CIT), pp. 8–12. IEEE (2010) 9. Oommen, B.J., Ma, D.C.Y.: Stochastic automata solutions to the object partitioning problem. Carleton University, School of Computer Science (1986) 10. Oommen, B.J., Agache, M.: Continuous and discretized pursuit learning schemes: various algorithms and their comparison. IEEE Trans. Syst. Man Cybern. Part B: Cybern.d 31(3), 277–287 (2001) 11. Shirvani, A.: Novel solutions and applications of the object partitioning problem. Ph. D. thesis, Carleton University, Ottawa, Canada (2018) 12. Yazidi, A., Granmo, O.C., Oommen, B.J.: Service selection in stochastic environments: a learning-automaton based solution. Appl. Intell. 36(3), 617–637 (2012) 13. Amer, A., Oommen, B.J.: A novel framework for self-organizing lists in environments with locality of reference: lists-on-lists. Comput. J. 50(2), 186–196 (2007)

AI Anomaly Detection - Active Learning

Cyber-Typhon: An Online Multi-task Anomaly Detection Framework Konstantinos Demertzis1(&), Lazaros Iliadis1, Panayiotis Kikiras2, and Nikos Tziritas3 1

3

School of Engineering, Department of Civil Engineering, Faculty of Mathematics Programming and General Courses, Democritus University of Thrace, Kimmeria, Xanthi, Greece [email protected], [email protected] 2 School of Science, Department of Computer Science, University of Thessaly, Lamia, Greece [email protected] Research Center for Cloud Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China [email protected]

Abstract. According to the Greek mythology, Typhon was a gigantic monster with one hundred dragon heads, bigger than all mountains. His open hands were extending from East to West, his head could reach the sky and flames were coming out of his mouth. His body below the waste consisted of curled snakes. This research effort introduces the “Cyber-Typhon” (CYTY) an Online MultiTask Anomaly Detection Framework. It aims to fully upgrade old passive infrastructure through an intelligent mechanism, using advanced Computational Intelligence (COIN) algorithms. More specifically, it proposes an intelligent Multi-Task Learning framework, which combines On-Line Sequential Extreme Learning Machines (OS-ELM) and Restricted Boltzmann Machines (RBMs) in order to control data flows. The final target of this model is the intelligent classification of Critical Infrastructures’ network flow, resulting in Anomaly Detection due to Advanced Persistent Threat (APT) attacks. Keywords: Deep content inspection  Anomaly detection  Multi-task learning  Online learning  Restricted Boltzmann Machine Critical Infrastructure Protection



1 Introduction Information generated by complex environments such as the Internet of Things (IOT) ecosystem, has increased exponentially. The result is the inefficient management and storage of the total volume of generated information. This requires the adoption of complex data mining and analysis architectures [1]. These architectures should incorporate specialized processing algorithms, that dynamically adapt to new standards or data, or to scaled data production as a equation of time [2].

© IFIP International Federation for Information Processing 2019 Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 19–36, 2019. https://doi.org/10.1007/978-3-030-19823-7_2

20

K. Demertzis et al.

Although mining Data Streams (DAST) is an emerging area, it poses enormous challenges to the data mining community. High transmission speed, change of data distribution, and high volume, raise the following issues that should be addressed [3]: • High Velocity: Online DAST, arrive at a very high speed. Thus, it has become almost practically infeasible to scan all of them. This is also the case for the offline ones. • Concept Drift: Frequent Patterns (FREP) keep changing, as data streams are time varying in nature. During the mining process, as new incoming data is added to the existing ones, some FREP may change their status to become infrequent and vice versa. This issue is known as the Concept Drift (CDR) problem. • Unbounded Size (UNS): DAST are unbounded in size. Their size is unknown to the user in advance unlike the static data. • Enormous Space Requirement (ESR): Huge amount of data are generated both in online and offline applications. There might not be enough space to store the data stream before processing. • Unsteady Analysis Results (UAR): High speed as well as varying data distribution, may affect the analyzed results, due to which the mining outcome may be declined. In order to cope with this, DAST mining must be an incremental process. Anomaly Detection [4] over multiple data streams, is initially determined by the observation of a single (multi-variable) time series frequency, which constitutes the systems’ quantitative performance parameters. A data stream si may be coming from a system of sensors, and it consists of numerical values, where si(t) stands for the data stream flow value at the time t, and t 2 [0, +∞]. If sensors’ flows are synchronized periodically to report their values, the whole of the multi-variable information at each time t is represented with the following frame vector shown in Eq. 1, [5]. DPt ¼ ðs1 ðtÞ; s2 ðtÞ; . . .; sn ðtÞ 2 Rn Þ

ð1Þ

In practice, each flow forms a one-dimensional time series, while the frame vector flow (FVF) represents a multi-variable time series. Event detection on DAST is intended to determine the values si(t), which represent abrupt changes within an FVF. Particularly, each FVF of length is converted to a binary vector of the same length, where each value represents a possible change in the corresponding sensor flux. Such deviations from the normal behavior are called events, and binary vectors are called event vectors. An event may be an observation that does not conform to an expected standard in the data set (anomaly). Incidents may have been caused by a variety of reasons, like sensor failure or malequation, or deviations and substantial changes that may affect the system’s behavior, such as Cyber Attacks (CYA) [7]. Therefore, a vector of events in time t is represented by Eq. 2 below [3, 5].  DRt ¼ et1 ; et2 ; . . .; etn Þ 2 ½0; 1

ð2Þ

Cyber-Typhon: An Online Multi-task Anomaly Detection Framework

21

where eti ¼ ei ðtÞ is the binary value which represents the occurrence of an abnormal flow behavior, and its value is si(t) = 1 in time t. The values of si(t) must be in an interval {0, 1}. The error is calculated at each iteration, as data characteristics can change drastically and in an unpredictable way changing the typical, normal behavior. An object that may be considered abnormal, can then be included in the set of normal observations due to rapid developments in the data stream. Due to the fact that the data volume is unlimited, data mining is performed on a subset of the flow, called a sliding window, which obviously contains a small but recent percentage of the observations. The goal of the data flow processing algorithms is to minimize the cumulative error for all iterations, that can be calculated by the following Eq. 3 [3, 5]. In ½w ¼

n n  2 X    X V w; xj ; yj ¼ xTj w  yj j¼1

ð3Þ

j¼1

where xj 2 Rd ; w 2 Rd and yj 2 R. We consider that Xi  d is a data matrix and Yi  1 is a matrix with target values, after the arrival of the first i data points. Assuming that the covariance matrix Ri ¼ X T X is reversible, the optimal solution f  ð xÞ ¼ hw ; xi is given by the following Eq. 4 [3, 5]. i X  1 w ¼ X T X X T  ¼ R1 xj yj i

ð4Þ

j¼1

First the covariance table is calculated by using the following Eq. 5 Ri ¼

Xi

x xT j¼1 j j

ð5Þ

The initial time complexity (CM) is calculated to be of Oðid 2 Þ order; but after we inverse the X T X ðd  d Þ table, it increases to Oðd 3 Þ; while the rest of the required multiplications have an Oðd 2 ÞCM: This produces an overall CM of Oðid 2 þ d 3 Þ [3, 5] (d is the size of the window). It is conceivable that robust systems ensuring reliability and high accuracy rates without requiring high availability of resources are required to safely approach problems stemming from knowledge mining processes. The above argument is further supported as follows: Let’s suppose that the size of data points is equal to n and after the arrival of each new data point i = 1, 2, …, n, the recalculation of the solution is required. In this case the total time complexity would be equal to Oðn2 d 2 þ nd 3 Þ [5]. The Multi-Task Learning (MTL) is a robust method in order to face some of the most challenge of the Big Data Streams processing with Online Learning algorithms. MTL is a subfield of Machine Learning (ML) in which multiple learning tasks are solved at the same time, utilizing common elements or differences arising from the multiple tasks included in the case study [6, 7]. More general MTL is an inductive transfer method which generates many generalization features. The common features or differences that arise between distributed tasks during the training process are

22

K. Demertzis et al.

transferred or shared as guaranteed and unambiguous knowledge in subsequent relevant or unrelated tasks, maximizing the accuracy of the model. MTL is efficient because regularization induced by requiring an algorithm to perform well on a related task can be superior to regularization that prevents overfitting by penalizing all complexity uniformly. The following approaches are characteristic cases of MTL [6, 7]: • Task grouping and overlapping, where tasks are grouped or provided in an overlapping way so that there is relevance, capable to lead in the use of cognitive or learning relationships. • Exploiting unrelated tasks, where the common learning of non-relevant tasks using the same input data, can be beneficial for learning main tasks of an application field. • Transfer of knowledge, where transfer of relevant knowledge is carried out to achieve learning from correspondingly trained models. • Group online adaptive learning, where transfer of previous experience or knowledge into continually changing environments takes place.

2 Literature Review Chen and Abdelwahed [8] have applied autonomous computing technology, to monitor SCADA system performance. Their approach proactively estimates upcoming attacks for a given system model of a physical infrastructure. Soupionis et al. [9] have proposed a combinatorial method for automatic detection and classification of faults and cyber-attacks occurring on the power grid system when there is limited data from the power grid nodes due to cyber implications. In addition, Zhu et al. [10] have described the network attack knowledge, based on the theory of the factor expression of knowledge. They have studied the formal knowledge theory of SCADA network from the factor state space and equivalence partitioning. This approach utilizes the Factor Neural Network (FNN) theory which contains high-level knowledge and quantitative reasoning, used to establish a predictive model including analytic FNN and analogous FNN. This model abstracts and builds an equivalent and corresponding network attack and a defense knowledge factors system. Finally [11] has introduced a new European Framework-7 project Cockpit CI (Critical Infrastructure) and roles of intelligent machine learning methods to prevent SCADA systems from cyber-attacks. Also, existing multi-task learning methods can be categorized into two main categories: learning with feature covariance and learning with task relations [12]. Different from prior solutions to distributed multi-task learning, which are focused on the former category [13], our proposed multi-task learning falls into the latter category. On the other hand, distributed machine learning has attracted more and more interests recently [14]. There have been tremendous efforts done on different machine learning problems. Also, online multi-task learning assumes instances from different tasks arrive in a sequence and adversarial chooses task to learn. Cavallanti et al. [15] exploited online multi-task learning with a given task relationship encoded in a matrix, which is known

Cyber-Typhon: An Online Multi-task Anomaly Detection Framework

23

beforehand. Parallel multi-task learning aims to develop parallel computing algorithms for multi-task learning in a shared-memory computing environment. Recently, Zhang [7] proposed a parallel multi-task learning algorithm named PMTL. In this work, dual forms of three losses are presented and accelerated proximal gradient method is applied to make the problem decomposable, and thus possible to be solved in parallel. Finally, distributed multi-task learning is an area that has not been much exploited. Wang et al. [13] proposed a distributed algorithm for multi-task learning by assuming that different tasks are related through shared sparsity. In another work [6], asynchronous distributed multi-task learning method is proposed for multi-task learning with shared subspace learning or shared feature subset learning. Different from the above-mentioned approaches, our method aims at solving multi-task learning by learning task relationships from data, which can be positive, negative, or unrelated, via a task-covariance matrix.

3 The Proposed Cyber-Typhon Framework This research proposes the Cyber-Typhon model, which combines the algorithmic power of OS-ELM and RBM in a hybrid mode. It is an innovative Multi-Task Learning approach [16], that performs control of network traffic in Critical Infrastructures [17– 19]. The final target is the detection of vulnerabilities that are usually due to APT attacks [20–22]. More specifically, the Cyber-Typhon initially exports features related to network traffic, which are used as input to an OS-ELM neural network. The OS-ELM has been trained with proper data, in order to be able either to classify traffic as normal or (in the opposite case) to identify the threat or the attack type. Obviously, it performs Multiclass classification in order to identify one of the following eight (8) classes: Normal, Naïve Malicious Response Injection (NMRI), Complex Malicious Response Injection (CMRI), Malicious State Command Injection (MSCI), Malicious Parameter Command Injection (MPCI), Malicious Equation Code Injection (MFCI), Denial of Service (DoS) and Reconnaissance (Recon). If the network traffic is normal further communication is allowed. In the opposite case, the type of anomaly is determined and the data flow is redirected to a proper absolutely specialized and dedicated Restricted Boltzmann Machine. If the first RBM does not recognize the specific anomaly for which it is specialized, the data is redirected to the next RBM responsible for the detection of another anomaly and so on till the successful identification is achieved. If detection cannot be done by any of the trained RBM (which are as many as the types of the known anomalies) the network flow data return to the initial OS-ELM, which can perform online sequential learning. Thus, the classification effort can be re-adjusted. The following Fig. 1 presents the architecture of the Cyber-Typhon. The proposed ELM is a Single-Hidden Layer Feed Forward Neural Network (SHLF2N2) [23] with N hidden neurons, randomly selected input weights and random values of bias in the hidden layer. The output weights are calculated with a single vector matrix multiplication [13]. Hidden nodes or hidden level parameters can be

24

K. Demertzis et al.

Fig. 1. Graphic representation of the Cyber-Typhon

randomly created before seeing the training data, while it is remarkable that nondifferential activation equations can be handled, and known Neural Network problems such as stopping criterion, learning rate and learning epochs are not addressed. Specifically, the input data is mapped to a random L-dimensional space with a discrete training set N, where ðxi ; ti Þ; i 2 ½½1; N  with xi 2 Rd and ti 2 Rc : The specification output of the network is given by the following Eq. 6 [23]: f L ð xÞ ¼

XL i¼1

bi hi ð xÞ ¼ hð xÞb

i 2 ½½1; N 

ð6Þ

Vector matrix b ¼ ½b1 ; . . .; bL T is the output of the weight vector matrix connecting hidden and output nodes. On the other hand, hð xÞ ¼ ½g1 ð xÞ; . . .; gL ð xÞ is the output of the hidden nodes for input x, and g1 ð xÞ is the output of the ith neuron. Based on a training set fðxi ; ti ÞgNi¼1 , the ELM can solve the Learning Problem Hb ¼ T; where

Cyber-Typhon: An Online Multi-task Anomaly Detection Framework

25

T ¼ ½t1 ; . . .; tN T are the target labels and the output vector matrix of the Hidden Layer H is given by the following Eq. 7 [23]: 2

gðx1 x1 þ b1 Þ   6 .. H xj ; bj ; xi ¼ 4 .

 .. . gðx1 xN þ b1 Þ   

3 gðxl x1 þ bl Þ 7 .. 5 .

gðxl xN þ bl Þ

ð7Þ Nl

The input weight vector matrix of the hidden layer x (before training) and the bias vectors b are created randomly in the interval [−1, 1], by employing Eqs. 8a and 8b.  T xj ¼ xj1 ; xj2 ; . . .; xjm

ð8aÞ

 T bj ¼ bj1 ; bj2 ; . . .; bjm

ð8bÞ

and

The output weight vector matrix of the hidden layer H is calculated by the use of the activation equation in the training dataset, based on Eq. 9a and the output weights b are calculated based on Eq. 9b. H ¼ gðxx þ bÞ



I þ HT H C

ð9aÞ

1 ð9bÞ

HT X

where H ¼ ½h1 ; . . .; hN  is the output vector matrix of the hidden layer and X ¼ ½x1 ; . . .; xN  is the input vector matrix of the hidden layer. Indeed, b can be calculated by the following Eq. 10: b ¼ HþT

ð10Þ

where H þ is the generalized inverse Moore-Penrose vector for matrix H [23]. The Cyber-Tyfon employs the ELM algorithm, which employs the kernel of the following Gaussian Radial Basis Equation K(u, v) = exp(−c||u − v||2) (5). The number of the hidden neurons k is equal to 20 and the assigned random input weights are wi where bi, i = 1, …, N are the bias. The calculation of the hidden layer output matrix H has been done by employing Eq. 11 below [23]. 2

3 2 h1 ð x 1 Þ hð x 1 Þ 6 7 6 H ðhÞ ¼ 4 ... 5 ¼ 4 ... hð x N Þ



h1 ð x N Þ   

3 hL ðx1 Þ .. 7 . 5

ð11Þ

hL ðxN Þ

where h(x) = [h1(x), …, hL(x)] is the output (row) vector of the hidden layer with respect to the input x. Also, h(x) maps the data from the D-dimensional input space to the L-dimensional hidden-layer feature space (ELM feature space) H. Thus, h(x) is

26

K. Demertzis et al.

indeed a feature mapping. ELM aims to minimize the training error ||Hb − T||2 as well as the norm ||b|| of the output weights: where H is the hidden-layer output matrix of the Eq. 11. Minimization of the norm of the output weights ||b|| is actually achieved by maximizing the distance of the separating margins of the two different classes in the ELM feature space 2/||b||. To calculate the output weights b we used Eq. 12 [23].



1 I T þH H HT T C

ð12Þ

where the value of C (a positive constant) and the value of T are obtained from the Equation Approximation of the SHLF2N2 with additive neurons (see Eq. 13 below). 2

3 tT1 6 7 ti ¼ ½ti1 ; ti2 ; . . .; tim T 2 Rm ; T ¼ 4 ... 5

ð13Þ

tTN The OS-ELM [16] is an alternative technique for large-scale computing and machine learning approaches that used when data becomes available in a sequential order to determine a mapping from data set corresponding labels. The main difference between online learning and batch learning techniques is that in online learning the mapping is updated after the arrival of every new data point in a scale fashion, whereas batch techniques are used when one has access to the entire training data set at once. It is a versatile sequential learning algorithm because of the training observations are sequentially (one-by-one or chunk-by-chunk with varying or fixed chunk length) presented to the learning algorithm. At any time, only the newly arrived single or chunk of observations (instead of the entire past data) are seen and learned. A single or a chunk of training observations is discarded as soon as the learning procedure for that particular (single or chunk of) observation(s) is completed. The learning algorithm has no prior knowledge as to how many training observations will be presented. Unlike other sequential learning algorithms which have many control parameters to be tuned, OS-ELM only requires the number of hidden nodes to be specified [23]. The proposed OS-ELM algorithm consists of two main phases namely: The Boosting (BPh) and the Sequential Learning (SLPh). The BPh is used to train the SLFNs by applying basic ELM, with a batch of training data in the initialization stage. The boosting training data are discarded as soon as BPh is completed. The volume of the required training data vectors is very small, and it can be equal to the number of hidden neurons. A detailed description of the OSELM classifier is done below, where Eqs. 9a and 9b are employed.

Cyber-Typhon: An Online Multi-task Anomaly Detection Framework

27

Phase 1 (BPh)

e is The BPh for a small initial training set N ¼ ðxi ; ti Þjxi 2 Rn ; ti 2 Rm ; i ¼ 1;    ; N the following: (a) Assign arbitrary input weights wi and biases bi or centers µi and their corree , where N e is the number of hidden neurons sponding impact widths ri, i ¼ 1;    N used by the RBF kernel for a specific application. h iT (b) Calculate the initial hidden layer output matrix H0 ¼ h1 ;   ; h e where N h  iT e , and g is the activation hi ¼ gðw1  xi þ b1 Þ;    ; g w e  xi þ b e , i ¼ 1;    ; N N N equation or the RBF kernel.  1 and (c) Estimate the initial output weight bð0Þ ¼ M0 H0T T0 , where M0 ¼ H0T H0 h iT T0 ¼ t1 ; . . .; t e . N (d) Set k ¼ 0.

Phase 2 (SLPh) In the second SLPh, the OS-ELM algorithm learns the train data one-by-one, or chunkby-chunk, and all of the training data are discarded once the learning procedure on them is completed. The actual steps of this phase (for each incoming observation ðxi ; t1 Þ), are described below. (a) Calculate the hidden layer output vector hðk þ 1Þ ¼ ½gðw1  xi þ b1 Þ;    ;    xi þ b e g we T . N N (b) Calculate the latest output weight bðk þ 1Þ by employing the Recursive Least^ ¼ ðH T H Þ1 H T T. Squares (RLS) algorithm where b e þ 1; N e þ 2; N e þ 3: xi 2 Rn, ti 2 (c) Set k ¼ k þ 1, where xi 2 Rn, ti 2 Rm and i ¼ N m e þ 1; N e þ 2; N e þ 3. R and i ¼ N Obviously, if the network flow is classified as normal, it is allowed to continue. In the opposite case, there are 7 RBMs, as many as the abnormal classes, where each one of them has been trained to perform One-Class Classification (OCC) in order to exclusively recognize one specific network attack [24]. The OCC also known as Unary Classification Method (UCM), implements intelligent categorization of cases belonging to a specific class, among an existing set of records. OCC is learning from a training set that contains only records of one specific class. Usually in this method the negative class is absent because it is not sampled. Thus, the division boundaries are determined effectively only with the knowledge of the positive class. This process is much more difficult than a traditional binary or multiclass classifier [25], as it is trained to accept target objects and to reject the ones that have significant deviation. Minimizing the errors is also a difficult process, because in this type of

28

K. Demertzis et al.

categorization, cross validation is unavailable since there is no data from the other classes. Finally, it should be stressed that one class problem-solving technique is inverse to the generalization approaches that are pursued in other machine learning problems, as it tends to provide a fully defined configuration of parameters. This can exponentially increase the complexity of the classifier, trying to correctly classify target data. The more complex the model, the smaller the rank range in the target data range, and the less likely it is for the outliers to be categorized correctly. In practice, one can create a complex model by setting all its possible parameters without being at risk from overfitting. In this sense, the OCC is the most appropriate approach for detecting abnormalities and identifying patterns or trends, in a set of data that displays divergent behaviors than expected. OCC achieves high levels of successful detection, while maintaining low false error rates (false alarm) [24, 25]. The RBM [26] belong to the family of energy-based models, where each configuration of the variables of interest corresponds to a finite scalar energy value used for training. The learning process is performed by modifying the energy equation (ENF), so that its shape has desirable. Specifically, RBM is a symmetric graphical model. The units of one layer are connected (and thus dependent) only with units of the next one. The proposed RBM ENF with V visible units and H hidden ones is defined as follows Eq. 14 [26]: Eðv; hÞ ¼ 

XV XH i¼1

vhw  j¼1 i j ij

XV

v bv  i¼1 i i

XV j¼1

hj bhj

ð14Þ

where v and h are binary vectors related to the state of visible units, and to the state of hidden units respectively. Moreover, vi and hj correspond to the individual state of each visible unit (VU) i, and each hidden unit (HU) j respectively, and wij is the weight assigned to their connection. Finally, bvi and bhj is the bias of the VU i and the HU j. The reserved probability pðvjhÞ is given by the following Eq. 15: eEðv;hÞ pðvjhÞ ¼ P Eðg;hÞ ge

ð15Þ

In the specific case of examining a unit i of the visible level, the assigned reserved probability distribution (if we know the status of the hidden layer h), is calculated by Eq. 16: pðvk ¼ 1jhÞ ¼ 1þe



1 PH j¼1

hj wkj þ bvk

ð16Þ

The reserved probabilities pðhjvÞ and pðhk ¼ 1jvÞ are defined by Eqs. 17 and 18: eEðv;hÞ pðhjvÞ ¼ P Eðv;gÞ ge and

ð17Þ

Cyber-Typhon: An Online Multi-task Anomaly Detection Framework

pðhk ¼ 1jvÞ ¼

1þe



1 PV

vw i¼1 i kj

þ bhk

29

ð18Þ

These relations express the independence of the units of the two levels from the units of the same level. The training process of the RBM is the process of finding values for its parameters, that maximize the mean logarithmic probability of the occurrence of set C (Eq. 19). XC

P

E ðv ;gÞ ge P P log p E ðu;gÞ c¼1 u ge c

ð19Þ

The following Eq. 20 estimates the cost equation to wij that represents the renewal of the weights: X X X @ XC @ XC E ðvc ;gÞ c log ð v Þ ¼ log e  log eEðu;gÞ p c¼1 c¼1 g u g @wij @wij

ð20Þ

The first term calculates the average values of vci ; gj when the visible level of RBM is leaded by the data ðvc Þ, whereas the second term corresponds to the values of vi ; gj when the data are “produced” by the model. An equivalent way of formulating, would suggest that every weight wij should change to become equal to Dwij (Eq. 21):     Dwij ¼ ew Edata vi hj  Emodel vi hj

ð21Þ

The RBMmodel in this case starts approaching the actual values of the data. The first term Edata vi hj can be calculated easily, since knowing the values of the units at the visible level, we can, by means of the above equations, calculate the reserved probability for each unit at the hidden level. Calculation of the second term, which presupposes the existence of samples from the model itself, is more difficult. In this paper we use optimization through Contrastive Divergence (CD) [27]. The steps that we used to procedure a single sample can be summarized as follows: 1. Take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution. 2. Compute the outer product of v and h and call this the positive gradient. 3. Use h, to sample a reconstruction v′ of the visible units, and then resample the hidden activations h′. (Gibbs sampling step) 4. Compute the outer product of v′ and h′ and call this the negative gradient. 5. Let the update to the weight matrix W be the positive gradient minus the negative gradient, times some learning rate: DW ¼ eðuhT  u0 h0T Þ. 6. Update the biases a and b analogously: Da ¼ eðu  u0 Þ; Db ¼ eðh  h0 Þ:

30

K. Demertzis et al.

The CD method gives lower energy to the actual data and much higher energy to the “reconstructions” resulting from them. This helps the model approach the actual data distribution. Also, the Typhon employs the Multi-Task Grouping and Overlap Learning approach, combined with the optimal use of the OS-ELM and RBMs methodologies. The sliding windows (SLIW) [28] are used to partition the data stream. Using the SLIW technique, the system estimates a table of indices, after accepting the data flow vectors as input. The data are divided in partitions of 1,000 samples with 400 of them overlapping between adjoining windows. This enables continuous and unintentional scanning of the data, which results in faster and more accurate control results. This happens because a small SLIW is much more likely to be more uniform than a larger one, and therefore it is more predictable. Each new sample is checked by the OS-ELM algorithm and force it in the appropriate RBM. Once the optimal model has been created, it is then applied to the control window of the SLIW. This process is followed until the input data of each window is completed. The total knowledge of the entire window is stored in MM1 (the first window). The MM1 is transferred in the window 2 as MLT, which means that the total knowledge of the first window is transferred to the second window. The process continues for all the data in window 2, and so on for all other windows, until the full analysis of the data flow is done. The full representation of the proposed process is presented in Fig. 2 below.

Fig. 2. The proposed Multi-Task Grouping and Overlap Learning approach

When the OS-ELM classification detects an attack type, the network flow data are directed to the corresponding RBM. If it is a false positive case, the data are directed to the next RBM and so on. If the output is a false alarm, then the traffic is redirected to the initial OS-ELM, which has the online learning potential and it re-examines the dataflow from the beginning as if it was a new one. This architecture has been chosen due to the fact that in multi parametric and high complexity problems related to big data (like the one examined herein) the classification results are unstable especially regarding the analysis and the incorporation of the data flows. The introduced architecture is a serious method of a hybrid combined resolution of the APT attacks. The system is flexible and effective and it not only offers a robust

Cyber-Typhon: An Online Multi-task Anomaly Detection Framework

31

approach but it also offers a faster convergence of the overall model. Finally, the proposed architecture can exploit the multi-tasking learning in order to enhance generalization. Each one of the developed RBMs is totally dedicated to the specific problem, which can result in bias and variance reduction. This can result in overfitting elimination and it offers a robust framework capable of coping with high complexity problems.

4 Datasets Appropriate datasets were chosen that closely simulate Industrial Control System (ICS) communication and transaction data. They were used in the development and evaluation of the proposed model. Contained preprocessed network transaction data and preprocessed to strip lower layer transmission data, were used (e.g. TCP, MAC) [29]. Specifically, the Gas_Dataset that was chosen for the purpose of this research includes 26 independent parameters and 210,770 instances, from which 61,156 normal and 149,614 abnormal (7 categories of attacks: Naïve Malicious Response Injection (NMRI) 16,578, Complex Malicious Response Injection (CMRI) 15,466, Malicious State Command Injection (MSCI) 28,152, Malicious Parameter Command Injection (MPCI) 30,548, Malicious Equation Code Injection (MFCI) 20,628, Denial of Service (DoS) 11,022 and Reconnaissance (Recon) 27,220). The following Fig. 3 offers a graphical representation of the abnormal class distribution in the final Gas_Dataset.

Fig. 3. Abnormal class distribution

The dataset is determined and normalized in the interval [−1, 1] in order to phase the problem of prevalence of features with wider range over the ones with a narrower range, without being more important. Also, the outliers and the extreme values spotted were removed based on the Inter Quartile Range technique [30]. The reader can find details regarding the dataset and the data collection and assessment methodology in [29].

32

K. Demertzis et al.

4.1

Training the RBMs

The RBMs were trained based on an innovative application of the One Class Classification methodology. More specifically, according to this approach the system is exclusively trained with data related to a vulnerability of the network, in order to be able to identify the specific behavior – anomaly that has a potential relation with APT attacks. This is achieved by estimating the RBMs Reconstruction Error (RER), which is a basic criterion for the safe determination of the classes. It is calculated by using a threshold which is unique for each class and which emerged after several trial and error attempts, aiming in the optimal output of the system. If the error is higher than this predefined limit (which is characteristic for each class) then it is rejected as an unknown class and the data are redirected to the next RBM.

5 Results In data cases using a machine learning classifier, for estimating the real error during training, the full probability density of both categories should be known. The classification performance is estimated by the development of a Confusion Matrix (CM), where the main diagonal values (top left corner to bottom right) correspond to correct classifications and the rest of the numbers correspond to very few cases that were misclassified. The following Table 1 presents the CM results of the OS-ELM: Table 1. Confusion Matrix of the OS-ELM Normal 59,826 632 40 264 503 2 139 0

ΝΜRI 428 15,944 0 0 0 0 0 0

CMRI 93 0 15,426 0 0 0 0 0

MSCI 289 2 0 27,888 0 0 1 0

ΜPCI 453 0 0 0 29,900 157 24 0

ΜFCI 2 0 0 0 125 20,469 0 0

DoS 65 0 0 0 20 0 10,858 0

Recon 0 0 0 0 0 0 0 2,220

All of the performance metrics in this testing, were estimated based on the One Versus ALL approach, because it is a Multi-Class Classification case. The numbers of misclassifications are related to the False Positive (FP) and False Negative (FN) indices appearing in the confusion Matrix. The True Positive rate (TPR) also known as Sensitivity the True Negative rate also known as Specificity (TNR) and the False Positive Rate (FPR) are defined by using Eqs. 22, 23, and 24 respectively [31]. TPR ¼

TP TP þ FN

ð22Þ

Cyber-Typhon: An Online Multi-task Anomaly Detection Framework

TN TN þ FP

ð23Þ

FP ¼ 1  TNR FP þ TN

ð24Þ

TNR ¼ FPR ¼

33

The Precision (PRE) the Recall (REC), the F-Score and the Total Accuracy (TA) indices are defined as in Eqs. 25, 26, 27 and 28 respectively [31]: PRE ¼

TP TP þ FP

ð25Þ

REC ¼

TP TP þ FN

ð26Þ

F  Score ¼ 2  TA ¼

PRE  REC PRE þ REC

ð27Þ

TP þ TN N

ð28Þ

The following Table 2 present the analytical results of the proposed method. Table 2. Classification Accuracy and Performance Metrics Classifier Fold TA 98.51% OS-ELM 1st 2nd 98.63% 3rd 97.96% 4th 98.63% 5th 98.98% 6th 98.00% 7th 98.60% 8th 98.75% 9th 98.28% 10th 98.30% Avg 98.46%

RMSE 0.0548 0.0541 0.0482 0.0543 0.0578 0.0490 0.0549 0.0560 0.0567 0.0536 0.0539

Precision 0.980 0.990 0.976 0.990 0.989 0.981 0.986 0.987 0.986 0.985 0.985

Recall 0.980 0.990 0.976 0.990 0.989 0.981 0.986 0.987 0.986 0.985 0.985

F-Score 0.9800 0.9900 0.9760 0.9900 0.9890 0.9810 0.9860 0.9870 0.9860 0.9850 0.985

AUC 0.998 0.999 0.989 0.996 0.997 0.995 0.999 0.999 0.999 0.999 0.997

The 10-fold cross validation (10_FCV) is employed to obtain performance indices. Cross-validation is a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it.

34

K. Demertzis et al.

6 Discussion and Conclusions This research introduces a highly innovative, reliable and effective anomaly detection system, which is based on advanced computational intelligence approaches [32–34]. The Cyber-Typhon performs a series of sophisticated anomaly detection equations, by using Multi-Task Learning and by effectively combining On-Line Sequential Extreme Learning Machines with Restricted Boltzmann Machines. It ensures (in the most effective and intelligent way) the safe network communication among the interacting devices in a critical infrastructure environment. The proposed system, significantly enhances the security mechanisms of the Critical Infrastructures, which are a constant target due to their high importance. Also, this architecture offers the potential of developing a safe platform that can control and integrate network transactions. This can be done without the need for human intelligence or for a central authority. Also, it has been proven that collective intelligence technologies offer a smart solution for the determination of digital security and for monitoring of assets related to critical infrastructure. The development of the model was based on the extremely effective OS-ELM algorithm, and on the multi-task learning method, which can perform transfer of knowledge between relative processes that are executed in parallel. The training process was based on the employment of the RBMs that were trained based on the Unary Classification method. This was done by using specific datasets, each one of them corresponding to the behavior of a certain attack, in order to ensure the absolute reliability of the classifier. The performance of the proposed system was tested on a multidimensional dataset of high complexity. This has resulted after an extensive research into the operation of critical infrastructure control systems and after comparisons, audits and tests. The target was the identification of the most appropriate limits, which express and realistically represent the classes they represent. The high accuracy of the results has greatly enhanced the validity of the general methodology. It is important to mention that this particular model is presented for the first time in the literature. It constitutes an important guideline in further exploitation of intelligent technologies in the automations that compose industrial networks. Proposals for the development and future improvements of this system, should focus on further optimizing the parameters of the RBMs used in order to achieve an even more efficient, accurate and quicker classification, capable of dividing even more precisely the boundaries between the situations of systems. It would be important to study the equation-extension of the proposed algorithm with meta-learning methods. This could further improve the anomaly detection process. Finally, the introduced model can employ adaptive learning in order to gain selfimprovement potentials. This would automate 100% the whole process.

Cyber-Typhon: An Online Multi-task Anomaly Detection Framework

35

References 1. Dedić, N., Stanier, C.: Towards differentiating business intelligence, big data, data analytics and knowledge discovery. In: Piazolo, F., Geist, V., Brehm, L., Schmidt, R. (eds.) ERP Future 2016. LNBIP, vol. 285, pp. 114–122. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-58801-8_10 2. Kiran, M., Murphy, P., Monga, I., Dugan, J., Baveja, S.S.: Lambda architecture for costeffective batch and speed big data processing. In: IEEE International Conference on Big Data (Big Data), Santa Clara, CA, pp. 2785–2792 (2015). https://doi.org/10.1109/bigdata.2015. 7364082 3. Lin, J.: The Lambda and the Kappa. IEEE Internet Comput. 21(5), 60–66 (2017). https://doi. org/10.1109/MIC.2017.3481351 4. Demertzis, K., Iliadis, L.: A hybrid network anomaly and intrusion detection approach based on evolving spiking neural network classification. In: Sideridis, A.B., Kardasiadou, Z., Yialouris, C.P., Zorkadis, V. (eds.) e-Democracy 2013. CCIS, vol. 441, pp. 11–23. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11710-2_2 5. Krawczyk, B., Cano, A.: Online ensemble learning with abstaining classifiers for drifting and noisy data streams. Appl. Soft Comput. J. 68, 677–692 (2018) 6. Baytas, I.M., Yan, M., Jain, A.K., Zhou, J.: Asynchronous multi-task learning. In: ICDM, pp. 11–20 (2016) 7. Zhang, Y.: Parallel multi-task learning. In: ICDM, pp. 629–638 (2015) 8. Chen, Q., Abdelwahed, S.: A model-based approach to self-protection in computing system. In: Proceeding CAC 2013 Proceedings of the ACM Cloud and Autonomic Computing Conference, Article no. 16 (2013) 9. Soupionis, Y., Ntalampiras, S., Giannopoulos, G.: Faults and cyber attacks detection in critical infrastructures. In: Panayiotou, C.G.G., Ellinas, G., Kyriakides, E., Polycarpou, M. M.M. (eds.) CRITIS 2014. LNCS, vol. 8985, pp. 283–289. Springer, Cham (2016). https:// doi.org/10.1007/978-3-319-31664-2_29 10. Zhu, W.T., et al.: Detecting node replication attacks in wireless sensor networks: a survey. J. Netw. Comput. Appl. 35(3), 1022–1034 (2012) 11. Cruz, T., et al.: Improving cyber-security awareness on industrial control systems: the CockpitCI approach. J. Inf. Warfare 13(4) (2015). ISSN 1445 3347 (online), ISSN 445-3312 (printed) 12. Zhang, Y., Yeung, D.: A convex formulation for learning task relationships in multi-task learning. In: UAI, pp. 733–742 (2010) 13. Wang, J., Kolar, M., Srebro, N.: Distributed multi-task learning. In: AISTATS, pp. 751–760 (2016) 14. Xing, E.P., Ho, Q., Xie, P., Wei, D.: Strategies and principles of distributed machine learning on big data. Engineering 2(2), 179–195 (2016) 15. Cavallanti, G., Cesa-Bianchi, N., Gentile, C.: Linear algorithms for online multitask classification. In: COLT 2008, Helsinki, Finland, June 2008 16. Demertzis, K., Iliadis, L., Anezakis, V.: MOLESTRA: a multi-task learning approach for real-time big data analytics. In: 2018 Innovations in Intelligent Systems and Applications (INISTA), Thessaloniki, pp. 1–8 (2018). https://doi.org/10.1109/inista.2018.8466306 17. Demertzis, K., Iliadis, L., Spartalis, S.: A spiking one-class anomaly detection framework for cyber-security on industrial control systems. In: Boracchi, G., Iliadis, L., Jayne, C., Likas, A. (eds.) EANN 2017. CCIS, vol. 744, pp. 122–134. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-65172-9_11

36

K. Demertzis et al.

18. Demertzis, K., Iliadis, L.S., Anezakis, V.-D.: An innovative soft computing system for smart energy grids cybersecurity. Adv. Build. Energ. Res. 12(1), 3–24 (2018). https://doi.org/10. 1080/17512549.2017.1325401 19. Demertzis, K., Iliadis, L.: A computational intelligence system identifying cyber-attacks on smart energy grids. In: Daras, N.J., Rassias, T.M. (eds.) Modern Discrete Mathematics and Analysis. SOIA, vol. 131, pp. 97–116. Springer, Cham (2018). https://doi.org/10.1007/9783-319-74325-7_5 20. Demertzis, K., Kikiras, P., Tziritas, N., Sanchez, S.L., Iliadis, L.: The next generation cognitive security operations center: network flow forensics using cybersecurity intelligence. Big Data Cogn. Comput. 2, 35 (2018) 21. Demertzis, K., Tziritas, N., Kikiras, P., Sanchez, S.L., Iliadis, L.: The next generation cognitive security operations center: adaptive analytic lambda architecture for efficient defense against adversarial attacks. Big Data Cogn. Comput. 3, 6 (2019) 22. Cyber-Security and Information Warfare. Cybercrime and Cybersecurity Research. NOVA Science Publishers. ISBN 978-1-53614-385-0. Chap. 5 23. Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Extreme learning machine: theory and applications. Neurocomputing 70(1–3), 489–501 (2006) 24. El-Yaniv, R., Nisenson, M.: Optimal single-class classification strategies. In: Proceedings of the 2006 NIPS Conference, vol. 19, pp. 377–384. MIT Press (2007) 25. Munroe, D.T., Madden, M.G.: Multi-class and single-class classification approaches to vehicle model recognition from images. In: Proceedings of Artificial Intelligence and Cognitive Science, Portstewart (2005) 26. Zhang, N., Ding, S., Zhang, J., Xue, Y.: An overview on restricted Boltzmann machines. Neurocomputing 275, 1186–1199 (2018). https://doi.org/10.1016/j.neucom.2017.09.065 27. Ma, X., Wang, X.: Convergence analysis of contrastive divergence algorithm based on gradient method with errors (2015). [Research article] 28. Dietterich, T.G.: Machine learning for sequential data: a review. In: Caelli, T., Amin, A., Duin, R.P.W., de Ridder, D., Kamel, M. (eds.) SSPR /SPR 2002. LNCS, vol. 2396, pp. 15–30. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-70659-3_2 29. Morris, T.H., Thornton, Z., Turnipseed, I.: Industrial control system simulation and data logging for intrusion detection system research. Int. J. Netw. Secur. (IJNS) 17(2), 174–188 (2015) 30. Zwillinger, D., Kokoska, S.: CRC Standard Probability and Statistics Tables and Formulae, p. 18. CRC Press, Boca Raton (2000). ISBN 1-58488-059-7 31. Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006) 32. Demertzis, K., Iliadis, L.: A bio-inspired hybrid artificial intelligence framework for cyber security. In: Daras, N.J., Rassias, M.Th. (eds.) Computation, Cryptography, and Network Security, pp. 161–193. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18275-9_7 33. Demertzis, K., Iliadis, L.: SAME: an intelligent anti-malware extension for Android ART virtual machine. In: Núñez, M., Nguyen, N.T., Camacho, D., Trawiński, B. (eds.) ICCCI 2015. LNCS (LNAI), vol. 9330, pp. 235–245. Springer, Cham (2015). https://doi.org/10. 1007/978-3-319-24306-1_23 34. Demertzis, K., Iliadis, L.: Evolving smart URL filter in a zone-based policy firewall for detecting algorithmically generated malicious domains. In: Gammerman, A., Vovk, V., Papadopoulos, H. (eds.) SLDS 2015. LNCS (LNAI), vol. 9047, pp. 223–233. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17091-6_17

Investigating the Benefits of Exploiting Incremental Learners Under Active Learning Scheme Stamatis Karlos1(&) , Vasileios G. Kanas2, Nikos Fazakis2, Christos Aridas1, and Sotiris Kotsiantis1 1

2

Department of Mathematics, University of Patras, Rio Campus, 26504 Patras, Greece {stkarlos,char}@upatras.gr, [email protected] Department of Electrical and Computer Engineering, University of Patras, Rio Campus, 26504 Patras, Greece [email protected], [email protected]

Abstract. This paper examines the efficacy of incrementally updateable learners under the Active Learning concept, a well-known iterative semisupervised scheme where the initially collected instances, usually a few, are augmented by the combined actions of both the chosen base learner and the human factor. Instead of exploiting conventional batch-mode learners and refining them at the end of each iteration, we introduce the use of incremental ones, so as to apply favorable query strategies and detect the most informative instances before they are provided to the human factor for annotating them. Our assumption about the benefits of this kind of combination into a suitable framework is verified by the achieved classification accuracy against the baseline strategy of Random Sampling and the corresponding learning behavior of the batch-mode approaches over numerous benchmark datasets, under the poolbased scenario. The measured time reveals also a faster response of the proposed framework, since each constructed classification model into the core of Active Learning concept is built partially, updating the existing information without ignoring the already processed data. Finally, all the conducted comparisons are presented along with the appropriate statistical testing processes, so as to verify our claim. Keywords: Incremental learners  Active Learning scheme  Stochastic Gradient Descent  Query strategy  Unlabeled data

1 Introduction Today, more and more applications from various scientific domains produce large volumes of data, changing the needs of current predictive mechanisms that mainly stem from the Machine Learning (ML) field. Since time and memory constitute the two main factors that highly define the performance of intelligent algorithms, especially when they tackle with problems over the era of Big Data, data scientists and ML/data engineers have to prioritize the structure of new predictive tools according to these © IFIP International Federation for Information Processing 2019 Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 37–49, 2019. https://doi.org/10.1007/978-3-030-19823-7_3

38

S. Karlos et al.

specifications [1]. Incremental learning is the answer of the ML community to such kind of issues, where the principal idea is to update an existing or a previously built learning model by exploiting the newly available data, reducing the total time demands while possibly producing less accurate models [2]. Besides the simple approach, according to which vast amounts of labeled data (L) are provided or are reaching into data streams, a more realistic scenario has to cope with the shortage of L, in contrast with high enough volumes of unlabeled data (U). One representative reason why this may happen is the fact that in several real-world applications (e.g. in medicine tasks or in long-term experiments) the final state of the target variable may demand large time periods to be verified or to converge. Another reason is the inherit complexity of the data. In case of text-mining and nature language processing, numerous articles, chapters and posts on social media are freely available through web. However because of the complex structure that may characterize these sources – such as the complicated or unexplored semantic meanings on languages other than English – neither the automated solutions produce always decent learning results nor the choice of manually scanning by human entities could be time efficient [3]. In order to handle this phenomenon, a new kind of algorithms has been raised, often called as Semi-Supervised Learning (SSL) or, even more generic, as PartiallySupervised Learning (PSL), where the former category is contained into the latter [4]. The ambition of PSL algorithms is to exploit the existing labeled instances (li) along with the collected unlabeled examples (ui) and construct a model that maps the unknown instances with the target variable better than the corresponding model, which is based exclusively on the L subset. One main division among PSL algorithms depends on the way that the corresponding ui are getting labeled before they are merged into the L subset, so as to contribute over the increase of the predictive performance of the whole algorithm. While in SSL algorithms this process is automated usually by a base learner, Active Learning (AL) algorithms are differentiated since a human oracle is inserted into the learning process and is responsible for assigning the selected by a criterion ui with accurate enough decisions [5]. Although in several domains, only human experts could be exploited, it has been studied and generally verified that the decisions of numerous simple-users tend to converge over these produced by the human specialists in domains like sound/music signal categorization [6]. This means that a large aspect of applications could be satisfied through AL approaches without consuming much humans’ expert effort, a fact that might convert this kind of solution into a non-affordable one. Our ambition in this work is to investigate the benefits of exploiting incremental learners (IncL) under a simple AL scheme that follows pool-based scenario. Thus, IncL would be responsible for detecting the most suitable ui, as well as for exporting the final decisions. The main rivals here are both the AL method whose query strategy coincides with a random selection of ui and the supervised scenario, where the same base learner has been trained based on the full dataset incrementally, as well as the same approaches trained under batch-mode operation. The amount of the initial L plays a crucial role producing per each different value a new variant of each exported algorithm from the proposed framework. More comments are presented in the corresponding paragraph. To sum up, in Sect. 2, a number of related works are presented briefly, regarding mainly recent publications over incremental learning task and secondly with AL.

Investigating the Benefits of Exploiting Incremental Learners

39

Section 3 contains a description of the selected optimizer that injects the desired asset of incrementally update over underlying linear classifiers along with the necessary information about the proposed AL framework, while Sect. 4 includes more technical information, as well as information about the examined datasets and the conducted experiments. Finally, this work finishes with the conclusory Section that discusses the posed ambitions and highlights future work.

2 Related Work 2.1

Incremental Learning

Incremental learning refers to online learning strategies applicable to real-life streaming scenarios [7] with limited memory resources. So far, IncL is widely used ranging from Big Data and Internet of Things technology [7] to outlier detection for surveillance systems. One of the most popular areas in IncL is image/video data processing. Typical case scenarios are object detection [8] and recognition [9], image segmentation [10] and classification [11], surveillance [12], visual tracking [13] and prediction [14]. Moreover, the inherit nature of data in robotics make online learning an appropriate approach for mining the streaming signals [15]. In another study [16] an incremental image semantics learning framework is proposed. The proposed framework aims to learn image semantics from scratch (without a priory knowledge) and enrich the knowledge incrementally with human-robot interactions based on a teaching-andlearning procedure. Khan et al. [17] present a mechanism to build a consistent topological map for self-localization robotics applications. The proposed appearance-based loop closure detection mechanism builds a binary vocabulary consisting of visual words in an online, incremental manner by tracking features between consecutive video frames to incorporate pose invariance. In another relevant study [18], authors propose a new method to incrementally learn end-effector and null-space motions via kinesthetic teaching allowing the robot to execute complex tasks and to adapt its behavior to dynamic environments. The authors combine IncL with a customized multi-priority kinematic controller to guarantee a smooth human-robot interaction. Another rapidly emerging domain, which utilizes this concept, is robotic automotive [19]. 2.2

Active Learning

However, all of the above applications require lots of labelled data and usually data labelling is difficult, time-consuming, and/or expensive to obtain. Active learning systems attempt to overcome the labeling bottleneck by asking queries in the form of ui to be labeled by an oracle, aiming to significantly reduce the cardinality of L subset that is needed. Active learning is still being heavily researched, under either more theoretical approaches or more experimental ones. The last years, several attempts have been made to combine AL with Deep Learning (DL) concept, especially targeting specific applications that demand much computational power. Hence, ML researchers have begun searching the benefits of using CNNs and LSTMs and how to improve their efficiency when are applied along

40

S. Karlos et al.

with AL frameworks [20, 21]. There is also research being done on implementing Generative Adversarial Networks (GANs) into the this kind of tasks [22]. With the increasing interest into deep reinforcement learning, researchers are trying to reframe AL as such a problem [23]. Also, there are papers which try to learn AL strategies via a meta-learning setting [24]. This does not mean that products of ML or more simple probabilistic base learners have been ignored by the corresponding community. On the contrary, a recent demonstration examines the chance of achieving fast and non-myopic AL strategy in context of binary classification datasets [25].

3 Proposed Framework In order to conduct our investigation, a series of properties have to be defined for formulating the corresponding framework, under which our experiments will be executed. To be more specific, the base learner that is used in the core of the proposed framework is based on regularized linear models manipulated by Stochastic Gradient Descent (SGD) learning [26]. A more in-depth analysis follows subsequently, into this Section. The same learner is used into both the selected Query strategy of AL framework and the stage of building the final classifier, after having augmented the L subset during each one of the k executed iterations. As it concerns the Query strategy, Uncertainty Sampling (UncS) approach has been selected in the context of this work, favoring the integration of the AL framework with probabilistic classifiers and boosting also the time response of each produced approach [27, 28]. Analog to the metric that is applied into the UncS approach, a number of variants can be produced. Finally, the human factor is replaced by an ideal oracle (Horacle) that exports always the correct decision about the label of each asked instance, playing the role of annotator. The last generic parameter that has to be set is the Labeled Ratio value – usually depicted as R – and measured in percentage values. This factor defines the amount of the initially li in comparison with the total amount of both li and ui. Its formula is: Rð%Þ ¼ cardinalityðLÞ=ðcardinalityðLÞ þ cardinalityðUÞÞ

ð1Þ

It is prominent that by acting under small R values, only a small part of the totally available information is provided initially to the AL framework. Hence, the predictions of the base learner are based on poor L subsets that may not reveal useful insights of the specific problem that is tackled, harnessing the achievement of accurate classification behaviors. Thus, the quadruple that defines each product of the proposed framework consists of the base learner, the specific metric of UncS, the number of iterations and the Labeled Ratio value. Its notation hereinafter would be (base-cl, UncSmetric, k, R). The obtained learning behavior of such an algorithm, according to our assumptions, would depict the ability of the selected base learner to operate efficiently under a fast and confident Query strategy to choose among a pool of ui, over which will be trained incrementally for k iterations, before exporting a final classifier, based initially on an amount of labeled instances that is defined by R parameter.

Investigating the Benefits of Exploiting Incremental Learners

41

Gradient descent (GD) is by far the most popular optimization strategy, used in ML and DL at the moment. It is an optimization algorithm, based on a convex function, that tweaks its parameters iteratively to minimize a given cost function to its local minimum. In a simple supervised learning setup, each training example is composed of an arbitrary input x and a scalar output y in the form (x, y). For our ML model, we choose a family G of functions such as y ffi gw ð xÞ þ b, with w being a weighted vector and b an intercept term, which is necessary for obtaining better fit. Consequently, our goal is to minimize a cost function Wð^y; yÞ ¼ Wðgw ð xÞ; yÞ that measures the cost of predicting ^y given the actual outcome y (or yactual) averaged on the training examples n. In other words, we seek to find a solution to the following problem: En ðfw ; wÞ ¼

1 Xn     l fw xj ; yj þ aRegðwÞ j¼1 n

ð2Þ

The cost function ðEn Þ of Eq. 2 depends mainly on loss function (l) and the regularization term RegðwÞ. The multiplicative constant a refers to a non-negative hyperparameter. Following the original GD process, the minimization of Eq. 2 is taking place updating the next two formulas per each iteration t: wt þ 1 ¼ wt  g

 X      1 n r l f x Reg ð w Þ ; y þ ar j w j¼1 w w j n

ð3Þ

where the positive scalar η is called the learning rate or step size. In order, for the algorithm, to achieve linear convergence sufficient regularity assumptions should be made, while the initial estimate w0 should be close enough to the optimum and the gain η sufficiently small. It is important to highlight that the evaluation of n derivatives is required at each step. So, the per-iteration computational cost scales linearly with the training data set size n, making the algorithm inapplicable to huge datasets. Thus, the stochastic (SGD) version of the algorithm is used instead, which offers a lighter-weight solution. More specifically, at each iteration, the SDG randomly picks an example and calculates the gradient for this specific example: wt þ 1 ¼ wt  gt ðrw lðfw ðxt Þ; yt Þ þ arw RegðwÞÞ

ð4Þ

In other words, SGD approximates the actual gradient using only one data point, saving a lot of time compared to summing over all data. SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. However, in most cases, the close approximation calculated by SGD for the parameter values are enough because they reach the optimal values and keep oscillating there. Another advantage of SGD is its ability to process the incoming data online in a deployed system, since no memory of the previous randomly chosen examples is necessary. In such a situation, the SGD directly optimizes the expected risk, since the examples are randomly drawn from the ground truth distribution [29]. As it concerns RegðwÞ term, three different choices are generally used in the literature: l1 norm that favors sparse solutions, l2 norm that is the most usual met and elastic net (elnet), that is formatted by a convex combination of the previous two norms

42

S. Karlos et al.

and offers sparsity with better stabilization than simple l1 norm [30]. Before introducing the proposed framework through suitable pseudocode, we have to define the amount of the ui examples that should be mined per iteration. Although many approaches prefer to mine only one example per time, leading probably to more accurate actively trained classifier but clearly demanding much more computational resources because of the large amount of iterations that should be executed under a specific budget plan (B), a heuristic method is applied here: the questioned quantity of mined instances per iteration is computed by dividing the initial size of L subset with the number of executed iterations k. Thus, after k steps, the finally augmented training set will enumerate to the double number of instances. Additionally, since each ui is defined by a pair of (xf1, y), where the scalar y value is not known, the assumed human oracle is defined as a function such that Horacle: Rf ! yactual, where f parameter denotes the dimensionality of each dataset. The corresponding pseudocode follows here (Fig. 1): Incremental Active Learning Framework based on SGD(loss function, reg) Initially collected Labeled (L0)/Unlabeled (U0) subsets for pool-based scenario Define the quadruple (base-cl, UncSmetric, k, R) where loss function ≡ base-cl Annotator (Horacle) Input: Budget (B) Regularization term (reg) Compute UncInst = round(cardinality(L0) / k) Set iter = 0 While B > 0 do Train/Update incrementally base-cl on Liter Assign through UncSmetric confidence value to each ui Uiter Process: Remove from U the top-UncInst instances from Uiter Provide them to Horacle and assign its decisions to their class value B := B – UncInst iter := iter + 1 Output:

Actively trained classifier (ALSGD(base-cl, reg)) built on Lk

Testing:

Measure Output’s learner performance over test set for any specified classification metric

Fig. 1. Pseudocode of the proposed Incremental Active Learning framework

4 Experimental Procedure and Results In order to verify the efficacy of the proposed AL framework, 19 binary datasets have been selected by UCI dataset. Their details are described in Table 1, along with the corresponding cardinality of initial training set (L0) for all the selected R-based scenarios: 5%, 15% and 25%. Moving further, all the conducted experiments are implemented using the libact library [31] that supports AL pool-based approaches via wellknown python libraries [32]. Thus, 3 different metrics have been inserted into Query

Investigating the Benefits of Exploiting Incremental Learners

43

Strategy of the proposed framework: Smallest Margin (sm), Least Confident (lc) and Entropy (ent), apart from Random Sampling (random) variant, which constitutes the baseline strategy of AL concept. Moreover, each Supervised approach is included into our comparisons, so as to verify both the relative improvement and the corresponding importance of the implemented algorithms per both R-based case and operation mode. Table 1. Informative quantities of the examined datasets. Dataset bands breast-cancer bupa chess colic credit-a credit-g heart-statlog heart housevotes kr-vs-kp mammographic monk-2 pima saheart sick tic-tac-toe vote wdbc

Instances Features L’s cardinality for R = 5% – 15% – 25% 365 20 16 – 49 – 82 286 49 13 – 39 – 64 345 7 16 – 47 – 78 3196 39 144 – 431 – 719 368 472 17 – 50 – 83 690 44 31 – 93 – 155 1000 62 45 – 135 – 225 270 14 12 – 36 – 61 270 14 12 – 36 – 61 232 17 10 – 31 – 52 3196 41 144 – 431 – 719 830 6 37 – 112 – 187 432 7 19 – 58 – 97 768 9 35 – 104 – 173 462 10 21 – 62 – 104 3772 34 170 – 509 – 849 958 28 43 – 129 – 216 435 17 20 – 59 – 98 569 31 26 – 77 – 128

Regarding the base learners that would be combined with SGD optimizer, 6 different approaches are presented here. This means that 2 different choices of base-cl parameter were made, along with all the 3 regularization terms that were referred. To be more specific, and at the same time following the notation of scikit-learn library [32], the corresponding loss functions, using their default properties, are: • base-cl = ‘log’, which implements the well-known Logistic Regression learner, whose output is filtered appropriately so as to be used in classification tasks [33], • base-cl = ‘mhuber’ (or ‘modified huber’), which implements a smoothed hinge loss function that is equivalent to quadratically smoothed Support Vector Machine (SVM) with gamma parameter equals to 2, offering robust behavior to outliers. To begin with, a comparison of the time response of the exploited IncL against their corresponding batch-mode variants is presented. Thus, all the 12 supervised algorithms, either incrementally updated or operating under batch-mode, are measured regarding their execution time during 10-fold cross validation (10-CV) procedure,

44

S. Karlos et al.

along with two approaches that are based on the well-known Naive Bayes (NB) algorithm, so as to compare the exploited learners with algorithms that are popular for their simplicity and support also incremental update. Because of lack of space, only a sample of the produced results will be presented here. An appropriate link with the full volume of our results is provided in the end of this Section. From the depicted results, it is observed a speed-up of at least 20% for the incremental SGD-based learners, while their time performance is comparable with the MNB. Similar kind of improvement is also met into the proposed framework. A quad-core machine (Intel Core Q9300, 2.50 GHz, 8 GB RAM) was used. Table 2. Execution time of incremental and batch-mode supervised classifiers for 10-CV. Dataset

SGD(mhuber, l1)

SGD(mhuber, l2)

SGD(mhuber, elnet)

Multinomial NB

IncL

IncL

IncL

IncL

Batch

Batch

Batch

Batch

bands

0.184

0.169

0.274

0.519

0.279

0.241

0.332

breast-cancer

0.224

0.225

0.526

0.197

0.147

0.277

0.217

0.176 0.225

bupa

0.206

0.235

0.366

0.401

0.465

0.457

0.253

0.179

chess

1.414

1.448

0.847

0.788

0.676

3.862

1.685

3.773

colic

0.409

0.306

0.221

0.182

0.418

0.486

0.223

0.936

credit-a

0.277

0.387

0.249

0.714

0.288

0.644

0.276

0.385

credit-g

0.570

1.825

0.351

1.480

1.118

0.927

0.783

0.811

heart-statlog

0.156

0.689

0.590

0.281

0.354

0.250

0.169

0.485

heart

0.141

0.152

0.149

0.244

0.221

0.285

0.234

0.217

housevotes

0.150

0.147

0.210

0.429

0.135

0.164

0.292

0.234

kr-vs-kp

0.895

1.482

0.683

1.310

2.883

3.033

1.123

1.880

mammographic

0.250

0.170

0.336

0.490

0.142

0.213

0.167

0.183

monk-2

0.126

0.206

0.139

0.316

0.156

0.164

0.158

0.156

pima

0.229

0.324

0.178

0.291

0.172

0.963

0.318

0.546

saheart

0.161

0.253

0.214

0.286

0.227

0.263

0.516

0.382

sick

2.939

2.363

1.239

1.901

2.601

1.948

1.156

1.525

tic-tac-toe

0.477

0.474

0.383

0.406

0.761

0.705

0.399

0.572

vote

0.167

0.271

0.276

0.747

0.408

0.698

0.210

0.304

wdbc

0.299

0.733

0.382

0.253

0.785

0.401

0.401

0.441

Total time (sec)

9.148

11.859

7.613

11.235

12.236

15.981

8.912

13.410

As it concerns the classification accuracy that was scored by the selected algorithms, the number of iterations has been fixed equal to 15. This value has been selected via empirical process. However, its tuning could provide better results, compromising the spent human effort and the available B. The next Table presents only the averaged accuracies over the 19 selected datasets, so as to compare the achieved accuracy per actively trained classifier against random strategy and the corresponding supervised variant that uses the whole dataset. The format of the next Table enables the direct comparison of IncL and batch-based approaches. The accuracy of the best performed metric per R-based scenario and same base-learner, independently of its operation mode, is highlighted in bold format. Only two R-scenarios have been included in Table 2.

Investigating the Benefits of Exploiting Incremental Learners

45

It is evident that the incrementally based algorithms obtain a superior learning behavior against the conventional batch-mode operating approaches, since in all cases they outperformed the latter approaches. For providing a more detailed insight of the obtained results concerning the produced AL algorithms of the proposed framework, we notice that: in all the 90 1-vs-1 comparison between IncL and batch-based learner the former prevailed, sm metric was ranked as the best metric in 13 out of 18 cases, UncS strategy outperformed random sampling in 33 out of 54 cases, while the Supervised approaches were also outreached 22 times, regarding the incremental scenario. Keeping in mind that the proposed algorithms consume less computational resources, it seems that this kind of combination leads to more remarkable ML tools, regarding both the aspects of accuracy and time efficacy (Table 3). Table 3. Classification accuracy of Incremental and Batch-mode algorithms for 10-CV. R = 5% lc ent SGDinc(mhuber,l1) 71.36 72.98 SGDbatch(mhuber,l1) 66.90 67.41 SGDinc(mhuber,l2) 70.91 72.07 SGDbatch(mhuber,l2) 65.58 66.55 SGDinc(mhuber, 70.66 72.87 elnet) SGDbatch(mhuber, 66.81 66.88 elnet) SGDinc(log,l1) 72.47 74.02 SGDbatch(log,l1) 67.37 68.69 SGDinc(log,l2) 71.81 72.93 SGDbatch(log,l2) 67.45 66.15 SGDinc(log,elnet) 72.01 73.21 SGDbatch(log,elnet) 67.49 67.57

sm 73.95 68.87 73.24 68.23 74.12

Random 73.82 67.49 72.85 66.22 73.69

R = 25% lc ent 79.42 78.71 74.26 73.33 77.59 77.91 71.78 71.56 77.42 77.93

Super sm 79.32 72.36 78.76 73.24 78.28

Random 78.10 73.44 76.22 72.61 77.41

77.89 72.12 75.23 71.33 75.64

66.99 67.05

72.13 72.10 72.01 73.84

70.82

74.62 68.37 74.35 66.09 74.97 67.38

78.14 73.55 77.71 72.41 78.24 72.44

77.34 72.02 75.29 71.96 75.09 71.89

74.91 67.08 73.88 67.00 73.21 66.89

79.56 73.80 77.82 71.92 78.61 72.21

79.88 73.11 78.52 72.64 77.86 73.29

80.24 72.78 77.30 72.56 78.02 73.43

The statistical verification of the produced results is visualized through CD diagrams. According to this method, appropriate rankings are provided to a post-hoc test, in our case the Bonferroni-Dunn, computed by Friedman statistical test, and corresponding critical differences are computed for significance level equal to 0.05 [34]. Every algorithm that is connected via a horizontal line to another one, depicts that their learning behavior did not present significant difference Fig. 2. For obtaining a more explanatory view of these comparisons, a series of violin plots has been selected to highlight the differences of the IncL and batch-mode algorithms. Through this tactic, the distribution of the scored classification accuracies is visualized, along with the average and the quartile values. In Fig. 3, the corresponding algorithms that use ‘elnet’ regularization term are presented. The complete results are provided in https://github.com/terry07/ke80537.

46

S. Karlos et al.

Fig. 2. A CD diagram for mhuber-based learners and ‘elnet’ as regularization term for R = 25%.

Fig. 3. A violin plot of log-based learners and ‘elnet’ as regularization term for R = 5%.

5 Conclusions and Future Work This work constitutes a primal product of our research in involving IncL under SSL schemes so as to compensate the iterative character of the latter, by exploiting the beneficial refinement assets of the former, regarding the base learner. Our results over a wide range of binary datasets prove the remarkable classification accuracy that was achieved in case of AL concept, based on 3 amounts of labeled examples and relying on an ideal human oracle for annotation stage. The common factor over all these experiments was the use of SGD method that injects its incremental property over the linear learners that are applied. Three different regularization terms were also used, creating a series of SGD-based learners, whose learning behavior outperformed the baseline of Random Sampling strategy and the corresponding conventional batch-based methods, in the majority of the examined cases, while their performance, mainly under the Smallest Margin metric into Uncs strategy, was similar enough with their supervised rival. The next steps are oriented towards both the examination of multiclass datasets and binary datasets that come from more specific tasks, like intrusion detections that suffers from distribution drifting [35] or text classification [36]. Furthermore, a larger variety of AL query strategies could be applied, exploiting either more sophisticated ML

Investigating the Benefits of Exploiting Incremental Learners

47

techniques [37] or margin-based queries that perform robustness over noisy input data [38]. Moreover, the scheme of ALBL (Active Learning By Learning) [39] could be a really promising solution, where a number of AL strategies are evaluated through a meta-learning stage. Finally, combination of SSL and AL strategies seems a powerful combination [40], reducing heavily human effort, since only a small number of iterations could be selected to ask feedback, while the incremental asset could be retained. Acknowledgements. This research is implemented through the Operational Program Human Resources Development, Education and Lifelong Learning and is co-financed by the European Union (European Social Fund) and Greek national funds.

References 1. Domingos, P., Hulten, G.: Mining high-speed data streams. In: KDD, pp. 71–80 (2000) 2. Pratama, M., Anavatti, Sreenatha G., Lughofer, E.: An incremental classifier from data streams. In: Likas, A., Blekas, K., Kalles, D. (eds.) SETN 2014. LNCS (LNAI), vol. 8445, pp. 15–28. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07064-3_2 3. Mahmoud, M.: Semi-supervised keyword spotting in Arabic speech using self-training ensembles (2015) 4. Schwenker, F., Trentin, E.: Pattern classification and clustering: A review of partially supervised learning approaches. Pattern Recogn. Lett. 37, 4–14 (2014) 5. Aggarwal, C.C., Kong, X., Gu, Q., Han, J., Yu, P.S.: Active learning: a survey. In: Data Classification: Algorithms and Applications, pp. 571–605 (2014) 6. Zhang, Z., Cummins, N., Schuller, B.: Advanced data exploitation in speech analysis. IEEE Signal Process. Mag. 34, 107–129 (2017) 7. Hoens, T.R., Polikar, R., Chawla, N.V.: Learning from streaming data with concept drift and imbalance: an overview. Prog. Artif. Intell. 1, 89–101 (2012) 8. Dou, J., Li, J., Qin, Q., Tu, Z.: Moving object detection based on incremental learning low rank representation and spatial constraint. Neurocomputing. 168, 382–400 (2015) 9. Bai, X., Ren, P., Zhang, H., Zhou, J.: An incremental structured part model for object recognition. Neurocomputing. 154, 189–199 (2015) 10. Tasar, O., Tarabalka, Y., Alliez, P.: Incremental learning for semantic segmentation of largescale remote sensing data. CoRR. abs/1810.1 (2018) 11. Ristin, M., Guillaumin, M., Gall, J., Van Gool, L.: Incremental learning of random forests for large-scale image classification. IEEE Trans. Pattern Anal. Mach. Intell. 38, 490–503 (2016) 12. Shin, G., Yooun, H., Shin, D., Shin, D.: Incremental learning method for cyber intelligence, surveillance, and reconnaissance in closed military network using converged IT techniques. Soft. Comput. 22, 6835–6844 (2018) 13. Dou, J., Li, J., Qin, Q., Tu, Z.: Robust visual tracking based on incremental discriminative projective non-negative matrix factorization. Neurocomputing 166, 210–228 (2015) 14. Wibisono, A., Jatmiko, W., Wisesa, H.A., Hardjono, B., Mursanto, P.: Traffic big data prediction and visualization using fast incremental model trees-drift detection (FIMT-DD). Knowl. Based Syst. 93, 33–46 (2016) 15. Wang, M., Wang, C.: Learning from adaptive neural dynamic surface control of strictfeedback systems. IEEE Trans. Neural Netw. Learn. Syst. 26, 1247–1259 (2015) 16. Zhang, H., Wu, P., Beck, A., Zhang, Z., Gao, X.: Adaptive incremental learning of image semantics with application to social robot. Neurocomputing 173, 93–101 (2016)

48

S. Karlos et al.

17. Khan, S., Wollherr, D.: IBuILD: incremental bag of binary words for appearance based loop closure detection. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 5441–5447. IEEE (2015) 18. Saveriano, M., An, S., Lee, D.: Incremental kinesthetic teaching of end-effector and nullspace motion primitives. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 3570–3575. IEEE (2015) 19. Thrun, S.: Toward robotic cars. Commun. ACM 53, 99 (2010) 20. Shen, Y., Yun, H., Lipton, Z.C., Kronrod, Y., Anandkumar, A.: Deep active learning for named entity recognition. In: Blunsom, P., et al. (eds.) Proceedings of the 2nd Workshop on Representation Learning for NLP, Rep4NLP@ACL 2017, Vancouver, Canada, 3 August 2017, pp. 252–256. Association for Computational Linguistics (2017) 21. Sener, O., Savarese, S.: Active learning for convolutional neural networks: a core-set approach. In: International Conference on Learning Representations (2018) 22. Liu, Y., et al.: Generative adversarial active learning for unsupervised outlier detection. CoRR. abs/1809.1 (2018) 23. Fang, M., Li, Y., Cohn, T.: Learning how to active learn: a deep reinforcement learning approach. In: Palmer, M., Hwa, R., Riedel, S. (eds.) EMNLP 2017, Copenhagen, Denmark, pp. 595–605. Association for Computational Linguistics (2017) 24. Contardo, G., Denoyer, L., Artières, T.: A meta-learning approach to one-step activelearning. In: Brazdil, P., Vanschoren, J., Hutter, F., and Hoos, H. (eds.) AutoML@PKDD/ECML, pp. 28–40. CEUR-WS.org (2017) 25. Krempl, G., Kottke, D., Lemaire, V.: Optimised probabilistic active learning (OPAL): for fast, non-myopic, cost-sensitive active classification. Mach. Learn. 100, 449–476 (2015) 26. Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent Algorithms. In: ICML, pp. 919–926 (2004) 27. Settles, B.: Active Learning. Morgan & Claypool Publishers, San Rafael (2012) 28. Sharma, M., Bilgic, M.: Evidence-based uncertainty sampling for active learning. Data Min. Knowl. Discov. 31, 164–202 (2017) 29. Tsuruoka, Y., Tsujii, J., Ananiadou, S.: Stochastic gradient descent training for L1regularized log-linear models with cumulative penalty. In: ACL/IJCNLP, pp. 477–485 (2009) 30. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. 67, 301–320 (2005) 31. Yang, Y.-Y., Lee, S.-C., Chung, Y.-A., Wu, T.-E., Chen, S.-A., Lin, H.-T.: libact: Poolbased active learning in Python (2017) 32. Buitinck, L., et al.: API design for machine learning software: experiences from the scikitlearn project. In: CoRR abs/1309.0238 (2013) 33. Harrell, F.E.: Regression Modeling Strategies. Springer, New York (2015). https://doi.org/ 10.1007/978-3-319-19425-7 34. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, Heidelberg (2009). https://doi.org/10.1007/978-0-387-84858-7 35. Xiang, Z., Xiao, Z., Wang, D., Georges, H.M.: Incremental semi-supervised kernel construction with self-organizing incremental neural network and application in intrusion detection. J. Intell. Fuzzy Syst. 31, 815–823 (2016) 36. Lin, Y., Jiang, X., Li, Y., Zhang, J., Cai, G.: Semi-supervised collective extraction of opinion target and opinion word from online reviews based on active labeling. J. Intell. Fuzzy Syst. 33, 3949–3958 (2017) 37. Akusok, A., Eirola, E., Miche, Y., Gritsenko, A.: Advanced Query Strategies for Active Learning with Extreme Learning Machine. In: ESANN, pp. 105–110 (2017)

Investigating the Benefits of Exploiting Incremental Learners

49

38. Wang, Y., Singh, A.: Noise-adaptive margin-based active learning for multi-dimensional data. CoRR. abs/1406.5 (2014) 39. Hsu, W.-N., Lin, H.-T.: Active learning by learning. In: Bonet, B., Koenig, S. (eds.) Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 25–30 January 2015, Austin, Texas, USA, pp. 2659–2665. AAAI Press (2015) 40. Zhao, J., Liu, N., Malov, A.: Safe semi-supervised classification algorithm combined with active learning sampling strategy. J. Intell. Fuzzy Syst. 35, 4001–4010 (2018)

The Blockchain Random Neural Network in Cybersecurity and the Internet of Things Will Serrano(&) Intelligent Systems and Networks Group, Electrical and Electronic Engineering, Imperial College London, London, UK [email protected]

Abstract. The Internet of Things (IoT) enables increased connectivity between devices; however, this benefit also intrinsically increases cybersecurity risks as cyber attackers are provided with expanded network access and additional digital targets. To address this issue, this paper presents a holistic digital and physical cybersecurity user authentication method based on the Blockchain Random Neural Network. The Blockchain Neural Network connects increasing neurons in a chain configuration providing an additional layer of resilience against Cybersecurity attacks in the IoT. The proposed user access authentication covers holistically its digital access through the seven OSI layers and its physical user identity such as passport before the user is accepted in the IoT network. The user’s identity is kept secret codified in the neural weights, although in case of cybersecurity breach, its physical identity can be mined and the attacker identified, therefore enabling a safe decentralized confidentiality. The validation results show that the addition of the Blockchain Neural Network provides a user access control algorithm with increased cybersecurity resilience and decentralized user access and connectivity. Keywords: Neural Network  Internet of Things Cybersecurity  User Management  Access credentials



Blockchain



1 Introduction In the Internet of Things (IoT), things are objects of the physical world (physical things) that can be sensed, or objects of the information world (virtual things) that can be digitalized; both are capable of being identified and integrated into information and transmitted via sensor and wired or wireless communication networks [1]. The IoT enables comprehensive connectivity between devices; however, this benefit also intrinsically increases cybersecurity risks as cyber attackers are provided with expanded network access and additional digital targets [2–4]. Blockchain enables the digitalization of contracts as it provides authentication between parties and information encryption of data that gradually increments while it is processed in a decentralized network such as the IoT [5]. Due to these features, Blockchain has been already applied in Cryptocurrency [6], Smart Contracts [7], Intelligent Transport Systems [8] and Smart Cities [9].

© IFIP International Federation for Information Processing 2019 Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 50–63, 2019. https://doi.org/10.1007/978-3-030-19823-7_4

The Blockchain Random Neural Network in Cybersecurity

1.1

51

Research Motivation

To address the increased cybersecurity risk of the IoT, this paper proposes a holistic digital and physical cybersecurity user authentication method based on the Blockchain Random Neural Network [10]. The Blockchain Neural Network connects neurons in a chain configuration providing an additional layer of resilience against Cybersecurity attacks in the IoT. The Cybersecurity and IoT application presented on this paper can be generalized to emulate an Authentication, Authorization and Accounting (AAA) server where user access information is encrypted in the neural weights and stored decentralized servers. The Blockchain Neural Network solution is equivalent to the Blockchain with the same properties: user authentication, data encryption and decentralization where user access credentials are gradually incremented and learned while travelling or roaming. The Neural Network configuration have analogue biological properties as the Blockchain where neurons are gradually incremented and chained through synapses as variable user access credentials increase; information is stored and codified in decentralized neural networks weights. The main advantage of this research proposal is the biological simplicity of the solution however it suffers high computational cost when the neurons increase. 1.2

Research Proposal

Internet of Things and the Blockchain related work is described in Sect. 2. The proposed user access authentication described in Sect. 3 covers holistically its digital access through the seven OSI layers and its physical user identity such as passport ID before the user is allowed to use IoT network resources. The method forces the user to be physically authenticated before establishing the connection that allows access to the IoT network, therefore cybersecurity is increased by reducing the likelihood of criminal network access. The user’s digital OSI layer identification such as MAC and IP address and physical identification such as biometrics generates the Private Key whereas there is no need for a Public Key, therefore this paper defines a truly decentralized solution with the same Blockchain validation process: mining the input neurons until the neural network solution is found as presented in Sect. 4. Experimental results in Sect. 5 show that the additional Blockchain neural network provides increased cybersecurity resilience and decentralized confidentiality to user access and connectivity. The main conclusion presented in Sect. 6 proves that the user physical identity is kept secret codified in the neural weights although in case of cybersecurity breach the identity can be mined and the attacker identified by its passport ID or biometrics.

2 Related Work 2.1

Internet of Things

The IoT has provided new services and applications with additional Cybersecurity issues. Lee et al. [11] present the evolution of the IoT technology started from Machine

52

W. Serrano

to Machine to connect machines and devices, Interconnections of Things that connect any physical or virtual object and finally Web of Things that enables the collaboration between people and objects. The IoT is formed of three layers, as proposed by Jing et al. [12]: sensor, transportation and application that similar as traditional networks, also have security issues and integration challenges. Roman et al. [13] state that because physical, virtual and user private information is captured, transmitted and shared by the IoT sensors, Cybersecurity aspects on data confidentiality and authentication, access control within the IoT network, identity management, privacy and trust among users and things; the enforcement of security and privacy policies shall be also considered and implemented. Sicari et al. [14] declare that the dynamic IoT is formed by heterogeneous technologies to provide innovative services in various application domains which shall meet flexible security and privacy requirements where traditional security countermeasures cannot be directly applied due the different standards, communication protocols and scalability issues because of the high number of interconnected devices. An important challenge for supporting diverse multimedia applications in the IoT is the security heterogeneity of wired and wireless sensor and transmission networks that requires a balance between flexibility and efficiency, as presented by Zhou et al. [15]. Secure and Safe Internet of Things (SerIoT) was proposed by Gelenbe et al. [16] to improve the information and physical security of different operational IoT applications platforms in a holistic and cross-layered manner. SerIoT covers areas such as mobile telephony, networked health systems, the Internet of Things, Smart Cities, Smart Transportation Systems, Supply Chains and Industrial Informatics [17]. 2.2

Neural Networks in Cryptography

Neural Networks have been already applied to Cryptography; Pointcheval [18] presents a linear scheme based on the Perceptron problem or N-P problem suited for smart cards applications. Kinzel et al. [19] train two multilayer neural networks on their mutual output bits with discrete weights to achieve a synchronization that can be applied to secret key exchange over a public channel. Klimov et al. [20] propose three cryptanalytic attacks (genetic, geometric and probabilistic) to the above neural network. Volna et al. [21] apply feed forward neural networks as an encryption and decryption algorithm with a permanently changing key. Yayık et al. [22] present a two-stage cryptography multilayered neural network where the first stage generates neural networkbased pseudo random numbers and the second stage, a neural network encrypts information based on the non-linearity of the model. Schmidt [23] et al. present a review of the use of artificial neural networks in cryptography. 2.3

Blockchain in Security

Currently; there is a great research effort in Blockchain Algorithms applied to security applications. Xu et al. [24] propose a punishment scheme based on the action record on the blockchain to suppress the attack motivation of the edge servers and the mobile devices in the edge network. Cha et al. [25] utilize a blockchain network as the underlying communication architecture to construct an ISO/IEC 15408-2 compliant security auditing system. Gai et al. [26] propose a conceptual model for fusing

The Blockchain Random Neural Network in Cybersecurity

53

blockchains and cloud computing over three deployment modes: Cloud over Blockchain, Blockchain over Cloud and Mixed Blockchain-Cloud. Gupta et al. [27] propose a Blockchain consensus model for implementing IoT security. Agrawal et al. [28] present a Blockchain mechanism that evaluates legitimate presence of user in valid IoTZone continuously without user intervention.

3 Blockchain Neural Network in the Internet of Things Blockchain [6] is based on cryptographic concepts which can be applied similarly by the use of Neural Networks. Information in the Blockchain is contained in blocks that also include a timestamp, the number of attempts to mine the block and the previous block hash. Decentralized miners then calculate the hash of the current block to validate it. Information contained in the Blockchain consists of transactions which are authenticated by a signature that uses the user private key, transaction origin, destination and value (Fig. 1).

Hash Time Stamp Transactions Iterations Previous Hash Block n-1 Transaction From To Data Signature

Hash Time Stamp Transactions Iterations Previous Hash Block n

Hash Time Stamp Transactions Iterations Previous Hash Block n+1

Signature = Function (Private Key, From, To, Value) Verify Signature = Function (Public Key, Signature, From, To,Value)

Fig. 1. Blockchain model

3.1

The Random Neural Network

The proposed Blockchain configuration is based on the Random Neural Network (RNN) [29–31] which is a spiking neuronal model that represents the signals transmitted in biological neural networks, where they travel as spikes or impulses, rather than as analogue signal levels. The RNN is a spiking recurrent stochastic model for neural networks where its main analytical properties are the “product form” and the existence of the unique network steady state solution. 3.2

The Random Neural Network with Blockchain Configuration

The Random Neural Network with Blockchain configuration consists of L Input Neurons, M hidden neurons and N output neurons Network (Fig. 2). Information in this model is contained networks weights w þ ðj; iÞ and w ðj; iÞ rather than neurons xL, z M , yN .

54

W. Serrano

i1

Λ1 λ1

i2

Λ2 λ2

iL

ΛL λL

External signals Input Layer

Hidden Layer

w+(j,i): excitatory network weights w-(j,i): inhibitory network weights Λ: External excitatory signal λ: External inhibitory signal Output Layer

Fig. 2. The random neural network with Blockchain configuration

• I = (KL, kL), a variable L-dimensional input vector I 2 ½1; 1L represents the pair of excitatory and inhibitory signals entering each input neuron respectively; where scalar L values range 1 < L < ∞; • X = (x1, x2, …, xL), a variable L-dimensional vector X 2 ½0; 1L represents the input state qL for the neuron L; where scalar L values range 1 < L < ∞; • Z = (z1, z2, …, zM), a M-dimensional vector Z 2 ½0; 1M that represents the hidden neuron state qM for the neuron M; where scalar M values range 1 < M < ∞; • Y = (y1, y2, …, yN), a N-dimensional vector Y 2 ½0; 1N that represents the neuron output state qN for the neuron N; where scalar N values range 1 < N < ∞; • w þ ðj; iÞ is the (L+M+N) x (L+M+N) matrix of weights that represents from the excitatory spike emission from neuron i to neuron j; where i 2 ½xL ; zM ; yN  and j 2 ½xL ; zM ; yN ; • w ðj; iÞ is the (L+M+N) x (L+M+N) matrix of weights that represents from the inhibitory spike emission from neuron i to neuron j; where i 2 ½xL ; zM ; yN  and j 2 ½xL ; zM ; yN . The main concept of the Random Neural Network Blockchain configuration is that the neuron vector sizes, L, M and N are variable instead of fixed. Neurons or blocks are iteratively added where the value of the additional neurons consists on both the value of the additional information and the value of previous neurons therefore forming a neural chain. Information in this model is transmitted in the matrixes of network weighs, w þ ðj; iÞ and w ðj; iÞ rather than in the neurons. The input layer X represents the user’s incremental verification data; the hidden layer Z represents the values of the chain and the output layer Y represents the user Private Key.

4 Cybersecurity and the IoT Blockchain Model Cybersecurity and the Internet of Things in the Neural Network Blockchain model described in this section is based on the main concepts shown on Fig. 3: • Private key, yN; • Roaming, R(t) and Verification, V;

The Blockchain Random Neural Network in Cybersecurity

55

• Neural Chain network and Mining; • Decentralized information, w þ ðj; iÞ and w ðj; iÞ. Roaming 1 R(1)

Roaming 2 R(2)

Roaming 3 R(3)

Roaming t R(t)

User yN

User

User

User

yN

yN

yN

Area 1 v1 No Mining

Area 2 v2 Mines xL and zM

Area 3 v3 Mines xL and zM

Area t vt Mines xL and zM

w+(j,i)

w-(j,i)

Decentralized Network

Fig. 3. Cybersecurity and the Internet of Things in the Neural Blockchain model

4.1

Private Key

The private key Y = (y1, y2, …, yN) consists on the user digital AAA authentication credentials that covers the seven layers of the OSI model and physical information such as a passport, biometrics or both. The private key is presented by the user every time its credentials require verification from the accepting roaming node (Table 1).

Table 1. Private key Private key Bits 72 y8 y7 16 y6 16 y5 16 y4 16 y3 32 y2 48 y1 16

4.2

Reference Type User Physical Web Digital Middleware Socket Port IP MAC Bit

Interface Passport - Biometrics OSI Layer 7 OSI Layer 6 OSI Layer 5 OSI Layer 4 OSI Layer 3 OSI Layer 2 OSI Layer 1

Roaming and Verification

Let’s define Roaming and Verification as: • Roaming, R(t) = {R(1), R(2), … R(t)} as a variable vector where t is the roaming number; • Verification, V = {v1, v2, … vt} as a set of t I-vectors where vo = (eo1, eo2, … eoI) and eo are the I different dimensions for o = 1, 2, … t.

56

W. Serrano

The first Roaming R(1) has associated an input state X = xI which corresponds to v1 and represents the user verification data. The output state Y = yN corresponds to the user Private Key and the hidden layer Z = zM corresponds to the value of the neural chain that will be inserted in the input layer for the next roaming. The second Roaming R(2) has associated an input state X = xI which corresponds to the user verification data v1 for the first Roaming R(1), the chain (or the value of the hidden layer zM) and the additional user data v2. The output state Y = yN still corresponds the user Private Key and the hidden layer Z = zM corresponds to the value of the neural chain for the next transaction. This process iterates as more user verification data is inserted. The neural chain can be formed of the values of the entire hidden layer neurons, a selection of neurons, or any other combination to avoid the reverse engineering of the user identity from the stored neural weights. 4.3

Neural Chain Network and Mining

The first Roaming R(1) calculates the Random Neural Network neural weights with an Ek < Y for the input data I = (KL, kL) and the user private key Y = yN. The calculated network weights w þ ðj; iÞ and w ðj; iÞ are stored in the decentralized network and are retrieved in the mining process. After the first Roaming; the user requires to be validated at each additional Roaming with its private key where its verification data is validated and verification data vt from Roaming R(t) are added to the user. Verification data is validated or mined by calculating the outputs of the Random Neural Network using the transmitted network weighs, w þ ðj; iÞ and w ðj; iÞ at variable random inputs i, or following any other method. The solution is found or mined when quadratic error function Ek is lesser than determined minimum error or threshold T: Ek ¼

N  2 1X y0  y n \ T 2 n¼1 n

ð1Þ

where Ek is the minimum error or threshold, y0n is the output of the Random Neural Network with mining or random input I and yn is the user Private Key. The mining complexity can be tuned by adjusting Ek. The Random Neural Network with Blockchain configuration is mined when an Input I is found that delivers an output Y with an error Ek lesser than a threshold T for the retrieved user network weights w þ ðj; iÞ and w ðj; iÞ. When the solution is found, or mined, the user data can be processed; the potential value of the neural hidden layer Z = zM is added to form the Neural Chain as the input of the next transaction where more user data is added. Then, the system calculates the Random Neural Network with gradient descent learning algorithm for the new pair (I, Y) where the new generated network weights w þ ðj; iÞ and w ðj; iÞ are stored in the decentralized network. The more roaming and verification data; the validation or mining process increases on complexity.

The Blockchain Random Neural Network in Cybersecurity

4.4

57

Decentralized Information

The user network weights w þ ðj; iÞ and w ðj; iÞ are stored in the decentralized network rather than its data I directly where I is calculated with the mining process. The network weights expand as more verification data is inserted creating an adaptable method. In addition; only the user Data can be extracted when the user presents its biometric key therefore making secure to store information in a decentralized system.

5 Neural Blockchain in Cybersecurity and IoT Validation This section proposes a practical validation of the Neural Blockchain model in the Internet of Things and Cybersecurity using the network simulator Omnet++ with Java for a network of ten nodes. The three independent experiments will emulate a Bluetooth network with roaming validation of MAC addresses, WLAN network with roaming validation of MAC and IP addresses and LTE Mobile network with a roaming validation of MAC and IP and user Passport (Table 2). Table 2. Neural Blockchain in IoT and cybersecurity validation – node values Type Bluetooth Master

Use

Room Floor Wireless LAN Access Point Building Campus Mobile LTE Base Station City Country

Coverage Layer

Node Vt

User yN

10 m 3140 m2 100 m 0.314 km2 1 km 31.4 km2

48 bits 01-23-45-67-89-XX 48 + 32 bits 192.168.11.XX 48 + 32 bits N/A

48 bits 01-23-45-67-89-AB 48 + 32 bits 192.168.11.11 48 + 32 + 72 bits VGD12345F

MAC MAC-IP MAC-IP PASSPORT

The user is assigned a private key yN that requires validation before is allowed to transmit. When the user travels through the space, the credential private key is validated by the roaming node; the decentralized system retrieves the neural weights associated to the private key; mines the block, adds the node code and stores back the network weights in the decentralized system. This validation considers mining as the selection of random neuron values until Ek < T. When the user roams, the private key is presented and the information of the node (MAC and IP address) vt is added to the neural chain once it is mined. Each bit is codified as a neuron however rather than the binary 0–1, neuron potential is codified as 0.25–0.75 (Fig. 4); this approach removes overfitting in the learning algorithm as neurons only represent binary values.

58

W. Serrano

R(1)

R(2)

R(3)

R(4)

R(5)

bit 1.0

User yN={MAC,IP, PASSPORT} Area 1 V1={MAC1,IP1}

Neuron Potential

Area 2 Area 5 Area 3 Area 4 V2={MAC2,IP2} V3={MAC3,IP3} V4={MAC4,IP4} V5={MAC5,IP5}

0.75

1

Area 10 Area 8 Area 7 Area 6 Area 9 V10={MAC10,IP10} V9={MAC9,IP9} V8={MAC8,IP8} V7={MAC7,IP7} V6={MAC6,IP6}

0.5 0

0.25

User yN={MAC,IP, PASSPORT} R(10)

R(9)

R(8)

R(7)

R(6)

Simulation Model

0.0 Bit - Neuron Codification

Fig. 4. Neural Blockchain in cybersecurity validation

The simulations are run 100 times for a Bluetooth MAC Network (Table 3). The information shown is the number of iterations the Random Neuron Network with Blockchain configuration requires to achieve an Ek < 1.0E−10; the error Ek, the number of iterations to mine the Blockchain and the number of neurons for each layer; input xL, hidden zM and output yN. Table 3. Bluetooth MAC simulation – learning and mining Roaming

Learning iteration

Learning error

Mining iteration

Mining threshold

Mining error Ek

1 2 3 4 5 6 7 8 9 10

233.00 190.52 171.42 160.90 150.67 144.31 140.00 137.00 132.99 131.00

9.96E−11 9.46E−11 9.40E−11 9.18E−11 9.12E−11 9.43E−11 9.47E−11 9.06E−11 8.75E−11 9.30E−11

36.22 24.88 56.32 758.37 109.10 116.59 354.38 2134.31 12.13 141.99

1.00E−05 1.00E−05 1.00E−05 1.00E−05 1.00E−05 1.00E−05 1.00E−05 1.00E−05 1.00E−05 1.00E−05

2.48E−06 3.30E−06 3.01E−06 3.97E−06 3.64E−06 3.13E−06 3.09E−06 3.70E−06 3.56E−06 3.26E−06

Number of neurons (xL, zM, yN) 48-4-48 100-4-48 152-4-48 204-4-48 256-4-48 308-4-48 360-4-48 412-4-48 464-4-48 516-4-48

The Blockchain Random Neural Network in Cybersecurity

59

With four neurons in the hidden layer, the number of learning iterations gradually decreases while the number of input neurons increases due the additional information added when roaming between nodes. The results for the mining iteration are not as linear as expected because mining is performed using random values (Fig. 5). Surprisingly; mining is easier in some Roaming stages when it would have been expected harder as the number of neurons increases.

Fig. 5. Bluetooth MAC simulation – learning and mining

The Blockchain Random Neural Network algorithm shall detect tampering to be effective (Table 4) where D represents the number of tampered bits. Table 4. Bluetooth and WLAN simulation – tampering error Roaming Bluetooth - MAC Error Error D = 0.0 D = 1.0 1 9.96E−11 1.31E−03 2 9.49E−11 1.50E−04 3 9.18E−11 4.32E−05 4 9.38E−11 1.95E−05 5 9.01E−11 8.54E−06 6 9.07E−11 4.28E−06 7 9.49E−11 2.56E−06 8 9.33E−11 1.71E−06 9 8.93E−11 9.00E−07 10 9.59E−11 7.22E−07

Neurons (xL, zM, yN) 48-4-48 100-4-48 152-4-48 204-4-48 256-4-48 308-4-48 360-4-48 412-4-48 464-4-48 516-4-48

WLAN - IP Error Error D = 0.0 D = 1.0 9.28E−11 1.18E−03 9.57E−11 1.63E−04 9.36E−11 5.21E−05 9.50E−11 2.23E−05 9.21E−11 1.02E−05 9.36E−11 5.22E−06 9.12E−11 3.07E−06 9.28E−11 2.10E−06 9.25E−11 1.15E−06 9.55E−11 7.81E−07

Neurons (xL, zM, yN) 80-4-80 164-4-80 248-4-80 332-4-80 416-4-80 500-4-80 584-4-80 668-4-80 752-4-80 836-4-80

60

W. Serrano

The effects of tampering the Neural Block Chain (Fig. 6) is detected by the learning algorithm even when the tampered values only differ in a bit, D = 1.0, although this error reduces with an incrementing roaming as the number of neurons increases. Both Bluetooth and WLAN Networks perform similarly.

Fig. 6. Bluetooth and WLAN simulation – tampering error

6 Conclusions This paper has presented the application of the Blockchain Random Neural Network in Cybersecurity and the IoT where neurons are gradually incremented as user validation data increases through travelling and roaming, although this research can be generalized to any AAA server or Access Control solution. This configuration provides the proposed algorithm the same properties as the Blockchain: security and decentralization with the same validation process: mining the input neurons until the neural network solution is found. The Random Neural Network in Blockchain configuration has been applied to an IoT AAA server that covers the digital seven layers of the OSI Model and the physical user credentials such as Passport or biometrics. Experimental results show that Blockchain applications can be successfully implemented using neural networks with a gradually increased mining effort, user authentication and data encryption in a decentralized network therefore removing centralized validation mechanisms. This paper has provided a holistic physical and digital Cybersecurity application in the IoT where access to the network in an area requires prior user physical verification between decentralized parties. User data is encrypted, information is decentralized where attackers can be identified if a criminal attack is delivered.

The Blockchain Random Neural Network in Cybersecurity

Appendix

61

62

W. Serrano

References 1. International Telecommunication Union. Overview of the Internet of Things. Y.2060, 1–22 (2012) 2. Andrea, I., Chrysostomou, C., Hadjichristofi, G.: Internet of Things: security vulnerabilities and challenges. In: IEEE Symposium on Computers and Communication, pp. 180–187 (2015) 3. Deogirikar, J., Vidhate, A.: Security attacks in IoT: a survey. In: IEEE International Conference on IoT in Social, Mobile, Analytics and Cloud, pp. 32–37 (2017) 4. Granjal, J., Monteiro, E., Sá Silva, J.: Security for the Internet of Things: a survey of existing protocols and open research issues. IEEE Commun. Surv. Tutor. 17(3), 1294–1312 (2015) 5. Huh, S., Cho, S., Kim, S.: Managing IoT devices using Blockchain platform. In: International Conference on Advanced Communication Technology, pp. 464–467 (2017) 6. Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash system, pp. 1–9 (2008). Bitcoin.org 7. Watanabe, H., Fujimura, S., Nakadaira, A., Miyazaki, Y., Akutsu, A., Kishigami, J.: Blockchain contract: securing a Blockchain applied to smart contracts. In: International Conference on Consumer Electronics, pp. 467–468 (2016) 8. Yuan, Y., Wang, F.-Y.: Towards Blockchain-based intelligent transportation systems. In: International Conference on Intelligent Transportation Systems, pp. 2663–2668 (2016) 9. Biswas, K., Muthukkumarasamy, V.: Securing smart cities using Blockchain technology. In: International Conference High Performance Computing and Communications/SmartCity/ Data Science and Systems, pp. 1392–1393 (2016) 10. Serrano, W.: The random neural network with a BlockChain configuration in digital documentation. In: Czachórski, T., Gelenbe, E., Grochla, K., Lent, R. (eds.) ISCIS 2018. CCIS, vol. 935, pp. 196–210. Springer, Cham (2018). https://doi.org/10.1007/978-3-03000840-6_22 11. Lee, G.M., Crespi, N., Choi, J.K., Boussard, M.: Internet of Things. In: Bertin, E., Crespi, N., Magedanz, T. (eds.) Evolution of Telecommunication Services. LNCS, vol. 7768, pp. 257–282. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41569-2_13 12. Jing, Q., Vasilakos, A., Wan, J., Lu, J., Qiu, D.: Security of the Internet of Things: perspectives and challenges. Wireless Netw. 20, 2481–2501 (2014) 13. Roman, R., Najera, P., Lopez, J.: Securing the Internet of Things. IEEE Comput. Soc. 0018– 9162, 51–58 (2011) 14. Sicari, S., Rizzardi, A., Grieco, L.A., Coen-Porisini, A.: Security, privacy and trust in Internet of Things: the road ahead. Comput. Netw. 76, 146–164 (2015) 15. Zhou, L., Chao, H.: Multimedia traffic security architecture for the Internet of Things. IEEE Netw. 0890–8044, 35–40 (2011) 16. Gelenbe, E., Domanska, J., Czàchorski, T., Drosou, A., Tzovaras, D.: Security for Internet of Things: the SerIoT project. In: IEEE International Symposium on Networks, Computers and Communications, pp. 1–5 (2018) 17. Domańska, J., Nowak, M., Nowak, S., Czachórski, T.: European cybersecurity research and the SerIoT project. In: Czachórski, T., Gelenbe, E., Grochla, K., Lent, R. (eds.) ISCIS 2018. CCIS, vol. 935, pp. 166–173. Springer, Cham (2018). https://doi.org/10.1007/978-3-03000840-6_19 18. Pointcheval, D.: Neural networks and their cryptographic applications. Livre des resumes In: Eurocode Institute for Research in Computer Science and Automation, pp. 1–7 (1994) 19. Kinzel, W., Kanter, I.: Interacting neural networks and cryptography secure exchange of information by synchronization of neural networks. In: Advances in Solid State Physic, vol. 42, 383–391 (2002)

The Blockchain Random Neural Network in Cybersecurity

63

20. Klimov, A., Mityagin, A., Shamir, A.: Analysis of neural cryptography. In: Zheng, Y. (ed.) ASIACRYPT 2002. LNCS, vol. 2501, pp. 288–298. Springer, Heidelberg (2002). https:// doi.org/10.1007/3-540-36178-2_18 21. Volna, E., Kotyrba, M., Kocian, V., Janosek, M.: Cryptography based on the neural network. In: European Conference on Modelling and Simulation, pp. 1–6 (2012) 22. Yayık, A., Kutlu, Y.: Neural network based cryptography. Int. J. Neural Mass Parallel Comput. Inf. Syst. 24(2), 177–192 (2014) 23. Schmidt, T., Rahnama, H., Sadeghian, A.: A review of applications of artificial neural networks in cryptosystems. In: World Automation Congress, pp. 1–6 (2008) 24. Xu, D., Xiao, L., Sun, L., Lei, M.: Game theoretic study on Blockchain based secure edge networks. In: IEEE International Conference on Communications in China, pp. 1–5 (2017) 25. Cha, S.-C., Yeh, K.-H.: An ISO/IEC 15408-2 compliant security auditing system with Blockchain technology. In: IEEE Conference on Communications and Network Security, pp. 1–2 (2018) 26. Gai, K., Raymond, K.-K., Zhu, L.: Blockchain-enabled reengineering of cloud datacenters. IEEE Cloud Comput. 5(6), 21–25 (2018) 27. Gupta, Y., Shorey, R., Kulkarni, D., Tew, J.: The applicability of Blockchain in the Internet of Things. In: IEEE International Conference on Communication Systems & Networks, pp. 561–564 (2018) 28. Agrawal, R., Verma, P., Sonanis, R., Goel, U., De, A., Anirudh, S., Shekhar, S.: Continuous security in IoT using Blockchain. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6423–6427 (2018) 29. Gelenbe, E.: Random neural networks with negative and positive signals and product form solution. Neural Comput. 1, 502–510 (1989) 30. Gelenbe, E.: Learning in the recurrent random neural network. Neural Comput. 5, 154–164 (1993) 31. Gelenbe, E.: G-networks with triggered customer movement. J. Appl. Prob. 30, 742–748 (1993)

Autonomous Vehicles - Aerial Vehicles

A Visual Neural Network for Robust Collision Perception in Vehicle Driving Scenarios Qinbing Fu1,2(B) , Nicola Bellotto1 , Huatian Wang1 , F. Claire Rind3 , Hongxin Wang1 , and Shigang Yue1,2(B) 1

3

Lincoln Centre for Autonomous Systems (L-CAS), University of Lincoln, Lincoln, UK {qifu,syue}@lincoln.ac.uk 2 School of Mechanical and Electrical Engineering, Guangzhou University, Guangzhou, China Institute of Neuroscience, Newcastle University, Newcastle upon Tyne, UK

Abstract. This research addresses the challenging problem of visual collision detection in very complex and dynamic real physical scenes, specifically, the vehicle driving scenarios. This research takes inspiration from a large-field looming sensitive neuron, i.e., the lobula giant movement detector (LGMD) in the locust’s visual pathways, which represents high spike frequency to rapid approaching objects. Building upon our previous models, in this paper we propose a novel inhibition mechanism that is capable of adapting to different levels of background complexity. This adaptive mechanism works effectively to mediate the local inhibition strength and tune the temporal latency of local excitation reaching the LGMD neuron. As a result, the proposed model is effective to extract colliding cues from complex dynamic visual scenes. We tested the proposed method using a range of stimuli including simulated movements in grating backgrounds and shifting of a natural panoramic scene, as well as vehicle crash video sequences. The experimental results demonstrate the proposed method is feasible for fast collision perception in real-world situations with potential applications in future autonomous vehicles. Keywords: LGMD · Collision detection · Adaptive inhibition mechanism · Vehicle crash Complex dynamic scenes

1

·

Introduction

Autonomous vehicles, though still in early stages of development, have demonstrated huge potential for shaping our future lifestyles and benefiting a variety of Supported by the EU Horizon 2020 projects STEP2DYNA (691154) and ULTRACEPT (778062). c IFIP International Federation for Information Processing 2019  Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 67–79, 2019. https://doi.org/10.1007/978-3-030-19823-7_5

68

Q. Fu et al. Visual Stimuli

Pre-synaptic area

FFI

LGMD

DCMD Motion Control Neural System

Fig. 1. Schematic illustration of the LGMD neuromorphology: the red ‘dendrites tree’ area indicates pre-synaptic visual processing; the green ‘dendrite trees’ field denotes a separate feed-forward inhibitory (FFI) pathway; DCMD (descending contra-lateral motion detector) is a one-to-one post-synaptic target neuron conveying the LGMD’s spikes to further motion control neural systems. (Color figure online)

human activities. Before well serving the human society, there is one critical issue to solve – trustworthy collision perception. Nowadays the number of fatalities by road crashes still remains high. To improve driving safety, the cutting-edge approaches for vehicle collision detection, such as radar, GPS-based methods and normal vision sensors, are often ineffective in terms of reliability, cost, energy consumption or size. For ground vehicle collision detection, the most effective systems comprise automated collision avoidance with emergency steering and braking assistance, as well as active lane keeping systems, e.g. [3]. The vast majority of vision-based methods implement object-and-scene segmentation, estimation or classification algorithms [13]. The state-of-the-art visual sensors like RGB-D and event-driven cameras can provide vehicles with more abundant visual features compared to normal cameras. However, these solutions are either computationally costly or heavily reliant on the specific sensors. A new type of reliable, low-cost, energyefficient and miniaturised collision detection techniques is demanded for future autonomous vehicles. Nature provides a rich source of inspiration for designing artificial visual systems for collision perception and avoidance (e.g. [4,6–8,16,19]). Insects have compact visual brains that deal with motion perception. For instance, locusts can fly for a long distance in very dense swarms without collision; also nocturnal insects successfully forage in the forest at night free of collision. These naturally developed visual systems are perfect sensory models for collision detection and avoidance. In the future, vehicles with or without a driver should possess similar ability to navigate as effectively as animals do. Locusts are well known for fast collision avoidance behaviour on the basis of visual cues. A group of lobula giant movement detectors (LGMDs) has been found by biologists (e.g. [14]). The LGMD1 (namely LGMD in this paper) was firstly identified as a moving objects detector and then gradually recognised

Visual Collision Detection in Complex Dynamic Scenes

69

as a large-field looming sensitive neuron, which responds most vigorously to objects quickly approaching rather than other kinds of movements [14]. The neuromorphology of LGMD is illustrated in Fig. 1. Such a fascinating neuron has been computationally modelled as collision selective visual neural networks or models (e.g. [2,11,17,19]), and applied in mobile machines like ground robots [5,7,10] and UAVs [15], and also embodied in hardware implementations like the FPGA [12]. These works have partially reproduced the LGMD’s responses and demonstrated that it features efficient neural computation for quick and reliable looming or collision sensing. However, the performance of these methods is still restricted by the background complexity, i.e., the real physical environmental noise including irrelevant motion cues greatly affect the looming detection. For outdoor vehicle driving scenarios, there are many challenges for artificial collision detection vision systems. The more unpredictable and dynamic environments include optical flows caused by ego-motions with the movements of lane markings or rapidly approaching ground shadows, and situations like nearby vehicles approaching or surpassing. These are all significantly challenging the performance of visual collision detection systems. There have been some modelling studies showing the LGMD’s efficacy of collision detection in vehicle driving scenarios. For example, a seminal work by Keil et al. demonstrated the effectiveness of ON and OFF mechanisms in an LGMD model for proximity detection in real-world scenes [11]. Yue et al. introduced a genetic algorithm with an LGMD visual neural network to reduce false collision alerts rate using video sequences from a camera mounted inside a car [20]. Stafford et al. proposed a method to combine the elementary motion detector (EMD) from the fly visual system with the LGMD for amplifying the colliding and translating stimuli [18]. More recently, a prominent work was proposed by Harbauer for constructing an LGMD-based collision detection system for vehicles [9]. In this study, the author specified a ‘danger zone’ in the centre of the vehicle’s view to help exclude irrelevant optical flows in the surrounding area. In this research, on the basis of a recent biological research [14], we looked into the signal processing within the pre-synaptic area of the LGMD giant neuron. Compared to aforementioned works and our previous model [7], we propose an adaptive inhibition mechanism, which enables the LGMD to perceive looming cues against complex dynamic background like gratings. Importantly, such a mechanism leads to further sharpen up the LGMD’s selectivity by rigorously suppressing background shifting and other irrelevant motion including objects receding and translating. To verify the effectiveness of this new inhibition mechanism, we investigated its collision detection performance in complex dynamic scenes. The rest of this paper is organised as follows: Sect. 2 introduces the proposed method; Sect. 3 presents our experiments and results, with further discussion; this research is summarised in Sect. 4.

2

Model Description

In this section we will present the proposed LGMD visual neural network. The neural computation flowchart is illustrated in Fig. 2.

70

2.1

Q. Fu et al.

Spatiotemporal Neural Computation in the Pre-synaptic Area

The first layer of photoreceptors arranged in a 2-D matrix calculates the luminance change between every two successive frames at every local pixel: P (x, y, t) = L(x, y, t) − L(x, y, t − 1) +

np 

ai · P (x, y, t − i).

(1)

i

The persistence of luminance change could last for a short while of several (np ) frames. The decay coefficient ai is calculated by ai = (1 + eu·i )−1 , and u = 1. Input Imagery

L1

P1

PNN1

t

L2

P2

PNN2

t

L3

P3

PNN3

t

L4

P4

PNN4

t

Ln

Pn

PNNn

t

LGMD CELL

SPIKE FREQUENCY

The PNNs in LGMD Visual Neural Network

E

t FFI-M

SON I = W D(E)

S

E t

G

SOFF I = W D(E)

Fig. 2. Schematic of the LGMD visual neural network. The input is grey-scale imagery. Pixel-wise luminance (L) is captured by n number of photoreceptors (P), each relayed into ON and OFF pathways with multi-layers inside partial neural networks (PNN). ON and OFF cells implement the functions of half-wave rectifiers. E, I, S and G stand for excitation, inhibition, summation and grouping cells; t indicates temporal latency. There are spatiotemporal convolutions in the ON/OFF channels. The output is spike frequency. A separate FFI-M pathway from the photoreceptor layer adjusts the local inhibition strength and tunes the local excitation latency at every frame.

In addition, from the photoreceptor layer, we also compute the object size dependent feed-forward inhibition (FFI). Differently from former LGMD neural networks, e.g. [7,10,14,19], where the FFI can directly suppress the LGMD giant neuron if luminance changes rapidly over a large field of view, we propose a new FFI mechanism, namely the feed-forward inhibition mediation (FFI-M), which will be used to tune the local inhibitions and excitations in the ON and OFF pathways. We define the output delayed signal as Fˆ , which is computed as: F (t) =

C R   x

y

|P (x, y, t)| · n−1 cell ,

1 dFˆ (t) = (F (t) − Fˆ (t)), dt τ1

(2)

where C and R denote the columns and rows of the photoreceptor layer, and ncell stands for the total number of units, i.e. ncell = C × R. The output is delayed by a first-order low-pass filtering with a time constant (τ1 = 10 ms).

Visual Collision Detection in Complex Dynamic Scenes

71

After that, the photoreceptors pass motion information through parallel ON and OFF pathways, depending on luminance increments (ON) or decrements (OFF): Pon (x, y, t) = [P (x, y, t)]+ ,

Pof f (x, y, t) = −[P (x, y, t)]− ,

(3)

where [x]+ and [x]− denote max(0, x) and min(x, 0), respectively. Subsequently, the polarity signals flow into parallel pathways, each possessing multiple layers including excitation (E), inhibition (I) and summation (S). Firstly, the ON cell leads the excitation to pass directly to the ON-E layer; meanwhile, it is fed into a low-pass filtering, which gives feedback on delayed information: Eon (x, y, t) = Pon (x, y, t),

ˆon (x, y, t) 1 dE ˆon (x, y, t)), (4) = (Eon (x, y, t) − E dt τ2

ˆon is the output and τ2 is the latency varying between 60 and 180 ms. The where E local inhibition in the ON-I layer is then convolved by delayed ON-excitations: Ion (x, y, t) =

1 1  

ˆon (x + i, y + j, t) · W (i + 1, j + 1), E

(5)

i=−1 j=−1

where W denotes a convolution kernel and fits the following matrix: ⎛ ⎞ 1/8 1/4 1/8 ⎜ ⎟ 1 1/4⎠ . W = ⎝ 1/4 1/8 1/4 1/8

(6)

It is important to note that the local excitation performs also self-inhibition. Next, in the ON-S layer, there is a competition between the local excitation and inhibition, which represents a linear calculation: Son (x, y, t) = Eon (x, y, t) − w · Ion (x, y, t),

(7)

where w is a local bias to the inhibition. Note that only the non-negative values can reach the forthcoming computation. In this LGMD neural network, the ON and OFF pathways share the same spatiotemporal neural computation. Therefore, for simplicity, we illustrate the processing in the ON pathway only. After the generation of local ON and OFF excitations in the ON-S and -OFF-S layers, there are interactions between both polarity channels, which represent supralinear computations. That is, S(x, y, t) = θ1 · Son (x, y, t) + θ2 · Sof f (x, y, t) + θ3 · Son (x, y, t) · Sof f (x, y, t), (8) where {θ1 , θ2 , θ3 } denote the combination of term coefficients for balancing the contributions of both ON and OFF pathways, all set to 1 in this model. The proposed visual neural network features also a grouping (G) layer for the purpose

72

Q. Fu et al.

of reducing noise and for clustering local excitations by expanding edges. The G layer processing adopts the methods used in a former LGMD neural network [19], which is omitted here. Moreover, as shown in Fig. 2, the grouped local excitation has a temporal latency before reaching the LGMD cell. The computational role is consistent with a first-order low-pass filtering, with a dynamic time τg , updated at every frame (initially set to 10 ms). 2.2

Adaptive Inhibition Mechanism

In the proposed LGMD visual neural network, the FFI-M pathway is crucial to adjust the local biases to both the ON-inhibition and the OFF-inhibition in the PNN, at every frame. That is,



Fˆ (t) Fˆ (t) w = max σ1 , , τˆg = τg · max σ2 , 1 − , (9) Tf Tf where Tf = 20 is a threshold and σ1 = 0.5. This mechanism works effectively to make the neural network adapt to different levels of background complexity. More precisely, the model is inhibited by dramatic changes of background clutter, like background shifting and grating movements. Most importantly, the model can still detect looming objects within dynamic background clutter. In addition, the FFI-M pathway tunes the temporal latency of grouped local excitations, where the time delay at every frame is updated by a coefficient compared to a very small real number σ2 = 0.01. This dynamic temporal tuning indicates that the time latency of local excitations reaching the LGMD will become shorter as the objects growing on the field of view, i.e., the looming case. 2.3

The LGMD Cell

Following the pre-synaptic visual processing, the LGMD giant neuron pools all the local excitations forming the neural potential, which is then exponentially transformed to the sigmoid membrane potential as follows: k(t) =

C R   x

G(x, y, t),

−1 −1 K(t) = 1 + e−k(t)·(ncell ·σ3 ) ,

(10)

y

where the coefficient σ3 = 1 shapes the neural potential within [0.5, 1). G is the grouped local excitation in the G layer (Fig. 2). After that, we apply a spike frequency adaptation mechanism, which contributes to further sharpen up the looming selectivity via weakening the LGMD’s responses to translating and receding stimuli. The corresponding computation can be found in [7]. 2.4

Output Spike Frequency

The membrane potential is then exponentially mapped to spikes by an integervalued function:

 (11) S spike (t) = e(σ4 ·(K(t)−Tsp )) ,

Visual Collision Detection in Complex Dynamic Scenes

t

t

t

Same Temporal Frequeny 2

0.7 0.5 0

SMP

SMP

2

200 100

frames

50

100

73

Same Spatial Frequency

0.7 0.5 0

200 100

0 0 SF (units of cycles / pixel)

0 0

50

100

TF (Hz)

Fig. 3. Membrane potentials of the proposed LGMD model by sinusoidal gratings with a wide range of spatial (SF) and temporal (TF) frequencies: the firing threshold is set at 0.7; The potentials at 0.5 denote non-response of the LGMD model.

where Tsp = 0.7 indicates the spiking threshold and σ4 = 10 is a scale parameter affecting the firing rate at every frame. After that, we compute the spike frequency (spikes per second) within a range of discrete frames as the model output to indicate the recognition of collisions or not: ⎧ t

 ⎪ ⎪ spike ⎨ True, if S (i) · 1000/(Nts · τi ) ≥ Tsf Coll(t) = , (12) i=t−N ts ⎪ ⎪ ⎩ False, otherwise where Nts = 6 denotes the amount of frames constituting a short time window and Tsf = 15–30 spikes/s is the spike frequency threshold. τi stands for the discrete time interval in milliseconds between successive frames.

3

Experimental Evaluation

All the experiments can be categorised into two types of tests: we firstly tested the proposed LGMD visual neural network against synthetic movements in dynamic background clutter, with various spatiotemporal sinusoidal grating movements and shifting of a panoramic natural scene; we then investigated its performance in very complex vehicle driving scenarios consisting of many crash and near-miss scenes. The simulations input to the proposed model were greys-scale videos with 30 Hz sampling frequency. The vehicle video sequences were adapted from dashboard camera recordings at 18 Hz [1]. The resolutions are 380 × 334, 540 × 270, 352 × 288 for grating, panoramic and vehicle videos, respectively. 3.1

Synthetic Visual Stimuli Testing

In the first type of experiments, we aim to demonstrate that the proposed LGMD model can detect objects approaching in complex dynamic backgrounds. Firstly,

74

Q. Fu et al.

as basic trials for biologically visual systems, we tested it against grating movements with a wide range of spatial and temporal frequencies. As shown in Fig. 3, challenged by grating movements alone, the proposed LGMD is completely inhibited during each course. The results verify that the proposed adaptive inhibition mechanism works effectively to deal with grating movements, which well reconciles with the responses of a biological LGMD. Subsequently, we simulated dark and light objects looming against grating movements in the background. In this case, the object approaches or moves away from the field of view at a constant linear speed of 10.8 cm/s; the spatial frequency varies at 40, 60, 80 units/m. The results in Fig. 4 demonstrate that the proposed LGMD can robustly recognise either dark or light object approaching within grating stimuli regardless of the background grating frequencies. The model represents dramatically increasing spike frequencies near the end of approaching when the object gets close to the retina. Notably, the light looming object generates a much higher spiking rate, which indicates the LGMD’s sensitivity to the contrast between looming objects and background, i.e, the LGMD responds more strongly to looming objects with larger contrast. In the last part of the synthetic stimuli experiments, we simulated movements including looming, recession and translation embedded in the shifting of a panoramic natural scene. As illustrated in Fig. 5, the proposed LGMD responds most strongly to dark and light looming stimuli compared to receding and translating ones. The spike frequency gradually increases when looming objects grow on the field of view, and then exceeds the collision alert level remaining high for a long period. On the other hand, the LGMD shows lower spike frequency at the start of the object recession only, and little or no response to translations. The results verify the effectiveness of the proposed LGMD model for extracting only looming cues from a dynamic cluttered background. 3.2

Real-World Driving Scenes Testing

In the second type of experiments, the proposed neural network was tested in much more complex vehicle driving scenes, which comprise mixed movements to interfere the detection of real ‘dangers’. Our goal was to investigate its performance for fast collision perception on vehicles. We categorised the vehicle driving video sequences adapted from [1] into two types of scenarios, i.e., the fatal crashes and the near-miss scenes. In the first case, as depicted in Fig. 6, the proposed neural system works effectively in perceiving imminent collisions very quickly, either in daylight or night driving scenarios. More specifically, the spike frequency increases very quickly and goes beyond the alert threshold before the highlighted real ‘crash moments’. Otherwise, the LGMD maintains a low-rate below the threshold. Perceiving dangers before crashes is of great importance for vehicles with or without drivers. The proposed LGMD is highly activated before crashes, so it can be a reliable assistant alert system to improve collision avoidance during navigation. For comparison, we also challenged the proposed neural system with some near-miss situations where the vehicle avoided the collision, or faced nearby-lane

Visual Collision Detection in Complex Dynamic Scenes

50 30 0 0

rate(spikes/s)

rate(spikes/s)

190

SF-40 units/m SF-60 units/m SF-80 units/m alert level 50

100

150 100

150

200

250

300

50 30 0 0

75

SF-40 units/m SF-60 units/m SF-80 units/m alert level 50

100

150

200

250

300

rate(spikes/s)

spike frequency alert level

50

30

30

15

15

50

100

150

50 30

100

150

200

spike frequency alert level

50

15

50

100

150

rate(spikes/s)

0 0

200

spike frequency alert level

50 30

50

100

150

200

spike frequency alert level

50 30

15 0 0

50

30

15 0 0

0 0

200

spike frequency alert level

rate(spikes/s)

0 0

spike frequency alert level

rate(spikes/s)

50

rate(spikes/s)

rate(spikes/s)

Fig. 4. Spike frequency of the proposed LGMD by dark and light objects looming in grating stimuli with three different SF. The alert level is set at 30 spikes/s.

15

50

100

150

200

0 0

50

100

150

200

Fig. 5. Spike frequency of the proposed LGMD by dark and light objects approaching, receding and translating embedded in shifting natural background. The alert level is set at 15 spikes/s.

Q. Fu et al.

spikes crash time

spike

10

5

5

0

0 90

150 90 60 30 0 0

spike frequency crash time alert line 50

60

100

spike

0 0

150

spikes crash time

10 5

0 90

rate(spikes/s)

spike frequency crash time alert line

60

20

40

10

80

60

20

20

40

60

80

spikes crash time

5 0

150

40

60

80

spikes crash time

90 60 30 0 0

spike frequency crash time alert line 50

10

5 0

100

150

spikes crash time

15

spike

10

5 0

rate(spikes/s)

rate(spikes/s)

50

spike frequency crash time alert line

0 0

rate(spikes/s)

rate(spikes/s)

spike

0 90

10

spike frequency crash time alert line

15

90 60 30 0 0

40

5

15

30

150

30

spikes crash time

10

100

5

0 0

20

15

spike

spike

60

spikes crash time

15

60

10

30

30

0

spike frequency crash time alert line

30

15

0 0

spikes crash time

15

rate(spikes/s)

rate(spikes/s)

10

spike

spike

15

rate(spikes/s)

76

spike frequency crash time alert line 50

150

100

150

90 60 30 0 0

spike frequency crash time alert line 20

40

60

80

100

120

140

Fig. 6. Spike frequency of the proposed LGMD by vehicle crash scenarios. The spikes and the frequencies are both depicted. The vertical dashed lines indicate the real ‘crash moments’. A red circle marks the potential colliding object in the snapshots. (Color figure online)

Visual Collision Detection in Complex Dynamic Scenes

spike

10

10 5

0 90

spike frequency alert line

rate(spikes/s)

rate(spikes/s)

5

0 90

spike frequency alert line

60

60 30

30 0 0

20

40

60

80

100

120

0 0

140

spikes

10

20

30

40

50

spikes

15

spike

15

spike

spikes

15

spike

spikes

15

77

10

10 5

0 90

0 90

spike frequency alert line

rate(spikes/s)

rate(spikes/s)

spike frequency alert line

60

5

60 30

30 0 0

10

20

30

40

50

60

70

0 0

20

40

60

80

100

120

140

Fig. 7. Spike frequency of the proposed LGMD by vehicle near-miss scenes.

approaching and translating vehicles. As illustrated in Fig. 7, compared to the crash cases, the proposed model generates few sparse spikes or remains quiet. The results demonstrate the proposed LGMD visual neural network can well discriminate urgent collisions by rapid approaching objects from other irrelevant end less dangerous situations. 3.3

Discussion

Building upon our previous work, we have shown the efficacy of an adaptive inhibition mechanism in the LGMD visual neural network dealing with looming perception in complex dynamic scenes. Notably, we showed its robust collision perception ability in vehicle driving scenarios including fatal crashes. However, there are still challenges to be solved by the proposed method. Indeed, although it can perceive impending crashes from the frontal view, we found that nearby surpassing or approaching objects may affect the collision detection performance by causing false alerts. In this case, a single neuron computation is insufficient, whereas the integration of multiple neural systems could be more effective. In addition, the alert firing threshold, which is now manually defined, should be also adaptive to varying complexity of environmental dynamics.

78

4

Q. Fu et al.

Concluding Remarks

This paper has introduced a bio-plausible visual neural network inspired by the locust looming sensitive giant neuron – the LGMD – for fast collision perception in complex dynamic scenes, including vehicle driving scenarios. Compared to previous studies, we focused on an adaptive inhibition mechanism capable of dealing with different levels of background complexity. The experimental results verified the feasibility and robustness of the proposed method in potential real-world applications. To improve road safety, the proposed model could be a good collision detection system embodied in miniaturised sensors for future autonomous vehicles and robots.

References 1. Best dash cam accidents. https://www.youtube.com/channel/UCM9Bwpf5Ruc T6j516NAA8sg. Accessed 01 Jan 2019 2. Bermudez i Badia, S., Bernardet, U., Verschure, P.F.: Non-linear neuronal responses as an emergent property of afferent networks: a case study of the locust lobula giant movement detector. PLoS Comput. Biol. 6(3), e1000701 (2010) 3. Eichberger, A.: Contributions to Primary. Verlag Holzhausen GmbH, Secondary and Integrated Traffic Safety (2011) 4. Franceschini, N.: Small brains, smart machines: from fly vision to robot vision and back again. Proc. IEEE 102, 751–781 (2014) 5. Fu, Q., Hu, C., Liu, T., Yue, S.: Collision selective LGMDs neuron models research benefits from a vision-based autonomous micro robot. In: IEEE International Conference on Intelligent Robots and Systems, pp. 3996–4002 (2017) 6. Fu, Q., Hu, C., Liu, P., Yue, S.: Towards computational models of insect motion detectors for robot vision. In: Giuliani, M., Assaf, T., Giannaccini, M.E. (eds.) Towards Autonomous Robotic Systems Conference, pp. 465–467 (2018) 7. Fu, Q., Hu, C., Peng, J., Yue, S.: Shaping the collision selectivity in a looming sensitive neuron model with parallel on and off pathways and spike frequency adaptation. Neural Netw. 106, 127–143 (2018). https://doi.org/10.1016/j.neunet. 2018.04.001 8. Fu, Q., Yue, S., Hu, C.: Bio-inspired collision detector with enhanced selectivity for ground robotic vision system. In: British Machine Vision Conference (2016) 9. Hartbauer, M.: Simplified bionic solutions: a simple bio-inspired vehicle collision detection system. Bioinspiration Biomim. 12(2), 026007 (2017) 10. Hu, C., Arvin, F., Xiong, C., Yue, S.: Bio-inspired embedded vision system for autonomous micro-robots: the LGMD case. IEEE Trans. Cognit. Dev. Syst. 9(3), 241–254 (2017) 11. Keil, M.S., Roca-Moreno, E., Rodriguez-Vazquez, A.: A neural model of the locust visual system for detection of object approaches with real-world scenes. In: Proceedings of the Fourth IASTED, pp. 340–345 (2004) 12. Meng, H., Appiah, K., Yue, S., Hunter, A., Hobden, M., Priestley, N., Hobden, P., Pettit, C.: A modified model for the lobula giant movement detector and its FPGA implementation. Comput. Vis. Image Underst. 114, 1238–1247 (2010) 13. Mukhtar, A., Xia, L., Tang, T.B.: Vehicle detection techniques for collision avoidance systems: a review. IEEE Trans. Intell. Transp. Syst. 16(5), 2318–2338 (2015). https://doi.org/10.1109/TITS.2015.2409109

Visual Collision Detection in Complex Dynamic Scenes

79

14. Rind, F.C., Wernitznig, S., Polt, P., Zankel, A., Gutl, D., Sztarker, J., Leitinger, G.: Two identified looming detectors in the locust: ubiquitous lateral connections among their inputs contribute to selective responses to looming objects. Sci. Rep. 6, 35525 (2016). https://doi.org/10.1038/srep35525 15. Salt, L., Indiveri, G., Sandamirskaya, Y.: Obstacle avoidance with LGMD neuron: towards a neuromorphic UAV implementation. In: Proceedings - IEEE International Symposium on Circuits and Systems (2017) 16. Serres, J.R., Ruffier, F.: Optic flow-based collision-free strategies: from insects to robots. Arthropod Struct. Dev. 46(5), 703–717 (2017) 17. Silva, A., Santos, C.: Computational model of the LGMD neuron for automatic collision detection. In: IEEE 3rd Portuguese Meeting in Bioengineering (2013) 18. Stafford, R., Santer, R.D., Rind, F.C.: A bio-inspired visual collision detection mechanism for cars: combining insect inspired neurons to create a robust system. Biosystems 87(2–3), 164–71 (2007) 19. Yue, S., Rind, F.C.: Collision detection in complex dynamic scenes using a LGMD based visual neural network with feature enhancement. IEEE Trans. Neural Netw. 17(3), 705–716 (2006) 20. Yue, S., Rind, F.C., Keil, M.S., Cuadri, J., Stafford, R.: A bio-inspired visual collision detection mechanism for cars: Optimisation of a model of a locust neuron to a novel environment. Neurocomputing 69(13–15), 1591–1598 (2006)

An LGMD Based Competitive Collision Avoidance Strategy for UAV Jiannan Zhao1 , Xingzao Ma2 , Qinbing Fu1 , Cheng Hu3 , and Shigang Yue1(B) 1

University of Lincoln, Brayfordpool LN6 7TS, Lincoln, UK {jzhao,qfu,syue}@lincoln.ac.uk 2 College of Electro-Mechanical and Engineering, Lingnan Normal University, Zhanjiang 524048, China [email protected] 3 Machine Life and Intelligence Research Centre, Guangzhou University, Guangzhou, China [email protected]

Abstract. Building a reliable and efficient collision avoidance system for unmanned aerial vehicles (UAVs) is still a challenging problem. This research takes inspiration from locusts, which can fly in dense swarms for hundreds of miles without collision. In the locust’s brain, a visual pathway of LGMD-DCMD (lobula giant movement detector and descending contra-lateral motion detector) has been identified as collision perception system guiding fast collision avoidance for locusts, which is ideal for designing artificial vision systems. However, there is very few works investigating its potential in real-world UAV applications. In this paper, we present an LGMD based competitive collision avoidance method for UAV indoor navigation. Compared to previous works, we divided the UAV’s field of view into four subfields each handled by an LGMD neuron. Therefore, four individual competitive LGMDs (C-LGMD) compete for guiding the directional collision avoidance of UAV. With more degrees of freedom compared to ground robots and vehicles, the UAV can escape from collision along four cardinal directions (e.g. the object approaching from the left-side triggers a rightward shifting of the UAV). Our proposed method has been validated by both simulations and real-time quadcopter arena experiments.

Keywords: UAV collision avoidance Bio-inspired neural network

· LGMD ·

J. Zhao and X. Ma—Contributed equally. This research is funded by the EU HORIZON 2020 project, STEP2DYNA (grant agreement No. 691154) and ULTRACEPT (grant agreement No. 778062). c IFIP International Federation for Information Processing 2019  Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 80–91, 2019. https://doi.org/10.1007/978-3-030-19823-7_6

An LGMD Based Competitive Collision Avoidance Strategy for UAV

1

81

Introduction

UAV is one of the most attractive but vulnerable robot platform, which has the potential to be applied in tons of scenarios, such as geography survey, agriculture fertilization, exploration in dangerous or disaster regions, products delivery, shooting photos. Safety of the UAV is always a vital property in a UAV application. Thus, researchers always seeking for better Sense and Avoidance (SAA) technics for UAVs. Classic UAVs use GPS or optic flow [12,18] to navigate, and onboard distance sensor like ultra sonic, infrared, laser, or a cooperative system to avoid obstacles as reviewed by [23]. However, these distance sensors are largely dependent on obstacles’ materials, texture and backgrounds’ complexity, thus, they can only work in simple and structured environment [6]. Lidar and Vision based methods is more diverse and applicable. One popular vision based method is to detect and locate obstacles in a reconstructed map, mark out the frontiers of the obstacles as banded fields in the map, and then use specified pathfinding algorithm (e.g. heuristic algorithm) to generate safe trajectories to avoid collision [1,2,16]. This is also named as Simultaneous Localization and Mapping (SLAM), but its high demand of computational burden prevent it from small or micro UAVs. On the other hand, bio-inspired vision based collision detecting methods are standing out for their efficiency. For example, Optic Flow (OF) is a widely used vision based motion detecting method inspired by biological mechanism in flies and bees [21]. It is also introduced in collision avoidance technology, e.g. Zufferey [30] applied 1D OF sensor onto a 30 g light weight fixed wing UAV and achieved automatic obstacle avoidance in indoor (GPS denied) structured environment. And later in 2009 [5] their group achieved autonomous avoidance towards trees with 7 OF sensor on a fixed wing platform. Griffiths [11] used optical mouse (key-point matching) converted OF sensor to fly through Canyon, besides the OF sensor, it also integrated a laser ranger for directly approaching obstacles. Serres [20] used a pair of EMD based OF sensor to avoid lateral obstacles for a hovercraft. Sabo [18] applied OF onto quadcopters and repeated some benchmark experiments to analysis the behaviours for honeybee-like flying robot, however, the algorithm was still computed off board. Stevens [22] achieved collision avoidance in cluttered 3D environments. Lobula Giant Movement Detector (LGMD) is another bio-inspired neural network inspired by Locusts vision system, and especially, superior in detecting approaching obstacles and avoiding imminent collisions. Compared to Optic Flow, LGMD is more specialised for detecting directly approaching obstacles and eliminate redundant image difference caused by shifting things and backgrounds. The LGMD neuron and its presynaptic neural network has been modeled [17] and promoted by many researchers [8,9,24]. As a collision detecting model, LGMD has been introduced to mobile robots [4,13], embedded systems [10,14], hexapod walking robot [7], blimp [3] and cars [25,26]. Basic LGMD model provide the threat level of collision in the whole field of view (FoV), but it is not enough to make wise avoidance behaviour, hence, early research generated randomly turn direction in mobile robots [13]. Shigang [27]

82

J. Zhao et al.

divided the field of view into two bilateral halves, and discussed both winnertake-all and steering-wheel network in direction control system of the mobile robot. Compared to mobile robots, UAV has more degree of freedom, and is more vulnerable during flight. In the extremely limited literature of LGMD research on UAV platforms, Salt [19] implemented a neuromorphic LGMD model using recording from a UAV platform, and divided the FoV into half twice for direction information. But there is no real-time flight conducted. Our previous research has proved the applicability of LGMD on Quadcopter [28] real-time flight and collision avoidance. Previously, the quadcopter can only avoid obstacles by randomly turn left or right in horizontal plane. To acquire the information about the coming direction of imminent obstacles, this research proposed a new image partition strategy, especially for LGMD application on UAVs, and a corresponding steering method for 3D avoidance behaviour. Both video simulation and real-time flight demonstrated the performance of this method.

2 2.1

Model Description LGMD Process

The LGMD process algorithm used in this paper is inherited from our previous research [28]. The LGMD process is composed of five groups of cells, which are Pcells (photoreceptor), I-cells (inhibitory), E-cells (excitatory), S-cells (summing) and G-cells (grouping), compared to previous model, we added four single competitive LGMD cells representing LGMD output of four sections: Left, Right, Up, and Down. The image is divided as shown in Fig. 1. The first layer of the neuron network is composed of P cells, which are arranged in a matrix, formed by luminance change between adjacent frames. The output of a P cell is given by: Pf (x, y) = Lf (x, y) − Lf −1 (x, y)

(1)

where Pf (x, y) is the luminance change of pixel(x, y) at frame f , Lf (x, y) and Lf −1 (x, y) are the luminance at frame f and the previous frame. The output of the P cells forms the input of the next layer and is processed by two different types of cells, which are I (inhibitory) cells and E (excitatory) cells. The E cells pass the excitatory flow directly to S layer so that the E cells has the same value to its counterpart in P Layer; While the I cells pass the inhibitory flow convoluted by surrounded delayed excitations. The I layer can be described in a convolution operation: [I]f = [P ]f ⊗ [w]I

(2)

where [w]I is the convolution mask representing the local inhibiting weight distribution from the centre cell of P layer to neighbouring cells in S layer, a neighbouring cell’s local weight is reciprocal to its distance from the centre cell.

An LGMD Based Competitive Collision Avoidance Strategy for UAV

83

Fig. 1. A schematic illustration of the proposed LGMD based competitive neuron network for collision detection. [#] denotes the inherited LGMD process as described in our previous research [28].

To adapt fast image motion during UAV flight, [w]I is set differently to it in mobile robot [13], the inhibition radius is expanded to 2 pixels: ⎡ 1 1 1 1 1 ⎤ √

8



5 2



5



8 √1 ⎥ 5⎥ 1 ⎥ 2 ⎥ ⎥ √1 ⎦ 5 √1 √1 1 √1 √1 8 5 2 5 8

⎢ √1 √1 1 √1 ⎢ 5 2 2 ⎢ [w]I = 0.25 ⎢ 21 1 0 1 ⎢ √1 √1 ⎣ 5 2 1 √12

(3)

The next layer is the Sum layer, where the excitation and inhibition from the E and I layer is combined by linear subtraction, and after summation. Next, Group layer is involved to reduce the noise caused by sporadic image change or backgrounds. Detailed equation and parameters can be found in our previous work [28]. When it comes to G layer, The unnormalized membrane potential of four C-LGMDs are Calculated respectively: U LGM D0 =

  min[Diag1,Diag2] x

D LGM D0 =

f (x, y)| |G

 max[Diag1,Diag2]  x

(4)

y=0

y=0

f (x, y)| |G

(5)

84

J. Zhao et al.

L LGM D0 =

 y=Diag1

R LGM D0 =

 yDiag2

where Diag1, Diag2, denote the coordinates in y axis of the two diagonals, and Gf (x, y) is the cells value of G layer, as illustrated in Fig. 2. For more details about the process from Pf (x, y) to Gf (x, y) please looks in our previous work [28].

Fig. 2. Image dividing method. The image scene is split through the diagonal.

Previously, the membrane potential of the LGMD cell Kf 0 is the summation of every pixel in G layer: Kf 0 =

 x

f (x, y)| |G

(8)

y

Now it also equals to the summation of the four C-LGMD neurons: Kf 0 = ULGM D + DLGM D + LLGM D + RLGM D and then Kf is adjusted in range (0, 255) by a sigmoid equation:

tanh( Kf 0 − ncell C1 ) κf = × 255 ncell C2

(9)

(10)

where C1 and C2 are constants to shape the normalizing function, limiting the excitation κf varies within [0, 255], ncell represents the total number of pixels in one frame of image. The membrane potential of the four C-LGMDs, is also limited in (0, 255) by calculating their proportion in Kf 0 , instead of modified with sigmoid function again: ULGM D0 × κf Kf 0 DLGM D0 D LGM D = × κf Kf 0 U LGM D =

(11) (12)

An LGMD Based Competitive Collision Avoidance Strategy for UAV

LLGM D0 × κf Kf 0 RLGM D0 R LGM D = × κf Kf 0

L LGM D =

If κf exceeds its threshold, then an LGMD spike is produced: 1, if κf  Ts spike = Sf 0, otherwise.

85

(13) (14)

(15)

An impending collision is confirmed if successive spikes last consecutively no less than nsp frames: ⎧ f ⎪ ⎨1, if  S spike  n sp f LGM D Cf = (16) f−nsp ⎪ ⎩0, otherwise. And then, based on the result of the competitive C-LGMDs, DCMD will switch to the corresponding escape command, and the command is sent through USART interface to the flight control system. The process from DCMD to PID based motor control system is shown in pseudocode Algorithm 1.

Algorithm 1. Escape direction steering algorithm

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Input: four competitive C-LGMDs refers to stimulus in the FoV: ULGM D , DLGM D , LLGM D , RLGM D Output: expected quadcopter speed in x, y, z axis (PID motor control input): exp sp[x], exp sp[y], exp sp[z] while CfLGM D = 1 do minDirection ← ULGM D if minDirection ≥ DLGM D then minDirection ← DLGM D if minDirection ≥ LLGM D then minDirection ← LLGM D if minDirection ≥ RLGM D then minDirection ← RLGM D if minDirection = ULGM D then set exp sp[z] = speed0 ; else if minDirection = DLGM D then set exp sp[z] = −speed0 ; else if minDirection = LLGM D then set exp sp[y] = speed0 ; else if minDirection = RLGM D then set exp sp[y] = −speed0 ;

86

3

J. Zhao et al.

System Overview

In this section, the outline of the whole system is described. A system composed of Quadcopter, embedded LGMD detector, Ground Station, Remote and auxiliary sensors is depicted in Fig. 3. Luminance information is collected by the camera on the detector board, and then involved into the LGMD algorithm, the output command is passed through a USART port into the flight control to monitor avoiding tasks. 3.1

Quadcopter Platform

The UAV platform used in this research is a customized quadcopter with the skeleton size of 33 cm between diagonally rotors. The flight control module we used is based on a STM32F407V and provides 5 USART interface for extra peripheral. Multiple sensors are applied for data collection and enhance the stability of the quadcopter, including an IMU (Inertial Measurement Unit), an ultrasonic sensor, an optic flow sensor and the LGMD detector, as illustrated in Fig. 3. The Pix4Flow optic flow module [12] is occupied as a position and velocity feedback in horizontal plane. The flight control module works as the central controller to combine the other parts together. It receives source data from the embedded IMU module (MPU6050), the Pix4flow optic flow sensor, and the LGMD detector, calculates out the PWM (Pulse-Width Modulation) values as the output to the four motors and it also sends back real time data for analysis through the nRF24L01 module.

PC

Real-time data

Sample Images

nRF24L01

LGMD Detector

USART

BlueTooth

Flight Control Module

USART

Pix4Flow

PWM output Motors

Fig. 3. The structure of the quadcopter platform.

An LGMD Based Competitive Collision Avoidance Strategy for UAV

4

87

Experiments and Results

To verify the performance of the proposed algorithm, both video simulation and arena real-time flight are conducted. 4.1

Video Simulation

The algorithm is firstly implemented on matlab and tested by a series recorded video, to verify whether the algorithm can distinguish stimulus from different directions. The results in Fig. 4 indicate that the new network is able to respond differently towards coming objects from different directions.

(a)

(b)

(c)

(d)

Fig. 4. Simulation results with snapshot. (a), (b), (c), (d) are membrane potential in C-LGMDs, toward upside, downside, left side, and right side stimuli respectively.

4.2

Hovering and Features Analysis

To further analyze the performance on quadcopter platform, we transplanted the algorithm into the embedded LGMD detector, and mounted the detector onto the quadcopter, stimulated the detector with test patterns while the quadcopter hovering in the air. Object is manually pushed towards the detector from four direction respectively, and each direction repeated 10 times. Figure 5 is an example of the trial scene, in which object is pushed towards the detector from left. According to the results in Fig. 6, the four competitive LGMD distinguished the coming direction of the object accurately. In all the four types of trials, when LGMD exceeds its threshold, the C-LGMD indicated the main direction is leading the other average values, even if the lowest performance (lower boundary of the shadow).

88

J. Zhao et al.

Fig. 5. Hovering experiments scene Hovering Test Towards Coming Object U_LGMD D_LGMD L_LGMD R_LGMD Trigger Threshold Continuous Error

Membrane Poteintial

200

150

100

U_LGMD D_LGMD L_LGMD R_LGMD Trigger Threshold Continuous Error

200

50

0

Hovering Test Towards Coming Object

250

Membrane Poteintial

250

150

100

50

0

5

10

15

20

25

30

35

40

45

0

50

0

5

10

15

20

Time/10ms

(a) Average membrane potential(Object from upside down)

100

50

0

40

45

50

U_LGMD D_LGMD L_LGMD R_LGMD Trigger Threshold Continuous Error

200

Membrane Poteintial

membrane Poteintial

150

35

Hovering Test Towards Coming Object

250

U_LGMD D_LGMD L_LGMD R_LGMD Trigger Threshold Continuous Error

200

30

(b) Average membrane potential(Object from downside up)

Hovering Test Towards Coming Object

250

25

Time/10ms

150

100

50

0

5

10

15

20

25

30

35

40

45

50

Time/10ms

(c) Average membrane potential(Object from left to right)

0

0

5

10

15

20

25

30

35

40

45

50

Time/10ms

(d) Average membrane potential(Object from right to left)

Fig. 6. Average membrane potential during hovering tests. (a), (b), (c), (d), reflected the average membrane potential of the four competitive LGMD neuron in trials. The shadow is the continuous error of the C-LGMD of the main direction.

4.3

Arena Real-Time Flight

Finally, real-time flight and obstacle avoidance experiments are conducted to test the performance and robustness of the proposed directionally obstacle avoiding method. Trials reflecting four directions of coming object are set in two types: obstacles on the left and right side or on the upside and downside on the UAV’s route. The quadcopter is first challenged by a static obstacle and then challenged by a dynamic intruder. The results showed that the system is able to make smart escape behaviour based on the coming direction of the obstacle. The trajectories of these trials have been extracted and overlaid on a screenshot from the video, as

An LGMD Based Competitive Collision Avoidance Strategy for UAV

(a) Left & right object avoidance in arena test

89

(b) Up & down object avoidance in arena test

Fig. 7. Real-time obstacle avoiding test.

shown in Fig. 7. Trajectories are detected by a python program using background subtractor [29] and template matching [15] method, and then printed onto a screen shot from the recorded video.

5

Conclusion

To conclude, a novel competitive LGMD and corresponding UAV control algorithm is proposed to address practical problems meet in UAV’s LGMD application. Both simulation and realtime flight experiments were conducted to analyze the proposed method, and the results showed high robustness. Based on the proposed competitive LGMD, quadcopter’s Real-time 3D collision avoidance is achieved in indoor environment. For the future work, totally autonomous flight in a larger arena should be take to analyze the boundary of this new method.

References 1. Achtelik, M., Bachrach, A., He, R., Prentice, S., Roy, N.: Stereo vision and laser odometry for autonomous helicopters in GPS-denied indoor environments. In: Unmanned Systems Technology XI, vol. 7332, p. 733219. International Society for Optics and Photonics (2009) 2. Bachrach, A., He, R., Roy, N.: Autonomous flight in unknown indoor environments. Int. J. Micro Air Veh. 1(4), 217–228 (2009) 3. Berm´ udez i Badia, S., Pyk, P., Verschure, P.F.: A fly-locust based neuronal control system applied to an unmanned aerial vehicle: the invertebrate neuronal principles for course stabilization, altitude control and collision avoidance. Int. J. Rob. Res. 26(7), 759–772 (2007) 4. Berm´ udez i Badia, S., Bernardet, U., Verschure, P.F.: Non-linear neuronal responses as an emergent property of afferent networks: a case study of the locust lobula giant movement detector. PLoS Comput. Biol. 6(3), e1000701 (2010) 5. Beyeler, A., Zufferey, J.C., Floreano, D.: Vision-based control of near-obstacle flight. Autonom. Rob. 27(3), 201 (2009) 6. Chee, K., Zhong, Z.: Control, navigation and collision avoidance for an unmanned aerial vehicle. Sens. Actuators A Phys. 190, 66–76 (2013)

90

J. Zhao et al.

ˇ ıˇzek, P., Miliˇcka, P., Faigl, J.: Neural based obstacle avoidance with CPG con7. C´ trolled hexapod walking robot. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 650–656. IEEE (2017) 8. Fu, Q., Hu, C., Liu, T., Yue, S.: Collision selective LGMDs neuron models research benefits from a vision-based autonomous micro robot. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3996–4002. IEEE (2017) 9. Fu, Q., Hu, C., Peng, J., Yue, S.: Shaping the collision selectivity in a looming sensitive neuron model with parallel on and off pathways and spike frequency adaptation. Neural Netw. 106, 127–143 (2018) 10. Fu, Q., Yue, S., Hu, C.: Bio-inspired collision detector with enhanced selectivity for ground robotic vision system. In: BMVC (2016) 11. Griffiths, S., Saunders, J., Curtis, A., Barber, B., McLain, T., Beard, R.: Obstacle and terrain avoidance for miniature aerial vehicles. In: Valavanis, K.P. (ed.) Advances in Unmanned Aerial Vehicles, pp. 213–244. Springer, Dordrecht (2007). https://doi.org/10.1007/978-1-4020-6114-1 7 12. Honegger, D., Meier, L., Tanskanen, P., Pollefeys, M.: An open source and open hardware embedded metric optical flow CMOS camera for indoor and outdoor applications. In: 2013 IEEE International Conference on Robotics and Automation (ICRA), pp. 1736–1741. IEEE (2013) 13. Hu, C., Arvin, F., Xiong, C., Yue, S.: A bio-inspired embedded vision system for autonomous micro-robots: the LGMD case. IEEE Trans. Cogn. Dev. Syst. PP(99), 1 (2016) 14. Hu, C., Arvin, F., Yue, S.: Development of a bio-inspired vision system for mobile micro-robots. In: Joint IEEE International Conferences on Development and Learning and Epigenetic Robotics, pp. 81–86. IEEE (2014) 15. Lewis, J.P.: Fast normalized cross-correlation. In: Vision Interface, vol. 10, pp. 120–123 (1995) 16. Richter, C., Bry, A., Roy, N.: Polynomial trajectory planning for aggressive quadrotor flight in dense indoor environments. In: Inaba, M., Corke, P. (eds.) Robotics Research. STAR, vol. 114, pp. 649–666. Springer, Cham (2016). https://doi.org/ 10.1007/978-3-319-28872-7 37 17. Rind, F.C., Bramwell, D.: Neural network based on the input organization of an identified neuron signaling impending collision. J. Neurophysiol. 75(3), 967–985 (1996) 18. Sabo, C., Cope, A., Gurny, K., Vasilaki, E., Marshall, J.: Bio-inspired visual navigation for a quadcopter using optic flow. In: AIAA Infotech@ Aerospace 404 (2016) 19. Salt, L., Indiveri, G., Sandamirskaya, Y.: Obstacle avoidance with LGMD neuron: towards a neuromorphic UAV implementation. In: 2017 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–4. IEEE (2017) 20. Serres, J., Dray, D., Ruffier, F., Franceschini, N.: A vision-based autopilot for a miniature air vehicle: joint speed control and lateral obstacle avoidance. Autonom. Rob. 25(1–2), 103–122 (2008) 21. Serres, J.R., Ruffier, F.: Optic flow-based collision-free strategies: from insects to robots. Arthropod Struct. Dev. 46(5), 703–717 (2017) 22. Stevens, J.L., Mahony, R.: Vision based forward sensitive reactive control for a quadrotor VTOL. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5232–5238. IEEE (2018) 23. Yu, X., Zhang, Y.: Sense and avoid technologies with applications to unmanned aircraft systems: review and prospects. Prog. Aerosp. Sci. 74, 152–166 (2015)

An LGMD Based Competitive Collision Avoidance Strategy for UAV

91

24. Yue, S., Rind, F.C.: Collision detection in complex dynamic scenes using an LGMDbased visual neural network with feature enhancement. IEEE Trans. Neural Netw. 17(3), 705–716 (2006) 25. Yue, S., Rind, F.C.: A synthetic vision system using directionally selective motion detectors to recognize collision. Artif. Life 13(2), 93–122 (2007) 26. Yue, S., Rind, F.C., Keil, M.S., Cuadri, J., Stafford, R.: A bio-inspired visual collision detection mechanism for cars: optimisation of a model of a locust neuron to a novel environment. Neurocomputing 69(13), 1591–1598 (2006) 27. Yue, S., Santer, R.D., Yamawaki, Y., Rind, F.C.: Reactive direction control for a mobile robot: a locust-like control of escape direction emerges when a bilateral pair of model locust visual neurons are integrated. Autonom. Rob. 28(2), 151–167 (2010) 28. Zhao, J., Hu, C., Zhang, C., Wang, Z., Yue, S.: A bio-inspired collision detector for small quadcopter. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE (2018) 29. Zivkovic, Z.: Improved adaptive Gaussian mixture model for background subtraction. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 2, pp. 28–31. IEEE (2004) 30. Zufferey, J.C., Floreano, D.: Toward 30-gram autonomous indoor aircraft: visionbased obstacle avoidance and altitude control. In: Proceedings of the 2005 IEEE International Conference on Robotics and Automation, ICRA 2005, pp. 2594–2599. IEEE (2005)

Mixture Modules Based Intelligent Control System for Autonomous Driving Tangyike Zhang1,2 , Songyi Zhang1,2 , Yu Chen1,2 , Chao Xia1,2 , Shitao Chen1,2(B) , and Nanning Zheng1,2 1

Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, Shaanxi, People’s Republic of China [email protected], {ericzhang,zhangsongyi, alan19960212,xc06210417,chenshitao}@stu.xjtu.edu.cn 2 National Engineering Laboratory for Visual Information Processing and Applications, Xi’an Jiaotong University, Xi’an, Shaanxi, People’s Republic of China

Abstract. As a typical artificial intelligence system, a safe and comfortable control system is essential for self-driving vehicles to have the same level of driving ability as human drivers. This paper proposes a novel control system for autonomous driving vehicles based on mixture modules, which aims to ensure the accuracy of path tracking while meeting the requirements of safety and ride comfort. The mixture modules consist of a lateral controller to control the steering wheel angle of the vehicle for path tracking and a longitudinal controller to adjust the speed of the vehicle. We conducted a series of experiments on our simulation platform and real self-driving vehicles to test the proposed control system and compared it with the traditional methods which are widely used. The experimental results indicate that our control system can run effectively on real vehicles. It may accurately track the intended driving path and adjust the driving speed comfortably and smoothly, which demonstrates a high level of intelligence.

Keywords: Mixture modules Ride comfort

1

· Control system · Path tracking ·

Introduction

Generally, an autonomous driving system can be divided into three subsystems: sensing, planning and control [20,21]. The control system needs to control the actuators such as throttle, brake and steering wheel according to the smooth and drivable path generated by the planning system. The efficiency of control will directly determine the safety and comfort of the ride [9]. With the advancement of This work was supported by the National Natural Science Foundation of China (NO. 61773312, 61790563). c IFIP International Federation for Information Processing 2019  Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 92–104, 2019. https://doi.org/10.1007/978-3-030-19823-7_7

Mixture Modules Based Intelligent Control System for Autonomous Driving

93

the commercialization of self-driving technology, people have put forward higher requirements for the safety and ride comfort of self-driving vehicles [22]. However, autonomous vehicles are usually faced with complex road scenarios. The control efficiency is affected by numerous factors, including the vehicle dynamics, the change of path curvature and the road condition. Therefore, the design of control system is a key and challenging problem for self-driving technology. Traditional control system typically consists of a lateral controller and a longitudinal controller [12,24]. The lateral controller is used to calculate the steering wheel angle of the vehicle to achieve path tracking, and the longitudinal controller is used to control the throttle and brake pedal percentage to control the speed of vehicle [3]. Although the traditional control method can provide a well solution to the problem of tracking accuracy, it is often difficult to adjust the ride comfort. In addition, traditional methods tend to have poor control performance at a high speed. An important reason is that the vehicle models used in traditional methods ignore the influence of longitudinal speed on lateral control. While designing the control system, only by considering such kind of influence can the control system have good performance at different speeds. Therefore, we make great efforts in tuning proper look-ahead distance and designing a vehicle model to improve traditional methods. Since the control performance is determined by the performance of modules, a better control strategy is automatically selected. In order to enable the vehicle to track the target path quickly and accurately, we propose a nonlinear lateral controller. It can fit a quintic spline curve that starts from the current pose of vehicle and converges to the trajectory to be tracked. The goal point of quintic spline curve is determined by the look-ahead distance. After generating the quintic spline curve, the required curvature towards the target point can be determined for the vehicle. Then the corresponding steering wheel angle can be calculated according to the curvature and the current speed. The quintic spline curve is updated in real time according to a certain frequency, which provides a new reference for the lateral controller. In self-driving systems, the speed commands generated by planning system may not be smooth and continuous. If such speed commands are directly executed without processing, it will cause sharp changes to the output of longitudinal controller, which affects the ride comfort. Therefore, when designing longitudinal controller, we first determine the constraint of speed command through passenger’s feedback on the ride comfort. After the speed command is input to the longitudinal controller, the speed command will be processed according to the constraint condition, and then executed by the controller. Different control strategies are adopted for different dynamic characteristics of the throttle pedal and brake pedal, and the throttle or brake commands that meets the comfort requirement is obtained. Compared with traditional system, our control system is easy to implement, and it can accurately track the target path while controlling the speed comfortably and smoothly. The “Pioneer” self-driving vehicle, which is equipped with the proposed control system, had participated in the 10th IVFC [23] and won the first place in the total score.

94

T. Zhang et al.

In this paper, firstly we review some relevant research on self-driving vehicle control systems, then introduce our proposed control system and describe the specific details in its implementation. Following that, the performance of control system is verified through the simulated experiments and the experiments on “Pioneer” self-driving experimental platform, indicating that our control system can work effectively on real vehicles. A comparison with the widely used traditional methods is presented as well. Finally, we summarize our contributions and future work.

Fig. 1. Block diagram of the proposed control system. Our control system includes a lateral controller and a longitudinal controller. The lateral controller receives the target path and current pose of vehicle, converts the path to vehicle coordinate system firstly, then generates a quintic spline curve in real time, and finally calculates a steering wheel angle command by vehicle model. The longitudinal controller smoothes the speed command, then adopts different control strategies for throttle and brake pedal.

2

Related Work

A great deal of explorations has been conducted in the construction of selfdriving control systems. In aim of solving path tracking problems, which is a core issue of control system, there are quite a few available ways including geometric controllers [6,8,18], dynamic controllers [18], optimization controllers [13], model predictive controllers [25], and adaptive controllers [10]. The geometric controller is relatively simple to configure and implement while showing strong robustness in most driving scenarios, thus being widely used. One of the most classic geometric controller is based on the pure pursuit algorithm [6,19] for calculating the steering wheel angle according to the circular arc. Many selfdriving platforms, such as Autoware [14], integrate the pure pursuit algorithm or its modified versions in the control system. “Stanley Method” [20] is based on the vehicle kinematics model, considering the cross track error and orientation error. Filho et al. [15] proposed a geometric controller based on the cubic Bezier curve, however, in his work, the author did not take much into consideration the influence of tire slip, and the cubic Bezier curve only guaranteed the continuity

Mixture Modules Based Intelligent Control System for Autonomous Driving

95

of orientation but that of curvature, so the curve does not exactly match the actual travel trajectory of the vehicle. Amidi [2] proposed a geometry controller based on a quintic curve that reduces the discontinuous motion of the steering wheel. Dominguez [7] analyzed the performance of several geometric controllers and performed several comparative experiments. In determining the geometric controller’s look-ahead distance, a traditional method is to set the look-ahead distance as constant or a linear or quadratic function of speed, but usually fails to make the controller maintain an optimal performance. The core issue is that there is no considering the trend of curvature change of the forward path segment, making it unable to adjust the look-ahead distance in time while driving. Moreover, the relative position and orientation of the vehicle and the desired path are not considered. Ollero [17] analyzed the effect of look-ahead distance on stability in the pure pursuit algorithm. Chen et al. [4] proposed a robust fuzzy logic-based look-ahead distance tuning strategy, which can operate by providing fewer related variables, but this method cannot guarantee the continuity of control thus bringing certain limitations in the real car applications. In the construction of self-driving integrated control system, Xu et al. [24] proposed a design scheme for self-driving vehicle control system, which integrates multiple adaptive and robust algorithms to improve the compatibility of control system. Kang et al. [13] improves the lateral safety of vehicle by combining the steering controller with a speed controller that maintains the lateral acceleration limit.

3 3.1

Lateral and Longitudinal Controller Lateral Controller

In our control system, we define the lateral controller as a module that calculates steering wheel angle commands for a self-driving vehicle to keep on the target path based on the path information and the current state of vehicle (see Fig. 1), where the target path is defined as a set of ordered points. Each point contains the information about position, orientation and curvature. Vehicle Model. In our lateral controller, how to map the driving trajectory to the steering wheel angle of vehicle is a key issue. For easy implementation, the model should not be designed to be too complex. Thus, we apply the Ackermann model [16] and its simplified version, the bicycle model [1] (shown in Fig. 2). The Ackermann model is the most commonly used kinematic model in vehicle modeling, the centers of curvature circles that correspond to four wheels intersect at point O. The bicycle model is an effective way to simplify the Ackermann model, and its simplification is based on following assumptions: 1. Ignore the movement of vehicle in the vertical direction. 2. The front and rear tires of vehicle can be described by one tire each, and the center points of the front and rear wheels are the same respectively.

96

T. Zhang et al.

Fig. 2. Ackermann model and bicycle model. Bicycle model simplifies the four wheels of ackermann model into two wheels, centered on the front and rear axles of vehicle.

3. The rear wheel cannot deflects, the control input on steering wheel is mapped to the steering of front wheel, and the proportionality of the steering wheel angle and the front wheel angle of the vehicle is fixed. According to the bicycle model, we can get relationship between the steering wheel angle and the curvature of driving path: θsteering = ksf tan−1 (

L ) = ksf tan−1 (Lγ), R

(1)

where L is the wheelbase of vehicle, R is the radius of driving track, γ is the curvature of driving track, and ksf is the steering ratio coefficient from steering wheel angle to front wheel angle. However, the bicycle model does not take into account the effects of lateral forces of wheels at different speeds. Therefore, when driving at high speeds, the slip of tire is increased, making control performance worse than normal conditions. So we introduce a curvature compensation coefficient, thus the relationship between the obtained steering wheel angle and curvature of the traveling track is calculated as follows: θsteering = ksf tan−1 ((L + kγ vc2 )γ),

(2)

where kγ is the curvature compensation coefficient that is determined empirically, and vc is current speed of vehicle. Tuning Look-Ahead Distance. The look-ahead maneuver simulates driving behavior of the human driver, which enables self-driving vehicle to calculate steering wheel angle commands in advance based on road information ahead, thereby improving the stability of control system. Our controller uses a singlepoint look-ahead model and the look-ahead point is determined by look-ahead distance l. Similar to most look-ahead-based algorithms, the look-ahead distance selection in our controller also has a large impact on control performance. In our lateral controller, the look-ahead distance is calculated as follows:  tanh(γ (kv +l )) avg c 0 + ke ∗ cte, γavg > 10−4 γavg , (3) l= 2 3 4 5 γ (kvc +l0 ) γ (kvc +l0 ) (kvc + l0 ) − avg 3.0 + avg 15.0 + ke ∗ cte, else

Mixture Modules Based Intelligent Control System for Autonomous Driving

97

where vc is the current speed of vehicle, cte is the cross track error of vehicle, and γavg is the average curvature of path within original look-ahead distance kvc + l0 . When calculating look-ahead distance, we take multiple factors into account, such as current speed of vehicle, curvature of path in the front, and current cross track error. Quintic Spline Curve Fitting. In order to ensure the continuous movement of steering wheel when the vehicle tracks the path, we adopt a method of generating a quantic spline curve that converges from current state [xcurrent , ycurrent , θcurrent , γcurrent ]T to the state of the look-ahead point [xlookahead , ylookahead , θlookahead , γlookahead ]T on the target path. Compared to circular arcs and cubic Bezier curves, the curvature of quintic spline curve is always continuous, thus improving the stability of steering wheel.

Fig. 3. Quintic spline curve generated by lateral controller, which converges from current state of vehicle to the state of look-ahead point on the path.

In the vehicle coordinate system, the current state of vehicle is tanδc T [0, 0, 0, L+k 2 ] , where δc is the front wheel angle of vehicle calculated by γv θsteering . ksf

c

And the state of look-ahead point is [xl , yl , θl , γl ]T , where xl , yl , θl , γl are the values of the position, orientation and curvature of the look-ahead point converted to vehicle coordinate system. According to the work [2], in solving the polynomial, we need to assume that the absolute value of θl is less than π/2, and we can describe x as a quintic polynomial about y: x(y) = a0 + a1 y + a2 y 2 + a3 y 3 + a4 y 4 + a5 y 5 .

(4)

According to the constraints of x(y), x(y) ˙ and x ¨(y) on current state and the state of look-ahead points, the quantic polynomial can be solved as follows [2]: x(y) = a2 y 2 +(10k1 −4k2 +

k3 3 k3 )y +(−15k1 +7k2 −k3 )y 4 +(6k1 −3k2 + )y 5 , (5) 2 2

tanδc where a2 = − 2(L+k 2 , k1 = xl + γv )

k3 = −γl sec3 θl +

c

tanδc L+kγ vc2 .

tanδc 2(L+kγ vc2 ) ,

k2 = −tanθl +

tanδc L+kγ vc2 ,

The solved quintic spline curve is as shown in Fig. 3.

98

T. Zhang et al.

Calculate the Steering Wheel Angle. After getting the quantic spline curve, the ideal driving curvature of vehicle needs to be determined. The curvature corresponding to a position in front of current pose is used as the driving curvature, so that the problem caused by mechanism delay can be solved. Then according to Eq. 2, we can get the steering wheel angle command. 3.2

Longitudinal Controller

The structure of our longitudinal controller is shown in Fig. 1. In our control system, the longitudinal controller is a module to receive speed commands sent by planning system and based on real-time state of vehicle, calculate throttle pedal percentage and brake pedal percentage and send them to vehicle in order to make the speed gradually approaches the required value with high ride comfort. The work [11] pointed out that ride comfort has a strong correlation with acceleration and jerk. However, in experiments, we found that as long as the longitudinal acceleration and jerk are within a certain range, the ride comfort in most cases can be guaranteed. Therefore, we first preprocess the input raw speed command vcmd to get a new speed command vcomf ort that meets the acceleration and jerk constraints:  (6) vcomf ort = acomf ort dt, where

⎧  ⎨ aul ,  jcomfort dt > aul jcomfort dt < all , acomf ort = a ll , ⎩ jcomf ort dt, else ⎧ ⎨ jul , v¨cmd > jul jcomf ort = jll , v¨cmd < jll , ⎩ v¨cmd , else

(7)

(8)

where aul , all is the upper and lower limits of acceleration, jul , jll is the upper and lower limits of jerk, and the principle of selecting constraints is to take passenger’s ride comfort requirements and controller’s dynamic performance into account, in the meanwhile, to ensure safety. In practical applications, the constraints can also be flexibly switched according to driving scenarios. Since it is not allowed to simultaneously depress the throttle pedal and brake pedal, after inputting acomf ort as an input signal to the controller, we need to calculate espeed = vcomf ort − vc . If espeed is greater than zero, the throttle sub-controller is used to calculate throttle pedal percentage, and the brake pedal percentage is not calculated. If espeed is less than zero, the brake sub-controller is used to calculate brake pedal percentage, and the throttle pedal percentage is not calculated. The resulting throttle pedal percentage and brake pedal percentage output are as follows:   Ka espeed −ac Kpt ((Ka espeed − ac ) + dt), espeed > 0 Tit , (9) uthrottle = 0, else

Mixture Modules Based Intelligent Control System for Autonomous Driving

 ubrake =

0,  Kpb ((KT espeed − Tc ) +

KT espeed −Tc dt), Tib

espeed > 0 , else

99

(10)

where Kpt , Kpb , Tit and Tib are the parameters of PI controllers. ac is the current acceleration of vehicle, and Tc is the current braking torque of vehicle.

4 4.1

Experiments and Results Experimental Platform

Our real car experimental platform “Pioneer” is shown in Fig. 4(a). “Pioneer” is based on 2017 Lincoln MKZ hybrid model. It is equipped with a drive-bywire system, communicates with an IPC through the CAN bus. Developers can use the IPC to send control commands to control the throttle pedal percentage, brake pedal percentage, steering wheel angle and gear status, while receiving the information of vehicle in real time. Our experimental platform is equipped with a high-precision integrated navigation system for accurate position estimation. The software platform is developed based on ROS (Robot Operating System). Specific parameters of our vehicle are as shown in Table 1. Besides, we also conducted experiments in a simulation platform [5]. The simulation platform includes a simulated vehicle model and simulated roads to maximize the reality of real vehicle motion, as shown in Fig. 4(b). Similar to controlling a real vehicle, we can also control the driving of the simulated vehicle through the control system to achieve safe and efficient algorithm testing.

(a)

(b)

Fig. 4. (a) The self-driving experimental platform “Pioneer”, based on 2017 Lincoln MKZ hybrid model. (b) The simulation platform based on Gazebo. Various experimental scenarios can be built in this simulation platform.

4.2

Path Tracking Experiment

In the path tracking experiment, we conducted a series of comparative experiments to verify the effectiveness of our proposed lateral control method.

100

T. Zhang et al. Table 1. Vehicle parameters. Parameter

Value

Maximum horsepower

253 Ps

Maximum brake torque

3400 Nm

Range of steering wheel angle output

−470◦ to 470◦

Resolution of Steering wheel angle output

0.1◦

Maximum rotation speed of Steering wheel 500◦ /s Steering ratio

14.8 : 1

Vehicle wheelbase

2850 mm

The experimental path on campus is shown in Fig. 5, suitable for testing the performance of controllers in curvature-changing scenarios. We chose two widelyused path tracking algorithms, Pure Pursuit [6] and Stanley Method [20] to compare with our methods by controlling the real car running the entire road, the cross track error and orientation error box-plot of different methods and the excerpt of steering wheel angle while crossing U-turn are shown in Figs. 6 and 7. It can be found that when using our method, the distribution of cross track error and orientation error are more concentrated. In addition, our method produces less jitter, and the actual performance of the steering wheel movement is closer to that of a human driver. On the simulation platform, we also carried out the same comparative experiments. The experimental results are shown in Figs. 8 and 9. Similar to the real vehicle experiment results, our approach has the best performance.

Fig. 5. Target path in path tracking experiment. The vehicle needs to track the path on which there are scenarios that require maneuvers including straight-line driving, quarter-turn, U-turn, etc.

Mixture Modules Based Intelligent Control System for Autonomous Driving

101

Fig. 6. Cross track error and orientation error in realistic environment.

Fig. 7. Excerpt of steering wheel angle while crossing U-turn in real environment.

Fig. 8. Cross track error and orientation error in simulation.

Fig. 9. Excerpt of steering wheel angle while crossing U-turn in simulation.

102

4.3

T. Zhang et al.

Speed Control Experiment

The speed control experiment on real car is carried out on a straight two-lane road, the planning system sends changing speed commands to the longitudinal controller. The experimental results in the realistic environment are as shown in Fig. 10. Although the speed command issued by the planning system is drastically changing, the longitudinal controller can guide the vehicle to smoothly reach the reference speed. In the meanwhile, there is no significant overshoot in the process of controlling the speed. Similarly, the experimental results in simulation are shown in Fig. 11.

Fig. 10. Experimental results of speed control in realistic environment.

Fig. 11. Experimental results of speed control in simulation.

5

Conclusion

This paper proposes a mixture modules based self-driving control system that guarantees both accuracy and ride comfort, which includes a lateral controller that controls the steering wheel angle and a longitudinal controller that controls the vehicle speed. By improving the vehicle model and look-ahead distance tuning strategy, as well as implementing a geometric controller based on quintic spline curve, smoother output and better control effect have been achieved. In this paper, the implementation details of lateral controller and longitudinal controller are introduced, and the experimental verification is carried out both in realistic environment and in simulation. The experiment results that the performance of our control system is superior. In the future work, we plan to further improve the generality and robustness of the controller to better adapt to the requirement of different vehicles and different scenarios.

Mixture Modules Based Intelligent Control System for Autonomous Driving

103

References 1. Amer, N.H., Zamzuri, H., Hudha, K., Kadir, Z.A.: Modelling and control strategies in path tracking control for autonomous ground vehicles: a review of state of the art and challenges. J. Intell. Rob. Syst. 86(2), 225–254 (2017) 2. Amidi, O., Thorpe, C.E.: Integrated mobile robot control. In: Mobile Robots V, vol. 1388, pp. 504–524. International Society for Optics and Photonics (1991) 3. Cai, L., Rad, A.B., Chan, W.L.: An intelligent longitudinal controller for application in semiautonomous vehicles. IEEE Trans. Ind. Electron. 57(4), 1487–1497 (2010) 4. Chen, L., Liu, N., Shan, Y., Chen, L.: A robust look-ahead distance tuning strategy for the geometric path tracking controllers. In: 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 262–267. IEEE (2018) 5. Chen, Y., Chen, S., Zhang, T., Zhang, S., Zheng, N.: Autonomous vehicle testing and validation platform: integrated simulation system with hardware in the loop. In: 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 949–956. IEEE (2018) 6. Coulter, R.C.: Implementation of the pure pursuit path tracking algorithm. Carnegie-Mellon UNIV Pittsburgh PA Robotics INST, Technical report (1992) 7. Dominguez, S., Ali, A., Garcia, G., Martinet, P.: Comparison of lateral controllers for autonomous vehicle: experimental results. In: 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), pp. 1418–1423. IEEE (2016) 8. Girb´es, V., Armesto, L., Tornero, J., Solanes, J.E.: Continuous-curvature kinematic control for path following problems. In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4335–4340. IEEE (2011) 9. Hemami, A., Mehrabi, M.: On the steering control of automated vehicles. In: Proceedings of Conference on Intelligent Transportation Systems, pp. 266–271. IEEE (1997) 10. Hessburg, T., Tomizuka, M.: Fuzzy logic control for lateral vehicle guidance. IEEE Control Syst. Mag. 14(4), 55–63 (1994) 11. Huang, Q., Wang, H.: Fundamental study of Jerk: evaluation of shift quality and ride comfort. Technical report, SAE Technical Paper (2004) 12. Jo, K., Kim, J., Kim, D., Jang, C., Sunwoo, M.: Development of autonomous car– part ii: a case study on the implementation of an autonomous driving system based on distributed architecture. IEEE Trans. Ind. Electron. 62(8), 5119–5132 (2015) 13. Kang, J., Hindiyeh, R.Y., Moon, S.W., Gerdes, J.C., Yi, K.: Design and testing of a controller for autonomous vehicle path tracking using GPS/INS sensors. IFAC Proc. Vol. 41(2), 2093–2098 (2008) 14. Kato, S., Takeuchi, E., Ishiguro, Y., Ninomiya, Y., Takeda, K., Hamada, T.: An open approach to autonomous vehicles. IEEE Micro 35(6), 60–68 (2015) 15. Massera Filho, C., Wolf, D.F., Grassi, V., Os´ orio, F.S.: Longitudinal and lateral control for autonomous ground vehicles. In: 2014 IEEE Intelligent Vehicles Symposium Proceedings, pp. 588–593. IEEE (2014) 16. Mitchell, W.C., Staniforth, A., Scott, I.: Analysis of Ackermann steering geometry. Technical report, SAE Technical Paper (2006) 17. Ollero, A., Heredia, G.: Stability analysis of mobile robot path tracking. In: Proceedings 1995 IEEE/RSJ International Conference on Intelligent Robots and Systems. In: Human Robot Interaction and Cooperative Robots, vol. 3, pp. 461–466. IEEE (1995)

104

T. Zhang et al.

18. Rossetter, E.J.: A potential field framework for active vehicle lanekeeping assistance. Ph.D. thesis, Stanford University (2003) 19. Snider, J.M., et al.: Automatic steering methods for autonomous automobile path tracking. Robotics Institute, Pittsburgh, PA, Technical report, CMU-RITR-09-08 (2009) 20. Thrun, S., et al.: Stanley: the robot that won the DARPA grand challenge. J. Field Rob. 23(9), 661–692 (2006) 21. Urmson, C., et al.: Autonomous driving in urban environments: Boss and the urban challenge. Journal of Field Robotics 25(8), 425–466 (2008) 22. Watzenig, D., Horn, M.: Automated Driving: Safer and more Efficient Future Driving. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31895-0 23. Xin, J., Wang, C., Zhang, Z., Zheng, N.: China future challenge: beyond the intelligent vehicle. IEEE Intell. Transp. Syst. Soc. Newslett. 16(2), 8–10 (2014) 24. Xu, L., Wang, Y., Sun, H., Xin, J., Zheng, N.: Integrated longitudinal and lateral control for Kuafu-II autonomous vehicle. IEEE Trans. Intell. Transp. Syst. 17(7), 2032–2041 (2016) 25. Yu, R., Guo, H., Sun, Z., Chen, H.: MPC-based regional path tracking controller design for autonomous ground vehicles. In: 2015 IEEE International Conference on Systems, Man, and Cybernetics, pp. 2510–2515. IEEE (2015)

Biomedical AI

An Adaptive Temporal-Causal Network Model for Stress Extinction Using Fluoxetine S. Sahand Mohammadi Ziabari(&) Social AI Group, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands [email protected]

Abstract. In this paper, an adaptive temporal causal network model based on drug therapy named fluoxetine to decrease the stress level of post-traumatic stress disorder is presented. The stress extinction is activated by a cognitive drug therapy (here fluoxetine) that uses continuous usage of medicine. The aim of this therapy is to reduce the connectivity between some components inside the brain which are responsible for causing stress. This computational model aspires to realistically demonstrate the activation of different portions of brain when the therapy is applied. The cognitive model starts with a situation of strong and continuous stress in an individual and after using fluoxetine the stress level begins to decrease over time. As a result, the patient will have a reduced stress level compared to not using drug. Keywords: Temporal-causal network model Drug-therapy  Fluoxetine

 Cognitive  Extreme emotion 

1 Introduction Stress is a vital response to physical and emotional threats with strong roots in human evolution. Stress is important to protect humans from dangerous conditions, where in early history it could have life-or-death consequences. A particular situation might trigger a fight or flight reaction, which could result in unnecessarily avoiding certain (social) circumstances. As it has been described in [1] depression is one of the most grueling psychiatric sicknesses and may decrease life-time expectancy with up to 20%. Recent literature [8] shows that fluoxetine suppresses or decreases synaptic changes associated with stress. It has been also mentioned that fluoxetine relatively suppresses the impact of stress on the infusion of synaptic plasticity in the medial prefrontal cortex which is responsible for receiving direct fibers from the hippocampus [5, 6]. There are some previous temporal causal network-oriented modeling literatures for decreasing stress have been proposed [20–30]. The paper is organized as follows. In Sect. 2 the underlying neurological principles concerning the parts of the brain involved in stress and in the suppression of stress are addressed. In Sect. 3 the integrative temporal-causal network model is introduced. In Sect. 4 the results of the simulation model are discussed, in Sect. 5 the mathematical analysis of the model is presented and eventually in the last section a conclusion is presented. © IFIP International Federation for Information Processing 2019 Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 107–119, 2019. https://doi.org/10.1007/978-3-030-19823-7_8

108

S. S. Mohammadi Ziabari

2 Underlying Neurological Principles In many recent research literature [12, 13] it has been proven that fluoxetine decreases or suppresses changes in synapses caused by stress. In many researches [55–57] it has been proven that repeated stressful conditions and experiences bring a remarkable effect on neural plasticity in many brain components, especially in limbic structures like hippocampal changes, prefrontal cortex (PFC), and Amygdala. As it has been clearly mentioned in [9, 13]: ‘Acute stress inhibits long-term potentiation (LTP) at synapses from the hippocampus to prefrontal cortex in the rat, a model of the dysfunction in the anterior cingulate/orbitofrontal cortices which has been observed in human depression. In major depressive disorder, decreased blood flow and metabolism have been regularly described in multiple areas of the prefrontal cortex (PFC) with occasional changes in the hippocampal region. Conversely, a beneficial response to antidepressants has been associated with reduced blood flow in the hippocampus and a return to baseline metabolism level or increase in blood flow in the anterior cingulate cortex. Plasticity at hippocampal to PFC synapses can be regulated up and down, as assessed by longterm potentiation (LTP) and long-term depression (LTD), depending on specific patterns of afferent activation and this circuit contributes to working memory processes. Antidepressant effects may be obtained by several mechanisms, such as inhibition of serotonin uptake, for fluoxetine.’ [13]

Also, previous studies [15, 16] revealed that chronic stress changes dendritic morphology not just in the hippocampus, but also in the mPFC (medial Prefrontal Cortex). ‘Depression is said to be caused by chronically low levels of serotonergic transmission. SSRIs interfere with the activity of the serotonin transporter (5-HTT), a reuptake molecule that removes serotonin from the synapses. The putative low levels of synaptic serotonin in the depressed patient are elevated, and depression is relieved. These manipulations of serotonin levels have little effect on mood except in individuals who are depressed or recently recovered from depression.’ [11, p. 1] ‘Studies of neurotransmitter release with microdialysis have demonstrated that acute olanzapine significantly increases both dopamine and norepinephrine levels in rat prefrontal cortex, nucleus accumbens, and striatum, and the combination of olanzapine plus fluoxetine produces a greater increase in levels of dopamine and norepinephrine in the rat prefrontal cortex than fluoxetine alone.’ [2], p. 776.

The functionality of chronic stress on brain parts particularly on the Hippocampus is mentioned in [3, 4]. ‘The volume of the hippocampus is decreased in patients with depression or posttraumatic stress disorder.’ [3], p. 975 ‘The reduction in hippocampal volume is inversely proportional to the amount of time a patient is medicated with an antidepressant, and reduced hippocampal volume is partially reversed after antidepressant treatment.’ [4], p. 577 ‘In the striatum, there was a tendency for an increase in the number of BrdU-positive cells that is similar in magnitude to that in hippocampus. This effect is consistent with highly significant and robust induction of cell proliferation reported in a recent study, and the greater increase

An Adaptive Temporal-Causal Network Model for Stress Extinction

109

could be due to the higher dose of olanzapine used (10 mg/kg) relative to the current study (2 mg/kg). In the current study, we found that the combination of olanzapine plus fluoxetine did not produce a greater increase in the number of BrdU-positive cells than either drug alone. This suggests that fluoxetine alone would be sufficient to produce a maximum response and therefore could not account for the augmentation that has been observed clinically; however, the clinical approach has been to add olanzapine after a patient has failed to respond to an SSRI like fluoxetine’ [4], p. 577.

Many researches illustrate that the hippocampus and other sections in the medial temporal lobe are interfered with detection of novelty [7, 16, 17]. In other research [18] it has been proved that using of antidepressant drugs (Ads) will enhance the levels of extracellular epinephrine and serotonin. ‘The ability to detect unusual events occurring in the environment is essential for survival. Several studies have pointed to the hippocampus as a key brain structure in novelty detection, a claim substantial by its wide access to sensory information through the entorhinal cortex and also distinct aspects of its intrinsic circuity.’ [15], p. 18286

In [10] it has been shown that small amounts of fluoxetine might block stress-facilitated hippocampal LTD and eventually helps in memory retrieval impairment. ‘Chronic fluoxetine treatment reinstates ocular dominance plasticity in the primary visual cortex of adult rats, a form of developmentally regulated plasticity that is significantly reduced in the mature brain, and enhances long-term potentiation (LTP) in the dentate gyrus of adult mice. Results show that chronic fluoxetine treatment suppresses LTP in the primary auditory cortex and hippocampus of adult rats. It has been well documented that exposure to acute stress impairs LTP and facilitate LTD in rats, as well as to produce learning and memory impairment in rats and monkeys. A single systematic injection of fluoxetine is able to reverse the impairment in LTP at synapses from the hippocampus to prefrontal cortex in the rats, caused by stress on elevated platform.’ [10], p. 1

In [10, p. 7] the influences of antidepressant agents such as fluoxetine is clearly mentioned: ‘The chronic effects of antidepressant agents including fluoxetine, are involved in the regulation of intracellular transduction pathways, implicating changes in the cyclic adenosine monophosphate (cAMP) second messenger system, cAMP response element binding protein (CREB) and brain-derived neurotrophic factor (BDNF) in antidepressant action.’

In [14] explicitly the effect of chronic stress on medial Prefrontal Cortex and Amygdala has been mentioned: ‘Chronic stress significantly suppressed cytogenesis in the mPFC and neurogenesis in the dentate gyrus, but had minor effect in nonlimbic structures. Fluoxetine treatment counteracted the inhibitory effect of stress. Hemispheric comparison revealed that the rate of cytogenesis was significantly higher in the left mPFC of control animals, whereas stress inverted this asymmetry, yielding a significantly higher incidence of newborn cells in the right mPFC. Fluoxetine treatment abolished hemispheric asymmetry in both control and stressed animals. Structural alterations including suppressed dentate neurogenesis may contribute to the pathogenesis of depression. Antidepressant treatment with fluoxetine or electroconvulsive seizure modulates cell proliferation not only in the dentate gyrus, but also in the medial PFC (mPFC) in adult rats. (p. 1490)’

110

S. S. Mohammadi Ziabari

3 The Temporal-Causal Network Model First the Network-Oriented Modelling approach used to model the integrative overall process is briefly explained. As discussed in detail in [17, Ch 2, 18, 19] this approach is based on temporal-causal network models which can be represented at two levels: by a conceptual representation and by a numerical representation. A conceptual representation of a temporal-causal network model in the first place involves representing in a declarative manner states and connections between them that represent (causal) impacts of states on each other, as assumed to hold for the application domain addressed. The states are assumed to have (activation) levels that vary over time. In reality, not all causal relations are equally strong, so some notion of strength of a connection is used. Furthermore, when more than one causal relation affects a state, some way to aggregate multiple causal impacts on a state is used. Moreover, a notion of speed of change of a state is used for timing of the processes. These three notions form the defining part of a conceptual representation of a temporal-causal network model: • Strength of a connection xX,Y. Each connection from a state X to a state Y has a connection weight value xX,Y representing the strength of the connection, often between 0 and 1, but sometimes also below 0 (negative effect) or above 1. • Combining multiple impacts on a state cY(..). For each state (a reference to) a combination function cY(..) is chosen to combine the causal impacts of other states on state Y. • Speed of change of a state ηY. For each state Y a speed factor ηY is used to represent how fast a state is changing upon causal impact. Combination functions can have different forms, as there are many different approaches possible to address the issue of combining multiple impacts. Therefore, the Network-Oriented Modelling approach based on temporal-causal networks incorporates for each state, as a kind of label or parameter, a way to specify how multiple causal impacts on this state are aggregated by some combination function. For this aggregation a number of standard combination functions are available as options and a number of desirable properties of such combination functions have been identified. Figure 1 represents the conceptual representation of the temporal-causal network mode. The components of the conceptual representation shown in Fig. 1 are explained here. The state wsc shows the world state of the contextual stimulus c. The states ssc and ssee are the sensor state for the context c and sensor state of the body state ee for the extreme emotion. The states srsc and srsee are the sensory representation of the contextual stimulus c and the extreme emotion, respectively. The state srsee is a stimulus influencing the activation level of the preparation state. Furthermore, psee is the preparation state of an extreme emotional response to the sensory representation srsc of the context c, and fsee shows the feeling state associated to this extreme emotion. The state esee indicates the execution of the body state for the extreme emotion. All these relate to the affective processes. The (cognitive) goal state shows the goal for absorbing fluoxetine in the body. The (cognitive) state pspil is the preparation state of taking a pill (here fluoxetine). The state espil is the execution state of taking pill (fluoxetine). The other states relate to biological brain parts (Norepinephrine, Hippocampus, Thalamus,

An Adaptive Temporal-Causal Network Model for Stress Extinction

111

Serotonin, Prefrontal Cortex, Amygdala, Lateral Cerebellum, Striatum) which are involved in the stress condition, and in the influence of the fluoxetine applied (Table 1). Table 1. Explanation of the states in the model X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21

wsee ssee wsc ssc srsee srsc fsee psee esee goal pspil espil Norepinephrine Hippocampus Thalamus Serotonin Prefrontal Cortex Amygdala Lateral Cerebellum Striatum psact

World (body) state of extreme emotion ee Sensor state of extreme emotion ee World state for context c Sensor state for context c Sensory representation state of extreme emotion ee Sensory representation state of context c Feeling state for extreme emotion ee Preparation state for extreme emotion ee Execution state (bodily expression) of extreme emotion ee Goal of using fluoxetine Preparation state of using pill Execution of using pill Brain part Brain part Brain part Brain part Brain part Brain part Brain part Brain part Preparation of action inside the brain

The connection weights xi in Fig. 1 are as follows. The sensor states ssee, sscc have two incoming connections from wsee and wsc (weights x1, x2). The world state of extreme emotion wsee has one arriving connection from esee, x11 as a body-loop with weight. The sensory representation state of an extreme emotion srsee has an incoming connection weights x8 from state preparation state of an extreme emotion psee. The feeling state fsee has one outgoing connection weight x5 from srsee. The preparation state of an extreme emotion psee has two incoming connection weights x36, x37 from states Striatum and esact, respectively. The preparation state of an extreme emotion psee has three outgoing connection weights, esee, Thalamus, and the connection weight between states Hippocampus and Prefrontal Cortex, (x10, x15, x20) respectively. The goal has one arriving connection weight from the sensory representation srsee (x27) and preparation state pspil an entering connection from the goal with weight x28. The execution of taking the drug (here fluoxetine) is named espil, and has an entering connection weight x29 from preparation state of taking pspil. The state Thalamus has three entering connection weights x12, x14 and x22 from preparation state of extreme emotion psee, Hippocampus and Amygdala, respectively. The Norepinephrine of brain has an arriving connection weight x16 from espil. The Hippocampus in brain has four

112

S. S. Mohammadi Ziabari

incoming connection weights, x17, x24, x15 and x13 from Serotonin, Prefrontal Cortex, Norepinephrine and Thalamus. Note that the connection weight between states Prefrontal Cortex and Hippocampus is adaptive and using Hebbian learning means through time will be changed. The state Serotonin has an arriving connection from espil (x19). Prefrontal Corte state has an incoming connection weight from amygdala with x25 and three outgoing connection weights to Amygdala, Lateral Cerebellum and psact, x26, x30, and x31. The state Lateral Cerebellum has two incoming connection weights from Prefrontal Cortex and Amygdala x30, x32, respectively and it has an outgoing connection weight to Striatum x34. The state Striatum has two outgoing connection weights to psact, psee named x35, x36. Finally, the state psact has an outgoing connection weight to psee named x37.

Fig. 1. Conceptual representation of the temporal-causal network model

An Adaptive Temporal-Causal Network Model for Stress Extinction

113

This conceptual representation was transformed into a numerical representation as follows [17, Ch 2, 18, 19]: • at each time point t each state Y in the model has a real number value in the interval [0, 1], denoted by Y(t) • at each time point t each state X connected to state Y has an impact on Y defined as impactX,Y(t) = xX,Y X(t) where xX,Y is the weight of the connection from X to Y • The aggregated impact of multiple states Xi on Y at t is determined using a combination function cY(..): aggimpactY ðtÞ ¼ cY ðimpactX1;Y ðtÞ; . . .; impactXk;Y ðtÞÞ ¼ cY ðxX1;Y X1 ðtÞ; . . .; xXk;Y Xk ðtÞÞ where Xi are the states with connections to state Y • The effect of aggimpactY(t) on Y is exerted over time gradually, depending on speed factor ηY: Yðt þ DtÞ ¼ YðtÞ þ gY ½aggimpactY ðtÞ  YðtÞDt or

dYðtÞ=dt ¼ gY ½aggimpactYðtÞ  YðtÞ

• Thus, the following difference and differential equation for Y are obtained: Yðt þ DtÞ ¼ YðtÞ þ gY ½cY ðxX1;Y X1 ðtÞ; . . .; xXk;Y Xk ðtÞÞ  Y ðtÞDt dY ðtÞ=dt ¼ gY ½cY ðxX1;Y X1 ðtÞ; . . .; xXk;Y Xk ðtÞÞ  Y ðtÞ For states the following combination functions cY(…) were used, the identity function id(.) for states with impact from only one other state, and for states with multiple impacts the scaled sum function ssumk(…) with scaling factor k, and the advanced logistic sum function alogisticr,s(…) with steepness r and threshold s. idðVÞ ¼ V ssumk ðV1 ; . . .; Vk Þ ¼ ðV1 þ . . . þ Vk Þ=k alogisticr;s ðV1 ; . . .; Vk Þ ¼ ½ð1=ð1 þ erðV1 þ ... þ VksÞ ÞÞ  1=ð1 þ ers Þ ð1 þ ers Þ

4 Example Simulation The simulation results of the cognitive temporal causal network model, which was constructed based on the neurological science which contains qualitative empirical information (such as fMRI) both for the mechanism by which the brain components work and for emerging result of the processes, has been shown in Fig. 2. Therefore, one can imply that the best option for declining the stress level has been chosen, given the usage of fluoxetine. The model used the Matlab codes which have been implemented in [22]. Using appropriate connections weights make the model

114

S. S. Mohammadi Ziabari

numerical and adapted to qualitative empirical information. Table 2 illustrates the connection weights that has been used, where the values for are initial values as these weights are adapted over time. The time step was Dt = 1. The scaling factors ki for the nodes with more than one arriving connection weights are mentioned in Table 2. At first, an external world state of an extreme emotion-stimuli context c (represented by wsc) will influence the affective internal states of the individual by influencing the emotional response esee (via ssc, srsc, and psee) conducted to manifest the extreme emotion by body state wsee. As a consequence, the stressed individual senses the extreme emotion (and at the same time all the biological brain components increased over time), so as a cognitive process, as a next step the goal becomes active to decrease this stress level by using fluoxetine at time around 300.

Table 2. Connection weights and scaling factors for the example simulation Connection weight Value

ω1 1

ω2 1

ω3 1

Connection Weight Value

ω14 1

ω15 1

Connection Weight

ω26

ω27

Value

1

1

state λi

X5 2

X8 3

ω6 1

ω7 1

ω8 1

ω10 1

ω11 1

ω12 1

ω13 1

ω16 ω17 ω18 1 1 -0.7

ω19 1

ω20 0.7

ω21 1

ω22 1

ω23 0.4

ω24 0.4

ω25 1

ω28 ω29

ω30

ω31

ω32

ω33

ω34

ω35

ω36

ω37

1

1

1

1

1

1

1

-0.9

X18 2

X19 2

X20 2

X21 2

1

X14 3.4

ω4 1

1

ω5 1

X15 3

X17 1.4

Fig. 2. Simulation results for temporal-causal network modeling of the therapy by fluoxetine

An Adaptive Temporal-Causal Network Model for Stress Extinction

115

As a biological process, the goal and in further steps, execution of taking drugs triggers the changes and suppression of execution of stress at the first state and this affects other brain components to be less active around time 300 and for stress level around 600. However, this effect is just temporary, and as the stressful context c still is present all the time, after a while the stress level goes up again, which in turn again leads to activation of the goal and performing another desire or prescription of eating fluoxetine, and so on and on repeatedly until the person or the doctor decides to stop taking drugs. The fluctuation in Fig. 2 shows how in real life the repeated usage of medicine (here fluoxetine) decreases the stress level over each intake. It is worth to tell that all of this fluctuation is produced internally by the model; the environment is constant, external input for the model is only the constant world state wsc. Therefore, based on the simulation results it is illustrated that the model for the drug therapy (fluoxetine) works as expected. In Fig. 3, the equilibrium situation where there is not any active goal (intake) has been shown. Based on this figure, when there is no intake (the goal is blocked in an artificial manner here), the stress level and activity of brain parts go up and stay high.

Fig. 3. Simulation results for equilibrium state without eating drug

The adaptivity connection (Hebbian learning) and suppression of connection between two brain parts; Hippocampus and Prefrontal Cortex is shown in Fig. 4. As it can be seen from Figure the adaptivity, learning to cope with stress and decreasing that over time starts at time around 100 and continues until time 600 to stay constant.

116

S. S. Mohammadi Ziabari

Fig. 4. Simulation results for adaptivity connection weight between Prefrontal Cortex and Hippocampus

5 Mathematical Analysis Emerging dynamic properties of dynamical models can be analyzed by simulation experiments, but some types of properties can be found by calculations in a mathematical manner using the WIMS Linear Solver1. For verification of the proposed temporal-causal network model, stationary points are investigated. To analyze the model mathematically, the solutions of linear equations of each state of the model are achieved and by comparing the outcome with the simulation results of the model using Matlab [22] one can verify the model. x1 ¼ x9 x2 ¼ x1 x3 ¼ 1 x4 ¼ x3 2  x5 ¼ x2 þ x8 x6 ¼ x4 x7 ¼ x5 2  x8 ¼ x6 þ x7  0:9  x20  0:9  x21 x10 ¼ 0:5 x12 ¼ x11 x11 ¼ x10 x13 ¼ x12

1

https://wims.unice.fr/wims/wims.cgi?session=K06C12840B.2&+lang=nl&+module=tool%2Flinear %2Flinsolver.en.

An Adaptive Temporal-Causal Network Model for Stress Extinction

117

4  x14 ¼ x13 þ x15 þ x16 þ 0:47  x17 3  x15 ¼ x14 þ x12 þ x19 x16 ¼ x12 2  x17 ¼ x18 þ 0:47  x14 2  x18 ¼ x17 þ x15 2  x19 ¼ x17 þ x18 2  x20 ¼ x18 þ x19 2  x21 ¼ x17 þ 20 To compare mathematical results with simulation results, in particular the ones illustrated in Figs. 1 and 2, the parameter values for X3 =1 and the value of the X10 (goal) = 0.5 were used, due to the fact that the other states that are goal dependent are not able to go up and would not reach equilibrium. Some of the comparisons among simulation and mathematical analysis of states is depicted in Table 3. Table 3. Comparing Analysis and Simulation wsc

ssc

srsc

goal

pspil

espil

Norepine

State

X3

X4

X6

X10

X11

X12

X13

Simulation

1,0000

1.0000

1.0000

0.5000

0.5000

0.5000

0.5000

Analysis

0.9999

0.9999

0.9999

0.4900

0.4900

0.4900

0.4900

Deviation

0.0001

0.0001

0.0001

0.0100

0.0100

0.0100

0.0100

Thal

PFC

Amyg

Late.Cer

Striatum

X15

X17

X18

X19

X20

State Simulation

0.3158

Analysis

0.3225

Deviation

0.0067

0.2460

0.2482

0.2144

0.2312

0.1891

0.2538

0.2195

0.2367

0.0569

0.0056

0.0051

0.0055

6 Conclusion In this paper a cognitive temporal causal network-oriented model of therapy by using drug (fluoxetine) for individuals under stress is introduced in which usage of medicine is used. The proposed model can be used to test different hypothesis and neurological principles about the impacts of the brain and the effects that different brain areas have the extinction of stress, but also on other processes.

118

S. S. Mohammadi Ziabari

Some simulations have been implemented, one of which was presented in the paper. This model can be used as the basis of a chatbot, a virtual agent model and to get insight in such processes and to bring up a certain cure or treatment of individuals to perform the therapies of extreme emotions for post-traumatic disorder individuals.

References 1. Blazer, D.: Mood Disorders: Epidemiology. Lippincott, Williams and Wilkins, New York (2000) 2. Kuroki, T., Meltzer, H.Y., Ichikawa, J.: Effects of antipsychotic drugs on extracellular dopamine levels in rat medial prefrontal cortex and nucleus accumbens. J. Pharmacol. Exp. Ther. 288, 774–781 (1999) 3. Bremner, J.D., et al.: MRI-based measurement of hippocampal volume in patients with combat-related posttraumatic stress disorder. Am. J. Psychiatry 152, 973–981 (1995) 4. Kodama, M., Fujioka, T., Duman, R.S.: Chronic olanzapine or fluoxetine administration increases cell proliferation in hippocampus and prefrontal cortex of adult rat. Biol. Psychiatry 57(2), 199 (2005) 5. Pereira, A., et al.: Processing of tactile information by the hippocampus. Proc. Nat. Acad. Sci. U.S.A. 104(46), 18286–18291 (2007) 6. Rocher, C., Spedding, M., Munoz, C., Jay, T.M.: Acute stressinduced changes in hippocampal/ prefrontal circuits in rats: effects of antidepressants. Cereb. Cortex 14, 224–229 (2004) 7. Kumaran, D., Maguire, E.A.: Which computational mechanisms operate in the hippocampus during novelty detection. Hippocampus 17(9), 735–748 (2007) 8. Vetencourt, J.F.M., et al.: The antidepressant fluoxetine restores plasticity in the adult visual cortex. Sciences 320(5874), 385–388 (2008). https://doi.org/10.1126/science.1150516 9. Spennato, G., Zerbib, C., Mondadori, C., Garcia, R.: Fluoxetine protects hippocampal plasticity during conditioned fear stress and prevents fear learning potentiation. Psychopharmacology 196(4), 583–589 (2007). Epub 2007 10. Han, H., Dai, C., Dong, Z.: Single fluoxetine treatment before but not after stress prevents stress-induced hippocampal long-term depression and spatial memory retrieval impairment in rats. Sci. Rep. 5, 12667 (2015). https://doi.org/10.1038/srep12667 11. Schafer, W.R.: How do antidepressants work? Prospects for genetic analysis of drug mechanisms. Cell 98, 551–554 (1993) 12. Kessal, K., Deschaux, O., Chessel, A., Xu, L., Moreau, J.L., Garcia, R.: Fluoxetine reverses stress-induced fimbria-prefrontal LTP facilitation. NeuroReport 17, 319–322 (2006) 13. Rocher, C., Spedding, M., Munoz, C., Jay, T.M.: Acute stress-induced changes in hippocampal/prefrontal circuits in rats: effects of antidepressants. Cereb. Cortex 14, 224–229 (2004) 14. Czeh, B., et al.: Chronic social stress inhibits cell proliferation in the adult medial prefrontal cortex: hemispheric asymmetry and reversal by fluoxetine treatment. Neuropsychopharmacology 32, 1490–1503 (2007) 15. Cook, S.C., Well, C.L.: Chronic stress alters dendritic morphology in rat medial prefrontal cortex. J. Neurobiol. 60, 236–248 (2005) 16. Radley, J.J., Rocher, A.B., Janssen, W.G., Hof, P.R., McEwen, B.S., Morrison, J.H.: Reversibility of apical dendritic retraction in the rat medial prefrontal cortex following repeated stress. Exp. Neurol. 196, 199–203 (2005) 17. Treur, J.: Network-Oriented Modeling: Addressing Complexity of Cognitive, Affective and Social Interactions. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45213-5

An Adaptive Temporal-Causal Network Model for Stress Extinction

119

18. Treur, J.: Verification of temporal-causal network models by mathematical analysis. Vietnam J. Comput. Sci. 3, 207–221 (2016) 19. Treur, J.: The ins and outs of network-oriented modeling: from biological networks and mental networks to social networks and beyond. In: Nguyen, N.T., Kowalczyk, R., Hernes, M. (eds.) Transactions on Computational Collective Intelligence. Paper for keynote lecture at ICCCI 2018, pp. 120–139. Springer, Heidelberg (2018). https://doi.org/10.1007/978-3-66258611-2_2 20. Treur, J., Sahand Mohammadi Ziabari, S.: An adaptive temporal-causal network model for decision making under acute stress. In: Nguyen, N.T., Pimenidis, E., Khan, Z., Trawiński, B. (eds.) ICCCI 2018. LNCS (LNAI), vol. 11056, pp. 13–25. Springer, Cham (2018). https:// doi.org/10.1007/978-3-319-98446-9_2. Journal version: Mohammadi Ziabari, S.S., Treur, J.: An adaptive temporal-causal network model for decision making under acute stress. Vietnam J. Comput. Sci. (2018, submitted) 21. Mohammadi Ziabari, S.S., Treur, J.: Computational analysis of gender differences in coping with extreme stressful emotions. In: Proceedings of the 9th International Conference on Biologically Inspired Cognitive Architecture (BICA2018). Elsevier (2018) 22. Mohammadi Ziabari, S.S., Treur, J.: A modeling environment for dynamic and adaptive network models implemented in Matlab. In: Proceedings of the 4th International Congress on Information and Communication Technology (ICICT 2019), 25–26 February. Springer, London (2019) 23. Sahand Mohammadi-Ziabari, S., Treur, J.: Integrative biological, cognitive and affective modeling of a drug-therapy for a post-traumatic stress disorder. In: Fagan, D., Martín-Vide, C., O’Neill, M., Vega-Rodríguez, M.A. (eds.) TPNC 2018. LNCS, vol. 11324, pp. 292–304. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04070-3_23 24. Sahand Mohammadi Ziabari, S., Treur, J.: An adaptive cognitive temporal-causal network model of a mindfulness therapy based on music. In: Tiwary, U.S. (ed.) IHCI 2018. LNCS, vol. 11278, pp. 180–193. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04021-5_17 25. Sahand Mohammadi Ziabari, S., Treur, J.: Cognitive modeling of mindfulness therapy by autogenic training. In: Satapathy, S.C., Bhateja, V., Somanah, R., Yang, X.-S., Senkerik, R. (eds.) Information Systems Design and Intelligent Applications. AISC, vol. 863, pp. 53–66. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-3338-5_6 26. Mohammadi Ziabari, S.S., Treur, J.: A temporal cognitive model of the influence of methylphenidate (ritalin) on test anxiety. In: Proceedings of the 4th International Congress on Information and Communication Technology (ICICT 2019), 25–26 February. Springer, London (2019) 27. Mohammadi Ziabari, S.S., Treur, J.: An adaptive cognitive temporal-causal network model of a mindfulness therapy based on humor. In: International Conference on Computational Science (ICCS 2019), (2019, submitted) 28. Mohammadi Ziabari, S.S.: Integrative cognitive and affective modeling of deep Brain stimulation. In: Proceedings of the 32nd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems (IEA/AIE 2019) (2019, submitted) 29. Andrianov, A., Guerriero, E., Mohammadi Ziabari, S.S.: Cognitive modeling of mindfulness therapy: effects of yoga on overcoming stress. In: Proceedings of the 16th International conference on Distributed Computing and Artificial Intelligence (DCAI 2019) (2019, submitted) 30. de Haan, R.E., Blanker, M., Mohammadi Ziabari, S.S.: Integrative biological, cognitive and affective modeling of caffeine use on stress. In: Proceedings of the 16th International conference on Distributed Computing and Artificial Intelligence (DCAI 2019) (2019, submitted)

Clustering Diagnostic Profiles of Patients Jaakko Hollm´en1 and Panagiotis Papapetrou2(B) 1 2

Department of Computer Science, Aalto University, Espoo, Finland [email protected] Department of Computer and Systems Sciences, Stockholm University, Stockholm, Sweden [email protected]

Abstract. Electronic Health Records provide a wealth of information about the care of patients and can be used for checking the conformity of planned care, computing statistics of disease prevalence, or predicting diagnoses based on observed symptoms, for instance. In this paper, we explore and analyze the recorded diagnoses of patients in a hospital database in retrospect, in order to derive profiles of diagnoses in the patient database. We develop a data representation compatible with a clustering approach and present our clustering approach to perform the exploration. We use a k-means clustering model for identifying groups in our binary vector representation of diagnoses and present appropriate model selection techniques to select the number of clusters. Furthermore, we discuss possibilities for interpretation in terms of diagnosis probabilities, in the light of external variables and with the common diagnoses occurring together.

Keywords: Medical records

1

· Binary representations · Clustering

Introduction

Electronic Health Records (EHR) provide a wealth of information for retrospective analysis of patients. They can be a source for validating the conformance to planned treatment, or can be used to mimic the doctor by predicting diagnoses of patients with the use of vital signs and other symptoms. The main adoption of EHRs in healthcare research has been rapidly increasing [7,18]. In contrast to traditional data sources, including spontaneous reports [12] or social media [16], EHRs comprise disparate data types and can convey critical information which could potentially allow medical practitioners to prevent critical conditions or provide a timely intervention when necessary. A wide body of research using supervised learning approaches on EHR data exists in the literature, e.g., [5,6,11,17, 19], mostly focusing on critical events such as heart failure [10] or adverse drug interactions [2,9]. On the other hand, descriptive analytics approaches focusing on frequent pattern mining or subgroup discovery have been proposed. c IFIP International Federation for Information Processing 2019  Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 120–126, 2019. https://doi.org/10.1007/978-3-030-19823-7_9

Clustering Diagnostic Profiles

121

More specifically, several ways of improving ADE detection have been explored [1] by combining sequential pattern mining with disproportionality analysis. In particular, the use of sequential pattern mining for finding frequent sequences of drug event prescriptions have been explored, which then form the basis for the disproportionality analysis. In other words, instead of looking for unexpected drug-diagnosis pairs, the main focus is placed on extracting unexpected pairs of drug sequences and diagnoses. Since the proposed method is better suited to handle drug interactions, it is expected to handle cases where a sequential administration of interacting drugs is responsible for a certain ADE. An empirical investigation of the method has been performed using a subset of the Stockholm EPR corpus [3]. In this paper, we focus on unsupervised learning, and more specifically on the detection of cluster structure in a medical database. Specifically, we use a large database of new-born babies treated at the neonatal intensive care unit [13] in Helsinki Children’s Hospital in Finland. We use the data in retrospect to explore and investigate the diagnostic profiles of patients. For this aim, use a clustering model to group the patient database to distinct patient groups, each having a particular diagnosis signature with the recorded diagnoses. We aim at identifying interpretable diagnostic profiles by characterizing the patient profiles by the most common diagnoses in the clusters. In the rest of the paper, we describe some backgrounds of the origins of data and present related work in Sect. 2. The methodology and the experimental part of clustering the diagnostic profiles is presented in Sect. 3. In Sect. 4, we summarize our findings and conclude our paper.

2

Patient Data and Diagnoses as Profiles

The data set under study has been recorded in the Helsinki University Hospital Neonatal Intensive Care Unit (NICU) between the years 1999 and 2013. The data set in question consists of some 2000 preterm babies born in the hospital and treated in the NICU. The treatment in the intensive care unit results in a lot of data recordings that can be analyzed in retrospect. Of particular interest in the data set has been the so called very low birth weight (VLBW) infants, which by definition are babies with a birth weight of less than 1500 g. The statistics of these data are presented in more detail in [13]. In our previous work, we have explored the possibility to predict diagnoses based on the vital signs and other symptom data, based on Gaussian process classification [14]. Contrary to our previous work [13,14], where diagnoses were estimated or predicted from vital signs and symptom based data, we consider a different approach: we treat the diagnoses of NICU patients as an individual data resource and analyze the heterogeneity and the statistical dependencies within the data and the hypothesized groups in the data. For each patient, there is a list of diagnoses given to a patient during the NICU stay. We have extracted a list of diagnosed and some other variables and cross-referenced them with an ICD-10 database retrieved from a health organization THL. For our diagnostic profile,

122

J. Hollm´en and P. Papapetrou

we include only those variables which are found in a standard ICD-10 database. As a result, we get a list of 437 possible diagnoses which have occurred in this data set. We must note that this is not a standardized, comprehensive list of all diagnoses but is focused on this set of patients specifically. As the list of diagnoses for a patient may vary, we seek to represent them as a unified diagnostic profile. Therefore, we represent the diagnoses for an individual patient as a list of truth values, which can be numerically represented as 0’s and 1’s. Our vectorial data representation has the possible diagnoses as attributes, and the binary 0–1 values denote whether a particular patient has a diagnosis (1) or not (0). In this manner, we can represent the diagnoses as a binary 0–1 vector for a patient. This gives rise to a matrix where each row of the matrix represents the diagnostic profile of a patient.

3

Experiments: Methodology and Evaluation

Recall that our data is 0–1 data collected to a vectorial representation, where each vector describes the set of diagnoses for that patient with a collection of 0’s (no diagnosis) and 1’s (diagnosis) for a particular code in ICD-10. The analysis may now proceed as the analysis of multivariate 0–1 data. If we would wish to describe occurrences of diagnoses together, we could extract frequent itemsets from the 0–1 data [4], or extract frequent itemsets combined with a clustering approach [8]. Here, we are doing the first steps in exploring the data and we are content by exploring the clustering structure with a clustering approach only. During the experiments, the goal is to use the data of the diagnostic profiles described earlier and to learn a cluster model from data. For this aim, we use the k-means cluster model, where the cluster profiles are represented in terms of prototype vectors, computer locally from the clustered data [4]. A central question regarding the clustering is to choose the number of clusters in the model: an optimal choice of the model is a trade-off between the richness of representation (many clusters) and the compactness of representation (just a few clusters). As a guiding criterion, we use the silhouette index [15] computed from the clustered data and decide on the number of clusters based on a series of silhouette indices computed on the cluster model. In order to reduce the impact of individual cluster models learned from data, we compute multiple cluster solutions form different initial values and form averages of our chosen model selection criteria. We present the statistics of these figures for solutions between 2 clusters and 20 clusters. The results of the experiment for model selection is illustrated in Fig. 1. The choice of the number of clusters is a trade-off between the richness of representation and the compactness of the result. Whereas the compactness would make the result set very interpretable, the richness of the result would make the clustered data sets very homogeneous. In order to balance between these extremes, we resort to the silhouette index [15] and select the number of clusters to be J = 3. We observe declining silhouette score between 2 and 3 clusters, but an increase of clusters from 3 does not seem to affect the silhouette score.

Clustering Diagnostic Profiles

123

Fig. 1. The silhouette indices computed for multiple realizations of cluster models and number of clusters. We have clustered all the data in the patient matrix, with data dimension d = 437. We have run k-means clustering multiple times with random initializations, repeating the runs from 2 to 20 clusters 50 times. For each clustering result, we have computed the silhouette score for the solution at hand. Average of the scores calculated and plotted with a black line. Percentiles (25th and 75th) are plotted with dash dotted line. The individual scores are plotted with points.

There is, however, some variance in the score, indicated by the variance in the individual scores marked by black points as well as the bounds given by the percentiles, marked by the dash-dotted lines. We proceed to the final clustering by fixing the number of clusters to be J = 3 and training a final model from data. In order to avoid degenerate results, we train a model 7 times and select the model with the median silhouette score. This is likely to avoid minima with extreme values for the silhouette index. Since the k-means algorithm estimates the cluster centers as the averages of data and the data is either 0’s or 1’s, the cluster centers have a natural interpretation of being probabilities of individual diagnoses given in the cluster, and can thus be related with the risk of a diagnosis in a given group. The cluster centers for the three clusters are illustrated in Fig. 2. There are apparent similarities between the profiles and they do indeed share some characteristics in terms of diagnoses. For instance, each cluster is characterised by the diagnosis P59.0, which indicates neonatal jaundice associated with preterm delivery, and P22.9, which in turn corresponds to unspecified respiratory distress of newborn. These are diagnoses that are associated with the selection of the patient material from NICU, rather than distinguishing factors between them. Some of the diagnoses appear in only one cluster, like the H35.1 Retinopathy of maturity, which offers an avenue to explore further which factors occur together in this particular group.

124

J. Hollm´en and P. Papapetrou

Fig. 2. The cluster centers of the identified diagnosis profiles are illustrated. Each panel describes one of the three clusters in terms of probabilities of diagnoses in the cluster.

4

Summary and Conclusions

We have analyzed a database of patients treated at the neonatal intensive care unit (NICU) at the Helsinki Children’s Hospital in Finland. In particular, we focused on the set of diagnoses of patients and developed a vectorial 0–1 data representation for further analysis. The diagnostic profile for a patient is the listing of all diagnosis of a patient and can be represented as a vector of 0–1 data with all diagnoses as vector components, or attributes. Then we proceeded with a clustering approach and developed a suitable clustering model for the data through a model selection procedure. We presented the prototypes of the clusters and discussed the further possibilities of describing the data more accurately. A clustering model can be used to yield a practical, yet nontrivial description of the patient diagnoses as such. Some of the highlighted diagnoses are generic hallmarks for the patient material in question, but some others may yield interesting information about subsets of patients. These diagnostic profiles can be used to describe further the statistical dependencies of the individual diagnoses in subsets of patients, which can yield interesting, but unexplored knowledge about the domain. In order to derive more medical relevance from the profiles, we will discuss the findings further with the medical experts, and reflect the findings with external patient data, which has not been used in clustering.

Clustering Diagnostic Profiles

125

Acknowledgments. This work was partly supported by the VR-2016-03372 Swedish Research Council Starting Grant. The study was approved by the Helsinki University Central Hospital Ethics Committee, decision number 115/13/03/00/14 dated 8 April 2014. We thank Olli-Pekka Rinta-Koski for technical assistance with the data extraction.

References 1. Asker, L., Bostr¨ om, H., Karlsson, I., Papapetrou, P., Zhao, J.: Mining candidates for adverse drug interactions in electronic patient records. In: Proceedings of the 7th International Conference on PErvasive Technologies Related to Assistive Environments, PETRA 2014, Island of Rhodes, Greece, 27–30 May 2014, pp. 22:1–22:4 (2014). https://doi.org/10.1145/2674396.2674420, http://doi.acm.org/ 10.1145/2674396.2674420 2. Aspden, P.B.J., Wolcott J.L.R.C.: Generalized random shapelet forests. In: Committee on Identifying and Preventing Medication Errors (2007) 3. Dalianis, H., Hassel, M., Henriksson, A., Skeppstedt, M.: Stockholm EPR corpus: a clinical database used to improve health care. In: Proceedings of the Fourth Swedish Language Technology Conference (2009) 4. Hand, D., Mannila, H., Smyth, P.: Principles of Data Mining. Adaptive Computation and Machine Learning Series. MIT Press, Cambridge (2001) 5. Harpaz, R., Haerian, K., Chase, H.S., Friedman, C.: Mining electronic health records for adverse drug effects using regression based methods. In: the 1st ACM International Health Informatics Symposium, pp. 100–107. ACM (2010) 6. Henriksson, A., Kvist, M., Dalianis, H., Duneld, M.: Identifying adverse drug event information in clinical notes with distributional semantic representations of context. J. Biomed. Inf. 57, 333–349 (2015) 7. Hersh, W.R.: Adding value to the electronic health record through secondary use of data for quality assurance, research, and surveillance. Clin. Pharmacol. Ther. 81, 126–128 (2007) 8. Hollm´en, J., Sepp¨ anen, J.K., Mannila, H.: Mixture models and frequent sets: combining global and local methods for 0–1 data. In: Proceedings of the Third SIAM International Conference on Data Mining, pp. 289–293. Society of Industrial and Applied Mathematics (2003) 9. Ouchi, K., Lindvall, C., Chai, P.R., Boyer, E.W.: Machine learning to predict, detect, and intervene older adults vulnerable for adverse drug events in the emergency department. J. Med. Toxicol. 14(3), 248–252 (2018). https://doi.org/10. 1007/s13181-018-0667-3 10. Pakhomov, S.V., Buntrock, J., Chute, C.G.: Prospective recruitment of patients with congestive heart failure using an ad-hoc binary classifier. J. Biomed. Inf. 38(2), 145–153 (2005) 11. Park, M.Y., et al.: A novel algorithm for detection of adverse drug reaction signals using a hospital electronic medical record database. Pharmacoepidemiol. Drug Saf. 20(6), 598–607 (2011) 12. van Puijenbroek, E.P., Bate, A., Leufkens, H.G., Lindquist, M., Orre, R., Egberts, A.C.: A comparison of measures of disproportionality for signal detection in spontaneous reporting systems for adverse drug reactions. Pharmacoepidemiol. Drug Saf. 11(1), 3–10 (2002) 13. Rinta-Koski, O.P.: Machine learning in neonatal intensive care. Ph.D. thesis, Aalto University (2018)

126

J. Hollm´en and P. Papapetrou

14. Rinta-Koski, O.P., Sarkka, S., Hollm´en, J., Leskinen, M., Andersson, S.: Gaussian process classification for prediction of in-hospital mortality among preterm infants. Neurocomputing 298, 134–141 (2018). https://doi.org/10.1016/j.neucom.2017.12. 064. http://www.sciencedirect.com/science/article/pii/S092523121830208X 15. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987). https://doi.org/10. 1016/0377-0427(87)90125-7. http://www.sciencedirect.com/science/article/pii/ 0377042787901257 16. Sarker, A., et al.: Utilizing social media data for pharmacovigilance: a review. J. Biomed. Inf. 54, 202–212 (2015) 17. Schuemie, M.J., et al.: Using electronic health care records for drug safety signal detection: a comparative evaluation of statistical methods. Med. Care 50(10), 890– 897 (2012) 18. Weiskopf, N.G., Hripcsak, G., Swaminathan, S., Weng, C.: Defining and measuring completeness of electronic health records for secondary use. J. Biomed. Inf. 46(5), 830–836 (2013) 19. Zhao, J., Henriksson, A., Asker, L., Bostr¨ om, H.: Predictive modeling of structured electronic health records for adverse drug event detection. BMC Med. Inform. Decis. Mak. 15(Suppl 4), S1 (2015)

Emotion Analysis in Hospital Bedside Infotainment Platforms Using Speeded up Robust Features A. Kallipolitis1,2 , M. Galliakis1,2 , A. Menychtas2,3 and I. Maglogiannis1(&)

,

1

2

Department of Computer Science and Biomedical Informatics, University of Thessaly, Volos, Greece {nasskall,michaelgalliakis,imaglo}@unipi.gr Department of Digital Systems, University of Piraeus, Piraeus, Greece [email protected] 3 BioAssist S.A, Kastritsiou 4, 26504 Rion, Greece

Abstract. Far from the heartless aspect of bytes and bites, the field of affective computing investigates the emotional condition of human beings interacting with computers by means of sophisticated algorithms. Systems that integrate this technology in healthcare platforms allow doctors and medical staff to monitor the sentiments of their patients, while they are being treated in their private spaces. It is common knowledge that the emotional condition of patients is strongly connected to the healing process and their health. Therefore, being aware of the psychological peaks and troughs of a patient, provides the advantage of timely intervention by specialists or closely related kinsfolk. In this context, the developed approach describes an emotion analysis scheme which exploits the fast and consistent properties of the Speeded-Up Robust Features (SURF) algorithm in order to identify the existence of seven different sentiments in human faces. The whole functionality is provided as a web service for the healthcare platform during regular Web RTC video teleconference sessions between authorized medical personnel and patients. The paper discusses the technical details of the implementation and the incorporation of the proposed scheme and provides initial results of its accuracy and operation in practice. Keywords: Healthcare platforms  Affective Computing Hospital bedside infotainment systems  WebRTC  Speeded Up Robust Features (SURF)  Emotion analysis



1 Introduction While the relation between the psychological status of human beings and their health was acknowledged in numerous studies [4–6] in the past years, conventional medicine failed to exploit this notion. In practice, it is only recently that medical experts, in parallel with the routine treatment, are investing in the improvement of the emotional status of their patients to reinforce the effects of provided therapy. Towards the same direction, bioinformatics researchers are investigating methods to better interpret, © IFIP International Federation for Information Processing 2019 Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 127–138, 2019. https://doi.org/10.1007/978-3-030-19823-7_10

128

A. Kallipolitis et al.

distinguish, process and quantify sentiments from various human expressions (body posture [3], speech [1], facial expression [2]), all summarized in what is called Affective Computing (AC) or Artificial Emotional Intelligence. Depending on the source of the human expression, affective computing is divided in three main categories: a. Facial Emotion Recognition (FER), b. Speech Emotion Recognition (SER), c. Posture Emotion Recognition (PER). The importance of affective computing systems is highlighted by the engagement of many IT colossi (Google [7], IBM [9], Microsoft [8]) to implement systems of realtime affective analysis of multimedia data depicting human faces and silhouettes. As far as healthcare platforms are concerned, integrating equivalent schemes in systems which bear the responsibility of monitoring and managing patients’ biosignals is of great significance to the healing procedure, especially in the case of chronic diseases. In brief, the generation of positive emotions assists in keeping the patient in a stable psychological condition, which is the basis for fast and efficient treatment [32], whereas negative ones have the opposite effect. Apart from the integration of affective systems in healthcare platforms, rapid development of emotional AI techniques has been reported in a wide range of areas, namely Virtual Reality, Augmented Reality, Advanced Driver Assistance and Smart Infotainment as part of a general trend leading towards the alignment with human-centered computing. In this paper, we describe the design and deployment of a FER system, incorporated in a healthcare management system as a web service to provide functionalities through the entire lifecycle of the Medical Staff – Kinsfolk – Patient interaction. Motivated by the improved results that a treatment can have when combined with the psychological management of the patient, this system will provide the ability of real time measurement and quantification of the patient’s emotions for the medical staff to assess and act upon. Moreover, correlating the emotion measurements with healthrelated markers collected by the system may lead to important newly discovered knowledge. The remainder of this paper is structured in 6 sections, as follows: Sect. 2 presents the related research works, while Sect. 3 describes the proposed emotion analysis system architecture. Section 4 describes the system in practice and Sect. 5 reports the experiments conducted and the corresponding results. Finally, Sect. 6 concludes the paper.

2 Related Work As stated earlier, the analysis, recognition and evaluation of human sentiment via pattern recognition techniques does not rely solely in the processing of facial expressions, but in the quantification of body posture and speech as well. Focusing on SER, several approaches have been proposed in the literature for the extraction of vocal features and their exploitation in forming appropriate classifying models. Methods based on the extraction of low-level features like raw pitch and energy contour [11, 12] are outperformed by high level features utilizing Deep Neural Network [13] to an extend of 20% better accuracy. PER is the least examined territory related to the field of AC. The interpretation of human emotions from body posture in an attempt to assist

Emotion Analysis in Hospital Bedside Infotainment Platforms

129

individuals that suffer from autism spectrum disorder is described in [14], while an approach based on theoretical frameworks investigates the correlation between patterns of body movement and emotions [15]. On the other hand, FER methodologies vary from the exploitation of Deep Belief Networks combined with Machine Learning Data Pipeline features [16], the utilization of a Hierarchical Bayesian Theme Model based on the extraction of Scale Invariant Feature Transform features [17], the capitalization of Online Sequential Extreme Learning Machine method [18] to Stepwise Linear Discriminative Analysis with Hidden Conditional Random Fields [19]. In addition, hybrid implementations of all the above-mentioned approaches that combine FER and SER are available in the literature to complete the puzzle of Affective computing methodologies [20]. In general, affective computing has been widely deployed in the blooming field of electronic healthcare. As examples of such applications, patients’ breathe is managed via emotion recognition carried out by Microsoft Kinect sensor in [21], while in [22] sentiments are analyzed via a facial landmark detecting algorithm from patients suffering from Alzheimer. Another application in electronic healthcare systems is the detection of potential Parkinson patients by recognizing facial impairment when certain expressions are formed with the generation of specific emotions [23]. An innovative notion concerning healthcare solutions is the hospital bedside infotainment systems (HBIS). These systems are designed to enhance medical staff patient communication and promote patients’ clinical experience. Comprising internet, video, movies, radio, music, video or telephone chatting with authorized personnel or kinsfolk and biosignals monitoring in one device and connected to the Electronic Health Record (EHR), it can be proven a productive tool for healthcare ecosystems [10]. Furthermore, constant monitoring of patients can assist in the improvement of their health status and lead to early detection of potential setbacks like detection of outliers [33], poor medication adherence, changes in sleep habits. Despite the fact that HBIS and emotion analysis services exist as stand-alone cloudbased applications, the combination of the aforementioned advances in a platform is a newly breed idea with positive effects concerning the timely intervention of specialists and kinsfolk when negative emotions or depression is detected.

3 System Architecture 3.1

Overview

The FER Restful web service is built to provide functionality as an additional feature of an existing hospital bedside infotainment system and assisted living solution [24]. The target group of this system are patients who suffer from chronic diseases or are obliged to stay in rehabilitation centers for long periods due to the reduced mobility. Another group of people affected are the elders who live independently or in far regions and conduct routine medical teleconsultations with doctors and caregivers [25]. Although the existing system provides numerous features like the monitoring of patients’

130

A. Kallipolitis et al.

biosignals through a mobile application while conducting measurements via wearables and Bluetooth enabled devices as illustrated in Fig. 1, the contribution of this paper is focused on the real-time video communication functionality through which the ability of communication with their medical experts and kinsfolk in a 24/7 basis is rendered. The FER service operates in parallel with the video communication functionality and is called upon request of the doctor. As mentioned earlier, the importance of automated analysis of facial emotion expression is high, especially to patients and elderly people whose health status is strongly connected to their psychological condition and emotion management. In reference to the FER service, it is divided in two modules: (a) the face extraction module, (b) the emotion recognition module. The face extraction module takes place in the web browser of the client side, while the emotion recognition module occurs on the cloud platform (server side).

Fig. 1. Overview of the homecare platform after the integration of the FER service

3.2

The Emotion Analysis Process

In general, the basic skeleton of methodologies related to FER consists of five steps: (a) Preprocessing of images, (b) Face’s acquisition, (c) Landmarks acquisition (if necessary) (d) Facial Feature extraction, (e) Facial Expression classification. The proposed method, specifically, comprises six steps as described in the pseudocode in Fig. 2 and as follows: (a) frame extraction from the real-time streaming video, (b) face detection, (c) cropping of picture to the dimensions of the detected face (Fig. 4), (d) resizing the face picture to 256  256 pixels (if needed), (e) analysis of the face picture for emotions, (f) presentation of the emotions to the medical expert during the video conference, (g) storage of generated results to the patient’s personal health record. The analysis of facial images and their classification in seven different

Emotion Analysis in Hospital Bedside Infotainment Platforms

131

sentiments (anger, disgust, happiness, neutral, sadness, surprise, fear) is accomplished by the extraction of Speed Up Robust Features (SURF) which form a k dimensional vector as a result of the Bag of Words technique to the extracted features. Given a collection of r images, an algorithm that extracts local features is utilized to create the visual vocabulary (Visual Vocabulary). In our case the Speeded Up Robust Features (SURF) algorithm [28] extracts n 64-dimensional vectors where n is the interest points which are automatically detected by the algorithm and, in turn, described by using a Fast Hessian Matrix (SURF Descriptor) in each one of the r images (Fig. 4). Upon completion of the feature extraction process from the r images, a collection of r x n 64-dimensional vectors is formed, which represent corresponding points in a 64-dimensional space. This collection is grouped using a clustering algorithm (Kmeans ++ is utilized) in k groups. The centroid of each group represents the visual word, resulting in the formation of a visual vocabulary of k visual words. The process of extracting SURF features is implemented utilizing ImageJ [26], face detection is based on the OpenCV library [27], while the processes of clustering and classification are using the WEKA tool [29].

while (videoconference) { frame ← extractFrame () If (detectFace(frame)) { face ← cropFace(frame)} resizedFace ← resize (face, 256, 256) emotions ← analyzeFace(resizedFace) f (emotions) { showEmotions(emotions) storeEmotions(emotions)}} Fig. 2. Emotion Analysis Process as pseudocode. Green color is indicative of the functions performed client-side, red color shows the operations performed server-side (Color figure online)

The emotion recognition service is called during a video call (Fig. 5). A sequence of image frames (1 frame per second) is captured during the WebRTC video conference. In order to avoid additional overload on the network, the Face detection module is executed locally on the web browser. Cropping the image to a face bounding box reduces the amount of data being sent from client to server which in turn results to overall improved performance of the system. This is accomplished by the utilization of the recent implementation of OpenCV library in JavaScript, which provides the functionality of OpenCV models in the JavaScript runtime environment of web browsers.

132

A. Kallipolitis et al.

Fig. 3. Jaffe Image Dataset depicting seven different emotion expressions

Fig. 4. (left) Interest Points detected in dataset image utilizing Speeded Up Robust FeaturesimageJ, (right) Cropped image utilizing haar cascade classifier-opencv

Fig. 5. Emotion analysis operational scenario

Emotion Analysis in Hospital Bedside Infotainment Platforms

133

4 The System in Practice The functionality of the proposed solution takes place transparently as far as the users are concerned and upon selection of the medical experts. This provides the discreet capability of monitoring and registering emotional status of the patients while performing a regular video conference ‘visit’ (Fig. 6).

Fig. 6. Medical expert interface showing the FER results of the patient during a video conference

The results of FER are returned from the cloud service in JSON format (Fig. 7) and consequently, visualized in the user interface.

{prediction": { angry": 0,12 disgust": 0.11, fear": 0.14, happy": 0.08, neutral": 0.20 sad": 0.12, surprise": 0.23}, status": 0} Fig. 7. Sample response of the FER Service

Testing the system in practice was performed by conducting 50 video sessions of 1-min duration. In these sessions, the client-side burden was handled by a PC with Quad core Intel Core i5-7400, while the server-side (cloud services) was deployed to an IaaS Cloud environment with two cores of Intel Xeon CPU E5-2650. Necessary internet connection for the communication along the two sides was provided by a

134

A. Kallipolitis et al.

typical 24 Mbps ADSL connection. The images that are captured by the camera had a resolution was set to 640  480 pixels. Average time in milliseconds for basic operations conducted in client side and server side are measured and depicted in Tables 1 and 2 respectively. Average time allotment for the uploading of the cropped image file from client to server is 70 ms. Observation of the measurements in both sides demonstrates that the most time-consuming operation is the feature extraction from the cropped image (server side), followed by the uploading of the image file in the client side. In addition, operations performed in the server side are far more expensive in time than those in client side, which was expected and strategically planned for the discharge of all the computationally demanding tasks from the web browser. Table 1. Average time allotment for performing operations in the client side FER operations (Client Side) Frame Face Face bounding box extraction detection generation 10 ms 105 ms 3 ms

Face cropping 5 ms

Image file creation 6 ms

Time (sum) 129 ms

Table 2. Time allotment for performing operations in the server side Web FER operations (Server Side) service Feature Visual Image to invocation extraction vocabulary vector loading (only transformation once) Beginning 200 ms 178 ms 142 ms of session All other 200 ms – 142 ms calls

Prediction Image Time model classification (sum) loading (only once) 157 ms 1 ms 678 ms 1 ms

243 ms

Further experimentation on the requirement of running face detection on the front end is presented in Table 3. The Table illustrates the produced overhead for network, browser’s memory and CPU for the two scenarios, one for the image size set to 320  240 indicated as s (s for small) and the other set to 640  480 accordingly indicated with (l). Table 3. Resources allotment for face detection on the web browser. Face detection specs (Front End) Idle (l) Frame (l) Face (l) Frame (s) Memory 3.9 GB 4.1 GB 4.1 GB 4.0 GB CPU *30% *33% *45% *33% Data – *400 KB *65 KB *122 KB

Face (s) 4.0 GB *45% *18 KB

Emotion Analysis in Hospital Bedside Infotainment Platforms

135

When idle, image is processed in 640  480; therefore, idle for (s) scenario does not exist. The experiment was conducted using the Mozilla Firefox browser (version 0.66), but the module also operates in Opera 58.0.3135.117 (64 bit) and Google Chrome 73.0.3683.86 (64-bit) without any issues. The experiments demonstrated that memory consumption is insignificantly influenced in all scenarios, whereas large variations are evidenced in data length as expected.

5 Experimental Results While the main objective of this paper is the presentation of integration of a FER web service into a homecare platform, initial results for two scenarios of the classification of the JAFFE [30] dataset are provided. The first scenario splits the dataset into two emotional categories (positive and negative emotions, an assumption is made that anger, fear, disgust, sadness are negative emotions) and the second scenario seven emotional categories (anger, fear, disgust, neutral, happiness, surprise, sadness-Fig. 3) are provided with the utilization of various classifiers. The procedure is conducted following the 10-fold cross-validation of the whole JAFFE dataset (214 images, 256  256 pixels, grayscale). In order to discover the more efficient space representation of the training dataset, extensive testing of the Bag of Visual Word scheme was conducted. Kmeans++ method (350 clusters, 70 seeds) was selected among Kmeans, Canopy and Farthest First WEKA’s implementations for its ability to better distinguish inter-class and intra-class relationships. Kmeans++ improves the initialization phase of the Kmeans clustering algorithm by selecting strategically the initial seeds [31]. Table 4. Classification accuracy results for JAFFE dataset Classifier

Random forest Naïve Bayes MLP Sequential minimum optimization Logistic K Star LMT

Experiments Scenario 1 (Two emotional categories) 92,01% 87,79% 93,48% 93,42%

Scenario 2 (Seven emotional categories) 77,46% 54,92% 73,23% 74,64%

92,01% 93,42% 89,67%

71% 84,03% 67,43%

The accuracy of the emotion detection module is provided in Table 4. A Multilayer Perceptron (learning rate: 0.3, momentum rate: 0.2, epoch number: 500, threshold for number of consecutive errors: 20) reaches a 93,48% classification accuracy for the first scenario while the selection of K Star classifier (manual blend: 20%, value missing replaced with average) from the Weka library, achieves the best accuracy (84,03%) for the second scenario.

136

A. Kallipolitis et al.

6 Conclusion Whereas other affective computing systems operate as stand-alone applications, this paper presents an innovative facial emotion recognition web service, integrated in a healthcare information system for monitoring and timely management of emotional fluctuations of elders and patients with chronic diseases as part of a human-centric treatment. The value of the provided functionality to classify faces into corresponding sentiments real-time during video communication sessions is of great significance especially in cases of patients with diseases related to their psychosomatic condition. Future work will be focused in the realization of a service that can execute emotion recognition in the web browser. This feature will liberate the application from its cloud based imposed restrictions. Concerning the classification performance towards the improvement of accuracy of the current prediction model, other schemes of Bag of Words techniques will be tested in an effort to provide weighted and localized information of the Visual words. Although results are promising, further testing with the utilization of larger and Caucasian oriented labeled dataset should be performed towards more thorough evaluation of the system. Correlating emotion recognition results along with information related to the biosignals and everyday routine activities of individual patients can lead to the discovery of specific patterns and valuable knowledge to the medical community. Acknowledgment. This research has been co-financed by the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH – CREATE – INNOVATE (SISEI: Smart Infotainment System with Emotional Intelligence, project code: T1EDK-01046).

References 1. Gunawan, T., Alghifari, M.F., Morshidi, M.A., Kartiwi, M.: A review on emotion recognition algorithms using speech analysis. Indonesian J. Electr. Eng. Inf. 6, 12–20 (2018) 2. Ko, B.C.: A brief review of facial emotion recognition based on visual information. Sensors 18(2), 401 (2018) 3. Dael, N., Mortillaro, M., Scherer, K.: Emotion expression in body action and posture. Emotion 12, 1085 (2011). https://doi.org/10.1037/a0025737 4. DuBois, C.M., Lopez, O.V., Beale, E.E., Healy, B.C., Boehm, J.K., Huffman, J.C.: Relationships between positive psychological constructs and health outcomes in patients with cardiovascular disease: a systematic review. Int. J. Cardiol. 195, 265–280 (2015). https://doi.org/10.1016/j.ijcard.2015.05.121. ISSN 0167-5273 5. Burger, A.J., et al.: The effects of a novel psychological attribution and emotional awareness and expression therapy for chronic musculoskeletal pain: a preliminary, uncontrolled trial. J. Psychosom. Res. 81, 1–8 (2016) 6. Huffman, J.C., Millstein, R.A., Mastromauro, C.A., et al.: J. Happiness Stud. 17, 1985 (2016) 7. Google Cloud Vision API Homepage: https://cloud.google.com/vision/ 8. Microsoft Cognitive Services Homepage: https://azure.microsoft.com/en-us/services/cognitiveservices/

Emotion Analysis in Hospital Bedside Infotainment Platforms

137

9. IBM Watson Visual Recognition Homepage: https://www.ibm.com/watson/services/visualrecognition/ 10. Dale, Ø., Boysen, E.S., Svagård, I.: One size does not fit all: design and implementation considerations when introducing touch-based infotainment systems to nursing home residents, computers helping people with special needs. In: Miesenberger, K., Bühler, C., Penaz, P. (eds.) ICCHP 2016. LNCS, vol. 9758, pp. 302–309. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41264-1_41 11. Schuller, B., Rigoll, G., Lang, M.: Hidden markov model-based speech emotion recognition. In: Proceedings of IEEE ICASSP 2003, vol. 2, pp. I–II. IEEE (2003) 12. Nwe, T.L, Hieu, N.T., Limbu, D.K.: Bhattacharyya distance based emotional dissimilarity measure for emotion classification. In: Proceedings of IEEE ICASSP 2013, pp. 7512–7516. IEEE (2013) 13. Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. Interspeech 2014, 223–227 (2014) 14. Libero, L.E., Stevens, C.E., Kana, R.K.: Attribution of emotions to body postures: an independent component analysis study of functional connectivity in autism. Hum. Brain Mapp. 35, 5204–5218 (2014) 15. Dael, N., Mortillaro, M., Scherer, K.R.: Emotion expression in body action and posture. Emotion 12, 1085–1101 (2012) 16. Uddin, M.Z., Hassan, M.M., Almogren, A., Zuair, M., Fortino, G., Torresen, J.: A facial expression recognition system using robust face features from depth videos and deep learning. Comput. Electr. Eng. 63, 114–125 (2017) 17. Mao, Q., Rao, Q., Yu, Y., Dong, M.: Hierarchical Bayesian theme models for multipose facial expression recognition. IEEE Trans. Multimed. 19(4), 861–873 (2017) 18. Cossetin, M.J., Nievola, J.C., Koerich, A.L.: Facial expression recognition using a pairwise feature selection and classification approach. In: 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016, pp. 5149–5155. IEEE (2016) 19. Siddiqi, M.H., Ali, R., Khan, A.M., Park, Y., Lee, S.: Human facial expression recognition using stepwise linear discriminant analysis and hidden conditional random fields. IEEE Trans. Image Process. 24(4), 1386–1398 (2015) 20. Ekman, P.: Facial expression and emotion. Am. Psychol. 48(4), 384 (1993) 21. Dantcheva, A., Bilinski, P., Broutart, J.C., Robert, P., Bremond, F.: Emotion facial recognition by the means of automatic video analysis. Gerontechnol. J. Int. Soc. Gerontechnol. 15, 12 (2016) 22. Tivatansakul, S., Chalumporn, G., Puangpontip, S., Kankanokkul, Y., Achalaku, T., Ohkura, M.: Healthcare system focusing on emotional aspect using augmented reality: emotion detection by facial expression. In: Advances in Human Aspects of Healthcare, vol. 3, p. 375 (2014) 23. Almutiry, R., Couth, S., Poliakoff, E., Kotz, S., Silverdale, M., Cootes, T.: Facial behaviour analysis in parkinson’s disease. In: Zheng, G., Liao, H., Jannin, P., Cattin, P., Lee, S.-L. (eds.) MIAR 2016. LNCS, vol. 9805, pp. 329–339. Springer, Cham (2016). https://doi.org/ 10.1007/978-3-319-43775-0_30 24. Menychtas, A., Tsanakas, P., Maglogiannis, I.: Automated integration of wireless biosignal collection devices for patient-centred decision-making in point-of-care systems. Healthc. Technol. Lett. 3(1), 34–40 (2016) 25. Panagopoulos, C., et al.: Utilizing a homecare platform for remote monitoring of patients with idiopathic pulmonary fibrosis. In: Vlamos, P. (ed.) GeNeDis 2016. AEMB, vol. 989, pp. 177–187. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57348-9_15 26. ImageJ Homepage: https://imagej.net

138

A. Kallipolitis et al.

27. Bradski, G., Kaehler, A.: Learning OpenCV: Computer vision with the OpenCV library. O’Reilly Media Inc, Sebastopol (2008) 28. Bay, H., Tuytelaars, T., Gool, V.G.: Speeded up robust features. Comput. Vis. Image Underst. 110(3), 346–359 (2008) 29. Weka 3, Data Mining Software in Java Homepage: https://cs.waikato.ac.nz/ml/weka 30. Lyons, M.J., Akemastu, S., Kamachi, M., Gyoba, J.: Coding facial expressions with gabor wavelets. In: 3rd IEEE International Conference on Automatic Face and Gesture Recognition, pp. 200–205 (1998) 31. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, Philadelphia. Society for Industrial and Applied Mathematics, pp. 1027–1035 (2007) 32. Chakhssi, F., Kraiss, J.T., Sommers-Spijkerman, M., Bohlmeijer, E.T.: The effect of positive psychology interventions on well-being and distress in clinical samples with psychiatric or somatic disorders: a systematic review and meta-analysis. BMC Psychiatry. 18(1), 211 (2018) 33. Fouad, H.: Continuous health-monitoring for early detection of patient by web telemedicine system. In: International Conference on Circuits, Systems and Signal Processing, 23–25 September 2014. Saint Petersburg State Politechnical University, Russia (2014)

FISUL: A Framework for Detecting Adverse Drug Events from Heterogeneous Medical Sources Using Feature Importance Corinne G. Allaart1,2 , Lena Mondrejevski1,2 , and Panagiotis Papapetrou2(B) 1

2

Department of Learning, Informatics, Management and Ethics, Karolinska Institute, Solna, Sweden {corinne.allaart,lena.schlegel}@stud.ki.se Department of Computer and Systems Sciences, Stockholm University, Stockholm, Sweden [email protected]

Abstract. Adverse drug events (ADEs) are considered to be highly important and critical conditions, while accounting for around 3.7% of hospital admissions all over the world. Several studies have applied predictive models for ADE detection; nonetheless, only a restricted number and type of features has been used. In the paper, we propose a framework for identifying ADEs in medical records, by first applying the Boruta feature importance criterion, and then using the top-ranked features for building a predictive model as well as for clustering. We provide an experimental evaluation on the MIMIC-III database by considering 7 types of ADEs illustrating the benefit of the Boruta criterion for the task of ADE detection.

Keywords: Adverse drug events Predictive models · Clustering

1

· Feature importance ·

Introduction

Adverse drug events (ADEs) refer to diagnoses corresponding to injuries that result from the use of a drug, including harm caused by the normal use of a drug, drug overdose, and use-related harms, such as from drug dose reductions and discontinuations of drugs administration [21]. ADEs possess high clinical relevance being that they account for approximately 3.7% of hospital admissions around the world [16]. Unfortunately, many ADEs are currently not being identified as such, due to limited knowledge about the effects of medical treatments, e.g., drugs being tested only in limited clinical trials under controlled conditions. An alternative approach is to resort to machine learning and the exploitation of the constantly growing amounts of information stored in electronic healthcare c IFIP International Federation for Information Processing 2019  Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 139–151, 2019. https://doi.org/10.1007/978-3-030-19823-7_11

140

C. G. Allaart et al.

records (EHRs), so as to extract knowledge from past observations and learn how to identify new patient cases with a high risk of leading to an ADE. With the adoption of EHRs, the amount of healthcare documentation is larger than ever, and there are several efforts underway to involve patients in their healthcare process through the use of patient generated data. Traditionally, data management and machine models have been developed by utilizing information from structured data fields [1,14] as well as clinical text [12], little attention has been devoted to combining different data sources for the creation of richer overall models [33]. More importantly, these data sources are naturally characterized by high degree of sparsity and missing values. Consider for example a drug prescription variable (e.g., beta-blockers), which is typically administered to patients suffering from heart-related disorders. We should expect that this variable will be substantially empty for patients not suffering from any heart disease. The problem of missing values in EHRs has been identified by several earlier studies [1,3]. More recently, Bagattini et al. [1] propose three simple approaches for handling sparse features in EHRs for the task of ADE detection. Nonetheless, only one type of EHR features was used, corresponding to blood test measurements before the occurrence of an ADE; while diagnoses codes and drug prescriptions were excluded from the study. Moreover, the goal of that paper was to define simple temporal abstractions that take into account such temporal features with high degrees of sparsity. The objective of our study in this paper is to take a different research angle and approach the problem using feature importance to assess the statistical significance of multiple, heterogeneous EHR features in terms of their predictive performance. Moreover, we aim to define a more general approach to the problem of ADE detection in EHRs that can handle disparate feature types and, in the presence of sparse and noisy features, identify the subset of most significant class-distinctive features, that can then be used for both classification and clustering of ADEs in EHRs. The main contributions of the paper include: (1) the formulation of a framework for identifying and assessing the importance of medical features in terms of their predictive performance, as well as their descriptive power for the problem of ADE detection; (2) the proposed framework employs the Boruta feature importance criterion as a first step, and then subsequently pipelines the selected features to building a predictive model for ADE prediction, as well as identifying clusters of patients under different ADE classes; (3) an extensive experimental evaluation on patient records obtained from the MIMIC database1 including patients with 7 ADE types, and assessing (a) the predictive performance of four classification models using sets of features extracted by the Boruta criterion, as well as (b) the descriptive performance of clusters obtained using the highest scoring features in terms of the Boruta criterion under K-medoids.

1

https://mimic.physionet.org.

Detecting ADEs from Heterogeneous Medical Sources

2

141

Related Work

The wide usage of EHRs in medical research has recently increased the interest in the use of clinical data sources by medical practitioners as well as researchers from various fields [13,32]. Numerous research directions arise for the problem of ADE identification, which is the key focus of this study [13]. Compared to traditional data sources, such as spontaneous reports [26], as well as other popular resources, such as social media data [28], EHRs contain data types and information that allow for incidence estimation and provide class labels for supervised machine learning. Research on mining both structured and unstructured EHR data for ADE detection is nascent, see e.g. [10,11,25,29,33]. The traditional approach for ADE identification is performed before the deployment of a drug. This is achieved by several rounds of clinical trials, which however are hampered by the fact that only a limited sample of patients is usually employed and monitored for a short or limited time period. Consequently, the phenomenon of ADE under-reporting arises as several serious ADEs are not detected during clinical trials but rather after the market deployment of a new drug. This typically results in having several drugs withdrawn. These limitations can be overcome by defining and employing rules for ADE detection [4,8]. Machine learning is an alternative to ADE detection by the exploitation of rich data features in EHRs, such as for example blood tests [23]. More importantly, the development and application of machine learning models, both supervised and unsupervised, in a clinical setting can facilitate substantial improvements in terms of ADE detection while maintaining low hospitalization and treatment costs. We can identify four major lines of research on learning from EHRs [17]: (1) detection and analysis of comorbidities, (2) clustering patients with similar characteristics, (3) supervised learning, and (4) cohort querying and analysis. Examples of the above four categories are itemset mining, association rule extraction, and disproportionality analysis, prediction of critical healthcare and patient conditions, such as, for instance, smoking status quantification for a patient [31], patient safety and automated surveillance of ADEs [15], comorbidity and disease networks [4], processing of clinical text [11], identification of suitable individuals for clinical trials [24], as well as identification of temporal associations between medical events and first prescriptions of medicines for signaling the presence of an ADE [22].

3

The FISUL Framework

We present Feature Importance for Supervised and Unsupervised Learning (FISUL), a framework for predictive and descriptive modeling of ADEs from EHRs. FISUL has three phases: (1) feature importance, (2) predictive modeling, and (3) clustering. In Fig. 1 we provide an outline of the proposed framework. Next, we describe each phase in more detail.

142

C. G. Allaart et al.

Fig. 1. An outline of the FISUL framework.

3.1

Phase I: Boruta Feature Importance

We employ the Boruta method [19] as a feature importance criterion for reducing the number of data features. Boruta is a variable importance method that is defined for the random forest classifier, by mainly measuring the total decrease in impurity from performing a feature split over all nodes of a tree, averaged over all trees in the random forest. The Boruta method was selected for its ability to provide unbiased and stable selection of all relevant features [18]. The main idea is to create randomized copies of the existing features, merge the copies with the original data features, build a final classifier using all features, including the randomized ones, and iteratively identify the most important features for the classification task at hand. More concretely, let D be the original dataset and let F denote the original feature space. The key objective of Boruta ˆ Where Fˆ is a set is to define a mapping process, T , such that T : F −→ F. of randomized features originating from F. More concretely, the following steps are performed: – Randomization: a replica, called shadow feature, fˆi ∈ Fˆ is created for each feature fi ∈ F, by random permutation of its values; as a result, possible correlations that may exist between the original features and the class attribute are diminished. – Model  ˆ building: a random forest R is built using the union of the features F F. This procedure is repeated n times, i.e., for n iterations. ˆ we define an importance – Importance score: for each fi ∈ F and fˆi ∈ F, score, called Z -score, over all trees in R, where each feature appears. The mean and standard deviation of the accuracy loss are defined as μfi , μfˆi and σfi , σfˆi , respectively, using the out-of-bag samples. Finally, the Z-score of each feature fi and each shadow feature fˆi is defined as Zfi =

μfˆ μfi and Zfˆi = i , σ fi σfˆi

respectively. Intuitively, the Z-score reflects the degree of fluctuation of the mean accuracy loss among the trees in R. – Statistical significance: for each original fi ∈ F, we compute a statistical j significance score using a two-tailed binomial test. More specifically, let Zmax

Detecting ADEs from Heterogeneous Medical Sources

143

be maximum Z-score of all shadow features in iteration j, i.e., j = max Zfˆi . Zmax

(1)

ˆ fˆi ∈F

We use a vector, called hit vector H, to store for each fi ∈ F in how many j , i.e., iterations it achieved a Z-score higher than Zmax Hi =

n 

|11{fi :Zf

j=1

where 11 is the indicator function, i.e.,  1 11A (x) = 0

i

j ≥Zmax } (fi )|

if x ∈ A if x ∈ /A

,

(2)

(3)

If feature fi performs significantly better than expected compared to its shadow features in terms of Z-score, it is marked as “important”. Note that under the binomial distribution assumptions, the expected number of times Hi that fi may outperform its shadow replicas is simply E(fi , n) = n2 , with √ a standard deviation σ(fi , n) = 0.25n, assuming that Hi ∼ B(n, 0.5). Conversely, fi is considered “important”, if Hi is significantly higher than E(fi , n). Finally, the features that survive the significance test constitute the set of Boruta features F  . 3.2

Phase II: Predictive Modeling

The set of Boruta features F  extracted from Phase I are next passed to Phase II for building a predictive model using the new feature space. The main objective is to learn a classification function τ : o → y, that assigns a given data object o with a class label from a set C of predefined class labels, such that y ∈ C. More specifically, we can couple τ with a set of features θ selected and employed during the training phase. In our case, the set of class labels corresponds to a selected set of ADEs. More information about the selected class labels can be found in Sect. 4. The training phase of a predictive model is more formally defined as τ = L(θ, T ), where L is the learning function corresponding to a chosen predictive model and T is the training set. Finally, the label of a newly seen data example o is obtained by applying τ , configured with the same chosen feature set θ, i.e., y = τ (o; θ). In our framework, we choose the top-k most important Boruta features, i.e., θ = Fk . 3.3

Phase III: Clustering

An alternative approach for exploiting F  is clustering. The main objective is to define a partitioning G = {g1 , . . . , gK } of K groups, such that inter-group similarity is maximized and intra-group similarity is minimized.

144

C. G. Allaart et al.

Since in our case the data objects contain features that are not necessarily numerical, we employ K-medoids using the Gower distance. This distance function computes the average dissimilarity across the data objects. Let oi , oj be two data objects in our dataset and |θ| be the size of our feature space. The Gower distance is computed as follows: |θ|

Gdist (oi , oj ) =

1  f d , |θ| i=1 i,j

(4)

where dfi,j is a function computing the dissimilarity of feature f between objects oi and oj , depending on the feature type, after standardizing each feature. For example, in the case of numerical features, dfi,j is defined as follows: dfi,j =

|ofi − ofj | , Zf

(5)

where Zf is the maximum distance range across all data objects and ofi , ofj denote the values of feature f for objects oi , oj , respectively. In the case of categorical features, dfi,j = 0, if ofi = ofj and 1, otherwise. The final clustering is obtained by running K-medoids under the Gower distance given by Eq. 4, and tuning K using the Silhouette coefficient [27] and selecting the one with the highest Silhouette value [27].

4

Experimental Evaluation

We outline the experimental setup by first our dataset, the benchmarked methods, the undersampling procedure we used to tackle the high class imbalance, and finally the presentation of our findings. Dataset. We used the Medical Information Mark for Intensive Care III (MIMICIII) database [30], a freely available medical database for intensive care (ICU) research, released in 2006 and comprising over 40,000 patients. Several studies have been conducted on this dataset using predictive models, such as prediction of hospital stay [9] or mortality rate [6]. However, little attention has been given to prediction of ADEs, yet they are common in ICU patients [2]. In MIMIC-III the ADEs are coded as ICD-9 diagnosis codes and for this study we explored the 7 of the most commonly occurring codes depicting ADEs; grouped as caused by one of four specific drugs: (1) antibiotics, (2) anticoagulants, (3) antineoplastic and immunosuppressive drugs, or (4) corticosteroids. We hence considered five datasets, one being the whole dataset including all ADEs, while each of the remaining four corresponded to each of the four specific drugs. All hospital admissions where one of these drugs were prescribed, were considered in the preprocessing, with a positive class label signifying at least one of the selected ADEs during hospital stay. A summary of the used datasets is given in Table 1.

Detecting ADEs from Heterogeneous Medical Sources

145

Four different types of features were selected from MIMIC-III, either based on previous relevance to ADE prediction or as they had been identified clinically as risk factors or indications of ADEs in critical care. These features were: admission characteristics, undergone procedures, laboratory tests, and prescribed drugs. For the last three, one-hot encoding was applied based on clinically relevant groupings of their coding systems. The NDC drug codes extracted from MIMICIII were converted to ATC codes [20], as the ATC grouping system has more clinical relevance. The full set of selected features is described in Table 2. The drug specific datasets excluded the drug feature group the ADE was caused by, for example the dataset with ADEs caused by corticosteroids excluded the corticosteroid drug group as a feature. Table 1. The table provides information about the whole dataset used for our experimental evaluation, and the four subset datasets of ADEs. For each dataset we indicate the number of total examples, the number of positive and negative class labels, the gender ratio of the patients (in terms of % of female patients), and their average age. Whole dataset

Anticoagulants

ImmunoCorticosuppressive steroids

Antibiotics

Total # of Examples

47506

42449

2389

12198

37145

# of Class Labels (positive)

2078

600

223

511

469

# of Class Labels (negative) 4,4

1,4

9,3

4,2

1,3

% of Female Patients

43.9

43.9

42.8

49.3

44.2

Average Age

58.8

61.9

57.2

61.2

58.4

Table 2. The table provides information on the features of the datasets used in our experimental evaluation. Per feature type, the total number of features for the whole dataset is indicated, as well as the type of grouping used for their one-hot encoding. # of Features Type of grouping Admission characteristics

5

-

Procedures

18

ICD-9a procedure groups

Laboratory test

10

LOINCb (parent) groups

Prescribed drugs 94 ATCc level 2 http://icd9.chrisendres.com/index.php?action=contents b https://loinc.org/groups/ c https://www.whocc.no a

Setup. We benchmarked six predictive modeling techniques having demonstrated competitive predictive performance in earlier works on ADE detection [1,33]: (1) Random Forests (RF100) with 100 trees, (2) simple Feed-Forward Neural Networks (NNet), (3) eXtreme gradient boosting (XGBoost), (4) SVM with a radial basis kernel (SVMRadial), (5) SVM with a polynomial kernel with degree 3 (SVMPolynomial), and (6) SVM with a linear kernel (SVMLinear). Due to the high class imbalance in all datasets, we performed under-sampling

146

C. G. Allaart et al.

of the majority class for each dataset. All models used 3 feature sets: all features, the relevant features as selected by Boruta, and Boruta’s top 10 (after under-sampling). The performance metrics were AUC and AUPRC, under 10fold cross-validation. For clustering we used the original imbalanced datasets. We applied K-medoids for different values of K, using the Gower distance on the top-10 Boruta selected features. Results. Next, we present our experimental findings for each of the three phases of the FISUL framework. – Boruta feature importance. When applied to the whole dataset, the Boruta criterion rejected 56 features, mainly those that were extremely sparse ( A then t ∈ RC if A > Amax then Save the current 1 and 2 while λ1 and λ2 are not optimal do Define the new values of λ1 and λ2 Evaluate ADRV for x ∈ Xtest do for t ∈ RC do Determine RCf by the help of the equation 2. M ajorityV ote(RCf )

Dynamic Reliable Voting in Ensemble Learning

4

183

Experimentation Results and Discussion

To evaluate our algorithm, a series of experiments were performed on eight different datasets. Next section discusses the dataset used, the protocol, and then results are exposed. 4.1

Data Description and Protocol

Table 1 provides the informaTable 1. Dataset information. tion of the dataset that were X Att. IR used in this experiments. Eight ID Dataset Num. Cat. datasets from UCI repository [3] were used: Breast Cancer Wis- 1 BCWD 286 0 9 2.365 consin Diagnostic (BCWD), Ver2 Vertebral 310 6 0 2.1 tebral Column (Vertebral), Iono351 33 0 1.786 sphere, Musk (version 1), Indians 3 Ionosphere Diabetes (Diabetes), Spambase, 4 Musk 476 166 0 1.3 Phishing Websites, and EEG Eye 5 Diabetes 768 8 0 1.866 State (EES). These data are 6 Spambase 4601 57 0 1.538 selected in order to study the 11055 0 30 1.257 behavior of our algorithm to han- 7 Phishing dle from BCWD (286 instances 8 EES 14980 14 0 1.228 and 10 attributes) to EES data (14980 instances and 15 attributes). The attributes vary between numeric (integer and real) and categorical. Since our focus is to study the reliability aspect of various expertise of base predictors, we avoid to use imbalanced dataset so that the performances of the algorithms are not distracted by these conditions. Class imbalance problems can be measured by the imbalance ratio (IR), defined as the ratio of the number of instances in the majority class to the number of examples in the minority class [1]. Balanced data are indicated in Table 1 by the IR score that is close to the value 1. The experiments were conducted by train-test evaluation and the data were split into 67% of training set and 33% of testing set. Five base classifiers were applied based on different knowledge representations to obtain diversity among the models combination. We used Weka1 library to build the models of C4.5 (Decision Tree), Naive Bayes (Bayesian), JRip (Rule), Sequential minimal optimization (Function), and k-nearest neighbors (Lazy). Then, we evaluated our proposed algorithm with MV, WMV [9], Stacking, and Multi-Scheme (MS) by accuracy score. The parameters of all algorithms were not changed and we considered the default setting of Weka. Ensemble algorithms were tested in a condition where the base classifiers do not contain a spammer as the first experiment. Then, 25 spammers were added to the base input as second attempt. This second scenario where random predictors are higher than the original classifiers is important to learn the reliability aspect of combination methods [7]. Both experiments were conducted in Java. We set the precision of the threshold p to 0.1 with the interval of the bin equals to 0.1. 1

http://www.cs.waikato.ac.nz/ml/weka/.

184

4.2

A. B. Raharjo and M. Quafafou

Results and Discussion

Table 2 shows the accuracy comparison between base classifiers and ensemble methods. The ID column represents the sequence number of dataset according to the Table 1. The best algorithm is defined by an algorithm that has the highest score of accuracy and the smallest value of standard deviation. kNN shows the best result than the other learners on the side of base classifier. In another way, C4.5 provides the smallest standard deviation among the others. Four out of five ensemble methods exceed the average scores of all single classifiers. This scenario confirms the benefit of ensemble methods to give a better accuracy score than relying on a single classifier. Three voting based algorithms (MV, WMV, DRV) show a superior average results compared to the results of Stacking, and MS. It is normal to see that voting based methods have good results since the base classifiers scores are quite good. WMV and DRV have the same deviation score even though the accuracy score of each dataset is different. If we consider the accuracy score of ensemble methods individually, DRV provides the highest accuracy for six dataset. Table 2. Accuracy comparison between base classifiers and ensemble methods. ID Data C4.5

NB

JRip

SMO

kNN

MV

WMV DRV

Stacking MS 0.642

1

0.663 0.642

0.653 0.663 0.663 0.653

0.674 0.663

2

0.786

0.748

0.786 0.796

0.757

3

0.897

0.906 0.889 0.906 0.880

4

0.823

0.816

0.709 0.816

0.835 0.835

5

0.711

0.719

0.703 0.699

0.727 0.797 0.770

6

0.907

0.898

0.916 0.937 0.926

7

0.943

0.927

0.941 0.940

0.966 0.966 0.957

0.966 0.956

0.966

8

0.779

0.671

0.713 0.727

0.796 0.796 0.766

0.796 0.770

0.796

mean

0.814

0.791

0.789 0.811

0.825 0.834

0.828

0.837 0.807

0.822

std dev. 0.098 0.112

0.112 0.109

0.100

0.103 0.103 0.108

0.105

0.806 0.757

0.777 0.777

0.923 0.923 0.923 0.889

0.942

0.106

0.835

0.642 0.748 0.880

0.842 0.816

0.816

0.785

0.695

0.789

0.943 0.943 0.911

0.937

Table 3. Accuracy comparison of ensemble methods after 25 spammers were added. ID Data MV

WMV DRV

1

0.6

0.653

0.663 0.642

0.642

2

0.657

0.755

0.777 0.748

0.777

3

0.769

0.376

0.88

0.863

0.88

4

0.747

0.589

0.829

0.854

0.816

5

0.664

0.641

0.734 0.664

0.711

6

0.799

0.605

0.924

0.918

0.937

7

0.815

0.46

0.966 0.955

0.966

8

0.673

0.559

0.796 0.7

0.796

mean

0.716

0.580

std dev. 0.077 0.118

Stacking MS

0.821 0.793

0.816

0.100

0.110

0.120

Dynamic Reliable Voting in Ensemble Learning

185

Accuracy Distance

In contrast to the results of the first experiment, Table 3 provides a significant decreasing values on MV and WMV after 25 spammers were added. DRV gives the best result, followed by MS, Stacking, MV, and WMV respectively. MV tends to give similar accuracy score for eight dataset indicated by the smallest standard deviation score, while the accuracy scores of WMV are the lowest among the others on six data. It means that the decision of MV and WMV are distracted by the presence of the spammers, while DRV is able to select the best combination and to eliminate the weak predictions. The ability of ensemble methods to maintain 0.6 MV (mean: 0.118) WMV (mean: 0.248) the accuracy score is illustrated in Fig. 3. XDRV (mean: 0.016) Stacking (mean: 0.025) 0.5 MS (mean: 0.013) axis shows the sequence of the dataset, while Y-axis shows the absolute accuracy distance 0.4 between the first and second trials (lower value is better). This score is formulated as ΔA = 0.3 |A1 −A2 |, where A1 is the accuracy value of the 0.2 first experiment and A2 is the accuracy score in the presence of spammers. The distance scores 0.1 of MV are higher than DRV, Stacking, and MS on all dataset, while WMV shows the highest 0 1 2 3 4 5 6 7 8 accuracy instability. This measure allows us to ID Data see the affect of random predictors to the popular voting techniques. On the other hand, DRV Fig. 3. Accuracy distance before improves this drawback by considering predic- and after 25 spammers were tor reliability aspect, indicated by the lower added for eight dataset (smaller value is better). score similar to MS and Stacking. Figure 4 illustrates the computation time of five ensemble methods during the training phase. It consists of two conditions where five base classifiers were used as the input (see Fig. 4a) and after the spammers were added (see Fig. 4b). 1000

MV WMV DRV Stacking MS

100

63.46 70.483

18.132

21.998

9.17

10

1.233 1 0.536

0.944 0.398 0.262

0.525

1.096

0.454

620.585

MV WMV DRV Stacking MS

514.522

Computation time in second (logscale)

Computation time in second (logscale)

1000

73.392 71.557

100

22.339 22.378 9.998 10

1.961 1.08 1

0.317

1.415 0.607

1.724

0.816 0.664

0.305

0.263

0.289

0.167 0.1

0.1 1

2

3

4 5 ID Data

6

7

8

1

2

3

4 5 ID Data

6

7

8

Fig. 4. Computation time comparison during training phase (smaller value is better).

186

A. B. Raharjo and M. Quafafou

X-axis represents eight data used and Y-axis indicates the number of second needed in a log scale. The values written in the diagram describe the lowest and highest time in each dataset. The performances of MV and WMV were computed from the sum of training time of base classifiers, while the score of DRV was obtained from the MV and the reliability diagram building time (see Eq. 3). Due to the same complexity, MV line is not visible in the figure and is overwritten by the WMV line. According to Fig. 4a, all ensemble methods require similar time to train when the number of instances is less than 500. It also shows that the number of instances generally influences the computation time. Although, the performances in BCWD and Diabetes show the opposite results due to their specific characteristics. The superiority of voting based methods compared to Stacking and MS can be seen in Diabetes, Spambase, Phishing and EES. Similar results are also presented in Fig. 4b. Stacking and MS computed Vertebral, Ionosphere, and Musk faster than the others. In contrary, the deviation between their running time and voting algorithms for the second setup are greater than the first experiments. MV, WMV, and DRV do not have varied results because the spammers do not need significant time to calculate. Based on the comparison of the first and the second figures, the number of base classifiers clearly affects the computation time during the training phase.

5

Conclusion

A diverse group of classifiers are likely to make better decisions comparing to a single learner. However, considering ensemble learning context, each classifier has its own performance. Hence, reliability is a crucial problem when such classifiers have contrasted performances. We propose dynamic reliable voting to solve the problem on how to select the best combination of reliable classifiers and how to handle uncertain labelers, i.e. spammer. The confidence score of prediction is used as main information to produce a reliability diagram of each algorithm and several filters are set to select the best candidates. Five classifiers are chosen as the base models and the voting combination of their predictions for each datum is changed dynamically according to the past experience of their probability estimates. The result shows that our proposed algorithm provides a reliable performance against the previous approaches on eight datasets before and after the presence of spammers. In future work, we will improve our approach to adapt uncertainty and imbalanced class. We will also enhance our algorithm to handle multi-class and multi-label classification.

References 1. Garc´ıa, V., S´ anchez, J., Mollineda, R.: On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl. Based Syst. 25(1), 13–21 (2012). Special Issue on New Trends in Data Mining 2. Ho, T.K., Hull, J.J., Srihari, S.N.: Decision combination in multiple classifier systems. IEEE Trans. Pattern Anal. Mach. Intell. 16(1), 66–75 (1994)

Dynamic Reliable Voting in Ensemble Learning

187

3. Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci. edu/ml 4. Murphy, A.H., Winkler, R.L.: Reliability of subjective probability forecasts of precipitation and temperature. J. R. Stat. Soc. Ser. C (Appl. Stat.) 26(1), 41–47 (1977) 5. Nachouki, G., Quafafou, M.: Mashup web data sources and services based on semantic queries. Inf. Syst. 25(2), 151–173 (2011) 6. Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classifiers, pp. 61–74. MIT Press (1999) 7. Raharjo, A.B., Quafafou, M.: The combination of decision in crowds when the number of reliable annotator is scarce. In: Adams, N., Tucker, A., Weston, D. (eds.) IDA 2017. LNCS, vol. 10584, pp. 260–271. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-68765-0 22 8. Rajnarayan, D., Wolpert, D.: Bias-variance trade-offs: novel applications. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning, pp. 101–110. Springer, Boston (2010). https://doi.org/10.1007/978-0-387-30164-8 9. Raschka, S.: Mlxtend, April 2016. https://doi.org/10.5281/zenodo.594432 10. Raykar, V.C., Yu, S.: Eliminating spammers and ranking annotators for crowdsourced labeling tasks. J. Mach. Learn. Res. 13, 491–518 (2012) 11. Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1), 1–39 (2010) 12. Selwa Elfirdoussi, Z.J., Quafafou, M.: Ranking web services using web service popularity score. Int. J. Inf. Technol. Web Eng. 9(2), 78–89 (2014) 13. Valdovinos, R.M., S´ anchez, J.S.: Combining multiple classifiers with dynamic ´ Baruque, B. (eds.) weighted voting. In: Corchado, E., Wu, X., Oja, E., Herrero, A., HAIS 2009. LNCS (LNAI), vol. 5572, pp. 510–516. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02319-4 61 14. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Chapter 4 - algorithms: the basic methods. In: Data Mining, 4th edn., pp. 91–160. Morgan Kaufmann (2017) 15. Zadrozny, B., Elkan, C.: Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002, pp. 694–699. ACM, New York (2002) 16. Zhang, Y., Zhang, H., Cai, J., Yang, B.: A weighted voting classifier based on differential evolution. Abstr. Appl. Anal. 2014, 1–6 (2014). https://doi.org/10. 1155/2014/376950

Extracting Action Sensitive Features to Facilitate Weakly-Supervised Action Localization Zijian Kang1 , Le Wang1(B) , Ziyi Liu1 , Qilin Zhang2 , and Nanning Zheng1 1

Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an 710049, Shaanxi, People’s Republic of China [email protected] 2 HERE Technologies, Chicago, IL 60606, USA

Abstract. Weakly-supervised temporal action localization has attracted much attention among researchers in video content analytics, thanks to its relaxed requirements of video-level annotations instead of framelevel labels. However, many current weakly-supervised action localization methods depend heavily on naive feature combination and empirical thresholds to determine temporal action boundaries, which is practically feasible but could still be sub-optimal. Inspired by the momentum term, we propose a general-purpose action recognition criterion that replaces explicit empirical thresholds. Based on such criterion, we analyze different combination of streams and propose the Action Sensitive Extractor (ASE) that produces action sensitive features. Our ASE sets temporal stream as main stream and extends with complementary spatial streams. We build our Action Sensitive Network (ASN) and evaluate on THUMOS14 and ActivityNet1.2 with different selection method. Our network yields state-of-art performance in both datasets.

Keywords: Action localization

1

· Weakly-supervised · Two-stream

Introduction

Temporal action localization (TAL) in untrimmed videos has attracted more and more attention in recent years, many methods [9,11,14,23,28,38,41,43] that greatly enhanced performance have been developed. Because labeling action boundary in untrimmed video is expensive, some researchers [25,26,30,33,38] proposed to use video-level action annotation to produce snippet level action localization results, which greatly reduced demand for human laboring and yield comparable performance. These studies combine Multiple Instance Learning (MIL) [7] and attention mechanism [25,26,38] with Deep Convolutional Neural Networks (DCNN) to produce clip presentations. Then, action detection criterions maps clip presentations to Class Activation Sequence (CAS) which determines which snippet includes action. c IFIP International Federation for Information Processing 2019  Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 188–201, 2019. https://doi.org/10.1007/978-3-030-19823-7_15

Extracting Action Sensitive Features

189

However, these weakly-supervised methods share two convenient assumptions that might be too optimistic in the real world. The first assumption is empirical thresholds to determine temporal action boundaries could be obtained in a trivial manner. This implicit assumption could be far from reality, given the diversity of datasets and applications. The second assumption is that straightforward fusion strategies are adequate in weakly-supervised TAL because of the prevailing twostream networks [3,32,39], where CAS is either generated separately and fused by weighted average [25,30,38], or generated by concatenated features [26] and [30] regression methods. With the two-stream network, each stream is independently trained via backpropagation and no interactions happen between streams. These two strategies are straightforward to implemented but we argue that there could possibly be a better alternative. To address these challenges, we design a general-purpose action detection criterion and an alternative stream fusion strategy. Specifically, we design the action detection criterion based on attention mechanism with momentum-inspired threshold generated in training stage. An analysis in stream combination options results in the proposed Action Sensitive Extractor (ASE) as shown in Fig. 1. Inspired by recent literature in spatial and temporal interaction [10,35], the proposed ASE prudently selects action sensitive features between two streams and produces activations. In the ASE, we handle spatial and temporal stream asymmetrically with respect to different sensitivities in actions. With our action detection criterion and the ASE, we build Action Sensitive Network (ASN) for Weakly-supervised TAL. CAS

(a)

Clip Prediction

Clip Prediction

Fusion Model

Flow Model

RGB Model

Clip Prediction

CAS

ASE Model

Clip Prediction

CAS

(b)

(c)

Fig. 1. Illustration of different strategies to combine two-stream features. (a) Lateral fusion of two-stream features. (b) Concatenating two-stream features for processing. (c) Our action sensitive extractor.

Main contributions of this paper include (1) a comparative analysis on stream fusion strategies with the proposed Action Sensitive Extractor (ASE), and (2) a new flexible action localization criterion which generates high quality CAS. The performance gains of the proposed ASN algorithm are verified on two challenging public datasets.

190

2

Z. Kang et al.

Related Works

Video Action analyze has been wildly discussed in several years. Most studies focus on action recognition in trimmed videos. Many novel structures have been proposed for videos [3,8,15,19] based on DCNN neural networks [16,17,37]. Twostream network [32] was one design which employs RGB images and optical Flow with lateral fusion. Based on two-stream network, temporal segment network (TSN) [39] was proposed to analyze long-term temporal data. TSN has been used as backbone in different tasks [38,43] with good performance. To further leverage optical Flow, [35] proposed a novel structure for optical Flow. Recent proposed SlowFast Network [10] uses two path way to process videos similar to two-stream network. In SlowFast, a fast pathway handles wide temporal motions and a slow pathway handles rich local details. Action localization has been greatly improved based on video action analyze. Many neural architectures and methods [9,13,21,24] have been developed for supervised learning. However, those studies heavy rely on data annotations of action sequences, which are expensive to acquire. To incorporate more data in training, Sun et al. [34] proposed to use web images and video-level annotation to handle TAL. Moreover, hide and seek [33] discovered how to force network focus on most discriminating part. UntrimmedNet [38] designed a novel structures that trains high-quality network on untrimmed videos and proposed a method which efficiently selects action segments. UntrimmedNet not only provide a good solution for Localization but is also a good baseline model that generates local representation. Based on extracted feature representation, AutoLoc [30] discovered an anchor generation and selection standard on feature sequences. W-TALC [26] and Nguyen et al. [25] discovered feature based networks with different auxiliary loss functions and attention mechanism.

3

Action Sensitive Network

In this section, our proposed Action Sensitive Network will be introduced. Section 3.1 describes the ASE we proposed. Section 3.2 describes our momentuminspired action detection criterion. In the last section, we introduce details of ASN. 3.1

Action Sensitive Extractor

In this section, we propose models to extract action sensitive features. Our proposal is to train a network that maximally leverage action sensitive features in two streams. Because actions are described in moving image in videos, spatial stream with only one frame perception unlikely to recognize action directly. While temporal stream have wider temporal perception and inherently sensitive at motion boundary [27]. Detial analysis can be found in our experiment section. Inspired by SlowFast network [10], where spatial (slow) and temporal (fast) are fed into different network architectures with different channels and

Extracting Action Sensitive Features

191

different temporal perception, we propose our models that treat temporal and spatial asymmetrically. In general, we use learned action sensitive knowledge (inherit from temporal stream) as main stream. We discover different structures to extract beneficial features to reinforce main stream. We adopt strategy in DenseNet [17] that we concatenate our main stream and reinforce stream together for classification and attention calculation. We call our extraction model Action Sensitive Extractor (ASE), different settings of ASE are shown in Fig. 2. For simplicity, we still use single fully-connected layer for classification or attention branch. ASE with classification and attention branch is referred as ASE model. Spatial

Bottleneck

Bilinear

Temporal

Temporal

(a)

Spatial

(b)

Bneck Bneck

Spatial

Temporal

(c)

Fig. 2. Data flow of different ASE settings. We set temporal stream as main stream and spatial stream as reinforce stream. Reinforced streams are fed into classification and activation branch then. (a) Fusion model. (b) Bottleneck model. (c) Bilinear bottleneck model.

Fusion with Temporal Knowledge. To leverage temporal features that are related to actions, we propose to build a network that initialized with temporal features and extended with spatial features. To achieve this goal, we adopt methods from [4], our network on fusion (concatenated) features is initiated with pretrained temporal weights and zero spatial weights. For example, Eq. 1 shows the classification branch for fusioned features. To inherit knowledge in temporal classifier, we set Wt and bt to pretrained temporal weights, while Ws and bs are set to 0. We also apply same method to attention branch. c = Wf · xf + bf    t  xt  t b s = W ,W · s + s x b

(1)

Bottleneck Model. To limit overfitting with spatial features, we further study on limiting and distilling spatial features. Different from former studies that enforce loss [26], we simply use special designed network architecture. As a naive attempt, we use a bottleneck layer to extract knowledge from spatial features. The bottleneck layer includes a dropout, a fully-connected layer and ReLU activation. The features extracted from bottleneck layer are concatenated with temporal features and feed to classification and attention branch. Our bottleneck layer extracts most expressive spatial features that help to identify actions.

192

Z. Kang et al.

Bilinear Bottleneck Model. Bottleneck model will remove unnecessary spatial feature but can’t introduce interactions. In recent works [5,40], bilinear layers are proposed to aggregate spatial-temporal features. To make use of connection between streams, we propose to use bilinear block to aggregate features. In our work, we propose to use two fully-connected bottleneck layer to aggregate features in each stream and use bilinear layer to combine temporal and spatial features. We use 0.5 dropout before bottleneck layer and bilinear layer. We use ReLU activation after fc layers and bilinear layer. The aggregated features are concatenated with temporal features as in bottleneck model. Hidden layer size of bottleneck layer and bilinear layer are set to same for simplicity. 3.2

Action Detection Criterion

Here we represent our action detection criterion. In our study, we propose to trim fixed proportion of clips as background, since proportion of background frames is relative stable in each dataset. We set our threshold as quantile of attention values during training, similar to batch normalization [18], where mean and standard deviation in each batch is recorded and reused, to deal with fluctuation. Quantile level describes desired proportion of clips. Level of quantile defines how much proportion the quantile divide, e.g. quantile at 30% means around 30% of clips in each batch have lower attention than the quantile. For each batch in training, we sort attentions of each clip in this batch and sample a attention value at desired level. Current quantile is updated by a momentum factor according to Eq. 2. Quantile is fixed during testing. q t+1 = αq t + (1 − α)q

(2)

Our method is simple and cross modality, it is easy to apply our action detection criterion to any attention based localization problem across different settings. Note that the quantile maybe different across datasets, since proportion of background frames maybe different. 3.3

Network Details

Having explained key components, now we introduce details of our ASN as shown in Fig. 3. To efficiently look over long video, we break videos into different levels. In bottom level, each frame is represented separately as frame. We use features from our two-stream pretrained DCNN model as representation. Then the middle level, which is clip level. We average our features sampled in short temporal period as clip representation since close frames in videos should be correlated. To distill key knowledge and trim noise, we use Action Sensitive Extractor to extract features and feed to classification and attention branch. The highest level is video-level, which is aggregated by attention mechanism. This level is symmetrical to annotations. In our study, we discover a setting based on extracted features from UntrimmedNet [38]. Following UntrimmedNet [38], we randomly sample 7 clips

Extracting Action Sensitive Features

193

Loss Annotation

Recognition

Clip Sample

Video-level Class activation

A

Class activation

Clip Representation

Action detection Criterion

Class activation

Frame Representation

A

...

ASE Model

...

...

Pretrained DCNN

Class activation

... Frame Raw Data

A

Class activation

CAS

Selection Method

A A

Clip Activation/Attention

Action segments

Fig. 3. Our full network for action recognition and detection. We use our ASE model to produce frame activations and attentions. Video-level classification activations are optimized with video-level annotations. CAS is generated by action detection criterion. Action segments are selected based on CAS.

for untrimmed videos, 1 clips for video clips. For each clip, 3 frames are sparsely sampled as in TSN [39] and averaged as clip representation. Two fully-connected layers are used to produce classification and attention respectively. Dropout of 0.5 is used only before classification layer. To fuse clip level activations, we apply softmax operation on attentions x from clip 1 to t. The normalized attentions exp(xa i) x¯ai = t exp(x a ) are used to fuse clip level classifications into video level prej=1 j t c diction x , where xc = i=1 x¯ai xci . Next, we apply softmax operations among each dimension of prediction and optimize with multi-label cross-entropy loss. l(xc , y) =

t 

exp(xci ) ) yi log( t c j=1 exp(xj ) i=1

(3)

During testing, we use strategy similarly to [38] and [30]. Each clip is aggregated every 15 frames. ASE model produces classification and attention activations for each clip. For video recognition, we soften our attentions by a factor (sets to 3) at first. Then, clips are fused to video representation according to their attentions as in training. For video detection, we generate CAS of size clip number × class number and feed it into selection method. Firstly, we apply softmax operation on clip classification activations. Then, we apply threshold on video-level prediction, clip activations of video unrelated class are set to 0 in CAS. Thirdly, we apply attention level threshold, clips with attentions lower than threshold are set to 0 in CAS. Finally we feed our CAS into selection method to generate action segments.

194

4 4.1

Z. Kang et al.

Experiments Dataset

THUMOS14 [20] has 101 classes for recognition and 20 classes out of 101 for action detection. THUMOS14 includes training set, validation set and testing set. Training set includes action video clips, validation and testing set includes untrimmed videos. In THUMOS14, 15 instances of actions covers 29% of video on average [28]. We train our model on training set and validation set, we test our model on testing set. ActivityNet1.2 [2] has 100 classes for both detection and recognition. It is divided into training set, validation set and test set. In ActivityNet, 1.5 instances of actions covers 64% of video on average [28]. We train our model on training set and test on validation set. 4.2

Implementation Details

We train our ASN using features extracted by UntrimmedNet pretrained model, which trained on same dataset and subsets as UntrimmedNet. We train our network with Nestrov momentum [36] of 0.9, weights decay of 0.0005. Batch size is set to 512 for THUMOS14 validation set and 8192 for THUMOS14 training set. Batch size is set to 512 for ActivityNet1.2. On THUMOS14 [20], we train 80 epochs jointly on training set and validation set. Our learning rate is set to 0.1 and decay 10 times on 40th and 60th epoch. On ActivityNet1.2 [2], we train 160 epochs jointly on training set. Learning rate is set to 0.1 and decay on 80th and 120th epoch. 4.3

Ablation Study

In this section, we explore our action detection criterion at different levels of quantiles and different model settings. For simplicity and efficiency, we use naive approach in UntrimmedNet [38] as selection method in ablation study on THUMOS14. This method only selects consecutive activated frames in CAS. For a selected snippet from clip timestamp n to k + n with label v, confidence scores s are evaluated by video-level activation cv and average activation as shown in Eq. 4. Where we use λ = 0.2 in our experiment. s=

k+n 1  i cv + λcv n+1

(4)

i=k

Evaluation of Action Detection Criterion. To demonstrate efficiency of our action detection criterion, we train our network 10 times and record performance on testing set under different quantiles. As baseline, spatial (RGB) and temporal (Flow) models are treated separately. Different levels of quantiles are recorded and tested on localization task. We train our network 10 times and quantiles

Extracting Action Sensitive Features

195

18 16 14 12 10 8 6 4 2 0

Max.

Avg.

6

Min.

Max.

Avg.

Min.

5 4

mAP

mAP

are recorded at level 10%, 20%, 30% to 90%. To compare with former studies, we also apply our methods on pretrained weights provided by [38], the quantiles of pretrained models are recorded by running on THUMOS14 validation set. CAS of two-stream model in localization are generated by two steps. First, clip level classifications after softmax are averaged. Second, attention scores of each stream are normalized by each threshold and averaged. Video-level recognition results are set to average of two streams.

3

2 1 0

10% 20% 30% 40% 50% 60% 70% 80% 90%

10% 20% 30% 40% 50% 60% 70% 80% 90%

levels of quantiles

levels of quantiles

(a) Flow

(b) rgb

Fig. 4. Localization mAP of flow and rgb model under different quantiles on THUMOS14. mAP is recorded under 0.5 IoU threshold. Table 1. Comparison with different settings on THUMOS14. We compare localization mAP under common IoU threshold and recognition accuracy. UntrimmedNet use a slightly different recognition strategy. Models

Localization (IoU threshold) 0.3 0.4 0.5 0.6 0.7

Recognition

Flow pretrained RGB pretrained Two-stream pretrained UntrimmedNet [38]

27.68 21.26 14.87 9.79 15.32 8.76 5.02 2.80 28.50 21.06 14.40 8.75 28.2 21.1 13.7 -

73.93% 72.29% 82.04% *82.2%

Flow stream RGB stream

28.67 22.43 16.60 10.42 5.63 74.15% 15.32 8.76 5.02 2.80 1.35 72.29%

Fusion stream Two-stream Two-stream (RGB) Two-stream (Flow)

20.40 27.87 21.88 28.57

14.50 9.50 5.56 20.76 14.46 8.59 14.73 9.19 4.97 22.23 16.40 10.04

5.64 1.35 4.78 -

3.03 4.68 2.29 5.62

75.61% 79.95% -

Results of spatial and temporal model under IoU threshold of 0.5 are shown in Fig. 4. The performance of pretrained model on different quantiles are shown in Fig. 5. For different models, performance peaks locate near 50% quantiles. During training, we find attention quantiles are fluctuating but performances are generally stable. Notably, spatial performances are worse and more unstable than temporal. We also compare our methods with original UntrimmedNet [38].

196

Z. Kang et al.

Performances of our best models under different settings are shown in Table 1. Our action detection criterion can achieve high performance with only temporal stream. Evaluation of Streams. We evaluate different combination of streams as shown in Table 1. We evaluate on spatial (RGB), temporal (Flow), two-stream and fusion stream (concatenated features of RGB and Flow). We also discover attention quality of each stream. Surprisingly, temporal stream yield the best localization performance. Streams with spatial features perform poorly. The bad behavior of spatial related streams may because of trivial details in spatial features cause overfitting. In addition, we analyze attention in each stream. For two-stream model, we fix CAS and apply only temporal or spatial attention to our criterion. We find two-stream with temporal attentions yield high performance similar to temporal stream and two-stream with spatial attentions yield low performance similar to fusion stream. Our experiment shows differences in action sensitivity between two streams. Combining with temporal and spatial information usually yield higher performance in action recognition but lower in localization. We also find commonly used two-stream or fusion strategies are inefficient in weakly-supervised localization task, which are worse than single temporal stream. Evaluation of ASE. We evaluate different ASE model settings. For inherit strategy, we our best flow model as initial weight. For fusion model, bottleneck model and bilinear bottleneck model, we compare training from scratch and inherit strategy with feature size of 64. We compare inherit strategy of feature size of 64 and 128 in bottleneck model and bilinear bottleneck model. Our results are shown in Table 2. 16

Flow

14

RGB

Two-stream

mAP

12 10 8 6 4

2 0 10%

20%

30%

40%

50%

60%

levels of quantiles

70%

80%

90%

Fig. 5. Localization mAP of pretrained model under different quantiles on THUMOS14. mAP is recorded under 0.5 IoU threshold.

Extracting Action Sensitive Features

197

Table 2. Compare with different ASE model settings on THUMOS14. Models

Localization (IoU threshold) 0.3 0.4 0.5 0.6 0.7

Recognition

Flow ours Two-stream ours

28.67 22.43 16.60 10.42 5.63 74.15% 27.87 20.76 14.46 8.59 4.68 79.95%

Fusion from scratch Fusion inherit

20.40 14.50 9.50 26.21 19.38 12.72

Bottleneck64 inherit Bottleneck64 scratch Bottleneck128 inherit

32.73 24.84 17.36 11.12 6.42 78.16% 29.35 22.61 15.91 9.94 5.39 80.92% 32.33 25.13 17.60 10.69 5.64 79.10%

5.56 3.03 75.61% 7.45 4.02 81.39%

BiBottleneck64 inherit 31.89 25.35 17.74 11.29 6.23 78.20% BiBottleneck64 scratch 29.12 22.84 16.27 10.30 5.99 74.73% BiBottleneck128 inherit 32.21 25.34 18.16 11.42 6.23 78.57%

Compare with training from scratch, inherit strategy greatly improves recognition and localization except for bottleneck model. For bottleneck model, only localization is slightly improved. This phenomenon may denotes that our bottleneck model has already restrained overfitting. For bottleneck models and bilinear bottleneck models, feature size from 64 to 128 slightly improves performance. In recognition tasks, fusion model has the highest performance because it can access full information, it also proves that our bottleneck structure does restrain information. For localization, bottleneck model and bilinear bottleneck model performs much higher than fusion model. Bilinear bottleneck models perform slightly higher than bottleneck model, which denotes that bilinear layer does improve interaction. High performance of our proposed ASE models shows its ability to extract action sensitive features. 4.4

Experiments on AutoLoc

To evaluate our final Action Sensitive Network, we use AutoLoc [30] as selection methods and compare with state-of-art results. AutoLoc incorporate OuterInter-Contrastive (OIC) loss that evaluate action snippet accurately. To further adjust performance, we increase weights for outer boundary in OIC as follow: LOIC = λAo (φ) + Ai (φ)

(5)

On THUMOS14, we set λ to 2. We also increase boundary inflation rate to 0.35. These settings help AutoLoc select most distinguishable action snippets. We add more offset anchors to AutoLoc and only use AutoLoc as a selection method over CAS. We show performance of our bilinear bottleneck model with feature size 128 and inherit strategy in Table 4. For ActivityNet1.2 [2], we set λ to 5 and boundary inflation to 0.7. We use quantile at 10% for ActivityNet1.2. Our results are shown in Table 3. Compare with other weakly-supervised TAL methods, our method have advantage especially under higher IoU and reach state-of-art level in both datasets.

198

Z. Kang et al.

Table 3. Comparison with state-of-art methods on ActivityNet1.2 in terms of action localization mAP under different IoU. We only list weakly-supervised methods. All results in this table are based on UntrimmedNet features. We describe selection methods we used in brackets. Models

Localization (IoU threshold) 0.5 0.55 0.6 0.65 0.7 0.75 0.8

Avg. 0.85 0.9 0.95

UntrimmedNet [38] 7.4 6.1 5.2 4.5 3.9 3.2 2.5 1.8 AutoLoc [30] 27.3 24.9 22.5 19.9 17.5 15.1 13.0 10.0 W-TALC [26] 37.0 33.5 30.4 25.7 14.6 12.7 10.0 7.0 ASN (AutoLoc) 29.8 27.1 25.0 23.1 21.2 18.6 16.1 13.1

1.2 6.8 4.2 9.6

0.7 3.3 1.5 4.4

3.6 16.0 18.0 18.8

Table 4. Comparison with state-of-art methods on THUMOS14 in terms of action localization mAP under different IoU. All weakly-supervised results are based on UntrimmedNet features. We describe selection methods we used in brackets.

5

Supervision Models

Localization (IoU threshold) 0.3 0.4 0.5 0.6 0.7

Full Full Full Full Full Full Full Full Full Full Full

S-CNN [31] Yuan et al. [42] CDC [29] Dai et al. [6] SSAD [22] Turn Tap [12] R-C3D [41] SS-TAD [1] Gao et al. [11] SSN [43] BSN [23]

36.3 33.6 40.1 43.0 44.1 44.7 45.7 50.1 51.9 53.5

28.7 26.1 29.4 33.3 35.0 34.9 35.6 41.3 41.0 45.0

17.0 18.8 23.3 25.6 24.6 25.6 28.9 29.2 31.0 29.8 36.9

10.3 13.1 15.9 19.1 19.6 28.4

Weak Weak Weak Weak Weak Weak Weak Weak

Sun et al. [34] Hide and Seek [33] UntrimmedNet [38] AutoLoc [30] W-TALC [26] STPN [25] ASN (Naive) ASN (AutoLoc)

8.5 19.5 28.2 35.8 32.0 31.1 32.2 35.9

5.2 12.7 21.1 29.0 26.0 23.5 25.3 29.4

4.4 6.8 13.7 21.2 18.8 16.2 18.2 22.8

13.4 5.8 6.2 9.8 5.1 11.4 6.2 15.2 7.3

5.3 7.9 9.0 9.6 9.9 10.7 20.0

Conclusion

We propose a general action detection criterion which can generate high quality CAS and can apply to different modalities. Based on this thresholding method, we analyze performance of different combinations of streams. According to our

Extracting Action Sensitive Features

199

experiments, spatial and temporal stream contains different information and have different sensitivity in actions. To combine two streams properly, we propose our novel Action Sensitive Network. Two-stream features are treated asymmetry to produce accurate representation without losing sensitivity in actions. We use ASE model to produce clip features and CAS that can be applied to different selection methods. Our network yields state-of-art performance with AutoLoc as selection method. In the future, we can investigate higher level relationship between different streams and apply our method to more modalities. Acknowledgement. This work was supported partly by National Key R&D Program of China Grant 2017YFA0700800, National Natural Science Foundation of China Grants 61629301 and 61773312, Young Elite Scientists Sponsorship Program by CAST Grant 2018QNRC001.

References 1. Buch, S., Escorcia, V., Ghanem, B., Fei-Fei, L., Niebles, J.: End-to-end, singlestream temporal action detection in untrimmed videos. In: BMVC (2017) 2. Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR (2015) 3. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017) 4. Chen, T., Goodfellow, I., Shlens, J.: Net2Net: accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641 (2015) 5. Chen, Y., Kalantidis, Y., Li, J., Yan, S., Feng, J.: A2 -Nets: double attention networks. In: NIPS (2018) 6. Dai, X., Singh, B., Zhang, G., Davis, L.S., Qiu Chen, Y.: Temporal context network for activity localization in videos. In: ICCV (2017) 7. Dietterich, T.G., Lathrop, R.H., Lozano-P´erez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89, 31–71 (1997) 8. Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015) 9. Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016). https:// doi.org/10.1007/978-3-319-46487-9 47 10. Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. arXiv preprint arXiv:1812.03982 (2018) 11. Gao, J., Yang, Z., Nevatia, R.: Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180 (2017) 12. Gao, J., Yang, Z., Sun, C., Chen, K., Nevatia, R.: TURN TAP: temporal unit regression network for temporal action proposals. In: ICCV (2017) 13. Gkioxari, G., Malik, J.: Finding action tubes. In: CVPR (2015) 14. Gudi, A., van Rosmalen, N., Loog, M., van Gemert, J.: Object-extent pooling for weakly supervised single-shot localization. arXiv preprint arXiv:1707.06180 (2017) 15. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 346–361. Springer, Cham (2014). https:// doi.org/10.1007/978-3-319-10578-9 23

200

Z. Kang et al.

16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016) 17. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR (2017) 18. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015) 19. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 221–231 (2013) 20. Jiang, Y.G., et al.: THUMOS challenge: action recognition with a large number of classes (2014). http://crcv.ucf.edu/THUMOS14/ 21. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Largescale video classification with convolutional neural networks. In: CVPR (2014) 22. Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: Proceedings of the 2017 ACM on Multimedia Conference (2017) 23. Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. arXiv preprint arXiv:1806.02964 (2018) 24. Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in LSTMs for activity detection and early detection. In: CVPR (2016) 25. Nguyen, P., Liu, T., Prasad, G., Han, B.: Weakly supervised action localization by sparse temporal pooling network. In: CVPR (2018) 26. Paul, S., Roy, S., Roy-Chowdhury, A.K.: W-TALC: weakly-supervised temporal activity localization and classification. arXiv preprint arXiv:1807.10418 (2018) 27. Sevilla-Lara, L., Liao, Y., Guney, F., Jampani, V., Geiger, A., Black, M.J.: On the integration of optical flow and action recognition. arXiv preprint arXiv:1712.08416 (2017) 28. Seybold, B., Ross, D., Deng, J., Sukthankar, R., Vijayanarasimhan, S., Chao, Y.W.: Rethinking the faster R-CNN architecture for temporal action localization (2018) 29. Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F.: CDC: convolutionalde-convolutional networks for precise temporal action localization in untrimmed videos. In: CVPR (2017) 30. Shou, Z., Gao, H., Zhang, L., Miyazawa, K., Chang, S.-F.: AutoLoc: weaklysupervised temporal action localization in untrimmed videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 162–179. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0 10 31. Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: CVPR (2016) 32. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems (2014) 33. Singh, K.K., Lee, Y.J.: Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization. In: ICCV (2017) 34. Sun, C., Shetty, S., Sukthankar, R., Nevatia, R.: Temporal localization of finegrained actions in videos by domain transfer from web images. In: Proceedings of the 23rd ACM International Conference on Multimedia (2015) 35. Sun, S., Kuang, Z., Sheng, L., Ouyang, W., Zhang, W.: Optical flow guided feature: a fast and robust motion representation for video action recognition. In: CVPR (2018) 36. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: ICML (2013) 37. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-ResNet and the impact of residual connections on learning. In: AAAI, vol. 4, p. 12 (2017)

Extracting Action Sensitive Features

201

38. Wang, L., Xiong, Y., Lin, D., Van Gool, L.: UntrimmedNets for weakly supervised action recognition and detection. In: CVPR (2017) 39. Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/ 978-3-319-46484-8 2 40. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018) 41. Xu, H., Das, A., Saenko, K.: R-C3D: region convolutional 3D network for temporal activity detection. In: ICCV (2017) 42. Yuan, J., Ni, B., Yang, X., Kassim, A.A.: Temporal action localization with pyramid of score distribution features. In: CVPR (2016) 43. Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: ICCV, October 2017

Image Recognition Based on Combined Filters with Pseudoinverse Learning Algorithm Xiaodan Deng1, Xiaoxuan Sun1, Ping Guo1(&), and Qian Yin2(&) 1

Image Processing and Pattern Recognition Laboratory, School of Systems Science, Beijing Normal University, Beijing 100875, China [email protected], [email protected], [email protected] 2 Image Processing and Pattern Recognition Laboratory, College of Information Science and Technology, Beijing Normal University, Beijing, China [email protected]

Abstract. Deep convolution neural network (CNN) is one of the most popular Deep neural networks (DNN). It has won state-of-the-art performance in many computer vision tasks. The most used method to train DNN is Gradient descentbased algorithm such as Backpropagation. However, backpropagation algorithm usually has the problem of gradient vanishing or gradient explosion, and it relies on repeated iteration to get the optimal result. Moreover, with the need to learn many convolutional kernels, the traditional convolutional layer is the main computational bottleneck of deep CNNs. Consequently, the current deep CNN is inefficient on computing resource and computing time. To solve these problems, we proposed a method which combines Gabor kernel, random kernel and pseudoinverse kernel, incorporating with pseudoinverse learning (PIL) algorithm to speed up DNN training processing. With the multiple fixed convolution kernels and pseudoinverse learning algorithm, it is simple and efficient to use the proposed method. The performance of the proposed model is tested on MNIST and CIFAR-10 datasets without using GPU. Experimental results show that our model is better than existing benchmark methods in speed, at the same time it has the comparative recognition accuracy. Keywords: Pseudoinverse learning autoencoder  Gabor kernel Random kernel  Image recognition  Ensemble learning



1 Introduction Recently, deep convolutional neural networks (CNNs) have been overwhelmingly successful across a variety of visual perception tasks. LeNet5 in [1], designed by Yann LeCun and Yoshua Benfio in 1998, is considered as the beginning of CNN. Over the past several years, many successful CNN architectures have emerged, such as AlexNet [2], VGG [3], GoogLeNet [4], ResNet [5, 6], MobileNet [7], and DenseNet [8], etc. Most deep neural networks are trained by the gradient descent (GD) based algorithms and their variations [1, 3]. However, it is found that the gradient descent based algorithm in deep neural networks has inherent instability. This instability blocks the © IFIP International Federation for Information Processing 2019 Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 202–209, 2019. https://doi.org/10.1007/978-3-030-19823-7_16

Image Recognition Based on Combined Filters

203

learning process of the previous or later layers. Though CNN has good performing result, it needs much professional knowledge to use and it takes a lot of time to train. In this paper, we proposed a method combines Gabor kernel [9], random kernel and pseudoinverse kernel. It is corresponding to multiple convolutional kernels. Gabor feature from Gabor kernel is a kind of handcraft feature which is faster obtained than learned features. In paper [10], perturbation layer is an alternative of convolutional layer. Their theoretical analysis shows that the perturbation layer can approximate the response of a standard convolutional layer. Inspired by perturbative neural network, a kind of random kernel with the same size of input data was proposed. Pseudoinverse learning algorithm was proposed by Guo et al. [11–13]. It’s a fast feedforward propagation algorithm. In our method, a random weight was used as the input weight of the pseudoinverse learning algorithm. As a result, the training time is reduced significantly and the random weight can regulate the whole model. Our model combines multiple fixed convolutional kernels, such as Gabor kernel, random kernel and pseudoinverse kernel. The parameters of convolutional kernel can be obtained without iteration. Therefore, the training process is accelerated. Moreover, the variant kernels contribute to variant image features which facilitate the recognition task. Instead of using gradient-based algorithm, pseudoinverse learning algorithm was used to speed up the training process significantly. three base learner were trained, then feed their prediction to the meta learner to obtain the final result. Our model was tested on MNIST and CIFAR-10 datasets without using GPU. The experimental results show that our model is better than existing benchmark methods in speed, at the same time it has the comparative recognition accuracy.

2 Related Work Recently, random feature has attracted researchers’ attention. Random feature shows its significant success in many research fields. The test of time award paper in NIPS 2017 [14], presented two method, random Fourier features and random binning features to map the input data to random features. Random feature mapping speeds up the training of large-scale kernel methods. Perturbative Neural Networks [10] presented a perturbative layer as the alternative of convolutional layer. The perturbative layer computes its response as a weighted linear combination of non-linearly activated additive noise perturbed inputs. The input data added a random and fixed noise is a kind of random features. The perturbation layer in [10] shows that maybe the convolutional layers are not necessary to be learned from input image. Perturbative Neural Networks performs as well as standard convolutional neural network. Pseudoinverse learning algorithm was originally proposed by Guo et al. [11–13], which is a kind of fast feedforward training algorithm. As a variant of pseudoinverse learning algorithm, pseudoinverse learning autoencoder [15] is a useful method to train the multiplayer neural network. Our previous works include combining handcraft features with pseudoinverse learning algorithm [16, 17]. These works perform well in terms of training time, however, it’s not satisfactory in accuracy especially on complicated data sets. In this paper, we proposed a method combining multiple fixed convolutional kernels, using pseudoinverse

204

X. Deng et al.

learning algorithm to accelerate the training. Our method performs better than other baseline method in speed, and obtains comparable accuracy. Meanwhile, our proposed method does not need large compute resource. It can meet the need of edge learning.

3 Proposed Methodology 3.1

Gabor Kernel

The base learner 1 was presented as shown in Fig. 1. The input image is extracted features by Gabor kernels firstly and then trained by PIL1 [18]. PIL1 is original pseudoinverse adding Gaussian noise perturbation matrix. Gabor kernel is corresponding to convolutional kernel and Gabor feature is corresponding to convolutional feature as shown in formula (1), IG ¼ I  G;

ð1Þ

where I is the grayscale distribution of the image, IG is the feature extracted from I, “” stands for 2D convolution operator, G is the defined Gabor kernel. As a kind of handcraft feature, Gabor feature can be obtained faster than learned features. Meanwhile, multiple Gabor features will facilitate the recognition.

Fig. 1. Gabor kernels are variant, the Gabor kernels is set in advance.

3.2

Random Kernel

The second base learner was shown in Fig. 2. The PIL1 part is as same as demonstrated in Sect. 3.1. The difference is on the front part. The input image was added with a random kernel, which has the same size as the input image. The values in random kernel are derived from specific distribution. Gaussian distribution and uniform distribution both work well. It’s better to control the mean of extracted values is zero [19, 20]. At the same time, the noise value should be small, otherwise, the original information in input is covered by heavy noise. The features added noise are activated by RELU. Then features are combined by linear weight. The obtained feature is as follow, F¼

Xq i¼1

Wi  frelu ðX þ Ri Þ;

where, q is the number of random features, R is the random kernel matrix.

ð2Þ

Image Recognition Based on Combined Filters

205

Fig. 2. Random kernel is a random noise matrix with the same size of the input image.

Random features are obtained by adding random noise to the input image. This is the simplest and fastest way to get random features. Moreover, adding noise to the input data in neural network can regulate the performance. 3.3

Pseudoinverse Kernel

The third base learner was shown in Fig. 3. The input image is sent to PIL0 [18]. The input weight of PIL0 is a random weight whose values are within a small scale, such as [−1, 1]. The number of input data is n, and the number of hidden neurons is p (p 0 → reward = 1 gprof it = 0 → reward = 0 ⎪ ⎩ gprof it < 0 → reward = −1

253

(9)

Figure 1 shows the Q-learning trading system based on a Double Deep Q-learning Network with Sharpe ratio reward function.

Fig. 1. Double Deep Q-learning trading system with Sharpe reward function.

The Q-learning trading system based on the D-DQN is composed by 2 CNN layers with 120 neurons each. In the case of DD-DQN, 2 CNN layers with 120 neurons each are followed by two streams of FC layers: the first with 60 neurons dedicated to estimate the value function and the second with 60 neurons to estimate the advantage function. In both cases, the number of epochs is set to 40 as well as the batch size. For weight optimization, the ADAM algorithm [11]  n

(y −ˆ y )2

is applied. The loss function is the Mean Squared Error, M SE = i=1 ni i . The activation function is set as the Leaky Rectified Linear Units (Leaky ReLU) function [14]. The discount factor, γ, is set to 0.98 in both D-DQN and DD-DQN. A similar setting is also used to implement the trading strategies based on DQN.

254

5

G. Lucarelli and M. Borrotti

Experimental Data and Results

5.1

Bitcoin Historical Data

The proposed Q-learning trading systems are tested on bitcoin historical data. Data can be found on the well-known Kaggle (www.kaggle.com) platform1 . We considered bitcoin price in USD dollars from the 1st December 2014 to the 27th June 2018, sampled at 1 min interval. For each observation, time stamp, OHLC (Open, High, Low, Close) values, volume in bitcoin, volume in USD dollars, and weighted bitcoin price are collected. The dataset is composed by roughly 2 million rows and 8 variables. Based on the time stamp, data is hourly aggregated obtaining a final dataset with more than 30.000 observations and the same number of variables.

Fig. 2. Average percentage returns over the 10 trading periods, i.e. different combinations of start and end dates for the trading activity. 1

www.kaggle.com/mczielinski/bitcoin-historical-data.

Q-learning Trading System

5.2

255

Results

The Q-learning trading system is tested with four different settings based on: 1. Double Deep Q-Network with a profit reward function (ProfitD-DQN); 2. Double Deep Q-Network with Sharpe ratio reward function (SharpeD-DQN); 3. Dueling Double Deep Q-Network with a profit reward function (ProfitDDDQN); 4. Dueling Double Deep Q-Network with Sharpe ratio reward function (SharpeDD-DQN); The four settings are compared with a Deep Q-Network with profit reward function (ProfitDQN) and a Deep Q-Network with Sharpe ratio reward function (SharpeDQN). Table 1. Average performance over the 10 trading periods. Trading system

Avg. return (%) Max. return (%) Min. return (%) St. dev.

ProfitD-DQN

3.74

21.31

−10.74

4.87

ProfitDD-DQN

4.85

17.34

−8.49

5.10

ProfitDQN

2.32

22.59

−17.97

7.93

SharpeD-DQN

5.81

26.14

−5.64

5.26

SharpeDD-DQN 3.04

13.03

−8.49

3.81

SharpeDQN

15.80

−9.29

5.46

1.83

Test 1. All Q-learning trading system settings are compared sampling 10 different periods of size 4.000. For each period, 80% is dedicated for training purpose and 20% for testing the performance. In Fig. 2, the cumulative average return (%) over the 10 test sets is reported. 95% confidence intervals around the mean are also included. DD-DQN and D-DQN trading systems clearly outperform the simpler DQN system. In average the best cumulative return (%) is reached by the SharpeD-DQN. Table 1 summarizes main statistical indicators. The trading systems based on DD-DQN and D-DQN reaches higher cumulative average return (%). In fact, the ProfitDQN and SharpeDQN obtain the worst results over all the test periods. Furthermore, DQN has the highest standard deviation demonstrating high instability. SharpeD-DQN has the highest average return (5.81%) over all the test period. It reaches a maximum value of return percentage equal to 26.14% and a minimum value equal to −5.64%. The DD-DQN and D-DQN trading systems based on the profit reward function have comparable results. From this preliminary analysis, the SharpeD-DQN has demonstrated to be the best Q-learning trading system.

256

G. Lucarelli and M. Borrotti

Test 2. Given the previous results, the SharpeD-DQN is tested on the entire period (from the 1st December 2014 to the 27th June 2018).

Cumulative Average Return (%)

9

6

3

0

0

2000

4000

Time

Fig. 3. SharpeD-DQN performance over the entire period.

Observations from 1st December 2014 to 1st November 2017 are used by SharpeD-DQN system to learn how to trade the cryptocurrency. After that period, SharpeD-DQN system has acted as an autonomous algorithmic trading system (from 2st November 2017 to 26th June 2018). It had an average percentage return (%) of almost 8% with a standard deviation 2.77. In Fig. 3, the cumulative percentage return over the entire period is shown.

6

Conclusions and Future Work

In this work, the performance of different trading systems based on Deep Reinforcement Learning were tested on hourly cryptocurrency (i.e. bitcoin) prices. The trading systems were based on Double and Dueling Double Deep Q-learning Networks. Furthermore, the previous trading systems were compared with a simpler Deep Q-learning Network. Each of them were tested with two different reward functions. The first function was based on the Sharpe ratio, a measure of the risk-adjusted return on an investment, and the second function was related to profit. Then, six different Q-learning trading system settings where tested on

Q-learning Trading System

257

bitcoin data from the 1st December 2014 to the 27th June 2018. Performance were evaluated in terms of percentage returns. All systems produced positive return (in average) for a set of shorter trading periods (different combinations of start and end dates for the trading activity). The trading systems based on Double Q-learning and Sharpe ratio reward function (SharpeD-DQN) achieved larger return values. SharpeD-DQN was also tested over the entire considered period producing a positive percentage return value (average percentage return 8%). It is important to stress that this work has some limitations. First, a broader set of performance indicators should be used to compare the different approaches. Second, the proposed Deep Reinforcement Learning techniques should be compared with recent AI approaches for a more accurate comparison study. Third, a parameter optimization should be done to improve the performance of the learning techniques. Given that, the presented methods were able to generate positive returns on all conducted tests. Extending the current analysis by considering these elements is a direction for future work. A different yet promising approach is to study the impact of social media on bitcoin and other cryptocurrency fluctuation prices and incorporating news and public opinion into the Deep Reinforcement Learning approach. In addition, uncertainty estimations should be investigated since uncertainty is essential for efficient reinforcement learning. Lastly, the proposed approaches can be extended for anomaly detection. Following the work of Du et al. [8], Q-learning approaches can be used to build a framework for online log anomaly detection and diagnosis. Such an approach could be a critical step towards building a secure and trustworthy anomaly detection system.

References 1. Alessandretti, L., ElBahrawy, A., Aiello, L.M., Baronchetti, A.: Anticipating cryptocurrency prices using machine learning. Complexity 2018, 1–16 (2018) 2. Almahdi, S., Yang, S.Y.: An adaptive portfolio trading system: a risk-return portfolio optimization using recurrent reinforcement learning with expected maximum drawdown. Expert Syst. Appl. 87, 267–279 (2017) 3. Bach, W.G., Kasper, L.N.: On Machine Learning Based Cryptocurrency Trading. Aalborg University, Denmark (2018) 4. Bellman, R.E., Dreyfus, S.E.: Applied dynamic programming. RAND Corporation, Santa Monica (1962) 5. Bu, S.-J., Cho, S.-B.: Learning optimal Q-function using deep Boltzmann machine for reliable trading of cryptocurrency. In: Proceedings of the 19th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL 2018), Madrid, SP, pp. 468–480 (2018) 6. Buduma, N.: Fundamentals of deep learning: designing next-generation artificial intelligence algorithms. O’Reilly Media, Sebastopol (2017) 7. Cheeda, S.R., Singh, A.K., Singh, P.S., Bhole, A.S.: Automated trading of cryptocurrency using twitter sentimental analysis. Int. J. Comput. Sci. Eng. 6, 209–214 (2018)

258

G. Lucarelli and M. Borrotti

8. Du, M., Li, F., Zheng, G., Srikumar, V.: DeepLog: anomaly detection and diagnosis from system logs through deep learning. In: Proceeding of the Conference on Computer and Communications Security (CCS 2017), Dallas, TX, pp. 1285–1298 (2017) 9. Fran¸cois-Lavet, V., Henderson, P., Islam, R., Bellemare, M.G., Pineau, J.: An introduction to deep reinforcement learning. Found. Trends Mach. Learn. 11, 219– 354 (2018) 10. Jiang, Z., Liang, J.: Cryptocurrency portfolio management with deep reinforcement learning. In: Proceedings of the Intelligent Systems Conference (IntelliSys 2017), pp. 905–913 (2017) 11. Kingma D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference of Learning Representations (ICLR 2015), San Diego, CA, pp. 1–15 (2015) 12. Li, Y.: Deep reinforcement learning. arXiv, pp. 1–150 (2018) 13. Lin, L.-J.: Programming robots using reinforcement learning and teaching. In: Proceedings of the Ninth National Conference on Artificial Intelligence (AAAI 1991), pp. 781–786. AAAI Press, Anaheim (1991) 14. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, pp. 1–6 (2013) 15. McNally, S., Roche, J., Caton, S.: Predicting the price of Bitcoin using machine learning. In: Proceedings of the 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP 2018), pp. 339–343. IEEE, Cambridge (2018) 16. Mnih, V., et al.: Playing Atari with deep reinforcement learning. arXiv, pp. 1–9 (2013) 17. Moody, J., Saffell, M.: Learning to trade via direct reinforcement. IEEE Trans. Neural Netw. 12, 875–889 (2001) 18. Mousavi, S.S., Schukat, M., Howley, E.: Deep reinforcement learning: an overview. arXiv, pp. 1–17 (2018) 19. Patel, Y.: Optimizing market making using multi-agent reinforcement learning. arXiv, pp. 1–10 (2018) 20. Sharpe, W.F.: The sharpe ratio. J. Portfolio Manage. 21, 49–58 (1994) 21. Sutton, R.S., Barto, A.G.: An Introduction to Reinforcement Learning. MIT Press, Cambridge (2015) 22. Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., de Freitas, N.: Dueling network architectures for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), pp. 1–9. PLMR, New York (2016) 23. Watkins, C.J.C.H., Dayan, P.: Q-learning. Mach. Learn. 8, 279–292 (1992)

Capacity Requirements Planning for Production Companies Using Deep Reinforcement Learning Use Case for Deep Planning Methodology (DPM) Harald Schallner(&) Jade Hochschule, Wilhelmshaven, Germany [email protected]

Abstract. In recent years, deep reinforcement learning has proven an impressive success in the area of games, without explicit knowledge about the rules and strategies of the games itself, like Backgammon, Checkers, Go, Atari video games, for instance [1]. Deep reinforcement learning combines reinforcementlearning algorithms with deep neural networks. In principle, reinforcementlearning applications learn an appropriate policy automatically, which maximizes an objective function in order to win a game. In this paper, a universal methodology is proposed on how to create a deep reinforcement learning application for a business planning process systematically, named Deep Planning Methodology (DPM). This methodology is applied to the business process domain of capacity requirements planning. Therefore, this planning process was designed as a Markov decision process [2]. The proposed deep neuronal network learns a policy choosing the best shift schedule, which provides the required capacity for producing orders in time, with high capacity utilization, minimized stock and a short throughput time. The deep learning framework TensorFlowTM [3] was used to implement the capacity requirements planning application for a production company. Keywords: Artificial intelligence applications  Planning and resource management  Deep learning framework Deep Planning Methodology (DPM)



1 Introduction Many production companies have to plan the capacity requirements for their work center frequently. They have to decide how many working hours and shifts per day are needed for each production resource. Current standard software implementations of SAP® Enterprise Resource Planning (ERP) and Supply Chain Management (SCM) systems support the capacity requirements planning by reporting capacity utilization and by capacity levelling functions based on shift schedule [4]. Users have to maintain a feasible shift schedule before capacity requirements planning can be executed. SAP® ERP and SCM systems do not offer any planning functionality to optimize shift schedule automatically [5]. One main reason for this missing standard system functionality is the © IFIP International Federation for Information Processing 2019 Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 259–271, 2019. https://doi.org/10.1007/978-3-030-19823-7_21

260

H. Schallner

significant uncertainty about changes of available production capacities and customer requirements in the near future. Another reason is the exponential runtime complexity for a complete enumeration algorithm, which searches for the optimal solution within an exponential sized decision space. See O( ) - formula (1) for complexity calculated by number of alternative shifts A (e.g. 8, 16 or 24 working hours per day), number of scheduled days T and number of capacities C:  T O AC

ð1Þ

In order to achieve long-term goals for a sequence of inherent uncertain planning situations, reinforcement-learning approach is proposed in this work. Reinforcementlearning applications provide evaluative feedback to a learning agent interacting with an environment. This approach was chosen, because there were many reinforcementlearning implementations that could handle other decision problems with similar complexity and significant uncertainty successfully. Related work can be found in [1].

2 Methodology for Applying Deep Reinforcement Learning to Planning Processes Subsequent sections describe all steps that are proposed to implement deep reinforcement learning application for a planning process, named Deep Planning Methodology (DPM).

Fig. 1. Generic business process model

2.1

Step 1: Model Business Process

For the purpose of understanding the business process, data flow, used applications and the sequence of planning process steps have to be modelled. The model has to comprise all relevant aspects of the business domain. Figure 1 shows the graphical representation

Capacity Requirements Planning for Production Companies

261

for all generic elements of the proposed business process model. For all planning process steps, the following business objects have to be modelled: • Business rules: Business rules describe the organizational constraints, which have to be fulfilled by the planning result. Planning process has to comply with all business rules. Planning results are feasible, if all constrains are observed. Business rules have to be formulated by propositional logic or by procedures. • Business objectives: Often, a huge number of different planning results are complying with business rules. In order to evaluate all feasible planning alternatives, an objective function measures the quality of planning results. Objective functions are based on key performance indicators (KPIs). Commonly, objective functions are defined by the sum of weighted key performance indicators, because trade-offs between different key performance indicators have to be balanced. • Input and output data object: All data objects have to be specified by their attributes, which are provided for or created by a planning process step. • Application systems: Every application system has to be modelled, that supports planning functions or provides data objects. 2.2

Step 2: Identify Relevant Planning Process Steps

Every planning process step can be analyzed according to its relevance for applying deep reinforcement learning based on the business rules, the planner performance and the planning problem complexity. The following checklist helps to decide which planning process step has high potential for implementing: First check: If all business rules can be specified in detail, this means every organizational constraint can be formulated, then perform the second check. Else, go on with third check.

Fig. 2. Checklist for deep reinforcement learning applications

262

H. Schallner

Second check: Is there any algorithm, heuristic or system transaction available, that computes a satisfactory planning result according to the objective function? If yes, the algorithm, heuristic or system transaction has to be analyzed in the fourth check. If there is no appropriate application available, then human planning skills have to be focused on in check three. Third check: Find a human planner, who can create satisfactory planning results complying with implicit business rules manually. If there is nobody known with this competence, then go on with check four. Fourth check: Analyze the computational complexity of the algorithm, heuristic or system transaction. If the runtime grows more than polynomially with the planning input size or the complexity is unknown, then deep reinforcement learning could be suitable for computing satisfactory planning results. Generally, deep reinforcement learning applications are recommended, if the planning problem complexity is high or unknown or business rules cannot be specified completely and there is no planner performing satisfactory results, as shown in Fig. 2. 2.3

Step 3: Convert Planning Process into Markov Decision Process

In previous step two, planning process steps are identified that do not apply any algorithm with polynomial runtime. Fortunately, Papadimitriou and Tsitsiklis [2] proved that the Markov decision process for finite horizons is solvable in polynomial runtime by dynamic programming (so-called “P-complete”). Consequently, the planning process has to be converted into a Markov decision process. In contrast to other machine learning, evolutionary or optimization approaches, the Markov decision process does not need exemplary supervision nor complete models of the business rules [1]. The task of the Markov decision process is to compute an appropriate policy within an uncertain environment in order to achieve long-term goals at the end of a given time horizon. A policy decides which action at has to be executed based on the current time t and state st. Thus, planning process has to be defined as a sequence of state dependent actions for each time of a finite horizon T 2 N. The sequence of actions and states has to be modelled by time buckets 0  t\T, e.g. weeks, days or hours. In addition, a finite set of states S, an initial state s0 for the first and for all other time bucket st have to be specified. The state st describes the decision relevant planning information for the current time bucket, provided by the planning environment. Furthermore, the action space has to be finite. Finally, a reward function r (st, at, t) 2 R has to be specified. To sum up, the Markov decision process has to find an optimal policy function dðst ; tÞ, that maximizes the following value function v, as shown in [2]: v¼

2.4

XT t¼0

rðst ; dðst ; tÞ; tÞ

ð2Þ

Step 4: Develop Reinforcement Learning Environment

This methodology proposes the use of the deep learning framework TensorFlow that was developed as an open-source software library by the Google Brain Team and is available for everyone via GitHub [6, 7]. In recent years, TensorFlow has become very

Capacity Requirements Planning for Production Companies

263

popular with over a million source code downloads [3]. In addition to Google, many companies have selected TensorFlow for their machine learning applications, e.g. Airbus, eBay, NVIDIA, Coca-Cola. Especially, SAP has integrated TensorFlow into its Leonardo Machine Learning Foundation by providing the “Bing your own Model” (BYOM) web service [8]. Based on TensorFlow Kuhnle, Schaarschmidt, and Fricke [9] developed an open source library for applied deep reinforcement learning, named TensorForce. TensorForce offers a unified declarative interface to common reinforcement-learning algorithms [10]. TensorForce library provides an application-programming interface by the four following generic Python classes: Environment, Runner, Agent and Model [11]. The Agent class processes states, returns actions, stores past observations, loads and saves models. Each Agent class employs a Model class, that implements algorithms for calculating next action given the current state and for updating the model parameters from past experiences [11]. The Runner class implements the interaction between the Environment and the Agent. For every time step the Runner receives the current action from the Agent, executes this action in the Environment and passes the observation to the Agent. The Runner manages the duration and number of each learning episode and the cumulative rewards [11]. The planning horizon limits the number of time steps of each episode. The value function is implemented by cumulative rewards. Thus, the state sequence of one episode represents the planning process results for the time horizon. For each planning process step, following Environment methods and attributes have to be developed in Python [11]: • • • •

Attribute actions returns the action space for all possible planning decisions. Attribute states returns the state space for all possible planning situations. Constructor method _init_() initializes the state based on planning input. Method reset() sets up a new learning episode with its initial state for the first time step. • Method execute(actions) performs the selected actions and returns its reward, next state and a terminal indicator defining the end of the planning horizon. TensorForce has no restriction on number and type of different states and actions [10]. To sum up, a planning specific Python class has to be developed, which overrides attributes and methods from the Environment superclass. A subclass has to simulate the consequences of each planning decisions to its domain. In order to handle the risk of unrealistic assumptions and oversimplifying, relevant business constrains have to be considered in the execute(actions) method. 2.5

Step 5: Configure Reinforcement Learning Agent

TensorForce library offers eight different pre-built Agent classes that implement stateof-the-art reinforcement-learning algorithms [10]: • DQNAgent: Deep Q-Network agent applies the original Q-Learning algorithm [12]. • NAFAgent: Normalized Advantage Function agent uses continuous Q-learning algorithm [13].

264

H. Schallner

• DQFDAgent: Double Q-learning from demonstration agent enriches expert knowledge [14]. • VPGAgent: Classic Vanilla policy gradient agent implements the classic policy gradients algorithm, known as REINFORCE [15]. • TRPOAgent: Trust Region Policy Optimization agent supports categorical, bounded and continuous action spaces [16]. • PPOAgent: Proximal Policy Optimization agent applies an alternative policybased method [17]. • DQNNstepAgent: The n-step Q-Learning agent performs asynchronous gradient descent for optimization [18]. • DDPGAgent: Deep Deterministic Policy gradient agent supports continuous action domains [19]. Agent classes have to be configured at least by neuronal network layers, exploration, action space, state space, learning algorithm parameters and update methods. There are three update methods for network weights: episode based, batch based, time-step based. The selection of an appropriate agent class and the declarative configuration of hyper-parameters is required, when creating an instance in Python. The best way to choose a suitably configured class is to evaluate all eight classes experimentally, because each planning process step has different business rules, objective functions and planning data (see next step in following Subsect. 2.6). 2.6

Step 6: Evaluate Learning Results

The experimental evaluation of the configured agent classes interacting with the developed environment class requires representative input data. Therefore, data cleansing of real world planning input data is recommended, in order to create meaningful test scenarios.

Fig. 3. Capacity requirements planning process model

Capacity Requirements Planning for Production Companies

265

Evaluation criteria are the convergence and level of the weighted objective function. Consequently, for each test scenario, each agent configuration and all training episodes the cumulative reward values have to be compared. In addition, TensorBoard can be used as a dashboard for the deep learning models, implemented with TensorFlow [20]. TensorBoard is an open-source implementation and available via GitHub [21]. It generates web-based visualizations of the computation dataflow graphs and training curves, which enable to fine-tune hyper-parameters.

3 Use Case for Deep Planning Methodology (DPM) 3.1

Standard Business Process Model for Production Companies

In this subsection, the described Deep Planning Methodology (DPM) is applied to the SAP® standard business process “plan-to-produce” for production companies. This standard planning process comprises the following consecutive planning steps [5]: 1. 2. 3. 4. 5.

Demand Planning (DP) Production Program Planning (PP) Material Requirements Planning (MRP) Capacity Requirements Planning (CRP) Detailed Scheduling (DS)

Demand Planning consolidates the sales forecast, applies statistical methods and releases a demand plan to Production Program Planning. The goal of DP is to achieve a high forecast accuracy. Based on demand plan, sales orders and delivery schedule, PP calculates balanced production quantities for every finished product and period. PP’s objective comprises the on-time delivery performance and the probability that demand is met from stock, called service-level. PP plans the requirement quantity and dates for finished products and releases independent demand to MRP. Subsequently, MRP creates planned orders for each in-house production material, in order to meet the independent demands. A planned order schedules dependent demand for its components and a technical sequence of operations performed at capacities. These operations require adequate capacity availability for every scheduled operation. The duration of an operation defines the capacity requirement quantity. MRP assumes infinite capacities. This means, MRP does not check scheduled operations with regard to capacity availability. Thus, the availability of capacities has to be checked and ensured by the following Capacity Requirements Planning (CRP). Based on planned orders CRP has to plan feasible shift schedules for all production capacities. On the other hand, Detailed Scheduling (DS) has to consider the material availability of all components based on shift schedule and planned orders. DS schedules the sequence of operations performed at capacities and aims to minimize set-up efforts. 3.2

Capacity Requirements Planning Process

Figure 3 visualizes the CRP process with its business rules, objectives, application systems and data objects. The shift schedule can be modeled by the decision variables

266

H. Schallner

di,j (C x T – matrix) that defines the available working hours for all capacities C 2 N and each day within a defined planning horizon T 2 N: d 2 RCT ; 0  di;j  24

ð3Þ

A planned order is specified by its operations i: • • • •

Operation sequence: ij , i is predecessor of operation j Scheduled time bucket: 0  si \T Production duration: .i 2 N Required capacity: wi 2 N

Capacity requirements k can be summed by all relevant operation durations: k 2 RCT ; kc;t ¼

X 8i:si ¼t^wi ¼c

.i

ð4Þ

A planned shift schedule is feasible, if two business constrains are fulfilled: Firstly, finite capacity utilization is limited to 100% for all capacities c and time buckets t: 8c; t : Uc;t ¼

kc;t  100% dc;t

ð5Þ

Secondly, the technical sequence of operation is ensured, if succeeding operations have to be scheduled at a later time bucket: 8 operation i  j : si \ sj

ð6Þ

The objective function f that measures the quality of a shift schedule comprises classic key performance indicators used in production companies, such as capacity utilization Uc,t, weighted by wu, throughput time Pi, weighted by wp, inventory Ri, weighted by wr, and delivery performance Pj, weighted by wp: X X X X f ¼ wu U þ w P þ w R þ w P ð7Þ c;t p i;t r i;t u 8c;t 8i;t 8i;t 8i;t i;t CRP was identified as a relevant planning step with high potential for deep reinforcement learning implementations, because the complete enumeration algorithm has exponential runtime complexity, see Eq. (1). 3.3

Markov Decision Process for Capacity Requirements Planning

CRP is converted into a Markov decision process by defining actions, states and rewards: • A C-dimensional vector specifies the actions that decide how many working hours (here: 8, 16 or 24) are available for each capacity and time bucket t: at 2 f8; 16; 24gC

ð8Þ

Capacity Requirements Planning for Production Companies

267

• The shift schedule d is calculated by the actions at performed for all capacities within one learning episode: 0

1 d0;t B C at ¼ @ ... A dC1;t

ð9Þ

• State st is designed via capacity requirements based on a time interval with b buckets. This means, that the state slices the capacity requirements by a rolling time window: 0

k0;t B st ¼ @ ... kC1;t

1  k0;t þ b1 C .. .. A . .    kC1;t þ b1

ð10Þ

• In orders to ensure, that for each learning episode the cumulative reward v is equal to the objective function value f, reward function r is calculated by: X X X X rðst ; at ; tÞ ¼ wu U þ wp P þ wr R þ wu P ð11Þ 8c c;t 8i i;t 8i i;t 8i i;t

3.4

Learning Environment for Capacity Requirements Planning

The learning environment has to fulfill the two business constraints, which are described in Subsect. 3.2. In consequence, the execute(actions) method implements for each time step and each capacity that the planned shifts offer enough working hours to meet the capacity requirements. If capacity utilization is higher than 100%, operations are moved to the next time bucket. Else operations are pulled forward from subsequent to current time buckets, in order to achieve fully utilized capacities. Predecessors and successors of moved operations have to be rescheduled, too. This influences throughput time, inventory, delivery performance and consequently the reward. The implemented time window of states focuses on a rolling interval of two days for capacity requirements. This provides relevant information for the agent to find a satisfactory shift schedule. 3.5

Learning Agent for Shift Schedule

Thriving reinforcement learning often requires tuning the agent hyper-parameters. This subsection proposes some best practice for customizing of the selected PPOAgent class, which has proven its productivity in the domain of capacity requirements planning. The hidden layers of deep neuronal network are specified by parameter network. State and action spaces define the structure of input and output layers implicitly. Explicitly, three hidden layers have to be configured to meet the complexity of this capacity-planning problem. The number of hidden units corresponds with the planning

268

H. Schallner

complexity, described in O – formula (1). This means high number of capacities, longer planning horizon and high number of alternative shifts require higher number of hidden units, called size of layer: 1. The first layer type is embedding, in order to handle all dimensions of the state space. This layer represents states in a continuous vector space where semantically similar states are mapped to nearby points, coded as indices. The number of embedding indices should be within the range of 32 to 1024 (here 100). The dimension of the vector space is defined by the size of the embedding layer, here 32 (range: 16–1024). This means, that the input layer is fully connected to the embedding layer. L1 and L2 regularization weights are not used in this layer, because overfitting is not an issue for learned policies. 2. The second layer is called flatten. This layer reshapes the first layer from multidimensional vector to a one-dimension vector. 3. The third dense layer with 32 hidden units (range: 32–1024) is fully connected to the second layer. The ReLU function [7] is chosen for activation. Prominent results concerning the convergence of cumulative rewards to a satisfactory solution can be achieved by the following configuration of the PPOAgent class: • The update_mode specifies how many episodes or time steps used for one iteration of a gradient descent update. The frequency defines the number of elapsed episodes or time steps that trigger the next update batch run. Both variables are set to one episode (range: 1–20 episodes). • The capacity of the memory corresponds to the numbers of experiences collected before updating neuronal network weights. It has to be a multiple of the update_mode batch size and was set to 100000, to speed-up training. • The learning rate specifies the strength of each gradient descent update step. Best results for Adam optimization were achieved with rate 0.001 (range: 0.00001– 0.008). If scattering of the objective function values is too high or the rewards do not consistently increase for many episodes, the learning rate has to be decreased. On the other hand, if the learning rate is too low, then optimal solution is too hard to find for the agent. • Parameter entropy_regularization controls the randomness rate of the policy. This parameter is very sensitive and should have the value 0.03 of interval 0.001 to 0.1. Other values have negative effects on convergence and level of cumulative rewards. Higher values create instable solutions for many periods. Lower values prevent finding satisfactory solutions. • Parameter likelihood_ratio_clipping specifies the acceptable threshold of divergence between the new and old policies for gradient update step. High values correspond with a fast, but instable learning process. Best results are based on value 0.02 (range: 0.005–0.35). • Parameter discount defines the factor of future rewards. The parameter value 0.99 ensures policies with high cumulative rewards at the end of an episode. Thus, discount factor should have a high value within the interval of 0.85 to 0.999, because the objective function of capacity requirements planning is calculated by the cumulative rewards for each time horizon.

Capacity Requirements Planning for Production Companies

3.6

269

Experimental Evaluation

Four of eight TensorForce agent classes are suitable for CRP: DQNAgent, VPGAgent, PPOAgent and DQNNstepAgent. The other agents were not applicable, because they are designed for continuous action space or additional expert knowledge. The qualified agents were evaluated according to ten different test scenarios. The test scenarios differ according to randomized planned orders, number of capacities (range: 6–24), number of operations (interval: 2.500–10.000) and planning horizon from 30 to 120 days. Learning algorithms were finished after 4000 episodes. All test scenarios converge to a satisfactory result, near global optimum. The mean of cumulative rewards constantly increased over periods. The scattering of the objective function values decreased over periods. The runtimes were reasonably short: from 0.3 to 1.2 s per episode. TensorFlow ran with the GPUs of NVIDIA Quadro M1200 with a compute capability of grade 5.0. The PPOAgent performed best results according to level and convergence of the objective function, as show in Fig. 4.

Fig. 4. Evaluation of test scenario with six capacities, 2500 operations and 30 days horizon

4 Conclusion and Future Work The proposed Deep Planning Methodology is an advanced implementation paradigm for planning processes in companies. It allows satisfactory planning results by developing a learning environment and customizing an agent without the need to implement all business rules and optimization algorithms explicitly. In the course of the CRP implementation a challenge emerged, how to tune hyper-parameter. Some hyper-parameters

270

H. Schallner

are sensitive concerning convergence and level of planning results. To meet the challenge many test scenarios are recommended to get a comprehensive insight, how to achieve reliable and reproducible results. To sum up, it is fascinating, how complex decision problems can be solved by reinforcement learning algorithms. In the future, I expect more Eureka moments for other planning tasks in companies, which have been unable to be successfully solved by current business applications until today.

References 1. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge (2018) 2. Papadimitriou, C.H., Tsitsiklis, J.N.: The complexity of markov decision processes. Math. Oper. Res. 12(3), 441–450 (1987) 3. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016). USENIX Association, Savannah (2016). www.usenix.org/conference/osdi16/ technical-sessions/presentation/abadi. Accessed 17 Jan 2019 4. Gulyássy, F., Vithayathil, B.: Kapazitätsplanung mit SAP®. Galileo Press, Boston (2014) 5. Dickersbach, J.T.: Supply Chain Management with APO: Structures, Modelling Approaches and Implementation Pecularities, 3rd edn. Springer, Heidelberg (2009). https://doi.org/10. 1007/978-3-662-10145-2 6. TensorFlow Documentation (2017) github.com/tensorflow. Accessed 17 Jan 2019 7. Géron, A.: Hands-On Machine Learning with Scikit-Learn and TensorFlow. O’Reilly Media, Sebastopol (2017) 8. Leukert, B., Müller, J., Noga, M.: Das intelligente Unternehmen: Maschinelles Lernen mit SAP zielgerichtet einsetzen. In: Buxmann, P., Schmidt, H. (eds.) Künstliche Intelligenz, pp. 41–62. Springer, Heidelberg (2019). https://doi.org/10.1007/978-3-662-57568-0_3 9. Kuhnle, A., Schaarschmidt, M., Fricke, K.: Tensorforce: a TensorFlow library for applied reinforcement learning (2017). https://github.com/tensorforce/tensorforce. Accessed 8 Jan 2019 10. Schaarschmidt, M., Kuhnle, A., Ellis, B., Fricke, K., Gessert, F., Yoneki, E.: LIFT: reinforcement learning in computer systems by learning from demonstrations. arXiv preprint:abs/1808.07903v1 (2018) 11. TensorForce Documentation Release 0.3.3 (2018). media.readthedocs.org/pdf/tensorforce/ latest/ten–sorforce.pdf. Accessed 17 Jan 2019 12. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518, 529– 533 (2015). https://doi.org/10.1038/nature14236 13. Gu, S., Lillicrap, T., Sutskever, I., Levine, S.: Continuous deep Q-learning with model-based acceleration. arXiv:1603.00748v1 (2016) 14. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double Q-learning. arXiv:cs.LG/1509.06461v3 (2015) 15. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992) 16. Schulman, J., Levine, S., Abbeel, P., Jordan, M.I., Moritz, P.: Trust region policy optimization. In: Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897, arXiv:1502.05477v5 (2017) 17. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint: 1707.06347v2 (2017)

Capacity Requirements Planning for Production Companies

271

18. Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. arXiv:1602. 01783v2 (2016) 19. Lillicrap, T., et al.: Continuous control with deep reinforcement learning. arXiv:1509. 02971v5 (2016) 20. Kanit, W., et al.: Visualizing dataflow graphs of learning models in TensorFlow. IEEE Trans. Vis. Comput. Graph. 24(1), 1–12 (2018) 21. The Tensor Board repository on GitHub. http://github.com/tensorflow/tensorboard. Accessed 31 Jan 2019

Comparison of Neural Network Optimizers for Relative Ranking Retention Between Neural Architectures George Kyriakides(&)

and Konstantinos Margaritis

University of Macedonia, 55236 Thessaloniki, Greece [email protected]

Abstract. Autonomous design and optimization of neural networks is gaining increasingly more attention from the research community. The main barrier is the computational resources required to conduct experimental and production project. Although most researchers focus on new design methodologies, the main computational cost remains the evaluation of candidate architectures. In this paper we investigate the feasibility of using reduced epoch training, by measuring the rank correlation coefficients between sets of optimizers, given a fixed number of training epochs. We discover ranking correlations of more than 0.75 and up to 0.964 between Adam with 50 training epochs, stochastic gradient descent with nesterov momentum with 10 training epochs and Adam with 20 training epochs. Moreover, we show the ability of genetic algorithms to find high-quality solutions of a function, by searching in a perturbed search space, given that certain correlation criteria are met. Keywords: Deep learning

 Neural architecture search  Ranking

1 Introduction Optimization and autonomous design of neural architectures have been increasingly investigated in the past years [1–8]. Partly due to the popularity of deep learning [9–11] and partly due to the cumbersome process of designing and optimizing the networks. It is only natural that we wish to free ourselves of the repetitively unique process of designing and optimizing. Many methods have been proposed. Although most studies propose methods that produce better networks or converge faster, most of them are concerned with the way that the design or optimization algorithm acts. Nonetheless, the majority of computational resources are consumed during the candidate architectures’ evaluation. Thus, many researchers distribute the workload among many machines in order to speed up the process. Although a possible and quite effective solution, not all algorithms favor distributed setups. Furthermore, it requires availability of additional hardware. On the other hand, the networks are evaluated only to further the search or optimization method, until the final architecture is produced. As such, the actual performance of the network does not matter as an absolute value. It matters in relation to the other alternatives that the design algorithm has at each point in time. As such, methods © IFIP International Federation for Information Processing 2019 Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 272–281, 2019. https://doi.org/10.1007/978-3-030-19823-7_22

Comparison of Neural Network Optimizers

273

that enable a quick relative evaluation of network architectures can greatly reduce the computational workload of designing and optimizing architectures. In this paper, we study the use of various neural network optimizers under a small number of training epochs as a proxy of relative network performance. Furthermore, we conduct Monte-Carlo simulations of searching for the best parameters in 3-d spaces with similar correlation coefficients. The motivation behind this work is to examine the feasibility of using faster evaluation schemes (reduced number of epochs) when evaluating architectures in an autonomous design framework, where absolute performance has a smaller impact than relative performance between candidate architectures.

2 Related Work As stated previously, most research focuses on faster and more efficient methodologies. Nonetheless, much of this work has greatly contributed to the field. A well-known approach entails the use of genetic algorithms in order to evolve networks of fixed submodule design [1]. Two versions are proposed, DeepNEAT which is an extension of the older NEAT [12] method for fully-connected networks and CoDeepNEAT which utilizes the co-evolution [13] of graph structure and sub-module populations. Another approach, utilizes reinforcement learning in order to design CNN architectures [2] and transferable architectures [3]. In [4] an attempt to reduce the number of trainable parameters by freezing all network layers minus the output layer, results in a positive relative ranking. This indicates that freezing intermediate layers is a viable approach for neural architecture search. Another study attempts to design architectures directly on the target dataset, without the use of proxy tasks [7]. Although the authors argue that a proxyless search produces better results, it also demands greater computational resources. It is only logical that a search conducted on the target hardware and training conditions will produce the best results. Furthermore, in [1] researchers noticed that by restricting the available training epochs during evolution, the final produced architecture required less epochs to achieve performance comparable to state of the art. Other approaches employ random search [6], sequential model-based optimization [5] and boosting [8].

3 Methodology 3.1

Evolving Architectures

In this paper we use the methodology proposed in [1], called DeepNEAT to evolve neural architectures. Although the original method uses speciated crossover to produce offspring from a population as well as mutations, we use only the mutation operator, as our goal is to produce an array of different architectures of increasing complexity. The methodology initializes an individual architecture comprised only of an input and output node. At each mutation, there is a certain probability that a new connection will be added or an existing connection will be deactivated and replaced by a new module. Each module consists of a convolutional layer, a dropout layer, a weight

274

G. Kyriakides and K. Margaritis

scaling factor and a max pooling layer with probability of 0.5. Each gene in the chromosome contains information about the module’s parameters: the number of filters in the convolutional layer, in the space [32, 256], the dropout rate in [0, 0.7], the weight scaling factor in [0, 2.0], the kernel size in {1, 3} and if the max pooling layer is present or not. In our experiment the probability of adding a new module node was set at 0.05 and the probability of adding a new connection at 0.1. Some of the architectures generated are depicted in Fig. 1. Number 1 is the starting architecture, consisting only of an input and an output layer. Each node denotes a whole module. Due to the fact that many node modules must be scaled in order for the network to be functional, Number 20 has 15 modules that translate to 86 PyTorch layers.

Fig. 1. Sample architectures evolved.

3.2

Experimental Setup

For our experiments, we generated several distinct neural networks of increasing complexity by mutating a simple network using DeepNeat for 20 generations, with the goal of classifying the CIFAR10 image recognition dataset. We trained each network using 1, 5, 10, 20, and 50 epochs respectively. We compared a total of seven optimizers; Adadelta, Adam, Adamax, RMSprop, Stochastic Gradient Descent (SGD), SGD with momentum (SGD-M), and SGD with nesterov momentum (SGD-NM). Following, we calculated each network’s rank for its specific optimizer and epochs combination. Our goal was to compare the relative rankings of low training cost combinations to the relative rankings of fully trained networks (50 epochs).

Comparison of Neural Network Optimizers

275

In order to compare the networks’ rankings, we use Kendall’s rank correlation coefficient (s) [14], which is a measure of similarity between ranks (correlation of ranks). High positive or negative values imply correlation (positive and negative respectively). To further investigate the ability of optimizers to discern high-performing architectures, we perform the analysis on all architectures as a whole, as well as the last ten generated architectures. The motivation behind this decision is that simple architectures will exhibit similar behavior and minor differences in performance can be attributed to other factors, such as initial conditions. Complex architectures on the other hand, demand more computational resources to train. Thus, design methods will benefit more from a faster but relatively consistent evaluation. Following, we conducted Monte-Carlo simulations of 3-d Rastrigin functions with similar correlation coefficients to those observed in our experiments in order to demonstrate the feasibility of using reduced training epochs as proxies for neural architecture search. We employed a simple genetic algorithm in order to find the local minima, both in the original function as well as the perturbed function. We initialized a population of 10 individuals, with crossover rate of 0.9 and mutation rate of 0.02. The population was selected for crossover by tournament selection. We experimented with 10, 100, and 1000 generations. Each experiment was repeated 1000 times, in order to obtain a reasonable distribution sample. Given the multi-modal nature of Rastrigin function, its search space resembles those in neural architectures. 3.3

Implementation

For our experimental implementation we used the NORD benchmarking framework. DeepNEAT was implemented in Python but only the mutation operator was used. Furthermore, we used ESA’s Pygmo implementation of a simple genetic algorithm for our Monte-Carlo simulations [15]. The experiments were run on 7 T K40 GPUs.

4 Results 4.1

Neural Architectures

In this study our focus is the retention of relative rankings while using reduced training epochs for candidate architectures. Nonetheless, the achieved accuracy of each architecture, optimizer and epoch combination provides us with an intuitive way to assess the quality of each network. Figure 2 depicts the accuracy of each optimizer for 1 and 50 epochs of training. It is evident that with 50 epochs the stability is greatly increased. Nevertheless, it seems that there exists a pattern, that both groups follow. A sharp increase in accuracy at generation 10, a relatively stable performance for a small number of generations and then a sharp decrease. This pattern seems to be followed by all optimizers on both groups, except for the ‘Ada’ family (Adamax, Adadelta, Adam) when given 50 epochs of training. The training time required for each group is depicted in Table 1. As it is evident, the number of training epochs seems to be the most influential factor. Furthermore, required time seems to scale linearly with the number of epochs, as it is expected.

276

G. Kyriakides and K. Margaritis

Fig. 2. Comparison of achieved accuracy in 1 epoch of training to 50 epochs. Table 1. Average training times (in seconds per architecture). Optimizer Adadelta Adam Adamax RMSprop SGD SGD-M SGD-NM

1 epoch 36.1 35.9 35.8 35.4 36.0 35.7 35.5

5 epochs 158.6 157.5 157.1 156.5 157.1 157.2 156.6

10 epochs 312.3 311.7 310.3 310.2 308.3 309.3 311.2

20 epochs 622.1 617.4 617.8 614.6 613.2 613.5 617.7

50 epochs 1701.5 1724.8 1718.6 1687.6 1687.3 1709.5 1707.4

Although useful, it is not easy to judge relative performance from accuracy plots and training time tables. In order to better compare the relative performance of the various optimizer and training epoch combinations, we ranked the generated architectures, depending on the accuracy that they achieved on the specific epoch-optimizer combination. We then computed the rank correlation coefficient for each list with the list generated by the same optimizer when the architectures are trained for 50 epochs before evaluation. The results show high correlation coefficients (above 0.65) for most combinations, except for RMSprop_01 and SGD_01 as depicted in Fig. 3.

Fig. 3. Rank correlation coefficients heatmap.

Comparison of Neural Network Optimizers

277

The results seem promising, although they do not show a clear or extremely strong correlation between the groups. This can be partly attributed to the fact that simple architectures will perform relatively close, independently of the optimizer and the number of epochs trained. If the number of parameters is small and the problem domain large, there are simply too little degrees of freedom to adequately fit the model. Thus, we repeat the analysis, by discarding the first half of the generated architectures. The goal is to test if rank correlations are retained on more complex architectures. As it is evident in Fig. 4, there are now strong relationships between some groups. Adam_01, Adam_05, Adam_20, Adamax_10, Adamax_20, RMSprop_20, SGD-M_10, and SGD-NM_10 have high correlations with Adam (s > 0.75, p < 0.05). Furthermore, SGD-NM_10 and Adam_20 have s = 0.964 with p < 0.01, indicating a strong correlation in relative rankings. This confirms that as architecture complexity increases, evaluation schemes come to a more robust agreement about their relative performance.

Fig. 4. Rank correlation coefficients heatmap for the second half of generated architectures.

4.2

Monte-Carlo Simulations

We saw the need to further study the ability to optimize through a proxy task with specific rank correlation coefficient, compared to the original search space. We also saw the need to study the behavior of optimization algorithms when they are given more freedom to explore both search spaces (i.e. they are not restricted by computational resources). In order to do so, a controlled simulation environment had to be created. The environment should have many local optima, similar to the search space of neural architectures. Furthermore, we wanted to search on a proxy space, were the proxy and original spaces had similar correlation coefficient to the one found in our original experiment. This would enable to study the feasibility of using any proxy task that is known to produce correlated results with the original search space. Thus, we generated a 3-d Rastrigin function and added uniform noise, until a fixed rank correlation coefficient was achieved, compared to the original function. We followed the same procedure as in our original experiment. The genetic algorithm searched the space of the perturbed function (proxy task) and the final solution was tested on the original. A 2-d example of the two functions can be seen in Fig. 5 showing how the addition of noise does change the function’s surface, but leaves major features and patterns intact.

278

G. Kyriakides and K. Margaritis

Fig. 5. 2-D Rastrigin functions, original and perturbed with s = 0.738

Table 2 depicts the average (E) and Table 3 the standard deviation(r) of the produced solutions, when applied to the original function. We conducted the experiment for three levels of correlation (s = 0.75, s = 0.85, and s = 0.95). In the original space, the genetic algorithm is able to produce solutions close to the top 1% by searching the space of the unperturbed Rastrigin function. For s = 0.75, by conducting the search on the perturbed spaces and then applying the final solutions on the normal function, the genetic algorithm is able to produce solutions near the top 4%. For s = 0.85 and s = 0.95 it is able to produce solutions near the top 3% and 2% respectively. By constructing a 95% confidence interval, the original function’s space produces a solution that belongs in the top [0, 0.033]. For the worst-case scenario (s = 0.75), the perturbed function’s space produces solutions in the top [0.016, 0.08]. For s = 0.85, the solutions lie in [0.006, 0.07] and for s = 0.95 they exist in [0.003, 0.039]. Furthermore, we can see that the solutions’ standard deviation reduces as s increases, given enough generations to search the given space. As it is evident, the most important factor concerning the final solution’s quality is the number of generations that the algorithm was allowed to run. Table 2. Mean of achieved solution’s top percentage for the original and perturbed search space, when applied to the original function. Generations 10 100 1000

E(original) 0.160 0.050 0.013

E(s = 0.95) E(s = 0.85) E(s = 0.75) 0.162 0.184 0.195 0.058 0.086 0.103 0.022 0.032 0.040

Comparison of Neural Network Optimizers

279

Table 3. Standard deviation of achieved solution’s top percentage for the original and perturbed search space, when applied to the original function. Generations 10 100 1000

r(original) 0.075 0.031 0.010

r(s = 0.95) r(s = 0.85) r(s = 0.75) 0.066 0.075 0.084 0.034 0.045 0.052 0.012 0.019 0.021

5 Limitations This study provides some interesting results, both in the real-world experiment, as well as the Monte-Carlo simulation. Nonetheless, more thorough research must be conducted in order to assess the feasibility of using reduced epoch training as a proxy for evaluating relative rankings in neural network architectures. The candidate architectures in this study are admittedly few, even though they are diverse. Moreover, we tested our results in a single image recognition task, although it utilized a popular and well-studied dataset. We hope that in our future work we will overcome some of these limitations, by running longer experiments on more recent hardware, thus being able to employ more architectures and datasets.

6 Conclusions In this paper we studied the relative ranking correlations of neural architectures by evolving an initial architecture with DeepNEAT’s mutation operator and training them for a set number of epochs, utilizing select optimizers. By comparing the 1, 5, 10, and 20 epoch groups to the 50-epoch group, we observed positive and relatively high correlations for most optimizers, except for the RMSprop_01 and SGD_01 optimizerepochs combination. In order to test the most complex architectures of our candidates, we split the candidates in two and tested the relative rankings of the second half. This revealed a strong correlation between Adam and the two optimizer-epochs groups, SGD-NM_10 and Adam_20. The correlation coefficient was very high for both groups (s = 0.964) and showed high statistical significance (p < 0.01). In order to test the ability of optimization algorithms to produce optimal solutions from correlated proxy tasks, we employed a genetic algorithm on a perturbed Rastrigin function, with a rank correlation coefficient of 0.75 to the original. The genetic algorithm produced solutions that belong on average, to the top 3,5% of the original function’s solutions. From the above, we conclude that there is a positive probability that reduced epoch training can be used in order to evaluate the relative rankings of neural architectures for design and optimization purposes. It is evident that the design/optimization process does not care for the performance of intermediate solutions, only for their relative performance. Furthermore, genetic algorithms perform relatively well when searching in perturbed spaces, given that the correlation with the original space is at least 0.75. Nonetheless, more thorough research must be conducted in order to establish optimal proxies for neural architecture design and optimization in a variety of domains.

280

G. Kyriakides and K. Margaritis

Finally, the most decisive factor concerning the algorithm’s solution quality is the number of generations that it is allowed to run. Thus, it can be advantageous to use proxy tasks with lower correlation to the original search space, if the reduction in computational resources required to evaluate a single network can be dedicated to running a more exhaustive search.

7 Future Work In light of our recent findings, we hope to continue the research into feasible solutions for the relative comparison of neural network architectures. A higher number of candidate architectures and a broader range of datasets will greatly enhance our confidence in the results, given that they agree with the current. Moreover, comparisons with even higher number of training epochs will further enhance our confidence in the results. Finally, a study that employs well-established design methodologies will contribute to the final verdict on the feasibility of using reduced epoch training, by collecting data from the actual problem. Acknowledgements. This work was supported by computational time granted from the Greek Research & Technology Network (GRNET) in the National HPC facility - ARIS - under project ID DNAD. Furthermore, this research is funded by the University of Macedonia Research Committee as part of the “Principal Research 2019” funding program.

References 1. Miikkulainen, R., Liang, J., Meyerson, E., et al.: Evolving deep neural networks. In: Artificial Intelligence in the age of Neural Networks and Brain Computing, pp. 293–312 (2019). https://doi.org/10.1016/b978-0-12-815480-9.00015-3 2. Zoph, B., Le, Q.: Neural architecture search with reinforcement learning (2016). https:// arxiv.org/abs/1611.01578. Accessed 21 Feb 2019 3. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.: Learning transferable architectures for scalable image recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018). https://doi.org/10.1109/cvpr.2018.00907 4. Kyriakides, G., Margaritis, K.: Neural architecture search with synchronous advantage actorcritic methods and partial training. In: Proceedings of the 10th Hellenic Conference on Artificial Intelligence - SETN 2018 (2018). https://doi.org/10.1145/3200947.3208068 5. Liu, C., et al.: Progressive neural architecture search. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 19–35. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_2 6. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012) 7. Cai, H., Zhu, L., Han, S.: ProxylessNAS: direct neural architecture search on target task and hardware. https://arxiv.org/abs/1812.00332. Accessed 19 Mar 2019 8. Cortes, C., Gonzalvo, X., Kuznetsov, V., et al.: AdaNet: adaptive structural learning of artificial neural networks (2019). https://arxiv.org/abs/1607.01097. Accessed 21 Feb 2019

Comparison of Neural Network Optimizers

281

9. Dong, C., Shi, Y., Tao, R.: Convolutional neural networks for clothing image style recognition. DEStech Trans. Comput. Sci. Eng. (2018). https://doi.org/10.12783/dtcse/ cmsms2018/25262 10. Lee, D., McNair, J.: Deep reinforcement learning agent for playing 2D shooting games. Int. J. Control Autom. 11, 193–200 (2018). https://doi.org/10.14257/ijca.2018.11.3.17 11. Gatys, L., Ecker, A., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) 12. Stanley, K., Miikkulainen, R.: Evolving neural networks through augmenting topologies. Evol. Comput. 10, 99–127 (2002). https://doi.org/10.1162/106365602320169811 13. Moriarty, D., Miikkulainen, R.: Forming neural networks through efficient and adaptive coevolution. Evol. Comput. 5, 373–399 (1997). https://doi.org/10.1162/evco.1997.5.4.373 14. Kendall, M.: A new measure of rank correlation. Biometrika 30, 81 (1938). https://doi.org/ 10.2307/2332226 15. Biscani, F., Izzo, D., Jakob, W., et al.: esa/pagmo2: pagmo 2.10. In: Zenodo (2019). https:// doi.org/10.5281/zenodo.1045336. Accessed 21 Feb 2019

Detecting Violent Robberies in CCTV Videos Using Deep Learning Giorgio Morales(B) , Itamar Salazar-Reque , Joel Telles , and Daniel D´ıaz National Institute of Research and Training in Telecommunications (INICTEL-UNI), National University of Engineering, San Luis 1771, 15021 Lima, Peru {gmorales,jtelles,ddiaz}@inictel.edu.pe, [email protected]

Abstract. Video surveillance through security cameras has become difficult due to the fact that many systems require manual human inspection for identifying violent or suspicious scenarios, which is practically inefficient. Therefore, the contribution of this paper is twofold: the presentation of a video dataset called UNI-Crime, and the proposal of a violent robbery detection method in CCTV videos using a deeplearning sequence model. Each of the 30 frames of our videos passes through a pre-trained VGG-16 feature extractor; then, all the sequence of features is processed by two convolutional long-short term memory (convLSTM) layers; finally, the last hidden state passes through a series of fully-connected layers in order to obtain a single classification result. The method is able to detect a variety of violent robberies (i.e., armed robberies involving firearms or knives, or robberies showing different level of aggressiveness) with an accuracy of 96.69%.

Keywords: Action recognition

1

· convLSTM · Robbery detection

Introduction

Citizen insecurity is one of the most important problems affecting today’s people quality of life. This is especially true for developing countries where the problem is exacerbated by the poverty and the lack of opportunities [1]. Of the different ways in which insecurity manifests itself, robberies are the most frequent. To reduce their rate of occurrence, a common solution is to install both indoor cameras – such as in convenience stores, gas stations, or restaurants–, or outdoor cameras – as the public surveillance cameras on the streets managed by the government. Unfortunately, for this solution to be efficient, many resources must be spent. For instance, indoor cameras are normally used just to record the assault and subsequently to identify the robber; but to use them to warn the police when a robbery is being committed, human inspection is needed. This is the approach in public outdoor cameras, where robbery detection relies on the c IFIP International Federation for Information Processing 2019  Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 282–291, 2019. https://doi.org/10.1007/978-3-030-19823-7_23

Detecting Violent Robberies in CCTV Videos Using Deep Learning

283

use of continuously monitored surveillance cameras by security agents. Nevertheless, this is a limited solution due to the small number of agents compared to the number of cameras to be monitored, and to the inherent fatigue caused by this exhausting task. In this context, the use of artificial intelligence techniques to offer new tools for automatic robbery detection can be of great aid. However, this is a difficult mission since a robbery can happen anywhere in the city, which means that there is a high variety of scenarios that make the solution to this problem a big challenge. To the best of our knowledge, no studies have been proposed for the automatic detection of robbery. Violence detection, however, do present several previous researches. Some of them used hand-crafted features as in [2] where the authors proposed DiMOLIF, a new feature, based on STIPs [3] and optical flow, to describe violence in surveillance videos. They used this new feature to classify video clips from two datasets using a SVM classifier obtaining an accuracy of 88% and 85% respectively. Deniz et al [4], compute power spectrum of two consecutive frames to detect sudden changes elicited by fast movements to detect violence with an accuracy improvement of up to 12% with respect to the compared methods. It is worth saying that both studies assumed that the actions are fast enough so the difference between frames encoded a violence action. This is not necessarily true for some armed robberies where the criminal can intimidate the victim by holding a gun without a sudden movement. Some other approaches used deep convolutional neural networks (CNN). In [5], the authors used a 6-layer CNN to classify frame images from videos as normal or abnormal. However, as images were sampled from video sequences there might be a high correlation between samples making the network prone to overfitting. Moreover, as robbery is an action, it will require more than one image to be correctly described. Additionally, the practical use of this approach will require the system to work in every frame which is computationally expensive. In [6], Trajectory-Pooled Deep-Convolutional Descriptors (TDD) were described. They were presented as a new video feature for action recognition that combined the merits of both hand-crafted features (Improved trajectories [7]) and deeplearned features (two-stream ConvNets [8]). This was later used in [9] for violence detection with high accuracy. In [10], the authors proposed an image acceleration field calculated from the optical flow field, which serves as an input to a Convolutional Neural Network, called FigthNet. Finally, Sudhakaran et al. [11] developed an end-to-end trainable deep neural network for performing violent video classification using convolutional long short term memory (convLSTM) networks. In this work, we propose an end-to-end trainable sequence model for violent robbery classification similar to that proposed by [11]. We no longer have to design hand-crafted features and feed them into a classifier; instead, the input of our model is a sequence of 30 RGB frames extracted from the CCTV videos of our dataset. The first part of our architecture is the feature extractor, which process each frame using a pre-trained CNN, such as VGG16 or NASNetMobile.

284

G. Morales et al.

The second part is the sequence network; it is composed of convLSTM layers, which can encode the spatio-temporal changes of the processed features. We tried different configurations and selected the best network, achieving a classification accuracy of 96,69% in our validation dataset.

2 2.1

Proposed Method UNI-Crime Dataset

One of the difficulties of training an optimal model for robbery detection is the lack of a proper public dataset. Namely, some common problems are the small number of samples, the poor diversity of scenarios, or the absence of spontaneity in simulated scenes [12–14]. Nevertheless, the main problem is that almost none of the previous datasets contains scenes of robberies recorded by CCTV cameras, which is what we aim to detect. On the other hand, the UCF-Crime dataset [15] contains 1900 CCTV videos of normal actions, robberies, abuse, explosions, among other anomalies. However, those videos have different duration and frame rates; what is more, many of them are edited videos that contain multi-camera view, which is not useful for analyzing the continuity of an action, or present advertisements at the beginning or at the end. For those reasons, we decided to construct the UNI-Crime dataset [16] based on the UCF-Crime dataset. For this, we discarded those multi-camera and repeated videos. Due to the fact that some videos are too long (three minutes or more) and others too short (20 s), we standardized the duration of our videos to 10 s. To do this, we trim each 10 useful seconds of the videos and classify them as robbery or non-robbery; by doing so, we can get multiple scenes of both classes from a single video, which strengthen our model, since it is less prone to overfitting. We also re-sized all the videos to 256 × 256 pixels and standardized the

Fig. 1. Sample frames from the UNI-Crime dataset. (a) (b) Robbery samples. (c) (d) Non-robbery samples.

Detecting Violent Robberies in CCTV Videos Using Deep Learning

285

frame rate to 3 frames per second; that is, 30 frames per video, in order to avoid redundant information. In addition, we downloaded extra videos from Youtube, mainly from robberies or normal actions at stores. In the end, we collected 1421 videos: 1001 of non-robbery and 420 of robbery. A sample of frames from the dataset is shown in Fig. 1. 2.2

Neural Network Training

We propose a sequence model for end-to-end robbery detection. The architecture of the model is divided in two parts: the feature extractor, and the recurrent network. We tried the pre-trained VGG16 [17] and NASNetMobile [18] networks as feature extractors, and the Convolutional LSTM network (convLSTM) [19] as the recurrent network. We give further details about these structures below: ConvLSTM. The Long-Short Term Memory (LSTM) network [20] is a recurrent end-to-end architecture which is capable of efficiently encoding long and short temporal changes of a sequence thanks to the memory cells it presents. Recently, LSTM networks have been widely used for different purposes such as speech recognition, natural language processing (NLP) or even human action recognition [21]. However, a standard LSTM network discards most of the spatial information due to it vectorizes all incoming data. Therefore, the Convolutional Long-Short Term Memory (convLSTM) network [19] was proposed in order to preserve both the spatial and temporal information. The equations that govern a single LSTM unit are as follows: Γf = σ(Wf x ∗ x + Wf a ∗ a + bf ),

(1)

Γu = σ(Wux ∗ x + Wua ∗ a + bu ),

(2)



= tanh(Wcx ∗ x

c

Γo

=

Γf

= σ(Wox ∗ x

a

=

+ Wca ∗ a

◦c

Γo

+

Γu

◦ c˜

+ Woa ∗ a

◦ tanh(c

+ bc ),

,

+ bo ),

),

(3) (4) (5) (6)

Γf , Γu , Γo

where x is the two-dimensional input at time t; are the is the cell outputs of the forget, update and output gates, respectively; c state; c˜ is the candidate for replacing the previous cell state; a is the hidden state; and Wf x , Wf a , Wux , Wua , Wox , and Woa are two-dimensional convolutional filters. VGG16. Also called OxfordNet, is a convolutional neural network that is trained on ImageNet [22]. Table 1 shows the performance of the model on the ImageNet validation dataset. The network is 16 layers deep and it exclusively uses 3×3 convolutional filters. It is commonly used as a high-level feature extractor for tasks such as saliency detection [23], action recognition [24,25] or semantic segmentation [26]. Hence, we consider the convolutional features from the last pooling layer.

286

G. Morales et al.

NASNet. It is a product of Google’s AutoML project, whose aim is to automate the design of machine learning models. In [18] they proposed a novel search method so that AutoML could find the best layer or combination of layers (like those present in ResNet [27] or Inception [28,29] models) which can then be stacked many times in a flexible manner to create a final network. One of the networks derived from the large NASNet, trained on the ImageNet and COCO datasets, is called NASNetMobile, which achieved better performance than equivalently-sized state-of-the-art models for mobile platforms [30,31]. Table 1. Models performance on the Imagenet validation dataset. Network

Top-1 accuracy (%) Top-5 accuracy (%) Parameters

VGG16

0.713

0.901

138,357,544

NASNetMobile 0.744

0.919

5,326,716

Proposed Architecture. Figure 2 illustrates the general architecture of the network for detecting violent robberies. Its input is a sequence of 30 frames of 224 × 224 × 3 pixels. Contrary to previous works such as [8] or [11], we do not use optical flow images or the difference between adjacent frames as inputs because of two reasons: First, some armed robberies does not necessarily involve rapid changes in the scene; instead, a robber could threat a person holding a gun without sudden movements. Second, we want to keep as much spatial information as we can and let the convLSTM to encode the spatial changes.

Fig. 2. The proposed network architecture using two convLSTM layers and the first 13 layers of the VGG16 network as feature extractor.

Detecting Violent Robberies in CCTV Videos Using Deep Learning

287

Each frame passes through a convolutional feature extractor, which could be derived from a pre-trained VGG16 or NASNet network. These high-level features are processed using the convLSTM layers. The last convLSTM layer is a manyto-one layer. Finally, the hidden state of the last convLSTM unit is processed by three fully-connected layers in order to obtain a single classification result. We tried different network architectures but show in Table 2 only the five architectures that achieved the greatest performances. The first one, ROBN et1, uses the first 13 layers of the pre-trained VGG16 network as feature extractor and two LSTM layers (as shown in Fig. 2). The second one, ROBN et2, uses the first 253 layers of the pre-trained NASNetMobile network (i.e. until the activation 74 layer whose output is 28 × 28 × 88) as feature extractor and one LSTM layer. The third one, ROBN et3, uses the same feature extractor as ROBN et2 and two LSTM layers. The fourth one, ROBN et4, uses the first 769 layers of the pre-trained NASNetMobile network (i.e. until the last convolutional layer, activation 188, whose output is 7 × 7 × 1056) as feature extractor and one LSTM layer. The fifth one, ROBN et5, uses the same feature extractor as ROBN et4 and two LSTM layers. Table 2. Parameters of the proposed network architectures. Network

Feature extractor

ROBNet1 VGG16 (13 layers)

3

#convLSTM #Filters #Filters Layers Layer 1 Layer 2 2

128

64

ROBNet2 NASNet (253 layers) 1

128

-

ROBNet3 NASNet (253 layers) 2

128

64

ROBNet4 NASNet (769 layers) 1

128

-

ROBNet5 NASNet (769 layers) 2

128

64

Results

Since the UNI-Crime dataset contains 256 × 256 - pixel videos and the input size of both the VGG16 and NASNetMobile networks is 224 × 224, we applied random cropping four times to each video; two of the cropped videos were horizontally flipped. In this way, we applied data augmentation techniques and got a dataset of 5684 videos. We divided 85% of the dataset to create the training set, and 15% to create the validation set. The training algorithm was implemented using Python 3.6 on a PC with Intel i7-8700 at 3.7 GHz CPU, 64 GB RAM and a NVIDIA GeForce GTX 1080 Ti GPU. The proposed CNN was trained during 120 epochs using an Adam optimizer [32] with a learning rate of 0.001, a momentum term β1 of 0.9, a momentum term β2 of 0.999 and a mini-batch size of 8. Furthermore, we added a 10% dropout rate before the fully-connected layers to prevent overfitting. Figure 3 shows the evolution of network accuracy and loss over training time of all the

288

G. Morales et al. Table 3. Metrics comparison of different robbery detection networks Network

ACC (%) Parameters

ROBNet1 96.698 ROBNet2 95.401

20,906,095 101,609,199

ROBNet3 94.103

88,025,199

ROBNet4 93.514

11,989,743

ROBNet5 92.453

6,927,471

Fig. 3. Comparison of metrics evolution over training time of all networks. (a) Epochs vs. Accuracy. (b) Epochs vs. Loss.

Fig. 4. Metrics evolution over training time of ROBNet1. (a) Epochs vs. Accuracy. (b) Epochs vs. Loss.

networks. Table 3 compares the performance of the five networks in terms of classification accuracy from the validation set. From Table 4 we chose ROBN et1 as the best network, as it presented the highest classification accuracy value and the lowest cost when evaluating in the validation set (Fig. 4). What is more, ROBN et1 is nearly 1.3% more accu-

Detecting Violent Robberies in CCTV Videos Using Deep Learning

289

rate when compared to the second best accuracy and it presents 80,703,104 less parameters. Furthermore, we observe in Fig. 3 that only ROBN et1 shows a little difference between the training and validation values over the training time, meaning that it prevents overfitting problems and has better performance than the other networks when it comes to predicting new samples outside the training set. Finally, in Table 4 we compared our results with those achieved using the method of [11] with the UNI-Crime dataset. Although their training accuracy is slightly higher than ours, our validation accuracy is higher than theirs by more than five percentage points. This fact supports our preference of using RGB inputs instead of optical flow inputs. Table 4. Metrics comparison of different robbery detection networks

4

Network

Training ACC (%) Validation ACC (%)

ROBNet1

99.46

96.69

Sudhakaran [11] 99.63

91.25

Conclusion

In this paper, we have presented an efficient end-to-end trainable deep neural network to tackle the problem of violent robbery detection in CCTV videos. We presented a dataset that encompasses both normal and robbery scenarios. What is more, some of the normal videos contained in our dataset correspond to moments before or after the robbery, which ensures that our method can discern between normal and robbery events even in the same environment. The proposed method is able to encode both the spatial and temporal changes using convolutional LSTM layers that receive a sequence of features extracted by a pre-trained CNN from the original video frames. The classification accuracy evaluated in the validation dataset achieves a value of 96.69%. Although this is a promising result, further improvements need to be done in order to offer a scalable and marketable product, such as severely increasing the number of videos and scenarios of the dataset, reducing the computational cost, among others.

References 1. The Global Shapers Survey. http://shaperssurvey2017.org/. Accessed 4 Feb 2019 2. Mabrouk, A.B., Zagrouba, E.: Spatio-temporal feature using optical flow based distribution for violence detection. Pattern Recognit. Lett. 92, 62–67 (2017). https:// doi.org/10.1016/j.patrec.2017.04.015 3. Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005). https://doi.org/10.1007/s11263-005-1838-7

290

G. Morales et al.

4. Deniz, O., Serrano, I., Bueno, G., Kim, T.K.: Fast violence detection in video. In: 2014 International Conference on Computer Vision Theory and Applications (VISAPP), pp. 478–485. IEEE, Lisbon (2004) 5. Tay, N.C., Connie, T., Ong, T.S., Goh, K.O.M., Teh, P.S.: A robust abnormal behavior detection method using convolutional neural network. Computational Science and Technology. LNEE, vol. 481, pp. 37–47. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-2622-6 4 6. Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deepconvolutional descriptors. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4305–4314 (2015). https://doi.org/10.1109/CVPR. 2015.7299059 7. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 3551–3558. IEEE, Sydney (2013). https://doi.org/10.1109/ICCV.2013.441 8. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, pp. 568–576. MIT Press, Montreal (2014) 9. Meng, Z., Yuan, J., Li, Z.: Trajectory-pooled deep convolutional networks for violence detection in videos. In: Liu, M., Chen, H., Vincze, M. (eds.) ICVS 2017. LNCS, vol. 10528, pp. 437–447. Springer, Cham (2017). https://doi.org/10.1007/ 978-3-319-68345-4 39 10. Zhou, P., Ding, Q., Luo, H., Hou, X.: Violent interaction detection in video based on deep learning. J. Phys. Conf. Ser. 844, 012044 (2017) 11. Sudhakaran, S., Lanz, O.: Learning to detect violent videos using convolutional long short-term memory. In: 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, pp. 1–6 (2017). https://doi. org/10.1109/AVSS.2017.8078468 12. Hassner, T., Itcher, I., Kliper-Gross, O.: Violent flows: real-time detection of violent crowd behavior. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, Providence (2012). https://doi.org/ 10.1109/CVPRW.2012.6239348 13. Bermejo Nievas, E., Deniz Suarez, O., Bueno Garc´ıa, G., Sukthankar, R.: Violence detection in video using computer vision techniques. In: Real, P., Diaz-Pernil, D., Molina-Abril, H., Berciano, A., Kropatsch, W. (eds.) CAIP 2011. LNCS, vol. 6855, pp. 332–339. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3642-23678-5 39 14. Li, W., Mahadevan, V., Vasconcelos, N.: Anomaly detection and localization in crowded scenes. IEEE Trans. Pattern. Anal. Mach. Intell. 36(1), 18–32 (2014) 15. Sultani, W., Chen, C., Shah, M.: Real-world anomaly detection in surveillance videos. arXiv:1801.04264 (2018) 16. UNI-Crime Dataset. http://didt.inictel-uni.edu.pe/dataset/UNI-Crime Dataset. rar. Accessed 25 Jan 2019 17. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014) 18. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8697–8710. IEEE, Salt Lake City (2018)

Detecting Violent Robberies in CCTV Videos Using Deep Learning

291

19. Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Cortes, C., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 (NIPS 2015), vol. 1, pp. 802–810. MIT Press, Cambridge (2015) 20. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735 21. Salehinejad, H., Sankar, S., Barfett, J., Colak, E., Valaee, S.: Recent advances in recurrent neural networks. arXiv:1801.01078 (2018) 22. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Miami (2009). https://doi.org/10.1109/CVPR.2009. 5206848 23. Lee, G., Tai, Y., Kim, J.: Deep saliency with encoded low level distance map and high level features. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 660–668. IEEE, Las Vegas (2016) 24. Lan, Z., Zhu, Y., Hauptmann, A.G., Newsam, S.: Deep local video feature for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, Honolulu (2017) 25. Li, Z., Gavrilyuk, K., Gavves, E., Jain, M., Snoek, C.G.: Video LSTM convolves, attends and flows for action recognition. Comput. Vis. Image Underst. 166, 41–50 (2018). https://doi.org/10.1016/j.cviu.2017.10.011 26. Liu, T., Stathaki, T.: Faster R-CNN for robust pedestrian detection using semantic segmentation network. Front. Neurorobot 12, 64 (2018). https://doi.org/10.3389/ fnbot.2018.00064 27. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE, Las Vegas (2016). https://doi.org/10.1109/CVPR.2016.90 28. Szegedy, C., et al.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9. IEEE, Boston (2015). https://doi.org/10.1109/CVPR.2015.7298594 29. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inceptionv 4, Inception-Resnet and the impact of residual connections on learning. In: AAAI Conference on Artificial Intelligence, San Francisco (2017) 30. Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 (2017) 31. Zhang, X., Zhou, X., Mengxiao, L., Sun, J.: Shufflenet: an extremely efficient convolutional neural network for mobile devices. arXiv:1707.01083 (2017) 32. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations (ICLR 2015), San Diego (2015)

Diversity Regularized Adversarial Deep Learning Babajide O. Ayinde1 , Keishin Nishihama1,2 , and Jacek M. Zurada1,2(B) 1

2

Electrical and Computer Engineering Department, University of Louisville, Louisville, KY, USA {babajide.ayinde,jacek.zurada}@louisville.edu [email protected] Information Technology Institute, University of Social Science, 90-113 Lodz, Poland

Abstract. The two key players in Generative Adversarial Networks (GANs), the discriminator and generator, are usually parameterized as deep neural networks (DNNs). On many generative tasks, GANs achieve state-of-the-art performance but are often unstable to train and sometimes miss modes. A typical failure mode is the collapse of the generator to a single parameter configuration where its outputs are identical. When this collapse occurs, the gradient of the discriminator may point in similar directions for many similar points. We hypothesize that some of these shortcomings are in part due to primitive and redundant features extracted by discriminator and this can easily make the training stuck. We present a novel approach for regularizing adversarial models by enforcing diverse feature learning. In order to do this, both generator and discriminator are regularized by penalizing both negatively and positively correlated features according to their differentiation and based on their relative cosine distances. In addition to the gradient information from the adversarial loss made available by the discriminator, diversity regularization also ensures that a more stable gradient is provided to update both the generator and discriminator. Results indicate our regularizer enforces diverse features, stabilizes training, and improves image synthesis. Keywords: Deep learning · Feature correlation · Generative model Adversarial learning · Feature redundancy · Generative Adversarial Networks · Regularization

1

·

Introduction

Convolutional neural networks (CNNs) have become the powerhouse for tackling many image processing and computer vision tasks. By design, CNNs learn to automatically optimize a well-defined objective function that quantifies the quality of results and their performance on the task at hand. As shown in previous studies [1], designing effective loss functions for many image prediction problems c IFIP International Federation for Information Processing 2019  Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 292–306, 2019. https://doi.org/10.1007/978-3-030-19823-7_24

Diversity Regularized Adversarial Deep Learning

293

is daunting and often requires manual effort and in-depth experts’ knowledge and insights. For instance, naively minimizing the Euclidean distance between predicted and ground truth pixels have shown to result in blurry outputs since the Euclidean distance is minimized by averaging all conceivable outputs [1–4]. One plausible way of training models with high-level objective specifications is by allowing CNNs to automatically learn the appropriate loss functions that satisfy these desired objectives. One of such objectives could be as simple as asking the model to make the output not distinguishable from the groundtruth. As established in [1,5–7], GANs are trained to automatically learn an objective function using a discriminator network to classify if its input is real or synthesized while simultaneously training a generative model to minimize the loss. In GAN framework, both the discriminator and generator aim to minimize their own loss and the solution to the game is the Nash equilibrium where neither player can independently improve their individual loss [5,8]. This framework can also be interpreted from the viewpoint of a statistical divergence minimization between the learned model distribution and the true data distribution [9–11]. Even though GANs have resulted in new and interesting applications and achieved promising performance, they are still hard to train and very sensitive to hyperparameter tuning. A peculiar and common training challenge is the performance control of the discriminator. The discriminator is usually inaccurate and unstable in estimating density ratio in high dimensional spaces, thus leading to situations where the generator finds it difficult to model the multi-modal landscape in true data distribution. In the event of total disjoint between the supports of model and true distributions, a discriminator can trivially distinguish between model distribution and that of true data [12], thus leading to situations where generator stops training because the derivative of the resulting discriminator with respect to the input has vanished. This problem has seen many recent works to come up with workable heuristics to address many training problems such as mode collapse and missing modes. We argue in line with the hypothesis that some of the problems associated with the training of GANs are in part due to lack of control of the discriminator. In light of this, we propose a simple yet powerful diversity regularizer for training GANs that encourages the discriminator to extract near-orthogonal filters. The problem abstraction is that in addition to the gradient information from the adversarial loss made available by the discriminator, we also want the GAN system to benefit from extracting diverse features in the discriminator. Experimental results consistently show that, when correctly applied, the proposed regularization enforces diverse features in the discriminator and better stabilize the GAN training with mostly positive effects on the generated samples. The contribution of this work is two-fold: (i) we propose a new method to regularize adversarial learning by inhibiting the learning of redundant features and availing a stable gradient for weights updates during training and (ii) we show that the proposed method stabilizes the adversarial training and enhances the performance of many state-of-the-art methods across many benchmark datasets. The rest of the paper is structured as follows: Sect. 2 highlights the state-ofthe-art and Sect. 3 discusses in detail the formulation of diversity-regularized

294

B. O. Ayinde et al.

adversarial learning. Section 4 discusses the detailed experimental designs and presents the results. Finally, conclusions are drawn in Sect. 5.

2

Related Work

As originally introduced in [5], GANs consist of generator and the discriminator that are parameterized by deep neural networks and are capable of synthesizing interesting local structure on select datasets. The representation capacity of original GAN was extended in conditional GANs [13] by incorporating an additional vector that enables the generator to synthesize samples conditioned on some useful information. This extension has motivated several conditional variants of GAN in diverse applications such as edge map [14,15], image synthesis from text [16], super-resolution [17], style transfer [18], just to mention a few. Learning useful representation with GANs has shown to heavily rely on hyperparameter-tuning due to various instability issues during training [8,12,19]. GANs are remarkably hard to train in spite of their success on variety of task. Robustly and systematically stabilizing the training of GANs has come in many forms such as selective architectural design [6], matching of intermediate features [7], and unrolling the optimization of discriminator [20] (Fig. 1).

Fig. 1. Schema of Diversity Regularized Adversarial Learning (DiReAL)

Many recent advances inspired by either theoretical insights or practical considerations have been attempted in form of regularization and normalization to address some of the issues associated with training of GANs. Imposing Lipschitz constraint on the discriminator has shown to stabilize the adversarial training and avoid an over-optimization scenario where the discriminator still distinguishes and allots different scores to nearly indistinguishable samples [12]. By satisfying the Lipschitz constraint, the discriminator’s joint/compressed representation of the true and synthesized data distributions is guaranteed to be smooth; thus ensuring a non-zero learning signal for the generator [12,19].

Diversity Regularized Adversarial Deep Learning

295

Enforcing the discriminator to satisfy the Lipschitz constraints has been approximated and implemented via ancillary means such as gradient penalties [21] and weight clipping [12]. Using a Gaussian classifier over the real/fake indicator variables has also been shown to have a smoothing effect on the discriminator function [19]. Injecting label noise [7] and gradient penalty have equally been shown to have a tremendous regularizing effect on GANs. Schemes such as weighted gradient [22] and missing modes penalty [23] have been utilized to alleviate some training and missing modes issues in GAN learning. Techniques such as batch normalization [24] and layer normalization [25] have also been reported in context of GANs [6,21,26]. In batch normalization, pre-activations of nodes in a layer are normalized to mean β and standard deviation γ. Parameters β and γ are learned for each node in the layer and normalization is done on the batch level and for each node separately [8,25]. Layer normalization on the other hand uses the same learned parameters β and γ to normalize all nodes in a layer and normalizes different samples differently [25]. Weight vectors of discriminator have been l2 -normalized with Frobenius norm, which constraints the sum of the squared singular values of the weight matrix to be 1 [7]. However, normalizing using Frobenius norm translates to utilizing a single feature to discriminate the model probability distribution from the target thus, reducing the rank and hence the number of discriminator features [27]. In addition to weight clipping [10,12], weight normalization approaches yield primitive discriminator model that maps the target distribution only with select few features. The most closely related work to ours is orthonormal regularization of weights [28] that sets all the singular values of weight matrix in the discriminator to one, which translates to using as many features as possible to distinguish the generator distribution from the target distribution. Our approach, however, imposes much softer orthogonality constraint on the weight vectors by allowing a degree of feature sharing in upper layers of the discriminators. Other related work is spectral normalization of weights that guarantees 1-Lipschitzness for linear layers and ReLu activation units resulting in discriminators of higher rank [27]. The advantage of spectral normalization is that weight matrices are constrained and Lipschitz. However, bounding the spectral norm of the convolutional kernel to 1 does not bound the spectral norm of the convolutional mapping to unity.

3

Method

The training of GAN can be abstracted as a non-cooperative game between two players, namely the generator G and the discriminator D. The discriminator tries to distinguish if the generated sample is from the real (pdata ) or fake data distribution (pz ), while G tries to trick D into believing that generated sample is from pdata by moving the generation manifold towards the data manifold. The discriminator aims to maximize Ex∼pdata (x) [logD(x)] when the input is sampled from real distribution and given a fake image sample G(z),

296

B. O. Ayinde et al.

z ∼ pz (z), it is trained to output probability, D(G(z)), close to zero by maximizing Ez∼pz (z) [log(1−D(G(z)))]. The generator network, however, is trained to maximize the chances of D producing a high probability for a fake image sample G(z) thus by minimizing Ez∼pz [log(1 − D(G(z)))]. The adversarial cost is obtained by combining the objectives of both D and G in a min-max game as given in 1 below: Jadv = min max Ex∼pdata (x) [logD(x)] G

D

+ Ez∼pz (z) [log(1 − D(G(z)))]

(1)

Training D can be conceived as training an evaluation metric on sample space [23] that enables G to use the local gradient ∇ log D(G(z)) information made available by D to improve itself and move closer to the data manifold. 3.1

Feature Diversification in GAN

Both D and G are commonly parameterized as DNNs and over the past few years, the general trend has been that DNNs have grown deeper, amounting to huge increase in number of parameters. The number of parameters in DNNs is usually very large offering possibility to learn very flexible high-performing models [29]. Observations from many previous studies [30–33] suggest that layers of DNNs typically rely on many redundant filters that can be either shifted version of each other or be very similar with little or no variations. For instance, this redundancy is evidently pronounced in filters of AlexNet [34] as emphasized in [31,35,36]. To address this redundancy problem, we train layers of the discriminator under specific and well-defined diversity constraints. Since G and D rely on many redundant filters, we regularize them during training to provide more stable gradient to update both G and D. Our regularizer enforces constraints on the learning process by simply encouraging diverse filtering and discourages D from extracting redundant filters. We remark that convolutional filtering has found to greatly benefit from diversity or orthogonality of filters because it can alleviate problems of gradient vanishing or exploding [28,37–39]. Typically, both D and G consist of input, output, and many intermediate processing layers. By letting the number of channels, height, and width of input feature map for lth layer be denoted as nl , hl , and wl , respectively. A convolutional layer in both D transforms input xl ∈ Rp into output xl+1 ∈ Rq , where xl+1 is the input to layer l + 1; p and q are given as nl × hl × wl and nl+1 ×hl+1 ×wl+1 , respectively. xl is convolved with nl+1 3D filters χ ∈ Rnl ×k×k , resulting in nl+1 output feature maps. Unrolling and combining all layer lth fil(l)

ters into a single matrix results in kernel matrix ΘD ∈ Rm×nl+1 where m = k 2 nl . (l)

(l)

Then, θD i , i = 1, ...nl , denotes filters in layer l, each θD i ∈ Rm corresponds to (l)

(l)

(l)

the i-th column of the kernel matrix ΘD = [θD 1 , ...θD nl ] ∈ Rm×nl+1 ; the bias

Diversity Regularized Adversarial Deep Learning

297

(l)

term of each layer is omitted for simplicity. Given that ΘD ∈ Rm×nl contain nl normalized filter vectors as columns, each with m elements corresponding to connections from layer l − 1 to ith neuron of layer l, then, the diversity loss JD for all layers of D is given as: ⎛ ⎞  2 nl nl  L (l) (l)   1 D ⎝ ⎠ (2) JD (θD ) = Ωij MD ij 2 i=1 j=1 l=1

(l)

(l)

(l)

where Ω D ∈ Rnl ×nl denotes (ΘD )T ΘD which contains the inner products of each (l)

(l)

(l)

pair of columns i and j of ΘD in each position i,j of Ω D in layer l; MD ∈ Rnl ×nl is a binary mask for layer l defined in (5); L is the number of layers to be regularized. (l) (l) D 1 τ ≤ |Ωij |≤1 D Mij = 0 i = j (3) 0 otherwise Similarly, the diversity loss JG for generator G is given as: ⎛ ⎞  2 nl nl  L (l) (l)   1 G ⎝ ⎠ JG (θG ) = Ωij MG ij 2 i=1 j=1

(4)

l=1

and (l) MG ij

=

(l)

1 0 0

G τ ≤ |Ωij |≤1 i=j otherwise

(5)

It is important to also note the importance and relevance of τ in (5). Setting τ = 0 results in layer-wise disjoint filters. This forces weight vectors to be orthogonal by pushing them towards the nearest orthogonal manifold. However, from practical standpoint, disjoint filters are not desirable because some features are sometimes required to be shared with layers. For instance a model trained on CIFAR-10 dataset [40] that have “automobiles” and “trucks” as two of its ten categories, if a particular lower-level feature captures “wheel” and two higherlayer features describe automobile and truck, then it is highly probable that the two upper layer features might share the feature that describe the wheel. The choice of τ determines the level of sharing allowed, that is, the degree of feature sharing across features of a particular layer. In other words, τ serves as a trade-off parameter that ensures some degree of feature sharing across multiple high-level features and at the same time ensuring features are sufficiently dissimilar. In order to enforce feature diversity in both G and D while training GANs, the diversity regularization terms in (4) is added to the conventional adversarial cost Jadv in (1) as given in (6). Jnet = Jadv + Jdiv

(6)

298

B. O. Ayinde et al.

(a)

(b)

(c)

(d)

Fig. 2. Diversity loss of (a) generator JG with no regularization (b) generator JG with diReAL (c) discriminator JD with no regularization, and (d) discriminator JD with DiReAL trained on MNIST dataset.

where Jdiv = λG JG (θG ) − λD JD (θD ), λG and λD is the diversity penalty factors for generator and discriminator, respectively. The derivative of diversity loss JD with respect to weights of D is given as ∇Θ(l) JD (θD ) = i,j

n 

(l)

(l)

(l)

D D Θi,k Ωk,j MD k,j

(7)

k=1

and the derivative of diversity loss JG with respect to weights of G is ∇Θ(l) JG (θG ) = i,j

n 

(l)

(l)

(l)

G G Θi,k Ωk,j MG k,j

(8)

k=1

The idea behind diversifying features is that in addition to adversarial gradient information provided by D, we provide additional diversity loss with more stable gradient to refine both G and D. The diversity loss encourages weights

Diversity Regularized Adversarial Deep Learning

299

of both generator and discriminator to be diverse by pushing them towards the nearest orthogonal manifold. Our proposed regularization provides more efficient gradient flow, a more stable optimization, richness of layer-wise features of resulting model, and improved sample quality compared to benchmarks and (l)

(l)

baseline. The diversity regularization ensures the column space of ΘD and ΘG for lth layer does not concentrate in few direction during training thus preventing them to be sensitive in few and limited directions. The proposed diversity regularized adversarial learning alleviates some of the main failure mode of GAN by ensuring features are diverse.

4

Experiments

All experiments were performed on Intel(r) Core(TM) i7-6700 CPU @ 3.40 GHz and a 64 GB of RAM running a 64-bit Ubuntu 16.04 edition. The software implementation has been in PyTorch library1 on two Titan X 12 GB GPUs. Implementation of DiReAL will be available at https://github.com/keishinkickback/ DiReAL. Diversity regularized adversarial learning (DiReAL) was evaluated on MNIST dataset of handwritten digits [41], CIFAR-10 [40], STL-10 [42], and Celeb-A [43] databases. In the first set of experiments, an ubiquitous deep convolutional GAN (DCGAN) in [6] was trained using MNIST digits. The standard MNIST dataset has 60000 training and 10000 testing examples. Each example is a grayscale image of an handwritten digit scaled and centered in a 28 × 28 pixel box. Both the discriminator and generator networks contain 5 layers of convolutional block. Adam optimizer [44] with batch size of 64 was used to train the model for 100 epochs and τ and learning rate in DiReAL were set to 0.5 and 0.0001, respectively. In similar vein, λD and λG were to 1.0 and 0.01, respectively. Adam optimizer (β1 = 0.0, β2 = 0.9) [44] with batch size of 64 was used to train the model for 100 epochs. Figure 2 shows the diversity loss of both generator and discriminator for DiReAL and unregularized counterpart. It can be observed that DiReAL was able to minimize the pairwise feature correlations compared to the highly correlated features extracted by the unregularized counterpart. Specifically, DiReAL was able to steadily minimize the diversity loss as training progresses compared to the unregularized DCGAN, where extraction of similar features grows with epoch of training, thus increasing the diversity loss. The divergence between discriminator output for real handwritten digits and generated samples over 30 batches for regularized and the unregularized networks is shown in Fig. 3a. The divergence was measured using the Wasserstein distance measure [45] and it can be observed that the regularizing effect of DiReAL stabilizes the adversarial training and prevents mode collapse. For unregularized network, however, the mode started to collapse around 45th epoch. Closer look into the diversity of the generator in Fig. 2a, it is evident that just around the epoch of collapse the generator starts extracting more and more redundant filters. We suspect that DiReAL 1

https://pytorch.org/.

300

B. O. Ayinde et al.

was able to stabilize the training by pushing features to lie close to the orthogonal manifold, thus preventing learned features from collapsing to an undesirable manifold. Figure 3b shows the handwritten digit samples synthesized with and without DiReAL and it can be observed that diversification of features is beneficial for stabilizing adversarial learning and ultimately improving the samples’ quality. Another observation is that DiReAL also prevents learned weights from collapsing to an undesirable manifold thus highlighting some of the benefits of pushing weights near the orthogonal manifold.

(a)

(b)

Fig. 3. (a) Divergence, as measured by Wasserstein distance, between the discriminator output for synthesized and real MNIST samples. (b) Synthesized hand-written digits with and without diversity regularization.

In the second large-scale experiments, CIFAR-10 dataset was used to train GAN using DiReAL and the results compared to the unregularized training. The dataset is split into 50000 and 10000 training and testing sets, respectively. Similar to experiments with MNIST, Fig. 4b shows the diversity loss of the discriminator with and without DiReAL trained on CIFAR-10 database. It can be observed that DiReAL was able to minimize the diversity loss and encourages diverse features that benefit the adversarial training. On the other hand, Fig. 4b shows that the diversity loss of the unregularized is higher and unconstrained compared to that of DiReAL. The images synthesized with DiReAL was compared and contrasted with state-of-the-art methods such as batch normalization [24], layer normalization [25], weight normalization [46], and spectral normalization [27]. It is remarked that DiReAL can be used in tandem with the other regularization techniques and could also be deployed as stand-alone regularization tool for stabilizing adversarial learning. In this light, DiReAL was also combined with these techniques. It must be noted that spectral normalization uses a variant of DCGAN architecture with an eight-layer discriminator network. See [27] for more implementation details. It can be observed in Fig. 5 that diversity regularization was able to synthesize more diverse and complex images compared to unregularized counterpart. Other benchmark regularizers were able to generate better image samples

Diversity Regularized Adversarial Deep Learning

(a)

301

(b)

Fig. 4. Diversity loss of (a) discriminator JD with no regularization, and (b) discriminator JG with diReAL trained on CIFAR-10 dataset.

Fig. 5. Generated images with and without DiReAL trained on CIFAR-10 dataset.

compared to using only DiReAL. However, when DiReAL was combined with other regularizers the quality of the generated samples was significantly improved. For quantitative evaluation of generated examples, inception score metric [46] was used. Inception score has been found to highly correlate with subjective human judgment of image quality [27,46]. Similar to [27,46], inception score was computed for 5000 synthesized images using generators trained with each regularization technique. Every run of the experiment is repeated five times and averaged to combat the effect of random initialization. The average and the standard deviation of the inception scores are reported. The proposed regularization is also compared and contrasted in terms inception score with many benchmark methods as summarized in Table 1. It can be again observed that DiReAL was able to improve the image generation quality compared to unregularized counterpart and when combined with spectral normalization, we observed a 6% improvement in the inception score. By combining DiReAL with layer normalization, an improvement of 11.68% on inception was observed. However, no significant improvement was observed when DiReAL was combined with batch normalization and weight normalization. It must be

302

B. O. Ayinde et al. Table 1. Inception scores with unsupervised image generation on CIFAR-10 Method

Inception score

Real data

9.04

-Standard CNNUnregularized [6]

4.00 ± 0.15

DiReAL (ours)

4.17 ± 0.03

Batch Normalization [24]

5.48 ± 0.19

Layer Normalization [25]

5.05 ± 0.12

Weight Normalization [46]

4.66 ± 0.14

Spectral Normalization [27]

6.50 ± 0.30

Weight Normalization + DiReAL

4.68 ± 0.06

Batch Normalization + DiReAL

5.48 ± 0.15

Layer Normalization + DiReAL

5.64 ± 0.15

Spectral Normalization + DiReAL 6.87 ± 0.12

Fig. 6. Qualitative comparison of generated images with four regularization techniques for models trained on STL-10 dataset.

remarked that the calculation of Inception Scores is library dependent and that is why the scores reported in Table 1 is different for those reported by Miyato et al. [27]. While our implementation was in PyTorch, [27] was in Chainer2 . In the next set of large-scale experiments, STL-10 dataset was used to train generator under diversity regularization and compared with other state-of-theart regularization techniques. As can be observed in Fig. 6, images synthesized by generator trained with DiReAL was able to generate images with competitive quality in comparison with other regularization methods considered. Performance of DiReAL was also observed to be competitive to regularization methods such as WGAN-GP and spectral normalization. In Fig. 7 we show the images produced by the generators trained with DiReAL using Celeb-A dataset. It can 2

https://chainer.org/.

Diversity Regularized Adversarial Deep Learning

303

Fig. 7. Generated images with and without diversity Regularization trained on CELEB-A dataset.

be again be observed that DiReAL was able to stabilize the training and avoid mode collapse in comparison to the unregularized counterpart.

5

Conclusion

This paper proposes a good method of stabilizing the training of GANs using diversity regularization to penalize both negatively and positively correlated features according to features differentiation and based on features relative cosine distances. It has been shown that diversity regularization can help alleviate a common failure mode where the generator collapses to a single parameter configuration and outputs identical points. This has been achieved by providing additional stable diversity gradient information in addition to adversarial gradient information to update both the generator and discriminator’s features. The performance of the proposed regularization in terms of extracting diverse features and improving adversarial learning was compared on the basis of image synthesis with recent regularization techniques namely batch normalization, layer normalization, weight normalization, weight clipping, WGAN-GP, and spectral normalization. It has also been shown on select examples that extraction of diverse features improves the quality of image generation, especially when used in combination with spectral normalization. This concept is illustrated using MNIST handwritten digits, CIFAR-10, STL-10, and Celeb-A Dataset.

304

B. O. Ayinde et al.

Acknowledgement. This work was supported partially by the NSF under grant 1641042.

References 1. Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017 2. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016) 3. Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9 40 4. Ayinde, B.O., Zurada, J.M.: Deep learning of constrained autoencoders for enhanced understanding of data. IEEE Trans. Neural Netw. Learn. Syst. 29(9), 3969–3979 (2018) 5. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014) 6. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015) 7. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: Advances in Neural Information Processing Systems, pp. 2234–2242 (2016) 8. Kurach, K., Lucic, M., Zhai, X., Michalski, M., Gelly, S.: The GAN landscape: losses, architectures, regularization, and normalization (2018) 9. Nowozin, S., Cseke, B., Tomioka, R.: f-GAN: training generative neural samplers using variational divergence minimization. In: Advances in Neural Information Processing Systems, pp. 271–279 (2016) 10. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning, pp. 214–223 (2017) 11. Mao, X., Li, X., Xie, H., Lau, R.Y., Wang, Z., Smolley, S.P.: Least squares generative adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2813–2821. IEEE (2017) 12. Arjovsky, M., Bottou, L.: Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862 (2017) 13. Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014) 14. Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2242–2251. IEEE (2017) 15. Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967–5976. IEEE (2017) 16. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: 33rd International Conference on Machine Learning, pp. 1060–1069 (2016)

Diversity Regularized Adversarial Deep Learning

305

17. Ledig C., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 105–114. IEEE (2017) 18. Azadi, S., Fisher, M., Kim, V., Wang, Z., Shechtman, E., Darrell, T.: Multi-content GAN for few-shot font style transfer (2018) 19. Grewal, K., Hjelm, R.D., Bengio, Y.: Variance regularizing adversarial learning. arXiv preprint arXiv:1707.00309 (2017) 20. Metz, L., Poole, B., Pfau, D., Sohl-Dickstein, J.: Unrolled generative adversarial networks. In: ICLR (2017) 21. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of Wasserstein GANs. In: Advances in Neural Information Processing Systems, pp. 5767–5777 (2017) 22. Roth, K., Lucchi, A., Nowozin, S., Hofmann, T.: Stabilizing training of generative adversarial networks through regularization. In: Advances in Neural Information Processing Systems, pp. 2018–2028 (2017) 23. Che, T., Li, Y., Jacob, A.P., Bengio, Y., Li, W.: Mode regularized generative adversarial networks. In: ICLR (2017) 24. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015) 25. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016) 26. Denton, E.L., Chintala, S., Fergus, R., et al.: Deep generative image models using a Laplacian pyramid of adversarial networks. In: Advances in Neural Information Processing Systems, pp. 1486–1494 (2015) 27. Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. In: ICLR (2018) 28. Brock, A., Lim, T., Ritchie, J.M., Weston, N.: Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093 (2016) 29. Liu, C., Zhang, Z., Wang, D.: Pruning deep neural networks by optimal brain damage. In: Fifteenth Annual Conference of the International Speech Communication Association (2014) 30. Xie, P., Deng, Y., Xing, E.: On the generalization error bounds of neural networks under diversity-inducing mutual angular regularization. arXiv preprint arXiv:1511.07110 (2015) 31. Rodr´ıguez, P., Gonz` alez, J., Cucurull, G., Gonfaus, J.M., Roca, X.: Regularizing CNNs with locally constrained decorrelations. arXiv preprint arXiv:1611.01967 (2017) 32. Dundar, A., Jin, J., Culurciello, E.: Convolutional clustering for unsupervised learning. arXiv preprint arXiv:1511.06241 (2015) 33. Ayinde, B.O., Zurada, J.M.: Building efficient convnets using redundant feature pruning. arXiv preprint arXiv:1802.07653 (2018) 34. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 35. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-31910590-1 53 36. Ayinde, B.O., Zurada, J.M.: Clustering of receptive fields in autoencoders. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 1310–1317. IEEE (2016)

306

B. O. Ayinde et al.

37. Ayinde, B.O., Inanc, T., Zurada, J.M.: Regularizing deep neural networks by enhancing diversity in feature extraction. IEEE Trans. Neural Netw. Learn. Syst., 1–12 (2019) 38. Saxe, A.M., McClelland, J.L., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In: ICLR (2014) 39. Ayinde, B.O., Zurada, J.M.: Nonredundant sparse feature extraction using autoencoders with receptive fields clustering. Neural Netw. 93, 99–109 (2017) 40. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, University of Toronto (2009) 41. LeCun, Y.: The MNIST database of handwritten digits (1998). http://yann.lecun. com/exdb/mnist/ 42. Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 215–223 (2011) 43. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738 (2015) 44. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 45. Vallender, S.: Calculation of the Wasserstein distance between probability distributions on the line. Theory Probab. Appl. 18(4), 784–786 (1974) 46. Salimans, T., Kingma, D.P.: Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In: Advances in Neural Information Processing Systems, pp. 901–909 (2016)

Interpretability of a Deep Learning Model for Rodents Brain Semantic Segmentation Leonardo Nogueira Matos1(B) , Mariana Fontainhas Rodrigues2 , Ricardo Magalh˜ aes2 , Victor Alves3 , and Paulo Novais3 1

Computer Science Department, Federal University of Sergipe, S˜ ao Crist´ ov˜ ao, Brazil [email protected] 2 Life and Health Sciences Research Institute (ICVS), School of Medicine, University of Minho, Braga, Portugal [email protected], [email protected] 3 Algoritmi Center, University of Minho, Braga, Portugal {valves,pjon}@uminho.pt

Abstract. In recent years, as machine learning research has become real products and applications, some of which are critical, it is recognized that it is necessary to look for other model evaluation mechanisms. The commonly used main metrics such as accuracy or F-statistics are no longer sufficient in the deployment phase. This fostered the emergence of methods for interpretability of models. In this work, we discuss an approach to improving the prediction of a model by interpreting what has been learned and using that knowledge in a second phase. As a case study we have used the semantic segmentation of rodent brain tissue in Magnetic Resonance Imaging. By analogy with what happens to the human visual system, the experiment performed provides a way to make more in-depth conclusions about a scene by carefully observing what attracts more attention after a first glance in en passant.

Keywords: Deep Learning Model Interpretability

1

· Magnetic Resonance Imaging ·

Introduction

A few years ago, deep learning technology achieved state-of-the-art in several areas of Artificial Intelligence and sparked a growing interest from the academic community, especially after the article [3], with more than 34,000 citations, has been published. The authors described the model used to win the Imagenet Large Scale Visual Recognition Challange 2012 (ILSVRC2012) [10]. The Alexnet model, used to win the competition, achieved an error rate of 15.3%, a classification error in a set of more than 100 thousand test images organized in 1000 categories. In the previous year, the winning model, the last one that did not use deep learning, achieved an error rate of 26.2%. From 2015, the models have c IFIP International Federation for Information Processing 2019  Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 307–318, 2019. https://doi.org/10.1007/978-3-030-19823-7_25

308

L. N. Matos et al.

an error rate of less than 5% [1], exceeding the performance of the classification of objects of this base by humans that, according to [1], is 5.1%. Deep neural networks are also being used successfully in other domains, such as image captioning [14], visual question answer [11], music recommendation [6], language translation [13], speech recognition [2], medical image analysis [5], among others. The list goes beyond the classification and segmentation of images, which were the main objectives of Imagenet’s challenge. Despite the unprecedented breakthroughs in a variety of computer vision tasks, model understanding and interpretation is of utmost importance in some critical areas. In criminal justice, for example, a decision that affects the life of a legal individual can not be taken into account if it is based on a method without transparency. Medicine is another critical area that mainly rely on interpretable models. Medical staff also requires decisions made by computers provide some sort of explanation, so interpretability is not only desirable but also necessary. Nevertheless, a deep model with a deep and complex architecture is naturally hard to interpret. On the other hand, shallow models such as decision trees and linear regressors are easy to interpret, and therefore their decisions are easy to explain. The more complex the model, the more difficult it becomes to interpret it. The medical area therefore needs mechanisms that make complex models interpretable in order to accept them as trustworthy. In this work we explore a method for interpretability of deep neural networks, called Guided Backpropagation (GBP), which provides insights to interpret decision making by showing parts, artifacts or patterns in the input that were relevant to the model. We used as a case study a CNN trained to segment different parts that form the brain of a rodent on an MRI basis. Then we try to take advantage of the knowledge learned, and identified by the GBP, doing a feedback of the system, that is, once we identify relevant parts, we refine the prediction by discarding irrelevant parts that, in this case, correspond to artifacts that do not belong to the animal’s brain. For a successful use of the gradient, we find that this signal can be modeled by a log normal pdf. This is one of the contributions of the work, since the hypothesis on the distribution can be used in a wider field of scientific applications.

2

Machine Learning Interpretability

There is a vast literature addressing machine learning interpretability, especially in recent years [16], but, as discussed in [4], the term interpretability is ill-defined. Therefore, we assume a consensual and yet subjective meaning to that concept as well as some desired characteristics for them as depicted on [8] and [7]. We assume interpretability as being a prerequisite for trust. Ribeiro et al. argue that humans would trust in a machine learning if predictions could be explained and the explanations are faithful and intelligible. They continue to state that explaining a prediction is related to present textual or visual artifacts that provide qualitative understanding of the relationship between the instance’s components (e.g. patches in an image) and the model’s prediction.

Interpretability of a Deep Learning Model

309

According to Pereira’s line of reasoning, methods can be categorized as model dependent when, unlike deep neural networks, the model is restricted to an inherently easy to interpret family, or model agnostic, a more comprehensive and flexible case, when model is treated as a “black box”. The methods can be global, when the ability to interpret is concentrated on how the model learn the data from a population, i.e., it does not concern the prediction of an individual sample on isolation. A popular approach is the visualization of high-dimensional data t-SNE (Van der Maaten & Hinton, 2008), a form to project high dimensional data on to the Cartesian plane preserving the notion of proximity. Pereira used the global interpretability to test the coherence between the proposed method and the a priori knowledge of specialists. They developed a method to segment brain lesions on MRI and were able to observe through global interpretability what MRI sequences were most appropriate for different tasks such as normal tissue segmentation versus lesion segmentation. Another important desired characteristic covered by Pereira is explicability which deals with the reasoning about a particular decision. It is based under the assumption that it is possible to explain the reason for a given activation, identifying artifacts and regions in the input responsible for this activation. Some authors describes methods that propagate the signal back from the end to the beginning throughout the model. Techniques that adhere to that concept are saliency maps.

3

Methods

The methods discussed in this section are addressed in two blocks: machine learning system and interpretability system. 3.1

Machine Learning System

A Convolutional Neuronal Network architecture was used, specifically the U-Net architecture [9]. This architecture presents two major novelties, it begins with a spatial contraction phase, where a combination of convolutional and maxpooling layers are used to highlight the information on condensed feature maps. This is followed by an expansion phase using Up-sampling layers, which are concatenated with the matching down-sampling layers. The architecture was trained using a data-set of rodent MRI data, aiming to classify different tissue classes within these images. A supervised training method was used, using the dice coefficient as the loss function, which was used to evaluate the performance of the model at each step. The optimizer adjusted the weights of the model at each step, using a stochastic gradient algorithm, with a learning rate of 0.0003, a decay of 1.5 × 10−6 and a momentum of 0.9, with a batch size of 5.80% of the data-set was used for training the model (further divided at 80%–20% for training and validation) and 20% for testing the model.

310

3.2

L. N. Matos et al.

Interpretability System

We will analyzed the model in Sect. 3.1 from the perspective of local interpretability, based on the work of Springenberg et al. [12], which is refered in literature as Guided Backpropagation. Saliency Maps. In this work, we try to present visually relevant aspects that allow to interpret the model’s functioning. In this case, since we are going to analyze the prediction of an isolated pattern, not a set, and we tried to explain what a network sees to perform a prediction, our approach is based on a saliency map, i.e., an attentional map made by the network. Our analyzes is based on methods that propagate the signal retroactively from the last to the first layer, such as those proposed by Zeiler and Fergus [15], called Deconvnet, and Springenberg et al. [12], called Guided Brackpopagation. Zeiler and Fergus [15] were the winners of the ILSVRC2013 contest. By publishing the model used, they also published a method to visually present the knowledge the network had learned. This method, called Deconvnet, has become quite popular and follow by others. The idea was to propagate the signal through the network in the reverse order, that is, from end to beginning, until reaching the input level. The signal is the activation of an isolated neuron, usually the highest activation in a layer, although in the original article they used the 10 higher activations. There are two important aspects to consider in this approach: the weights the network had learned during training are the same used to backpropagate the signal throughout convolutional layers, (2) the method has a forward phase in which a pattern is presented and the positions of the maxima in max pooling layers, called switches, are saved and a backward phase when the salience map is properly composed. In [12], Springenberg et al. established a different method of identifying the artifacts in the input that most influence decision making. They observed that by propagating the gradient in the reverse order and projecting it at the input, it is also possible to identify patterns that influenced the activation of an output unit. In addition, they realized that the backpropagation of the gradient could be guided only by the higher values, neglecting the values of smaller amplitude, which justifies the name of the method — Guided Backpropagation. The control of the propagation proposed by Springenberg et al. was imposed on the relu layers. In these layers, the signal was propagated only in places where the relu function allows the signal to pass in both forward and backward phase. To illustrate this concept, consider the example in Fig. 1, adapted from [12]. Let f be the propagated signal in the forward phase and b, the signal propagated in the backward. The superscript index indicates the layer and the subscript index its position. Springenberg et al. have shown that in the deeper layers the reconstructed image is cleaner than that obtained by the Deconvnet method.

Interpretability of a Deep Learning Model

311

Fig. 1. Gradient guided backpropagation. Adapted from [12]

4 4.1

Experiments Database

A data-set with 144 images were used to train the network. Data was acquired on a 11.7T using an SE-EPI diffusion sensitive acquisition with TR = 5 s, T E = 20, voxel resolution of 0.375 × 0.375 × 0.5 mm3 . Data was pre-processed using FSL, correcting movement and averaging B0 weighted images. The ground truth was generated using SPM segment tool to create a semantic classification of the brain. All in-vivo experiments were done in the context of the FCTANR cofounded SIGMA project and were conducted in accordance with the recommendations of the European Community (2010/63/EU) and the French legislation (decree n◦ 2013-118) for use and care of laboratory animals and were ´ approved by the “Comit´e d’Ethique en Exp´erimentation Animale du Comissariat ´ ´ `a l’Energie Atomique et aux Energies Alternatives - Direction des Sciences du Vivant, Ile-de-France (CETEA/CEA/DSV IdF, protocol number ID 13-023). 4.2

Rodents’ Brain Semantic Tissue Segmentation

The architecture was inspired by U-NET. It is a full convolutional network, which means that it does not contain full connected layers and, since it performs a multiclass segmentation, it contains a softmax activation function in the last layer. The network segments three different classes: white matter (WM), gray matter (GM) and cerebrospinal liquid. There is also an extra class that corresponds to the background, i.e., it refers to the negation of the others, since the softmax output must sum one. Hence, the last layer of the network contains four binary channels (featuremaps), each one related to one class, Fig. 2. 4.3

Background Removal

The model was trained with sectioned brain images. That is, the images were preprocessed before being presented to the network. Preprocessing consisted of the removal of existing artifacts that were not part of rat brain tissue by the

312

L. N. Matos et al.

Fig. 2. Input and output. (a) Input; (b) Background; (c) White matter; (d) Gray matter; (e) Cerebrospinal liquid

application of a mask. The pixels inside the mask were preserved, the outer ones were zeroed. This step allowed to treat with clean images, without presence of foreign objects, and produce an output with high precision. In the test phase, however, it is not possible to use the same mask used in the training, because during image acquisition procedure small displacements can occur which leads to translated images, not allowing the mask to be fitted on the target. Model Explanation. When applying GBP, it is possible to see that the edges are the most importants parts the network takes into account to make a decision. Even when segmenting gray matter, Fig. 2(d), which is the innermost area, GBP is higher at the edges. In Fig. 3 we can also observe that, if a translation of the input occurs, a corresponding translation will also occur at the output. This can be explained by the fact that convolution is translation invariant, in this way convolutional layers can identify the presence of patterns in any place. Another important fact is that, since the background used in the training images is fairly uniform, since they always have a dark background, the presence of any artifacts in the input can easily confuse the network, leading to a wrong segmentation, Fig. 4. The use of the mask, therefore, is imperative because without it the model can not produce the expected results. Image Enhancement. To identify the location of the mask, we use a method to enhance the projected gradient image, or simply gradient image, as we will call it from now on. This is the central element in this decision-making. That is, the gradient image which corresponds to what the network is seeing, identifies the site where the mask should be positioned. When dealing with images of the

Interpretability of a Deep Learning Model

313

Fig. 3. Saliency maps of translated images

Fig. 4. Noisy segmentation

gradient, more specifically of its magnitude, we assume that its histograms are governed by a log normal distribution. Based on this hypothesis, we can adjust the histogram distribution of the gradient to that of a Gaussian distribution with μ = 0 and σ = 1, which would allow us to identify a threshold for distinguish between foreground and background. The transformation of the histogram to fit it into a Gaussian curve can be done by applying a linear expansion, followed by a logarithmic compression, Fig. 5. The next step is to normalize the values by the linear transformation z = (x − μx )/σx . The final step consist to apply a threshold function, Eq. 1, to z, in replacement of step function as usually is done. This allows us to leave some intermediate values, which are not labeled as background or foreground, postponing to the next step, which takes into account the geometry of the shape, the decision on which region these pixels belong to. The maximum in Eq. 1 occurs when x = 2. It means that positive values far from zero are mapped to 1, nevertheless, if they are too high, i.e. grather than 2, they decrease in size, which favor to reduce the influence of artifacts since

314

L. N. Matos et al.

they also have pixels associated to high gradients. In this work, by making the denominator of exponent in (1) equals to 0.5, we narrow the interval where the maximum activations are located (Fig. 6). f (x) = e−(

x−2 2 .5 )

(1)

Fig. 5. Histogram transform. (a) Histogram of gray level magnitude; (b) Normal histogram; (c) Thresholding using sigmoid function; (d) Final histogram

Fig. 6. Image enhancement

Interpretability of a Deep Learning Model

315

Image Correlation. The second step in mask positioning is given by obtaining its location from the threshold applied to the gradient images. This is done by calculating the 2D cross correlation between the enhanced gradient image and the mask. Cross-correlation involving binary images, or roughly binary, has maxima in places where there is greater alignment between the shapes. Because gradient images generally have shapes and contours that are repeated in different achievements of magnetic resonance imaging, the use of cross-correlation may be successful in this type of application. To mitigate the effect caused by the presence of artifacts, the mask is preliminarily multiplied by the image gradient of the model, Fig. 7(a), which makes the resulting image more sparse, reducing cross-correlation with parts belonging to the artifacts without changing the value of the correlation with the parts of the brain tissue, Fig. 7(c).

Fig. 7. Shape alignment. (a) Shape mask; (b) Gradient image; (c) Cross correlation 2D; (d) Input; (e) Input after background removal

4.4

Results

We discuss the background removal on the magnetic resonance of rodents by analysing the semantic segmentation of the tissues WM, GM and cerebrospinal fluid as well as the segmentation of the brain over 40 MRI slices of an individual. We compare the results of background removal based on GBP against a ground truth obtained by manual segmentation. The metrics accuracy, specificity, sensitivity and dice similarity coefficient (DSC), commonly adopted in segmentation analysis, are presented in Tables 1 and 2.

316

L. N. Matos et al. Table 1. Network performance after background removal (with GBP) Class

Accuracy Specificity Sensitivity DSC

Brain segmentation 98.52

99.17

99.07

99.12

White matter

87.9

84.75

86.05

98.38

Gray matter

98.41

81.83

73.1

76.76

Cerebrospinal liquid

98.41

53.61

68.55

60.15

Table 2. Network performance after background removal (manually) Class

Accuracy Specificity Sensitivity DSC

White matter

98.92

Gray matter

98.96

89.45

81.02

84.65

Cerebrospinal liquid 99.12

71.79

84.59

77.61

92.67

89.65

91.05

It can be observed that the segmentation of the brain was performed very efficiently, as presented in first row in Table 1. However, the semantic segmentation reached low performance, especially for cerebrospinal fluid. This is mainly due to the fact that the presence of artifacts related to the exterior of the animal’s brain harm the perfect mask alignment. As a result, the metrics associated with small parts, such as those occupied by the cerebrospinal fluid, become impaired.

Fig. 8. Precision on segmenting cerebrospinal liquid. (a) Input image; (b) Background manually removed; (c) Background removal with GBP; (d) Semantic segmentation ground trouth; (e) CNN semantic segmentation based on GBP background removal

Interpretability of a Deep Learning Model

317

Another important aspect that should be considered is the fact that removing the background by applying a mask coming from of another animal, although well positioned, may leave residues or remove parts of the foreground since the brain sizes are not necessarily the same. In Fig. 8 we present a particular case (slice 25) where precision is high for brain area and low for cerebrospinal fluid. Despite the fact that there is a tight aligment between manually and GBP background removal, small displacements in the brain area afects severly the metrics evaluation of cerebrospinal liquid due the fact that it occupies a small portion of the image.

5

Conclusion

We discuss in this work an approach for interpretability of CNN models. We show that the prediction of a model can be increased using the knowledge acquired by the model itself. In a case study involving semantic segmentation of rodent brain tissues, we have shown that the important parts of the input, identified by the GBP method, can be used to enhance the prediction in a second phase. More specifically, we proposed to use what the network learn to find the place to put a mask for background removal. In analogy to what occurs with the visual system of humans, the experiment carried out presents a way to draw more accurate conclusions about a scene by observing carefully what draws more attention after a first interpretation in en passant. We also show a way to enhance the output of GBP method, exploring the hypothesis that the gray level distribution of the images is governed by log normal pdf. This is an interesting finding, as it may help support other works exploring the nature of the probability distribution of the gradient-propagated signal. As future work, we intend to explore more precise techniques for performing segmentation of the brain, ie the removal of the background, using the information provided by the GBP method. This can be done with the help of another connectionist system replacing the 2D crosscorrelation approach explored in this article. Acknowlegdments. This work has been supported by FCT - Funda¸ca ˜o para a Ciˆencia e Tecnologia within the Project Scope: UID/CEC/00319/2019. We gratefully acknowledge the support of the NVIDIA Corporation with their donation of a Titan V board used in this research.

References 1. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015) 2. Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)

318

L. N. Matos et al.

3. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 4. Lipton, Z.C.: The mythos of model interpretability. arXiv preprint arXiv:1606.03490 (2016) 5. Litjens, G., et al.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017) 6. Van den Oord, A., Dieleman, S., Schrauwen, B.: Deep content-based music recommendation. In: Advances in Neural Information Processing Systems, pp. 2643–2651 (2013) 7. Pereira, S., et al.: Enhancing interpretability of automatically extracted machine learning features: application to a RBM-random forest system on brain lesion segmentation. Med. Image Anal. 44, 228–244 (2018) 8. Ribeiro, M.T., Singh, S., Guestrin, C.: Why should i trust you? explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM (2016) 9. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4 28 10. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015) 11. Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., Batra, D.: Gradcam: Why did you say that? Visual explanations from deep networks via gradientbased localization. CoRR, abs/1610.02391 7 (2016) 12. Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806 (2014) 13. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014) 14. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015) 15. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-31910590-1 53 16. Zhang, Q.S., Zhu, S.C.: Visual interpretability for deep learning: a survey. Front. Inform. Technol. Electron. Eng. 19(1), 27–39 (2018)

Learning and Detecting Stuttering Disorders Fabio Fassetti1(B) , Ilaria Fassetti2 , and Simona Nistic` o1 1

DIMES Department, University of Calabria, Rende, Italy {f.fassetti,s.nistico}@dimes.unical.it 2 Therapeia Rehabilitation Center, Rende, Italy [email protected]

Abstract. Stuttering is a widespread speech disorder involving about the 5% of the population and the 2.5% of children under the age of 5. Much work in literature studies causes, mechanisms and epidemiology and much work is devoted to illustrate treatments, prognosis and how to diagnose stutter. Relevantly, a stuttering evaluation requires the skills of a multi-dimensional team. An expert speech-language therapist conduct a precise evaluation with a series of tests, observations, and interviews. During an evaluation, a speech language therapist perceive, record and transcribe the number and types of speech disfluencies that a person produces in different situations. Stuttering is very variable in the number of repeated syllables/words and in the secondary aspects that alter the clinical picture. This work wants to help in the difficult task of evaluating the stuttering and recognize the occurrencies of disfluency episodes like repetitions and prolongations of sounds, syllables, words or phrases silent pauses, hesitations or blocks before speech. In particular, we propose a deep-learning based approach able at automatically detecting difluent production point in the speech helping in early classification of the problems providing the number of disfluencies and time intervals where the disfluencies occur. A deep learner is built to preliminarily valuate audio fragments. However, the scenario at hand contains some peculiarities making the detection challenging. Indeed, (i) fragments too short lead to uneffective classification since a too short audio fragment is not able to capture the stuttering episode; and (ii) fragments too long lead to uneffective classification since stuttering episode can have a very small duration and, then, the much fluent speaking contained in the fragment masks the disfluence. So, we design an ad-hoc segment classifier that, exploiting the output of a deep learner working with non too short fragments, classifies each small segment composing an audio fragment by estimating the probability of containing a disfluence.

Keywords: Deep learning

· Audio classification · Stuttering

c IFIP International Federation for Information Processing 2019  Published by Springer Nature Switzerland AG 2019 J. MacIntyre et al. (Eds.): AIAI 2019, IFIP AICT 559, pp. 319–330, 2019. https://doi.org/10.1007/978-3-030-19823-7_26

320

1

F. Fassetti et al.

Introduction

Stuttering is a communication disorder where the smooth flow of speech is disrupted. It begins during childhood and, in some cases, lasts throughout life. This dysfluency may interfere with the ability to be clear and understood. The effort to learn to speak and the normal stress of the evolutionary growth can trigger in the child language manifestations characterized by brief repetitions, hesitations and prolongations of sounds that characterize both the early stuttering and the normal disfluency. About 5% of the child population experience a period of stuttering that lasts 6 months or more. A lot of Children of those who start stuttering will have a remission of the disorder in late childhood. Most of the risk for stuttering onset is over by age 5, earlier than has been previously thought, with a male-to-female ratio near onset smaller than what has been thought [10]. There is strong clinical evidence that more of 60% of the treated stuttering children have a stuttering relative in the family. Children who start stuttering before 42 months have a greater chance of overcoming and solving the problem. Physiological and normal developmental disfluencies are difficult to differentiate from the first signs of effective stuttering. But if the subject stutters for more than 6 months, it is difficult to solve the problem spontaneously. Signs of chronicity in older children (e.g., 6- or 7-year-olds) who had stuttered for two years may not be quite the same as those in 2- to 4-year-olds who have short stuttering histories [11]. It is important to remember that stuttering is not caused by nervousness nor is it related to personality or intellectual capabilities. Despite not demonstrating more severe stuttering, socially anxious adults who stutter demonstrate more psychological difficulties and have a more negative view of their speech [4]. Parents have not done anything that may have caused their son’s stuttering even if they feel responsible in some way! Exist also an idiopathic stuttering that is caused by a possible deficiency in motor inhibition in children who stutter [6]. Stuttering is characterized by an abnormally high number of disfluencies, abnormally long disfluencies, and physical tension that is often evident during speech [8]. Some signs of stuttering are [9]: – – – –

repetitions of whole words (e.g., “We, we, we went”) repetitions of parts of words (e.g., “Be-be-because”) prolongation or stretching of sounds (e.g., “Ssssssee”) silent blocks (getting stuck on a word or tense hesitations)

The child with severe stuttering often shows physical symptoms of stress, especially the increase in muscle tension, and tries to hide his stuttering and avoids speaking and exposing himself to linguistic situations. Although severe stuttering is more common in older children, it may, nevertheless, start at any age, between 1.5 years and 7 years. This person can exhibit behaviors associated with stuttering: blinking, looking away, or muscle tension in the buccal or other parts of the face. Moreover, part of the tension and of the impact can be perceived in a strong increase of the vocal tone or of the intonation (increase of the vocal

Learning and Detecting Stuttering Disorders

321

frequency) during the repetitions or during the extensions. The subject with severe stuttering can resort to extraverbal sounds, interjections, such as “um, uh, well. . . ” at the beginning of a word in which he expects to stutter. Especially moderate to severe stuttering had a negative impact on overall quality of life [5]. This work aims at contributing in stuttering recognition by helping therapists and patients in detecting episodes of disfluency. Technically speaking, we propose a system that fed with an input audio outputs the time interval related to stuttering phenomena. The system consists in several phases, the two main ones are devoted to classify. The rest of the paper is organized as follows. Section 2 presents the basic notions exploited in this work; Sect. 3 introduces the architecture of our technique; Sect. 4 details the proposed technique and the main phases it requires; Sect. 5 describes the experimental campaign we perform to validate our technique; Sect. 6 draws the conclusions.

2

Preliminaries

In this section we report some preliminary notions. The input audio file is quite clean since we assume that the therapist acquire the registration of the patient in a safe environment. From the input wav file we obtain feature vectors by considering spectrograms [3] and Mel frequency cepstral coefficients [1].

3

The Proposed Architecture

The proposed architecture consists in several phases. For the sake of clarity, we introduce them next to provide a general overview. Each phase is detailed in the following section. The main flow is reported in Fig. 1.

Fig. 1. Methodology flow

Cleaning Phase. This phase consists in cleaning the input audio file. Even if, as before stated, we assume that the input audio file is quite clean, without background sounds that could compromise quality, this phase is needed to clean the input audio file from intervals during which the patient does not speak. Also, during this phase the input file is normalized in terms of volume, and sampling frequency. See Sect. 4.1 for details. Audio Fragmentation Phase. This phase consists in splitting the input file in fragments having length flen and overlapped of ε seconds as detailed in Sect. 4.2.

322

F. Fassetti et al.

Feature Extraction Phase. This phase consists in extracting features from raw audio fragments and represent a critical part of the architecture. Each fragment is transformed in a numeric vector as detailed in Sect. 4.3. Fragment Classification Phase. This phase represents, with the succeeding classification phase, the core of our architecture. Here, a trained deep learner assigns each fragments to the class of fluent or disfluent with a certain probability. Details on this are reported in Sect. 4.4. Segment Classification Phase. This phase is, with the previous learning phase, the core of the architecture. Here, a probabilistic model allows the classification of each segment composing a fragment, exploiting the overlapping of the fragments and taking as input the probabilities of belonging to the fluent or disfluent class computed by the previous phase. Details are provided in Sect. 4.5.

4

Detection Technique

In this Section we describe the proposed technique providing details about all the phases above introduced. 4.1

Noise Removal

The input audio file is assumed to be clean from background sounds that could compromise audio quality. However, there can be several time intervals during which the patient does not speak. These intervals cannot be completely removed since, in many cases, the silence is symptomatic of a disfluence. Thus, in this phase the system recognizes these intervals and reduces each of them so that its duration is large enough to guarantee that at least one fragment captures the segment just before the noise interval and the segment just after the noise interval. Note that, this operation is performed also for the audio files employed to train the system. To perform this operation, we trained a learner able to discriminate between spoken and no-spoken fragments. This learner is simple and highly accurated since the classes are well-separated. 4.2

Audio File Preparation

With the aim of obtaining classifiable fragments of the input audio file we recover to a fixed-size sliding window approach. Indeed, in order for the learner to correctly work, fragments cannot be too long, otherwise stuttering phenomena would be obfuscated by fluent speech, and cannot be too short since we need fragments of at least some seconds, how can be intuitively understood. Indeed, also a human, to recognize a stuttered phoneme, needs to hear at least for the interval covering the stuttering phenomenon that has a great variability but is, in general, wider than the duration of a segment. On the other hand, if there were not overlapping, the stuttering phenomenon could be split in two adjacent

Learning and Detecting Stuttering Disorders

323

intervals and, then, not recognized. In other words, we need that at least one fragment contains all the stuttering phenomenon if this is shorter than the duration of a fragment, that a fragment and its adjacent ones cover all the stuttering phenomenon otherwise. Let S be the input audio stream and let d be its duration in seconds. Choosen a fragment length in seconds, denoted as flen , and an overlap size in seconds, denoted as ε, and letting flen ε d ns = ε nf = ns − (n − 1)

be the number of segments per fragment,

n=

be the total number of segments, be the total number of fragments,

then, S is sectioned in nf fragments of equal size flen and overlapped of ε seconds. Each fragment is composed by n segments and, thus, S can be considered as partitioned in ns segments and each segment, due to the overlap, belongs to n distinct fragments as illustrated in Fig. 2. S

s1

s2

| | f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11

s3

s4

s5

ε s2 s2

ε s3 s3 s3

s7

s8

ε s4 s4 s4 s4

ε s5 s5 s5 s5 s5

s9

s10

s11

s12

s13

s14

s15

···

sn s |

d

|

f ε s1

s6

s6 s6 s6 s6 s6

s7 s7 s7 s7 s7

s8 s8 s8 s8 s8

s9 s9 s9 s9 s9

s10 s10 s10 s10 s10

s11 s11 s11 s11 s11

s12 s12 s12 s12

s13 s13 s13

s14 s14

s15 ···

fnf

···

Fig. 2. Example of audio fragmentation with n = 5.

4.3

Feature Extraction

From each audio fragment, we need to build a numeric vector representing audio features. In particular, we compute spectrograms [3] and Mel frequency cepstral coefficients [1]. 4.4

Fragment Classification

As for the fragments classification, we adopt a deep-learning based classifier.

324

F. Fassetti et al.

The learning phase through a deep learner provides the fragment fi , i ∈ [0 . . . nf ] (see Fig. 2) with a classification πi stating for the probability that the fragment fi belongs to the class labeled , with  ∈ {fluent, disfluent}. 4.5

Segment Classification

The classification described in the previous section provides for each fragment fi , i ∈ [0 . . . nf ] the probability πi to belong to a certain class labeled . Nevertheless, each segment si , i ∈ [0 . . . ns ] (see Fig. 2) has to be classified in order to detect the time intervals of fluent speaking and the time intervals of disfluent speaking. Consider, for example, the scenario where each fragment consists in 5 segments and a time interval I with a c2 voice has a duration of 3 segments. There are, then, 7 overlapped fragments covering I as illustrated in Fig. 3.

S

| s1

s2

f1 s1 f2 f3 f4 f5 f6 f7 f8

s2 s2

c1 s3

|| s4 |

s3 s3 s3

s4 s4 s4 s4

s5 s5 s5 s5 s5 s5

s6

c2

s7

I s6 s6 s6 s6 s6

|| s8

s9

s10

s11

c1 s12

s8 s8 s8 s8 s8

s9 s9 s9 s9

s10 s10 s10

s11 s11

s12

s13

s14

···

|

| s7 s7 s7 s7 s7

Fig. 3. Example of challenging segment classification.

Obviously, no fragment fi has a probability πic2 close to 1 since no fragment fully contains a female voice and the aim of the segment classification phase is to correctly individuate  the  3 segments  where the class label changes. d−flen + 1 denote the fragment starting from the Let fi with i ∈ 1 . . . ε segment si , then si belongs to the set of fragments {fmax(1,i−n+1) , . . . , fi }, which are employed to evaluate the trend of classification when si appears, and does not belong to the succeeding set of fragments {fmin(i+1,ns ) , . . . , fmin(i+n−1,ns ) }, containing some segments of fragment fi . Thus, these fragments share with fi some segments excepting si and are, roughly speaking, employed to evaluate the trend of classification when si disappears. As described in Sect. 4.4, the learning phase through a deep learner provides the fragment fi with a classification πi stating for the probability that the fragment fi belongs to the class labeled .

Learning and Detecting Stuttering Disorders

325

In order to exploit the overlap to improve segment classification, we aim at evaluating the contribution that the i-th segment gives to the classification. Case 1: valuating the effect of the i-th segment when it appears. The segment si firstly appears as n-th segment of the fragment fi−n+1 . Then, consider fragment fi−n as referring fragment (si has not yet been seen) and let ϕi (j) =

 πi−n+j  πi−n

∀j ∈ [1, n ),

,

for each label 

with n ≤ n is a parameter representing how many contributions are to be taken     = πi−n and πcurr = πi−n+j . into account. In this case, we denote πref Case 2: valuating the effect of the i-th segment when it disappears. The segment si firstly disappears for the fragment fi+1 . Then, consider fragment fi+j as referring fragment (si is not seen) and let ϕi (j) =

πi ,  πi+j

∀j ∈ [1, n ),

∀k ∈ {0, 1}

with n ≤ n is a parameter representing how many contributions are to be taken    = πi and πref = πi+j . into account. In this case, we denote πcurr We, firstly, compute the probability to observe a ratio smaller than ϕi (j) 

F (ϕi (j), λ) = 1 − e−λ·ϕi (j) , with λ =

ε·j . flen

We use an exponential distribution since when fi belongs to a class labeled , πi is high, then we want to capture that the ratio is lowered with high probability and further raised with low probability. Also, the dependence of λ from j allows us to alleviate the exponential trend. The idea is that the more j is high, the more the change in the ratio can be high. In other words, it is quite improbable that a single segment can drastically change the ratio. Whereas, when j is high, more segments are taken into accounts and then the change can be high. How much this value exceeds the no-change case, namely ϕki (j) = 1, is  employed as weight for πcurr   F ϕi (j), λ   · πcurr gi (j) = . F (1, λ) In order to normalize this value in a vote ranging from 0 to 1, we apply an exponential kernel function to it: 

hi (j) = 1 − e−λ·gi (j) , with λ = −

 ) log(1 − πcurr  πcurr

 so that, if ϕi (j) = 1 then hi (j) coincides with πcurr . To combine votes hji , we compute the weighted arithmetic mean, where the weights are related to the

326

F. Fassetti et al.

  probability that observing πcurr given πref is due to chance. This probability follows a gamma distribution with parameters k and θ and  x 1 1 k−1 −x Γ θ , · γ k, f Γ (x) = · x · e F (x) = Γ (k) · θk Γ (k) θ

are, respectively, the probability density function and the cumulative density  ∈ [0, 1] function. Since the Gamma distribution is asymmetric and since both πref    and πcurr ∈ [0, 1], in comparing πcurr and πref we adopt the following strategy:  if πref is smaller than 0.5 then we compute the probability of observing a value    more extreme than πcurr given πref , otherwise, if πref is greater than 0.5 then  we compute the probability of observing a value more extreme than (1 − πcurr )  given (1 − πref ). To compute F Γ , we valuate k and θ by fixing the value x (say t this value) maximizing the gamma probability density function and the value x such that the probability of observing x is equal to the probability of observing a value distant 4 standard deviations from the mean. In particular, we determine k and   = πref , then θ with the following constraints: (i) the maximum is when πcurr   t = πref ; and (ii), due to the fact that πcurr is at most 1, we want that 1 is at 4 standard deviations from the mean. The following theorem accounts for the computation of these parameters. Theorem 1 (Parameters k and θ of F Γ ). The value of parameters θ and k of F Γ such that the maximum is in a generic point t < 1 and that 1 is at 4 standard deviation from the mean are

2  2 + 4 + w · (w − 1) 1 1 t with w = −W−1 − · e k= (1) w−1 t where W is the Lambert function and θ=

t . k−1

(2)

Proof. The first constraint can be imposed by computing the first derivative of f (x) and evaluating it in t. x ∂ f 1 1 −1 − x = ·e θ · (k − 1) · xk−2 · e− θ + · xk−1 · ∂x Γ (k) · θk Γ (k) · θk θ

which is equal to 0 in t when  ∂ f   =0 ⇒ ∂ x t

t 1 −1 − t 1 ·e θ =0 · (k − 1) · tk−2 · e− θ + · tk−1 · Γ (k) · θk Γ (k) · θk θ t t (k − 1) − ⇒ θ= . θ k−1



Learning and Detecting Stuttering Disorders

327

√ As for parameter k, since the mean is kθ and the standard deviation is θ k we have to solve the following equation:  √  f (1) = f kθ + 4θ k and, thus, √ k−1 − kθ+4θ√k 1 1 − θ1 θ · e = · (kθ + 4θ k) ·e Γ (k) · θk Γ (k) · θk √ √ kθ+4θ k 1 e− θ = (kθ + 4θ k)k−1 · e− θ ⇒ √ √ 1 kθ + 4θ k − = (k − 1) log (kθ + 4θ k) − ⇒ θ θ √ √ − 1 = θ(k − 1) log (kθ + 4θ k) − (kθ + 4θ k).



By substituting Eq. (2) we obtain √ √ kt + 4t k kt + 4t k − −1 = t log (k − 1) k−1 by setting

√ k+4 k w= k−1

(3)

we obtain

1 wt − 1 = log wt ⇒ ew− t = wt t which can be solved by exploiting the Lambert function W , thus obtaining 1 −1 ·e t . w = −W−1 t

From Eq. (3), we have  k=

2+



4 + w · (w − 1) w−1

2

and, then, the theorem is proved. Once F Γ is fully determined, the weight of each vote hji is      1 − F hi (j) if πcurr