Intelligent Systems and Applications: Proceedings of the 2019 Intelligent Systems Conference (IntelliSys) Volume 1 [1st ed. 2020] 978-3-030-29515-8, 978-3-030-29516-5

The book presents a remarkable collection of chapters covering a wide range of topics in the areas of intelligent system

957 117 108MB

English Pages XIV, 1302 [1316] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Intelligent Systems and Applications: Proceedings of the 2019 Intelligent Systems Conference (IntelliSys) Volume 1 [1st ed. 2020]
 978-3-030-29515-8, 978-3-030-29516-5

Table of contents :
Front Matter ....Pages i-xiv
Can Human Evidence Accumulation Be Modeled Using the Set-Theoretic Nature of Dempster-Shafer Theory? (Samantha Lang, Eric Freedman, Michael E. Farmer)....Pages 1-9
Emotional Speech Recognition Using SMILE Features and Random Forest Tree (Ammar Mohsin Butt, Yusra Khalid Bhatti, Fawad Hussain)....Pages 10-17
A Switching Approach that Improves Prediction Accuracy for Long Tail Recommendations (Gharbi Alshammari, Jose L. Jorro-Aragoneses, Stelios Kapetanakis, Nikolaos Polatidis, Miltos Petridis)....Pages 18-28
A Novel Algorithm for Dynamic Student Profile Adaptation Based on Learning Styles (Shaimaa M. Nafea, François Siewe, Ying He)....Pages 29-51
Exploring Transfer Learning for Low Resource Emotional TTS (Noé Tits, Kevin El Haddad, Thierry Dutoit)....Pages 52-60
Emotional Speech Datasets for English Speech Synthesis Purpose: A Review (Noé Tits, Kevin El Haddad, Thierry Dutoit)....Pages 61-66
Feature Selection for Hidden Markov Models with Discrete Features (Stephen Adams, Peter A. Beling)....Pages 67-82
Location Tracking and Location Prediction Techniques for Smart Traveler Apps (Mohamad Amar Irsyad Mohd Aminuddin, Mohd Azam Osman, Wan Mohd Nazmee Wan Zainon, Abdullah Zawawi Talib)....Pages 83-96
Implementation Aspects of Tensor Product Variable Binding in Connectionist Systems (Alexander Demidovskij)....Pages 97-110
Timing Attacks on Machine Learning: State of the Art (Mazaher Kianpour, Shao-Fang Wen)....Pages 111-125
HORUS: An Emotion Recognition Tool (André Teixeira, Manuel Rodrigues, Davide Carneiro, Paulo Novais)....Pages 126-140
Frustrated Equilibrium of Asymmetric Coordinating Dynamics in a Marketing Game (Matthew G. Reyes)....Pages 141-159
Cloud Capacity Planning Based on Simulation and Genetic Algorithms (Riyadh A. K. Mehdi, Mirna Nachouki)....Pages 160-174
Simulation of Artificially Generated Intelligence from an Object Oriented Perspective (Bálint Fazekas, Attila Kiss)....Pages 175-193
Robustness of Keystroke Dynamics Identification Algorithms Against Brain-Wave Variations Associated with Emotional Variations (Enrique P. Calot, Jorge S. Ierache, Waldo Hasperué)....Pages 194-211
Explanatory AI for Pertinent Communication in Autonomic Systems (Marius Pol, Jean-Louis Dessalles, Ada Diaconescu)....Pages 212-227
Genetic Algorithm Modeling for Photocatalytic Elimination of Impurity in Wastewater (Raheleh Jafari, Sina Razvarz, Wen Yu, Alexander Gegov, Morten Goodwin, Mo Adda)....Pages 228-236
Bitcoin: A Total Turing Machine (Craig S. Wright)....Pages 237-252
Agent-Based Turing-Complete Transactions Integrating Feedback Within a Blockchain System (Craig S. Wright)....Pages 253-265
A High Resolution and Low Jitter 5-Bit Flash TDC Architecture for High Speed Intelligent Systems (Jagdeep Kaur Sahani, Anil Singh, Alpana Agarwal)....Pages 266-275
A Sensitivity Analysis for Harmony Search with Multi-Parent Crossover Algorithm (Iyad Abu Doush, Eugene Santos)....Pages 276-284
Decentralized Autonomous Corporations (Craig S. Wright)....Pages 285-298
A Proof of Turing Completeness in Bitcoin Script (Craig S. Wright)....Pages 299-313
Turbines Allocation Optimization in Hydro Plants via Computational Intelligence (Ramon Abritta, Frederico F. Panoeiro, Ivo C. da Silva Junior, André Luís Marques Marcato, Leonardo de Mello Honório, Luiz Eduardo de Oliveira)....Pages 314-329
A Dynamic Ensemble Selection Framework Using Dynamic Weighting Approach (Aiman Qadeer, Usman Qamar)....Pages 330-339
Applying Feature Selection and Weight Optimization Techniques to Enhance Artificial Neural Network for Heart Disease Diagnosis (Younas Khan, Usman Qamar, Muhammad Asad, Babar Zeb)....Pages 340-351
Immune Inspired Dendritic Cell Algorithm for Stock Price Manipulation Detection (Baqar Rizvi, Ammar Belatreche, Ahmed Bouridane)....Pages 352-361
Validity Evaluation for the Data Used for Artificial Intelligence System (Han Seong Son)....Pages 362-369
Advanced DBSCAN: A Clustering Algorithm for Personal Credit Reference System (Lu Han)....Pages 370-381
An Analysis on the Weibo Topic “US-China Trade War” Based on K-Means Algorithm (Shaochi Cheng, Yuan Gao, Xiangyang Li, Su Hu)....Pages 382-390
High Quality Dataset for Machine Learning in the Business Intelligence Domain (Luisa Franchina, Federico Sergiani)....Pages 391-401
A Machine Learning Approach to Shipping Box Design (Guang Yang, Cun (Matthew) Mu)....Pages 402-407
Utilizing Local Outlier Factor for Open-Set Classification in High-Dimensional Data - Case Study Applied for Text Documents (Tomasz Walkowiak, Szymon Datko, Henryk Maciejewski)....Pages 408-418
Dynamic Programming Models for Maximizing Customer Lifetime Value: An Overview (Eman AboElHamd, Hamed M. Shamma, Mohamed Saleh)....Pages 419-445
Big Data Modelling for Predicting Side-Effects of Anticancer Drugs: A Comprehensive Approach (Sai Jyothi Bolla, S. Jyothi)....Pages 446-456
Semantic Topic Discovery for Lecture Video (Jiang Bian, Mao Lin Huang)....Pages 457-466
Who Has Better Driving Style: Let Data Tell Us (Linna Wu, Huan Li, Hengtian Ding, Lizhuo Zhang)....Pages 467-485
A Hierarchical Missing Value Imputation Method by Correlation-Based K-Nearest Neighbors (Xin Liu, Xiaochen Lai, Liyong Zhang)....Pages 486-496
Predicting the Success of Kickstarter Projects in the US at Launch Time (Weifeng Hu, Rui Yang)....Pages 497-506
Patterns and Outliers in Temporal Point Processes (César Ali Marin Ojeda, Kostadin Cvejoski, Rafet Sifa, Jannis Schuecker, Christian Bauckhage)....Pages 507-526
Classification of Followee Recommendation Techniques in Twitter (Kamaljit Kaur, Kanwalvir Singh Dhindsa)....Pages 527-540
Firemen Prediction by Using Neural Networks: A Real Case Study (Christophe Guyeux, Jean-Marc Nicod, Christophe Varnier, Zeina Al Masry, Nourredine Zerhouny, Nabil Omri et al.)....Pages 541-552
A Community Detection Based Approach for Exploring Patterns in Player Reviews (Maren Pielka, Rafet Sifa, Rajkumar Ramamurthy, Cesar Ojeda, Christian Bauckhage)....Pages 553-565
Visual Ontology Sketching for Preliminary Knowledge Base Design (Tatiana Gavrilova, Elvira Grinberg)....Pages 566-576
Impact of the Structure of Data Pre-processing Pipelines on the Performance of Classifiers When Applied to Imbalanced Network Intrusion Detection System Dataset (I. Al-Mandhari, L. Guan, E. A. Edirisinghe)....Pages 577-589
An Effective and Efficient Constrained Ward’s Hierarchical Agglomerative Clustering Method (Abeer A. Aljohani, Eran A. Edirisinghe, Daphne Teck Ching Lai)....Pages 590-611
Embedding Intelligence Within Data Points for a Machine Learning Framework: “Hex-Elementization” (Bhuvan Unhelkar, Girish Nair)....Pages 612-628
High Dimensional Restrictive Federated Model Selection with Multi-objective Bayesian Optimization over Shifted Distributions (Xudong Sun, Andrea Bommert, Florian Pfisterer, Jörg Rähenfürher, Michel Lang, Bernd Bischl)....Pages 629-647
Initial Results from Using Preference Ranking Organization Methods for Enrichment of Evaluations to Help Steer a Powered Wheelchair (Malik Haddad, David Sanders, Giles Tewkesbury, Alexander Gegov, Mohamed Hassan, Favour Ikwan)....Pages 648-661
Using a Simple Expert System to Assist a Powered Wheelchair User (David Sanders, Ogechukwu Okono, Martin Langner, Mohamed Hassan, Sergey Khaustov, Peter Omoarebun)....Pages 662-679
Combining Multiple Criteria Decision Making with Vector Manipulation to Decide on the Direction for a Powered Wheelchair (Malik Haddad, David Sanders, Alexander Gegov, Mohamed Hassan, Ya Huang, Mohamed Al-Mosawi)....Pages 680-693
Intelligent Approach to Minimizing Power Consumption in a Cloud-Based System Collecting Sensor Data and Monitoring the Status of Powered Wheelchairs (Ogechukwu Okonor, Mo Adda, Alex Gegov, David Sanders, Malik Jamal Musa Haddad, Giles Tewkesbury)....Pages 694-710
Task Programming Methodology for Powered Wheelchairs (Giles Tewkesbury, David Sanders, Malik Haddad, Nils Bausch, Alexander Gegov, Ogechukwu Okonor)....Pages 711-720
Indoor Location and Collision Feedback for a Powered Wheelchair System Using Machine Learning (Nils Bausch, Peter Shilling, David Sanders, Malik Haddad, Ogechukwu Okonor, Giles Tewkesbury)....Pages 721-739
Development of Real-Time ADAS Object Detector for Deployment on CPU (Alexander Kozlov, Daniil Osokin)....Pages 740-750
Opponent Modeling Under Partial Observability in StarCraft with Deep Convolutional Encoder-Decoders (Hyungu Kahng, Seoung Bum Kim)....Pages 751-759
Object Localization and Detection for Real-Time Automatic License Plate Detection (ALPR) System Using RetinaNet Algorithm (Sohailah Safie, Nik Muhamad Aizuddin Nik Azmi, Rubiyah Yusof, Muhd Ridzuan Muhd Yunus, Mohammad Fairol Zamzuri Che Sayuti, Kok Keng Fai)....Pages 760-768
Human Rescue Based on Autonomous Robot KUKA YouBot with ROS and Object Detection (Carlos Gordón, Patricio Encalada, Henry Lema, Diego León, Dennis Chicaiza)....Pages 769-780
Application of Deep Learning for the Diagnosis of Cardiovascular Diseases (Giovanah Gogi, Alexander Gegov)....Pages 781-791
Anticipating Next Goal for Robot Plan Prediction (Edoardo Alati, Lorenzo Mauro, Valsamis Ntouskos, Fiora Pirri)....Pages 792-809
Propositional Deductive Inference by Semantic Vectors (Douglas Summers-Stay)....Pages 810-820
Shape Reconstruction from a Monocular Defocus Image Using CNN (Rulin Chen, Alex Noel Joseph Raj, Xun Ma, Zhemin Zhuang)....Pages 821-831
Deep Learning: A Brazilian Case (Dora Kaufman)....Pages 832-847
Characterizing High Level LIGO Gravitational Wave Data Using Deep Learning (Lavika Goel, Joy Mukherjee)....Pages 848-860
Hyper-Spectral Image Classification by Multi-layer Deep Convolutional Neural Networks (Tao Chi, Yang Wang, Ming Chen, Manman Chen)....Pages 861-876
ROM-Based Deep Learning Inference for Sleep Stage Classification (Mohamed H. AlMeer, Hanadi Hassen, Naveed Nawaz)....Pages 877-889
Deep Learning for SAR Image Classification (Hasni Anas, Hanifi Majdoulayne, Anibou Chaimae, Saidi Mohamed Nabil)....Pages 890-898
Predicting Emerging and Frontier Stock Markets Using Deep Neural Networks (Dennis Murekachiro, Thabang Mokoteli, Hima Vadapalli)....Pages 899-918
Application of Transfer Learning for Object Detection on Manually Collected Data (Elhassan Mohamed, Konstantinos Sirlantzis, Gareth Howells)....Pages 919-931
An Optimized Deep Convolutional Neural Network Architecture for Concept Drifted Image Classification (Syed Muslim Jameel, Manzoor Ahmed Hashmani, Hitham Alhussain, Mobashar Rehman, Arif Budiman)....Pages 932-942
Fuzzy Control Techniques for Energy Conversion Systems (Silvio Simani, Stefano Alvisi, Mauro Venturini)....Pages 943-955
Fuzzy Sliding Mode Control of Onboard Power Electronics for Fuel Cell Electric Vehicles (Amin Hajizadeh)....Pages 956-965
Effect of the Delay in Fuzzy Attitude Control for Nanosatellites (Ástor del Castañedo, Álvaro Bello, Karl Olfe, Victoria Lapuerta)....Pages 966-981
User-Friendly Interface for Introducing Fuzzy Criteria into Expressive Searches (Mohammad Halim Deedar, Susana Muñoz-Hernández)....Pages 982-997
Fuzzy Optimal State Observers for Takagi-Sugeno Fuzzy State Feedback Control (José Miguel Adánez, Basil Mohammed Al-Hadithi, Agustín Jiménez)....Pages 998-1016
Exploring the Uncanny Valley Theory in the Constructs of a Virtual Assistant Personality (Marta Perez Garcia, Sarita Saffon Lopez)....Pages 1017-1033
Design and Implementation of the IMU Human Body Motion Tracking System (Qi Jin, Zequan Zhang, Wenguang Jin)....Pages 1034-1043
Artificial Intelligence Teaching Methods in Higher Education (Yi Yang, Jiasong Sun, Lu Huang)....Pages 1044-1053
Artificial Swarm Intelligence (Louis Rosenberg, Gregg Willcox)....Pages 1054-1070
Conversational User Interface Integration in Controlling IoT Devices Applied to Smart Agriculture: Analysis of a Chatbot System Design (Eleni Symeonaki, Konstantinos Arvanitis, Dimitrios Piromalis, Michail Papoutsidakis)....Pages 1071-1088
Automation of Teaching Processes on e-Learning Platforms Using Recommender Systems (Marcin Albiniak)....Pages 1089-1100
A Pilot Study on Estimating Players Dispositional Profiles from Game Traces Analysis (Abir B. Karami, Benoît Encelle, Karim Sehaba)....Pages 1101-1120
A Memory-Based Decision-Making Model for Multilingual Alternatives: The Role of Memory, Emotion and Language (Zineb Djouamai, Li Ying)....Pages 1121-1137
Agent-Based Simulation of Cultural Events Impact on Social Capital Dynamics (Darius Plikynas, Rimvydas Laužikas, Leonidas Sakalauskas, Arūnas Miliauskas, Vytautas Dulskis)....Pages 1138-1154
Efficient Heuristics for Solving Precedence Constrained Scheduling Problems (Amy Khalfay, Alan Crispin, Keeley Crockett)....Pages 1155-1167
Solving the Service Technician Routing and Scheduling Problem with Time Windows (Amy Khalfay, Alan Crispin, Keeley Crockett)....Pages 1168-1177
Development of High Rate Wearable MIMU Tracking System Robust to Magnetic Disturbances and Body Acceleration (Hammad Tanveer Butt, Manthan Pancholi, Mathias Musahl, Maria Alejandra Sanchez, Didier Stricker)....Pages 1178-1198
Measuring and Reducing the Cognitive Load for the End Users of Complex Systems (James Oakes, Mark Johnson, James Xue, Scott Turner)....Pages 1199-1209
Sea Water Desalination for Coastal Area Using Concentrated Sunlight and Solar Panel (Nadia Nowshin, Md. Mukter Hossain Emon, Minul Khan Rahat, Md. Ashikul Islam, Romana Alam)....Pages 1210-1218
Investigating the Use of Interference Fringe of Spherical Film for Detecting Micro-force (Yong-hua Lu, Jing Li, Chi Zhang, Rui Wang, Jia Zhang)....Pages 1219-1231
Statistical Model Checking of Cyber-Physical Systems Using Hybrid Theatre (Libero Nigro, Paolo F. Sciammarella)....Pages 1232-1251
Network Modeling of the Structure of Conceptual Experience in the Context of Intellectual Competence in Older Adolescence (Yana Ivanovna Sipovskaya)....Pages 1252-1262
Asymmetric Information and e-Tourism: Literature Review (Samira Oukarfi, Hicham Sattar)....Pages 1263-1274
Using Low-Level Sensory Data to Recognize Events in a Smart Home (Thomas Reichherzer, Andrew Petrovsky)....Pages 1275-1284
The Role of Corporate Ontology in the IT Support of Processes Management (Cezary Stępniak, Tomasz Turek, Leszek Ziora)....Pages 1285-1297
Back Matter ....Pages 1299-1302

Citation preview

Advances in Intelligent Systems and Computing 1037

Yaxin Bi Rahul Bhatia Supriya Kapoor Editors

Intelligent Systems and Applications Proceedings of the 2019 Intelligent Systems Conference (IntelliSys) Volume 1

Advances in Intelligent Systems and Computing Volume 1037

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen, Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong

The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. ** Indexing: The books of this series are submitted to ISI Proceedings, EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink ** More information about this series at http://www.springer.com/series/11156

Yaxin Bi Rahul Bhatia Supriya Kapoor •



Editors

Intelligent Systems and Applications Proceedings of the 2019 Intelligent Systems Conference (IntelliSys) Volume 1

123

Editors Yaxin Bi School of Computing, Computer Science Research Institute Ulster University Newtownabbey, UK

Rahul Bhatia The Science and Information (SAI) Organization Bradford, West Yorkshire, UK

Supriya Kapoor The Science and Information (SAI) Organization Bradford, West Yorkshire, UK

ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-3-030-29515-8 ISBN 978-3-030-29516-5 (eBook) https://doi.org/10.1007/978-3-030-29516-5 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Editor’s Preface

The Intelligent Systems Conference (IntelliSys) 2019 was held during September 5–6, 2019, in London, UK. The Intelligent Systems Conference is a prestigious annual conference on areas of intelligent systems and artificial intelligence and their applications to the real world, which is built on the success of the IntelliSys conferences in the past five years held at London. This conference not only presented state-of-the-art methods and valuable experience from researchers in the related research areas, but also provided the audience with a vision of further development in the fields. The research that comes out of a series of the IntelliSys conferences will provide insights into the complex intelligent systems and pave a way for the future development. The program committee of IntelliSys 2019 represented 25 countries, and the authors submitted 546 papers from 45 countries. This certainly attests to the widespread, international importance of the theme of the conference. Each paper was reviewed on the basis of originality, novelty, and rigorousness. After the reviews, 223 were accepted for presentation, out of which 189 papers are finally being published in the proceedings. The event was a two-day program comprised of 24 paper presentation sessions and poster presentations. The themes of the contributions and scientific sessions ranged from theories to applications, reflecting a wide spectrum of artificial intelligence. We are very gratified to have an exciting lineup of featured speakers who are among the leaders in changing the landscape of artificial intelligence and its application areas. Plenary speakers include: Grega Milcinski (CEO, Sinergise), Detlef D Nauck (Chief Research Scientist for Data Science at BT Technology), Giulio Sandini (Director of Research—Italian Institute of Technology), and Iain Brown (Head of Data Science—SAS UK&I). The conference would truly not function without the contributions and support received from authors, participants, keynote speakers, program committee members, session chairs, organizing committee members, steering committee members, and others in their various roles. Their valuable support, suggestions, dedicated commitment, and hard work have made the IntelliSys 2019 successful.

v

vi

Editor’s Preface

It has been a great honor to serve as General Chair for the IntelliSys 2019 and to work with the conference team. We believe this event will certainly help further disseminate new ideas and inspire more international collaborations. Kind Regards, Yaxin Bi Conference Chair

Contents

Can Human Evidence Accumulation Be Modeled Using the Set-Theoretic Nature of Dempster-Shafer Theory? . . . . . . . . . . . . . . Samantha Lang, Eric Freedman, and Michael E. Farmer

1

Emotional Speech Recognition Using SMILE Features and Random Forest Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ammar Mohsin Butt, Yusra Khalid Bhatti, and Fawad Hussain

10

A Switching Approach that Improves Prediction Accuracy for Long Tail Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gharbi Alshammari, Jose L. Jorro-Aragoneses, Stelios Kapetanakis, Nikolaos Polatidis, and Miltos Petridis A Novel Algorithm for Dynamic Student Profile Adaptation Based on Learning Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shaimaa M. Nafea, François Siewe, and Ying He Exploring Transfer Learning for Low Resource Emotional TTS . . . . . . Noé Tits, Kevin El Haddad, and Thierry Dutoit Emotional Speech Datasets for English Speech Synthesis Purpose: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Noé Tits, Kevin El Haddad, and Thierry Dutoit Feature Selection for Hidden Markov Models with Discrete Features . . . Stephen Adams and Peter A. Beling Location Tracking and Location Prediction Techniques for Smart Traveler Apps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohamad Amar Irsyad Mohd Aminuddin, Mohd Azam Osman, Wan Mohd Nazmee Wan Zainon, and Abdullah Zawawi Talib Implementation Aspects of Tensor Product Variable Binding in Connectionist Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Demidovskij

18

29 52

61 67

83

97

vii

viii

Contents

Timing Attacks on Machine Learning: State of the Art . . . . . . . . . . . . . 111 Mazaher Kianpour and Shao-Fang Wen HORUS: An Emotion Recognition Tool . . . . . . . . . . . . . . . . . . . . . . . . . 126 André Teixeira, Manuel Rodrigues, Davide Carneiro, and Paulo Novais Frustrated Equilibrium of Asymmetric Coordinating Dynamics in a Marketing Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Matthew G. Reyes Cloud Capacity Planning Based on Simulation and Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Riyadh A. K. Mehdi and Mirna Nachouki Simulation of Artificially Generated Intelligence from an Object Oriented Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Bálint Fazekas and Attila Kiss Robustness of Keystroke Dynamics Identification Algorithms Against Brain-Wave Variations Associated with Emotional Variations . . . . . . . . 194 Enrique P. Calot, Jorge S. Ierache, and Waldo Hasperué Explanatory AI for Pertinent Communication in Autonomic Systems . . . 212 Marius Pol, Jean-Louis Dessalles, and Ada Diaconescu Genetic Algorithm Modeling for Photocatalytic Elimination of Impurity in Wastewater . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 Raheleh Jafari, Sina Razvarz, Wen Yu, Alexander Gegov, Morten Goodwin, and Mo Adda Bitcoin: A Total Turing Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Craig S. Wright Agent-Based Turing-Complete Transactions Integrating Feedback Within a Blockchain System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Craig S. Wright A High Resolution and Low Jitter 5-Bit Flash TDC Architecture for High Speed Intelligent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 Jagdeep Kaur Sahani, Anil Singh, and Alpana Agarwal A Sensitivity Analysis for Harmony Search with Multi-Parent Crossover Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Iyad Abu Doush and Eugene Santos Decentralized Autonomous Corporations . . . . . . . . . . . . . . . . . . . . . . . . 285 Craig S. Wright A Proof of Turing Completeness in Bitcoin Script . . . . . . . . . . . . . . . . . 299 Craig S. Wright

Contents

ix

Turbines Allocation Optimization in Hydro Plants via Computational Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 Ramon Abritta, Frederico F. Panoeiro, Ivo C. da Silva Junior, André Luís Marques Marcato, Leonardo de Mello Honório, and Luiz Eduardo de Oliveira A Dynamic Ensemble Selection Framework Using Dynamic Weighting Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 Aiman Qadeer and Usman Qamar Applying Feature Selection and Weight Optimization Techniques to Enhance Artificial Neural Network for Heart Disease Diagnosis . . . . 340 Younas Khan, Usman Qamar, Muhammad Asad, and Babar Zeb Immune Inspired Dendritic Cell Algorithm for Stock Price Manipulation Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 Baqar Rizvi, Ammar Belatreche, and Ahmed Bouridane Validity Evaluation for the Data Used for Artificial Intelligence System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 Han Seong Son Advanced DBSCAN: A Clustering Algorithm for Personal Credit Reference System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 Lu Han An Analysis on the Weibo Topic “US-China Trade War” Based on K-Means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382 Shaochi Cheng, Yuan Gao, Xiangyang Li, and Su Hu High Quality Dataset for Machine Learning in the Business Intelligence Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 Luisa Franchina and Federico Sergiani A Machine Learning Approach to Shipping Box Design . . . . . . . . . . . . 402 Guang Yang and Cun (Matthew) Mu Utilizing Local Outlier Factor for Open-Set Classification in High-Dimensional Data - Case Study Applied for Text Documents . . . 408 Tomasz Walkowiak, Szymon Datko, and Henryk Maciejewski Dynamic Programming Models for Maximizing Customer Lifetime Value: An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 Eman AboElHamd, Hamed M. Shamma, and Mohamed Saleh Big Data Modelling for Predicting Side-Effects of Anticancer Drugs: A Comprehensive Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 Sai Jyothi Bolla and S. Jyothi

x

Contents

Semantic Topic Discovery for Lecture Video . . . . . . . . . . . . . . . . . . . . . 457 Jiang Bian and Mao Lin Huang Who Has Better Driving Style: Let Data Tell Us . . . . . . . . . . . . . . . . . . 467 Linna Wu, Huan Li, Hengtian Ding, and Lizhuo Zhang A Hierarchical Missing Value Imputation Method by Correlation-Based K-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . 486 Xin Liu, Xiaochen Lai, and Liyong Zhang Predicting the Success of Kickstarter Projects in the US at Launch Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 Weifeng Hu and Rui Yang Patterns and Outliers in Temporal Point Processes . . . . . . . . . . . . . . . . 507 César Ali Marin Ojeda, Kostadin Cvejoski, Rafet Sifa, Jannis Schuecker, and Christian Bauckhage Classification of Followee Recommendation Techniques in Twitter . . . . 527 Kamaljit Kaur and Kanwalvir Singh Dhindsa Firemen Prediction by Using Neural Networks: A Real Case Study . . . 541 Christophe Guyeux, Jean-Marc Nicod, Christophe Varnier, Zeina Al Masry, Nourredine Zerhouny, Nabil Omri, and Guillaume Royer A Community Detection Based Approach for Exploring Patterns in Player Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553 Maren Pielka, Rafet Sifa, Rajkumar Ramamurthy, Cesar Ojeda, and Christian Bauckhage Visual Ontology Sketching for Preliminary Knowledge Base Design . . . 566 Tatiana Gavrilova and Elvira Grinberg Impact of the Structure of Data Pre-processing Pipelines on the Performance of Classifiers When Applied to Imbalanced Network Intrusion Detection System Dataset . . . . . . . . . . . . . . . . . . . . . 577 I. Al-Mandhari, L. Guan, and E. A. Edirisinghe An Effective and Efficient Constrained Ward’s Hierarchical Agglomerative Clustering Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590 Abeer A. Aljohani, Eran A. Edirisinghe, and Daphne Teck Ching Lai Embedding Intelligence Within Data Points for a Machine Learning Framework: “Hex-Elementization” . . . . . . . . . . . . . . . . . . . . . 612 Bhuvan Unhelkar and Girish Nair

Contents

xi

High Dimensional Restrictive Federated Model Selection with Multi-objective Bayesian Optimization over Shifted Distributions . . . . . 629 Xudong Sun, Andrea Bommert, Florian Pfisterer, Jörg Rähenfürher, Michel Lang, and Bernd Bischl Initial Results from Using Preference Ranking Organization Methods for Enrichment of Evaluations to Help Steer a Powered Wheelchair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648 Malik Haddad, David Sanders, Giles Tewkesbury, Alexander Gegov, Mohamed Hassan, and Favour Ikwan Using a Simple Expert System to Assist a Powered Wheelchair User . . . 662 David Sanders, Ogechukwu Okono, Martin Langner, Mohamed Hassan, Sergey Khaustov, and Peter Omoarebun Combining Multiple Criteria Decision Making with Vector Manipulation to Decide on the Direction for a Powered Wheelchair . . . 680 Malik Haddad, David Sanders, Alexander Gegov, Mohamed Hassan, Ya Huang, and Mohamed Al-Mosawi Intelligent Approach to Minimizing Power Consumption in a Cloud-Based System Collecting Sensor Data and Monitoring the Status of Powered Wheelchairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694 Ogechukwu Okonor, Mo Adda, Alex Gegov, David Sanders, Malik Jamal Musa Haddad, and Giles Tewkesbury Task Programming Methodology for Powered Wheelchairs . . . . . . . . . . 711 Giles Tewkesbury, David Sanders, Malik Haddad, Nils Bausch, Alexander Gegov, and Ogechukwu Okonor Indoor Location and Collision Feedback for a Powered Wheelchair System Using Machine Learning . . . . . . . . . . . . . . . . . . . . . 721 Nils Bausch, Peter Shilling, David Sanders, Malik Haddad, Ogechukwu Okonor, and Giles Tewkesbury Development of Real-Time ADAS Object Detector for Deployment on CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 740 Alexander Kozlov and Daniil Osokin Opponent Modeling Under Partial Observability in StarCraft with Deep Convolutional Encoder-Decoders . . . . . . . . . . . . . . . . . . . . . . 751 Hyungu Kahng and Seoung Bum Kim Object Localization and Detection for Real-Time Automatic License Plate Detection (ALPR) System Using RetinaNet Algorithm . . . . . . . . . 760 Sohailah Safie, Nik Muhamad Aizuddin Nik Azmi, Rubiyah Yusof, Muhd Ridzuan Muhd Yunus, Mohammad Fairol Zamzuri Che Sayuti, and Kok Keng Fai

xii

Contents

Human Rescue Based on Autonomous Robot KUKA YouBot with ROS and Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769 Carlos Gordón, Patricio Encalada, Henry Lema, Diego León, and Dennis Chicaiza Application of Deep Learning for the Diagnosis of Cardiovascular Diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 781 Giovanah Gogi and Alexander Gegov Anticipating Next Goal for Robot Plan Prediction . . . . . . . . . . . . . . . . . 792 Edoardo Alati, Lorenzo Mauro, Valsamis Ntouskos, and Fiora Pirri Propositional Deductive Inference by Semantic Vectors . . . . . . . . . . . . . 810 Douglas Summers-Stay Shape Reconstruction from a Monocular Defocus Image Using CNN . . . 821 Rulin Chen, Alex Noel Joseph Raj, Xun Ma, and Zhemin Zhuang Deep Learning: A Brazilian Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 832 Dora Kaufman Characterizing High Level LIGO Gravitational Wave Data Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848 Lavika Goel and Joy Mukherjee Hyper-Spectral Image Classification by Multi-layer Deep Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 861 Tao Chi, Yang Wang, Ming Chen, and Manman Chen ROM-Based Deep Learning Inference for Sleep Stage Classification . . . 877 Mohamed H. AlMeer, Hanadi Hassen, and Naveed Nawaz Deep Learning for SAR Image Classification . . . . . . . . . . . . . . . . . . . . . 890 Hasni Anas, Hanifi Majdoulayne, Anibou Chaimae, and Saidi Mohamed Nabil Predicting Emerging and Frontier Stock Markets Using Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899 Dennis Murekachiro, Thabang Mokoteli, and Hima Vadapalli Application of Transfer Learning for Object Detection on Manually Collected Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 919 Elhassan Mohamed, Konstantinos Sirlantzis, and Gareth Howells An Optimized Deep Convolutional Neural Network Architecture for Concept Drifted Image Classification . . . . . . . . . . . . . . . . . . . . . . . . 932 Syed Muslim Jameel, Manzoor Ahmed Hashmani, Hitham Alhussain, Mobashar Rehman, and Arif Budiman

Contents

xiii

Fuzzy Control Techniques for Energy Conversion Systems . . . . . . . . . . 943 Silvio Simani, Stefano Alvisi, and Mauro Venturini Fuzzy Sliding Mode Control of Onboard Power Electronics for Fuel Cell Electric Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956 Amin Hajizadeh Effect of the Delay in Fuzzy Attitude Control for Nanosatellites . . . . . . 966 Ástor del Castañedo, Álvaro Bello, Karl Olfe, and Victoria Lapuerta User-Friendly Interface for Introducing Fuzzy Criteria into Expressive Searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 982 Mohammad Halim Deedar and Susana Muñoz-Hernández Fuzzy Optimal State Observers for Takagi-Sugeno Fuzzy State Feedback Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 998 José Miguel Adánez, Basil Mohammed Al-Hadithi, and Agustín Jiménez Exploring the Uncanny Valley Theory in the Constructs of a Virtual Assistant Personality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017 Marta Perez Garcia and Sarita Saffon Lopez Design and Implementation of the IMU Human Body Motion Tracking System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034 Qi Jin, Zequan Zhang, and Wenguang Jin Artificial Intelligence Teaching Methods in Higher Education . . . . . . . . 1044 Yi Yang, Jiasong Sun, and Lu Huang Artificial Swarm Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054 Louis Rosenberg and Gregg Willcox Conversational User Interface Integration in Controlling IoT Devices Applied to Smart Agriculture: Analysis of a Chatbot System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1071 Eleni Symeonaki, Konstantinos Arvanitis, Dimitrios Piromalis, and Michail Papoutsidakis Automation of Teaching Processes on e-Learning Platforms Using Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1089 Marcin Albiniak A Pilot Study on Estimating Players Dispositional Profiles from Game Traces Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1101 Abir B. Karami, Benoît Encelle, and Karim Sehaba A Memory-Based Decision-Making Model for Multilingual Alternatives: The Role of Memory, Emotion and Language . . . . . . . . . 1121 Zineb Djouamai and Li Ying

xiv

Contents

Agent-Based Simulation of Cultural Events Impact on Social Capital Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1138 Darius Plikynas, Rimvydas Laužikas, Leonidas Sakalauskas, Arūnas Miliauskas, and Vytautas Dulskis Efficient Heuristics for Solving Precedence Constrained Scheduling Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155 Amy Khalfay, Alan Crispin, and Keeley Crockett Solving the Service Technician Routing and Scheduling Problem with Time Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1168 Amy Khalfay, Alan Crispin, and Keeley Crockett Development of High Rate Wearable MIMU Tracking System Robust to Magnetic Disturbances and Body Acceleration . . . . . . . . . . . 1178 Hammad Tanveer Butt, Manthan Pancholi, Mathias Musahl, Maria Alejandra Sanchez, and Didier Stricker Measuring and Reducing the Cognitive Load for the End Users of Complex Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1199 James Oakes, Mark Johnson, James Xue, and Scott Turner Sea Water Desalination for Coastal Area Using Concentrated Sunlight and Solar Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1210 Nadia Nowshin, Md. Mukter Hossain Emon, Minul Khan Rahat, Md. Ashikul Islam, and Romana Alam Investigating the Use of Interference Fringe of Spherical Film for Detecting Micro-force . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1219 Yong-hua Lu, Jing Li, Chi Zhang, Rui Wang, and Jia Zhang Statistical Model Checking of Cyber-Physical Systems Using Hybrid Theatre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1232 Libero Nigro and Paolo F. Sciammarella Network Modeling of the Structure of Conceptual Experience in the Context of Intellectual Competence in Older Adolescence . . . . . . 1252 Yana Ivanovna Sipovskaya Asymmetric Information and e-Tourism: Literature Review . . . . . . . . . 1263 Samira Oukarfi and Hicham Sattar Using Low-Level Sensory Data to Recognize Events in a Smart Home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1275 Thomas Reichherzer and Andrew Petrovsky The Role of Corporate Ontology in the IT Support of Processes Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285 Cezary Stępniak, Tomasz Turek, and Leszek Ziora Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1299

Can Human Evidence Accumulation Be Modeled Using the Set-Theoretic Nature of Dempster-Shafer Theory? Samantha Lang1, Eric Freedman2, and Michael E. Farmer1(&) 1

University of Michigan-Flint, 303 E. Kearsley Street, Flint, MI 48502, USA {samantla,farmerme}@umflint.edu 2 University of Michigan-Flint, Flint, USA

Abstract. Belief updating and the integrating of streams of evidence are critical tasks for artificially intelligent systems. Traditional AI systems tend to employ Bayesian models of conditioning or approaches based on Dempster-Shafer. These approaches have assumptions that run counter to the cognitive models developed for human belief updating. In addition, human cognition has an added limitation of finite working memory, which requires the human cognitive system to develop and manage representations of data to manage capacity. While the evidence updating approach of Dempster-Shafer has issues with behaving as humans, the approaches use of the concept of the power set and the management of the beliefs through subsets of varying cardinality as evidence unfolds can serve as a promising model for human management of working memory. This paper uses a Clue® game approach to testing human subjects. The results show that the use of sets of information at varying cardinalities may be an effective model for describing how human subjects develop and manage beliefs. Keywords: Working memory models Evidence accumulation

 Dempster-Shafer 

1 Introduction Belief updating has been a heavily researched area of Artificial Intelligence (AI) for many years. As Baratgin and Politzer note, “the most famous and widespread theory used for this purpose in the psychological literature certainly is the Bayesian model [15] ”. Various researchers have found issues with using Bayes model for human belief revision in three areas, namely: (i) order effects, (ii) reduction of the impact of evidence over long evidence streams, and (iii) evaluation versus estimation of evidence [5, 15, 17]. Past work by Farmer developed a Kalman filter-based evidence accumulation model, which closely mimicked the human cognitive model of Hogarth and Einhorn [11]. The model successfully addressed these three issues, and demonstrated superior evidence updating to not only Bayes method, but to a number of other belief updating mechanisms popular in AI [4]. E. Freedman—Deceased, Formerly, University of Michigan-Flint. © Springer Nature Switzerland AG 2020 Y. Bi et al. (Eds.): IntelliSys 2019, AISC 1037, pp. 1–9, 2020. https://doi.org/10.1007/978-3-030-29516-5_1

2

S. Lang et al.

There are three additional areas of difference between traditional AI methods (i.e. Bayes and Dempster-Shafer) and human cognition that was not directly addressed in previous studies. These areas are: (i) non-complementary decision-making, (ii) hypothesis resurrection, and (iii) the finite nature of human working memory. Recall that Bayes requires complementarity of evidence: if p(A) increases, then p(*A) decreases). Various studies in human psychology have shown non-complementary updating of evidence such as those by Freedman and Myers [2] and more recently by Curley [3]. Baratgin and Politzer also reviewed a number of belief revision studies and found a consistent “lack of observance of the complementarity constraint required by Bayes” [15]. The second issue, namely hypothesis resurrection, is not compatible in either Bayesian or Dempster-Shafer frameworks. Baratgin and Politzer define it, “hypotheses whose probabilities were previously estimated at zero by participants, later have a probability different from zero” [15]. Even though hypothesis resurrection is not primarily noted and/or compatible within Bayesian or Dempster-Shafer framework, it is evident in human cognition [15]. The final issue that has not been addressed in Artificial Intelligence systems research is the capacity limitations of human working memory and the role this limitation may play in belief revision. This is important since as Kareev notes: “people view the world through a window whose size reflects their working memory capacity [6] ”. One mechanism for managing within a finite working memory capacity is to store information as more efficient representations. Gruszka and Necka note: “virtually every complex cognitive activity requires the temporal availability of a certain amount of cognitive representations [14] ”. Interestingly, they also note that researchers have found, “stereotype-based impression formation has been shown as less resourceconsuming than individuation [14] ”. The level at which Bayes manages beliefs is at the singleton level while DempsterShafer, manages beliefs across the entire power set [10]. The stereotyping mentioned above is an example of managing information in a set-theoretic format. The ability to manage evidence as subsets rather than as a collection of singletons can help mitigate the small amount of working memory required. In addition to reducing the amount of working memory required, proper information representations even reduce the complexity of tasks [14]. In this paper, we will explore the role of information representation, in the form of subsets, plays in human cognition for a task of evidence accumulation and belief updating using the common Clue® game. We will show that the set theoretic nature of Dempster-Shafer belief functions more closely model the behavior of human subjects compared to Bayes belief updating of singletons.

2 Belief Revision and Working Memory in Human Psychology Cognitive research has shown that there is a range of memory structures within the human brain. These structures consist of short-term memory, working memory, and long-term memory. Baddeley refers to short term memory as being simply storage, while working memory provides storage and manipulation [1]. Logie notes,

Can Human Evidence Accumulation Be Modeled

3

“knowledge is then held in a collection of interacting, domain-specific temporary memory systems—or components of working memory—and processed by a range of executive functions [8] ”. Likewise, Baddley noted, “useful goal-specific representations are temporarily stored in the working memory” [1]. Van Lamsweerde, et al. notes, “limited capacity of visual working memory (VWM) can be maximized by combining multiple features into a single representation through grouping principles such as connection, proximity, and similarity [9] ”. Likewise, Nie, et al. notes, “memory for real-world scenes has been shown to depend largely on organizational principles, that is, mechanism that impose structure [7] ”. Nie, et al. also posit that “observers have a strong tendency to structure and organize a given sensory input into some higher-order regularity, that is, ‘compression’ of the available information in order to spare the limited cognitive resources [7] ”. The activation model of working memory provides an alternative theory to the multi-component models of working memory discussed previously [13]. Velichkovsky notes, “these models concentrate on the presence of the various representational states of information held in working memory” [13]. Baddeley notes that Oberauer’s activation model view of working memory as being “to serve as a blackboard for information processing on which we can construct new representations…” [1]. Oberauer also defines maintaining structural representations as one of the key roles of working memory [1]. Wolf and Knauff discuss the task of belief revision without tying it to working memory and note “the mental model theory of reasoning, people construct a set of mental models of the possibilities that the situation embedded in the premises might represent [12] ”. Note that all the working memory models share two concepts: (i) the capacity of working memory is finite, and (ii) working memory relies on information organization, representation, and manipulation. Management of subsets of information is natural representation. The role of subsets in human cognition has been explored by Curley’s, where he directed his subjects to consider “… assign belief to a set of suspects, e.g., selecting the set {B, D}” thus guiding the subjects to consider managing evidence and beliefs using sets [3]. Van Lamsweerde, et al. note: “it is possible to use top-down knowledge to influence the perception of a group [7] ”. This implies it is possible Curley’s subjects used set-based groupings simply because they were directed to do so. In our research, we devised an experiment that will allow us to see set-based groupings of evidence for sequential belief revision.

3 Dempster-Shafer Approach to Evidence Representation Dempster-Shafer theory begins with the concept of there being a fixed and exhaustive set of mutually exclusive objects called the environment, or frame of discernment, defined as H ¼ fh1 ; h2 ; . . .; hN g [10]. For a simple three-element set, the power set is: Belðf AgÞ; BelðfBgÞ; BelðfCgÞ; BelðfA; BgÞ; BelðfA; C gÞ; BelðfB; C gÞ; BelðfA; B; C gÞ ð1Þ where the last term Bel({A,B,C}) is belief in the set of complete ignorance. One critical property of beliefs in Dempster-Shafer theory is that they are non-complementary, meaning [10]:

4

S. Lang et al.

Belð X Þ 6¼ 1  Belð: X Þ

ð2Þ

This is a critical distinction between Dempster-Shafer and Bayes where complementarity is required. We will explore the role of non-complementarity in human cognition later in this paper.

4 Mapping Incoming Beliefs into a Set Framework The past work of Freedman and Myers in [2] implied that subjects were assigning beliefs at more than the singleton level since the sums of the beliefs were greater than one. Curley in [3] showed a set-based belief structure more closely modeled how humans handled legal case evidence. A perfect example of the subjects’ beliefs summing to greater than one comes from the study to be defined below where subjects were asked to assign quantitative belief values to four hypotheses for suspects in a murder mystery. For example, one test subject assigned their beliefs as: Suspect 1: 1; Suspect 2: 0; Suspect 3: 0; Suspect 4: 1 It can be hypothesized that the test subject is actually considering the beliefs at a higher level of abstraction than the singleton level, and the forcing of them to assign them at the singleton level is resulting in this odd assignment. For this subject above, his beliefs would best be represented by Bel({suspect1, suspect 4}) = 1. Another subject recorded his or her first beliefs as: Suspect 1: 0.25; Suspect 2: 0.25; Suspect 3: 0.25; Suspect 4: 0.25 This, of course, is a perfect example of Bayesian ignorance and would optimally be represented by Bel({suspect1, suspect2, suspect3, suspect 4}) = 1.

Fig. 1. Mapping of singleton beliefs into set-based beliefs [16].

An algorithm for interpreting singleton beliefs into higher cardinality belief sets was motivated by these observations by Farmer in [16]. In this process, the singleton confidences are first placed in a histogram and a histogram slicing process is performed as shown in Fig. 1. The confidences of the subsets of the power set PðHÞ are assigned by horizontally slicing the histogram and grouping the levels into appropriate subsets.

Can Human Evidence Accumulation Be Modeled

5

The residual amplitude at each level is the confidence assigned to each of the appropriate subsets. This provides the representations that represent the optimal subset representation of the two test subjects just discussed. The storage of the entire power set would be an unwise use of resources since working memory has significant capacity limitations. This approach results in optimally managing working memory through the settheoretic representation of Dempster-Shafer. While this approach is different from traditional Dempster-Shafer belief generation, Josang, Costa, and Blasch observed, “Because different situations can involve different forms of belief fusion, there is no single formal model that is suitable for analyzing every situation [18] ”.

5 Method of Experimentation The findings of Freedman and Myers [2] showed non-complementary decisions in their human subjects. For this paper, we developed a set of experiments that would test whether human subjects use an approach analogous to Dempster-Shafer where evidence was aggregated at various levels of abstraction. In the experiment by Curley in [3], the subjects were specifically asked to consider subsets for evidence management. In this study, the subjects are not pre-biased towards subsets, but instead are asked to process clues that would be best managed by working with subsets of candidate perpetrators. This experiment had four iterations of different clues that implicated differing subsets of characters. In the experiment, the suspect characters were Susan, Robert, Claire, Paul; while Jessica was the victim. The possible subsets for the characters can be found in Table 1.

Table 1. Subsets in which singleton characters fall. Subset Female Male Work Support Staff

Characters Susan & Claire Robert & Paul Robert, Paul & Susan Susan & Paul

For the experiment, the new evidence that comprised each new clue implicated subsets of suspects in each round of the evidence presentation as follows: Round Round Round Round

1: 2: 3: 4:

Everyone Mainly Robert, Claire, Paul Mainly Robert, Claire, Paul Mainly Susan, Paul

The participants for this study were attending the University of Michigan-Flint at the time of their participation. The subjects were students working towards an

6

S. Lang et al.

undergraduate, or graduate, degree within the Psychology and/or Computer Science departments. 88 students participated, however three students did not complete the experiment so there were 85 participants in the subsequent data analysis.

6 Results

# of Subjects

% of Subjects

The results were consistent with past experiments that showed participants’ beliefs tend to sum to greater than one. Figure 2(a) shows that as the experiment advanced the summation of the beliefs was dropping. However, nearly 75% of subjects still maintained a sum of their beliefs that was greater than 100 through the last cycle of the clue gathering. The reduction in the sum of beliefs as the subjects gathered evidence is consistent with the hypothesis that as the participants converged towards their final beliefs, their collection of subsets being managed in working memory may be focusing down to fewer subsets of lower cardinality. This reduction in set cardinality can also be seen when viewing Fig. 2(b) which shows the cardinality of subjects final decision, where the significant majority of the subjects had a final decision of cardinality one. The percentage of subjects with beliefs greater than 100 in Fig. 2(a) did not drop further indicating that even when the final decision was a singleton, the participants were still maintaining beliefs in higher cardinality subsets that involved other suspects. Figure 3(a) exemplifies the issue of hypothesis resurrection that was identified in the introduction as one of the key differences between human cognition and AI approaches. This figure shows that roughly 20 of the over 80 subjects did not have the suspect, from their final decision; appear in their top three candidate sets in any evidence gathering pass. Figure 3(b) further reinforces the need for hypothesis resurrection in any system that strives to mimic human cognition. It shows that when considering just the singleton final set nearly half of the test subjects never had the singleton as a candidate in any of their evidence gatherings. This is particularly difficult for Bayes to deal with since that theory only works on singletons. Figure 3(c) and (d) further imply the use of higher cardinality subsets by the participants. In these figures it shows that the participants used subsets of both cardinality two and cardinality three to hold the final singleton suspect whom they ultimately considered to be guilty as they processed the incoming evidence over the three passes leading up to the final decision.

Pass of Evidence Gathering

(a)

Cardinality of Final Decision

(b)

Fig. 2. Behavior of subjects in terms of belief assignment, (a) Percentage of subjects with sum of beliefs > 100, (b) Frequency of cardinality of final decisions.

7

# of Subjects

# of Subjects

Can Human Evidence Accumulation Be Modeled

0

1

2

3

0

# Passes

1

2

3

# Passes

(b)

# of Subjects

# of Subjects

(a)

0

1

2

3

# Passes

(c)

0

1

2

3

# Passes

(d)

Fig. 3. Management of beliefs at various cardinalities: (a) # passes where the final decision was maintained as a member in any cardinality set (except ignorance set), (b) # passes where the final decision singleton was in top 3 beliefs, (c) # passes where the final decision singleton was a member of a cardinality = 2 subset in top 3 beliefs, (d) # passes where the final decision singleton was a member of a cardinality = 3 subset in top 3 beliefs.

7 Conclusions and Future Work Recall the study presented was interested in addressing three issues where human cognition and artificially intelligence differ in evidence accumulation. These areas were: (i) non-complementary decision-making, (ii) hypothesis resurrection, and (iii) the finite nature of human working memory. The results of the research verify previous human cognition studies where non-complementarity of hypotheses was found. The study provided evidence for humans using sets to manage and combine evidence as is required due to the finite capacity of human working memory. Working memory capacity can be optimally managed with subsets of appropriate cardinality and content. The study also showed the prevalence of hypothesis resurrection in human cognition. As previously noted, the research conducted fits well into the set-theory nature concepts that were proposed by previous studies. When looking to the future directions of study, the researchers considered potentially re-running the study with time stamps to indicate when a participant made their suspect selection. By allowing a time stamp to be implemented, it would give the researchers a better idea as to how long it took the participant to process the information given to produce an answer. Decisions that required additional time would potentially be indicative of decisions that required higher levels of information combining or pulling down information, or integration of new and old information into higher level sets.

8

S. Lang et al.

Another possible future study is to run two groups, a control group that only could input their beliefs at the singleton level as done here, and a second group that would be given the opportunity to input their beliefs at subsets of higher cardinality. It would be interesting to see if the subjects of the control group developed similar subsets as the group that was given the hints of using sets explicitly. Lastly, another possible future study would be to request that participants provide written responses after every suspect selection explaining their selections.

References 1. Baddeley, A.: Working memory: theories, models, and controversies. Annu. Rev. Psychol. 63, 1–29 (2012) 2. Freedman, E.G., Myers, A.M.: Effects of evidence type, number of hypotheses and alternative strength on hypothesis evaluation. In: Proceedings of the 37th Annual Meeting of the Psychonomic Society (1997) 3. Curley, S.P.: The application of Dempster-Shafer theory demonstrated with justification provided by legal evidence. Judgement Decis. Making 2(5), 257–276 (2007) 4. Farmer, M.E.: Sequential evidence accumulation via integrating Dempster-Shafer reasoning and estimation theory. In: Proceedings of the Intelligent Systems Conference, pp. 90–97. (IntelliSys 2017) (2017) 5. Farmer, M.: Application of evidence accumulation based on estimation theory and human psychology for automotive airbag suppression. In: Proceedings Special Session on Artificial Neural Networks and Intelligent Information Processing at the International Conference on Informatics and Control (2011) 6. Kareev, Y.: Through a narrow window: working memory capacity and the detection of covariation. Cognition 56, 263–269 (1995) 7. Nie, Q.-Y., Müller, H.J., Conci, M.: Hierarchical organization in visual working memory: from global ensemble to individual object structure. Cognition 159, 85–96 (2017) 8. Logie, R.H.: The functional organization and capacity limits of working memory. Curr. Dir. Psychol. Sci. 4(20), 240–245 (2011) 9. van Lamsweerde, A.E., Beck, M.R., Johnson, J.S.: Visual working memory organization is subject to top-down control. Psychon. Bull. Rev. 23, 1181–1189 (2016) 10. Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press, Princeton (1976) 11. Hogarth, R.M., Einhorn, H.J.: Order effects in belief updating: the belief-adjustment model. Cogn. Psychol. 1(24), 1–55 (1992) 12. Wolf, A.G., Knauff, M.: The strategy behind belief revision: a matter of judging probability or the use of mental models?. In: Proceedings of the 30th Annual Conference of the Cognitive Science Society, pp. 64–70 (2008) 13. Velichkovsky, B.B.: Consciousness and working memory: current trends and research perspectives. Conscious. Cogn. 55, 35–45 (2017) 14. Gruszka, L., Necka, E.: Limitations of working memory capacity: The cognitive and social consequences. Euro. Manag. J. 35, 776–784 (2017) 15. Baratgin, J., Politzer, G.: The psychology of dynamic probability judgment: order effect, normative theories, and experimental methodology. Mind Soc. 6, 53–66 (2007) 16. Farmer, M.: Evidential reasoning for control of smart automotive airbag suppression. In: Proceedings of the IASTED International Conference on Intelligent Systems and Control, pp. 174–181 (2005)

Can Human Evidence Accumulation Be Modeled

9

17. Wang, H., Zhang, J., Johnson, T.R.: Order effects in human belief revision. In: Proceedings of the 1999 Cognitive Science Society Conference, pp. 547–552 (1999) 18. Jøsang, A., Costa, P., Blasch, E.: Determining model correctness for situations of belief fusion. In: Proceedings of the International Conference on Information Fusion, pp. 1886– 1893 (2013)

Emotional Speech Recognition Using SMILE Features and Random Forest Tree Ammar Mohsin Butt, Yusra Khalid Bhatti, and Fawad Hussain(&) Department of Computer Engineering, University of Engineering and Technology, Taxila, Pakistan [email protected], [email protected], [email protected]

Abstract. The recognition of emotional speech and its accurate representation is an exciting and challenging area of research in the field of speech and audio processing. The existing methods for representation of emotional speech don’t provide discriminating features for different emotions and there are many limitations as well. In this work, we propose to evaluate the openEAR toolkit features on publicly available datasets e.g. SAVEE Database. The low-level descriptors and their statistical functionals provide discriminating features for each emotion which provides state-of-the-art results for the given dataset. A random forest tree classifier model is trained in WEKA for classification. The accuracy obtained for SAVEE emotional database is 76.1%. Keywords: Feature extraction  Emotional speech  openEAR  Random Forest

1 Introduction In human-computer interaction, emotion recognition from the speech is one of the most emerging and widely researched topics. Furthermore, there are many benefits of knowing the emotional state of a human in the field of artificial intelligence. Recently, there has been increased interest in speech emotions because of its vast applications in academic as well as the industrial field. The emotions of the speaker are classified using classifiers as models of machine learning and acoustic and linguistic input features. The more distinctive the features more accurate is the trained model [1]. The results are better and promising when the corpus is the same for training and testing data. However, such a system might not perform well in a real time scenario where we have varying conditions such as different languages and age group speech utterances. In most of the machine learning schemes for speech emotion recognition feature selection affects the accuracy of recognition directly. By far the most common feature extraction methods include the pitch frequency feature, the energy-related feature, zero crossing rate, the formant feature, the spectral feature, temporal features, linear predictive coding and Mel frequency cepstral coefficient are used as a combination for extraction of emotion. These extracted features are then used in various classifiers for speech emotion recognition. Many researchers have proposed machine learning methods to train and predict artificial neural networks [2], Bayesian network model [3], Hidden Markov © Springer Nature Switzerland AG 2020 Y. Bi et al. (Eds.): IntelliSys 2019, AISC 1037, pp. 10–17, 2020. https://doi.org/10.1007/978-3-030-29516-5_2

Emotional Speech Recognition Using SMILE Features

11

Model [4] Support Vector Machine (SVM) [5] Gauss Mixed Model (GMM) [6] and multi-classifier fusion [7]. The advantage of this method is that it could train model without very large data. While the disadvantage is that it is difficult to judge the quality of the feature and may lose some key features, which will decrease the accuracy of recognition. In the meantime, it is difficult to ensure the good results can be achieved in a variety of databases. Furthermore, structural features can be learned using a deep structure [8] developed from autoencoders. During model training, many constraints are imposed in deep structure. To constrain the average response for every hidden unit to a small value a sparse auto-encoder is proposed. In the proposed approach the main aim is to evaluate the openEAR toolkit features. The features extracted using this toolkit are extensively used by researchers for emotions recognition in speech. The SAVEE dataset is used as the emotional speech database. The extracted features are classified using WEKA software. The Surrey Audio-Visual Expressed Emotions (SAVEE) Dataset is used in this work. SAVEE dataset was proposed in 2008 by S. Haq and P.J.B Jackson at University of Surrey. Four male actors with ages between 27 to 31 recorded the dataset. Each actor had a different English accent such as one had a southern accent, one with a Scottish accent and two actors had an English accent. There are a total of seven emotions in the dataset. Each actor was asked to utter 15 sentences for each of the emotion. Along with the audio acquisition visual data was also recorded. This dataset is now extensively used by many researchers to perform emotion recognition using audio and visual data. The paper is organized as follows: Sect. 2 describes the related work in emotional speech recognition, Sect. 3 explains the proposed methodology, Sect. 4 discusses the experimental setup and results and finally, in Sect. 5 the work is concluded.

2 Related Work Various speech emotion recognition models have been implemented by many researchers using different sets of features. In recent years, ELM had become a center of research as more researchers proposed their working on it, and to enhance the working of the ELM algorithm many variants of ELM have been proposed. The ELM algorithms [9] has been employed in a number of different problems to optimize solutions such as computational intelligence, forecasting, image processing, pattern recognition, and machine learning. Huang et al. [10] proposed a learning algorithm and worked on Extreme Learning Machine using feed-forward network having a single hidden layer. The paper discusses the working and algorithm of extreme learning machines and more emphasis is on the variants of ELMs such as ordinal ELM, pruning ELM, error minimized ELM, two-stage ELM, voting based ELM, incremental ELM, symmetric ELM, and fully complex ELM. As compared to other neural networks and learning methods the extreme learning machine is much faster and provide better generalization performance. Han et al. [11] investigated the importance of higher-level features for emotional speech and proposed to use deep neural networks for extracting higher-level features from raw speech for emotional speech recognition. The deep neural network creates an emotional state distribution of probability for the construction of high-level features.

12

A. M. Butt et al.

The distributions are used in extreme learning machine for recognition of utterance level emotions. The results suggest improvement in emotional speech recognition using this method as features are much more meaningful as compared to the recent methods. The overall improvement is about a 20% improvement in accuracy. In order to find the best set of features both at a low and high level, the researchers have proposed various methods. The most commonly used features for emotion recognition are Energy, Mel Frequency, velocity and acceleration coefficients of MFCC [12]. The researchers in [13] have investigated the macro and micro-based feature for emotional speech recognition and emotions were extracted at utterance and segment level. It was seen that prosodic and spectral features which are extracted at frame level are better and provide the best results as compared to spectral features from samples of speech which are not segmented. Therefore, when micro level features are used for emotional speech classification the prosodic features are better and more meaningful. Similarly, another study [14] investigated two sets of extracted features MFCC features (65) including 39 Mel-freq cepstral coefficients. Deep support vector machine and simple support vector machine along with auto-encoders are used on SAVEE [15] dataset as classifiers. The results show improved performance with auto-encoders and deep support vector machine as compared to the conventional SVM. Latif, et al. in [16] employed the cross-language and cross-corpus analysis to address the issue of emotion speech recognition. eGeMAPS [17] is used which is a transfer learning approach. Its feature set consists of 88 features which include energy, cepstral, spectral, frequency and dynamic information. The results suggest an improved performance in speech recognition for cross-corpus and cross-language scenarios. The evaluation is performed in three languages for five different corpora which shows that Deep Belief Networks (DBNs) outperforms similar state-of-the-art techniques on crossdataset emotional speech recognition, compared to an SVM baseline system and Sparse Auto-encoder. Yogesh et al. [18] extracted spectral features of higher order using openSmile toolbox and a feature selection scheme to develop an autonomous speech emotion recognition system. The feature selection was done through a biogeography-based algorithm. The datasets used in the study are Surrey Audio-Visual Expressed Emotion Database (SAVEE), Berlin Emotional Speech Dataset and Speech under Simulated Actual Stress SUSAS. Addition eight datasets were used for validation purpose. The following recognition rates were obtained after conducting eight different experiments 62.5%–78.4% for SAVEE, 85.8%–98.7% for SUSAS and 90.3%–99.4% for BES dataset. The results show that the proposed method of feature selection is better as compared to earlier works. Gluge et al. in [19] proposed an auto-encoder based Extreme Learning Machine for emotional speech recognition. In this study three different corpora were used for evaluation eNTfigERFACE [20], Emotional-DB [21] and smartKom [22]. The recognition rates and results are compared with the most recent methods such as Generalized Discriminant Analysis based on deep learning and support vector machines. The comparison shows considerable improvement in accuracy as compared to SVM where performance is enhanced by 3–14% for all three datasets and 8–13% improvement as compared to GeDA for two datasets.

Emotional Speech Recognition Using SMILE Features

13

3 Methodology The proposed methodology consists of two main steps. Firstly, the features are extracted using SMILE features and a descriptor is formed for each audio clip using low-level audio features. The second part consists of a classification of emotion classes using random forest tree classifier. This section is divided into the following sections: Speech Database, Feature Extraction, and Classification. A. Speech Database In this paper, speech corpus used for experimentation is Surrey Audio-Visual Expressed Emotions (SAVEE) dataset. It is widely used by many researchers to extract emotion information using audio, as well as visual data. It is recorded by four male speakers. The audio files are in a (.wav) format recorded at 44.1 kHz and containing 15 sentences for each of 7 emotions namely: anger, disgust, happiness, fear, surprise, neutral and sadness. B. Feature Extraction The feature extraction is performed using openEAR toolkit, openEAR is freely available for public use as open source software under general public license. a. openEAR Toolkit Architecture There are three main components of the toolkit the major constituent is the SMILE (Speech and Music Interpretation by Large-Space Extraction) feature extraction, it can be used for generating more than 500 k features for both live and offline audio data. It also provides tools and scripts for classification and facilitates the training of models on arbitrary datasets. The input data can be classified using the in-built classifiers or the extracted features can be exported in different formats e.g. Hidden Markov Toolkit (. HTK) format, Comma Separated Value (.CSV) and ARFF etc. The implementation is very efficient and code is optimized to avoid any double calculations e.g. Fourier Transform FFT coefficients are extracted only once and can be used numerous times for calculation of energy, cepstral features, and spectral features. b. Features SMILE features are extracted which represent some low-level descriptors or lowlevel audio features formed using transformational and statistical methods. The featurelength is 6552 formed using 39 functionals of 56 low-level audio features and delta regression coefficients of first and second order. Table 1 shows the 33 low-level audio features employed in the acoustic analysis. Table 2 below shows the statistical functions applied to the low-level descriptors. The step converts the variable length features from variable to static feature vector. C. Classification The extracted features from openEAR toolkit are in a.arff file format which can be used in WEKA for classification. WEKA contains many built-in classifiers which can be used for generating a classification model.

14

A. M. Butt et al. Table 1. 33 low-level audio features used for feature extraction

Feature Raw signal Energy of signal Signal pitch Quality of voice Spectral

Signal melspectrum Cepstral feature

Group features Zero Crossings Logarithmic Feature Autocorrelation and Cepstrum, for Fundamental Frequency Fo Exponentially curved Fo envelope Voicing probability Energy in bands between 0 to 250 Hz, 0 to 650 Hz and 1 to 4 kHz Flux, centroid, relative position of signal spectrum of max and min, 25%, 50%, 75%, 90% roll-off point Band from 1 to 26 MFCC from 0 to 12

Table 2. The functions applied to the Low-Level Descriptor regression coefficients and contours Functionals # Relative pos. (max-min) 2 Range 1 Arithmetic Mean (max-min) 2 Quadratic and Arithmetic mean 2 Non-Zeros 1 Mean of non-zeros (quadratic-geometric) 2 Mean of non-zeros and absolute values 2 Ranges (Inter-quartile - quartile) 6 Percentile (95% and 98%) 2 Variance, kurtosis, Std deviation and skewness 4 Centroid Functional 1 Rate of zero crossing 1 No. of peaks, mean distance between peaks, arithmetic mean of peaks, overall mean 4 Approximation error and regression coefficients 4 Quadratic coefficients and Approximation error 5

4 Experimentation and Results The extracted features were classified using the WEKA software. The Random Forest Tree performed best among other available classifiers in the WEKA. The results are state-of-the-art for the SAVEE dataset. The overall data consists of 420 instances each having 6552 attributes. The holdout validation scheme with random sampling is

Emotional Speech Recognition Using SMILE Features

15

employed for training and testing. The splits are 80% and 20% respectively. The summary of the results is given in Table 3 below: Table 3. The summary of results for given samples Correctly Classified Samples Incorrectly Classified Samples Kappa statistic Mean abs error Root mean squared error Relative absolute error Root relative squared error Total Samples

76.1905% 23.8095% 0.7207 0.1754 0.2715 71.4884% 77.4159% 84

The detailed accuracy given by each class is given in Table 4 below: Table 4. The detailed accuracy and other parameters for each emotion class TP Rate 0.750 0.733 0.727 0.750 0.833 1.000 0.615 0.762

FP Rate 0.015 0.058 0.041 0.026 0.056 0.027 0.056 0.041

Precision 0.923 0.733 0.727 0.750 0.714 0.818 0.667 0.766

Recall 0.750 0.733 0.727 0.750 0.833 1.000 0.615 0.762

F-Measure 0.828 0.733 0.727 0.750 0.769 0.900 0.640 0.761

MCC 0.798 0.675 0.686 0.724 0.730 0.892 0.578 0.721

ROC Area 0.977 0.933 0.958 0.911 0.939 1.000 0.944 0.952

PRC Area 0.926 0.791 0.828 0.665 0.812 1.000 0.761 0.830

The confusion matrix is shown in Fig. 1 below for given dataset.

Fig. 1. Classifier output from WEKA

Class anger disgust fear happy neutral sad surprise

16

A. M. Butt et al.

The results show the effectiveness of the applied method as compared to the methods proposed earlier in [23] where the accuracy on SAVEE dataset is 56% using LDA and 50% with PCA.

5 Conclusion and Future Work In this work, we have achieved state-of-the-art results for the SAVEE emotional database. The feature extraction using openEAR toolkit provides excellent representation for emotional speech datasets. The Random Forest Tree performed the best among other classifiers in forming a trained classification model. The acoustic features extracted using this toolkit contain discriminating features which are helpful for the classifiers. In future, the work can be further extended by employing deep learning models (e.g. ELMs, ResNets, etc.) to identify emotions in music, singing and different genres of singing. Furthermore, evaluation schemes such as cross-cultural evaluation can be employed for better acoustic analysis and recognition.

References 1. Batliner, A., Schuller, B., Seppi, D., Steidl, L., Devillers, S., Vidrascu, L., Vogt, T., Aharonson, V., Amir, N.: The automatic recognition of emotions in speech. In: EmotionOriented Systems. Springer, pp. 71–99 (2011) 2. Wang, S., et al.: Speech emotion recognition based on principal component analysis and back propagation neural network. In: 2010 International Conference on Measuring Technology and Mechatronics Automation (ICMTMA), pp. 437–440 (2010) 3. Ververidis, D., Kotropoulos, C.: Fast and accurate sequential floating forward feature selection with the Bayes classifier applied to speech emotion recognition. Signal Process. 88 (12), 2956–2970 (2008) 4. Mao, X., Chen, L., Fu, L.: Multi-level speech emotion recognition based on HMM and ANN. In: 2009 WRI World Congress on Computer Science and Information Engineering, pp. 225–229 (2009) 5. Zhou, J., et al.: Speech emotion recognition based on rough set and SVM. In: International Conference on Machine Learning and Cybernetics, pp. 53–61 (2005) 6. Neiberg, D., Laskowski, K., Elenius, K.: Emotion recognition in spontaneous speech using GMMs. In: INTERSPEECH 2006- ICSLP, pp. 1–4 (2006) 7. Wu, C.H., Liang, W.B.: Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. IEEE Trans. Affect. Comput. 2(1), 10–21 (2011) 8. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006) 9. Ding, S., et al.: Extreme learning machine: algorithm, theory and applications. Artif. Intell. Rev. 44(1), 103–115 (2015) 10. Huang, G.B., Chen, L., Siew, C.K.: Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans. Neural Netw. 17(4), 879–892 (2006)

Emotional Speech Recognition Using SMILE Features

17

11. Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth Annual Conference of the International Speech Communication Association (2014) 12. Chan, K., Hao, J., Lee, T., Kwon, O.W.: Emotion recognition by speech signals. In: Proceedings of International Conference EUROSPEECH, Citeseer (2003) 13. Pervaiz, M., Amir, A.: Comparative study of features extraction for speech’s emotion at micro and macro level. In: International Conference on Communication, Computing and Digital Systems (C-CODE), IEEE (2017) 14. Aouani, H., Ayed, Y.B.: Emotion recognition in speech using MFCC with SVM, DSVM and auto-encoder. In: 2018 4th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), IEEE (2018) 15. Jackson, P., Haq, S.: Surrey audio-visual expressed emotion (savee) database. University of Surrey, Guildford (2014) 16. Latif, S., et al.: Cross Corpus Speech Emotion Classification-An Effective Transfer Learning Technique. arXiv preprint arXiv: 1801.06353 (2018) 17. Gideon, J., Khorram, S., Aldeneh, Z., Dimitriadis, D., Provost, E.M.: Progressive neural networks for transfer learning in emotion recognition, arXiv preprint arXiv: 1706.03256 (2017) 18. Yogesh, C.K., et al.: A new hybrid PSO assisted biogeography-based optimization for emotion and stress recognition from speech signal. Expert Syst. Appl. 69, 149–158 (2017) 19. Glüge, S., Ronald, B., Thomas O.: Emotion recognition from speech using representation learning in extreme learning machines. In: 9th International Joint Conference on Computational Intelligence, Funchal, Madeira, Portugal, 1–3 November 2017, vol. 1. SciTePress (2017) 20. Martin, O., Kotsia, I., Macq, B., Pitas, I.: The ENTERFACE 2005 audio-visual emotion database. In: 22nd International Conference on Data Engineering Workshops (ICDEW 2006), pp. 1–8. IEEE (2006) 21. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of German emotional speech. In Proceeding Interspeech, pp 1517–1520 (2005) 22. Steininger, S., Rabold, S., Dioubina, O., Schiel, F.: Development of the user-state conventions for the multimodal corpus in smartkom. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (2002) 23. Haq, S., Jackson, P.J., Edge, J.: Audio-visual feature selection and reduction for emotion classification. In: Proceeding International Conference on Auditory-Visual Speech Processing (AVSP 2008), Tangalooma, Australia (2008)

A Switching Approach that Improves Prediction Accuracy for Long Tail Recommendations Gharbi Alshammari1(B) , Jose L. Jorro-Aragoneses2 , Stelios Kapetanakis1 , Nikolaos Polatidis1 , and Miltos Petridis3 1

School of Computing, Engineering and Mathematics, University of Brighton, Moulsecoomb Campus, Lewes Road, Brighton BN2 4GJ, UK {g.alshammari,s.kapetanakis,n.Polatidis}@brighton.ac.uk 2 Department of Software Engineering and Artificial Intelligence, Universidad Complutense de Madrid, Madrid, Spain [email protected] 3 Department of Computer Science, Middlesex University London, The Burroughs, London NW4 4BT, UK [email protected]

Abstract. Recommender systems are software tools that play an important role of generating a list of recommendations for unseen items based on the past users experience and interactions. One of the most popular approaches is Collaborative Filtering (CF) that considers the users similarities to generate the recommendation. Although, recommender systems have been discovered in many aspects, the popularity bias is still one of the challenges that need to be considered. Therefore, we proposed a novel model that applies a switching technique to solve the long tail recommendation problem (LTRP) when collaborative filtering fails to find the target case using a multi-level method. We evaluate the results using the public dataset 100K Movielens. Our result outperforms all the existing methods through reducing the recommendation error rates for the items in the long tail. Keywords: Collaborative filtering

1

· Switching · Multi-level · Long tail

Introduction

The success of Recommender Systems (RS) in filtering and finding the relevant products is a way to address the increased quantity of available information on the web. Hence, RS can make a huge impact on both sides: on business in the market and on users in finding and recommending an interesting item. These recommendations rely on user interaction, behaviour tracking and analysis using machine-learning techniques. Collaborative filtering (CF) is the most successful technique for RS. Given a set of users, items and ratings, CF attempts to recommend items to a target user c Springer Nature Switzerland AG 2020  Y. Bi et al. (Eds.): IntelliSys 2019, AISC 1037, pp. 18–28, 2020. https://doi.org/10.1007/978-3-030-29516-5_3

Switching Multi-level Algorithm that Solves LTRP

19

based on users’ ratings. The main task of CF is to predict the rating of a certain item that meets the user’s interest based on the user’s behaviour and interaction in their history. Hence, the rating is the most important input to CF, which can be gathered explicitly or implicitly [22]. It works on the idea of recommending items based on the similarity between users. It has been proposed since the mid 1990s using the most common classification model: k-nearest neighbour (kNN) and matrix factorization (MF). The kNN method has more powerful advantages to be used such as simplicity to implement. Secondly, from the point of efficiency, it needing no costly steps to train the model. Hence, it is more popular among researchers [10]. However, CF suffers from the long tail problem, which affects the accuracy of the recommendations [16]. The key issue in this technique is how to calculate the similarity between users or items by finding similar shared interest. It significantly relies on the rating aspect, which allows users to assign a high or low rating to a certain item based on their preference or dislike for it [18]. On the contrary, content based filtering (CBF) is a second technique that considers the features of the items to find the similarity between them. For example, in user terms the user profile represents the content of the items that have been liked/rated to reflect the user’s interests and preferences. Therefore, to make relevant recommendations that are matched against the user’s profile a similarity measure is adopted that is able to calculate how similar or close to this profile items are. Many similarity measures have been adopted in recommender systems such as Pearson’s Correlation Coefficient (PCC) [22] and Cosine [24] to provide recommendations based on the absolute ratings between users. Hence, modified similarity measures are one of the most important challenges to improve the prediction accuracy in recommender systems. For hybrid recommender systems, the author in [7] proposed a different way of using two or more techniques through seven hybridization methods: including mixed, weighted, switching, cascade, feature augmentation, feature combination and meta-level. The main goal to combine those methods is to achieve higher quality of recommendations and provide more reliable and accurate results than using one method. He presented one category of the hybrid recommendation that is called EntreeC which is a restaurant recommender system that combines Case-based Reasoning (CBR) and CF as a cascade method using the knowledge representation as a first step to rank similar users based on their interest. Then, CF is employed among those users. Many recommender system approaches have considered popular items or items with the highest rating which is called a popularity based recommender system. For example, when you read daily news it will recommend to you popular news based on people’s reading frequency. However, the challenge comes when the items are new to the system or have not gained enough rating to become popular among others. In this issue it is really essential to consider less known items more than the popular ones since it can provide serendipity to the users. These items belong to the problem of the long tail as it is introduced in [4]. Hence,

20

G. Alshammari et al.

those items should be considered and the method is able to suggest the relevant ones in the tail. For example, the authors in [1] presented an item weighting approach that filters the items in the long tail and make them on top ranked items. In this paper, a novel method is proposed that integrates the multi-level method with the switching hybrid system. The fundamental contributions of our method are as follows: 1. We proposed a novel recommendation framework that applies a switching method between CF and CBF using the multiple-level method that improves the prediction accuracy when recommending items in the long tail. 2. We examine the proposed method through a comprehensive experiment on a public dataset using two different training/testing approaches to show the quality of the proposed method, conducting a comparison with the baseline methods and the state-of-the-art alternative.

2

Related Work

A main challenge of recommender systems is to find the right recommendations for the target users. This challenge is mostly managed by first finding a probability of the user to watch/purchase the item through the rating prediction then ranking the items that have a high impact. The literature details substantial studies in this area and focus mainly on the two most applied methods: CF and CBF. CF relies on rating similarity that is relying on the assumption that similar users rate similar items, which can help predict unseen items [11]. Furthermore, CBF is relying on the similarity of items’ features, for example: genres or some text which represents the item using information retrieval and filtering methods, e.g., Term Frequency-Inverse Item Frequency (TF-IDF). However, the effectiveness of both methods is limited when they are presented individually. Hence, a hybrid recommender system was proposed in 2002 by [7] to solve the limitation of each method by using two or more methods. The recommendations are generated to a specific user through a prediction using a similarity function that calculates how two users are similar. Then, the classification model estimates and identifies who is the closest one that can help calculating the predicted value. One of the most applied classifier is k-nearest neighbour’ kNN, which presents the most similar user utilizing a pre-defined k. In addition, the most popular measures that have been utilised in the literature are Pearson correlation coefficient (PCC), Euclidean distance and Cosine similarity [6]. Pearson similarity has had the most successful applications, which is defined in Eq. 1. Where Sima,b is the similarity of users a and b, ra,p is the rating of user a for product p, rb,p is the rating of user b for product p and ra, rb represent user’s average ratings. P is the set of all products.  P ∈P (ra,p − r a )(rb,p − r b ) P CC  Sima,b =  (1) 2 2 P ∈P (ra,p − r a ) P ∈P (rb,p − r b )

Switching Multi-level Algorithm that Solves LTRP

21

Recently, many authors presented a modification to the similarity function to improve the CF recommendation [3,5,21,23]. For example, a multi-level method was proposed by the authors in [21] that utilize some constraints that enhance the PCC similarity value of users who belonged to a specified categories based on the common items (co-rated). In [23], the cosine similarity was modified using co-rated items as an adjusted factor to improve the similarity. A utilisation of one or more techniques is called a hybrid recommender system that has been applied to overcome the limitations of using one approach and obtain better results [7]. For instance, in [2], a hybrid case based reasoning approach was proposed to solve a long tail problem, which is items that have a few ratings’ by switching between CF and CBF. In addition, the authors in [17] implemented a hybrid recommender system that applied a clustering technique and an artificial algae algorithm with a multi-level CF approach. However, co-rated items have been used for problem solving in recommender systems to improve their predictive accuracy. Authors in [27] also introduced a hybrid approach that solve the problem of the item that has not been rated in a user-item matrix through a weighted combination of user-based and item-based CF. These methods addressed the two main challenges of RS, the accuracy of recommendations and sparsity of data, by jointly incorporating the correlation of items and users. In [26] the authors address a cold-start problem in user-based CF by considering both the distance between users and the co-rating of items using Jaccard factors. In [25], the authors proposed a new measure that integrates the triangle similarity approach with Jaccard similarity, which considers non co-rating users. The authors in [21] propose a multi-level constraint that improves the quality of a recommendation using PCC. Equation 2 considers the similarity between users relying on PCC and co-rated items in different levels that is detailed in 4.1. ⎧ |Ia ∩Ib | ⎪ SimPCC ≥ t1 and SimPCC ⎪ a,b + x1 if a,b ≥ y T ⎪ ⎪ |I ∩I | PCC PCC a b ⎪ ⎪ + x if t ≤ < t and Sim Sim 2 2 1 a,b a,b ≥ y ⎨ T |Ia ∩Ib | PCC PCC (2) Sima,b = Sima,b + x3 if t3 ≤ T < t2 and SimPCC a,b ≥ y ⎪ ⎪ |Ia ∩Ib | PCC PCC ⎪ Sima,b + x4 if t4 ≤ T < t3 and Sima,b ≥ y ⎪ ⎪ ⎪ ⎩0 otherwise. The long tail recommendations problem (LTRP) is a major challenge in recommender systems and refers to items with less popularity [4]. Detailed in the literature, a number of ways have been presented to solve this problem. The majority use a pre-processing technique such as clustering or dividing the data into groups (head and tail) [12,19,28]. For example, the authors in [20] describe a clustering technique that boosts items belonging to the long tail. In [12], an item clustering approach was proposed using association rule mining. In addition, matrix factorization was proposed in [9] to measure the performance of the recommendation of items in LTRP. Furthermore, in [28] the author proposed a graph based method for LTRP by using user-item information along with undirected edge weighted graphs. In [8] case base reasoning method is presented in which new artist and tracks are recommended. The authors in that study proposed a method that could identify whether an item belongs in the long tail

22

G. Alshammari et al.

and, if it were, seek to improve its provided meta-data through the addition of tag knowledge. However, most of those existing algorithms require additional processing to solve the LTRP and some algorithms decrease the accuracy when recommending items in the long tail [28] whereas in our method the accuracy is increased and no additional processing or information are required.

3

Proposed Method

The number of co-rated items indicates the degree of connection between users. For instance, a low number of co-rated items might reflect a low level of similarity. However, traditional similarity measures do not take into account the number of co-rated items [23]. Hence, to solve LTRP a switching hybrid system is proposed that utilises the multi-level similarity method. In Fig. 1, we show our proposed architecture that switches between two different techniques: CF technique that calculates the predicted rating relying on other similar users and CBF. CBF technique calculates the predicted rating based on other similar movies that the user has rated previously. In our approach, we apply a hybrid method that also adopts a multi-level CF approach, which enhances the similarity of users that fit in certain groups and disregard the rest [21] as shown in Eq. 2. It enhances the process of kNN by discovering a great margin within an application.

Fig. 1. A switching multi-level recommender system architecture.

The system receives a query (Q) that selects the user (u) and movie (m). Ratings are in a scale from 1 to 5 and the aim is to calculate the estimated rating for the target movie r(m, u) . Q = < u, m >

(3)

The first step in the model is to determine which method is more applicable in calculating the rating prediction. This decision is based on the number of ratings that the target movie has received. In order to make this decision, the

Switching Multi-level Algorithm that Solves LTRP

23

system computes a vector (Rm ) that represents the number of ratings for a target movie m. Rm = m, rm = (m, r(m, u1 ), . . . , m, r(m, uj ))

(4)

In this first step, the system gets the number |Rm | of ratings that the query movie (m) has. Then, it compares this value with a threshold constant (δ). If the number of ratings of m is greater than δ, then this movie is not belong to the LTRP and the multi-level collaborative filtering method can be utilised. On the other hand, the CBF is used when the number of ratings is less than δ, due to the fact that the target movie has not received sufficient number of ratings. Therefore, in this case the system switches to the content based method. Finally, we detail each method and present how the rating prediction of a target movie is computed.

4

Experimental Evaluation

4.1

Comparison

We ran the following switching hybrid algorithm and make a comparison with the baseline CF using Euclidean and Pearson similarity, CBF and our proposed method in ICCBR17 [2]. The kNN algorithm is used to compute the rating prediction. User-Based CF. This model is based on the ratings of similar users u that is utilised to o calculate the rating prediction. This model calculates all movies rated by user (Ru ) as a list and compares this list to get the user similarity. In this model, the KNN algorithm is used to find the similar that rated the target movie m. Then, the similarity is configured using the two most popular similarity functions. First, the Euclidean distance:  |M | sim (R , R  ) = 1 −

(r(m, u) − r(m, u ))2 (5) Euc

u

u

m=0

where M = Ru ∩ Ru

(6)

The other similarity function that has been utilised in the experimental evaluation is the Pearson correlation that is defined in Eq. 1: If the system can find the k most similar users that have rated the target movie m, then it computes the rating prediction using the other ratings derived from these users. The weighted average of the rating and the similarity measure is used to calculate the prediction. k ri (m, u ) ∗ simi (Ru , Ru ) (7) r(m, u) = i=0k  i=0 simi (Ru , Ru ) Finally, r(m, u) is the result returned by this module as the rating prediction. Then, the second method is explained.

24

G. Alshammari et al.

Content Based Filtering Based on User History. This model is based on the statistical average of ratings for each genre that is defined in user profile. Through creating a personal case base for the target user. The similar movies that the user rated is used to calculate the predicted rating. Each case (CCB ) consist of a list of genres that describes the movie CCB = < u, m, Gm > Gm = {g1 , . . . , gi , . . . , gn }

(8) (9)

Movies are compared based on the number of common genres using Jaccard similarity that is defined in Eq. 10. sim(m, m ) =

Gm ∩ Gm Gm ∪ Gm

(10)

Using the k most similar movies and the rating prediction is calculated using Eq. 7. Multi-level. This method is based on multiple levels, from the top to the bottom, and in each of these levels we have a number of constraints that is defined in Eq. 2. Where, sima,b indicates the similarity value between user a and user b. T is the total number of co-rated items. t1, t2, t3 and t4 are the predefined threshold of co-rated items for user similarity. We consider that t1 = 50, t2 = 20, t3 = 10 and t4 = 5. We took x1 = 0.5, x2 = 0.375, x3 = 0.25, x4 = 0.125 and y = 0.33. ICCBR17. This model applied a switching method between the CF and CBF using a constraint that specifies whether to apply a CF or CBF relying on the number of ratings received for the target item. The proposed method is based on a switching method that uses the multilevel algorithm with a number of constraints. The constraints apply a higher similarity value between users that have share common items and a PCC similarity value is greater than a certain threshold, which is not been used in the other methods. However, it should be considered that the similarity value between the CC when any of the constraints of the level is not users will be set to SimP a,b satisfied, which is different from Eq. 2 where it set to be 0. An experiment was conducted to compare the proposed method results against the baseline CF, CBF algorithms and ICCBR17 method. It was based on the publicly available MovieLens dataset and the most popular accuracy measures for the predictive accuracy in recommender systems: mean absolute error (MAE) and root mean squared error (RMSE). Given below are the results for the real dataset using two different training and testing percentages to evaluate the results and compare it with all methods. In this experiment, k represents the number of neighbours, specified to be equal to 3, 5, 10, 20 and 30.

Switching Multi-level Algorithm that Solves LTRP

4.2

25

MovieLens 100K

This is the latest small MovieLens dataset that is publicly available. It uses a web-based research recommender system that was in use from January 1995 to October 2016. It was generated by 671 users across 9,125 movies. Each user has rated at least 20 movies. It contains 100,004 ratings, all of which are in a range between 1 and 5. The three main features are [UserID], [MovieID] and [Rating]. Furthermore, movie information contains [movieId], [title] and [genres] [13]. 4.3

Evaluation Metrics

Recommender system researchers have applied different measures to evaluate the quality of proposed recommendation algorithms [15]. Since 1994 [22], most of the empirical studies examining recommender systems have focused on appraising the accuracy of these systems using different methods [14]. Appraisals of accuracy are useful for evaluating the quality of a system and its ability to forecast the rating for a particular item. Predictive accuracy measurement metrics are widely used by the research community in CF, which measures the similarity value between real user ratings and the predicted ratings. Hence, we apply both the mean absolute error (M AE) and the root mean squared error (RM SE) to measure the performance of the proposed methods and evaluate their prediction accuracy compared with other recommendation techniques. The MAE is defined in Eq. 11 and the RMSE is defined in Eq. 12. n

MAE =

1 |pi − ri | n i=1

 n 1 RMSE =

(pi − ri )2 n i=1

(11)

(12)

In the above equations, pi is the predicted rating, and ri is the actual rating. It should be considered that lower values provide better result. 4.4

Results

Figure 2 shows the MAE and the RMSE error rates respectively across the MovieLens 100K dataset using the aforementioned predictive methods utilising 70% for training and 30% testing. It is shown that our proposed method outperforms all the other compared recommendation methods. It can be seen clearly that when the number of neighbours is smaller, for example, when k = 3, 5 and 10, the improvement is very significant. On the other hand, when k is getting higher we still have an improvement in our method but it is less effective. Figure 3 also show the MAE and the RMSE rates for all methods using 60% of the dataset as training and 40% testing.

26

G. Alshammari et al.

(a) MAE

(b) RMSE

Fig. 2. MAE and RMS results for MovieLens dataset using 70/30 test

(a) MAE

(b) RMSE

Fig. 3. MAE and RMS results for MovieLens dataset using 60/40 test

5

Conclusions and Future Work

In this paper, a novel method that solve LTRP is presented using a switching approach integrated with a multi-level algorithm. Firstly, the number of ratings received by the target movie is calculated to determine whether this movie belongs to the long tail or not. Then, if the target movie has not a sufficient number of ratings, i.e., the movie belongs in the long tail. Finally, it will switch to the CBF method that calculates the rating. In this case, this method considers only the movies that are in the user profile. Then, the genres is used to find the most similar movies. After that the rating of the target movie is computed by comparing with the most similar one. The advantages of the presented method are that the data does not need pre-process before producing the recommendation. Furthermore, this method does not need to apply another information, since in both techniques the same information is utilised (the users’ rating histories). Then, we detail that our approach increases the accuracy of detecting and applying a switching method with items in the long tail. Furthermore, based on the comparisons with the traditional CF similarity measures and some state-of-the-art measures, the switching method using a multi-level similarity algorithm is robust against the long tail recommendation problem. In future work, we will consider other different datasets to validate the proposed method in terms of offline evaluation. In addition, a good recommendations algorithm needs to be examined using the online recommender systems

Switching Multi-level Algorithm that Solves LTRP

27

platform to allow observation of the effectiveness of our method and measure the quality of the recommendations from the users’ perspective. Furthermore, we will consider a popular related problem, called cold start, and as adopting our method as a solution to this problem.

References 1. Abdollahpouri, H., Burke, R., Mobasher, B.: Value-aware item weighting for longtail recommendation. arXiv preprint (2018). arXiv:1802.05382 2. Alshammari, G., Jorro-Aragoneses, J.L., Kapetanakis, S., Petridis, M., RecioGarc´ıa, J.A., D´ıaz-Agudo, B.: A hybrid cbr approach for the long tail problem in recommender systems. In: International Conference on Case-Based Reasoning, pp. 35–45. Springer (2017) 3. Alshammari, G., Kapetanakis, S., Polatidis, N., Petridis, M.: A triangle multilevel item-based collaborative filtering method that improves recommendations. In: International Conference on Engineering Applications of Neural Networks, pp. 145–157. Springer (2018) 4. Anderson, C.: The long tail: why the future of business is selling less of more by Chris Anderson. J. Prod. Innovation Manag. 24(3), 1–30 (2007) 5. Ayub, M., Ghazanfar, M.A., Maqsood, M., Saleem, A.: A jaccard base similarity measure to improve performance of cf based recommender systems. In: 2018 International Conference on Information Networking (ICOIN), pp. 1–6. IEEE (2018) 6. Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering. In: Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pp. 43–52. Morgan Kaufmann Publishers Inc. (1998) 7. Burke, R.: Hybrid recommender systems: survey and experiments. User Model. User-Adap. Inter. 12(4), 331–370 (2002) 8. Craw, S., Horsburgh, B., Massie, S.: Music recommendation: audio neighbourhoods to discover music in the long tail. Lect. Notes Comput. Sci. (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9343, 73–87 (2015) 9. Cremonesi, P., Koren, Y., Turrin, R.: Performance of recommender algorithms on top-n recommendation tasks. In: Proceedings of the Fourth ACM Conference on Recommender Systems - RecSys 2010 p. 39 (2010) 10. Gedikli, F., Jannach, D.: Recommending based on rating frequencies: accurate enough? In: Proceedings of the 8th Workshop on Intelligent Techniques for Web Personalization & Recommender Systems at UMAP 2010 (ITWP 2010). pp. 65–70 (2010) 11. Goldberg, D., Nichols, D., Oki, B.M., Terry, D.: Using collaborative filtering to weave an information tapestry. Commun. ACM 35(12), 61–70 (1992) 12. Grozin, V., Levina, A.: Similar product clustering for long-tail cross-sell recommendations. In: AIST (Supplement), pp. 273–280 (2017) 13. Harper, F.M., Konstan, J.A.: The MovieLens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5(4), 191–1919 (2015). http://doi.acm.org/10.1145/2827872 14. Herlocker, J.L., Konstan, J.A., Borchers, A., Riedl, J.: An algorithmic framework for performing collaborative filtering. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 230–237. ACM (1999)

28

G. Alshammari et al.

15. Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. (TOIS) 22(1), 5–53 (2004) 16. Jeong, B., Lee, J., Cho, H.: Improving memory-based collaborative filtering via similarity updating and prediction modulation. Inf. Sci. 180(5), 602–612 (2010) 17. Katarya, R., Verma, O.P.: Effectual recommendations using artificial algae algorithm and fuzzy c-mean. Swarm Evol. Comput. 36, 52–61 (2017) 18. Konstan, J.A., Riedl, J.: Recommender systems: from algorithms to user experience. User Model. User-Adap. Inter. 22(1–2), 101–123 (2012) 19. Park, Y.J.: The adaptive clustering method for the long tail problem of recommender systems. IEEE Trans. Knowl. Data Eng. 25(8), 1904–1915 (2013) 20. Park, Y.J., Tuzhilin, A.: The long tail of recommender systems and how to leverage it. In: Proceedings of the 2008 ACM Conference on Recommender Systems, pp. 11–18. ACM (2008) 21. Polatidis, N., Georgiadis, C.K.: A multi-level collaborative filtering method that improves recommendations. Expert Syst. Appl. 48, 100–110 (2016) 22. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: Grouplens: an open architecture for collaborative filtering of netnews. In: Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work, pp. 175–186. ACM (1994) 23. Shen, K., Liu, Y., Zhang, Z.: Modified similarity algorithm for collaborative filtering. In: International Conference on Knowledge Management in Organizations, pp. 378–385. Springer (2017) 24. Shi, Y., Larson, M., Hanjalic, A.: Collaborative filtering beyond the user-item matrix: a survey of the state of the art and future challenges. ACM Comput. Surv. (CSUR) 47(1), 3 (2014) 25. Sun, S.B., Zhang, Z.H., Dong, X.L., Zhang, H.R., Li, T.J., Zhang, L., Min, F.: Integrating triangle and jaccard similarities for recommendation. PloS One 12(8), e0183570 (2017) 26. Tan, Z., He, L.: An efficient similarity measure for user-based collaborative filtering recommender systems inspired by the physical resonance principle. IEEE Access 5, 27211–27228 (2017) 27. Wei, S., Zheng, X., Chen, D., Chen, C.: A hybrid approach for movie recommendation via tags and ratings q. Electron. Commer. Res. Appl. 18, 83–94 (2016) 28. Yin, H., Cui, B., Li, J., Yao, J., Chen, C.: Challenging the long tail recommendation. Proc. VLDB Endowment 5(9), 896–907 (2012). http://dl.acm.org/citation.cfm?id=2311916

A Novel Algorithm for Dynamic Student Profile Adaptation Based on Learning Styles Shaimaa M. Nafea1(B) , Fran¸cois Siewe2 , and Ying He2 1

School of Business, Arab Academy for Science Technology and Maritime, Cairo, Egypt [email protected] 2 School of Computer Science and Informatics, De Montfort University, Leicester LE1 9BH, UK {fsiewe,ying.he}@dmu.ac.uk

Abstract. E-learning recommendation systems are used to enhance student performance and knowledge by providing differentiated instruction based on the students’ interests and learning styles (LSs), which are typically stored in student profiles. For such systems to be effective, the profiles need to be adaptable and reflect the students’ changing behaviour. In this paper, we introduce new algorithms that are designed to track student learning behaviour patterns, identify their LSs, and maintain dynamic student profiles within a recommendation system (RS). We also propose a new method to extract features that characterise student behaviour to identify student LSs with respect to the Felder-Silverman learning style model (FSLSM). To test the efficiency of the proposed algorithm, we present a series of experiments that use a dataset based on real students to demonstrate how our proposed algorithm can effectively model a dynamic student profile and adapt to changes in student learning behaviour. The results reveal that the students could effectively increase their learning efficiency and quality of the courses recommended when their LSs are identified using our method. Keywords: Recommender system · Dynamic student profile · Student modelling · Adaptation · Algorithms · Learning style · Behaviour patterns · FSLSM Model

1

Introduction

The dramatic growth of information in the World Wide Web (WWW) has inadvertently led to information overload. Hence, it has become difficult and timeconsuming to find a specific piece of information [1,2]. Recommendation systems (RSs) are popular personalisation tools that help students find relevant information based on the preferences contained in their respective profiles. A student c Springer Nature Switzerland AG 2020  Y. Bi et al. (Eds.): IntelliSys 2019, AISC 1037, pp. 29–51, 2020. https://doi.org/10.1007/978-3-030-29516-5_4

30

S. M. Nafea et al.

profile [3] represents a student’s background, goals, learning styles (LSs), interests, and preferences. A key element of the profile is LS, which pertains to a variety describing how a student engages with his or her learning environment [4]. These are especially important in e-learning environments in that they can facilitate the system personalising the learning process in line with an individual’s characteristics. According to several studies, adaptive e-learning environments based on specific LSs are not only more productive, but also create higher student satisfaction levels, decrease learning times, and increase students’ academic achievement [5–9]. Many studies [5,10,11] have found that students with a strong preference for a specific LS have difficulty learning when it is not supported by the teaching environment; such mismatches lead to poor student performance. Increasing the efficacy of adaptive learning processes relies identifying each student’s LS correctly. The conventional method for doing this is by getting the learner to complete a questionnaire (collaborative approach) [5,20,21]. In spite of the widespread use of this method, it has limitations that impair the precision of LS identification. Regarding which, these questionnaires are quite long, requiring sustained concentration and effort, thus leading to the student becoming bored and disinterested. Second, the technique works under the assumption that students are aware of their LSs and preferences, which in many cases does not hold true. Two other approaches to identifying LSs automatically are displayed in Fig. 1 include the following.

Fig. 1. Identification of learning styles and automatic detection of learning styles

1. Data-driven approach: This approach focuses on building a model imitating the Inventory Learning Style questionnaire and using sample data to build student’s profile. The common techniques used for the data-driven approach are decision trees, neural networks, fuzzy clustering, and Bayesian networks [12,13]. 2. literature-based approach: This approach involves using the behaviours of the students to identify their learning style preferences. A simple rules-based method is then applied for calculating learning styles from behaviours, such as the number of matching hints, time spent on learning objects, etc. [14–17].

A Novel Algorithm for Dynamic Student Profile

1.1

31

Study Aim and Problem Statement

The aim of this study is to build dynamic student profiles based on learning behaviour patterns, which is called Dynamic Personalised Student Profile (DPSP) for LS identification. Prior studies on student profile adaptation have many drawbacks that hinder precise identification of LSs, such as the following: – First, it takes a long time to identify the behaviour patterns of students who participate in online learning, and sometimes those identified from the data are not strong enough [18], [19]. – Second, there is uncertainty in student behaviour identification, as well as difficulty and complexity in developing and implementing rules that can infer LSs effectively based on student actions. – Third, current e-learning RSs are not capable of detecting changes in student LSs owing to alterations in their behaviour patterns. For example, CS383 [20], eTeacher [21], and LSAS [5] use questionnaires (the static or collaborative approach). Furthermore, student profiles in these systems are created only once at the course beginning without the possibility of later updates [22]. To avoid these drawbacks and improve the accuracy of LS identification, in this research, exceptional student behaviours and frequent updates of student LSs, since they change over time. 1.2

Contributions

The following contributions are made in this paper. – Firstly, a novel algorithm for dynamic student profile adaptation based on the FSLSM learning style model is proposed (see Sect. 3). – Secondly, a new approach is proposed to identify learning behaviour patterns based on the time spent on learning objects (LOs), the number of messages exchanged, and the format of the LOs accessed by the student (see Sect. 4). – Thirdly, a method is proposed to transform learning behaviour patterns into the FSLSM learning style preferences (Sect. 5), which are then used to update the student profile dynamically (see Sect. 6). Our algorithm, DPSP, has been executed in C++ with Visual Studio and Windows Presentation Foundation (WPF) to design the Graphical User Interface (GUI), and evaluated using a real dataset. Participants from the Arab Academy for Science and Technology (AAST) voluntarily engage in the study (see Sect. 7). The rest of this paper is structured as follows. In the next section the proposed algorithm’s main concepts are defined. Sections 3 and 4 describe our proposed algorithm for automatically detecting learning style through learning behaviour. Sections 5 and 6 describe the student profile adaptation, whilst Sect. 7 describes a case study. Sections 8 and 9 contain the experimental results and analyses. Associated work discussed in Sect. 10, whilst Sect. 11 concludes the paper with proposals for future avenues of enquiry.

32

2

S. M. Nafea et al.

Basic Concepts and Background

In this section, concepts regarding student profile development, based on the Felder-Silverman Learning Style Model (FSLSM) and student modelling are introduced. 2.1

Description of the FSLSM

Several authors have proposed different LS definitions [23–25]. One study [26] defined LS as the complex manner in which, and conditions under which, learners most efficiently and effectively perceive, process, store, and recall what they are learning. In the literature, there are several LS theories, for example, Kolb [27], Honey and Mumford [28], Dunn and Dunn [29], Myers-Briggs [30], and FelderSilverman [6]; have all developed LS models. In our case, we used the Felder and Silverman model [6] for several reasons: First, it is the most commonly used one due to its capacity to quantify student LSs, as illustrated in Table 2. [31] observed that the FSLSM stands out from other theories by combining the main LS models. Second, it provides comprehensive details of its dimensions (processing, input, understanding, and perception), identifies a teaching style for each and comes with a reliable and validated LS assessment tool, as shown in Fig. 2. Finally, the FSLSM theory describes student LS preferences for each dimension, which can be strong, moderate, or balanced, based on a scale from +11 to −11. For example, the value +11 means that a learner has a strong preference for visual learning in the visual/verbal dimension, whereas the value −11 indicates that a learner has a strong preference for verbal learning. Felder and Soloman [32] devised the Index of Learning Styles (ILS) questionnaire, which has 44 questions for identifying the LS of individual learners effectively. Students LS preferences, as shown by their answers, are then calculated [32,33]. 2.2

Student Modelling

The student model tracks an individual student’s information to adapt itself to the students interests and preferences [43]. In this regard [44], states that the student model is considered a critical piece of individualised behaviour in e- learning recommendations, which strongly depend on how the knowledge about the student is modelled internally. The process of building and updating a Student Profile (SP) is called student modelling and Fig. 3 illustrates the phases involved. Student modelling can be classified into static or dynamic modelling as described below: – Static modelling refers to an approach that initialises the student model only once, usually when the student is enrolled on the system [45]. – Dynamic modelling refers to an approach that updates the student information, and then allows the system to respond to changes in the student model during the course [45].

A Novel Algorithm for Dynamic Student Profile

33

Fig. 2. Felder-Silverman’s learning style dimensions Table 1. Summary of student modeling approaches in existing personalised e-leaning systems System name

Learning style model

MAS-PLANG [34] FSLSM

Student modelling approach Explicit modelling (Questionnaire)

Implicit modelling (Behaviour pattern)



 

WELSA [35]

Unified LS Model

ANGOW [36]

FSLSM (understanding and perception)



shaboo [37]

FSLSM



e-Teacher [21]

FSLSM (perception, processing, understanding)



CS388 [20]

FSLSM



iWeaver [38]

Dunn & Dunn Model



LSAS [5]

FSLSM(Sequential/global)



PLORS [39]

FSLSM



DeLeS [40]

FSLSM

Protus [41]

FSLSM

OSCAR CITS [42] FSLSM



 

 

Table 1 illustrates examples of systems that use an explicit (using questionnaire) or implicit (using behaviour) modelling approach or both. In the next section, the proposed algorithm for the dynamic student profile component is described.

34

S. M. Nafea et al.

Fig. 3. Student modelling in e-learning systems

3

Capturing and Modelling Dynamic Student Profile

A study is conducted in this section to develop an effective algorithm to build and update SP learning styles frequently. To reach this goal, the proposed algorithm observes a student’s behaviour while taking the course through an e-learning system. After that, the agent records the student’s actions and then uses data to construct the SP. Definition 1.(A student profile learning style) is represented by a vector of real values that range from 0 to 1 (or from 0% to 100%)) as follows: LS = [act, ref, vis, ver, seq, glo, sen, int]

(1)

Example 1. Table 2 shows some examples of student’s learning style vectors. As explained in Sect. 2.1, the SP learning style vector is calculated by quantifying the learner’s responses to the ILS questionnaire [33].

Table 2. Examples of student learning style vectors act ref vis

ver

seq glo sen int

Fatima 0.4 0.6 0.35 0.75 0.5 0.5 0.8 0.2 Ali

3.1

0.5 0.5 0.6

0.4

0.8 0.2 0.7 0.3

Proposed Architecture for a Dynamic Student Profile

The proposed model’s overall structure is presented in Fig. 4. The algorithm includes the following basic steps.

A Novel Algorithm for Dynamic Student Profile

35

Fig. 4. Architecture for updating student profile dynamically

Step 1: Information retrieval. The system log-file is analysed to extract information about the student behavior, such as the time spent by her/him on each learning object (LO), the type of LOs accessed (e.g. video, audio, or text), and the number of messages exchanged (see Sect. 4). Step 2: Dynamic checking sessions. In order to build the student profile, we need to observe, process and then learn from the student’s behaviour patterns. Hence, in this step those identified in Step 1 are converted into FSLSM learning style preferences. The result is a feature vector of FSLSM learning style preferences extracted from the student learning behaviour patterns (see Sect. 5). Step 3: Profile adaptation phase. The student is observed over a number of sessions and the corresponding feature vectors are aggregated to approximate the student’s current learning style preferences (see Sect. 5). Step 4: Updating student profile. The current student’s learning style obtained in Step 3 and the previous student’s learning style are used to calculate a new student learning style, which is then stored in the student profile database. The new student learning style will be used to provide students with personalised course LOs the next time s/he logs in (see Sect. 6). The following section describes in detail how the algorithm calculates a student’s learning style.

4

Student Learning Behaviour Patterns

The algorithm in Fig. 5 calculates a student’s learning style dynamically based on learning behaviour patterns. In order to ensure that the proposed algorithm is general enough to apply to any learning system, it was key to base the approaches on generic behaviour patterns that can be collected in any learning system. Three variables are used to model these patterns. The first, is the time spent on each LO the second, is the number of messages sent; and the third is the format of the LOs accessed during a session. If a student watches more than 50% of a video, then it is more likely that the s/he is an active learner than a reflective one. Conversely,

Time spent

Time spent

Time spent

Time spent

Time spent

Time spent

Time spent

video

Audio

Simulation

PPT

PDF and Doc

Summary

Outline

Learning object Behaviour pattern

Total time spent on Summary content (Based on total session duration) Total time spent on Outline content (Based on total session duration)

Total time spent on PDF content (Based on total session duration)

Total time spent on PPT content (Based on total session duration)

Pattern description Total time spent on video content (Based on predefined video actual time) Total time spent on Audio content (Based on predefined Audio actual time) Total time spent on simulation content (Based on predefined simulation actual time)

>= 50%Global

< 50%Sequential

< 50%Sequential

< 50%Sequential < 50%Reflective < 50%Sensing < 50%Global < 50%Active < 50%Sensing < 50% visual

>= 50%Global >= 50% Active >= 50% Intuitive >= 50%Sequential >= 50%Reflective >= 50% Intuitive >= 50% verbal >= 50%Global

< 50%Reflective

>= 50% Active

Learning style criteria >= 50% Active < 50%Reflective >= 50% visual < 50% verbal >= 50%Reflective < 50%Active >= 50% verbal < 50% visual

Table 3. Behaviour pattern based on learning object type and Felder and Silverman

36 S. M. Nafea et al.

A Novel Algorithm for Dynamic Student Profile

37

if a student spends more than 50% of her/his time on textual documents (e.g. pdf or doc documents), then s/he is more likely to be reflective than active. The behaviour pattern for time spent Wi is calculated as in Eq. (2), for each LOi . Table 3 shows the rules for deciding the learning styles corresponding to a given behaviour pattern.  ti (For Video, Audio, Simulation) (2) Wi = Ttii (For PDF, PPT, Doc, Summary, Outline) T where, ti = time spent on LOi , Ti = total duration of LOi , and T = total session duration. Example 2. Table 4 Shows the learning styles of three students calculated based on the time they have spent on LOs.

Table 4. An example of time spent Student (time spent) LO (length)

W

Learning style

Fatima (30 min)

Video (40 min) 30/40 = 75% LS = Visual

Ali (30 min)

PDF (35 min)

Clara (45 min)

Audio (60 min) 45/60 = 75% LS = Verbal

30/35 = 85% LS = Verbal

The number of messages posted in the course discussion forum can indicate a student’s tendency regarding social orientation, i.e. whether the student is active or reflective. We used the following Eq. (3). D=

Number of messages sent by the student Average number of messages sent during the sessions

(3)

The value of D is undefined, if no messages have been exchanged during the session. A greater value of D indicates a stronger positive level of active learning, since active students are likely to post messages more often than passive ones. If the value of D is greater than or equal 1, the student is considered to be active; otherwise the s/he is reflective. Example 3. Table 5 illustrates the case where three students, Fatima, Ali, and Clara, have exchanged 100 messages throughout the session, i.e. 100/3 = 33.3 messages on average have been sent per student. Fatima is reflective, because she has sent less than the average number of messages, while Ali and Clara are active as they have sent more than the average during the session.

38

S. M. Nafea et al. Table 5. An example of forum discussion Student (no. of messages) D

Learning style

Fatima (25 messages)

25/33.3 = 0.75 LS =Reflective

Ali (40 messages)

40/33.3 = 1.20 LS =Active

Clara (35 Massages)

35/33.3 = 1.05 LS =Active

The formats of the LOs accessed by a student define her/his learning style, as in Table 6. The symbol “-” in yellow cells means that LOs of that format are irrelevant to the corresponding leaning style attribute. A value of 1 indicates the learning style criteria is associated with that format in the learning style dimension. A value of 0.5 indicates that both learning style criteria of the dimension are associated with that format. Both learning style calculation and student profile adaptation are described in the following sections:

5

From Behaviour Patterns to Learning Styles

The learning style adaptation algorithm in Fig. 5 observes the student behaviours over K sessions and calculates in the vector KSSP the number of hits for each of the eight learning style attributes according to the student learning behaviour patterns, as explained in Sect. 4. For example, KSSP [1] is the number of hits for the learning style “active”, KSSP [2] is the number of hits for the learning style “reflective”, and KSSP [8] is the number of hits for the learning style “intuitive”, using the same indexing as in Eq. (1). At the end of the K sessions, the number of hits is normalised in each dimension as in Eq. (4) to obtain the current learning style vector KSSP N . ⎫ for i = 1 : 2 : 8 do ⎪ ⎪ ⎪ ⎪ if (KSSP [i] + KSSP [i + 1] = 0) ⎪ ⎪ ⎪ KSSP [i] ⎪ ⎪ KSSP N [i] = KSSP [i]+KSSP [i+1] ⎪ ⎪ ⎪ KSSP [i+1] ⎬ KSSP N [i + 1] = KSSP [i]+KSSP [i+1] ⎪ (4) else ⎪ ⎪ ⎪ KSSP N [i] = 0 ⎪ ⎪ ⎪ ⎪ KSSP N [i + 1] = 0 ⎪ ⎪ ⎪ ⎪ endif ⎪ ⎪ ⎭ enddo Tables 8 and 9 illustrate how the vectors KSSP and KSSP N are calculated.

Information Processing Learning object LO\LS Active Reflective Video 1 0 Audio 0 1 Format Presentation (PPT) 0.5 0.5 PDF 0 1 Doc 0 1 Exercise 1 0 GroupAssignment 1 0 Individual Assignment 0 1 Activities Summary outline simulation 1 0 forum 1 0 ”1” Relevant Positive Learning Object ”-” Irrelevant Learning object

Learning Style Dimensions Information Perception Information Input Information Understating Sensing Intuitive Visual Verbal Sequential Global 0.5 0.5 0 1 0 1 0 1 0 1 0 1 1 0 0 1 0 1 1 0 0.5 0.5 0.5 0.5 0 1 0 1 ”0” Relevant Negative Learning Object ”0.5” Relevant positive to two LS criteria within the same dimension

Table 6. Mapping of learning objects format and activities as FSLSM

A Novel Algorithm for Dynamic Student Profile 39

40

S. M. Nafea et al.

6

Updating Student Profile

The final step is to update dynamically the student profile SP using the learning style KSSP N calculated during the K sessions. This is done by calculating the new value of SP as in Eq. (5). for i = 1 : 8 do if (KSSP N [i] = 0) SP [i] = endif enddo

SP [i]+KSSP N [i] 2

⎫ ⎪ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎪ ⎭

(5)

That is for each learning style attribute i, i = 1, 2, · · · , 8, if KSSP N [i] = 0, then calculates the new learning style value by taking the average between the previous and current learning style values; otherwise, keep the previous learning style value. The new student profile SP will be used to recommend personalised LOs to the student the next time s/he logs in, using the learning style based recommender system proposed in [46]. An important question remains, however: “is the proposed profile adaptation algorithm effective in predicting student learning styles?” The next section explains how the algorithm works by utilising a real academic case study.

7

Modelling a Real Academic Case Study

After developing the proposed algorithm, our efforts were focused on modelling a real academic case study to demonstrate how the proposed algorithm works. Participants to the study from AAST were asked to fill out the FSLSM questionnaire in order to initialise their profiles, as shown in Table 2. 7.1

Profile Adaptation Scenario

The student profile (SP) learning style adaptation algorithm works as follows: Step 1 – Recommend course LOs according to their similarity to SP learning styles, as explained in [46] and in Table 7. Step 2 –Collect student behaviour during K sessions (in this case K=5). Table 7 depicts a sample from Fatima and Ali’s learning behaviour patterns; Step 3 –Apply adaptation rules as illustrated in Tables 3 and 6. First, the algorithm calculates the total time spent by Fatima and Ali on LOs (see Sect. 4). Second, it calculates the number of messages sent by Fatima and Ali during the sessions, as illustrated in Table refpatterns. The average number of messages is 200/8 = 25, where 8 is the total number of participants during the forum discussion (see Sect. 4). Then, KSSP and KSSP N are calculated, as shown in Tables 8 and 9, for Fatima and Ali, respectively. Step 4 – Update Fatima and Ali’s profiles with the new learning styles as shown in Tables 10 and 11. In order to evaluate the effectiveness of the proposed algorithm, the updated SPs are used in the following section to predict student’s ratings of learning objects.

A Novel Algorithm for Dynamic Student Profile

8

41

Student Ratings Prediction Based on Student Learning Style

This section presents an experimental student rating prediction algorithm. We aim to show that our approach on dynamic SP adaptation can capture students’ LSs effectively from their behaviour pattern and can be used as part of the e-learning recommendation system to provide them with more effective and accurate learning objects. Let LS be the learning style vector of an active student. The recommender algorithm comprises the following four steps: Step 1 – Calculate C, the closest learning object cluster to the active student LS by using the cosine similarity metric This is obtained by calculating the level of similarity between LS and the centroid of each cluster. Subsequently, the cluster

Fig. 5. Updating student profiles dynamically flowchart

42

S. M. Nafea et al. Table 7. Examples of students’ behaviour patterns Learning object

Fatima’s Learning behaviour patterns

Ali’s Learning behaviour patterns

Format

Fatima’s spent time

Session duration

Ali’s spent time

LO total time

Session duration

Session 1

ppt

30 min

75 min

45 min

60 min

Session 2

Outline

10 min

30 min

0

0

Session 3

Video

Session 4

Summary

0

0

15 min

32 min

Session 5

PDF

45 min

80 min

10 min

50 min

40 min

Discussion Forum discussion

30 min

20 min

Total Number of Messages sent by Fatima messages during session

Messages sent by Ali

200

18

30

Table 8. KSSP calculation for Fatima’s behaviour Fatima’s behaviour

Active Reflective Visual Verbal Sequential Global Sensing Intuitive

PPT=30/75 = 0.4

PPT < 50% : 0.5

Outline=10/30 = 0.35

Outline< 50%

Video=30/40 = 0.75

Video>= 50% 1

Summary= 0

No behaviour

PDF=45/80 = 0.56

PDF >= 50%

Messages = 30/25 = 1.2 D >= 50%

0.5

1 1 0.5

1

1

1

0.5

1

1

1

1

KSSP Calculation KSSP

2.5

1.5

0.5

2.5

3

0

1

1

KSSP N

0.63

0.37

0.17

0.83

1

0

0.5

0.5

Table 9. KSSP calculation for Ali’s behaviour Ali’s behaviour

Active Reflective Visual Verbal Sequential Global Sensing Intuitive

PPT=45/60 = 0.4

PPT < 50%

Outline= 0

No behaviour

Video=15/40 = 0.37

Video< 50%

0.5

0.5

1

1

0.5

0.5

Summary= 15/32 = 0.46 Summary< 50% PDF=10/50 = 0.20

PDF < 50%

1

1 1

Messages =18/25 = 0.72 D < 50%

1

1

1

1

KSSP Calculation KSSP

1.5

2.5

1.5

0.5

2

1

2

0

KSSP N

0.37

0.63

0.75

0.25

0.67

0.33

1

0

that produces the highest degree of similarity is chosen using cosine similarity, as defined in Eq. (6); x.y (6) c(x, y) = ||x||.||y||

A Novel Algorithm for Dynamic Student Profile

43

Table 10. Fatima’s updated learning style profile act

ref

vis

ver

seq

glo

sen

int

Fatima’s SP

0.4

0.6

0.35 0.75 0.5

0.5

0.8

0.2

KSSP N

0.63 0.37 0.17 0.83 1

0

0.5

0.5

Fatima’s new SP 0.52 0.48 0.21 0.79 0.75 0.25 0.65 0.35 Table 11. Ali’s updated learning style profile act

ref

vis

ver

seq

glo

sen

int

Ali’s SP

0.5

0.5

0.6

0.4

0.8

0.2

0.7

0.3

KSSP N

0.37 0.63 0.75 0.25 0.67 0.33 1

0

Ali’s new SP 0.44 0.56 0.68 0.32 0.73 0.27 0.85 0.15

Step 2 – Calculate the level of similarity between LS and each LO in C with the Pearson correlation coefficient similarity metric For all OP ∈ C, calculate P (LS, OP ) using the Pearson correlation coefficient, as in Eq. (7); n ¯) (yi − y¯) i=1 (xi − x  (7) P (x, y) =  n 2 n 2 (x − x ¯ ) (y − y ¯ ) i i i=1 i=1 Step 3 – Select the top-n LOs most similar to LS The number of selected LOs, using a similarity threshold, can be a selected constant or determined. Step 4 – Predict the Student’s rating of the top-n learning objects A 5-level Likert scale is considered where 1 is the lowest score and 5 is the highest score. Equation (8) is used to predict the student’s ratings of LOs. r˜(LS, OPi ) = int(0.5 + Sim2 (LS, OPi ) × 5),

1≤i≤n

(8)

where r˜(LS, OPi ) pertains to the predicted rating of the learning object (Profile) OPi for the active student LS and int(x) is the closest integer to the real value x, e.g. int(2.3) = 2 and int(2.5) = 3. We then compare the student’s actual ratings of LOs with the predicted ones based on the questionnaire responses (initial student profile) (SL1s ) and the predicted ratings based on the adapted student profile (LS2s ). The next section presents the results of experiments.

9

Experimentation

In this section, we will describe the set of experiments we conducted to set the parameters and examine our proposed algorithm’s effectiveness in terms of updated SP accuracy. The profile adaptation algorithm and student ratings prediction algorithm were implemented in C++ using Visual Studio and Windows Presentation Foundation (WPF) to design the graphical user interface (GUI), as shown in Figs. 6 and 7. All experiments were run on a Windows-based PC with an Intel core i5 processor that has a speed of 2.40 GHz and 16 GB of RAM.

44

S. M. Nafea et al.

Fig. 6. Dynamic student profile adaptation interface

Fig. 7. Student ratings prediction interface

9.1

Dataset Description

In order to measure the effectiveness of the proposed algorithm, we conducted an experiment at the school of business at the AAST. The learners behaviour data were gathered from the AAST MOODLE log-file. Eighty students, in total, participated in this study. It should be noted that this dataset is of a similar size or larger than those used in related works: [42] used 75–95 students, [48] utilised 77, [47] involved 75, [48] used 40, and [18] had 49 students. First, the students filled in the ILS questionnaire developed by FSLSM, as explained in [33]. Second, we collected students’ behaviour patterns throughout the course on different LO formats, as explained in [54]. During the course, the students were asked to rate each LO using a 5-point Likert scale, with 1 = “not at all useful” and 5 = “very useful” to their learning.

A Novel Algorithm for Dynamic Student Profile

9.2

45

Evaluation Metrics

To evaluate the accuracy of the adaptation algorithm, two metrics were used to measure the preciseness of the learning styles detection: the Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE). Both metrics calculate the average magnitude of the errors in a set of predictions, without considering their direction. They range from 0 to ∞ and the smaller their value, the greater the accuracy. They are defined by Eqs. (9) and (10), where ri is the actual student rating of the LO i and r˜i is the predicted student rating for that learning object, 1 ≤ i ≤ n. n 1 MAE = |ri − r˜i | (9) n i=1

n 1 RMSE = (ri − r˜i )2 (10) n i=1 RMSE is useful for detecting large errors in the prediction that might not be identified through MAE. 9.3

Experimental Results and Discussion

It emerged that the proposed adaption algorithm improves the prediction results in all learning styles dimensions for a random sample of 80 students whilst studying three topics, each with three lessons, which was not the case with ILS. Based on these findings, it can be concluded that monitoring learners’ behaviour with time spent on different LO formats enhances the accuracy of detecting students’ learning styles in an e-learning recommendation system. Thus, this adaptation algorithm can help such a system to improve student performance by recommending the most suitable course LOs that match their learning styles. Experimental results have shown that the student rating prediction algorithm has the

Fig. 8. Adaptation algorithm accuracy using MAE

46

S. M. Nafea et al.

Fig. 9. Adaptation algorithm accuracy using RMSE

best accuracy when the student profile is adapted through behaviour (LS2 s) when compared to the ILS questionnaire (SL1 s), as can be seen in Figs. 8 and 9 according to MAE and RMSE, respectively, as the measurement tools.

10

Critical Literature Overview

This section presents a review of the extant literature of relevance to this study geared towards providing a brief overview of the common techniques used for automatic student LS identification based on learning behaviour patterns, as shown in Table 12.

[53] Presents intelligent systems for automatic learning style identification to provide a personalised learning environment

[52] Builds a student profile according to learning styles

[51] Presents a technique for building student model in an e-learning environment using Learning Style

[50] Presents a new technique for identifying students’ learning styles based on their behaviour patterns

[12] Presents a new mechanism for learner profile adaptation according to their behaviour

[49] Presents an adaptation mechanism to adapt the learner model dynamically





Decision Hidden markov Baysian tree model network

Used algorithms

[18] Learners’ preferences are  diagnosed based on their behaviour in order to customise the user interface

Study



Rule based





 



Support Dynamic Naive Neuro-fuzzy Genetic vector bayesian Bayes network algorithm Machine networks algorithm

Table 12. A comparison between existing learning styles detection approaches

A Novel Algorithm for Dynamic Student Profile 47

48

11

S. M. Nafea et al.

Conclusions and Future Work

Personalised learner profiles are increasingly becoming an important area of research in e-learning RSs, for which each learner’s preferences, interests, and contextual information are studied in detail. The characteristics of an LS play a vital role in the identification of a learner’s LS preferences. In this paper, an algorithm for a dynamic student profile based on the FSLSM has been introduced. It is aimed at building and frequently updating SPs based on student behaviour during an online course. It relies on three major steps: the first is based on extracting student learning behaviour patterns that reflect LSs from the MOODLE log file; the second attempts to capture student LSs through a quantitative method; and finally, the third step is to update students LSs dynamically after each topic. The experimental results have demonstrated that the proposed algorithm is more accurate than the results obtained using the ILS questionnaire. In future work, we will extend the algorithm to update SPs further based on similar student LSs and ratings. Furthermore, the collection of patterns is planned to be extended according to additional types of LOs and activities.

References 1. Chen, C.C., Chen, M.C., Sun, Y.: Pva: a self-adaptive personal view agent. J. Intell. Inf. Syst. 18(2–3), 173–194 (2002) 2. Challam, V., Gauch, S., Chandramouli, A.: Contextual search using ontologybased user profiles. In: Large Scale Semantic Access to Content (Text, Image, Video, and Sound). LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE 612–617 (2007) 3. Montaner, J., Alvarez-Sab´ın, J., Molina, C., Angl´es, A., Abilleira, S., Arenillas, J., Gonz´ alez, M.A., Monasterio, J.: Matrix metalloproteinase expression after human cardioembolic stroke: temporal profile and relation to neurological impairment. Stroke 32(8), 1759–1766 (2001) 4. Keefe, J.W.: Learning Style Theory and Practice. ERIC, Fukuchiama (1987) 5. Bajraktarevic, N., Hall, W., Fullick, P.: Incorporating learning styles in hypermedia environment: empirical evaluation. Proc. Workshop Adapt. Hypermedia Adapt. Web-Based Syst. 2003, 41–52 (1999) 6. Felder, R.M., Silverman, L.K., et al.: Learning and teaching styles in engineering education. Eng. Educ. 78(7), 674–681 (1988) 7. Graf, S., Lan, C.H., Liu, T.-C., et al.: Investigations about the effects and effectiveness of adaptivity for students with different learning styles. In: Ninth IEEE International Conference on Advanced Learning Technologies, ICALT 2009, IEEE, 2009, 415–419 (2009) 8. Alfonseca, E., Carro, R.M., Mart´ın, E., Ortigosa, A., Paredes, P.: The impact of learning styles on student grouping for collaborative learning: a case study. User Modeling and User-Adap. Interact. 16(3–4), 377–401 (2006) 9. Graf, S., Liu, T.-C., Kinshuk, K.: Interactions between students learning styles, achievement and behaviour in mismatched courses. In: Proceedings of the International Conference on Cognition and Exploratory Learning in Digital Age (CELDA 2008). IADIS International Conference, pp. 223–230. Citeseer (2008)

A Novel Algorithm for Dynamic Student Profile

49

10. Taylor, W.A.: Computer-mediated knowledge sharing and individual user differences: an exploratory study. Euro. J. Inf. Syst. 13(1), 52–64 (2004) 11. Graf, S.: Using cognitive traits for improving the detection of learning styles. In: Workshop on Database and Expert Systems Applications (DEXA), et al., pp. 74– 78. IEEE 2010 (2010) 12. Graf, S., Viola, S.: Automatic student modelling for detecting learning style preferences in learning management systems. In: Proceedings International Conference on Cognition and Exploratory Learning in Digital Age, pp. 172–179 (2009) 13. Ahmad, N., Tasir, Z., Kasim, J., Sahat, H.: Automatic detection of learning styles in learning management systems by using literature-based method. Procedia-Soc. Behav. Sci. 103, 181–189 (2013) 14. Dung, P.Q., Florea, A.M.: A literature-based method to automatically detect learning styles in learning management systems. In: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics, p. 46. ACM (2012) 15. Graf, S., Liu, T.-C., et al.: Identifying learning styles in learning management systems by using indications from students’ behaviour. In: Eighth IEEE International Conference Advanced Learning Technologies, ICALT 2008, pp. 482–486. IEEE 2008 (2008) 16. Garc´ıa, P., Amandi, A., Schiaffino, S., Campo, M.: Evaluating bayesian networks precision for detecting students learning styles. Comput. Educ. 49(3), 794–808 (2007) 17. Atman, N., Inceo˘ glu, M.M., Aslan, B.G.: Learning styles diagnosis based on learner behaviors in web based learning. In: International Conference on Computational Science and Its Applications. Springer, pp. 900–909 (2009) 18. Cha, H.J., Kim, Y.S., Park, S.H., Yoon, T.B., Jung, Y.M., and J.-H. Lee: Learning styles diagnosis based on user interface behaviors for the customization of learning interfaces in an intelligent tutoring system. In: International Conference on Intelligent Tutoring Systems, pp. 513–524. Springer (2006) ¨ Atman, N., Inceo˘ ˙ 19. S ¸ im¸sek, O., glu, and Y. D. ArikanM, M.: Diagnosis of learning styles based on active/reflective dimension of felder and silvermans learning style model in a learning management system. In: International Conference on Computational Science and Its Applications, pp. 544–555. Springer (2010) 20. Carver, C.A., Howard, R.A., Lane, W.D.: Enhancing student learning through hypermedia courseware and incorporation of student learning styles. IEEE Trans. Educ. 42(1), 33–38 (1999) 21. Schiaffino, S., Garcia, P., Amandi, A.: eteacher: providing personalized assistance to e-learning students. Comput. Educ. 51(4), 1744–1754 (2008) 22. Popescu, E.: Diagnosing students’ learning style in an educational hypermedia system. In: Cognitive and Emotional Processes in Web-Based Education: Integrating Human Factors and Personalization, pp. 187–208. IGI Global (2009) 23. Akbulut, Y., Cardak, C.S.: Adaptive educational hypermedia accommodating learning styles: a content analysis of publications from 2000 to 2011. Comput. Educ. 58(2), 835–842 (2012) 24. Zywno, M.S.: A contribution of validation of score meaning for felder-soloman’s’, In: Index of Learning Styles, Proceedings : Annual ASEE Conference, p. 2003. ASEE, Citeseer (2003) 25. Felder, R.M., Spurlin, J.: Applications, reliability and validity of the index of learning styles. Int. J. Eng. Educ. 21(1), 103–112 (2005) 26. Brown, B.L.: Learning styles and vocational education practice. practice application brief (1998)

50

S. M. Nafea et al.

27. Kolb, D.: Experiential learning as the science of learning and development (1984) 28. Honey, P., Mumford, A.: The Manual of Learning Styles. Mcgraw-hill, Maidenhead (1982) 29. Dunn, R., Dunn, K.: Learning style as a criterion for placement in alternative programs. The Phi Delta Kappan 56(4), 275–278 (1974) 30. Myers, I.B.: The myers-briggs type indicator: Manual (1962) 31. Liu, T.-C., Graf, S., et al.: Coping with mismatched courses: students behaviour and performance in courses mismatched to their learning styles. Educ. Technol. Res. Dev. 57(6), 739 (2009) 32. Felder, M.R., Soloman, B.A.: Index of learning styles questionnaire (1999). https:// www.webtools.ncsu.edu/learningstyles/, Accessed 05 Oct 2018 33. Nafea, S., Siewe, F., He, Y.: Ulearn: personalised learners profile based on dynamic learning style questionnaire. In: Proceedings of Intelligent Systems (IntelliSys). pp. 1257–1264. IEEE (2018) 34. Pe˜ na, C.-I., Marzo, J.-L., de la Rosa, J.-L.: Intelligent agents in a teaching and learning environment on the web. In: Proceedings of the International Conference on Advanced Learning Technologies, pp. 21–27. NZ, IEEE Learning Technology Task Force. Palmerston North (2002) 35. Popescu, E., Badica, C., Moraret, L.: Accommodating learning styles in an adaptive educational system. Informatic 34(4), 451–462 (2010) 36. Paredes, P., Rodriguez, P.: A mixed approach to modelling learning styles in adaptive educational hypermedia. Adv. Technol. Learn. 1(4), 210–215 (2004) 37. Baldiris, S., Fabregat, R., Mej´ıa, C., G´ omez, S.: Adaptation decisions and profiles exchange among open learning management systems based on agent negotiations and machine learning techniques. In: International Conference on HumanComputer Interaction, pp. 12–20. Springer (2009) 38. Wolf, C.: iweaver: towards learning style-based e-learning in computer science education. In: Proceedings of the fifth Australasian Conference on Computing Education, vol. 20, pp. 273–279. Australian Computer Society, Inc. (2003) 39. Imran, H., Belghis-Zadeh, M., Chang, T.-W., Graf, S., et al.: Plors: a personalized learning object recommender system. Vietnam J. Comput. Sci. 3(1), 3–13 (2016) 40. Graf, S.: Adaptivity in learning management systems focussing on learning styles (2007) 41. Klaˇsnja-Mili´cevi´c, A., Vesin, B., Ivanovi´c, M., Budimac, Z.: E-learning personalization based on hybrid recommendation strategy and learning style identification. Comput. Educ. 56(3), 885–899 (2011) 42. Latham, A., Crockett, K., McLean, D., Edmonds, B.: A conversational intelligent tutoring system to automatically predict learning styles. Comput. Educ. 59(1), 95–109 (2012) 43. Araniti, G., De Meo, P., Iera, A., Ursino, D.: Adaptively controlling the qos of multimedia wireless applications through “user profiling” techniques. IEEE J. Sel. Areas Commun. 21(10), 1546–1556 (2003) 44. Thompson, J.E.: Student Modeling in an Intelligent Tutoring System. Tech. Rep, Air Force Inst of Tech Wright-Patterson AFB OH (1996) 45. Graf, S., Ives, C., et al.: A flexible mechanism for providing adaptivity based on learning styles in learning management systems. In: 2010 IEEE 10th International Conference on Advanced Learning Technologies (ICALT), pp. 30–34. IEEE (2010) 46. Nafea, S., Siewe, F., He, Y.: A novel algorithm for course learning object recommendation based on student learning styles (2019)

A Novel Algorithm for Dynamic Student Profile

51

47. Bernard, J., Chang, T.-W., Popescu, E., Graf, S.: Learning style identifier: improving the precision of learning style identification through computational intelligence algorithms. Expert Syst. Appl. 75, 94–108 (2017) ¨ 48. Ozpolat, E., Akar, G.B.: Automatic detection of learning styles for an e-learning system. Comput. Educ. 53(2), 355–367 (2009) 49. Alkhuraiji, S., Cheetham, B., Bamasak, O.: Dynamic adaptive mechanism in learning management system based on learning styles. In: 2011 11th IEEE International Conference on Advanced Learning Technologies (ICALT), pp. 215–217. IEEE (2011) 50. Amir, E.S., Sumadyo, M., Sensuse, D.I., Sucahyo, Y.G., Santoso, H.B.: Automatic detection of learning styles in learning management system by using literaturebased method and support vector machine, pp. 41–144. IEEE Press (2016) 51. Kelly, D., Tangney, B.: ’first aid for you’: getting to know your learning style using machine learning. In: Fifth IEEE International Conference on Advanced Learning Technologies : ICALT 2005, pp. 1–3. IEEE 2005 (2005) 52. Carmona, C., Castillo, G., Mill´ an, E.: Designing a dynamic bayesian network for modeling students’ learning styles. In: 2008 Eighth IEEE International Conference on Advanced Learning Technologies, pp. 346–350. IEEE (2008) 53. Zatarain, R., Barr´ on-Estrada, L., Reyes-Garc´ıa, C.A., Reyes-Galaviz, O.F.: Applying intelligent systems for modeling students learning styles used for mobile and web-based systems. In: Soft Computing for Intelligent Control and Mobile Robotics, pp. 3–22. Springer (2010) 54. @inproceedingsnafea2019novel, Nafea, S.M., Siewe, F., He, Y.: A novel algorithm for course learning object recommendation based on student learning styles. In: 2019 International Conference on Innovative Trends in Computer Engineering (ITCE), pp. 192–201. IEEE (2019)

Exploring Transfer Learning for Low Resource Emotional TTS No´e Tits(B) , Kevin El Haddad, and Thierry Dutoit Numediart Institute, University of Mons, 7000 Mons, Belgium {noe.tits,kevin.elhaddad,thierry.dutoit}@umons.ac.be

Abstract. During the last few years, spoken language technologies have known a big improvement thanks to Deep Learning. However Deep Learning-based algorithms require amounts of data that are often difficult and costly to gather. Particularly, modeling the variability in speech of different speakers, different styles or different emotions with few data remains challenging. In this paper, we investigate how to leverage finetuning on a pre-trained Deep Learning-based TTS model to synthesize speech with a small dataset of another speaker. Then we investigate the possibility to adapt this model to have emotional TTS by fine-tuning the neutral TTS model with a small emotional dataset. Keywords: Speech synthesis · Emotion Transfer learning · Fine-tuning

1

· Deep learning ·

Introduction

The current state of the art of Text-to-Speech (TTS) synthesis is based on deep learning algorithms. These systems are now capable of producing natural humanlike speech. There are more and more deep learning-based TTS systems developed. The Merlin toolkit [1] has played an important part in this development, first with simple architectures based on stacks of Fully Connected layers and then with more complex ones such as Recurrent Neural Networks (RNNs). Most recent TTS systems, such as Wavenet [2], Tacotron [3], WaveRNN [4], Char2Wav [5] and Deep Voice [6], achieve excellent results in terms of naturalness. However they require tens of hours of speech data and a lot of computational power. A first system that aims to synthesize speech with few computational power is Deep Convolutional TTS [7] (DCTTS). This system is only based on Convolutional Neural Networks (CNNs) and avoids using RNNs known to be difficult to train due to the vanishing gradient issue during gradient descent [8]. In their experiments, the authors of DCTTS were able to train their model in 15 h using a standard PC with two GPUs, resulting in nearly acceptable speech synthesis. Moreover it is difficult to have a fine control on speech quality and emotional content with such systems, while this has become an important challenge in c Springer Nature Switzerland AG 2020  Y. Bi et al. (Eds.): IntelliSys 2019, AISC 1037, pp. 52–60, 2020. https://doi.org/10.1007/978-3-030-29516-5_5

Exploring Transfer Learning for Low Resource Emotional TTS

53

speech synthesis. Here again, data availability is an issue. Indeed, high quality speech datasets with emotional content needed for speech synthesis are quite difficult to collect. The amount of data available is therefore relatively limited compared to what deep learning algorithms require to converge. Promising methods to tackle the problem of quantity of data are those related to knowledge transfer such as transfer learning [9], fine-tuning and multi-task learning. These techniques have proved useful in various applications of deep learning. In the field of Motion Capture and Analysis, [10] mapped a motion sequence to an RGB image to be able to use a CNN pre-trained for image classification in their motion classification task. They showed that fine-tuning the CNN on their motion data improved classification results. In a previous work of ours [11], we used an neural Automatic Speech Recognition (ASR) as a feature extractor for emotion recognition. We showed that the mapping between speech and text learned by the ASR system contains information useful for emotion recognition. In the TTS field as well, transfer learning is being investigated. In [12], they successfully transferred knowledge from a model trained to discriminate between speakers to a multi-speaker TTS model. These examples motivates our interest to investigate the use of knowledge transferability between models. Our goal is to use this to tackle the inherent problem of the amount of data needed in deep learning in the case of emotional TTS which is a topic of growing interest. We can cite [13] that experimented an unsupervised learning technique to change prosody of synthesized sentences with style tokens or [14] that modified Tacotron’s architecture to synthesize speech given emotional labels. In this paper, we explain how to leverage fine-tuning on deep learning-based TTS systems to synthesize emotional speech with a small emotional speech dataset.

2

System

Our goal is to study the feasibility of fine-tuning a TTS system pre-trained on a big dataset on few new data and analyze how much the model is able to fit them. This Section describes the whole system. Figure 1 represents its overall idea. First, in Sect. 2.1, we present the TTS system used as a basis for finetuning. We then briefly present the dataset we are using in Sect. 2.2. In Sect. 2.3, we explain the pre-processing of our dataset. Finally, in Sect. 2.4, we detail the fine-tuning procedure applied to obtain emotional TTS models. 2.1

TTS System

The number of deep learning-based TTS system of the state of the art are growing. To carry out our experiments, we chose, DCTTS [7], a system that seems to combine advantages of several systems. DCTTS models a sequence-to-sequence problem with a encoder-decoder structure along with an Attention Mechanism

54

N. Tits et al.

Small emotional dataset

MODEL FINE-TUNED ON amused subset

Pre-processing resampling to 22.05kHz trimming selection NVE removal

MODEL FINE-TUNED ON angry subset

MODEL FINE-TUNED ON disgust subset MODEL PRETRAINED ON LJSPEECH (~24h)

Fine-tuning Using neutral subset

Fine-tuning MODEL FINE-TUNED ON neutral subset

Using each emotional subset

MODEL FINE-TUNED ON sleepy subset

Fig. 1. Block diagram of the system

like Tacotron [3]. However, unlike Tacotron, the modules of the architecture are all CNN-based and there is no RNN component. In [7], they compared an open source implementation of Tacotron to DCTTS and have higher Mean Opinion Score (MOS). In this work, we use the Tensorflow implementation provided in [15]. There are two modules trained separately: Text2Mel and SSRN (for Spectrogram Super-resolution Network). Text2Mel takes care of the mapping between character embeddings and the output of Mel Filter Banks (MFBs) applied on the audio signal, that is, a mel-spectrogram. Then the second module SSRN maps the mel-spectrogram to full resolution spectrogram. Finally, Griffin-Lim is used as a vocoder. Text2Mel module models the sequence-to-sequence task. It is composed of a Text Encoder, an Audio Encoder, an Attention Mechanism, and an Audio Decoder. 2.2

Dataset Used

The dataset used in this work is EmoV-DB: The Database of Emotional Voices [16] that is available online1 . EmoV-DB contains sentences uttered by male and female actors in English and a male actor in French. The English actors were recorded in 2 different anechoic chambers of the Northeastern University campus while the French actor was recorded in an anechoic chamber of the University of Mons. Each actor was asked to utter a subset of the sentences from the CMUArctic [17] database for English speakers and from the SIWIS [18] database for the French speaker. The actors uttered these sentences with 5 emotion classes making it possible to build synthesis and voice transformation systems. For every speaker, the different emotions were recorded in different sessions. In this work we used one of the English actress to perform emotion adaptation of the TTS system. The experiments performed on this dataset also assess its usability with deep learning algorithms for voice generation systems. 1

https://github.com/numediart/EmoV-DB.

Exploring Transfer Learning for Low Resource Emotional TTS

2.3

55

Pre-processing

Important aspects of the pre-processing to use this model are – the sample frequency – the trimming of silences at the beginning and end of audio files – removal of non-verbal expressions (laughters, yawns, etc.) As the model was trained with LJ-speech database with a sample frequency of 22050 Hz, we should use the same with our database. The trimming of silences is important because the model use guided attention [7]. It helps the attention mechanism by assuming that the ordering of characters is almost linearly related to the time in the audio file. This is true only if the speech begins from the start of the file without a silence. We experimented that without this trimming, the synthesized sentences often omitted the first words of the text to pronounce. The implementation of [15] already use trimming with librosa library. However we noticed that default parameters were not suited for our database and changed top db to 20 dB. The same problem occurs due to the nonverbal vocalizations [19] (NVV) present in the audio files such as laughs, yawns or sighs. Indeed in these cases, the hypothesis to use guided attention is not verified either. To overcome this problem, we first manually selected utterances without such NVV for the amused dataset (156 utterances) and the sleepy dataset (361 utterances). To increase the amount of amused sentences, we manually trimmed the laughs of 82 of the remaining amused sentences to have a total of 238 utterances. 2.4

Fine-Tuning

In this section, we explain how we leveraged knowledge transfer on TTS by fine-tuning a part of a pre-trained model on our small dataset. The pre-training of the model was done using the LJ-Speech dataset. This dataset is available online2 and contains 23.9 h of speech uttered by a single female speaker. The fine-tuning was done with the dataset described in Sect. 2.2. There are several possibilities of how we could fine-tune the model. First, we can choose which parts of the pre-trained model we want to fine-tune with the new dataset and which part we want to keep fixed. The second part of the model, SSRN, does the mapping between MFBs and full spectrogram. Therefore, it should not depend on the speaker identity on speaking style as it is just trained to do the mapping between two audio features. However, as the model has been pre-trained on one speaker, there is a possibility of over-fitting on the characteristics of that specific speaker. The first question we want to answer is whether the SSRN can generalize the mapping to other speaking styles. 2

https://keithito.com/LJ-Speech-Dataset.

56

N. Tits et al.

As for Text2Mel module, it is composed of a Text Encoder, Audio Encoder, Attention, and Audio Decoder. As the text does not depend on characteristics of the speaker or his speaking style, we tried to train only the Audio part. However we found that there are some problems of rhythm in synthesized speech. We believe this is because Attention module is not adapted to the new speaking style. As a consequence, we chose to fine-tune on the entire Text2Mel module.

3

Experiment

In this Section, we detail the experiments performed on the system. In the first experiment, we evaluate the usefulness of the fine-tuning technique compared to a random initialization of the parameters of the model. The evaluation is based on a measure of intelligibility of the synthesized speech in terms of word accuracy proposed in [20]. In the second experiment, we evaluate the quality of the emotional speech synthesized through a MOS test for each emotion according to the confidence in the perception of the emotion specified. The amount of speech data used for the experiments are showed in Table 1. Durations are rounded to the minute. The values between parentheses correspond to the amount of data before selection and NVV removal. Table 1. Amount of data available for each emotion in terms of total duration and number of utterances Total duration [min] Number of utterances

3.1

Amused

15 (20)

238 (296)

Angry

19

304

Disgusted 29

303

Neutral

23

357

Sleepy

36 (51)

361 (496)

Objective Measures

In this experiment, we synthesized 100 sentences of the Harvard sentences [21] with several models. Then an objective measure of the intelligibility of every sentence was computed in terms of word accuracy [20]. The measure consists of using an ASR to recognize speech and compute a word accuracy by comparing the result to the text label. The mean word accuracy with 95% confidence interval for all models are summarized in Table 2. The first line show the word accuracy of the pre-trained model (using LJ-speech). The second line corresponds to the model trained only on the neutral subset. Finally, the third line corresponds to the pre-trained model fine-tuned on the neutral subset.

Exploring Transfer Learning for Low Resource Emotional TTS

57

Table 2. Intelligibility in terms of Word Accuracy. Word Accuracy 0.630 ± 0.042

LJ-speech

Neutral (random initialization) 0.004 ± 0.004 Neutral (fine-tuning)

0.517 ± 0.048

These measures allow us to compare the fine-tuning of the parameters of a pre-trained model with the random initialization of the parameters of the model. The experiments clearly shows that model trained with the neutral subset of 20 min is unable to generate intelligible speech if the parameters are randomly initialized. However, if the initialization of the parameters. 3.2

Perception Tests

After fine-tuning from the neutral model to emotional models, we synthesized 5 sentences not seen during training with each of these models. These sentences were used in a Mean Opinion Score (MOS) test. During this test, the participants were asked to complete a form. This survey contained 5 sections. Each section was dedicated to one emotion. In every section, the participants were asked to rate utterances between 0 and 5 for the confidence in the perception of the emotion specified (0=we can not hear the emotion specified, 5=we perfectly hear the emotion specified). This test was performed on both original files from the dataset and synthesized files. Table 3 gives MOS with 95% confidence interval for the original files and Table 4 gives them for the synthesized files. Table 3. MOS test results of original files Confidence Amused

4.60 ± 0.20

Angry

4.22 ± 0.25

Disgusted 3.28 ± 0.27 Neutral

4.37 ± 0.23

Sleepy

3.80 ± 0.27

Results of Table 3 should be considered as higher bounds as they represent the opinion about original files of the dataset. Results from Table 4 should be compared to these higher bounds. One can see that these higher bounds do not meet the maximum value of 5.

58

N. Tits et al. Table 4. MOS test results of synthesized files Confidence Amused

2.00 ± 0.27

Angry

2.10 ± 0.28

Disgusted 2.27 ± 0.30 Neutral

3.59 ± 0.24

Sleepy

3.29 ± 0.26

The confidence of the perception of an emotion is altered by the recording quality, the playing quality and the emotion expressed. Moreover, the test was done on non-native English speakers. In [22], they conducted an experiment showing a disproportionate disadvantage for the non-native English speaker when listening to synthesized speech compared to their native English speaker. In Table 4, for the confidence of the perception of an emotion, the Neutral category has the higher value. A possible explanation of this is that the pretrained model used has been trained with a neutral corpus and is therefore closer to the Neutral subset used for fine-tuning. For the other emotional categories, the values are a little degraded but still significantly higher than 0 which shows that the emotion is perceptible in the synthesized speech.

4

Conclusions and Future Work

In this paper, we present a technique allowing to synthesize emotional speech using a small emotional speech dataset. This technique is based on the finetuning of a deep learning-based TTS model with the neutral subset of the small dataset and then fine-tuning the resulting model with each emotional subset to obtain one model per emotional category. In the first experiment, we show that training the model with random initialization of the parameters gives completely unintelligible speech synthesis. However using a pre-trained model as initialization and fine-tuning it allows to have intelligible speech. In the second experiment, we show through perception tests that the speech synthesized is correctly perceived as emotional. To improve these results, we plan to try a multi-speaker and multi-emotional system to be able to share the knowledge of emotional content of several speakers and several emotions of EmoV-DB. The approach in [12] seems really interesting for this. Acknowledgments. No´e Tits is funded through a PhD grant from the Fonds pour la Formation ` a la Recherche dans l’Industrie et l’Agriculture (FRIA), Belgium.

Exploring Transfer Learning for Low Resource Emotional TTS

59

References 1. Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. In: Proceedings SSW, Sunnyvale, USA (2016) 2. van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A.W., Kavukcuoglu, K.: Wavenet: a generative model for raw audio. In: SSW (2016) 3. Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q.V., Agiomyrgiannakis, Y., Clark, R., Saurous, R.A.: Tacotron: towards end-to-end speech synthesis. In: Interspeech (2017) 4. Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., van den Oord, A., Dieleman, S., Kavukcuoglu, K.: Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435 (2018) 5. Sotelo, J., Mehri, S., Kumar, K., Santos, J.F., Kastner, K., Courville, A., Bengio, Y.: Char2wav: end-to-end speech synthesis. In: ICLR2017 Workshop Submission (2017) 6. Arik, S.O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J., et al.: Deep voice: real-time neural text-to-speech. arXiv preprint arXiv:1702.07825 (2017) 7. Tachibana, H., Uenoyama, K., Aihara, S.: Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. arXiv preprint arXiv:1710.08969 (2017) 8. Hochreiter, S.: The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertainty, Fuzziness Knowl.-Based Syst. 6(02), 107–116 (1998) 9. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010) 10. Laraba, S., Brahimi, M., Tilmanne, J., Dutoit, T.: 3D skeleton-based action recognition by representing motion capture sequences as 2D-RGB images. Comput. Animat. Virt. W. 28(3–4), e1782 (2017) 11. Tits, N., El Haddad, K., Dutoit, T.: Asr-based features for emotion recognition: a transfer learning approach. arXiv preprint arXiv:1805.09197 (2018) 12. Jia, Y., Zhang, Y., Weiss, R.J., Wang, Q., Shen, J., Ren, F., Chen, Z., Nguyen, P., Pang, R., Moreno, I.L., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis arXiv preprint arXiv:1806.04558 (2018) 13. Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R.J., Saurous, R.A., Agiomyrgiannakis, Y., Wu, Y.: Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. CoRR, vol. abs/1712.05884 (2017) 14. Lee, Y., Rabiee, A., Lee, S.Y.: Emotional end-to-end neural speech synthesizer. arXiv preprint arXiv:1711.05447 (2017) 15. Kyubyong, P.: A tensorflow implementation of dc-tts: yet another text-to-speech model (2018). https://github.com/Kyubyong/dc tts 16. Adigwe, A., Tits, N., El Haddad, K., Ostadabbas, S., Dutoit, T.: The emotional voices database: towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514 (2018) 17. Kominek, J., Black, A.W.: The CMU arctic speech databases. In: Fifth ISCA Workshop on Speech Synthesis (2004) 18. Honnet, P.-E., Lazaridis, A., Garner, P.N., Yamagishi, J.: The siwis french speech synthesis database? Design and recording of a high quality french database for speech synthesis. Online Database (2017)

60

N. Tits et al.

19. El Haddad, K., Tits, N., Dutoit, T.: Annotating nonverbal conversation expressions in interaction datasets. In: Proceedings of Laughter Workshop 2018, September 2018 20. Orozco-Arroyave, J.R., Vdsquez-Correa, J.C., H¨ onig, F., Arias-Londo˜ no, J.D., Vargas-Bonilla, J.F., Skodda, S., Rusz, J., Noth, E.: Towards an automatic monitoring of the neurological state of parkinson’s patients from speech. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6490–6494. IEEE (2016) 21. Rothauser, E.H.: IEEE recommended practice for speech quality measurements. IEEE Trans. Audio Electroacoust. 17, 225–246 (1969) 22. Alamsaputra, D.M., Kohnert, K.J., Munson, B., Reichle, J.: Synthesized speech intelligibility among native speakers and non-native speakers of english. Augmentative Altern. Commun. 22(4), 258–268 (2006)

Emotional Speech Datasets for English Speech Synthesis Purpose: A Review No´e Tits(B) , Kevin El Haddad, and Thierry Dutoit Numediart Institute, University of Mons, 7000 Mons, Belgium {noe.tits,kevin.elhaddad,thierry.dutoit}@umons.ac.be

Abstract. In this paper, we review the datasets of emotional speech publicly available and their usability for state of the art speech synthesis. This is conditioned by several characteristics of these datasets: the quality of the recordings, the quantity of the data and the emotional content captured contained in the data. We then present a dataset that was recorded based on the observation of the needs in this area. It contains data for male and female actors in English and a male actor in French. The database covers five emotion classes so it could be suitable to build synthesis and voice transformation systems with the potential to control the emotional dimension. Keywords: Speech synthesis

1

· Emotional speech · Deep learning

Introduction

One of the major components of human-agent interaction systems is the speech synthesis module. The state-of-the-art speech synthesis systems such as wavenet and tacotron [19,23,24] are giving impressive results. They can produce, intelligible, expressive, even human-like speech. But, they cannot yet be used to control the emotional dimensionality in speech which is a crucial parameter in order to obtain human-like controllable speech synthesis system. Although still being relatively neglected by the affective computing community, the interest for emotional speech synthesis systems has been growing for the past two decades. After the improvement parametric systems brought to this field [10,15], deep learning-based systems were also employed for this task. One of the problems in the emotional speech synthesis research community is the lack of publicly available data and the difficulty to collect them. In fact, to the best of our knowledge, no emotional speech database for synthesis purpose and suitable for deep learning systems is publicly available. In this paper, we try to tackle this problem. In what follows we will present a review of emotional speech datasets in Sect. 2. We will then describe the motivations for collecting a new database in Sect. 3 and detail the content of a newly released database1 that fulfill these motivations in Sect. 4. 1

https://github.com/numediart/EmoV-DB.

c Springer Nature Switzerland AG 2020  Y. Bi et al. (Eds.): IntelliSys 2019, AISC 1037, pp. 61–66, 2020. https://doi.org/10.1007/978-3-030-29516-5_6

62

2

N. Tits et al.

Review

Emotions can be represented in different ways. A first representation, is Ekman’s six basic emotion model [7] which identify anger, disgust, fear, happiness, sadness and surprise as six basic emotions from which the other emotions may be derived. Emotions can also be represented in a multidimensional continuous space like in the Russels circomplex model [18] (valence and arousal being the currently most famous dimensions used). A more recent way of representing emotions is based on ranking which prefer a relative preference method to annotate emotions rather than labeling them with absolute values [25]. Several open-source databases can be found but to the best of our knowledge, none is really suitable for emotional speech synthesis purpose. In this section we will explain why and mention some examples. The RAVDESS database emotional data for 24 different actors [17]. The actors were asked to read 2 different sentences in a spoken and sung way in North American English. The spoken style was recorded in 8 different emotional styles: neutral, calm, happy, sad, angry, fearful, disgust, surprise. Each utterance was expressed at two different intensities each (except for the neutral emotion) and 2 times thus giving a total of 1440 files. A perception test was then undertook to validate the database on the emotional categories, intensity and genuineness. The CREMA-D database [6] is similar to the RAVDESS. For this database, 12 different sentences were recorded by 91 different actors, for the 6 basic emotions:happy, sad, anger, fear, disgust, and neutral. Only one of the 12 sentences was expressed in 3 different intensities, for the other 11, the intensity was not specified. The authors report 7442 files in total. This database was also validated through perception tests and helped validate the emotion category and intensity. Also similar to the previous ones, the GEMEP database [1] is a collection of 10 French-speaking actors, recorded uttering 15 different emotional expressions at three levels of intensity, in three different ways: improvised sentences, pseudospeech, and nonverbal affect bursts. This database counts a total of 1260 audio files. It was also validated through perception tests. The Berlin Emotional Speech Dataset [3] contains the recording of 10 different utterances by 10 different actors in 7 different emotions (neutral, anger, fear, joy, sadness, disgust and boredom) in German, making it a total of 800 utterances (counting some second version of some of the sentences). This database was, like the previous ones, validated using perception experiments. These databases are not suitable for current state of the art speech synthesis purpose because of the limited amount of sentences recorded. Moreover, the six basic emotions do not really occur in daily conversations. Indeed, in Ekman’s model, on which the choice of emotions was based for these datasets, the basic emotions are the ones from which other emotions derive. But that does not necessarily mean that they are frequently expressed in speech in our daily interactions. The IMPROV [5] and IEMOCAP [4] databases both contain a large amount of diverse sentences of emotional data. IEMOCAP contains audio-visual recordings of 5 sessions of dyadic conversations between a male and a female subjects.

Emotional Speech Datasets for English Speech Synthesis Purpose: A Review

63

In total it contains 10 speakers and 12 h of data. IMPROV contains 6 sessions from 12 actors resulting in 9 h of audiovisual data. Both databases were evaluated in terms of category of emotions [7] and emotional dimensions [18] by several subjects. However they are not suitable for synthesis purpose either because although the data is well recorded and post-processed it contains overlapping speech due to the data recording setup (dyadic conversation) and some external noise. The CMU Arctic Speech Database [16] and the SIWIS French Speech Synthesis Database [14] are collections of read utterances of phonetically balanced sentences in English and French respectively. The CMU-Arctic database contains approximately 1150 sentences recorded from each of 4 different speakers while SIWIS contains a total of 9750 utterances from a single speaker. These are database suitable for speech synthesis purpose as there is a large amount of different sentences recorded from a single speaker in noiseless environment. However the sentences are neutral and do not express any emotions. The AmuS database contain audio data dedicated to amused speech synthesis [13]. We showed in previous work [8,9,11] that this database was well suited for amused speech synthesis. But AmuS contains data only for amused speech and not other emotions.

3

Motivations

This database’s2 primary purpose is to build models that could not only produce emotional speech but also control the emotional dimension in speech [20,21]. The techniques to allow this are either text-to-speech like systems where the system would map a given text sentence to a speech audio signal or voice transformation systems where a source voice would be converted to a specific target emotional voice. Considering this, it is obvious that a lot of data is required. One of the primary difficulties of building emotional speech-based generation systems is the collection of data. Indeed not only must the recording be of good quality and noise free, but the task of expression emotional sentences in a large enough amount is challenging. Also it is often preferable concerning these types of systems, that a certain category of emotion contains data that are similar on the acoustic level. The database presented here was built with these requirements in mind. The aim was also for it to fit with other currently open-source databases to maximize the quantity of data available. As mentioned previously, the CMUArctic database (English) and the SIWIS (French) databases are two datasets of neutral speech. Each of them contain a relatively large amount of data that can be used as source voices for a voice conversion system or as pre-training data for a system. They are also transcribed which makes the transcription also available for our database. The transcribed utterances as well as annotations at phonetic level are available. A subset of these were used to build our database. 2

https://github.com/numediart/EmoV-DB.

64

N. Tits et al.

The phonetic annotations are not time-aligned with our data yet, but methods can be used such as forced alignment systems [2]. We chose five different emotions: amusement, anger, sleepiness, disgust and neutral. We chose emotions that are more likely to be expressed in daily conversations than Ekman’s basic emotions. These emotions were chosen because of the ease to produce them by actors and in order to cover a diverse space in the Russel Circumplex to allow experimenting with interpolation techniques to obtain intermediate emotions.

4

Database Content

The data was recorded in two different languages English (North American) and French (Belgian). English natives (2 females and 2 males) and a single male French native were asked to read sentences while expressing one of the above mentioned emotions. The English sentences were taken from the CMU-arctic database. The French ones from the SIWIS database. Both databases contain freely available open-source phonetically balanced sentences. The recordings for the English data were carried on in two different anechoic chambers of the Northeastern University campus. The ones for the French data were made in an anechoic room at the University of Mons. The utterances were recorded in several sessions of about 30 min recordings followed by a 5 to 15 min break and the data collection was spread across several days depending on the availability of the actors. The actors were asked to repeat sentences that were mispronounced. The actors were asked to record each emotion class separately in different sessions. At the moment of redaction of this article, the sentences were segmented manually for some of the speakers (annotation and segmentation is still ongoing). By segmentation we mean determining the intervals of start and end of each sentence. The total number of utterances obtained is summarized in Table 1. Table 1. Gender and language of recorded sentences from each speaker and amount of utterances segmented per speaker and per emotion. Speaker Gender Language Neutral Amused Angry Sleepy Disgust Spk-Je

Female English

417

222

523

466

189

Spk-Bea Female English Spk-Sa

373

309

317

520

347

Male

English

493

501

468

495

497

Spk-Jsh Male

English

302

298

-

263

-

Spk-No

French

317

-

273

-

-

Male

Emotional Speech Datasets for English Speech Synthesis Purpose: A Review

5

65

Conclusion

Amused speech can contain chuckling sounds which overlap and/or intermingle with speech called speech-laughs [22] or can be only amused smiled speech [10]. So, for the amused data in our database, in order to collect as much data as possible and considering the relatively limited time the actors provided us, we focused on amused speech with speech-laughs. This choice was motivated by our previous study showing that this type of amused speech was perceived as more amused than amused smiled speech (without speech-laugh). Also in another study, we show that including laughter in synthesized speech is always perceived as amused no matter the style of speech it is inserted in (neutral or smiled) [11]. Based on the previous studies made on amusement, the actors were encouraged, while simulating the other emotions, to use nonverbal expressions [12] before and even while uttering the sentences if they felt the need to (e.g. yawning for sleepiness, affect bursts for anger and disgust). Acknowledgments. No´e Tits is funded through a PhD grant from the Fonds pour la Formation ` a la Recherche dans l’Industrie et l’Agriculture (FRIA), Belgium.

References 1. B¨ anziger, T., Mortillaro, M., Scherer, K.R.: Introducing the geneva multimodal expression corpus for experimental research on emotion perception. Emotion 12(5), 1161 (2012) 2. Brognaux, S., Roekhaut, S., Drugman, T., Beaufort, R.: Train&Align: a new online tool for automatic phonetic alignment. In: Spoken Language Technology Workshop (SLT), 2012 IEEE, pp. 416–421. IEEE (2012) 3. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of german emotional speech. In: Ninth European Conference on Speech Communication and Technology (2005) 4. Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J.N., Lee, S., Narayanan, S.S.: Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008) 5. Busso, C., Parthasarathy, S., Burmania, A., Abdel-Wahab, M., Sadoughi, N., Provost, E.M.: Msp-improv: an acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 8(1), 67–80 (2017) 6. Cao, H., Cooper, D.G., Keutmann, M.K., Gur, R.C., Nenkova, A., Verma, R.: CREMA-D: crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 5(4), 377–390 (2014) 7. Ekman, P.: Basic emotions. In: Dalgleish, T., Powers, M.J. (eds.) Handbook of Cognition and Emotion, pp. 4–5. Wiley, New Jersey (1999) 8. El Haddad, K., Cakmak, H., Dupont, S., Dutoit, T.: Breath and repeat: an attempt at enhancing speech-laugh synthesis quality. In: European Signal Processing Conference (EUSIPCO 2015) Nice, France, 31 August–4 September 2015 9. El Haddad, K., Cakmak, H., Dupont, S., Dutoit, T.: An HMM approach for synthesizing amused speech with a controllable intensity of smile. In: IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Abu Dhabi, UAE, 7–10 December 2015

66

N. Tits et al.

10. El Haddad, K., Dupont, S., d’Alessandro, N., Dutoit, T.: An HMM-based speechsmile synthesis system: an approach for amusement synthesis. In: International Workshop on Emotion Representation, Analysis and Synthesis in Continuous Time and Space (EmoSPACE), Ljubljana, Slovenia, 4–8 May 2015 11. El Haddad, K., Dupont, S., Urbain, J., Dutoit, T.: Speech-laughs: an HMM-based approach for amused speech synthesis. In: Internation Conference on Acoustics, Speech and Signal Processing (ICASSP 2015), pp. 4939–4943, Brisbane, Australia, 19–24 April 2015 12. El Haddad, K., Tits, N., Dutoit, T.: Annotating nonverbal conversation expressions in interaction datasets. In: Proceedings of Laughter Workshop, vol. 2018, p. 09 (2018) 13. El Haddad, K., Torre, I., Gilmartin, E., C ¸ akmak, H., Dupont, S., Dutoit, T., Campbell, N.: Introducing amus: the amused speech database. In: Camelin, N., Est`eve, Y., Mart´ın-Vide, C. (eds.) Statistical Language and Speech Processing, pp. 229– 240. Springer International Publishing, Cham (2017) 14. Honnet, P.-E., Lazaridis, A., Garner, P.N., Yamagishi, J.: The siwis french speech synthesis database? design and recording of a high quality french database for speech synthesis. Online Database (2017) 15. Kawanami, H., Iwami, Y., Toda, T., Saruwatari, H., Shikano, K.: Gmm-based voice conversion applied to emotional speech synthesis. In: Eighth European Conference on Speech Communication and Technology (2003) 16. Kominek, J., Black, A.W.: The CMU arctic speech databases. In: Fifth ISCA Workshop on Speech Synthesis (2004) 17. Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north american english. PLOS ONE 13(5), 1–35 (2018) 18. Posner, J., Russell, J.A., Peterson, B.S.: The circumplex model of affect: an integrative approach to affective neuroscience, cognitive development, and psychopathology. Dev. Psychopathol. 17(3), 715–734 (2005) 19. Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Yu., Wang, Y., Skerry-Ryan, R.J., Saurous, R.A., Agiomyrgiannakis, Y., Wu, Y.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. CoRR, abs/1712.05884 (2017) 20. Tits, N., El Haddad, K., Dutoit, T.: Asr-based features for emotion recognition: a transfer learning approach. arXiv preprint (2018). arXiv:1805.09197 21. Tits, N., El Haddad, K., Dutoit, T.: Exploring transfer learning for low resource emotional tts. arXiv preprint (2019). arXiv:1901.042761901.04276 22. Trouvain, J.: Phonetic aspects of “speech-laughs”. In: Oralit´e et Gestualit´e: Actes du colloque ORAGE, Aix-en-Provence. Paris: L’Harmattan, pp. 634–639 (2001) 23. van den Oord, A., Sander, D., Heiga, Z., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A.W., Kavukcuoglu, K.: A generative model for raw audio. In: SSW, Wavenet (2016) 24. Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q.V., Agiomyrgiannakis, Y., Clark, R., Saurous, R.A.: Tacotron: towards end-to-end speech synthesis. In: INTERSPEECH (2017) 25. Yannakakis, G.N., Cowie, R., Busso, C.: The ordinal nature of emotions. In: 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), vol. 00, pp. 248–255. Octobet 2017

Feature Selection for Hidden Markov Models with Discrete Features Stephen Adams(B) and Peter A. Beling Engineering Systems and Environment, University of Virginia, Charlottesville, VA, USA {sca2c,pb3a}@virginia.edu

Abstract. Hidden Markov models (HMMs) are widely used for modeling multivariate time series data. However, all collected data is not always useful for distinguishing between states. In these situations, feature selection should be implemented to save the expense of collecting and processing low utility data. Feature selection for HMMs has been studied but most existing methods assume that the observed data follows a Gaussian distribution. In this paper, a method for simultaneously estimating parameters and selecting features for an HMM with discrete observations is presented. The presented method is an extension of the feature saliency HMM which was originally developed to incorporate feature cost into the feature selection process. Expectation-maximization algorithms are derived for features following a Poisson distribution and features following a discrete non-parametric distribution. The algorithms are evaluated on synthetic data sets and a real-world event detection data set that is composed of both Poisson and non-parametric features.

Keywords: Hidden Markov Models

1

· Feature selection · EM algorithm

Introduction

Hidden Markov models (HMMs) are widely used models for time series data and have been applied in the fields of bioinformatics [1], building science [2], and robotics [3]. HMMs are composed of a sequence of latent random variables and corresponding observed random variables. Generally, the latent variables are assumed to be discrete, and their transitions are modeled using a Markov chain. When the observed variables are continuous, these variables are often assumed to follow a conditional Gaussian distribution, but they can also be modeled using a Gaussian mixture model (GMM), a gamma distribution, or an exponential distribution. Discrete distributions, such as the Poisson distribution or a nonparametric distribution, can be used when the observed data are discrete. Feature selection is the process of selecting a relevant set of features from a larger group of collected features [4,5]. Feature selection methods are generally divided into three categories: filters, wrappers, and embedded techniques. Filters assess feature relevance independent of a model. Wrappers utlize a model c Springer Nature Switzerland AG 2020  Y. Bi et al. (Eds.): IntelliSys 2019, AISC 1037, pp. 67–82, 2020. https://doi.org/10.1007/978-3-030-29516-5_7

68

S. Adams and P. A. Beling

and sequentially move through the feature subset space utilzing an evaluation function. Embedded techniques simultaneously estimate model parameters and select features. Feature selection is becoming an ever increasing need in data mining as the collection of diverse datasets with multiple data streams increases due to advances in technology such as the Internet of Things. Most models can benefit from some form of feature selection, including GMMs and HMMs [6], by reducing the effects of the curse of dimensionality. These effects include long training times, large storage-space requirements, and noisy data from irrelevant data streams. In this paper, a feature selection method for discrete observation HMMs is proposed. Most of the research published on feature selection for HMMs focuses on continuous observations and assumes either a single Gaussian or a mixture of Gaussians for the emission distribution. The proposed method builds upon the feature saliency model proposed by Adams, Cogill, and Beling [7] and extends it to discrete features. Specifically, expectation maximization (EM) algorithms for feature saliency HMMs (FSHMMs) that have Poisson features and discrete non-parametric features are derived. The concept of feature saliency, first developed for GMMs [8], has been adopted to HMMs and shown to be an effective method for simultaneously estimating model parameters and selecting features. Feature saliency methods estimate feature relevance by determining the ability for a feature to distinguish between states. Zhu, He, and Leung [9] developed a feature saliency model for HMMs and estimate model parameters using a variational Bayesian method. Adams, Cogill, and Beling [7] use MAP estimation and add the ability to incorporate feature cost. To the best of our knowledge, the published work on feature saliency is exclusive to continuous observations and limited to Gaussian distributions (or GMMs). The primary contribution of the presented study is deriving an EM algorithm for simultaneously estimating model parameters and selecting features for a discrete observation HMM. This paper is organized in the following fashion. Section 2 gives background information on HMMs and the FSHMM from [7]. Section 3 outlines the proposed discrete feature selection method for HMMs. In Sect. 4, we perform numerical experiments on synthetic data and an event detection data set. Section 5 provides our conclusions about the proposed method.

2

Background

In this section, a basic HMM and the FSHMM are described. Let I bet the number of hidden states, let x = {x1 , x2 , ..., xT } be the unobserved state sequence, and let y = {y1 , y2 , ..., yT } be the sequence of observed data (either continuous or discrete). The transition matrix of the Markov chain associated with x is denoted as A where the components of this matrix represent the transition probabilities aij = P (xt+1 = j|xt = i). The initial state distribution is

Discrete FSHMM

69

represented by π. In terms of these quantities, the complete data likelihood can be written as T  axt−1 ,xt fxt (yt ), (1) P (x, y|Λ) = πx1 fx1 (y1 ) t=2

where Λ is the set of model parameters consisting of the initial distribution, transition probabilities and the emission distribution parameters; and fxt (yt ) is the emission distribution given xt . The EM algorithm [10] can be used to calculate maximum-likelihood (ML) estimates for the model parameters and iterates between two steps: the expectation step (E-step) and the maximization step (M-step). Prior distributions can be placed on the model parameters to calculate the maximum a posteriori (MAP) estimates [11]. Given the data and the current model parameters, the E-step finds the expected value of the complete log-likelihood with respect to the state. This quantity is designated the Q function and given by Q(Λ, Λ ) = E[log P (x, y|Λ)|y, Λ ].

(2)

The M-step maximizes the Q function to find the next set of model parameters. Λ represents the set of model parameters for the current iteration, and Λ is the set of parameters from the previous iteration. The E-step and the M-step are iterated until the selected stopping criteria is achieved. For MAP estimation, terms corresponding to the prior distributions on the model parameters, G(Λ), are added to the Q function. Q(Λ, Λ ) + log(G(Λ)).

(3)

When implementing MAP estimation, the E-step remains the same as in the ML formulation, but Eq. 3 is maximized during the M-step. For both the ML and MAP formulations, the quantities in Eqs. 2 and 3 are maximized by computing roots of their partial derivatives with respect to the model parameters. The FSHMM includes a binary latent variable, denoted by z, which represents the relevance of each feature. A feature is considered to be relevant if its distribution is dependent on the underlying state and irrelevant otherwise. Let L be the number of features. If zl = 1, then the l-th feature is relevant, and if zl = 0 the l-th feature is irrelevant. The feature saliency, represented by ρ, is interpreted as the probability that the corresponding feature is relevant. If it is assumed that the features are conditionally independent given the state, the conditional distribution of yt given z and xt can be written as P (yt |z, xt = i, Λ) =

L 

p(ylt |θl )zl q(ylt |ψl )1−zl ,

(4)

l=1

where p(ylt |θl ) is the conditional feature distribution for the l-th feature with parameters θl ; and q(ylt |ψl ) is the state-independent feature distribution with parameters ψl .

70

S. Adams and P. A. Beling

The probability distribution of the latent variable z is P (z|Λ) =

L 

ρzl l (1 − ρl )1−zl .

(5)

l=1

The joint distribution of the observation yt and z, given the hidden state xt , is P (yt , z|xt = i, Λ) = P (yt |z, xt = i, Λ)P (z|Λ) =

L 

[ρl p(ylt |θl )]zl [(1 − ρl )q(ylt |ψl )]1−zl .

(6)

l=1

The marginal distribution for yt , given xt , can be derived from Eq. 6 by summing over z fxt (yt ) = P (yt |xt = i, Λ) =

L 

(ρl p(ylt |θl ) + (1 − ρl )q(ylt |ψl )) .

(7)

l=1

Using the previously derived equations, the complete data likelihood for the FSHMM can be written as P (x, y, z|Λ) = πx1 P (y1 , z|x1 , Λ)

T 

axt−1 ,xt P (yt , z|xt , Λ).

(8)

t=2

3

Method

In this section, the proposed method for simultaneously selecting features and estimating model parameters for an HMM with discrete features is outlined. The method is an extension of the FSHMM described in [7]. Specifically, EM algorithms for features following a Poisson distribution and a discrete non-parametric distribution are derived. For each feature distribution, both an ML and a MAP formulation are outlined. 3.1

Poisson Features

Assume that the relevant features have a state-dependent Poisson distribution. Further, assume that the irrelevant features have a state-independent Poisson distribution. The conditional distribution for yt , given xt = i and z, is P (yt |z, xt = i, Λ) =

L 

p(ylt |μil )zl q(ylt |l )1−zl ,

(9)

l=1

where p(ylt |μil ) =

μyillt e−μil , ylt !

(10)

Discrete FSHMM π

p ¯

x1

...

y1

k

a ¯

A

t = 2...T

ρ

xt−1

xt

yt−1

yt

z

m

71

...

μ



s

b

c

Fig. 1. Graphical model for the MAP formulation of the FSHMM with Poisson features. Squares represent hidden variables. Filled circles are observable variables. Open circles are model parameters.

and q(ylt |l ) =

yl lt e−l , ylt !

(11)

The marginal distribution for z is the same as Eq. 5. For the E-step, calculate γt (i) = P (xt = i, |y, Λ) and ξt (i, j) = P (xt = i, xt+1 = j|y, Λ) using the forward-backward algorithm [10] as for a standard HMM. Additionally, calculate the following FSHMM-specific probabilities eilt = P (ylt , zl = 1|xt = i, Λ ) = ρl p(ylt |μil ), hilt = P (ylt , zl = 0|xt = i, Λ ) = (1 − ρl )q(ylt |l ), gilt = P (ylt |xt = i, Λ ) = eilt + hilt ,

(12)

(13) (14)

uilt = P (zl = 1, xt = i|y, Λ ) =

γt (i)eilt , gilt

(15)

72

and

S. Adams and P. A. Beling

vilt = P (zl = 0, xt = i|y, Λ ) γt (i)hilt gilt = γt (i) − uilt ,

(16)

=

where γt (i), ξt (i, j), uilt and vilt will be used in the M-step. The M-step parameter updates for the ML formulation are π ˆi = γ1 (i), T −1 ξt (i, j) , a ˆij = t=1 T −1 t=1 γt (i) T t=1 uilt ylt μ ˆil =  , T t=1 uilt  I

T ˆl =

t=1 T t=1

and

t=1

I

(18)

(19)



i=1 vilt ylt , I i=1 vilt

T

ρˆl = T

(17)

(20)

I

uilt T I

i=1

uilt +

vilt

(21) u i,l,t . = t=1 i=1 T The priors used for MAP estimation are the Dirichlet distribution, represented by Dir, and the Gamma distribution using shape and rate as hyperparameters, represented by G. The prior for the feature saliencies is a truncated exponential distribution. Ai represents the ith row of the transition matrix. The specific prior distributions are listed below: t=1

T

i=1

I

t=1

i=1

π ∼ Dir(π|p), ¯

(22)

Ai ∼ Dir(Ai |¯ ai ), μil ∼ G(μil |mil , sil ),

(23) (24)

l ∼ G(l |bl , cl ), 1 ρl ∼ e−kl ρl , Z

(25) (26)

where Z is the normalizing constant. The parameter update equations are π ˆ i = I

γ1 (i) + p¯i − 1

i=1

(γ1 (i) + p¯i − 1)

,

T −1 ξt (i, j) + a ¯ij − 1 , t=1 a ˆij =  T −1 I ¯ij − 1 j=1 t=1 ξt (i, j) + a

(27)

(28)

Discrete FSHMM

T

uilt ylt + mil − 1 , T t=1 uilt + sil  T   I t=1 i=1 vilt ylt + bl − 1 ˆl = ,  T  I t=1 i=1 vilt + cl μ ˆil =

and ρˆl =

T + kl −

73



t=1

T  I (T + kl )2 − 4kl ( t=1 i=1 uilt ) 2kl

(29)

(30)

.

(31)

Figure 1 displays a graphical model for the MAP FSHMM with Poisson features. 3.2

Discrete Non-parametric Features

For the discrete non-parametric features, let Y be the set of possible observation symbols, and let Pl be the number of possible observation symbols for the lth feature. Further, let Pil (Y) be the probability that the discrete observation symbol from the lth feature comes from Y given xt = i. There are two possibilities for modeling the state-independent discrete distribution: (1) the state-independent distribution can be modeled as nonuniform and the probabilities learned by the algorithm, or (2) it can be assumed that the state-independent distribution is uniform and set to a fixed value a priori. For the former, let q(ylt ) = Pl (Y) and for the latter let q(ylt ) = Pl−1 , where the conditional parameters have been dropped from the notation as these distributions are non-parametric. The conditional distribution for yt , given xt = i and z, is P (yt |z, xt = i, Λ) =

L 

Pil (ylt )zl q(ylt )1−zl .

(32)

l=1

The E-step probabilities are the same as in the Poisson formulation. When q(ylt ) is estimated from data, the ML M-step updates are π ˆi = γ1 (i),

(33)

T −1

ξt (i, j) , a ˆij = t=1 T −1 t=1 γt (i) T uilt I(Y = ylt ) ˆ Pil (Y) = t=1T , t=1 uilt T Pˆl (Y) =

t=1

 vilt I(Y = ylt ) , T I t=1 i=1 vilt

(34)

(35)

 I

i=1

(36)

74

S. Adams and P. A. Beling π

p ¯

x1

...

y1

k

a ¯

A

t = 2...T

ρ

xt−1

xt

yt−1

yt

z

...

Pil

Pl

m

b

Fig. 2. Graphical model for the MAP formulation of the FSHMM with discrete nonparametric features. Squares represent hidden variables. Filled circles are observable variables. Open circles are model parameters.

and

T

ρˆl = T

t=1

I

I

uilt T I

i=1

uilt +

vilt

(37) u i,l,t , = t=1 i=1 T where I(·) is an indicator function such that I(Y = ylt ) = 1 if Y = ylt and 0 otherwise. When the state-independent distribution is assumed to be uniform and not estimated by the algorithm, Pˆl (Y) = Pl−1 . The priors used for MAP estimation are listed below. Here, Pil is a vector representing every possible observation symbol for state i and the lth feature. t=1

T

i=1

I

t=1

i=1

π ∼ Dir(π|p), ¯

(38)

Ai ∼ Dir(Ai |¯ ai ), Pil ∼ Dir(Pil |mil ), Pl ∼ Dir(Pl |bl ), 1 ρl ∼ e−kl ρl , Z

(39) (40) (41) (42)

where Z is the normalizing constant. The parameter update equations are π ˆ i = I

γ1 (i) + p¯i − 1

i=1

(γ1 (i) + p¯i − 1)

,

(43)

Discrete FSHMM

T −1 ξt (i, j) + a ¯ij − 1 , t=1 a ˆij =  T −1 I ¯ij − 1 j=1 t=1 ξt (i, j) + a T t=1 uilt I(Y = ylt ) + mil (Y) − 1  ˆ Pil (Y) =   , T t=1 uilt I(Y = ylt ) + mil (Y) − 1 Y  T  I t=1 i=1 vilt I(Y = ylt ) + bl (Y) − 1 ˆ , Pl (Y) =    I  T Y t=1 i=1 vilt I(Y = ylt ) + bl (Y) − 1 and T + kl −



75

(44)

(45)

(46)

T  I (T + kl )2 − 4kl ( t=1 i=1 uilt )

. (47) 2kl As in the ML case, when the state-independent distribution is assumed to be uniform and not estimated, Pˆl (Y) = Pl−1 . Figure 2 displays a graphical model for the MAP FSHMM with discrete non-parametric features. ρˆl =

4

Numerical Experiments

In this section, numerical experiments on synthetic data sets and an event detection data set are performed. The synthetic data set experiments demonstrate the ability for the proposed method to accurately estimate the discrete FSHMM model parameters. The event detection data set demonstrates the ability to utilize the proposed methods on real-world data with features generating discrete data from different distributions, i.e. this model uses both the Poisson and nonparametric discrete feature distributions. 4.1

Synthetic Data

First, the Poisson FSHMM is tested on a synthetic data set generated by a known model. An observation sequence of T = 2000 is generated from a two state HMM with two relevant Poisson features. Three irrelevant features generated from a single Poisson distribution with an expected value of 30 are added to the data, resulting in a model with five features in total. The model parameters are:

μ1 = 10 20 ,

μ2 = 50 100 , 0.75 0.25 A= , 0.4 0.6 0.4 π= . 0.6 ¯ij = The hyperparameters for the priors in the MAP formulation are: p¯i = a 2, sil = cl = 1. bl is the mean of the observations for the lth feature, m1l =

76

S. Adams and P. A. Beling

[5 15 25 30 35], and m2l = [40 90 35 30 25]. The weight parameter kl for the feature saliencies is set to 50. Initial values for the ML and MAP formulations are set to the same values: equal initial state probabilities and transition probabilities, μinit = m, init = b, and ρinit = 0.5. The maximum number of iterations for each algorithm is 500 and the convergence threshold is 10−6 . Table 1. Estimated values of π and A for the Poisson FSHMM. The prior in MAP affects the estimates for the initial distribution. The estimates for the transition probabilities are within 0.03 units of the true probability. Algorithm π ˆ1

π ˆ2

a ˆ11

a ˆ12

a ˆ21

a ˆ22

ML

0

1

0.78 0.22 0.41 0.59

MAP

0.33 0.67 0.78 0.22 0.41 0.59

Table 2. Estimated values of μ for the relevant features for the Poisson FSHMM. All estimates are within 0.5 units of the true value. Algorithm μ ˆ11 ML

μ ˆ12

μ ˆ21

μ ˆ22

10.00 20.04 49.77 99.86

MAP

9.99 20.04 49.76 99.84

The ML method converged in 21 iterations while the MAP formulation converged in 184. The estimated parameters are displayed in Tables 1, 2, 3, 4. The ML formulation learns initial distribution parameter estimates that match the training data. Although the training data started in state 2, MAP learns initial distribution parameter estimates that more closely reflect the true parameters Table 3. Estimated values of  for the irrelevant features for the Poisson FSHMM. All parameters are within 0.5 units of the true value. Algorithm ˆ3

ˆ4

ˆ5

ML

30.28 29.78 30.04

MAP

29.95 29.82 30.11

Table 4. Estimated values of ρ for the Poisson FSHMM. ML overestimates the relevance of the irrelevant features, while MAP successfully identifies the relevant and irrelevant features. Algorithm ρˆ1

ρˆ2

ρˆ3

ρˆ4

ρˆ5

ML

1.0000 1.0000 0.3847 0.4999 0.3895

MAP

1.0000 1

0.0050 0.0105 0.0037

Discrete FSHMM

77

due to the prior distribution. Both formulations give accurate estimates for the transition probabilities and the emission distribution parameters. It is interesting to note that ML overestimates ρ for irrelevant features, while the prior on the feature saliencies force the MAP estimates towards 0. For the relevant features, both formulations give accurate estimates for ρ. The discrete non-parametric FSHMM is also tested on a synthetic data set generated by an HMM with known parameters. Both the model estimating Pl (Y) and the model treating it as uniform are evaluated. An observation sequence of T = 2000 is generated from a three state HMM with two relevant discrete non-parametric features. Three irrelevant features are generated from a discrete uniform distribution and added to the data. The relevant and irrelevant features have three possible observation symbols. The model parameters are: P11 (Y) = [0.95 0.025 0.025], P12 (Y) = [0.99 0.005 0.005], P21 (Y) = [0.01 0.98 0.01], P22 (Y) = [0.025 0.95 0.025], P31 (Y) = [0.025 0.025 0.95], P32 (Y) = [0.01 0.01 0.98], ⎡ ⎤ 0.55 0.25 0.2 A = ⎣ 0.3 0.5 0.2⎦ , 0.15 0.25 0.6 ⎡ ⎤ 0.4 π = ⎣0.4⎦ . 0.2 The hyperparameters for the priors in the MAP formulation are βi = αij = mil (Y) = bl (Y) = 2. The weight parameter kl for the feature saliencies is set to 50. As in the Poisson FSHMM example, both ML and MAP are initialized with the same values: equal initial state probabilities and transition probabilities are used. For the initial values of Pil (Y), the symbol with the largest true probability for state i and the lth feature is set to 0.9, and the remaining two symbols are set to 0.05. When estimating Pl (Y) from the data, the initial values are the counts of the symbols divided by T. The maximum number of iterations for each algorithm is 500, and the convergence threshold is 10−6 . The ML method converged in 53 iterations. ML assuming a uniform distribution for the state-independent distribution (designated ML-U) also converged in 53 iterations. MAP converged in 310 iterations, and MAP-U converged in 322 iterations. The estimated parameters are displayed in Tables 5, 6, 7, 8, 9. The difference between the methods that estimate Pl (Y) and those that assume it is uniform appears to be negligible. Further, the estimates for Pl (Y) reflect a uniform distribution. These results are likely biased by the fact that the irrelevant features are drawn from a uniform distribution.

78

S. Adams and P. A. Beling

ML and MAP give similar parameter estimates for the transition probabilities, but the priors skew the results for the initial distribution. For the relevant feature parameters, ML tends to overestimate some of the probabilities, while the use of priors in MAP prevents this. As seen in the Poisson formulation, ML overestimates ρ for irrelevant features, while the prior on the feature saliencies force the MAP estimates towards 0. Both formulations give accurate estimates for the feature saliencies of the relevant features. Table 5. Estimated values of π for non-parametric FSHMM. The prior in MAP affects the estimates of the initial distribution. Algorithm π ˆ1

π ˆ2

π ˆ3

ML

0

0

1

ML-U

0

0

1

MAP

0.25 0.25 0.5

MAP-U

0.25 0.25 0.5

Table 6. Estimated values of A for non-parametric FSHMM. All estimates are within 0.03 units of the true probability. Algorithm a ˆ11

a ˆ12

a ˆ13

a ˆ21

a ˆ22

a ˆ23

a ˆ31

a ˆ32

a ˆ33

ML

0.58 0.24 0.18 0.31 0.48 0.21 0.15 0.22 0.63

ML-U

0.58 0.24 0.18 0.30 0.49 0.21 0.15 0.22 0.63

MAP

0.58 0.24 0.18 0.31 0.48 0.21 0.15 0.22 0.63

MAP-U

0.58 0.24 0.18 0.31 0.48 0.21 0.15 0.22 0.63

Table 7. Estimated values of Pil (Y) for the non-parametric FSHMM. F1 is feature 1 and F2 is feature 2. The estimates for the algorithm that assumes the state-independent distribution is uniform are similar to the estimates for the algorithm that estimates the states independent distribution. The estimated probabilities are within 0.05 units of the true probabilities. Algorithm

Pˆ1l (1) Pˆ1l (2) Pˆ1l (3) Pˆ2l (1) Pˆ2l (2) Pˆ2l (3) Pˆ3l (1) Pˆ3l (2) Pˆ3l (3)

ML F1

0.99

0

0.01

0

1.00

0

0.01

0

0.99

ML F2

1.00

0

0

0.03

0.95

0.02

0

0

1.00

ML-U F1

0.98

0.01

0.01

0

1.00

0

0.01

0.01

0.98

ML-U F2

1.00

0

0

0.03

0.94

0.02

0

0

1.00

MAP F1

0.95

0.03

0.02

0.01

0.98

0.01

0.02

0.03

0.95

MAP F2

0.98

0.01

0.01

0.04

0.94

0.02

0.01

0.01

0.98

MAP-U F1 0.95

0.03

0.02

0.01

0.98

0.01

0.02

0.03

0.95

MAP-U F2 0.98

0.01

0.01

0.04

0.94

0.02

0.01

0.01

0.98

Discrete FSHMM

79

Table 8. Estimated values of Pl (Y) for non-parametric FSHMM, the state-independent parameters. The estimates are within 0.02 units of the true value. Algorithm Pˆ1 (1) Pˆ1 (2) Pˆ1 (3) Pˆ2 (1) Pˆ2 (2) Pˆ2 (3) Pˆ3 (1) Pˆ3 (2) Pˆ3 (3) ML

0.34

0.33

0.33

0.35

0.32

0.33

0.34

0.32

0.34

MAP

0.34

0.33

0.33

0.34

0.32

0.34

0.33

0.33

0.34

Table 9. Estimated values of ρ for non-parametric FSHMM. The ML formulation overestimates the relevance of the irrelevant features, while MAP successfully identifies the relevant and irrelevant features. Algorithm ρˆ1

4.2

ρˆ2

ρˆ3

ρˆ4

ρˆ5

ML

0.9785 0.9755 0.5004

0.5014 0.5012

ML-U

0.9662 0.9722 0.5004

0.5012 0.5010

MAP

1.0000 1.0000 8.8006 × 10−4 0.0050 0.0030

MAP-U

1.0000 1.0000 6.7891 × 10−4 0.0072 0.0025

Cal IT Data

The CalIT2 data set, which is publicly available on the UCI Machine Learning Repository [12], is modeled to test the discrete feature FSHMM formulation on a real-world data set. This data set contains counts of people entering and exiting a building through the main entrance over a 15 week period. Each count is aggregated over a half hour period, and each half hour has the corresponding date and time. The data includes a list of scheduled events and their dates and times. Ihler, Hutchins, and Smyth [13] model the CalIT2 data set using a joint Markov and Poisson process. The day of the week and time of day are modeled as effects on the Poisson rate parameter, and the observations are the counts of people entering and exiting the building. The objective for this set of numerical experiments is to detect the known events. The method presented in [13] demonstrates remarkable accuracy at detecting the presence of an event, as well as having the ability to estimate the event attendance. The goal of this analysis is not to suggest the discrete observation FSHMM is a better method for modeling this data set, but to demonstrate that the discrete feature formulation of the FSHMM can be applied to nonsynthetic data and that features assuming different distributions can be modeled together. To accomplish this, a discrete observation FSHMM is trained assuming the entering and exiting counts have a Poisson distribution, and the day of the week and time of day have discrete non-parametric distributions. A two state Markov chain represents the presence of an event. The FSHMM assumes that state 1 is the normal operation or no event, and state 2 is an event. Ihler et al. [13] do not split the data into training and testing sets, therefore these numerical experiments will follow this procedure so that the presented analysis

80

S. Adams and P. A. Beling

is comparable to the original. The models are trained on the entire data set and then the Viterbi algorithm is used to detect the presence of events on the same data. The ML and MAP formulations are initialized with the same values, where π = [1; 0]; the self transition of A is 0.9, and the transition to the other state is 0.1; and the Poisson parameters for the state representing no event are set to 1, and the parameters for an event are set to 20. The non-parametric probabilities are initialized so that there is a higher probability of an event on weekdays, and from 8 o’clock in the morning to 8 in the evening: 0.25 0.1 0.1 0.1 0.1 0.1 0.25 , Pday = 0.025 0.19 0.19 0.19 0.19 0.19 0.025 0.0278 . . . 0.0139 . . . 0.0278 . Ptime = 0.0139 . . . 0.0278 . . . 0.0139 The state-independent Poisson parameters are set to the mean of the data, and the state-independent non-parametric probabilities are fixed to the inverse of the number of possible symbols (7−1 and 48−1 ). The initial values for the feature saliencies are 0.5. For the MAP formulations, a ¯ = 2, p¯ = 1, m = λinit , s = 1, b = εinit , and c = 1. The hyperparameter for the non-parametric features is 2, and k is equal to a quarter of the number of observations k = 1260. The estimated saliencies for the three features are displayed in Table 10. Feature selection is performed by removing the feature with the lowest estimated ρ. The feature representing count entering the building was removed for ML, while the feature representing the day of the week was removed for MAP. The ML full and reduced models detected 28 out of the 29 scheduled events within the time period the event occurred. The full and reduced models for the MAP formulation detected all schedule events in the time period the event occurred (100% accuracy). There were several unscheduled events also detected by the method. Ihler et al. [13] were also able to achieve 100% scheduled event detection using a specific set hyperparameters, and their method also detected numerous unscheduled events. Table 10. Parameter estimates of ρ for ML and MAP. ML removes the count entering the building, while MAP removes the day of the week. Algorithm Count In Count Out Day

5

Time

ML

0.7895

0.8335

0.9952 0.9993

MAP

0.6876

0.7650

0.0848 0.5372

Conclusion

This paper presents a method for simultaneously estimating model parameters and selecting features for discrete observation HMMs. The presented feature

Discrete FSHMM

81

selection method is an extension of the FSHMM formulation outlined in [7], which assumes a Gaussian distribution for the observed data. Feature saliency assumes that relevant features have feature distributions that are dependent upon the state while irrelevant features have distributions independent of the state. The feature saliency parameters represent the probability that the corresponding feature is relevant. The model parameters, including the feature saliencies, are estimated using the EM algorithm. An EM algorithm is derived for an FSHMM with Poisson features and an FSHMM with discrete non-parametric features. For each type of feature, a ML version and a MAP version are derived. Although not explored in this study, the MAP formulation allows for the cost of features to be included in the feature selection process. The discrete observation FSHMM was tested on synthetic data and an event detection data set. The synthetic data numerical experiments demonstrate that the EM algorithms can accurately estimate the model parameters. However, the ML formulation overestimates the relevance of irrelevant features. This is consistent with the synthetic numerical experiments performed in [7] using a FSHMM with Gaussian features. The experiments on the event detection data set demonstrate that (1) a FSHMM with multiple types of features can be utilized and (2) that the discrete observation FSHMM can detect events at the same level as the method developed in [13]. In future work, the proposed method should be validated using a wider range of examples with a greater number of states and features. In particular, the method should be tested in a situation where the features have an associated cost as in [7]. Further, the FSHMM formulation should be expanded to other types of conditional feature distributions such as the gamma distribution or the exponential distribution.

References 1. Bian, J., Zhou, X.: Hidden Markov models in bioinformatics: SNV inference from next generation sequence. In: Hidden Markov Models, pp. 123–133. Springer (2017) 2. Candanedo, L.M., Feldheim, V., Deramaix, D.: A methodology based on hidden Markov models for occupancy detection and a case study in a low energy residential building. Energy Build. 148, 327–341 (2017) 3. Karg, M., Kuli´c, D.: Modeling movement primitives with hidden Markov models for robotic and biomedical applications. In: Hidden Markov Models, pp. 199–213. Springer (2017) 4. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003) 5. Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J., Liu, H.: Feature selection: a data perspective. ACM Comput. Surv. (CSUR) 50(6), 94 (2017) 6. Adams, S., Beling, P.A.: A survey of feature selection methods for Gaussian mixture models and hidden Markov models. In: Artificial Intelligence Review pp. 1–41 (2017) 7. Adams, S., Beling, P.A., Cogill, R.: Feature selection for hidden Markov models and hidden semi-Markov models. IEEE Access 4, 1642–1657 (2016)

82

S. Adams and P. A. Beling

8. Law, M.H., Figueiredo, M.A., Jain, A.K.: Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 26(9), 1154– 1166 (2004) 9. Zhu, H., He, Z., Leung, H.: Simultaneous feature and model selection for continuous hidden Markov models. IEEE Signal Process. Lett. 19(5), 279–282 (2012) 10. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989) 11. Gauvain, J.L., Lee, C.H.: Maximum a posteriori estimation for multivariate gaussian mixture observations of Markov chains. IEEE Trans. Speech and Audio Process. 2(2), 291–298 (1994) 12. Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ ml 13. Ihler, A., Hutchins, J., Smyth, P.: Adaptive event detection with time-varying Poisson processes. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 207–216. ACM (2006)

Location Tracking and Location Prediction Techniques for Smart Traveler Apps Mohamad Amar Irsyad Mohd Aminuddin, Mohd Azam Osman(&), Wan Mohd Nazmee Wan Zainon, and Abdullah Zawawi Talib School of Computer Sciences, Universiti Sains Malaysia, USM, 11800 Gelugor, Pulau Pinang, Malaysia [email protected], {azam,nazmee,azht}@usm.my

Abstract. Traveling has become one of the most popular trends nowadays. People travel for business, study, vacation, religious activity and many more. Nonetheless, while travelling, their safety and health condition can be a major concern to travelers as well as concern from their family members. There exists many existing smart traveler app for the purpose of locating a traveler. However, they might produce accurate location. Predicting location in all apps is largely based on the speed of a walking traveler. This paper discusses the techniques for more accurate location tracking and location prediction of a traveler. The techniques are able to track the movement of the traveler using location tracking techniques and predict the location of a missing traveler based on parameters such as walking speed, age, gender, weather condition and the health condition. The results based on an app developed in this research show that the techniques are more accurate and efficient. Also, with these techniques, a better emergency alert system can be incorporated in a smart traveler app. Keywords: Location prediction  Smart traveler app Location based tracking  People tracking



1 Introduction When travelling in a foreign country, safety has always been one of the major concerns for travelers, families and friends. Based on past incidents such as violent conflicts, terrorist attacks, natural disaster and homicide, knowing the travelers’ condition and location are important [1]. Examples of the incidents include the mass-shooting incidents in Munich, Mina stamping tragedy at Mecca and Louisiana flood disaster [2, 3, 4]. When such incident occurs, the travelers’ family, relatives and friends should be able to track and communicate with the travelers. Sometimes, some travelers especially senior citizens need to be extra cautious to certain conditions such as changes in the weather, fatigue and unconducive environment since it is easier for them to be affected under these conditions. Keeping up with their health condition is very important to seniors while traveling. Hence, tracking travelers’ location and communicating among travelers and with their family are very important. Nowadays, there are a lot of ways to track and communicate with others. By using a smartphone, travelers can contact their family members, and update their © Springer Nature Switzerland AG 2020 Y. Bi et al. (Eds.): IntelliSys 2019, AISC 1037, pp. 83–96, 2020. https://doi.org/10.1007/978-3-030-29516-5_8

84

M. A. I. M. Aminuddin et al.

conditions and whereabouts. For example, travelers can send messages through Facebook Messenger or WhatsApp. Travelers can also use the mobile tracking application to keep updating their whereabouts without the need to manually send the location information. Furthermore, if the travelers happen to be in a difficult situation or involved in any emergency situation, they need to notify their family and the authority as soon as possible since it extremely important for these people to be informed so that they can do something regarding the situation. For example, if someone is separated from the rest of the family members due to an accident, the family members and authority need to be informed immediately. This paper presents location tracking technique and location prediction technique that can use a smart travel app.

2 Background and Related Work 2.1

Smart Apps for Travelers

Nowadays, tracking applications for travelers have been widely developed and are easily available on the Google Play and Apple App Store. Most of these applications focus on enabling the travelers to share their location with their family members and friends quickly without much hassle. Examples of such applications include Glympse – Share GPS Location by Glympse (Fig. 1), Family Locator & GPS Tracker by ZoeMob (Fig. 2) and HajjUmrah by Advanced Media Lab (Fig. 3). These applications have some similarities and differences between each other. One of the similarities is that all of them provide tracking service that enables the travelers to share their location information. Besides, these applications use map-based interface in their main screen to allow user access to the map and location information quickly. On the other hand, they do not provide support during emergency except for HajjUmrah which provides some basic contact information in case of emergency.

Fig. 1. Screenshots of Glympse – share GPS location application

Location Tracking and Location Prediction Techniques

85

Fig. 2. Screenshots of family locator & GPS tracker application.

Fig. 3. Screenshots of HajjUmrah application

2.2

Location Tracking Techniques

Location tracking is a technology that enables tracking of anyone around the world. Anyone that needs to be tracked is usually required to have a tracking device with them. There are a lot of tracking devices available currently such as the standalone tracking chip, mobile device with built-in tracker, automobile tracking system and the flight tracking system. In this paper, the proposed location tracking technique focuses more on the mobile devices with built-in tracker. Current generation of smartphones

86

M. A. I. M. Aminuddin et al.

usually have more than a single location tracking techniques [5]. There are three main techniques used for location tracking in mobile devices which are the Global Positioning System (GPS), GSM-based positioning and Wi-Fi positioning [6]. In Global Positioning System, there are a number of satellites orbiting the earth to provide the location service globally. GPS receiver will communicate with a minimum of three satellites to obtain the satellites’ location coordinates and the distance between the receiver and the satellites. By obtaining this information, the receiver will calculate and determine its location coordinates. GPS could provide precise coordinates with the accuracy of few meters and thus making it one of the most widely used location tracking techniques. The best situation for using GPS is for outdoor tracking under good weather condition. The main disadvantage of GPS is that an accurate location could not be provided if it receives unclear signal from the satellites. This situation may happen if the receiver is used for indoor tracking or in rainy weather condition. Cellular network positioning or the GSM-based positioning is a location tracking technique based on the GSM cellular network. Most mobile phones nowadays are GSM-enabled devices [6]. The coordinates of the mobile phone could be easily tracked based on the cell tower location. The tower signal strength and latency information will be used to determine the distance between the tower and the mobile devices. Thus, with the combination of this data and the cell tower location, the location of the mobile devices could be tracked. The accuracy of this technique is between a few miles to 50 m. The main advantage of this location tracking technique is that it is simpler and faster as it determines location only by using the cell tower signal information. The main drawback of the GSM-based positioning is that the accuracy depends on the number of the cell towers surrounding the device. Fewer towers surrounding a device will make the location information less accurate. Wi-Fi positioning tracking technique is based on the hotspot or access point that provides network connection with the Wi-Fi enabled device. Various phone makers and phone carrier companies such as Google and Apple keep extensive list of registered Wi-Fi hotspots and the hotspots location globally [7]. By using the Wi-Fi’s signal intensity, method of fingerprinting and the hotspot location, the location `information of the connected device could be determined. It could provide the location accuracy up to 20 to 30 m. This location tracking technique is best used to track indoor or around a hotspot area and it is not affected by the adverse weather conditions. The main disadvantage of Wi-Fi positioning technique is that if the device goes out of range from the Wi-Fi hotspot, location tracking is stopped immediately. In addition, only registered Wi-Fi hotspots can provide location information and the hotspot list database needs to be updated regularly. 2.3

Location Prediction Techniques

As the proposed location prediction technique focuses on walking people, this section discusses the location prediction techniques for walking people. There are numerous elements that need to be considered in determining human walking speed. However, there are few common elements that can be considered as general factors that affect people walking speed. Although these common factors are not applicable in every situation, it does provide the general interpretations in determining the people walking speed.

Location Tracking and Location Prediction Techniques

87

The first two common factors are the people’s age and gender as mentioned by Fitzpatrick et al. [8] and by Crabtree [9]. Different age and gender have been proven to have a significant impact on the people’s walking speed. For example, the elderly walking speeds are usually slower than the younger adults. Furthermore, the weather condition is also a common factor that affects people walking speed as studied in by Iraj [10]. Studies have shown that people walk differently based on the weather condition. For example, people tend to walk faster in rainy weather than in normal weather condition. Lastly, the health condition sometimes could give a big impact on the people’s walking speed. For example, people that require a walking stick to walk would walk slower than a healthy people. 2.4

Issues

Accuracy of location tracking might be less precise as the process of obtaining accurate location via mobile device could be very complex. This is due several reasons. Firstly, by using a combination of tracking technologies, it will be tricky to determine which tracking technologies provide the most accurate location. Secondly, the calculation of those data is affected by the user movement [2, 3]. Due to the user movement, location data will be different from time to time which will make the tracking location data less accurate. Efficiency of running mobile applications for a long time could become very low as it requires the GPS and internet data to be turned on. These two services use high battery resources making it not suitable for them to be turned on for a long time. In this paper, we present a prediction location technique that is more accurate and uses less battery resources. The location prediction may be not useful due to certain limitations. The parameter being used to determine how far the traveler could have walked based on their last known location might not be valid in some cases. Moreover, the location prediction focuses more on determining speed for a walking traveler. The location of the travelers who are cycling or driving should not be predicted in the same manner as for a walking traveler. In this paper, we present a location prediction technique by taking into account the common factors affecting the walking speed of a traveler such as age, gender, and weather and health conditions as well as the travelling speed of people inside a moving vehicle.

3 Design of a Smart Traveler App A smart traveler app was developed to incorporate the proposed location tracking and location prediction techniques. The app is a tracking application for travelers, travelers’ family members and the relevant authorities. Travelers can use this application to update their locations while the travelers’ family members can receive updates on the travelers’ location and health condition. The relevant authorities can monitor the traveler’s condition. The app consists of three main components which are the mobile application, dashboard interface and server. These components are integrated to form a reliable tracking and location prediction system. Figure 4 shows the system architecture which describes the main structure, behavior and interaction between components in the system.

88

M. A. I. M. Aminuddin et al.

Fig. 4. System architecture of the smart traveler app

Using this app, traveler, family members and the relevant authorities will retrieve location data from the GPS system, get map information from the map server and continuously synchronize data with the app server. Data that are being synchronized are location information, account information and the alert notification information. Any location change or alert notification could be immediately updated or retrieved from the app server. For the communication with the map server and app server, Internet connection is required. The relevant authorities can use the dashboard interface to retrieve map information from the map server and continuously synchronize the data with the app server through the Internet connection. By using data synchronization, the authorities can monitor the travelers’ location and condition in real time. Any alert or emergency signs coming from the traveler’s side can also be retrieved immediately without any unnecessary delay. The map server is not included in the app server component since the map information is stored in a third-party map provider server (Google Maps). Thus, the mobile application and dashboard interface will request map data directly from the map server.

4 Tracking and Location Prediction Techniques 4.1

Proposed Tracking Technique

The tracking feature for Mobile Apps is implemented using a combination of three location tracking techniques which are GPS, GSM-based and Wi-Fi tracking. In mobile application, GPS coordinates consisting of latitude, longitude, location speed and timestamp will be retrieved via the Google Awareness Snapshot API provided by the Google Play Service. This API handles the complexity of retrieving and processing data from all the three location tracking techniques and provides location information

Location Tracking and Location Prediction Techniques

89

as accurate as possible. Figure 5 shows the strategy of retrieving location information. When a traveler switches on the mobile application, it will start retrieving location information from the API. Then it will send the location information to the server. Next, it will sit idle i.e., it will not be retrieving the next location information update for a certain amount of time. Despite continuously receiving location information update that will allow the tracking system to obtain more accurate current location of the traveler, this approach is not efficient as it will drain the battery and it will be very inefficient for long-time usage. Thus, we proposed an option to set the length of the idle time before retrieving the next location information. In real-life situation, it might differ based on the situation and condition. The retrieving location information process also will include the information on current detected user activity and the weather condition of the current user location.

Fig. 5. Strategy of retrieving location information

By using the Google Awareness Snapshot API, the technique can retrieve the user’s location information which includes the location coordinates and location speed. Code Snippet 1 shows the codes for retrieving this information. From the coordinate data, the mobile application will use Geocoder function to check for the location address name. Next, by using another two functions of Awareness Snapshot API, the application could retrieve the user detected activity, activity detection confidence and the weather at the user’s location. The activity detection API will give the most probable detected activity (it might detect more than one activity at any given time) with the confidence level in scale of hundred. This scale is used to show how much the API confidence is with the detected activity. All information such as location coordinates, location time, location address, location speed, weather condition, and detected activity will be sent to the server. On normal settings, this location information gathering will be triggered for every ten seconds. However, it could be changed by the user to 30 s, 1 min, 10 min or 30 min.

90

M. A. I. M. Aminuddin et al.

Code Snippet 1: Mobile application code for retrieving user location information Awareness.SnapshotApi.getLocation(mGoogleApiClient).setResultCallback(new ResultCallback() { @Override public void onResult(@NonNull LocationResult locationResult) { if (!locationResult.getStatus().isSuccess()) { if(getConfig(getApplicationContext(), SETTINGS_LOCATION_UPDATE_NOTIFICATION, true) != getInt(SETTINGS_LOCATION_UPDATE_NOTIFICATION_DISABLED)) { NotifyUI.display(getApplicationContext(), TST_LOCATION_SERVICE, "Could not get location.", NOTIFICATION_ERROR, null); } // Call for Update Attempt hasLocation = true; attemptSendUpdate(true); return; } tempLocation = locationResult.getLocation(); tempUpdateTime = new SimpleDateFormat(DATE_FULL_PROPER).format(new Date());

4.2

Proposed Location Prediction Technique

Location prediction technique is used to predict the location of a missing person based on the last known location. How far a traveler could possibly walk based on several known parameters will be shown on the map. The application and dashboard interface will normally provide three set of calculations in order to determine how far the traveler could walk. The result of the calculation will be shown on the map in a circular area which indicates the furthest possible distance that the traveler could have walked. Several common factors affecting walking speed are chosen namely age, gender, weather condition and the health condition. Table 1 shows the average walking speed based on age group [8] and Table 2 shows the average walking speed based on gender and weather condition [9]. By using this information, the possible distance that a traveler could have possibly walked based on a given time could be predicted. Thus, the basic location prediction could be determined. As an example, if a 31 years old male have gone missing for about 30 min in normal weather condition, the predicted distance that the traveler could have walked is calculated as follows: • Calculation 1 – (30  60) s  1.45 m/s = 2610 m • Calculation 2 – (30  60) s  1.10 m/s = 1980 m So, the possible distance that the traveler could have walked is between 1980 m and 2610 m. Hence, the search for the missing traveler should be done within that range from the last known location. Sample GUI for prediction location is illustrated in Fig. 6. The red circle is the area where the missing traveler could have possible walked based on the prediction. The above calculations are based on human walking speed data. However, for the sake of reliability, the location information retrieved by using Google Awareness Snapshot API also has location speed data which is the movement

Location Tracking and Location Prediction Techniques

91

speed of the device. There is also information on detected user activity such as walking or in a vehicle and the confidence level of the detection. Thus, with the location speed data, the walking speed prediction could be more reliable. Code snippet 2 shows the partial code for the first two calculations on mobile application. Table 1. Average walking speed based on the age group [9] Age group Average speed (m/s) Child (ages 0–12) 1.33 Teen (ages 13–18) 1.42 Young (ages 19–30) 1.46 Middle (ages 31–60) 1.45 Older (more than 60 but not classifies as elderly) 1.34 Elderly or physically disabled 1.03 Table 2. Average walking speed based on gender and weather condition [10] Gender Male Female Male Female

Weather condition Average speed (m/s) Normal 1.10 Normal 0.97 Rainy 1.21 Rainy 1.12

Fig. 6. Sample GUI for location prediction information

92

M. A. I. M. Aminuddin et al.

Figure 7 shows the activity diagram for complete calculation process of location prediction. The application (mobile application and dashboard interface) will obtain location data (locally available when viewing traveler’s location on the map) when a user requests for location prediction information. Based on the received data, it will determine whether the location time has exceeded the MAX_PREDICTION_TIME or not. MAX_PREDICTION_TIME contains the maximum value of time that the prediction should be computed. For example, if the maximum value is one hour and the user location time is over one hour ago from the current time, the prediction will not be made. If the user location time is less than the maximum value then the prediction will be calculated. This is to ensure that the prediction will not produce illogical result. This will happen in case of the user’s location time was a long time ago and the user was walking continuously over the period. Taking location time of five hours ago as an Code Snippet 2: Partial of location prediction implementation in mobile application // Check walking speed based on age if (age >= 0 && age = 13 && age = 19 && age = 31 && age = 60){ ws.FIRST = 1.34; } else { ws.FIRST = 1.03; } // Check walking speed based on raining weather and gender If ( hasExtraData && ( weather == Definition_Database.WEATHERCONDITION.RAINY)) { if (gender == 1) { ws.SECOND = 1.21; } else if (gender == 2) { ws.SECOND = 1.12; } } else { if (gender == 1) { ws.SECOND = 1.10; } else if (gender == 2) { ws.SECOND = 0.97; } } ws.FIRST = (speed * timeElapsed) / 1000; // 1000 for value in ms ws.SECOND = (speed * timeElapsed) / 1000; // 1000 for value in ms

Location Tracking and Location Prediction Techniques

93

example, based on human walking speed in Table 2, the possible distance of the traveler to walk is 1.45 m per second (middle ages) time five hours. This will give possible traveler walking distance of 26.1 km which is very unreliable. For the sake of demonstration, the maximum prediction time is set to be one hour. Next, when the location time is less than the maximum time, the application will check for available extra information. The extra data are location speed, user detected activity, and activity detection confidence. These three values are considered as extra data since they might give unreliable values and should not be included in the prediction calculation. For example, the data shows user detected activity is ‘IN_VEHICLE’ (user in vehicle), however, the confidence level is only 25 which means the detected activity is most probably not correct. In the case of the detected activity is in a vehicle where the detection confidence is above 50% and the location speed is not zero, then only the location speed data needs to be used in the prediction since the information in Table 2 might not be suitable for predicting a moving vehicle. Otherwise, the prediction will be calculated as usual which is divided into three calculation. The first will determine the walking speed based on the traveler’s age. The second determines the walking speed based on the traveler’s gender and weather condition. The third is the walking speed which is determined using the location speed data. This third calculation in some case might be unavailable as the location speed data value is zero.

Fig. 7. Activity diagram for location prediction calculation process

94

M. A. I. M. Aminuddin et al.

5 Testing and Evaluation Table 3 shows the test cases scenario details. Basically, all the critical functions of the system are tested and validated. However, there are some unavoidable issued in the test cases. Due to the network and mobile limitation, the transmit location test case sometimes might become failed. The location data retrieved using the API might be inaccurate sometimes since the device itself might have inaccurate location determination ability. The data that send to the server might be failure sometimes if the process happening while user is in the vehicle. This is because, sometimes, the cell tower network coverage is limited making it not very reliable to be used for sending data to server in short time while the device is moving in fast speed. Table 3. Test cases scenario Test case Transmit location (mobile application) Get location prediction (dashboard interface) Locate User (mobile application)

Activity Mobile application sends location information update to the server

Expected Result Server will receive the location update information

A user trying to view location prediction of a traveler in which the traveler’s location time was less than one hour ago Check for other users’ location information

A circle is shown on the map covering the possible walking distance area of the traveler Locations of other authorized users that are shown on the map

Status Passed. Sometimes the location data is inaccurate or fail to be retrieved Passed successfully. It shows the expected prediction information Passed successfully

Ten respondents comprising undergraduate students from the School of Computer Sciences, Universiti Sains Malaysia are involved in the evaluation of the relevant features of the smart traveler app. Figure 8 shows some snapshots of the app used in the evaluation. Figure 8(a) shows the location history of the traveler. Only ten most recent locations will be shown. If the location prediction is available, it will be shown. Figure 8(b) shows the current location of the traveler and the relevant authority. For the traveler and family members, the location of the relevant authority and the traveler will be shown. For the relevant authority, all travelers and relevant authorities will appear on the map. The location information will be shown when user click on the marker. Info such as coordinates, address, and location time will be shown. If the location prediction is available, it also will be shown. Figure 8(c) shows the dialog screen for the traveler to send location update. If the user has any specific request especially for the relevant authority, the note field can be used.

Location Tracking and Location Prediction Techniques

95

Based on the response by the subjects, the key findings of the evaluation are as follows: • When asked if they found that location tracking facility is better than some other location tracking facilities that they are familiar with, 90% of them agreed while 10% of them disagreed. • When asked if they found that location prediction facility is efficient and effective, 70% of them agreed while 30% of them disagreed. • When asked in general whether the location tracking and location prediction facilities are useful in monitoring a traveller, 80% of them agreed while 20% of them disagreed. • When asked on the draining of the battery when using the app, 50% of the respondents raised the concern.

Fig. 8. Some snapshots of the app used in the evaluation (a) location history of the traveler, (b) location of current traveler and relevant authority, (c) Screen for the traveler to send location update.

6 Discussion and Conclusion Compared to other existing apps, the location tracking and location technique proposed in this paper provide a better facility for monitoring a traveler by the family members (location, health condition, finding a missing person) and also in the case of this app, by the relevant authorities. Nevertheless, the location update process sometimes might become unstable or unreliable as mentioned in Table 3. Even though we have attempted to only update location information at a specified interval in order to reduce

96

M. A. I. M. Aminuddin et al.

the draining of the battery, half of the respondents raised the concern on this matter. With the proposed techniques, we are also able to provide an alert system in a smart travel app that is used to notify all the users in the event of emergency while traveling.

References 1. Reidenberg, J.R., Gellman, R., Debelak, J., Elewa, A., Liu, N.: Privacy and Missing Persons After Natural Disasters. Washington, DC and New York, NY: Center on Law and Information Policy at Fordham Law School and Woodrow Wilson International Center for Scholars, p. 128 (2013) 2. The Guardian: Munich attack: teenage gunman kills nine people at shopping centre. (2016). https://www.theguardian.com/world/2016/jul/22/munich-shopping-centre-evacuated-afterreported-shooting-germany. Accessed 22 Sept 2016 3. The Guardian: Hajj pilgrimage: more than 700 dead in crush near Mecca (2015). https:// www.theguardian.com/world/2015/sep/24/mecca-crush-during-hajj-kills-at-least-100-saudistate-tv. Accessed 22 Sept 2016 4. The Guardian: Louisiana faces ongoing flood danger as Obama declares federal disaster zone (2016). https://www.theguardian.com/us-news/2016/aug/15/louisiana-flooding-disasterobama-baton-rouge-mississippi. Accessed 23 Sept 2016 5. Wang, S., Min, J., Yi, B.K.: Location based services for mobiles: technologies and standards. In: Proceedings of the IEEE International Conference on Communication (ICC 2008), pp. 35–38 (2008) 6. Steiniger, S., Neun, M., Edwardes, A.: Foundations of Location Based Services-Lecture Notes. Zurich: Department of Geography, University of Zurich (2006). https://pdfs. semanticscholar.org/b56b/f42dd39a383c07e5eeb01498309e6e52b8af. Accessed 14 Oct 2016 7. Zahradnik, F.: Wi-Fi Positioning System (2016). https://www.lifewire.com/wifi-positioningsystem-1683343. Accessed 20 Oct 2016 8. Fitzpatrick, K., Brewer, M.A., Turner, S.: Another look at pedestrian walking speed. Transp. Res. Rec. 1982, 21–29 (2006) 9. Crabtree, M., Lodge, C., Emmerson, P.: A Review of Pedestrian Walking Speeds and Time Needed to Cross the Road, p. 40. Transport Research Laboratory, London (2014) 10. Iraj, B., Vahid, N.M.G.: The effect of rainy weather on walking speed of pedestrians on sidewalks. Bullet. Teknol. Tanaman 12(2015), 217–222 (2015)

Implementation Aspects of Tensor Product Variable Binding in Connectionist Systems Alexander Demidovskij(B) National Research University Higher School of Economics, Bolshaya Pecherskaya Street 25/12, Nizhny Novgorod, Russia [email protected] https://www.hse.ru/staff/demidovs

Abstract. Tensor Product Variable Binding is an important aspect of building the bridge between the connectionist approach and the symbolic paradigm. It can be used to represent recursive structures in the tensor form that is an acceptable form for neural networks that are highly distributed in nature and, therefore, promise computational benefits from using it. However, practical aspects of tensor binding implementation using modern neural frameworks are not covered in the public research. In this work, we have made an attempt of building the topology that can perform binding operation for a well-known framework Keras. Also we make the analysis of the proposed solution in terms of its applicability for other important connectionist aspects of Tensor Product Variabl Binding. Proposed design of the binding network is the first step towards expressing any symbolic structure and operation in the neural form. This will make it possible for traditional decision making algorithms to be replaced with a neural network that brings scalability, robustness and guaranteed performance. Keywords: Connectivism · Decision support systems · Tensor computations · Neural networks · Unsupervised learning

1

Introduction

Any knowledge should be expressed in the form of formalized structures in order to be used in mathematical models and computations. The question on how to express that knowledge is the key in every symbolic and sub-symbolic or connectivist approach. With a certain simplification, we can think of symbolic approach as representing traditional algorithms when we operate with understandable structures, terms and operands. For example, multi-round decision making, auction algorithms and so on. At each step of symbolic computation we can understand the intermediate results as they are symbolic structures. c Springer Nature Switzerland AG 2020  Y. Bi et al. (Eds.): IntelliSys 2019, AISC 1037, pp. 97–110, 2020. https://doi.org/10.1007/978-3-030-29516-5_9

98

A. Demidovskij

With the same simplification, sub-symbolic or connectivist approach [1] can be described, for example, by neural computations during which multiple tensors (matrices, vectors) are created and absolutely meaningless until we get the final result. There is no place for symbolic structures in this paradigm. It is especially important when we want to build the bridge between connectivist and symbolic paradigms. This transition should have a sort of communication protocol and therefore formal knowledge plays that role. In the decision making field we are trying to find the solution based on the problem knowledge. Usually it is kept from stakeholders who share that knowledge with us. One of the possible ways to express this knowledge is to use ontologies [2]. However, regardless the way we express the knowledge we need to create the hierarchy. Hierarchy is required to represent relations between elements, otherwise we loose the semantics imposed by the way those elements interact. In general, such a structure has unbound nesting levels that considerably complicates creation of representation in a vector format - a natural input for neural networks. One of simple examples of such an hierarchy is the natural language sentence parsing [3]. Hierarchy is essentially built in the sentence in the way how words are used there and how they are connected. For example, adjective mandatory relates to a noun, adverb - to a verb. However, when there are several adjectives the only way to understand what adjective relates to what noun is to take a look at the interconnection between them. There are multiple examples of natural language sentence parsing with the help of hierarchy [4], where language is defined as a text and text is processed as a set of sentences (sequences), each of them is used for structure construction. To sum up, the built hierarchy can be considered as an output of the task solving on the symbolic level. However, there are problems of interpreting this structure in the vector format, because at the end we want to preform computations in the neural network that accepts only numbers as inputs. This is the place where Tensor Product Variable Binding starts shining [1]. We will describe key aspects later in the paper as well as describing technical aspects of building a working network with the modern means of network construction. To our best knowledge this is the first attempt of applying Tensor Product Variable Binding ideas in the modern neural network frameworks. The structure of the paper is as follows. Section 2 covers the problem overview, including analysis of the existing solutions, Sect. 3 contains an example of applying Tensor Product Variable Binding to a simple structure. After that, in Sect. 4 we describe a high level architecture of the Tensor Product Variable Binding network. Then, in Sect. 5 we describe the proposed architecture of the neural network that solves the binding task. Finally, in Sect. 6 we make conclusions and define directions of further research.

2

Problem Overview

As we have already noted, the transition between symbolic and connectionist paradigms is a key component to make the integration of them possible.

Selected Aspects of Tensor Product Variable Binding

99

This transition can be made with the means of the First-Order Logic (FOL). Given that expressions from the FOL can be translated to some distributed subsymbolic representation, we can translate our knowledge to the set of expressions in FOL and then, using the known translation rules, transform those expressions in the vector representation [5–7]. There is a huge list of different logics: 1. First-Order Logic [5]. It is important to note that the distributed representation is created from FOL expressions, that in turn, contain predicates and variables. 2. Fuzzy Evidential Logic [6]. It is built around following elements: facts, rules, weighting scheme and conclusion. Moreover, the expression is mandatory split into left and the right part. 3. Probabilistic Soft Logic [7]. In this type of logic it is expected that users define the family of theories that extend the logic, for example, to work with floating point, binary and complex values. Such theories are called Modulo Theories. As we can see this transition is possible and it plays the critical role in the whole flow to work. In particular, we have the problem situation, then we collect data about it from stakeholders and experts, then we need to somehow translate it to the structure and then get its vector representation, so that we can use it as an input to the neural network. By that we can connect the world of concepts and terms and the world of numbers and activations. One of the ways for building it is the knowledge fitting mechanism [8]. Neural Network is capable of building associations that construct definite structures on the base of the training dataset. However, one of the most profound approaches to representing the structures in the vector format is the Tensor Product Variable Binding approach proposed by Smolensky [1] and further described in the [9]. 2.1

Tensor Product Variable Binding

The main idea of Tensor Product Variable Binding [1] is representing the data in the hierarchical form containing elements of two types: role and filler. Definition 1. Filler: Element of the structure that is characterized by the role that it plays in the structure. Definition 2. Role: An action, function, model that the given element (filler) plays in the structure. Definition 3. Binding: A connection between a filler and a role representing their relationship in the structure. If we define a filler and a role in some space, we can use the rules of tensor algebra for translating the structure in some vector representation.

100

A. Demidovskij

Therefore, we consider the structure as a system that consists of pairs {role, filler} and each pair is represented as a tensor multiplication of the corresponding pair of vectors. More importantly, there are local and distributed representation. Local representations are so called “hot-encodings”, where each element is represented as a vector that contains all zeros but the single one on some position. Distributed representations are, in turn, arbitrary vectors. In a case we get the new element in the structure and all elements have local representation, we need to add the new dimension to all elements in the structure with zero value in order to keep orthogonality of the representation. For the distributed representation, the number of dimensions is kept the same. Distributiveness can be on the full space or in some sub-space [6]. It is important to note that it is possible to perform the unbinding procedure that is by definition getting of the original role or the filler vector by the given tensor representation of the structure and the filler or role that the element played in the structure. In other words, we can say what role a particular filler played in the structure or what filler played the particular role. This is performed with the usage of the bind space. Finally, it is possible to perform operations over the structure with only using its tensor representation. This opens the door for potential computation of those manipulations with neural networks. Tensor Product Variable Binding is not the only one way to get the distributed representation of the structure. There also such methods as: Holograhic Reduced Representations (HRRs), Binary Spater Codes and so on [6]. Tensor Product Variable Binding is in the focus of current research due to closeness of its ideas to modern frameworks, therefore, we leave discussion of alternative structure representation solutions out of the scope of this paper. The general flexibility of the Tensor Product Variable Binding inspired creation of Vector Symbolic Architectures (VSA) [9]. There are three consecutive stages in VSAs: 1. Pre-processing. During this step, symbolic representation (structure) is translated to the sub-symbolic (vector) one. This step is usually performed once to generate required vectors. For example, with the use of embeddings - special vectors that represent words while saving the semantics closed in the context where those words were used [10,11]. 2. Sequence generation. 3. Prediction (output calculation). 2.2

Distributed Representations in Connectivist Approach

Distributed representation plays a key role in the sub-symbolic computations. For example, any neural network can not work with anything but tensors [12]. Moreover, in this research we focus on such data which structure is of the same importance as separate elements. There are various methods of getting distributed representations from the natural language data that essentially has a structure: grounding [5], doc2vec [13] and word2vec [14].

Selected Aspects of Tensor Product Variable Binding

101

We would like to draw attention to the mandatory inclusion of coders and encoders in any workflow that contains sub-symbolic computations [15]. The role of the encoder is transforming the symbolic phrase to the vector form and the role of decoder is in contrary translation of given vector representations in the symbolic phrases. The main difficulty with distributed representations is that they should store not only content of the original sequence (for example, a sentence) but also the structure included in that sequence. There are multiple cases when distributed representations are used for solving the real life cases, for example in Predication-based Semantic Indexing [16]. Usually, distributed representations have a huge dimension and this dramatically increases the computational complexity of the overall solution. There are multiple ways to reduce dimensionality, for example Singular Value Decomposition. For further analysis of the vector space, Hierarchical Clustering is usually used [17]. Finally, there is the recent effort to introduce new entity called Semantic vector that is designed to be a unified part of any solutions working with the distributed representations [16]. To sum up, comparing the connectivist and symbolic levels we can say that symbolic computations are serial and discrete and operations are performed in the user-friendly format - in known structures with known properties. At the same time, connectivist calculations are made over local or distributed representations [8]. Moreover, symbolic level hugely depends on the domain knowledge while neural computations, obviously, do not depend on that [18]. Interestingly, there are some works also devoted to the usage of multi-layer (deep) tensor topologies [19]. However, we strongly consider that there are still remaining questions that should be addressed. In particular, analysis of the existing theoretical architectures and their adoption details in terms of modern frameworks, applicability of such approaches to a broader set of tasks, for example in building expert systems and decision support systems.

3

Tensor Product Variable Binding Calculation Rules

Tensor Product Variable Binding is a way to transform the symbolic structure into the vector format. According to Smolensky [1] it is built on the simple tensor multiplication operation. Definition 4. Tensor multiplication is the operation over two tensors f of rank N and r of rank M that results in the new tensor b of rank N+M so that bij = fi ∗ rj . Definition 5. Tensor product ψ for the given structure S containing pairs {fi , ri } is calculated in the following way:   ri fi (1) ψ= i

102

A. Demidovskij

r0 A

r01

r11

B

C

Fig. 1. A simple structure for demonstration of Tensor Product Variable Binding

In order to understand the ground principles for constructing the distributed matrix manipulation mechanism in a form of neural network it is important to understand how the full workflow works for the simple structure (Fig. 1). Let the structure be a directed acyclic graph with three leaves: A, B, C and the root ε. As we already know the structure is needed to describe relationships between the elements. In the sample structure it is obvious that A is the left child of the root, while B and C are children of the right child of the root. We ignore that right child because it is not a filler and does not bring any value in our structure representation. However, we are still interested in other two fillers B and C and their connections. Having defined fillers is not enough for representing the structure according to the Tensor Product Variable Binding rules because we also need to describe roles. Again, from the diagram we can see that there are three different roles: r0 , r01 and r11 . Their meaning is quite simple: r0 denotes the role “left child of the root”, r01 - the role “the left child of the right child of the root” and r11 - means the role “the right child of the right child of the root”. Semantically those rules and fillers are considered different and that should be reflected in the vector representation of those elements of the structure. Then we are free to define the vectors representing fillers matrix F (2) and roles matrix R (3). We need to make sure that they are at least linearly independent or, what is easier for us, orthogonal and normalized. ⎡ ⎤ 1000 F = ⎣0 1 0 0⎦ (2) 0010 ⎡ ⎤ 10000 R = ⎣0 1 0 0 0⎦ 00100

(3)

According to (1) we can translate the given structure to the vector representation by performing pairwise tensor multiplication over the fillers and roles (4) with the final sum of the terms. In particular:

Selected Aspects of Tensor Product Variable Binding

⎡ ⎤ 10000    ⎢0 0 0 0 0⎥ ⎥ 10000 =⎢ f1 r1 = 1 0 0 0 ⎣0 0 0 0 0⎦ 00000 ⎡ ⎤ 00000    ⎢0 1 0 0 0⎥ ⎥ 01000 =⎢ r2 = 0 1 0 0 f2 ⎣0 0 0 0 0⎦ 00000 ⎡ ⎤ 00000    ⎢0 0 0 0 0⎥ ⎥ 00100 =⎢ r3 = 0 0 1 0 f3 ⎣0 0 1 0 0⎦ 00000

103

(4)

As we have found standalone tensor multiplications for all three pairs of fillers and roles in the structure, we are able to find the final tensor representation of the structure by performing a simple element-wise sum over the given matrices:    ψ = f1 r1 + f2 r2 + f3 r3 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 10000 00000 00000 10000 (5) ⎢0 0 0 0 0⎥ ⎢0 1 0 0 0⎥ ⎢0 0 0 0 0⎥ ⎢0 1 0 0 0⎥ ⎥+⎢ ⎥+⎢ ⎥=⎢ ⎥ =⎢ ⎣0 0 0 0 0⎦ ⎣0 0 0 0 0⎦ ⎣0 0 1 0 0⎦ ⎣0 0 1 0 0⎦ 00000 00000 00000 00000 Finally, we get (5) that is the tensor representation of the given structure S (Fig. 1). In the next section we will learn how those operations can be performed in the neural computing paradigm.

4

High-Level Theoretical Architecture for Tensor Product Variable Binding

It was already mentioned above that the Tensor Product Variable Binding was proposed in [1] as well as the network paradigm that can create tensor representation of the given pairs of roles and fillers. This high level architecture is presented in Fig. 2. We start reviewing this architecture from the case when the network performs one binding operation per the given time period. Figure 2a represents such a network. As we can see the idea is that the network has two inputs: for a filler f and a role r vectors correspondingly. The network itself consists of sigma-pi units [20]. Each sigma-pi unit has one or several input sites {σi } that are connected with other units in the network. Each site σi performs a product of its input connections {Iσi }. The resulting value of the unit can be computed as a weighted sum of products from each input site (6). 

v= wσ × Iσi (6) σ

i

104

A. Demidovskij

However, weights are ignored in the proposed architecture, so they are equal to 1 for each input site and the formula (6) is a bit simplified (7).

  1× Iσi = Iσi (7) v= σ

i

σ

i

A well known advantage of neural networks is the high level of computations distributiveness and easy scalability on the growing number of inputs that is often called batching. There is an extension of the network accepting two vectors and therefore performing serial computations to the architecture that provides simultaneous binding operation for N pairs of such vectors. This extension is presented on the Fig. 2b. For simplicity, we refer to the case when the network performs two simultaneous binding operations. The overall idea of the network that consists of sigma-pi units is kept the same although each sigma-pi unit now contains N input sites. In our case each input site receives signals from the corresponding values in input vectors. The nice fact about this architecture is that adding new inputs and new connections to the sigma-pi units is enough to add parallelism in network computations. In other words we utilize natural properties of neural networks. However, a gap between the theory and practice exists so that practitioners face the problem of expressing the network architecture in the terms of modern frameworks and approaches to training and inference of neural networks. The purpose of this research is to close this gap for the binding network and we will demonstrate it in the following section.

5

Modern Architecture for Tensor Product Variable Binding

As you have seen from the previous section, the proposed network is rather high level and described in very abstract terms. This brings a huge gap when someone decides to use this architecture in enterprise applications or for some further research. Practically, all the questions about the network inputs, outputs, mechanics, implementation details are left not described although they are critical for building real life solutions. Therefore, in this paper, we propose the implementation of the topology that performs binding operation. You can find the overall scheme in Fig. 3. It is important to mention that we prototype the network in the Keras [21] framework that is the superstructure over the popular frameworks: TensorFlow [22] and PyTorch [23]. The main advantage of this framework is the high level of abstraction for network description when compared to other analogues. This framework is considered to be ideal for prototyping and therefore is our framework of choice.

Selected Aspects of Tensor Product Variable Binding

105

Fig. 2. High-level theoretical architecture for Tensor Product Variable Binding network [1]

106

5.1

A. Demidovskij

Topology Structure

Network Inputs. As we can see the network accepts two inputs: one for fillers and one for roles. Each input is a batch of vectors while batch can vary depending on the number of roles and fillers in our structure. Using batch is a common approach for building networks that automatically scale for the given number of input samples. Note that roles and fillers are not constrained to be of the same size as in general they represent different spaces. We do not perform any additional manipulations with the input and let the network do it for us. In other words we feed the network with raw data. Preparing Inputs for Binding. The next step is preparing one of the inputs for tensor multiplications. It is easy to see that tensor multiplication over vectors can be expressed in a usual vector product operation with one of the vectors being transposed (8):  (8) ri = fiT × ri v = fi This equivalence lets us avoid using tensor product directly in the network described in the terms of the Keras framework. It is extremely vital as the tensor product layers are absent in all modern frameworks. Therefore we perform the permutation operation that switches dimensions in a way that we can use vectorvector multiplication with standard framework layers. Vector Multiplication Layer. This brings us to the next layer that has two inputs: raw fillers vectors left unchanged and transposed role vectors. This layer performs the operation described in (8). Although this is the usual vector-vector multiplication it was a surprise for us to recognize absence of a layer that receives two vectors and outputs the product of them. However, as it was already stated, Keras has a flexible and expandable architecture, therefore we created the custom layer that computes product of two vectors. It is a primitive Lambda layer (Listing 1.1). Listing 1.1. Implementation of the vector-vector multiplication layer

from k e r a s . l a y e r s import Lambda def m u l v e c o n v e c ( t e n s o r s ) : return [ t e n s o r s [ 0 ] [ i ] ∗ t e n s o r s [ 1 ] [ i ] \ f o r i in r a n g e ( l e n ( f i l l e r s ) ) ] b i n d i n g c e l l = Lambda ( m u l v e c o n v e c )

Sum Layer. This layer takes a variable number of input matrices as an input and performs the element-wise sum over them. According to the definition of the tensor product for the given structure (1), we need to perform exactly this operation to get the tensor representation.

0

batch

Example filler vector:

0

batch

Example role vector:

0

1

0

0

Permute Layer (Transposing data)

0

1 x filler vector length (variable), e.g. 4

1

1 x role vector length (variable), e.g. 5

0

MulVec Layer (Keras Lambda)

batch

role vector size, e.g. 5

filler vector size, e.g. 4

batch size, e.g. 5

Tensor product result first role vector in batch multiply transposed filler vector

Add Layer

role vector size, e.g. 5

filler vec size, e.g. 4

Tensor Product Variable Binding Result (matrix)

Selected Aspects of Tensor Product Variable Binding

Fig. 3. Modern architecture for Tensor Product Variable Binding

107

108

5.2

A. Demidovskij

Training and Inference

Once the topology is designed and described in the terms of the framework, we can move to its training and inference. However, for the given binding network we can see that it is designed for a feed forward inference without preliminary training. In point of fact, the topology does not keep any weights which are usually trained in canonical neural models. Speaking about the inference side, it absolutely follows the example that we examined in Sect. 3. The source code for further experiments can be found at the open source repository1 .

6

Conclusion

To sum up, we have demonstrated how the binding network can be implemented with the means of the Keras framework, what are the obstacles and limitations of modern frameworks when they are applied to the implementation of such networks. Also we have considered inference aspects. However, when analysing the proposed architecture, we see the following cons of the design: 1. Absence of training. The network is not trainable due to absence of weights. When it is planned to reuse existing binding values the network should be inferred again to obtain values. Instead, they could be trained and stored in the serialized network file. 2. Absence of any structure manipulation on the tensor level. From the perspective of current research, it is possible to translate a structure to the tensor level, however, it is not clear what we can do with that. There are numerous potential operations that can be done over the structure: adding new children, replacing elements, removing them completely. That is why we should be able to make such transformations not only on the symbolic level, but also on the connectivist side or, in other words, on the tensor level. 3. Need in the decoder. The binding network by definition plays the role of encoder translating the given arbitrary structure to the tensor representation. However, after the structure is encoded and changed on the tensor level, it should be translated to the symbolic form again. Otherwise it is not possible to analyse the result gained by the neural network. Apart from the fact that this research clearly closes the practitioner gap in implementing binding networks, we can highlight several advantages of such an architecture: 1. Scalability. The network does not depend on the number of fillers and roles. It can accept any quantity of them with the obvious requirement that each filler should have one and only one matching role. Moreover, this scalability is inherently built in the topology with the usage of batch dimension. 1

https://github.com/demid5111/ldss-tensor-structures.

Selected Aspects of Tensor Product Variable Binding

109

2. Simplicity. The network does not contain overcomplicated parts that are hard for implementation. For example, sigma-pi units can be expressed by a combination of primitive layers that do the same operations but with clear and transparent data flow. Moreover, using the sigma-pi unit is overcomplication as by definition it contains weights, but in binding network it is ignored (set as equal to 1). 3. Known language for practitioners. The binding topology architecture was proposed 30 years ago and the field has rapidly rocketed from that moment as well as tooling and frameworks. Binding mechanism is a crucial part of Tensor Product Variable Binding mechanism and its derivatives and it is vital to express the topology in modern terms for further development of the field. To sum up, we consider items 2 and 3 from the list of current design limitations as concrete directions for further research. At the same time we formulate following research questions that are still open: 1. Training in sub-symbolic systems. Do we need it to be a supervised or unsupervised one? Do we need a labelled data or such networks can generalize and learn patterns from raw data without a teacher? 2. Attainability of any symbolic operation expression in terms of neural paradigm so that the neural model produces results acceptable by accuracy? What about expressing traditional decision making approaches in the neural form? A positive answer could give a start for huge usage of connectivist ideas in dozens of actual problems including those from the decision making domain.

References 1. Smolensky, P.: Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artif. Intell. 46(1–2), 159–216 (1990) 2. Wang, H., Dou, D., Lowd, D.: Ontology-based deep restricted Boltzmann machine. In: International Conference on Database and Expert Systems Applications, pp. 431–445. Springer (2016) 3. Margem, M., Yilmaz, O.: How much computation and distributedness is needed in sequence learning tasks? In: Artificial General Intelligence, pp. 274–283. Springer (2016) 4. Dehaene, S., Meyniel, F., Wacongne, C., Wang, L., Pallier, C.: The neural representation of sequences: from transition probabilities to algebraic patterns and linguistic trees. Neuron 88(1), 2–19 (2015) 5. Serafini, L., d’Avila Garcez, A.: Logic tensor networks: deep learning and logical reasoning from data and knowledge. arXiv preprint arXiv:1606.04422 (2016) 6. Browne, A., Sun, R.: Connectionist inference models. Neural Netw. 14(10), 1331– 1355 (2001) 7. Teso, S., Sebastiani, R., Passerini, A.: Structured learning modulo theories. Artif. Intell. 244, 166–187 (2017)

110

A. Demidovskij

8. Besold, T.R., K¨ uhnberger, K.-U.: Towards integrated neural-symbolic systems for human-level AI: two research programs helping to bridge the gaps. Biol. Inspired Cogn. Archit. 14, 97–110 (2015) 9. Gallant, S.I., Okaywe, T.W.: Representing objects, relations, and sequences. Neural Comput. 25(8), 2038–2078 (2013) 10. Blacoe, W., Lapata, M.: A comparison of vector-based representations for semantic composition. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 546–556. Association for Computational Linguistics (2012) 11. Cheng, J., Wang, Z., Wen, J.-R., Yan, J., Chen, Z.: Contextual text understanding in distributional semantic space. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 133–142. ACM (2015) 12. Rumelhart, D.E., McClelland, J.L., PDP Research Group, et al.: Parallel Distributed Processing, vol. 1. MIT Press, Cambridge (1987) 13. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014) 14. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 15. Shang, L., Lu, Z., Li, H.: Neural responding machine for short-text conversation. arXiv preprint arXiv:1503.02364 (2015) 16. Widdows, D., Cohen, T.: Reasoning with vectors: a continuous model for fast robust inference. Log. J. IGPL 23(2), 141–173 (2014) 17. Frank, R., Mathis, D., Badecker, W.: The acquisition of anaphora by simple recurrent networks. Lang. Acquis. 20(3), 181–227 (2013) ¨ d’Avila Garcez, A.S., Silver, D.L.: A proposal for common dataset in 18. Yilmaz,O., neural-symbolic reasoning studies. In: NeSy@HLAI (2016) 19. Wang, H.: Semantic deep learning, pp. 1–42. University of Oregon (2015) 20. Rumelhart, D.E., Hinton, G.E., McClelland, J.L., et al.: A general framework for parallel distributed processing. Explor. Microstruct. Cogn. 1(45–76), 26 (1986) 21. Chollet, F., et al.: Keras (2015). https://keras.io 22. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Man´e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Vi´egas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). Software available from tensorflow.org 23. Paszke, A., Gross, S, Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)

Timing Attacks on Machine Learning: State of the Art Mazaher Kianpour(&) and Shao-Fang Wen Norwegian University of Science and Technology, Gjøvik, Norway {mazaher.kianpour,shao-fang.wen}@ntnu.no

Abstract. Machine learning plays a significant role in today’s business sectors and governments, in which it is becoming more utilized as tools to help in decision making and automation process. However, these tools are not inherently robust and secure, and could be vulnerable to adversarial modification and cause false classification or risk in the system security. As such, the field of adversarial machine learning has emerged to study vulnerabilities of machine learning models and algorithms, and make them secure against adversarial manipulation. In this paper, we present the recently proposed taxonomy for attacks on machine learning and draw distinctions between other taxonomies. Moreover, this paper brings together the state of the art in theory and practice needed for decision timing attacks on machine learning and defense strategies against them. Considering the increasing research interest in this field, we hope this study provides readers with the essential knowledge to successfully engage in research and practice of machine learning in adversarial environment. Keywords: Adversarial machine learning Learning models



Timing attacks



Manipulation



1 Introduction The considerable number of research on cybersecurity indicates that machine learning techniques have significant roles in detecting security attacks and anomalies. Increasing interest focused on these techniques has led to emerge a wide variety of novel capabilities such as regression (or prediction), classification, clustering, association rule learning and dimensionality reduction. These capabilities have powered the researchers to make considerable progress towards solving difficult problems in different areas including cybersecurity. Moreover, since machine learning is the process by which a machine can learn to make decisions without being explicitly told what to do, it is also becoming more utilized as a tool for businesses and governments to help in decision making and automation process. However, these tools are not inherently robust and secure. Attackers that want to harm or evade the function of unprotected machine learning systems can do so with relative ease. The fundamental concept of machine learning in an adversarial environment is to anticipate that the adversary or attacker is trying to cause the machine learning system to fail in different ways. Consequently, the adversary can achieve two conflicting goals; evading detection while at the same time success with the attack and © Springer Nature Switzerland AG 2020 Y. Bi et al. (Eds.): IntelliSys 2019, AISC 1037, pp. 111–125, 2020. https://doi.org/10.1007/978-3-030-29516-5_10

112

M. Kianpour and S.-F. Wen

achieve their malicious goals. Nevertheless, there are many examples that show the scope of the adversarial machine learning is broader than evasion attacks. Poisoning attacks and forensic analysis of malware are two examples that malicious actors manipulate the training data of algorithms. This can cause wrong categorization and attribution of attacks.

Fig. 1. Machine learning process

Studying the state of the art of adversarial machine learning aims to investigate the modeling of attacks on machine learning and developing robust learning techniques to adversarial manipulation. In this paper, we study a number of common problems in adversarial learning. We start with an overview of machine learning background in Sect. 2. In this section, we also discuss how machine learning techniques can be applied in adversarial environment. A taxonomy of attacks against machine learning techniques is presented in Sect. 3. Due to the rapid changes in machine learning techniques, this taxonomy may not be fully comprehensive. We will continue this section by discussing the first class of this taxonomy, Decision Time attack. We then proceed to discuss a number of techniques for defending against this attack in Sect. 3.2. Finally, this paper is concluded in Sect. 4.

2 Machine Learning Background Machine learning is a method of data analysis that automates modeling and training. It is a branch of Artificial Intelligence (AI) and the idea behind it is that systems can learn from data, identify patterns, build models, and make decisions with minimal intervening by humans. Figure 1 shows a schematic view of machine learning. Today, machine learning is increasingly used in security-sensitive applications and systems. In these applications, instances can be manipulated by an attacker to confound learning algorithms. This situation has led to emergence of competition between designers of learning systems and the attackers, and subsequently, increasing the complexity of modern attacks and their countermeasures.

Timing Attacks on Machine Learning: State of the Art

113

The learning algorithms have different approaches and are categorized into five types. This categorization is based on their input and output data, and the problem that they are intended to solve. In this section, we will have an overview on these types. This would help us to understand the attacks and countermeasures in this field of study. 2.1

Supervised Learning

In supervised learning, there are a model M and a dataset D ¼ fxi ; yi gi¼1 of feature n vectors xi 2 v  Rm , where v is the feature space, and labels yi are from some label set !. D is typically generated from an unknown distribution. Therefore, it is used for training and finding a candidate model, which approximates the labels observed in this dataset. Six steps need to be performed in order to solve a given problem of supervised learning (see Fig. 2). The most widely used learning algorithms in the fourth step are Support Vector Machines, Linear Regression, Naive Bayes, Decision Trees and Multilayer perception Neural Networks. Each of these algorithms has its strength and weaknesses. According to No Free Lunch Theorem, there is no algorithm that works best in for all supervised learning problems [1]. The tradeoff between bias and variance, the amount of training data available in compare to the complexity of the function and the degree of noise in the desired output values are the major challenges that need to be considered in supervised learning [2]. Classification (i.e. the outputs are restricted to a limited set of values) and regression (i.e. the outputs may have any numerical value within a range) are included in supervised learning algorithms. Similarity learning is also another area of supervised machine learning, which is closely related to regression and classification problems, however, the goal is learning from examples by using a similarity function. This function measures how similar two instances are. Ranking systems, recommendation systems, visual identity tracking, etc. use similarity learning to do their tasks.

Fig. 2. Supervised learning steps

114

M. Kianpour and S.-F. Wen

2.2

Semi-supervised Learning

In semi-supervised learning class, only a small amount of training data is labeled. The results of different researches show that this mechanism causes significant improvement in learning accuracy in compare to completely to labeled training data. Moreover, these algorithms require less time and costs to solve the problems [3]. Just as in supervised learning, we have a dataset D ¼ fxi ; yi gi¼1 of feature vectors n xi 2 v  Rm , where v is the feature space, and labels yi from some label set !. Additionally, this dataset contains u unlabeled instances xi þ 1 ; . . .; xi þ u 2 v. The combination of labeled and unlabeled data in semi-supervised learning enhances the classification performance. At least one of the following assumptions is used in semi-supervised learning: • Continuity. According to this assumption, neighbor instances are more likely to have the same label. • Cluster. The training dataset tends to form distinct clusters, and the instances in one cluster are more likely to have the same label. • Manifold. A topological space that locally resembles Euclidean space near each instance is called manifold1. In learning algorithms, the training dataset lies approximately on a manifold of lower dimension than the input space. 2.3

Active Learning

Optimal experimental design or active learning is a special class of learning that learning algorithms are able to interact with users to obtain the desired outputs at new data points. Active learning algorithms can be used in situations that manually labeling is costly. Recently, researchers are working on multi-label active learning [4] and hybrid active learning [5]. There are different scenarios that learning algorithms may query the user. Membership Query Synthesis, Stream-based Selective Sampling and Pool-based Sampling are three main scenarios that have been considered in the literature of active learning algorithms [6]. All of these scenarios involve evaluation of unlabeled instances to have the best query (i.e. most informative instance xA ). Below, we discuss one of the proposed query frameworks that uses a decision-theoretic approach. Expected Model Change framework selects the instance that would cause the greatest change to the model if its label is known. Expected Gradient Length (EGL) is a query strategy in this framework [7]. This strategy is used in multiple instance settings and can be applied to any learning problem that is using gradient-based training. In these problems, the change can be measured by the length of the training gradient. In the active learning task, unlabeled instances from a pool are repeatedly chosen. If the model is changed due to the outlier, Expected Model Change framework will certainly select a good instance that can maximize the change again in the next data

1

Manifold learning algorithms build decision functions that are different along the manifolds occupied by the data. These different classes form separate manifolds, and the learning algorithms indirectly implement the cluster assumption by not cutting the manifolds [33].

Timing Attacks on Machine Learning: State of the Art

115

selection round. Therefore, the negative effect of the outlier will be minimized. In practice, the number of outliers is usually very small. Hence, this framework will lead to a good generalization ability with more sampled data. 2.4

Unsupervised Learning

When the training dataset is not labeled (D ¼ fxi g), classified or categorized, we can use unsupervised learning algorithms. These algorithms are widely used in the field of density estimation in statistics where an estimate of an observable underlying probability density function is constructed based on observed data. In contrast to supervised learning methods, unsupervised learning methods must learn the relationships among instances in a dataset and classify them without help. To find these relationships, many different algorithms have been proposed. Figure 3 categorizes some of the most common unsupervised learning algorithms based on their application. Clustering is the most known example of unsupervised learning problems. In these problems, feature vectors are divided into several subsets S. Each set of feature vectors s 2 S is close to the mean feature vector of S. Clustering can be modeled as the following optimization problem: minS;l

X s2S

X i2S

lðxi ; lS Þ;

ð1Þ

where lS is an aggregation measure (e.g. mean) of the data in set s. Image compression and document clustering are two main applications of clustering. Another application of clustering is in bioinformatics to learn motifs. Motifs are sequences of amino acids that occur repeatedly in proteins in a DNA.

Fig. 3. Most common algorithms used in unsupervised learning

2.5

Reinforcement Learning

Reinforcement learning (a.k.a. approximate dynamic programming or neuro-dynamic programming) is another class of machine learning algorithms that its mathematical foundation is Markov Decision Process [8]. In these algorithms, the learner is a

116

M. Kianpour and S.-F. Wen

decision-making agent in an environment. The agent receives rewards or penalty for its actions while trying to solve the problems. After a set of runs, the agent learns the best strategy, which is the sequence of actions maximizing the total rewards. To formulate the value function and Q-function in reinforcement learning, consider a tuple ½S; A; T; r; d: S is a finite set of states, A is a finite set of actions, T transition dynamics, r expected reward function and d 2 ½0; 1Þ the discount factor. The value function calculates the optimal discounted sum of rewards that can be obtained: V ðSÞ ¼ maxa ðr ðs; aÞ þ d

X s0

a 0 TSS 0 Vðs ÞÞ;

ð2Þ

where a 0 TSS 0 ¼ Pr fst þ 1 ¼ s jst ¼ s; at ¼ ag:

ð3Þ

On the other hand, Q-function is the discounted reward for raking an action a in state s: Qðs; aÞ ¼ r ðs; aÞ þ d

X s0

a 0 TSS 0 V ðs Þ:

ð4Þ

Reinforcement learning process starts with an arbitrary value function converging to the true value function. The learner only knows S, A, and d. A number of algorithms are proposed for reinforcement learning which Q-learning is perhaps the best known algorithm. In this algorithm, Q-function starts arbitrarily. In iteration i þ 1, the agents in state si takes an action ai . In this situation, Q-function updates as follow: Qi þ 1 ðsi ; ai Þ ¼ Qi ðsi ; ai Þ þ bi þ 1 ðri þ 1 þ d maxa Qi ðsi þ 1 ; aÞ  Qi ðsi ; ai ÞÞ:

2.6

ð5Þ

Machine Learning Algorithms in Adversarial Environment

In this section, we summarize the state of the art of using machine learning algorithms in adversarial environment. The field of adversarial machine learning has emerged to study vulnerabilities of machine learning approaches and attacks against them in adversarial environment. The aim of this filed is developing techniques to make learning algorithms robust to adversarial manipulation. Regression Learning. Bojarski et al. [9] and Chen and Huang [10] presented two examples of using regression learning in adversarial environment in the context of selfdriving cars. In these examples, a parametric controller learns from observations of actual control decisions. The attacker may manipulate the captured images by the vision system and introduce some errors in the predicted outputs. Classification Learning. Spam emails filtering [11], credit fraud detection [12], and malware detection [13] are the main examples of applying binary classifications in adversarial environment. In all these examples, the attacker avoids being detected and tries to manipulate the data to appear normal to the detectors.

Timing Attacks on Machine Learning: State of the Art

117

Clustering. In [14], a new scalable clustering system is proposed in HTTP-based malware forensic analysis. In this system, various malwares are clustered to identify where malware polymorphism had been used to hide the malware from detectors like anti-virus tools or finding the malware authors. Anomaly Detection. The ultimate goad in anomaly detection is to identify the anomalous behavior and determine whether this behavior is due to a system fault or attack. Intrusion detection systems are the most well-known concrete examples used in adversarial environments. Centroid Anomaly Detection is a general anomaly detection approach that uses training dataset D (in unsupervised learning) to obtain a mean P l ¼ 1n i2D xi and detects an instance x as anomalous if: kx  lkqq  r where r is the threshold to limit the false positive rate below a determined level. In this approach, l can be updated online by arriving a new data. Other examples of anomaly detectors are presented in [15] and [16]. In the former, Principal Component Analysis (PCA) is used to identify an anomalous traffic flow in the network, and the latter has used n-grams approach to determine anomalies in a Network Intrusion Detection System (NIDS).

3 Timing Attacks on Machine Learning In the previous section, we described the five major machine learning classes and how they can be instantiated in adversarial environment. In this section, we address adversarial machine learning in more details. Moreover, we investigate how such adversarial settings introduce vulnerabilities into regular learning approaches. We will present a general category of attacks in the context of machine learning and provide you with a detailed description of the specific attacks. This classification will be in the following dimensions: Timing. In modelling the attacks, the time that an attack occurs is crucial. Considering when the attack takes place leads to the division of attacks on machine learning to attack on models and attack on algorithms. Attack on models assumes that the model has already been learned and in order to cause the model to make incorrect predictions, the attacker should change its behavior or observed environment. While attacks on algorithms take place before training models and the attackers modify a part of the data that is used for training. Information. The information that an attacker has about the learning model or algorithm is another important issue in modeling the attacks. In this category, the attacker might have full information about the model or algorithms, or have limited information or no information about these. We will address these two situations white-box and black-box attacks, respectively.

118

M. Kianpour and S.-F. Wen

Goals. Evading detection, reducing the accuracy of the algorithms, etc. might be one or more goals of the attacker. The attacker goals can be differentiated into two classes: targeted attacks and attacks on reliability of the learning methods. In the former class, the attacker aims to cause a mistake on specific instances of specific nature. As an example, they can cause a learning function to predict an incorrect label on an instance. While in the latter, the attacker degrades the perceived reliability of the learning function by increasing prediction error. The first taxonomy of attacks on machine learning is presented by Barreno et al. [17]. In 2010, they presented a more comprehensive taxonomy of these attacks [18]. Their taxonomy also considers three aspects of attacks; however, there are some differences in compare to presented taxonomy in this paper. Attacker influence, security violation and specificity are the dimensions of this taxonomy. Causative attacks and exploratory attacks are identical to poisoning attacks and decision time attack, respectively. Figure 4 summarizes these dimensions of attacks on machine learning. In the remainder of this section, we will discuss Timing dimension in more details. Other dimensions are beyond the scope of this paper.

3.1

Timing Attack Strategies

As we mentioned earlier, timing attacks are divided into Decision Time and Training Data. Adversarial evasion of spam, phishing, and malware detectors which are trained to classify normal and malicious instances are some examples of decision time attacks on machine learning. In these attacks, the attacker manipulates the nature of objects (e.g. introducing word misspelling or substituting the codes in order to misclassification of instances).

Fig. 4. Dimensions of attacks on machine learning

One of the key challenges to make machine learning algorithms robust to decision time attacks is modeling these attacks. The attacker should keep a balance between introducing sufficient manipulation into the object in order to erroneous classification and limiting the changes that they make to maintain malicious functionality. In this

Timing Attacks on Machine Learning: State of the Art

119

paper, we discuss common mathematical models of decision time attacks in two dimensions: white-box attacks and black-box attacks. We start with attacks on binary classifiers, so-called evasion attacks. Then, we generalize it to multiclass classifiers, anomaly detection and regression. Modeling decision time attacks enables us to understand them fundamentally. Decision time attacks are against machine learning models. Linear support vector machine, as an example, yields a linear classifier, f ð xÞ ¼ sgnðwT xÞ, where w is feature weights. In decision time attacks, we only care about f ðxÞ, not the algorithms that has produced it. An attacker is associated with a particular behavior or object, which normally is labeled as malicious by the learned model. Therefore, attacker makes some changes to that behavior or object to fulfill their malicious objective. Attacker’s knowledge about the learned model is a major challenge in modeling these attacks. First, we assume that the attacker knows the model that they are attacking (i.e. White-box attacks). Below, we have listed three constructs that white-box evasion attacks on binary classifiers are started with [19]: • Classifier f ð xÞ ¼ sgnðgðxÞÞ for some scoring function gðxÞ. • Adversarial feature vector xA , which is corresponded to the feature characteristics of the attacker’s behavior or object. • Cost function cðx; xA Þ assigning a cost to an attack which is characterized by feature vector x. Considering these constructs, evasion attacks, which are targeted attacks, aim to appear normal to the classifier (i.e. gðxÞ  0 or f ð xÞ ¼ 1) and minimize cðx; xA Þ. Distance-based cost functions are the most common way to model evasion costs. These functions are based on a measure of distance in feature space between a modified and original feature vector. However, these functions fail to capture equivalence among attack features. This limitation is discussed in more details for spam detection [11]. After choosing the cost function, the next challenge for attackers is the tradeoff between appearing normal to the classifier and minimizing the evasion cost. This problem is formulated as minx ½minfgð xÞ; 0g þ kcðx; xA Þ;

ð6Þ

where k is used to tradeoff between the importance of appearing normal and the cost. Another modeling approach that is presented in [20] assumes that appearing normal is the only goal of the attacker. Several variations are presented for optimization problem of this trade off however, finding the solution of these problems is another step that attacker should take. Assuming that gðxÞ and cðx; xA Þ are convex in continuous x 2 Rm , all of the attacker’s decision problem are convex and can be solved by using standard convex optimization techniques. However, in reality, optimization problems are not convex. Hence, a locally optimal solution can be obtained using techniques like Gradient Descent.

120

M. Kianpour and S.-F. Wen

Generally, achieving an optimal solution for this problem is difficult. An alternative approach is using approximation algorithms. Lowd and Meek presented an algorithm that approximates the optimal solutions to factor of 2. Binary classifiers are just one context of decision time attacks. Now, suppose that attacker wants that instance xA to be labeled as their target class c, accordingly, the optimization problem will be modelled as: mincx ðx; xA Þ;

ð7Þ

such that f ðxÞ ¼ c. In this model, we can have a cost function as the same as in binary classifiers. Multiclass classifiers can also be represented as: f ð xÞ ¼ arg maxx gy ð xÞ: In this case, the optimization problem transforms to    minx gc ð xÞ  kc x; xA :

ð8Þ

ð9Þ

Evasion attacks are not limited to classifiers (supervised learning) but also they can be conducted against anomaly detection systems (unsupervised learning). While there are significant differences between these two algorithms, [21] shows that decision time attacks on binary classifiers are identical to attacks on binary classifiers. A decision attack on the model learned using reinforcement learning is an attack on the policy, pðxÞ, mapping an arbitrary state x into an action a. The attacker aims to change x to lead the defender to make a poor action. The attacker targets a particular state to transform it to another state during the attack. Therefore, the attacker wants p to take a target action at ðxÞ. As we discussed in Sect. 2.5, Q-function Qðx; aÞ is defined as the expected rewards in state x if action a is taken by following an optimal policy. If we define a ¼ arg mina Qðx; aÞ;

ð10Þ

then minimizing the expected rewards is equivalent to ac ð xÞ ¼ a. To model attacks on reinforcement learning we need to think of optimization problems and their solutions. Author in [22] shows that targeted attacks on reinforcement learning are equivalent to targeted attacks on multiclass classifiers. To proof this claim, they assumed a greedy policy considering the Q-function. Then, this policy is defined as pð xÞ ¼ arg mina Qðx; aÞ: ð11Þ Now if we define ga ðxÞ ¼ Qðx; aÞ, then pð xÞ ¼ arg mina ga ð xÞ;

ð12Þ

which is identical to the decision time attack on multiclass classifiers that we mentioned earlier.

Timing Attacks on Machine Learning: State of the Art

121

Assuming that attacker knows everything about the system, learning models and algorithms is unrealistic. In black-box attacks, attackers may have partial information about the learning models. There are two main questions that need to be addressed in black-box attacks: (1) what attacker can achieve from partial information that they may have; and (2) how the way that the attacker obtains this information can be modelled. Authors in [23] have answered the first question by presenting a comprehensive taxonomy of these attacks, which is centered around the information that attackers may have about the model and used feature space. Authors in [24] also have answered the second question by proposing a framework that shows how an attacker can obtain information about the learning model. One of the advantages of this framework is that it does not approximate the classifier f(x). The model only asks multiple specific questions, which in some cases they are costly. Another class of timing attacks is attacks that target the learning algorithms by direct manipulation of data used for training. Authors in [25] categorizes data poisoning attacks into four categories based on attacker capabilities to modify training data and time of attack: • Modification of Labels. In supervised learning datasets, attacker can modify a limited number of the labels. Label flipping attack is a common form of label modification attacks that is specific to binary classifiers. • Insertion of Poisoned Feature Vectors. In this attack, the attacker adds a limited number of poisoned feature vectors based on their threat models. In supervised learning, the attacker uses labels that are likely to control. However, in unsupervised learning that there are no labels, the attacker can infect the feature vectors. • Modification of Data. The attacker modifies feature vectors and labels of the training data. • Frog-Boiling Attacks. In both supervised and unsupervised learning settings, the defender may retrain a model in different iterations. This process provides the attacker with an opportunity to inject poison and make malicious impact in each iteration. In poisoning attacks, the attacker starts with a fresh training dataset (D0 ) and tries to transform it to another dataset (D), which is used to train the learning algorithm. Similar to decision time attacks, data poisoning attacker should consider a tradeoff between achieving their goal and minimizing the cost. This trade off can be modelled as: minD RA ðD; SÞ þ kcðD0 ; DÞ;

ð13Þ

where risk function RA ðD; SÞ depends on the learning parameters obtained by training the model using D. Another approach to model this concept is minD RA ðD; SÞ

such that

cðD0 ; DÞ  C:

ð14Þ

In this equation, D0 ¼ fðxi ; yi Þg and C is the attacker’s budget. The cost of label flipping for i is ci and z shows the decision of flipping the label of i (i.e. zi ¼ 1 flips the

122

M. Kianpour and S.-F. Wen

label and zi ¼ 0 does not flip it). The optimization problem of this attack is a bi-level problem that Xiao et al. have proposed in approximate solution [26]. Another important type of poisoning attacks is insertion of set of data to feature vectors [27]. In this attack, attacker cannot choose the labels assigned to the inserted data. Let D0 be the original training data. Attacker inserts an instance ðxc ; yc Þ to this dataset and transforms it to D. 3.2

Defense Strategies

In the previous section, we investigated several classes of decision time attacks on machine learning models, and data poisoning on machine learning algorithms. In this section, we discuss the approaches that can defend against such attacks. The focus of this paper is on supervised learning and the defense mechanisms for other learning paradigms are beyond the scope of this paper. We start with defense against decision time attacks, and then we will discuss the data poisoning defense mechanism. Below we have expressed the objective of decision time attacks that is learning function f, which  0 0 0 for all a, b, where a and b are each either a site in V or an edge in E. In other words, a statistic t that defines a family of equilibrium Gibbs distributions is positively correlated if the covariance between any two components of t is positive. The significance of whether covariances of t are positive or negative lies in the relationship between the covariances and entropy [8]  p(x; θ) log p(x; θ). H(X; θ) = − x

Namely, that the change in entropy due to an incremental increase in the direct bias θi is   ∂H(X; θ) =− θi cov(ti , tj ) − θl,k cov(ti , tl,k ). ∂θi j∈V

{l,k}

Similarly for social biases. One can see that if t is positively correlated, that is cov(ti , tj ) > 0 for all j ∈ V , and cov(ti , tk,l ) > 0 for all {k, l} ∈ E, then entropy decreases monotonically with θ  0 [19]. Decreasing entropy corresponds to increasing concentration of preference, which is akin to the phenomenon commonly referred to as a cascade or social contagion [10,15,18,28]. Griffiths [13] showed that for the binary choice problem considered in this paper, if θij (1, 1) = 1 for all i, j ∈ E, and either ti (1) = 1 for each i, ti (1) = −1 for each i, then t is positively correlated. That is, if the social biases are all

146

M. G. Reyes

coordinating and the direct biases are of uniform polarity, then t is positively correlated. In Sect. 6 we show that statistics defining families of equilibrium Gibbs distributions may be positively correlated with anti-coordinating social biases and direct biases of non-uniform polarity. Moreover, if all social biases are coordinating, yet direct biases are of non-uniform polarity, then the statistic t is not positively correlated. In Sect. 7, we discuss results that connect the combinatorial pattern of anticoordinating social biases and the polarities of the direct biases, with whether the statistic t is positively correlated or not. Note that positive correlation is a universal property. It says that all cov(ti , tj ) > 0 for all j ∈ V , and cov(ti , tk,l ) > 0 for all {k, l} ∈ E. On the other hand, to show that a statistic t is not positively correlated, one merely has to show the existence of an j ∈ V for which cov(ti , tj ) > 0, or an {k, l} ∈ E for which cov(ti , tk,l ) > 0.

5

Equivalence Classes of Dynamics Models

In this section we discuss how symmetry of social biases and uniformity of direct biases determine the Gibbs equilibrium to which dynamics converge. In particular, we are interested in whether the mapping from dynamics models to equilibrium models induces an equivalence relation on the set of dynamics models. There is a speculative component to this discussion, and while speculation is based on theoretical analysis by Godreche [12], and numerical analysis presented in this section, more results are needed to confirm. Nevertheless, it is believed that an appreciation for the broader program will stimulate interest in this question. Figure 1 illustrates the speculated relationship between dynamics and equilibrium models. Let us now recall the four classes of dynamics models, indicated at the end of Sect. 2. With consumers connected in a cycle, the probability of con(t) sumer i selecting alternative xi ∈ {−1, 1}, conditioned on the preferences x∂i at time t, is (t)

p(xi |x∂i ) ∝ exp {(θs + Δ) xi xi−1 + (θs − Δ) xi xi+1 + θi xi } .

(7)

We ask the following two questions. One, whether the social biases are symmetric or asymmetric; that is, whether Δ = 0 or Δ > 0. And two, whether the direct biases are uniform with θi = θd for all i. Blume [5] showed that when Δ = 0, the dynamics (2) converge to the equilibrium Gibbs distribution     p(x; θ) ∝ exp θi xi + θs xi xi+1 , (8) i∈V

regardless of whether the θi are uniform.

i

Camera Ready for Coordinated Frustration

147

Fig. 1. Diagram illustrating speculated relationship between dynamics models and equilibrium models. In particular, how asymmetry in social biases and uniformity of direct biases induce equivalence classes on the set of dynamics models, the relationship being the equilibrium Gibbs distribution to which each member of a given equivalence class converges. (a) An illustration of two parameters that can be used to characterize the set of dynamics models on a cycle. For instance, the right portion of (a) indicates the space of dynamics models where the asymmetry Δ is zero, and vary from one another in the direct biases. The left portion of (a) indicates a dynamics model that has asymmetry Δ > 0 and the same direct biases as the dynamics model indicated by the top network in (a). The dynamics model indicated by the bottom network of (a) can be obtained from that of the network in the middle of (a) by changing both the asymmetry of the social biases and the direct biases. In (b) is indicated a set of equilibrium Gibbs models on the cycle with uniform social biases θi,i+1 = θs for all i.

Godreche [12] has shown that if θi = θd for all i, then for all Δ > 0, the dynamics (7) likewise converge to the symmetric equilibrium Gibbs model, i.e.,     θd xi + θs xi xi+1 . (9) p(x; θ) ∝ exp i∈V

i

Now consider the case that Δ > 0 and θi = θj for some i, j on the cycle. For example, suppose θs = 0.8 and Δ = 0.2, so that θi+1→i = 1 and θi−1→i = 0.6 for all i. The direct biases are θ4 = 2 and θi = 0 for i = 4. A priori, it is not clear whether these dynamics will converge. However, we can use Monte Carlo samples to estimate parameters of an equilibrium Gibbs model on a cycle. We numerically fit parameters for an equilibrium Gibbs model on the cycle from Monte Carlo samples of (7). We used minimum conditional description length (MCDL) estimation [20] to estimate social and direct biases of Gibbs    model.  an equilibrium     Figure 2(a) shows the average estimation errors, ˆθi − θi , ˆθi→i±1 − θs . We can

148

M. G. Reyes

Fig. 2. For Glauber dynamics with uniformly asymmetric social biases, i.e., θi+1→i = 1 and θi−1→i = .6, for all i, θ4 = 2, θi = 0 for i = 4, (a) Symmetric MCDL estimation error, and (b) Estimated equilibrium direct biases.

see that MCDL estimates of the social biases converge to the symmetric values, which is consistent with Godreche’s findings for the case of no direct biases. On the other hand, we see that MCDL estimates of direct biases do not converge to the true values. Indeed, Fig. 2(b) shows direct biases for both the Glauber dynamics (2) and the steady-state Gibbs equilibrium of the form      θi xi + θs xi xi+1 , p(x; θ) ∝ exp i∈V

i

where θi = θi , to which the Glauber dynamics converge. The theoretical results of Godreche and the above numerical results suggest an equivalence relation of dynamics models where all dynamics models within a given equivalence class converge to the same equilibrium Gibbs model. This is illustrated and discussed in Fig. 1.

6

Frustrated Families of Equilibrium Models

The previous section posited an equivalence relation on dynamics models, determined by the equilibrium Gibbs model to which the dynamic models within an equivalence class converge. One can induce an equivalence relation with a coarser criterion for two dynamics models to be related. Rather than require that two dynamics models converge to the same equilibrium model, one can simply require that two dynamics models converge to two equilibrium models within the same family of equilibrium models. It is the polarity of the statistic t that defines a family of equilibrium Gibbs models, under the convention that the exponential parameter θ is positive. In this

Camera Ready for Coordinated Frustration

149

Fig. 3. Top row: direct biases of equal polarity; Bottom row: direct biases of opposing polarity. Left column: even number of anti-coordinating social biases; Right column: odd number of anti-coordinating biases.

section we induce a partition of the set of equilibrium models by the presence or absence of a set of patterns within the polarity of t. Depending on the presence or absence of this pattern in the polarity of t, the family of equilibrium models will be classified as frustrated or non-frustrated. We define frustration as the absence of a so-called ground state configuration, a particular configuration of preferences that maximizes the exponent of (3). Specifically, a ground state is a configuration x that simultaneously maximizes all components of the statistic t. That is, it is a configuration in which neighbors that coordinate preferences have the same preference; neighbors that anti-coordinate preferences have opposite preferences; and the preference of each consumer aligns with his direct bias. Such a ground state is invariant to a positive scaling of the direct and social biases, and as such, is a property of the statistic t defining the family of Gibbs equilibria. Figure 3 illustrates a chain graph with non-zero direct biases at each endpoint. There are four cases to consider to determine frustration for such a network. One, whether or not the direct biases are of the same polarity. Two, whether there are an even or an odd number of anti-coordinating social biases on the chain. Figure 4 illustrates a cycle with two non-zero direct biases. For such a network, there are six cases to consider to determine frustration. One whether, there the direct biases are of the same polarity. Two, whether there are an even or an odd number of social biases between the segments intervening the two direct biases on the cycle. Notions of anti-coordination and frustration were discussed in the context of social interactions explicitly for the first time by Bramoulle [6]. However, study of anti-coordination and frustration within Ising models, in the context of so-called anti-ferromagnetism, has a long history [17,27]. In such models, a cycle within a larger network is considered frustrated if there are an odd number of antiferromagnetic interactions along the cycle. Owing to the correspondence between Ising models and social choice problems established by Blume [5], one can argue that the earlier concept of frustration as investigated by statistical physicists [17,27] is in fact the initial formulation of frustration for social interactions. Earlier examination of frustration within statistical physics has considered models with uniform direct biases corresponding to a uniform magnetic field. Under this assumption, the absence of a ground state occurs only when there

150

M. G. Reyes

Fig. 4. Top row: direct biases of equal polarity; Bottom row: direct biases of opposing polarity. Left column: even number of anti-coordinating social biases on each path; Middle column: odd number of anti-coordinating biases on each path; Right column: one path with an odd number of anti-coordinating social biases, the other path with an even number of anti-coordinating social biases.

are an odd number of anti-coordinating social biases around a cycle. Another implication is that frustration cannot occur on an acyclic graph. However, one can see from Fig. 3(b) that frustration can occur without cycles if there are an odd number of anti-coordinating social biases connecting consumers whose direct biases are of same polarity. Conversely, one can see from Fig. 3(c) that frustration also occurs when an even number of anti-coordinating social biases connects consumers whose direct biases are of different polarity. The concept of coordinating social biases and direct biases of non-uniform polarity has been studied extensively within the framework of so-called battle of the sexes games [24]. In [7] it was pointed out that such interactions with opposing preferences involve an inherent tension due to “unfairness”, as not everybody will be satisfied. Therefore, it is not surprising that, as we show in the next section, direct biases of opposite polarity manifest mathematically similar to anti-coordinating social biases.

7

Frustrated and Positively Correlated Families of Equilibrium Models

This section presents the main technical results of the paper. It introduces concepts necessary for technical discussion and analysis of the relationship between frustration and positive correlation of a statistic t. As discussed in Sect. 6, frustration of t pertains to the pattern in the polarity of t, which refers to the pattern of coordinating and anti-coordinating social biases, and the polarity of direct biases. In particular, frustration of t is determined by the polarities of social

Camera Ready for Coordinated Frustration

151

biases along paths connecting nearest sites with non-zero direct bias. By nearest sites, we mean that all sites on the path except for the endpoints have zero direct bias. As discussed in Sect. 4, positive correlation of t concerns whether pairwise covariances between components of t are positive or negative under non-negative scalings θ  0. Moreover, whether such covariances are positive or negative is important as this determines whether entropy increases or decreases in response to incremental changes in components of θ  0. In this section we establish that at least for a chain and a cycle, positive correlation is equivalent to non-frustration. We provide sketches of the proofs in the Appendix. From (6), to determine whether covariances between components of t are positive or negative, we first compute second-order derivatives of the partition function. For instance, the covariance cov (t1 , t2 ) between statistics t1 (x1 ) and t2 (x2 ) is cov (t1 , t2 ) = =

=

∂2 log Z(θ) ∂θ2 ∂θ1 ∂ ∂ ∂θ1 Z(θ) ∂θ2 

Z(θ)



∂2 ∂θ1 ∂θ2 Z(θ)

Z(θ) −





∂ ∂θ1 Z(θ) 2

[Z(θ)]



∂ ∂θ2 Z(θ)

.

The partition function and subsequent derivatives are polynomials in the hyperbolic functions cosh θ and sinh θ. As such it will be useful to introduce shorthand notation. The reason for this is that we would like to examine what happens when the strength of interaction or bias increases or decreases while the orientation of interaction or bias remains the same. If neighbors i and j anti-coordinate their preferences for Products A and B, then tij (1, 1) = tij (−1, −1) = −1 and tij (1, −1) = tij (−1, 1) = 1. We will abuse notation somewhat and say that tij = 1 if i and j coordinate and tij = −1 if i and j anti-coordinate. Similarly, if consumer i has a direct bias in favor of Product A, then ti (1) = 1 and ti (−1) = −1, and we will again abuse notation by saying, respectively ti = 1 and ti = −1. We let Si = sinh θi and Ci = cosh θi for consumer i, and likewise for Sij and Cij for neighbors i and j. Moreover, S i = sinh ti θi Si ti = 1 . = −Si ti = −1

(10)

Taking derivatives, d d Si = sinh ti θi dθi dθi = ti cosh ti θi Ci ti = 1 = . −Ci ti = −1

(11)

152

M. G. Reyes

Likewise, C i and dθdi C i are defined analogously. However, C i = Ci and d dθi C i = Si for both ti = 1 and ti = −1. For neighbors i and j, S ij and C ij are defined similarly and have the same properties in terms of sign reversal and differentiation. That is, anti-coordinating social biases manifest in S ij and its derivative, and likewise S i for direct biases of opposing polarity. We begin with the following result on the partition function of an Ising model on a chain with direct biases at the ends of the chain. For simplicity and economy, we omit constant scale factors. Theorem 1. The partition function for an Ising model on a chain with direct biases at the ends of the chain is



C ij + S 1 S N S ij Z(θ) = C 1 C N {i,j}∈E

{i,j}∈E

We use Theorem 1 to show for such an Ising model, the statistic t is positively correlated if and only if t is non-frustrated. Figure 3 illustrates the four basic patterns for the polarity of t corresponding to an Ising model on a chain with nonzero direct biases at the ends of the chain. If the two direct biases are of the same polarity, as in (a) and (b), then a ground state exists if there are an even number of anti-coordinating social biases on the chain, whereas a ground state does not exist if there are an odd number of anti-coordinating social biases on the chain. If, as in (c) and (d), the two direct biases are of opposing polarity, then a ground state exists if there an odd number of anti-coordinating social biases on the chain, and a ground state does not exist if there are an even number of anti-coordinating social biases on the chain. Theorem 2 (Ising Chain with Two Direct Biases). Let G be a chain, and t a statistic for a family of Ising models on G. Suppose all site statistics in t are neutral except for the end sites of G. Then, t is positively correlated if and only if t is non-frustrated. We now compute the partition function for a cycle with N consumers and M ≤ N direct biases at sites i1 , . . . , iM . Theorem 3. The partition function for an Ising model p(x; θ) on a cycle consisting of N sites with a total of M ≥ N direct biases is Z = H0

M



Ci +

i1 ,i2 ∈I

i=1

where H0 =

H i1 ,i2 S i1 S i2



C i3 ,

i3 =i1 ,i2



C i,i+1 +

i,i+1



S i,i+1

i,i+1

and H i1 ,i2 =

i

2 −1 i=i1

C i,i+1

i

1 −1 i=i2

S i,i+1 +

i

1 −1 i=i2

C i,i+1

i

2 −1 i=i1

S i,i+1 ,

(12)

Camera Ready for Coordinated Frustration

153

We now use Theorem 3 to show that for an Ising model on a cycle with two direct biases, the statistic t is positively correlated if and only in t is nonfrustrated. Figure 4 illustrates the six basic patterns for the polarity of such an Ising model. As with the chain, the two direct biases may agree or disagree in polarity. In the cycle, however, there are two paths between the locations i1 and i2 of the two direct biases. Therefore, one has to consider cases where there are an even number of anti-coordinating social biases along each path; an odd number of anti-coordinating social biases along each path; and where one path has an even number and the other paths has an odd number. The condition for the existence of a ground is the same as in a chain, except with a cycle, the condition has to be satisfied for both paths connecting i1 and i2 . Theorem 4 (Ising Cycle with Two Direct Biases). Let t be a statistic for a family of Ising models on a cycle G. Suppose there are two non-neutral site statistics in t, at sites i1 and i2 . Then, t is positively correlated if and only if t is non-frustrated. Though at the moment we do not have a proof of the generalization to M direct biases, the pattern evinced through the case of two direct biases suggests that extension to an arbitrary number of direct biases, while likely tedious, does indeed hold. Likewise for a general graph. As such we state it below as a conjecture. Conjecture 1 (General Ising Cycle). Let t be a statistic for a family of Ising models on a cycle G with N sites. Suppose there are M ≤ N non-neutral site statistics in t, at sites i1 , . . . , iM . Then, t is positively correlated if and only if t is non-frustrated. Conjecture 2 (General Graph). Let t be a statistic for a family of Ising models on a graph G. Then, t is positively correlated if and only if t is non-frustrated.

8

Application to a Marketing Game

This section discusses the connection between the polarity of t, whether entropy increases or decreases, and the recently introduced framework A Marketing Game (AMG). Consumers update their preferences according to (2), beginning from some initial preference configuration x(t0 ) , generating a sequence of preference configurations x(t0 ) , x(t0 +1) , . . . , x(t0 +t) , . . ., in which the configurations x(t) and x(t+1) at successive time points differ in at most a single consumer. The goal of each Company in A Marketing Game is to optimize the total preference. 0 +t  t (τ ) r x(t0 ) , . . . , x(t0 +t) = xi .

τ =t0 i∈V

Due to our convention, Company A seeks to maximize total preference, while Company B seeks to minimize total preference.

154

M. G. Reyes

AMG has recently [22] been cast within the paradigm of reinforcement learning [25], in which a Company fits a model from data, then uses the model to estimate the total preference resulting from different allocations. The allocation that yields the highest expected total preference is selected. In [22], total preference over some time interval was estimated by simulating the preference dynamics using parameters estimated from data and parameters corresponding to candidate marketing allocations. It has been pointed out [26] that suboptimal variational methods with respect to an equilibrium model may give more accurate inference than exact Monte Carlo when computational resources are limited. To use an equilibrium model for forecasting expected total preference, it is important than one can map dynamics parameters to equilibrium parameters. While minimum conditional description length (MCDL) estimation can be used to infer either dynamics or equilibrium models, inferring equilibrium models requires more data and is therefore more computationally intensive. On the other hand, if one understands how to map dynamics models to equilibrium models, then one can estimate the dynamics model with less data, then compute expected total preference with the corresponding equilibrium model. However, it may be possible to make good, principled allocation selection without recourse to a particular model, especially since any dynamics model will be course at best. Consider an equilibrium Gibbs distribution p(x; θ) ∈ Ft for preference configurations on the network. If one understands how entropy of p(x; θ) changes with respect to the θi , then one can understand the relative value of marketing to consumer i. For instance, decreasing entropy amounts to increasing concentration of choice. If a company already has market share [4], then an allocation that decreases entropy would be good. On the other hand, if a company is losing in market share, a good allocation would be one that increases entropy. One can view this as an information-theoretic way of looking at the problem of influencing decision-making on a social network.

9

Summary, Limitations and Future Directions

In this paper we have discussed a notion of frustration within a family of equilibrium models of consumer choice, shown theoretically how it relates to positive correlation of the statistic t defining the family of equilibrium models. Moreover, we demonstrated numerically that when direct biases are non-uniform yet of uniform polarity, then asymmetric yet coordinating social biases will converge to an equilibrium whose statistic is frustrated as a result of re-distributed direct biases of non-uniform polarity. Limitations of the paper are a reflection of what remains to be done. For example, our use of MCDL to estimate both dynamics and equilibrium models provides an alternative means of evaluating steady-state behavior of asymmetric dynamics, as compare to the theoretical analysis of Godreche. Future work should involve extending Godreche’s theoretical analysis to determine the equilibrium direct biases that correspond to asymmetric social biases when the direct biases are non-uniform. In addition, analysis related to frustration and positive

Camera Ready for Coordinated Frustration

155

correlation should be extended to more complicated topologies, for example using the message passing approach of [23].

A

Appendix: Sketches of Derivations

Note that to determine whether a particular covariance is positive or negative, one needs to determine the sign of the numerator, which we creatively refer to as numerator. For both the chain and the cycle, numerator will be a product of factors, one of which will be positive or negative depending on whether the direct biases agree in polarity, another of which will be positive or negative depending on whether there are an odd or an even number of negative edges between the two direct biases. Each factor in numerator will be a sum of terms, each of term of which itself a product sinh θ’s and cosh θ’s, where θ is a direct or social bias. Without loss of generality, we simplify analysis by assuming uniform social biases, i.e., θi,i+1 = θ. For both a chain and a cycle, direct biases are zero, except for two sites. In the chain, the nonzero direct biases are at sites 1 and N , the ends of the chain. In the cycle, the nonzero direct biases are at sites 1 and k < N . Δ In the interest of space, we will economize notation by letting Cθ = cosh θ and Δ Δ Δ Sθ = sinh θ when θ is a social bias, and Ci = cosh θi and Si = sinh θi when θi is a nonzero direct bias at site i. A.1

Theorems 1 and 2

Proof (Theorem 1). Using the concept of the transfer matrix [3], one can compute the partition function Z(θ) as   θ     N −1

θ −θ  1 1 1 1 e1 C 0 θ N N Z(θ) = e e N −1 1 −1 1 −1 e−θ1 0 Sθ N −1

= C θN C θ

N −1

C θ1 + S θN S θ

S θ1 .



We now sketch a proof that t is positively correlated if and only if t is nonfrustrated. Proof (Theorem 2). We will determine the sign of numerator for the covariance cov (t1 , t12 ) between statistics t1 (x1 ) and t12 (x1 , x2 ). In the case that θ1 > 0 and θN > 0, it is straightforward to show

  (13) numerator = CN SN CθN −1 Cθ SθN −2 − Sθ CθN −2 SθN −1 C12 − S12 The first factor CN SN is positive since θN > 0. The third factor, C12 − S12 , is positive for all θ1 .

156

M. G. Reyes

Consider the middle factor, CθN −1 Cθ SθN −2 − Sθ CθN −2 SθN −1 . There are four cases to consider. First, whether there are an odd or even number of negative edges in the chain. For each of these cases, whether the social bias θ12 is positive or negative. One can see that when there are an odd number of negative edges on the chain CθN −1 Cθ SθN −2 − Sθ CθN −2 SθN −1 < 0, and when there are an even number of negative edges along the chain, CθN −1 Cθ SθN −2 − Sθ CθN −2 SθN −1 > 0. Combining this with (13), we see that cov (t1 , t12 ) < 0 when t is frustrated and cov (t1 , t12 ) > 0 is non-frustrated. Note, however, that in order to show positive correlation, one has to show that all covariances are positive when t is non-frustrated. By recalling (13), one can convince themselves that this is the case. The case where θ1 < 0 and θN > 0 follows similarly, though one should recall (10) and (11) for sign reversals.

A.2

Proof Sketch for Theorem 4

The concept of the transfer matrix can again be used to compute the partition function Z(θ) in the case of an Ising model on a cycle. However, one does so by conditioning on a particular site, say i0 , and computing the respective conditional (A) (B) partition functions Zi0 and Zi0 corresponding to conditioning that xi0 = 1 (A)

(B)

and Xi0 = −1. Then, Z = Zi0 + Zi0 . We begin with the following lemma. Lemma 1. Consider a cycle with M direct biases, at sites i1 , . . . , iM . The con(A) (B) ditional partition function Zi0 and Zi0 are, respectively, (A)

Z i0

= H0

(14) M

Ck +

M −1  j=1

k=1

H ij ,ij+1 S ij S ij+1



Ck +



k=j,j+1

k=1





C k S 1 H i0 ,i1 +



C k S M H i0 ,iM ,

k=M

and (B)

Z i0

= H0

M

k=1

Ck +

M −1  j=1

H ij ,ij+1 S ij S ij+1

Ck −

k=j,j+1

k=1

C k S 1 H i0 ,i1 −



C k S M H i0 ,iM .

k=M

Proof. We will prove (14) by induction. It is straightforward1 to verify the conclusion of the lemma in the case of three direct biases, which establishes the 1

For the reader who wishes to verify this claim, the author recommends first computing Z for zero, one, and two direct biases, then finally for three direct biases.

Camera Ready for Coordinated Frustration

157

base step. That is, the statement holds for some M ≤ N . For the inductive step, (A) we will first compute Zi0 as a function of the message m from site iM to site (A)

iM + 1, and establish a relationship between m and Zi0

in the base case. We

(A)

will then compute Zi0 as a function of m in the inductive case of M + 1 direct biases. For simplicity and without loss of generality, assume that for all {i, i+1}, θij = θ. (A) Suppose there are M direct biases. We compute Zi0 by conditioning on site i0 = 0. The sites where there exist direct biases are enumerated as i1 , . . . , iM . Moreover, let Δ be the distance from iM to i0 in the ‘forward’ direction, i.e., the direction in which no other direct biases exist between iM and i0 . One can show that (A)

Zi0

= CθΔ [m(A) + m(B)] + SθΔ [m(A) − m(B)] .

Equating this with (14), which is established for the case of M direct biases, one can see that m(A) + m(B) = α0 CθN −Δ +

M −1 

i −ij+1 −Δ

αij ,ij+1 Cθ j

i

Sθj+1

−ij

j=1

+

α0,1 Cθi0 −i1 −Δ Sθi1 −i0

+ αM,0 CθiM −i0 −Δ Sθi0 −iM,

(15)

and m(A) − m(B) = α0 SθN −Δ +

M −1 

i

αij ,ij+1 Cθ j+1

−ij

i −ij+1 −Δ

Sθ j

j=1

+

α0,1 Cθi1 −i0 Sθi0 −i1 −Δ

+ αM,0 Cθi0 −iM SθiM −i0 −Δ .

(16)

Now consider the inductive case of M + 1 direct biases, where i1 . . . , iM are as they were in the base case, and iM +1 > iM , i.e., it is the only direct bias between iM and i0 . Note that m is the same as in the base case of M direct biases, and therefore hence m(A) − m(B) and m(A) + m(B) are the same as well. One can show that Zi0 (A) = CM CθΔ + Sm CθΔ1 SθΔ2 [m(A) + m(B)] + CM SθΔ + Sm SθΔ1 CθΔ2 [m(A) − m(B)] Substituting (15) and (16) establishes (14) for the inductive case of M + 1 direct biases.

We now sketch a proof of Theorem 4. Proof (Theorem 4). Let θ1 and θ2 denote the non-zero direct biases on the cycle. We assume that θ2 > 0. We will consider four cases. First, whether θ1 < 0 or θ1 > 0. Second, whether there are an odd or even number of anti-coordinating social biases along each path between i1 and i2 .

158

M. G. Reyes

In the case that θ1 < 0,

 numerator = H 0 H 1,2 C12 S22 + S12 S22 − S12 C22 − C12 C22

(17)

where it is straightforward to verify that C12 S22 + S12 S22 − S12 C22 − C12 C22 < 0. If there are an odd number of anti-coordinating social biases along each path connecting i1 and i2 ,  

(18) H 0 H 1,2 = CθN + SθN −CθΔ1 SθΔ2 − CθΔ2 SθΔ1 < 0, while if there are an even number of anti-coordinating social biases along each path connecting i1 and i2 , we have  

(19) H 0 H 1,2 = CθN + SθN CθΔ1 SθΔ2 + CθΔ2 SθΔ1 > 0. Substituting (18) into (17) shows that when the two direct biases are of opposite polarity, and there are an odd number of anti-coordinating social along each path connecting i1 and i2 , then the covariance between t1 and t2 is positive. Substituting (19) into (17) shows that when the two direct biases oppose in polarity, and there are an even number of anti-coordinating social biases along each path connecting i1 and i2 , then the covariance between t1 and t2 is negative. In the case that θ1 > 0,

 (20) numerator = H 0 H 1,2 C12 C22 + S12 S22 − S12 C22 − C12 S22 , where one can verify that C12 C22 + S12 S22 − S12 C22 − C12 S22 > 0. Substituting (18) and (19) into (20) will yield analogous results.



References 1. Alonzo-Sanz, R.: Self-organization in the battle of the sexes. Int. J. Mod. Phys. C 22, 1–11 (2011) 2. Amari, S., Nagaoka, H.: Methods of Information Geometry. Oxford University Press, Oxford (1993) 3. Baxter, R.J.: Exactly Solved Models in Statistical Mechanics. Dover, Mineola (2007) 4. Bell, D.E., Keeney, R.L., Little, J.D.C.: A market share theorem. J. Mark. Res. 12(2), 136–141 (1975) 5. Blume, L.E.: Statistical mechanics of strategic interaction. Games Econ. Behav. 5(3), 387–424 (1993) 6. Bramoulle, Y.: Anti-coordination and social interactions. Games Econ. Behav. 58, 30–49 (2007)

Camera Ready for Coordinated Frustration

159

7. Broere, J., Buskens, V., Weesie, J., Stoof, H.: Network effects on coordination in asymmetric games. Sci. Rep. 7, 17016 (2017) 8. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (1991) 9. Georgii, H.O.: Gibbs Measures and Phase Transitions. De Grutyer, Berlin (1988) 10. Gladwell, M.: The Tipping Point. Little, Brown, and Company, Boston (2000) 11. Glauber, R.J.: Time-dependent statistics of the Ising model. J. Math. Phys. 4, 294–307 (1963) 12. Godreche C.: Dynamics of the directed Ising chain. J. Stat. Mech. Theory Exp. P04005 (2011). https://doi.org/10.1088/1742-5468/2011/04/P04005 13. Griffiths, R.B.: Correlations in Ising Ferromag. I. J. Math. Phys. 8, 478–483 (1967) 14. Luce, D.: Individual Choice Behavior. Dover, Mineola (1959) 15. Kempe, D., Kleinberg, J., Tardos, E.: Influential nodes in a diffusion model for social networks. In: Proceedings of the 32nd International Conference on Automata, Languages, and Programming (ICALP), pp. 1127–1138 (2005) 16. McFadden, D.: Conditional logit analysis of qualitative choice behavior. In: Frontiers in Econometrics. Academic Press, New York (1974) 17. Moessner, R., Sondhi, S.L.: “Ising models of quantum frustration. Phys. Rev. B 63, 224401 (2001) 18. Montanari, A., Saberi, A.: The spread of innovations in social networks. PNAS 107(47), 20196–20201 (2010) 19. Reyes, M.G., Neuhoff, D.L.: Entropy bounds for a Markov random subfield. In: ISIT 2009, Seoul, South Korea (2009) 20. Reyes, M.G., Neuhoff, D.L.: Minimum conditional description length estimation of Markov random fields. In: ITA Workshop, February 2016 21. Reyes, M.G.: A Marketing Game: a model for social media mining and manipulation. Accepted to Future of Information and Communication Conference, San Francisco, CA, 14–15 March 2019 22. Reyes, M.G.: Reinforcement learning in a marketing game. Accepted to Computing Conference, London, UK, 16–17 July 2019 23. Reyes, M.G., Neuhoff, D.L.: Monotonicity of entropy in positively correlated Ising trees. Accepted to ISIT, Paris, July 2019 24. Smith, J.M., Hofbauer, J.: The ‘Battle of the Sexes’: a genetic model with limit cycle behavior. Theor. Popul. Biol. 32, 1–14 (1987) 25. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998) 26. Wainwright, M.J.: Estimating the “Wrong” graphical model: benefits in the computation-limited setting. J. Mach. Learn. Res. 7, 1829–1859 (2006) 27. Wannier, G.H.: Antiferromagnetism. The triangular Ising net. Phys. Rev. 79, 357 (1950) 28. Watts, D.J., Dodds, P.S.: Influentials, networks, and public opinion formation. J. Consum. Res. 34, 441–458 (2007)

Cloud Capacity Planning Based on Simulation and Genetic Algorithms Riyadh A. K. Mehdi(&) and Mirna Nachouki Ajman University, Ajman, UAE {r.mehdi,mirna}@ajman.ac.ae

Abstract. Reducing spending on information technology is one important area that enable enterprises to reduce cost. One area where this can be done is to use cloud computing. Cloud computing key benefits include scalability, instant provisioning, virtualized resources, and cost effectiveness. There are different ways to deploy cloud resources such as public, private, and hybrid cloud. Business requirements determine the best deployment model to use. In this work, we have built a simulation model based on genetic programming to find the optimal combination of private and public cloud resources to satisfy a pattern of demand over the planning period as well as the optimal guaranteed service level. Our main findings is that the optimal level of private computing capacity depends to a large extent on the shape of the demand curve, negative exponential or normally for example. Variations in demand within the same family of demand distributions have a very small effect on capacity for the same mean demand over the planning period but significant impact on capacity utilization and cost. The distinguishing feature of our model is that it can handle any theoretical or an ad hoc demand probability distribution. In addition, our computational scheme allows for any random variation in any of the parameters affecting the total cost of cloud resources consumed as long as this variation can be described by an estimated parametric or empirical probability density function. In addition, the model can be easily modified to determine the optimal total cost with respect to any parameters that can be used as decision variables. The accuracy and correctness of the model was tested against results obtained from a mathematical model based on an exponential probability distribution with almost identical results. Keywords: Cloud costing  Cloud pricing  Optimal cloud deployment Cloud cost optimization  Genetic programming



1 Introduction Cloud computing has resulted in a paradigm shift in the way computing services are procured and used. Cloud computing services are provided as a utility service over the network. The origin of the term Cloud computing is inspired by the cloud symbol which is often used to represent the Internet in diagrams. Originally, it was used to indicate that a service and/or resource is delivered over the internet. This technology, which is an internet-based computing, started in late 2007 [1]. It allows sharing various © Springer Nature Switzerland AG 2020 Y. Bi et al. (Eds.): IntelliSys 2019, AISC 1037, pp. 160–174, 2020. https://doi.org/10.1007/978-3-030-29516-5_13

Cloud Capacity Planning Based on Simulation and Genetic Algorithms

161

services and resources such as databases, storage, servers, applications, etc. in a flexible and dynamic way with a minimal management effort [2]. Many definitions have been given to cloud computing in different ways. Cloud computing allows business customers to plan for their provision and to raise their resources when needed [3]. Islam et al. defined cloud computing as a set of applications, hardware and system software aimed to deliver good quality of services to the end user through the use of the internet [4]. Cloud Computing has two types of cloud models: Service models and Deployment models. The type of services offered by the cloud categorizes the service models; how these services are being used classifies the deployment models. 1.1

Service Models

There are three main categories of service models: • Infrastructure as a Service (IaaS). In this model, the customer has access to core computing resources (hardware and software components) in the form of virtualized servers, storage, networks, processing and other resources where the customer is able to use including operating systems [1]. Unlike traditional hardware machines that require maintenance, this model makes these machines available virtually with flexible specifications. The common example of IaaS is Amazon Web Systems EC2. • Platform as a Service (PaaS). This model provides the customer, through the internet, with all the requirements to build, run and maintain custom web-based applications. These facilities are available to customers without software downloads or installations for developers [1]. The common example of PaaS is Microsoft Azure. • Software as a Service (SaaS). This model provides customers with all required applications hosted by a vendor or service provider through internet connection. It is a licensing model that involves pay-as-you-go subscription [5]. In this model, there are the provider and the customer. The Service provider controls everything while the customer controls the setting of the needed applications only [6]. The common example of SaaS is Google’s email. In addition to the above, Mukundha and Vidyamadhuri [7] describes other services offered by cloud such as: • MBaaS (Mobile Backend as a Service) which is a model that provides mobile applications developers with a way to link their applications to a backend cloud storage. • Daas (Data as a Service) allows customers to access data files such as text, images, sound and videos through internet connection. • MaaS (Monitoring as a Service) helps monitoring various services, applications, servers and systems.

162

1.2

R. A. K. Mehdi and M. Nachouki

Deployment Models

There are four main categories of deployment models [7]: • Public cloud. This model offers applications and services available free of charge most of the time to the public using an internet connection. • Private cloud. This model is activated for a single business organization and serves its multiple users. It could be managed by the business body itself or by a third party service provider. • Community cloud. A group of organizations that share common concerns such as security controls this model. This model follows multi-tenant infrastructure. It might be operated by a single or multiple organizations in the community, or by a third party. • Hybrid cloud. This model is a combination of two or more different cloud infrastructures (public, private or community). It allows an organization to have the required privacy to store its sensitive data as well as the possibility to access with other organizations to an application. There are other deployment models for cloud such as: • Inter cloud. This model is known as cloud of clouds. It allows connecting various (public, private and hybrid) clouds together to provide and exchange data in a transparent way. • Multi Cloud. This model allows the use of multiple and distributed clouds software, hardware and applications services under a single heterogeneous architecture. • Distributed cloud. This model refers to cloud computing platform distributed across multiple geographic locations and multiple machines. It allows getting faster and responsive communication services. 1.3

Pricing Models

Pricing is considered as a challenge in cloud computing. A cloud-computing provider aims to maximize its revenues by relating the cost to the quality of the services provided [8]. In other hands, Al-Roomi et al. state in [9] that cloud customers try to obtain the best quality of service possible with speedy response to their requests at a reasonable and competitive price. Therefore, satisfying both parties requires an optimal pricing methodology. Cloud computing offers the possibility to share resources and costs among large number of end users. It allows them to store, process, and manage data efficiently. End users can access their data when they need without having to install any specific software in their computers. As the number of cloud providers has increased, the price of a cloud computing became an essential element and an indicator showing the quality of the services delivered. Consequently, identifying the price charging is a difficult task for the cloud provider [10]. Typically, a cloud-computing provider aims to maximize its revenues and a customer targets to get the best quality of service at a reasonable price. Satisfying both parties requires an optimal pricing methodology [9]. Determining the pricing schemes by cloud computing providers is affected by different elements:

Cloud Capacity Planning Based on Simulation and Genetic Algorithms

163

• Initial costs that providers spend annually for new resources. • Leasing period of resources: Cloud providers usually offer better price for longer subscription period. • Maintenance costs that providers spend in yearly basis to maintain and secure the services provided. • Quality of services provided based on the technologies applied to enhance the user experience such as data privacy and resource availability. There are different factors that influence the pricing of cloud computing [11] such as the location (user access point to a service), content accessed, and time spent. Other factors are also important: • Flat rate pricing: Fixed price for a specific time. • Priority pricing: Services are priced based on their priority. • Usage pricing: Charges are applied based on the use of a service for a period of time. • Responsive pricing: Charges are applied only on service congestion. • Service type: Charges are calculated based on the type of service used, note that the first paragraph of a section or subsection is not indented. • The first Session oriented: Charging is activated based on the use given to the session. There are two common pricing models [8]: • Fixed pricing model where the amount to be paid is highly time-dependent. The time interval of offered service is predetermined. In this model, each resource type has a predefined price. Users may choose one of the two mechanisms: (i) Pay-peruse where the user pays in function of the time or quantity of the service used, or (ii) subscription where the user pays a flat fee for using a service. These two mechanisms are straightforward but certainly are not fair to all users. • Dynamic pricing model highly depends on the availability of the resource requested by the end-user. The cloud provider computes the cost based on the user’s request. In this model, pricing adjustment is based on market conditions: when there is sufficient demands, the price is high, and when the demand is weak, the price is low. In this model, Cloud providers aim to maximize their profits with each user who is paying to get a good quality of service. Moreover, they need to have advance technology for adjusting the price and profit calculation. A hybrid cloud model could dynamically allow the customer to adjust the amount of capacity used in a public or private hosting environment thereby achieving high level of scalability and efficiency. The aim of this paper is to build a model that will help cloud computing customers to decide on the optimal mix of private and public cloud provisioning depending on demand expectations and cost parameters. The optimal mix will be determined by genetic algorithms approach based a simulated demand for resources described by a theoretical or empirical probability distribution.

164

1.4

R. A. K. Mehdi and M. Nachouki

Genetic Algorithms

Genetic algorithms are search techniques that have been used to solve optimization problems. A distinguishing feature of genetic algorithms is that the mechanisms they use allow them to avoid a local optima. Genetic algorithms are loosely modeled on processes that appear to be at work in biological evolution and the working of the immune system [12]. Central to the evolutionary system is the idea of a population of genotypes (chromosomes) that are elements of a high dimensional search space. A genotype can be thought of as an arrangement of genes, where each gene takes on values from a suitably defined domain of values. In this work, the value is the amount of private capacity that is needed to satisfy an expected level of demand over the planning period where extra demand levels are met through the hiring of public cloud resources. The business objective is to minimize the cost of meeting demand through a mix of private and public cloud resources. Thus, each chromosome encodes one possible value for the private capacity. An evolutionary algorithm starts with a population of randomly generated individuals. Once an initial population has been created, an evolutionary algorithm enters a loop. At the end of each iteration applying a certain number of stochastic operators to the previous population will have created a new population. One such iteration is referred to as a generation. The first operator to be applied is selection. Its aim is to simulate the Darwinian law of “survival of the fittest”. In order to create a new intermediate population of n individuals, pairs of individuals (parents) are selected based on their fitness scores. The probability of each individual being selected is linearly proportional to its fitness value. Therefore, above average individuals will expectedly have more copies in the new population, while below average individuals will risk extinction. For the computational model developed in this paper, the fitness value represent the overall cost of acquiring private and public cloud resources to meet specified demand. To create offspring from the selected parents, a crossover operation is applied. For each pair of parents to be mated, a crossover point is chosen at random from within the genes. Offspring are created by exchanging the genes of parents among themselves until the crossover point is reached. After crossover, all chromosomes undergo mutation. The purpose of mutation is to simulate the effect of transcription errors that can happen with a very low probability when a chromosome is duplicated. This is accomplished by replacing each gene value by another from the domain of possible values using a very low probability of change. The above process is stopped when a termination condition is specified. For example, a pre-determined number of generations have been reached, a satisfactory solution has been found or no improvement in the solution quality has taken place for a pre-determined number of generations [12].

Cloud Capacity Planning Based on Simulation and Genetic Algorithms

165

2 Review of Related Work Khanafer et al. [13] have developed a cost optimization scheme based on a constrained version of the Ski-rental problem that allows a cloud user to decide whether to rent or buy infrastructure to meet computing requirements. The scheme assumes that the first or second moment of the arrivals distribution is known. They concluded that the scheme leads to significant cost savings when applied to cloud file systems. However, the scheme does not address the problem of a mixed strategy of provisioning computing resources. Li et al. [14] have investigated the problem of optimizing both the server running cost and the software storage cost in cloud gaming. They have analyzed the behavior of a proposed stochastic model based on queuing theory under different request dispatching policies. Several classes of computationally efficient heuristic algorithms were experimentally evaluated by simulations with real world parameters. They determined that their proposed Ordered and Genetic algorithms perform quite well in most cases and are robust to dynamic changes. Guo et al. [15] have developed a cloud bursting system, Seagull. Some enterprises deploy a hybrid cloud model to handle workload fluctuations by bursting into the cloud when local resources are insufficient. Seagull automates the decision processes about which applications can be run in the cloud most efficiently. The system uses selective precopying to proactively replicate some applications from the private computing resources to the cloud to reduce the migration time of large applications by orders of magnitude. They showed that Seagull has a reasonable performance in minimizing cost compare to an Integer Linear Programming solution and its scalability is much better. Deniziak et al. [16] presented a methodology based on developmental genetic programming for mapping real-time cloud applications onto an IaaS cloud. The aim of the methodology is to find the mapping giving minimal cost of IaaS services required for running the real-time applications in the cloud environment while keeping the level of quality of service as high as possible. Cost reduction of IaaS services is achieved by efficient resource sharing among cloud applications. Henneberger [17] investigated the economics of hybrid cloud computing. He has developed a simplified stochastic optimization model to identify the conditions under which hybrid cloud computing becomes economically feasible. He stated that under certain conditions it is viable to use cloud services to cover peak demand, even if the price is high or if service levels are low. Furthermore, he indicated that higher variance of demand for capacity should not automatically result into a more extensive use of cloud services. However, his model is not a closed form solution as stated in [18]. Lee [18] has developed a closed form mathematical model to investigate the problem of determining an optimal mix of hybrid cloud computing for enterprise. The model is used to derive a mathematical formula to determine the private capacity that minimizes the total cost of meeting a computing demand described by an exponential probability distribution over the planning period. The author also uses the mathematical model to derive the optimum level of public cloud to be negotiated in a service level

166

R. A. K. Mehdi and M. Nachouki

agreement (SLA). The shortcoming of the model is that it does not allow for variations in the other parameters that influence the hybrid cloud decision problem such as the price of public cloud computing resources. Moreover, demand for computing resources over a planning period may not follow standard probability distribution amenable to the required mathematical analysis to derive the decision formulae. Our objective in this research is to develop a computational model based on simulation and genetic algorithms that addresses the optimal mix problem of hybrid cloud computing which can apply to demand variations described by any type of theoretical or empirical probability distribution derived from historical data.

3 Computational Model 3.1

Parameters, Variables and Algorithm

An enterprise needs to determine the mix of investment in private and public cloud to meet its need for computing resources. These requirements depend on the expected demand and cost parameters. The decision variables and parameters used in the model are: • Decision Variables: – Private capacity • Parameters: – f(x): Demand probability density function. – pr: Unit price of private cloud – pb: Unit price of public cloud – g: Guaranteed service level – pen: Penalty of unsatisfied demand – t: time periods of decision horizon. The following assumptions are made: • Private cloud resources are available at the start of the decision horizon. Public cloud resources can be obtained to satisfy demand that exceed private capacity at a fixed cost. The probability distribution can be theoretical or empirical estimated from real data. Demand can be divided between private and public cloud resources. The unit price of public cloud over the decision horizon either remain constant or the probability density function is known. Most of these assumptions are based on Henneberger [17]. The fitness function used to measure the cost of each level of private cloud provisioning level is outlined below:

Cloud Capacity Planning Based on Simulation and Genetic Algorithms

167

cloudFitnessFunction(capacity) { Initialize parameters (privateUnitPrice, publicUnitPrice, gServiceLevel, timePeriods, penaltyUnitCost) Initialize parameters of demand probability density function. For each time period t{ Generate random demand for period t based on the demand cumulative probability density function. if (demand > capacity) { Compute shortage. Compute cumulative public cloud resources cost Compute cumulative penalty cost } } Compute totalCost as sum of (private cost, public cost, penalty Cost) }

3.2

Experiments and Results

Parametric Demand Distributions. To test, and compare the performance of the computational model developed in this work, we have used the parameters provided by Lee [18] in illustrating the operations of his analytical model based on exponential demand density function. The values of the parameters used are: k = 0.001, pr = $10,000, pen = $100, t = 10,000, and g = 99.45%. The optimal capacity computed from the model is 439.4, which is very close to the value obtained from the analytical model (434.7). Table 1 compares the results obtained with those of Lee’s analytical approach.

Table 1. Comparison of model output with analytical approach. Computed output Optimal capacity Unit cost of public cloud Total cost Capacity utilization

Lee’s analytical model Genetic based model 434.7 439.4 $1.54 $1.544 $14,331,557 $14,387,446 81.1% 81.26%

Figure 1 shows a comparison of public cloud unit price, private cloud unit price, and overall unit price of meeting computing demand. For the hypothetical demand used in

168

R. A. K. Mehdi and M. Nachouki

this study (k = 0.001), private cloud capacity can be increased to approximately 950 units and still more economical than depending entirely on public cloud services.

Fig. 1. Comparison of different cloud unit costs in relation to private capacity levels.

Optimal Public Cloud Service Level. A service-level agreement (SLA) defines the level of service guaranteed by the provider to the customer. The higher the service level guaranteed the higher the price. To minimize the overall cost of acquiring computing resources to meet demand, a customer need to find the optimal level of guaranteed service. Lee, in [18] has used the formula shown below to describe the relationship between public cloud cost and guaranteed service level: p ¼ base price þ ðgsl  base levelÞ  psr

ð1Þ

where p is the unit cost of public cloud, base_price is the unit cost of public cloud for a base level guarantee, gsl is the required service level guarantee by customer, base_level is the base level guarantee offered by the provider, and psr is the premium service rate at which higher levels of guaranteed service levels above the base level are offered. The model was extended so that the gsl parameter becomes a decision variable. The genetic algorithm optimization function is modified to find the optimal combination of gsl, and private cloud capacity that minimize the total cost. The results obtained below are based on a base service level of 99.45% and a base level price of 1.0$/unit as in the first version of the model. The computations were performed for different values of the premium service rate. Figure 2 describes the relationship between the optimal guaranteed service level and the premium service rate. The graph indicates that when the premium rate is greater than the base rate (1.0$), it is more profitable for a cloud customer to acquire cloud services at the base level service guarantee. As the premium price drops below 1.0$, it is more cost effective for a customer to contract service levels at the maximum level a cloud provider can make available. Figure 3 shows that as the premium service rate increase gradually from 1.0$, the private cloud capacity increases correspondingly. As the premium price drops below 1$, required private capacity decreases. When the premium price drops to zero, customers contract at the highest guaranteed service level offered by the cloud provide and the

Cloud Capacity Planning Based on Simulation and Genetic Algorithms

169

Fig. 2. Optimal guaranteed service levels v premium service rates.

private cloud capacity drops to 100 units. This situation is equivalent to a fixed public price scenario at a higher level of guaranteed service level. In fact if we set the guaranteed service level at 99.45%, the optimal capacity becomes 439.38 as in the fixed price scenario.

Fig. 3. Optimal private capacity v premium service rates.

Figure 4 demonstrates the relationship between the optimal guaranteed service level and the total optimal cost for the customer to satisfy demand. As the premium service rate increases, the total optimal cost for the customer increases at much slower rate when the premium service rate is above 1$ than when the premium service rate is below 1.0$. At premium service rate greater than 1$, it is in the interest of the cloud services consumer to acquire more private cloud capacity and thus reducing the influence of higher premium service rates. Figure 4 demonstrate the relationship between the optimal guaranteed service level and the total optimal cost for the customer to satisfy demand. As the premium service rate increases, the total optimal cost for the customer increases at much slower rate when the premium service rate is above 1$ than when the premium service rate is below 1$. At premium service rate greater than 1$, it is in the interest of the cloud

170

R. A. K. Mehdi and M. Nachouki

Fig. 4. Total optimal cost v premium service rates.

services consumer to acquire more private cloud capacity and thus reducing the influence of higher premium service rates. To give a quantitative perspective of the results, Table 2 shows the private cloud capacity, public cloud cost per unit, and the total optimal cost for various values of the premium service rate based on a base level guaranteed service of 99.455 and a base level of public cost of 1.0$. Table 2. Private cloud capacity, public cloud cost per unit, and the total optimal cost for various values of the premium service rate. Premium service rate 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Private optimal capacity 121.99 123.99 206.37 291.49 373.24 439.38 439.75 440.98 443.64 444.56 446.01

Average unit cost of public cloud $ 99.99 99.99 99.99 99.99 99.99 99.46 99.46 99.46 99.46 99.46 99.46

Optimal total cost $ 10,173,400 11,138,453 12,036,665 12,900,850 13,695,934 14,387,709 14,399,091 14,413,303 14,421,815 14,433,144 14,444,461

Ad Hoc Demand Distributions. To test the model on an empirical demand that can be obtained from historical data, we synthesized a random demand distribution shown in Fig. 5 with a mean of 1000 and a standard deviation of 186.4. The computed optimal private capacity is 1041.8. Figure 6 shows the average cost of demand met by cloud resources, average cost of demand met by private infrastructure and the average overall cost of meeting requirements per unit demand.

Cloud Capacity Planning Based on Simulation and Genetic Algorithms

171

Fig. 5. Computing demand.

Fig. 6. Total cost of publicly and privately met demand and total overall cost.

To examine the effect of higher variation in demand data with the same mean, a new set of demand data was synthesized with a mean of 1000 and a standard deviation of 425.8 shown in Fig. 7.

Fig. 7. Computing demand with a higher variance.

172

R. A. K. Mehdi and M. Nachouki

The computed optimal private capacity is 1053.57. Figure 8 shows the total cost of demand met by two options and the overall total.

Fig. 8. Total cost of publicly and privately met demand and overall cost for high variance demand.

Figure 9 shows the capacity utilization for the two demand patterns discussed above.

Fig. 9. Capacity utilization for two demand patterns with same mean but different variance.

To examine the effect of demand variation further, we conducted a number of runs assuming a normal distribution with a mean of 1000 with different standard deviations as shown in Table 3. Table 3 indicates that the optimal capacity and utilization of private infrastructure decrease as the variation in demand increases. The average cost of meeting total demand using hybrid cloud increases as variation in demand increases. Comparing these results with the optimal private capacity of 439 units for the negative exponential demanded with a mean of 1000 units leads to the observation that the shape of the demand distribution has a significant affect on the level of optimal private capacity.

Cloud Capacity Planning Based on Simulation and Genetic Algorithms

173

Table 3. Cost statistics for a normally distributed demand with mean 1000 and different levels of variance. Standard deviation 50 100 200 300

Optimal capacity 982 962 924 886

Average utilization of private resources % 0.9878 97.49 94.88 91.94

Overall cost per unit demand $ 1.029 1.057 1.114 1.172

4 Conclusions In this work, we have built a simulation model based on genetic programming to find the optimal combination of private and public cloud resources to satisfy demand for computing resources that can described over the planning period by an estimated parametric probability density function or an ad hoc demand histogram based on real data. The model has been applied to both parametric and ad hoc types of demand. Results obtained are almost identical to comparable analytical models in the literature. Our experiments have shown that the optimal level of private computing capacity depends to a large extent on the shape of the demand curve, negative exponential or normally for example. Variations in demand within the same family of demand distributions have a very small effect on capacity for the same mean demand over the planning period but significant impact on capacity utilization and cost. The distinguishing features of our model is that it can handle any theoretical or empirical demand probability distribution. In addition, our computational scheme allows for any random variation in any of the parameters affecting the total cost of cloud resources consumed as long as this variation can be described by a theoretical or an ad hoc density function. In addition, the model can be easily modified to determine the optimal total cost with respect to any group of parameters that can be used as decision variables in determining optimal cost. This computational scheme will be extended in the future to compute the optimal mix of private and public resources such as IaaS, SaaS and PaaS.

References 1. Rao, C.C., Leelarani, M., Kumar, Y.R.: Cloud: computing services and deployment models. Int. J. Eng. Comput. Sci. 2(12), 3389–3392 (2013) 2. Mell, P. Grance, T.: The NIST Working Definition of Cloud Computing. National Institute of Standards and Technology (NIST), Special Publications 800-145, Gaithersburg (2011) 3. Subhash, L., Thooyamani, K.P.: Allocation of resource dynamically in cloud computing environment using virtual machines. Int. J. Adv. Technol. 8(4), 193 (2017) 4. Islam, S., Gregoire, J.-C.: Giving users an edge: a flexible Cloud model and its application for multimedia. Future Gener. Comput. Syst. 28(6), 823–832 (2012) 5. Diaby, T., Rad, B.: Cloud computing: a review of the concepts and deployment models. Int. J. Inf. Technol. Comput. Sci. 9(6), 50–58 (2017)

174

R. A. K. Mehdi and M. Nachouki

6. Ali, T., Ammar, H.: Pricing models for cloud computing services, a survey. Int. J. Comput. Appl. Technol. Res. 5(3), 126–131 (2016) 7. Mukundha, C., Vidyamadhuri, K.: Cloud computing models: a survey. Adv. Comput. Sci. Technol. 10(5), 747–761 (2017) 8. Ibrahimi, A.: Cloud computing: pricing model. Int. J. Adv. Comput. Sci. Appl. 8(6), 434– 441 (2017) 9. Al-Roomi, M., Al-Ebrahim, S., Buqrais, S., Ahmad, I.: Cloud computing pricing models: a survey. Int. J. Grid Distrib. Comput. 6(5), 93–106 (2013) 10. Soni, A., Hasan, M.: Pricing schemes in cloud computing: a review. Int. J. Adv. Comput. Res. 7(29), 60–70 (2017) 11. Mazrekaj, A., Shabani, I., Sejdiu, B.: Pricing scheme in cloud computing: an overview. Int. J. Adv. Comput. Sci. Appl. 7(2), 80–86 (2016) 12. Yu, X., Gen, M.: Introduction to Evolutionary Algorithms, 1st edn. Springer, London (2010) 13. Khanafer, A., Kodialam, M., Puttaswamy, K.: To rent or to buy in the presence of statistical information: the constrained Ski-Rental problem. IEEE/ACM Trans. Netw. 23(4), 1067– 1077 (2015) 14. Li, Y., Deng, Y., Tang, X., Cai, W., Liu, X., Wang, G.: Cost-efficient server provisioning for cloud computing. ACM Trans. Multimedia Comput. Commun. Appl. 14(3s) (2018). Article 55 15. Guo, T., Sharma, U., Shenoy, P., Wood, T., Sahu, S.: Cost-aware cloud bursting for enterprise applications. ACM Trans. Internet Technol. 13(3) (2014). Article 10 16. Deniziak, S., Ciopinski, L., Pawinski, G., Wieczorek, K., Bak, S.: Cost optimization of realtime cloud applications using developmental genetic programming. In: Proceedings of the IEEE/ACM 7th International Conference on Utility and Cloud Computing, London (2014) 17. Henneberger, M.: Covering peak demand by using cloud services – an economic analysis. J. Decis. Syst. 25(2), 118–135 (2016) 18. Lee, L.: Determining an optimal mix of cloud computing for enterprises. In: Companion Proceedings of the 10th International Conference on Utility and Cloud Computing, Austin, TX, USA (2017)

Simulation of Artificially Generated Intelligence from an Object Oriented Perspective B´alint Fazekas(B) and Attila Kiss(B) Department of Information Systems, Faculty of Informatics, E¨ otv¨ os Lor´ and University, Budapest, Hungary {bfazekas,kiss}@inf.elte.hu

Abstract. The aim of this paper is to outline several different approaches of how the evolution of intelligence can be observed in agents, which are placed inside a given environment. We fully explain a method that relies strongly on current object oriented programming paradigms and technologies, and show a complete explanation of our implementation of the chosen model. In the first parts of the paper, we point out the important characteristics of intelligence and what is to be measured. Then, we introduce ideas that can be used to model the simulation of artificial intelligence evolution. During the explanation of the models, we point out both the strength and the weaknesses of the models. We then move on to a guide of our implementation, then show and explain the results of the simulation. Lastly, we suggest further possible improvements for our approach. Keywords: Computational intelligence · Artificially generated intelligence · Evolution Object oriented programming

1

· Simulation · Agents ·

Introduction and Ideas

The aim of this project is to observe an entity – or a given set of entities – placed within a defined environment, while the entity observes its surroundings, tries to evolve, and deduce certain rules of the objects in its environment. 1.1

Refined Description

First, it is important to clarify that the scope of this paper does not include or consider the physical development of the entities that observe their environment – in other words, undertake any sort of physical evolution. The primary aim of This project was supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.3-VEKOP-16-2017-00002). A. Kiss was also with J. Selye University, Komarno, Slovakia. c Springer Nature Switzerland AG 2020  Y. Bi et al. (Eds.): IntelliSys 2019, AISC 1037, pp. 175–193, 2020. https://doi.org/10.1007/978-3-030-29516-5_14

176

B. Fazekas and A. Kiss

this paper is strictly to concentrate on analyzing the intelligence, logical deductive capabilities, and environment observation of the given entities – from here on, which we will refer to as agents. Despite the fact, that researches, theories, and algorithmic methods – that had well–tried in the open field – related to artificial intelligence had been around for decades, the topic of this paper is on a higher level of abstraction of theoretical research, than that of artificial intelligence. In contrast to the fields of artificial intelligence, the aim of this paper is to observe whether agents are capable of deduce, logically separate (or classify), and use the objects in its environment for its own purposes. We assume that the agents are given a set of observatory senses (such as light sensors), and that the environment is a logically built, non–physically represented environment. Definitions. This section contains definitions which “cannot very well be defined” scientifically. This problem is present due to the fact that the current state of definitions related to psychological terms and behavioral observations are quite subjective in their nature, and therefore hard to define by default. The difficulty of constructing these definitions is also caused by our extremely limited capabilities to get a full map and process–diagram of a person’s thought–process. In the real world, the human brain works slightly different for each individual, and therefore develop differently. Hence, in the case of every individual, we would deal with a non–deterministic, and structurally deviant system. Obviously, the complexity of building such system that is capable of handling many different sub–systems is too large for the purposes of this paper. The following definitions were inspired by a one–hour long interview with Christof Koch, who is a famous and notorious German–American neuroscientist. Christof Koch has created countless models and publications regarding the relationship between human cognition and neuron networks. The interview is available for free to watch on one of MIT’s great professors, Lex Fridman’s YouTube channel on the following link: https://www.youtube.com/watch? v=piHkfmeU7Wo. Consciousness: In this paper, we define consciousness as one’s ability of self– reflection. By self–reflection, we mean the expression of how well a being can separate, or distance itself from its environment and from other being in its environment, and most importantly is capable of developing these skills by itself. Intelligence: We define intelligence as the ability to create relationships between the “known information” of an individual. Experience: In this context, experience is defined as an agent’s ability of how well, and what can it sense within its given environment. Object: Object is defined as a set of attributes and functions that can be executed on that object. The parameters of the functions of an object can be of

Simulation of AGI

177

arbitrary type and amount, as an agent has no knowledge of these function in the beginning. Object–space: Object–space is a physical space within the memory or disk of a computer, that stores all the objects that we intend to use in the environment. Agents that observe the environment are not part of the object–space. Environment: The environment is an interpretation (or representation) of the object–space that an agent is able to communicate with. Agent: An agent is an actor that can observe, sense, communicate, and have an impact on its environment. Our aims with this project can be summarized by observing the cognitive emergence of intelligent agents [1,5].

2

Model Creation

Based on the definitions above, it is not a straightforward task to define a model which is adequate for simulating intelligence–evolution. For this, a necessary amount of creativity and heuristic approach is needed, which can precisely fit for our purposes, but at the same time, neither restrict the problem at hand, and our ability to implement it. During the planning of a model, we tried to come up with several models that fit in the scope of our definitions. These models were developed considering different mathematical approaches and several services provided by certain programming languages. First, let us take a look at an approach that would be good for running a simulation, but otherwise very bad when it comes to implementation and analysis. 2.1

Analytic and Functional Approach Based on Mathematical Models

For this approach, we tried to build a system from a purely mathematical, physical, and analytic point of view. In this model, the environment is a static object that contains several physical objects. We can think of this environment as Space, or universe that has a defined set of rules, and every object contained are subjected to these rules. For the sake of simplicity, we can say that the environment does not alter its objects, and that the objects do not have an impact on each other. This environment is simply a container for the physical objects that we define. In this sense, this environment is the same as our known universe, and the rules can be interpreted as the mathematical laws that we know today. Each object in the environment has a simple identification attribute (for example, a name), a category list, and a set of executable functions.

178

B. Fazekas and A. Kiss

The only purpose of the identifier name of an object is so we can reference the object in the environment. Given a set of classified objects at the beginning, we expect our agents to be able to process and further classify new objects. Here, we can make two choices about the starting object set: we could leave it empty, and leave every classification process to the agents. The other approach is to create a few labeled objects, which observed by the agents, can guide the agents how to classify, and how to interpret the basic labels. Naturally, here we can also chose to create such a set of basic objects, which contains labeled basic objects (such as: “door”, “button”, “box”, etc.). One characteristic of these basic objects, is that they are fully labeled, meaning that the agents know everything about these objects; in other words, the agents have access to all attributes and methods of these objects. Outside of the set of basic objects, we can define a set of regular objects, which contain only partially labeled objects, and analyze how the agents observe (classify) these partially labelled objects. As we previously outlined, an object has a list of methods. These methods can be executed on that given object. Each of these methods can be interpreted as a mathematical function, more precisely, an analytically solvable function. Each function can alter the inner state (attributes) of an object, and the agents can query the change of state of that object. However, it is not entirely clear what kind of mathematical function is defined within a method, and what the actual function is trying to solve. Let us take a look at a complex, mathematical model. Suppose that our agent would like to wash its hands in a washroom. To reach the sink in the washroom, it must first get to the door of the washroom, and must open it, without having any knowledge of such object. After opening the door, the agent must find the sink, and wash its hands without knowing how the sink actually works. To solve this problem, the agent could try to find all methods of the sink object, and “bombard” the found functions with arbitrary types and arbitrary number of parameters, and try to find a method, that accepts these parameters, and results in a desired outcome. The agent will know that it found the right method for washing its hands by querying the change of state of the object after giving it a certain parameter list. After finding the desired method, the agent must undergo an optimization routine, which will find a reasonable value for the function’s parameter, to get an adequate result. The actual process of solving the problem can only begin after all of these have been found out by the agent. This small example demonstrates that even this small problem requires a lot of planning and technologically interesting fields, both from the agents’ and the developers’ perspective. This approach is further complicated by the aim of this research, in which we try to observe the evolution of an agent’s intelligence. This means that after each step, the agent must classify the observed objects, and try to dynamically create newer classifications while the simulation is running. From the perspective of the implementation, this means that the program must

Simulation of AGI

179

be able to create new data types during run-time, and use the created types. However, the purely mathematical approach requires to dynamically create isomorphic relations between data types, the problems we wish to solve, and the mathematical functions. In order to avoid this great complexity, we propose another model, which is easier to define and implement. 2.2

Object Oriented Approach Using Discrete Values

In the previous section we saw that the mathematical approach introduces too many complications to our goal. In order to avoid this, we chose a different approach. In our discrete model, let the environment be a container, that holds all the objects and the agents. In this model, we can define a new type objects that have attributes which change with the passing of time. We call this model the evolutionary environmental model. In the evolutionary environmental model, we have to account for the objects impact on each other, as well as the passing of an arbitrary unit of time. In this model, we measure the unit of time in the amount of steps taken since the beginning of starting the simulation. One step includes all the agents observing some of the objects in the environment. With each step, the step–counter increases, and some objects may change their inner states regarding the counter. For example, assume that our environment contains an iron door. We know that a property of iron is that it rusts if it is exposed to oxygen for long enough. This iron door object might have a function that determines the force which is needed to open it. In the beginning, if the iron is completely fresh, then it will be much easier for an agent to open it – in other words, it can apply a much lower force to open the door. However, after some steps, the door may start to “rust”, causing the increase of minimum force to open it. Therefore, we can see that this object will change its inner state without an agent’s interaction. The purpose of the evolutionary environment is to propagate all of its contained objects forward in time. An agent can observe its environment only if all the environmental objects are within the same evolutionary step. We denote the evolutionary environment with Γ , and denote each discrete state of the environment with Γ0 , Γ1 , ..., Γk (k ∈ N0 ). Furthermore, let us denote a step between these discrete state as Γk −→ Γk+1 . For several steps, we use the following denotation: Γk −→∗ Γk+m , (k ∈ N0 , m ∈ N0 ). During the execution of the simulation, we would like the agents to deduce the attributes and functions of the objects. Consequently, it is not enough for the agent to observe the objects in the environment, hence practically that would mean that the agent knows everything about that particular object (attributes and methods). This problem can be solved by introducing a new space, where the objects are held. We place the well defined objects in the object–space. We denote the object-space with Ω, and the objects within the object space with ω1 , ω2 , etc. (ω1 , ω2 , .. ∈ Ω)

180

B. Fazekas and A. Kiss

After we have placed the desired objects in the object-space, we map the object-space to an object representational table, which stores all the environment objects as serialized strings in a table. This representational table is the medium between the agents and the actual object, and it only represents the information about an object that has been “discovered” by an agent before. We denote the representational table with Σ. In the previous section, we mentioned that it is advised to include baseobjects in the environment, which play guiding roles for the agents. The purpose of these base-objects is especially important is this case, since these object are the ones that define how other objects will be interpreted. We understand baseobjects as static objects in the environment, therefore we denote these objects with κ1 , κ2 , etc., and the set of base objects with K. Clearly, we can see that K ⊆ Ω, and ∀κ ∈ K : κ ∈ Ω. This object oriented approach that we described above, is very similar to ideas that are outlined in two other papers [10,11]. Essentially, our model comes very close to a abstract state machine, where the agents represent the changing states within the state machine, and the environment provides the input for the agents [14]. 2.3

Attribute Assignment to Objects

One of the basic characteristics of intelligence is that the “being” is capable of categorical thinking. In other words, the being is able to categorize the objects experienced in its environment, and use the objects according to their categories. These classifications are not straightforward, and the observed attributes have a great impact on how an agent will classify an object. During the simulation, we would like to achieve for each agent to be able to create their own categorical system for the objects that they observed. During the simulation, we would like to see that the agent’s knowledge about the environmental objects get more and more specific with each passing step, consequently, the categoryattribute of that object get specific. This is only possible if the agents observe the same object in several different Γ states. 2.4

Comparison with General Neuron–Networks

There are several questions that can arise from the models explained above. For example, why is it necessary to create these models in the first place, when there are already well defined methods in the field of artificial intelligence, which can optimize any type of mechanically implementable algorithm; and, how is our model actually different from neuron–networks? First, let us take a look at the well defined methods in the fields of artificial intelligence. One of the elements of creating an AI algorithm is to define the problem on a higher level of abstraction such way, that the problem can be easily mapped to a problem-graph. The starting node of the graph indicates where we initially start our simulation, and we can move to different nodes on

Simulation of AGI

181

the graph based on a set of rules. From these algorithms, we expect to reach a node on the graph, that results in the solution of the problem. There are many different variations of these graph search algorithms, they try to find the solution as optimally as possible – if it exists. However, it is also a possibility that our problem does not have a clear solution. In this case, we are forced to introduce heuristic approaches in the algorithms, which help the algorithm in case it gets stuck in a state, or a sequence of states. These heuristic ideas are always defined by the given problem, and often cannot be generalized. During our simulation, and the agents’ observation of their environment, we will work with many random numbers, which would greatly complicate our ability to define a problem-space, without it actually benefiting the processes of the simulation. We would like to observe whether the machines are able to create or initialize a hierarchical observation system. Therefore, even if the graph search algorithms would be good if we had a clear, mechanical algorithm, this approach does not help much in our case. Now let us explain why or model is different from neuron–networks. Neuronnetworks are used in cases, where we want to teach a program to solve a problem on a general level. The learning process can be guided and unguided. In the case of guided learning, we tell the program what the outcome (result) is, and we expect the program to adjust its parameters if the calculated result is not the expected. By storing these parameters gained, the more sample data we feed the learning process, the more likely our program will be able to solve the given problem outside of its learning phase. The samples given to the program has a great impact on each step of the learning process. For example, if we only show left turns to the automatic steering software of a vehicle, then the vehicle will have almost no chance of being able to correctly take a right-turn. First, it would seem that this problem of learning can be done by giving a relatively huge number of samples of all possible cases that the software could encounter. However, in that case, the software is subjected to the fault of over-learning, which means that the software will account for parameters in certain cases, which would be completely irrelevant. In the majority of the cases, this leads to worse results than that of a program, which was given less, but more general samples. Therefore, we must discard the guided learning for the following reasons: firstly, we cannot tell the agents what outcome we are looking for, since this is what we are essentially trying to obtain. (We hope to see results, which reflect the categorical thinking of a human, however this requirement is very hard to ensure due to the lack of intervention.) On the other hand, the sample data given to the learning process must help in the categorization process, however, the samples do not have any precedence, therefore there is no real “good” sample that we could supply the learning process. Our model is much more similar to the unguided learning process. In this case, we leave the categorization process to the agents, and we only interfere by telling the agents whether a calculated value is adequate or not. The problem of us not being able to give an adequate set of sample data is still present in this case.

182

3

B. Fazekas and A. Kiss

Implementation

In the previous chapter, we detailed the models that can be, and should be used for our purposes. In this chapter, we explain how we implemented the object– oriented approach. Even though the object–oriented programming paradigms are well known, it is most appropriate to mention an amazing work by Zeigler about how agent based simulations should be planned, and how the communication between the components flows [15]. Another fundamental work about object oriented simulation was digested when we planned our implementation [6]. Ideas about implementing simulation systems were inspired by an article written by Nepomuceno-Chamorro [12]. To implement our model, we require the means to reference an object by its name, and query all its attributes and function in run-time. The JAVA programming language is a very strongly object-oriented language, which is beneficial for defining the environment and the environmental objects. The JAVA 8 (version 1.8), the language supports a service that is so called reflection. Reflection allows us to extract every information (including attributes, methods, and method parameters) about an object. We can exploit this service to give our agents a sense of “intuition”. The JAVA Enterprise Edition (JAVA EE) includes services that support database connections and persistent data management, which is a convenient way for us to store our environmental objects, without having to worry about extensive memory usage. With the so called JPA – Java Persistence API –, we can annotate our objects to define exactly which database and in which schema we would like to store it. The details of the modern Java JPA (also used in our implementation) can be read in a book by Keith and Schincariol [8]. By persisting our objects in databases, we will know exactly where to look for them, and more importantly, the memory of the running simulation will only contain the agents, the representational string table, and the object which is currently observed by an agent. However, one problem arises from persistently storing our objects in the database: if an object is an evolutionary environmental object, then we expect the object to undergo some inner change of state as we propagate through the steps of the simulation – even if the object is not observed by an agent (therefore, have not been pulled into the memory of the program). We solve this problem by storing the step-counter, in which a given object was last observed by an agent. When the object is observed again, the step-counter is refreshed and the new inner state of the object is calculated. After we calculate the new inner state, we pass the object to the agent for observation. 3.1

Environment Implementation

As mentioned before, the environment is responsible for the maintenance of the communication between the agents and the environmental objects. Since we want to store our environmental objects in a database, we need some sort of communication pipe between the agents and the actual objects.

Simulation of AGI

183

This communication channel is very similar to the message driven concept in P–colonies, where each side has a sender and a receiver method [2,4]. The environment is therefore responsible for receiving the requests from the agents, and send these requests forward to the objects in the object-space. Beside support the means of communication, the environment has another important task. Our environment should also be responsible for querying the database, and create the representational string table from the queried objects in the object-space. We use Java Run-time Reflection to extract all information about the queried objects, and serialize these object in the string-table, based on a string representation of the objects. The reason why this is beneficial for us, is because in the future, we can tell exactly which data in the representational table can be interpreted by the agents. 3.2

Environmental Objects

We implement the environmental objects using Plain Old Java Objects (or POJOs). These object defining classes have the characteristic of having getters and setters for all of their attributes, in a very well defined format. Let us look at an example, or what we mean by “well defined format”. Suppose our object has an attribute called “id”. For this attribute, the getter function would be called “getId” and the setter would be called “setId”. Each object class can have numerous attributes. Rather than writing the getters and setters for each attribute, and expanding the code, we use the LOMBOK Java library, which dynamically generates these methods for us. This is also a much safer method of doing so, because it eliminates the possibility of mistyping these methods. To identify an object class as a persistable entity, we use the annotations included in the Java Persistence API. We use “@Entity” and “@Table (schema = “mySchema”, table = “object”)” annotations to tell which table and schema we want to persist the object in. 3.3

Agents

As outlined in the previous sections, the agents are objects which can observe (read) its environment, and occasionally change the inner state of the objects via sent messages. We implement our agents such that they have a list of known categories, which stores the type of objects the agent has encountered before. The agents also have sender and receiver methods, which are generic in a sense that they are capable of sending and receiving different type of objects. The sender function send a list of types and values, and will try to call a function of the object with the parameters given in the list. The purpose of the receiver function is to receive the result of the function, and to interpret the result.

184

3.4

B. Fazekas and A. Kiss

Implementation with Results

We first show the results of the progression of the implementation. For the early tests, we used one type of entity object called Block. A block is an abstract object with very simple, observable characteristics, such as dimensions, weight, and the ability to be moved. The following Java class defines this object. @Data @Entity @Table(schema = ‘‘WOAF", name = ‘‘BLOCK") public class Block { @Id @GeneratedValue(strategy = GenerationType.AUTO) @Column(nullable = false) private int id_; @Column(nullable = false) private int step_ = 1; @Column(nullable = false) private float width_; @Column(nullable = false) private float height_; @Column(nullable = false) private float depth_; @Column(nullable = false) private boolean groundAttached_; @Column(nullable = false) private boolean hollow; public Block() { } ... }

As we mentioned in the previous section, the annotations above the class definition help the Java compiler and JPA to sort out what to do with the class. The @Data annotation automatically generates the getters and setters for all fields, without manifesting the code itself in the source file. The reason we decided to put an underscore character after each property element is to avoid accidental name clashing with SQL names. We will use this convention throughout all of our objects. We used the @Table annotation to tell which database schema and in which table we wish to store our object. In order to initialize our database – which is essentially our environment with the known objects – we instantiated a few objects of the Block class, and stored the objects in the database.

Simulation of AGI

185

private static void fillDB() { List objects = new ArrayList(); Block b = new Block(10, 20, 15, true, true); Block b2 = new Block(40, 20, 30, false, false); objects.add(b); objects.add(b2); EntityManagerFactory emf = Persistence.createEntityManagerFactory("ReflectionPU"); EntityManager em = emf.createEntityManager(); em.getTransaction().begin(); for (Object object : objects) { em.persist(object); } em.getTransaction().commit(); em.close(); }

Luckily, we found out that using a list of Object typed objects is adequate for storing initializing entities, since the entity annotations will make sure that each entity will be persisted in their corresponding tables. This is especially useful, since we have the ability to store every initializing object in one array, and persist them using one loop. After we the generation of the basic environment variables and persisting them in our database, we moved on to how we can query these object from a clean, empty environment, without ever mentioning the type of the object which we would like to query. In order to do so, we first query the names of the tables that are present in our database. private static List setupDatabaseConnection() { Connection con; String driver = "org.apache.derby.jdbc.EmbeddedDriver"; String dbName = "//localhost:1527/Reflection;user=woaf;password=123"; String connectionURL = "jdbc:derby:" + dbName; List tableNameList = new ArrayList(); try { Class.forName(driver); con = DriverManager.getConnection(connectionURL); ResultSet sets = con.getMetaData().getTables(null, "WOAF", "%", null); while(sets.next())

186

B. Fazekas and A. Kiss { if(!sets.getString(3).equals("SEQUENCE")) tableNameList.add(sets.getString(3)); } con.close(); } catch (SQLException | ClassNotFoundException ex) { Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex); } return tableNameList;

}

This method returns the names of all the tables that are present in our database. However, there is one problem with the return values. Due to conventional reasons, we decided to give all–capital letter names to our tables, but in our queries, we would have to refer to our tables with only the first letter of the name being capitalized. For example, we would write “Block” instead of “BLOCK”. The conversion of each name in the result list can be done easily. In the next few lines of code, we demonstrate how we managed to get all the persisted Block objects from the database, without referring explicitly to the Block table or type. public static void main(String[] args) { fillDB(); List tables = setupDatabaseConnection() .stream() .map(s -> toCamelCase(s)) .collect(Collectors.toList()); EntityManagerFactory emf = ersistence.createEntityManagerFactory("ReflectionPU"); EntityManager em = emf.createEntityManager(); Query q = em.createQuery("SELECT b FROM " + tables.get(0) + " b"); q.getResultList().forEach(System.out::println); }

Of course, we can iterate through all of the table names, and query the stored objects of each type. Running the program will give us the following results. Block(id_=2, step_=1, width_=40.0, height_=20.0, depth_=30.0, groundAttached_=false, hollow=false) Block(id_=1, step_=1, width_=10.0, height_=20.0, depth_=15.0, groundAttached_=true, hollow=true)

As we can see, we managed to get our stored entities without referring to their types.

Simulation of AGI

187

In the following lines of code we demonstrate how we can extract all of the information about any given class (contained within the names of the tables), including function names, parameter numbers and types, method modifiers and data members. tables.forEach((name) -> { try { Class arbitraryClass = Class.forName("base." + name); List