Interpretable Artificial Intelligence: A Perspective of Granular Computing 3030649482, 9783030649487

This book offers a comprehensive treatise on the recent pursuits of Artificial Intelligence (AI) – Explainable Artificia

1,296 89 15MB

English Pages 429 [430] Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Interpretable Artificial Intelligence: A Perspective of Granular Computing
 3030649482, 9783030649487

Table of contents :
Preface
Contents
Explainable Artificial Intelligence for Process Mining: A General Overview and Application of a Novel Local Explanation Approach for Predictive Process Monitoring
1 Introduction
2 Background and Related Work
2.1 Process Mining
2.2 Predictive Business Process Management
2.3 Deep Learning for Predictive BPM and XAI
3 A Framework for Explainable Process Predictions
4 A Novel Local Post-Hoc Explanation Method
4.1 Binary Classification with Deep Learning
4.2 Local Region Identification by Using Neural Codes
4.3 Local Surrogate Model
5 Experiment Setting
5.1 Use Case: Incident Management
5.2 Evaluation Measures
5.3 Results
6 Discussion
7 Conclusion
References
Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI): A Framework of Information Granules
1 Introduction
2 Explainability Strategies
2.1 Feature Selection
2.2 Performance Analysis
2.3 Model Explanations
3 Global and Local Interpretability
3.1 Information Scalability
3.2 Visual Scalability
4 Stability of Explanation
5 Visual Analytics for Granular Computing
6 Summary
References
Visualizing the Behavior of Convolutional Neural Networks for Time Series Forecasting
1 Introduction
2 Introduction to Neural Networks and Forecasting
2.1 Power Time Series Forecasting
2.2 Neural Networks
3 Relevant Literature
4 Training the cnn ae
4.1 Experiment
4.2 Data and Code
4.3 Setup
5 Visualization and Patterns
5.1 How to Interpret the Visualizations
5.2 Input Visualization
5.3 Kernel Visualization
5.4 Forecast Visualization
5.5 Activation Maps
5.6 How to Use the Individual Visualizations
6 Conclusion
References
Beyond Deep Event Prediction: Deep Event Understanding Based on Explainable Artificial Intelligence
1 Introduction
2 Why Current Machine Learning is Differentiated from Human Learning
3 Beyond Deep Event Prediction
4 Big Data, AI, and Critical Condition
5 DUE Architecture
6 Properties of DUE
7 The Concept of DUE
7.1 Human Critical Thinking
7.2 Contextual Understanding
8 Learning Model for DUE
8.1 Fundamental Computing for DUE
8.2 Computing Using CBNs-Based XAI
9 DUE Trends and Future Outlooks
9.1 Disasters
9.2 Economic Consequences
9.3 Safety and Security
10 Conclusions
References
Interpretation of SVM to Build an Explainable AI via Granular Computing
1 Introduction
1.1 The Era of Explainable AI with Granular Computing
2 The Problem with a Gap in Explainability
3 Related Work
4 Background
4.1 SVM Algorithm
4.2 Granular Computing
4.3 Syllogisms
4.4 Explainable Artificial Intelligence
5 Research Methodologies
5.1 A Constructive Approach in Developing XAI
5.2 A Human-Centric Approach at Early Development Stage
6 Implementation: A Syllogistic Approach to Interpret SVM's Classification from Information Granules
6.1 Data Selection
6.2 Identifying the Information Granules from These Data Sets
6.3 Analyzing and Interpretation of Syllogisms from SVM
6.4 The General Framework for Modelling Syllogistic Rules
6.5 Validating the Interpreted Syllogistic Rules with Physicians and CPGs
6.6 XAI Knowledge Base for CAD
6.7 XAI with Inference Engine
6.8 User Interface in Mobile Application
6.9 Preliminary Results
6.10 Iterative Retuning and Validation of XAI Mobile App with Physicians in the Loop
7 Final XAI Mobile App
7.1 XAI Mobile App
8 Testing Results from XAI Mobile App
8.1 Testing Phase I
8.2 Testing Phase II
8.3 Testing Phase III
8.4 Results from Testing
9 Conclusion and Discussion
10 Future Work
References
Factual and Counterfactual Explanation of Fuzzy Information Granules
1 Introduction
2 Background
3 Proposal
4 Illustrative Use Case
5 Experiments
5.1 Experimental Settings
5.2 Experiment 1: Relevance of Expert Knowledge-Based Counterfactual Explanations
5.3 Experiment 2: An Impact of Posterior Linguistic Approximation
6 Discussion
7 Conclusion and Future Work
References
Transparency and Granularity in the SP Theory of Intelligence and Its Realisation in the SP Computer Model
1 Introduction
2 Introduction to Transparency
3 Introduction to Granularity
4 The SP System in Brief
4.1 Information Compression
4.2 Abstract View of the SP System
4.3 Basic Structures in the SP System for Representing Knowledge
4.4 The Concept of SP-Multiple-Alignment
4.5 Unsupervised Learning
4.6 Existing and Potential Strengths of the SP System
4.7 SP-Neural
4.8 Future Developments
5 Information Compression and the Representation and Processing of Knowledge in the SP System
5.1 Information Compression via the Matching and Unification of Patterns
5.2 Discontinuous Patterns
5.3 Seven Variants of ICMUP
5.4 The DONSVIC Principle
5.5 Ideas Related to the Concept of a Granule
5.6 Tying Things Together?
6 Transparency via Audit Trails
7 Transparency via Granularity and Familiarity
7.1 Granularity, Familiarity, and Basic ICMUP
7.2 Granularity, Familiarity, and Chunking-With-Codes
7.3 Granularity, Familiarity, and Schema-Plus-Correction
7.4 Granularity, Familiarity, and Run-Length Encoding
7.5 Granularity, Familiarity, and Part-Whole Hierarchies
7.6 Granularity, Familiarity, and Class-Inclusion Hierarchies
7.7 Granularity, Familiarity, and SP-multiple-alignments
8 Interpretability and Explainability
9 Conclusion
References
Survey of Explainable Machine Learning with Visual and Granular Methods Beyond Quasi-Explanations
1 Introduction
1.1 What Are Explainable and Explained?
1.2 Types of Machine Learning Models
1.3 Informal Definitions
1.4 Formal Operational Definitions
1.5 Interpretability and Granularity
2 Foundations of Interpretability
2.1 How Interpretable Are the Current Interpretable Models?
2.2 Domain Specificity of Interpretations
2.3 User Centricity of Interpretations
2.4 Types of Interpretable Models
2.5 Using Black-Box Models to Explain Black Box Models
3 Overview of Visual Interpretability
3.1 What is Visual Interpretability?
3.2 Visual Versus Non-Visual Methods for Interpretability and Why Visual Thinking
3.3 Visual Interpretation Pre-Dates Formal Interpretation
4 Visual Discovery of ML Models
4.1 Lossy and Lossless Approaches to Visual Discovery in n-D Data
4.2 Theoretical Limitations
4.3 Examples of Lossy Versus Lossless Approaches for Visual Model Discovery
5 General Line Coordinates (GLC)
5.1 General Line Coordinates to Convert n-D Points to Graphs
5.2 Case Studies
6 Visual Methods for Traditional Machine Learning
6.1 Visualizing Association Rules: Matrix and Parallel Sets Visualization for Association Rules
6.2 Dataflow Tracing in ML Models: Decision Trees
6.3 IForest: Interpreting Random Forests via Visual Analytics
6.4 TreeExplainer for Tree Based Models
7 Traditional Visual Methods for Model Understanding: PCA, t-SNE and Related Point-to-Point Methods
8 Interpreting Deep Learning
8.1 Understanding Deep Learning via Generalization Analysis
8.2 Visual Explanations for DNN
8.3 Rule-Based Methods for Deep Learning
8.4 Human in the Loop Explanations
8.5 Understanding Generative Adversarial Networks (GANs) via Explanations
9 Open Problems and Current Research Frontiers
9.1 Evaluation and Development of New Visual Methods
9.2 Cross Domain Pollination: Physics & Domain Based Methods
9.3 Cross-Domain Pollination: Heatmap for Non-Image Data
9.4 Future Directions
10 Conclusion
References
MiBeX: Malware-Inserted Benign Datasets for Explainable Machine Learning
1 Introduction
2 Background and Related Works
2.1 Malware Analysis Overview
2.2 Granularity in Malware Analysis
2.3 Feature Visualization
2.4 Malware as Video
2.5 MetaSploit
2.6 Bash Commands
3 Dataset Generation
3.1 Gathering Benign Files
3.2 Trojan Insertion
3.3 Malware Verification
3.4 Dataset Generation Results
4 Malware Classification
4.1 Pre-processing
4.2 Network Specifications
4.3 Classification Results
5 Saliency Mapping
6 Conclusion and Future Work
References
Designing Explainable Text Classification Pipelines: Insights from IT Ticket Complexity Prediction Case Study
1 Introduction
2 Related Work
2.1 Explainability and Granularity
2.2 Text Representation
2.3 Text Classification
2.4 Ticket Classification Research
2.5 Summary
3 Methods
3.1 Feature Extraction
3.2 Machine Learning Classifiers
4 Experimental Evaluation
4.1 Case Study and Datasets
4.2 Experimental Settings
4.3 Comparison of SUCCESS and QuickSUCCESS
4.4 Results
5 Discussion
5.1 Explainability and Granularity Implications
5.2 Methodological Contributions
5.3 Managerial and Practical Contributions
6 Conclusion and Future Works
Appendix I: Taxonomy of Decision-Making Logic Levels
Appendix II: Business Sentiment Lexicon with Assigned Valences
References
A Granular Computing Approach to Provide Transparency of Intelligent Systems for Criminal Investigations
1 Introduction
2 Supporting Intelligence Analysts with Intelligent Systems
2.1 Faster Investigations with Intelligent Systems
2.2 The Need for Transparent Systems in Criminal Investigations
2.3 A Granular Computing Perspective to Design Transparent Systems
3 How Analysts Think
3.1 Cognitive Task Analysis (CTA) Interview Study [37]
4 Designing Recognisable Systems
4.1 Modelling Information Granules for Conversational Agent Intentions
4.2 Implementing Interpretable Conversational Agent Intentions
5 Insightful Investigative Agents
5.1 Modelling Information Granules for Investigation Pathways
5.2 Implementing Interpretable Recommendations for Investigation Paths
6 Evaluation Studies
6.1 Static Prototype Evaluation Study: Interpretability Requirements Depend Upon the System Component [53]
6.2 Interactive Prototype Evaluation Study
7 Conclusion
References
RYEL System: A Novel Method for Capturing and Represent Knowledge in a Legal Domain Using Explainable Artificial Intelligence (XAI) and Granular Computing (GrC)
1 Introduction
2 Related Work
3 RYEL: Explainable Artificial Intelligence and Granular Computing
3.1 Implementation
3.2 Granular Computing
3.3 Interpretation-Assessment/Assessment-Interpretation (IA-AI)
3.4 Explainable Artificial Intelligence
4 Results
5 Conclusions
References
A Generative Model Based Approach for Zero-Shot Breast Cancer Segmentation Explaining Pixels’ Contribution to the Model’s Prediction
1 Introduction
2 Related Works
2.1 Granular Computing in Image Understanding
2.2 Generative Adversarial Networks on Anomaly Detection
2.3 Explainable Artificial Intelligence (XAI)
2.4 XAI for Anomalous Region Segmentation
3 The Fundamentals
3.1 Why Adversarial Training?
3.2 What is an Anomaly?
3.3 Why GAN in Anomaly Detection?
3.4 Impact of XAI in Anomaly Detection
3.5 RISE Model
3.6 Motivation Behind This Approach
4 Proposed Methodology
4.1 Healthy GAN
4.2 Anomalous Region Segmentation
5 Evaluation
5.1 Evaluation Metric
5.2 Experiments and Results
6 Conclusion and Future Work
References
Index

Citation preview

Studies in Computational Intelligence 937

Witold Pedrycz Shyi-Ming Chen   Editors

Interpretable Artificial Intelligence: A Perspective of Granular Computing

Studies in Computational Intelligence Volume 937

Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

More information about this series at http://www.springer.com/series/7092

Witold Pedrycz Shyi-Ming Chen •

Editors

Interpretable Artificial Intelligence: A Perspective of Granular Computing

123

Editors Witold Pedrycz Department of Electrical and Computer Engineering Edmonton University of Alberta Edmonton, AB, Canada

Shyi-Ming Chen Department of Computer Science and Information Engineering National Taiwan University of Science and Technology Taipei, Taiwan

ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-030-64948-7 ISBN 978-3-030-64949-4 (eBook) https://doi.org/10.1007/978-3-030-64949-4 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

In the recent years, Artificial Intelligence (AI) has emerged as an important, timely, and far reaching research discipline with a plethora of advanced methodologies and innovative applications. With the rapid progress of AI concepts and methods, there is also a recent trend to augment the paradigm by bringing aspects of interpretability and explainability. With the ever growing complexity of AI constructs, their relationships with data analytics (and inherent danger of cyberattacks and adversarial data) and the omnipresence of demanding applications in various critical domains, there is a growing need to associate the results with sound explanations and augment them with a “what-if” analysis and advanced visualization. All of these factors have given rise to the most recent direction of Explainable AI (XAI). Augmenting AI with the facets of human centricity becomes indispensable. It is desirable that the models of AI are transparent so that the results being produced are easily interpretable and explainable. There have been a number of studies identifying opaque constructs of artificial neural networks (including deep learning) and stressing centrality of ways of bringing the aspect of transparency to the developed constructs. To make the results interpretable and deliver the required facet of explainability, one may argue that the findings have to be delivered at a certain level of abstraction (casting them in some general perspective)—subsequently information granularity and information granules play here a pivotal role. Likewise, the explanation mechanisms could be inherently associated with the logic fabric of the constructs, which facilitate the realization of interpretation and explanation processes. These two outstanding features help carry out a thorough risk analysis associated with actionable actions based on conclusions delivered by the AI system. The volume provides the readers with a comprehensive and up-to-date treatise on the studies at the junction of the area of XAI and Granular Computing. The chapters contributed by active researchers and practitioners exhibit substantial diversity naturally reflecting the breadth of the area itself. The methodology, advanced algorithms and case studies and applications are covered. XAI for processing mining, visual analytics, knowledge, learning, and interpretation are among the highly representative trends in the area. The applications to text classification,

v

vi

Preface

image processing prediction covered by several chapters are a tangible testimony to the recent advancements of AI. We would like to take this opportunity to thank the authors for sharing their research findings and innovative thoughts. We would like to express our thanks to the reviewers whose constructive input and detailed comments were instrumental to the process of quality assurance of the contributions. We hope that this volume will serve as a timely and vital addition to the rapidly growing body of knowledge in AI and intelligent systems, in general. Edmonton, Canada Taipei, Taiwan

Witold Pedrycz Shyi-Ming Chen

Contents

Explainable Artificial Intelligence for Process Mining: A General Overview and Application of a Novel Local Explanation Approach for Predictive Process Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nijat Mehdiyev and Peter Fettke

1

Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI): A Framework of Information Granules . . . . . . . . . . . . . . . . . . . Bo Sun

29

Visualizing the Behavior of Convolutional Neural Networks for Time Series Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Janosch Henze and Bernhard Sick

63

Beyond Deep Event Prediction: Deep Event Understanding Based on Explainable Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . Bukhoree Sahoh and Anant Choksuriwong

91

Interpretation of SVM to Build an Explainable AI via Granular Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Sanjay Sekar Samuel, Nik Nailah Binti Abdullah, and Anil Raj Factual and Counterfactual Explanation of Fuzzy Information Granules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Ilia Stepin, Alejandro Catala, Martin Pereira-Fariña, and Jose M. Alonso Transparency and Granularity in the SP Theory of Intelligence and Its Realisation in the SP Computer Model . . . . . . . . . . . . . . . . . . . 187 J. Gerard Wolff Survey of Explainable Machine Learning with Visual and Granular Methods Beyond Quasi-Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Boris Kovalerchuk, Muhammad Aurangzeb Ahmad, and Ankur Teredesai

vii

viii

Contents

MiBeX: Malware-Inserted Benign Datasets for Explainable Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Wayne Stegner, Tyler Westland, David Kapp, Temesguen Kebede, and Rashmi Jha Designing Explainable Text Classification Pipelines: Insights from IT Ticket Complexity Prediction Case Study . . . . . . . . . . . . . . . . . 293 Aleksandra Revina, Krisztian Buza, and Vera G. Meister A Granular Computing Approach to Provide Transparency of Intelligent Systems for Criminal Investigations . . . . . . . . . . . . . . . . . 333 Sam Hepenstal, Leishi Zhang, Neesha Kodagoda, and B. L. William Wong RYEL System: A Novel Method for Capturing and Represent Knowledge in a Legal Domain Using Explainable Artificial Intelligence (XAI) and Granular Computing (GrC) . . . . . . . . . . . . . . . . 369 Luis Raúl Rodríguez Oconitrillo, Juan José Vargas, Arturo Camacho, Alvaro Burgos, and Juan Manuel Corchado A Generative Model Based Approach for Zero-Shot Breast Cancer Segmentation Explaining Pixels’ Contribution to the Model’s Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 Preeti Mukherjee, Mainak Pal, Lidia Ghosh, and Amit Konar Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427

Explainable Artificial Intelligence for Process Mining: A General Overview and Application of a Novel Local Explanation Approach for Predictive Process Monitoring Nijat Mehdiyev and Peter Fettke Abstract The contemporary process-aware information systems possess the capabilities to record the activities generated during the process execution. To leverage these process specific fine-granular data, process mining has recently emerged as a promising research discipline. As an important branch of process mining, predictive business process management, pursues the objective to generate forward-looking, predictive insights to shape business processes. In this study, we propose a conceptual framework sought to establish and promote understanding of decision-making environment, underlying business processes and nature of the user characteristics for developing explainable business process prediction solutions. Consequently, with regard to the theoretical and practical implications of the framework, this study proposes a novel local post-hoc explanation approach for a deep learning classifier that is expected to facilitate the domain experts in justifying the model decisions. In contrary to alternative popular perturbation-based local explanation approaches, this study defines the local regions from the validation dataset by using the intermediate latent space representations learned by the deep neural networks. To validate the applicability of the proposed explanation method, the real-life process log data delivered by the Volvo IT Belgium’s incident management system are used. The adopted deep learning classifier achieves a good performance with the area under the ROC Curve of 0.94. The generated local explanations are also visualized and presented with relevant evaluation measures which are expected to increase the users’ trust in the black-box model. Keywords Explainable Artificial Intelligence (XAI) · Deep learning · Process mining · Predictive process monitoring · Granular computing

N. Mehdiyev (B) · P. Fettke German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany e-mail: [email protected] P. Fettke e-mail: [email protected] Saarland University, Saarbrücken, Germany © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. Pedrycz and S. Chen (eds.), Interpretable Artificial Intelligence: A Perspective of Granular Computing, Studies in Computational Intelligence 937, https://doi.org/10.1007/978-3-030-64949-4_1

1

2

N. Mehdiyev and P. Fettke

1 Introduction In order to attain the competitive edge, enterprises are called upon to consistently and sustainably improve their abilities to accelerate time-to-market, to ensure the quality of products and services and to increase the scale of their business operations [1]. For these purposes, securing a robust integration of various business processes, as well as maintaining data consistency upon such incorporation, is the primary driver of success. As more and more processes are being equipped and instrumented with a wide variety of information systems and corresponding data sources, the ability to incorporate rapidly increasing volume of heterogeneous data into decisionmaking processes has become an indispensable prerequisite for managing processes efficiently. The prevalence of process-aware information systems facilitates the capturing of digital footprints which are generated throughout the various stages of business processes. Process mining has recently emerged as an established research discipline aiming at delivering useful insights by using fine-granular event log data delivered by enterprise information systems; the new discipline lies at the intersection of artificial intelligence particularly data mining and business process management [2]. Although the initial approaches of process mining have been primarily concerned with process analysis of a descriptive and diagnostic nature such as discovery of the business processes or bottleneck analysis, a substantial shift towards prospective analysis has recently been observed. The process owners exhibit in particular a considerable interest in the opportunities generated by predictive business process management, one of the rapidly evolving process mining branches, because such intelligence enables deviations from desired process execution to be detected proactively and it establishes a basis for defining prompt intervention measures. The adoption of advanced machine learning methods facilitates to deliver robust, consistent, and precise predictions for examined targets of interest by learning the complex relationships among process sequences, process specific features, and business case related characteristics [3]. Nevertheless, due to their black-box nature, these techniques suffer notably in delivering appropriate explanations about their outcomes, internal inference process and recommended courses of actions. Consequently, their utilization potentials for predictive business process management are substantially impaired, since the verification of validity and reliability to the models cannot be accomplished and the justification of individual model judgements cannot be ascertained which leads to lack of trust and reliance in the machine learning models. The recent approaches from the explainable artificial intelligence (XAI) research domain pursue the objective of tackling these issues by facilitating a healthy collaboration between the human users and artificial intelligent systems. Generating relevant explanations tailored to the mental models, technical and business requirements along with preferences of the decision makers is assumed to alleviate the barriers that the data-driven business process intelligence can be operationalized.

Explainable Artificial Intelligence for Process Mining …

3

In this manuscript, we aim to examine the applicability and implications of the explainable artificial intelligence for process mining particularly for predictive business process management to lay foundations for prescriptive decision analytics. The main contributions of this study are multi-faceted: As a first contribution, this study proposes a conceptual framework which is sought to guide the researchers and practitioners in developing explanation approaches for predictive process monitoring solutions. Being used as a guideline, it is expected to enable the designers and developers to identify the objectives of the explanations, the target audience, the suitability of examined methods for the given decision-making environment, and the interdependencies among all these elements. Approaching to the development process of explanation systems in such a systematic manner is crucial as previous research on the explanations in the expert systems suggests that the specifications regarding the explanation have to be considered during the early design of the intelligent systems, otherwise the quality of the generated explanations are likely considerably deteriorated [4]. For instance, the generated explanations may not meet the end users’ requirements by being too complex, time-consuming, not necessarily relevant and imposing high cognitive loads on the users which reduce their usability, and even lead to failure in adoption of the underlying artificial advice givers [5]. For this purpose, this study attempts to provide an initial holistic conceptual framework for examining the decision-making environment and assessing the properties of the explanation situation before developing or adopting a specific explanation method for predictive process monitoring. As a second contribution, this study instantiates the proposed framework for analyzing the context of the examined use case scenario and the characteristics of the underlying process data-driven predictive analytics to develop an appropriate explanation method. After defining the relevant explanation requirements, a novel local post-hoc explanation approach is proposed to make the outcomes of the adopted deep learning based predictive process monitoring method understandable, interpretable, and justifiable. One of the novelties in our proposed approach lies in the identification process of the local regions since the extracted intermediate representations of the applied deep neural networks are used to identify the clusters from the validation data. The local surrogate decision trees are then applied to generate relevant explanations. By following this approach, the proposed explanation method pursues the goal to overcome the shortcomings of popular perturbation-based local explanation approach while using the model information as well. Furthermore, this study incorporates the findings from the cognitive sciences domain to make the generated explanations more reliable and plausible. This study also presents the relevance of granular computing for explainable process predictions. The ultimate purpose of our proposed local post-hoc explanation approach is making the results of the black-box algorithms comprehensible by investigating the small local regions (granules) of the response functions since they are assumed to be linear and monotonic which lead to more precise explanations [6]. Such a formulation of the explanation problem aligns completely with the idea of the granular computing which is defined as a set of theories, methodologies, tools

4

N. Mehdiyev and P. Fettke

and techniques for using granules such as classes, clusters, subsets, groups and intervals to develop an efficient computational model for the process of problem solving with extensive data and knowledge [7–9]. The information granules are created by grouping the elements together by indistinguishability, similarity, proximity or functionality which allow to approach the underlying problem by examining different hierarchical levels [10–12]. The relevance of the granular computing for supporting interactive and iterative decision-making processes and designing intelligent systems has also already been illustrated throughout numerous use cases [13, 14]. Furthermore, an overview of recent applications of granular computing in various domains such as forecasting time-series, manufacturing, concept learning, optimization, credit scoring etc. can be found in the study by [15] which also presented the necessity of granular computing for data analytics and introduced main design pursuits. The implications of granular computing studies for conducting machine learning based data-driven analysis have also been recently discussed, applied and demonstrated [16–21]. Our study goes beyond earlier research on granular computing for conventional machine learning problems and positions the appropriateness of granular computing for explainable artificial intelligence by identifying relevant explanation granules with novel approaches and extracting comprehensible representations for each information granule. The remainder of the manuscript is organized as follows. Section 2 provides an overview of background and related work to process mining particularly by focusing on predictive business process management methods and application of the explainable artificial intelligence especially for deep learning approaches. Section 3 introduces the framework which can be used to carry out the explainable process prediction projects and provides a brief overview of chosen use cases. In the Sect. 4, we present the proposed post-hoc local explanation approach for business process prediction scenario and discuss its stages in detail. Section 5 introduces the examined use case from the incident management domain, discusses the details of the evaluation measures for the black-box, clustering and local surrogate models and finally highlights the obtained results and explanations. Section 6 discusses the scientific and practical implications, limitations, and future work. Finally, Sect. 7 concludes the study with a summary.

2 Background and Related Work 2.1 Process Mining The adoption and deployment of advanced analytics approaches deliver sustained flexibility, precision, and efficiency in business processes, and thereby generates substantial added value along the entire value chain. Nevertheless, despite the fact that data-driven analytics has already been established as a capable and effective instrument and has found successful applications in diverse disciplines, they rarely

Explainable Artificial Intelligence for Process Mining …

5

embrace a business process perspective throughout the information processing and inferencing cycle. The process mining techniques have recently proven to provide a resilient mechanism to address this challenge by bridging the gap between data and process science. Various process-aware information systems such as Customer Relationship Management (CRM), Enterprise Resource Planning (ERP), Workflow Management Systems (WMS), Manufacturing Execution Systems (MES), Case and Incident Management Systems have the ability to log the process activities generated during the process execution. The delivered event log contains the information about the process traces representing the execution of process instances which are also referred as to cases. A trace consists of sequence of process activities which are described by their names, by timestamp and eventually by other information such as responsible people or units if available. Typical process data is similar to sequence data but due to branching and synchronization points, it get much more complex [22]. Various process mining methods such as process discovery, conformance checking, process enhancement, predictive business process management etc. use the event log data to generate important insights into business processes. The main purpose of the process discovery is the automatic data-driven construction of business process models from the event logs [23]. Heuristics mining, genetic process mining, region-based mining, alpha, alpha+ , inductive mining etc. are different algorithms that can be used to synthesize process models from log data [24]. Conformance checking pursues the objective to examine the real process behavior by comparing the process models with the event log of these processes. Various alignment strategies and fitness measures can be used to compare the observed processes and process models that are hand-made or discovered from the event data [25]. The main idea behind process enhancement is extending the a-priori process models by analyzing the event logs. Generating process improvement strategies by analyzing the bottlenecks, service levels, throughput time etc. is a typical example for process enhancement [23].

2.2 Predictive Business Process Management Predictive business process management also referred to as predictive process monitoring or business process prediction is another branch of process mining that aims at predicting the pre-defined targets of interest by using the activities from the running process traces [26]. The primary underlying concept of the predictive business process management is the anticipation of the user’s defined target of interest with the usage of the process activities from the running cases. Several studies have been proposed to address the predictive process monitoring focusing on classification problems such as: – next event prediction [22, 27–34] – business process outcome prediction [35–38]

6

N. Mehdiyev and P. Fettke

– prediction of service level agreement violations [31, 39]. There are also various studies that handle various regression problems in the business process prediction domain: – – – –

remaining time prediction [40–43] prediction of activity delays [44] risk prediction [45, 46] cost prediction [47].

These studies use different machine learning approaches such as decision trees [37], support vector machines [32], Markov models [28], evolutionary algorithms [30] among others.

2.3 Deep Learning for Predictive BPM and XAI In the light of reported predictive process analytics experiments and corresponding results, it is conceivable to state that deep learning approaches provide superior results to alternative machine learning approaches. Various deep learning architectures such as. – – – –

deep feedforward neural networks [3, 27, 48, 49], convolutional neural networks (CNN) [50–54], long-short term memory networks (LSTM) [22, 33, 49, 51, 55–57], and generative adversarial nets [58]

have already been successfully implemented for different classification and regression tasks for predictive process monitoring problems. Although these advanced models provide more precise results compared to conventional white-box approaches, their operationalization in the process mining applications suffers from their black-box nature. Lack of understandability and trust into these non-transparent models results in an enlarging gap between advanced scientific studies and their low adoption in practice. Recently there has been considerable interest in making the deep learning models understandable by using different explanation techniques in other research domains. A systematic analysis by [59] provides a comprehensive overview of explanation methods for deep neural networks by categorizing them as techniques explaining how the data are processed, approaches explaining the representation of the data and explanation producing systems. In another study by [60], a brief overview of various local explanation approaches for deep learning methods such as Input–Output Gradient [61], Integrated Gradients [62], Guided Backpropagation [63], Guided Grad-CAM [64], SmoothGrad [65] are provided and the sensitivity of these approaches is examined. A further comparative analysis of RISE, GradCAM and LIME approaches for making deep learning algorithms explainable can

Explainable Artificial Intelligence for Process Mining …

7

be found in [66]. An extensive literature analysis carried out by [67] focuses especially on visualization approaches by categorizing the identified studies into nodelink diagrams for network architecture, dimensionality reduction and scatter plots, line charts for temporal metrics, instance-based analysis and exploration, interactive experimentation, algorithms for attribution and feature visualization classes. Although generating explanations for deep learning and other black-box approaches have already attracted an increased attention in other domains, there are just a few approaches for explainable artificial intelligence in the process mining domain. The study by [68] proposed to generate causal explanations for deep learning-based process outcome predictions by using a global post-hoc explanation approach, partial dependence plots (PDP). The explanations are generated for process owners who would like to define long-term strategies like digital nudging by learning from the causal relationship among transactional process variables. The study by [69] introduced and illustrated the necessity of explainable artificial intelligence in a manufacturing process analytics use case. The authors of the article [70] applied the permutation based local explanation approach, LIME, to generate explanations for business process predictions. In a brief position paper the relevance of explainable artificial intelligence for business process management is highlighted as well [71]. However there is still much work to be done for explainable predictive process monitoring and thus to fill this gap, we propose first a framework which is expected to facilitate the guidance for choosing the appropriate explanation approaches and illustrate the applicability of the proposed novel local post-hoc explanation approach in a real-world use case.

3 A Framework for Explainable Process Predictions Notwithstanding the fact that explainable artificial intelligence has just recently emerged as one of the major research disciplines in artificial intelligence research domain, the need to generate the explanations of intelligent systems is not a new phenomenon. Despite the fact that considerable amounts of studies have already been carried out over the last three decades, extensive investigations reveal that most attempts at explaining the black-box systems have not been entirely satisfactory due to their insufficient functionality to meet the users’ requirements of understandability. Explainability is a broad and vague notion, as it encompasses a wide range of dimensions and objectives. To a considerable extent, their perceived adequacy and appropriateness rely on the prevailing circumstances of the decision-making environment and the nature of the user characteristics. Due to the inherent socio-cognitive nature of the explanation process, it is necessary to ensure an intensive interaction of decision makers with the underlying intelligent systems and produced explanations [72]. In this context it is essential to facilitate an adequate in-depth guidance that enables the human audience to clearly comprehend the nature and the causes of the examined phenomenon.

8

N. Mehdiyev and P. Fettke

To cope with such a multi-faceted challenge, it is essential to systematically approach the explanation generation process by considering several relevant perspectives in a comprehensive way. Recent studies imply that the ongoing research and development initiatives pursued in the XAI area to this day still overlook numerous factors [73]. Apart from the researcher’s intuition concerning the soundness and reasonability of an explanation, in many cases process-specific, human-related, organizational, and economic considerations as well as the inherent properties of explanations and problem settings are disregarded. Therefore, there is a need for a holistic framework that can be used as a guidance for design and development of explanatory systems. To overcome these challenges, we propose a conceptual framework by analyzing, combining and adapting the propositions from the explainable artificial intelligence research to process mining domain (see Fig. 1). Below we provide a thorough discussion of the framework elements, subject, objectives, techniques, generation time, outcomes, and present the close links and implications of them to each other by illustrating examples. Subject: The nature of the explanations is significantly influenced by the group of the people having different backgrounds, motives, beliefs, and expectations who may use them for different decision-making situations. A series of recent research havemerely concentrated on the implications of explanation recipients for choosing the proper explanation techniques. The study by [74] which is partially built on various suggestions from the literature [72, 75] defined six different types of the subjects. These subject types include the system creators developing the machine learning systems, system operators, executors making decisions by using the system outputs,

Fig. 1 A conceptual framework for explainable process prediction

Explainable Artificial Intelligence for Process Mining …

9

decision subjects for whom the decisions are made, the data subjects whose data are used for training the models and examiners such as auditors. By conducting a more global analysis, the study by [76] investigated the main stakeholders for explainable artificial intelligence. According to their analysis developers, theorists, ethicists and end-users are the main subject groups which use or contribute to the process of generating explanation for intelligent systems. Similar to the studies outlined above, the study by [67] organizes the subjects into non-mutually exclusive groups and defines three main categories: model developers and builders, model users and non-experts. To carry out process mining projects especially the ones with the predictive analytics features, various stakeholders have to be involved. Data/process engineers are the main developers and knowledge engineers who are responsible for creating and productionizing the machine learning based predictive process analytics systems. The data/process analysts are the executors or the model users who use the underlying systems to formulate and verify the hypotheses, make decisions and recommendations. The process owners are mainly non-experts in terms of the machine learning but with their specific knowledge about the processes they define the scope of the project and identify the technical and business success criteria. The domain experts who have deep expertise in the examined business processes may provide valuable feedback for both process engineers and analysts when carrying out their specific tasks. The technical experts are mainly responsible for a secure extraction of the process specific data from the process aware information systems and providing an initial understanding of the relevant data fields. Supervisory boards and regulatory bodies have also their specific interests on the structure and content of process predictions especially related to compliance and fairness issues. Objectives: The objectives of the machine learning explainability which are mainly driven by the requirements of various subject groups are multifaceted and have significant implications for the success of the analytics project. A systematic analysis of the failed expert systems by [77] has revealed that the main reason for the failure was the inability of these systems to address the users objectives in demanding explanations. The findings by [78] suggest that the experience levels of the experts imply various considerations for generated explanation. The novice users prefer justification based terminological explanations by avoiding explanation of multi-level reasoning traces which may lead to high cognitive loads. According to [79], verification of the system capabilities and reasoning process, investigating the knowledge of the system which was coined as duplication, and ratification which aims to increase the trust of the end users in the underlying intelligent system are the three main objectives. Recent study by [80] which conducted an extensive analysis of the explainable AI tools and techniques have identified multiple other objectives. According to their findings, explanation objectives can be defined as explaining how the system works (transparency), helping the domain experts to make reasonable decisions (effectiveness), convincing the users to invest in the system (persuasiveness), increasing the ease of the use (satisfaction), allowing the decision makers to learn from the system (education), helping the users to make fast decisions (efficiency), enabling the decision

10

N. Mehdiyev and P. Fettke

makers to detect the defects in the system (debugging) and allowing the user to communicate with the system (scrutability). Techniques: Over time, an extensive literature has developed on explainable artificial intelligence methods. Various systematic survey articles can be found in the studies by [81–85]. The explanation approaches can be categorized in accordance with various criteria. In terms of model relation, the explanation approaches can be model-specific implying that they are limited to the underlying black-box model or can be model-agnostic generating the relevant explanations independent of the used machine learning model. The model specific approaches by nature generate intrinsic explanations which can be used for verification purposes whereas the model-agnostic approaches have mainly post-hoc explanation character and facilitate the users in justifying the models and their outcomes. An alternative way to classify these methods is to differentiate them according to their scope. Global interpretability approaches PDP, global surrogate models [81, 86], SHAP dependent plots [87], Accumulated Local Effects (ALE) Plots [88] pursue the objective of providing explainability of the model for whole data set whereas the local explanation approaches such as Individual Conditional Expectation (ICE) Plots [89], Shapley Additive Explanations (SHAP) [90], LIME [91] enable to examine the instances individually or defined local regions. Furthermore, depending on the objectives of the explanations and user preferences visual and textual, analogy/case-based explanations, causal explanations, counterfactuals can also be generated. Generation Time: The explanations can be generated before building the models which is referred to as pre-model explanation, during the model building which is called in-model explanation and after building the underlying machine learning model (post-model explanation) [85]. The pre-model explanation approaches are strongly related to exploratory data analysis and include mainly visualization and graphical representation approaches. In-model methods which are mainly model specific intrinsic approaches, attempt to generate explanations during the training phase by testing e.g. constraint options. Finally, the post-model approaches are adopted once a strong predictive model is obtained and aim to generate explanation without examining the reasoning trace or inference mechanism of the black-box model. Outcomes: Finally, the generated explanations can be used to get understanding in terms of different outcomes. Process analysts, domain experts and process owners may use the generated explanations to understand various business process specific analysis. By using causal explanations generated for next event predictions, the users can identify the limitations related to the process and user behavior and consequently define corresponding intervention measures. Justifying the process outcome predictions may facilitate to define the course-of-actions for enhancing the business processes. Making resource specific prediction comprehensible would enable to generate various resource allocation strategies more efficiently. Interpretability may allow to examine the reasons for the deviations in process performance indicators (time delays, increasing costs). Furthermore, in order to realize the concept

Explainable Artificial Intelligence for Process Mining …

11

of the trustworthy artificial intelligence, it is very important to validate the technical robustness of the machine learning systems, to verify that the underlying model fulfills the required privacy and safety requirements and finally to ensure the fairness, non-discrimination and diversity. Potential Use Cases for Explainable Predictive Process Monitoring: It is very essential to emphasize that these constructs should not be considered in an isolated manner since they have direct and contextually strong relationships. Figure 2 provides an overview of chosen use cases for predictive business process management which illustrates the links among these various dimensions of the analytics situation. In the first scenario, the target audience for explanations is defined as domain experts with limited machine learning background who aim to use the explanations to justify the model outcomes for individual observations. These users have limited interest and capability to understand the inner working mechanisms of the adopted black-box models. Therefore, it is reasonable to develop relevant local posthoc explanation approaches for them. In the second scenario, the process owners are provided with explanations that enable them to make more strategic decisions rather than focusing on each decision instance separately. Thus, it makes more sense to generate global post-explanation solutions that facilitate the process owners to understand relationships among features and model outcomes for the whole dataset. Once the relevant predictive insights are generated, they can decide how to enhance processes or improve the products and services. According to the third scenario, the

Use Case 1

Use Case 2

Use Case 3

Use Case 4

Domain Experts

Process Owners

Process / Data Scientists

Supervisory Board / Regulatory Body

Trust

Product and Process Enhancement

Verification/ Duplication

Compliance/ Fairness

Local Post-Hoc Explanation

Global Post-Hoc Explanation

Intrinsic Explanation

Local and Global Post-Hoc Explanation

Post-Model

Post-Model

In-Model

Pre-Model and PostModel

Justification of Individiual Model Outcomes

Investigating Features that Lead to Bottlenecks

Enhancing Machine Learning Model Performance

Examining Compliance and Fairness Violations

Fig. 2 Chosen use cases for explainable process predictions

12

N. Mehdiyev and P. Fettke

process and data scientists aim to improve accuracy of the applied black-box prediction approach. Since these stakeholders are mainly interested in the reasoning trace of algorithms, the generated intrinsic explanations create more added value. Finally, the supervisory board or regulatory bodies are interested to understand whether using the adopted data-driven decision-making violates the compliance or fairness criteria. For this purpose, it is reasonable first to provide the explanations for all model outcomes and then allow them to examine the specific individual model decisions. It is also essential to note that these are just some of the chosen illustrative use cases and the list can be easily extended by various stakeholders or objectives, outcomes etc. for the target audience discussed above. Furthermore, in many cases the combination of various explanation approaches provides more comprehensible and precise tool by facilitating the users to examine the process predictions from different perspectives. However, for this goal it is crucial to ensure smooth transitions among various explanation types since misleading explanations are worse than no explanation at all [4].

4 A Novel Local Post-Hoc Explanation Method In our study, we propose a novel explainable process prediction approach after identifying key requirements and elements by using the conceptual framework introduced in Sect. 3. The ultimate objective of the proposed method is defined as developing an explainable deep learning model for identifying the deviations from the desired business processes by using a real-world use case (see Sect. 5.1). For this purpose, it is important to design an explanation technique by targeting the end users such as domain experts or process owners rather than the knowledge engineers. On these grounds, it is reasonable for us to follow the ratification/justification objective of the machine learning interpretability by explaining why the provided results by the deployed deep learning approach are reliable (see Use Case 1 in Fig. 2). Therefore, we propose a post-hoc local explanation which uses a novel technique for identifying the local regions. By using the generated local explanations, the domain experts are expected to understand the process behavior resulting in undesired outcomes and to justify the model decisions. Figure 3 presents an overview of the proposed approach. After preparing the process event data by extracting the relevant process specific features and n-gram representations of the process transitions and defining the target of interest the deep learning model is trained to learn the complex relationships among them. The trained black-box model does not only deliver precise prediction outcomes but also extract useful representations by performing multiple layers of the non-linear transformations. The intermediate latent space representations obtained from the last hidden layer of the network are then used as input variables to the k-means clustering approach with the purpose to define the local regions. At the final stage, by using the original input variables and the prediction scores by deep learning approach, individual local surrogate decision trees are trained for the identified clusters to approximate the behavior of the main black-box classifier in that locality. The

Explainable Artificial Intelligence for Process Mining …

13

Fig. 3 Stages of the proposed local post-hoc explanation method

learned representations of this comprehensible surrogate model, namely the decision trees and consequently extracted rules are then provided to the domain experts who aim to justify the decision for individual instances. In the following subsections a discussion of the details for each used method is provided.

4.1 Binary Classification with Deep Learning To generate plausible explanations, it is very important to ensure that a good predictive model is obtained. Considering their ability in addressing the discriminative tasks for different problem types from various domains and particularly for business process prediction problems, in our study we adopt a deep learning method as our black-box classifier. For the examined binary classification problem, we use the fully connected deep neural networks [92]. The stochastic gradient descent optimization (SGD) method was adopted to minimize the loss function and the specific lock-free parallelization scheme was used to avoid the issues related to this approach. Uniform adaptive option was chosen as an initialization scheme. To take the advantages of both learning rate annealing and momentum training for avoiding the slow convergence, the adaptive learning rate algorithm ADADELTA was used. Furthermore, we used the dropout technique (with rectifier) and early stopping metric as regularization approaches to prevent the overfitting in the underlying deep feedforward neural networks. The dropout ratio for the input layer was defined 0.1 whereas this value

14

N. Mehdiyev and P. Fettke

was set as 0.3 for hidden layers. The area under the ROC Curve (AUROC) was selected as stopping metric for which the relative stopping tolerance was defined at 0.01 and the stopping rounds as 10.

4.2 Local Region Identification by Using Neural Codes Our proposed approach aims to generate the post-hoc explanations particularly local explanations. It was reported in literature that there are two main types of local explanation approaches [93, 94]. The model specific approaches such as saliency masks attempt to explain solely the neural networks whereas the model-agnostic approaches such as LIME, SHAP generate explanations independent of the underlying blackbox models. Although these two perturbation-based approaches have recently gained an increased attention in the community and have found applications for various problem ranging from text classification to image recognition, they have been criticized for various issues. A recent study by [93] which investigated the robustness of the local explanation approaches revealed that the model-agnostic perturbation-based approach are prone to instability especially when used to explain non-linear models such as neural networks. Another study by [95] also concluded that in addition to high sensitivity to a change in an input feature, the challenges in capturing the relationships and dependencies between variables are the further issues that linear models suffer in the local approximation. A further limitation of these perturbation-based approaches is the random generation of the instances in the vicinity of the examined sample that doesn’t consider the density of the black-box model in these neighbor instances [94]. In short, the literature pertaining to perturbation-based approaches strongly suggests that the local region should be identified and examined carefully to deliver stable and consistent explanations. To address these issues outlined above, various alternative approaches have been proposed. The study by [94] used the genetic algorithms to identify the neighbor instances in the vicinity of the examined instance which is followed by fitting the local surrogate decision trees. The study by [96] proposed an approach that defines the local regions by fitting the surrogate model-based trees and generating the reason codes through linear models in the local regions that are defined by obtained leaf node paths. Similar to this approach, the K-LIME approach proposed by [97] attempts to identify the local regions by applying the k-means clustering approach on the training data instead of perturbed observation samples. For each cluster, a linear surrogate model is trained, and the obtained R2 values are used to identify the optimal number of the clusters. One of the main critics to this approach is that the adopted clustering approach doesn’t use the model information which leads to issues in preserving the underlying model structure [96]. In this study, we use a similar technique to the K-LIME however we follow a slightly different approach. The major difference lies in the identification procedure of clusters, where we do not use original feature space as input to the k-means clustering approach but non-linearly transformed feature space. Our method is inspired by

Explainable Artificial Intelligence for Process Mining …

15

the approach proposed by [98] which attempts to generate case based explanations to non-case based machine learning approaches. The main idea behind this technique is using the activation of hidden units of the learned neural networks during a discrimination task as distance metric to identify the similar cases or clusters from the model’s point of view. The findings in [98] imply that using the Euclidean distance on the original input space to identify the locality as performed in the K-LIME approach can be useful to explain the training set to users. However, this approach fails to deliver plausible explanation regarding the trained model since it doesn’t capture how the input variables are processed by the model. More specifically, in our approach in addition to its discrimination task, we use the deep neural networks as feature extractors by using the learned intermediate representations from the pre-last network layer as input to the unsupervised partitioning approach. Previous studies have shown that the using the latent space is a promising approach for neighborhood identification and the non-linear transformation approaches especially deliver promising results [99–101]. It is also worth to mention that even though the deep learning models such as deep autoencoders can be used to extract the latent space in an unsupervised fashion, we use the learned representation from the network that was trained to classify the business processes which is more relevant to generate the explanations as it preserves the used black-box model structure.

4.3 Local Surrogate Model Once the local regions (clusters) are defined, at the last stage of our proposed approach we fit a local surrogate decision tree model. Also referred to as emulators, model distillation or model compression, the main idea of the surrogate models is approximating the behavior of the black-box model accurately by using white-box models and providing explanations by introducing their interpretable representations. More specifically, in our case we use the deep learning model to generate predictions for the validation data. Following this for each cluster we train a decision tree by using the original variables from the validation set as input data and the prediction scores by deep learning model as output data. The learned decision paths and extracted rules provide cluster specific local explanations. To evaluate the approximation capability of the surrogate decision tree, we calculate the R2 value for local surrogate decision trees in each cluster. The position paper by [102] discussed the advantages and shortcomings of various white-box approaches including decision trees, classification rules, decision tables, nearest neighbors and Bayesian network classifiers by comparing their comprehensibility with use-based experiments. More specifically, the study by [103] compared the predictive performance of various comprehensible rule-based approaches such as RIPPER, PART, OneR, RIDOR etc. It is worth to mention that any of these alternative comprehensible, whit-box approaches can be adopted instead of decision trees as in the underlying study.

16

N. Mehdiyev and P. Fettke

5 Experiment Setting 5.1 Use Case: Incident Management The applicability of the proposed local explainable artificial approach is examined for an incident management use case by analyzing the real-world process log data delivered by Volvo IT Belgium’s incident system called VINST [104]. The main purpose of the incident management processes is ensuring the quality of normal service operations within the Service Level Agreements (SLA) by mitigating potentially adversarial impacts and risks of incidents to the continuity of the service operations. Once the process owners verify that the intended level of the services has been attained, the incidents can be resolved. In case of suspected resurgence of such incidents, a problem management process should be undertaken with the aim of ascertaining their root causes and adopting the corresponding corrective and preventive procedures. By using various information systems, the experts from different service lines perform their specific tasks to avoid the disruption to the business caused by incidents of various characteristics. One of such dimensions is the impact criticality of an incident, which is measured by the magnitude of the influence on the deterioration of the service levels and the number of concerned parties. The major impact incidents include the ones that disrupt the plant production, the business processes for designing, ordering and delivering the vehicles or negatively influence the cash flow and even the public image. The examples for high impact incidents are server incidents that result in customer loss or infrastructural incidents that may trigger production losses. In case the incidents only affect a limited proportion of the clientele and have no negative consequences on their service provision capabilities or only partially hamper internal IT services, they can be characterized as medium impact incidents. Finally, low impact incidents affect only a small number of customers and do not impede the functionality and performance if the necessary measures are followed. In accordance with their specific content and criticality of incidents three main service lines take various actions. The first line includes the service desk which is a first point of contact to provide efficient and expeditious solutions for customers to restore normal services and the expert desk where the application specific assistance is carried out. Incidents that cannot be handled by the first line staff are processed by the second line. The professionals on the third line are specialized in their product domain and may also operate as a support unit. Similarly, when incidents remain unresolved by the second line, the experts of third line will endeavor to resolve them. The company pursues the objective and strategic vision that most of the incidents are to be addressed and successfully completed by the teams of the first service line. By attaining this objective, the efficiency of the work can be increased significantly. However, the process owners have noticed an improper usage of the “push-to-front” mechanism which is referred to the transfer of incidents to second and third service lines that could have been easily solved in the first line. Since such an overload of support activities that are not the main task of these experts in the second and

Explainable Artificial Intelligence for Process Mining …

17

third service lines, causes obstructions in the running of their core business activities it is essential to examine the dynamics of such “push-to-front” processes. To address this challenge, this study aims to build an efficient prediction approach and to extract corresponding explanations which may serve as as basis for generating proactive intervention measures. The proposed local explanation approach is expected to increase the transparency by extracting easily interpretable rules with low complexities and to improve the reliability to the underlying black-box model by making each prediction decision explainable. The level of the trust in the black-box machine learning system increases in case the extracted local decision trees and rules conform to domain knowledge and reasonable expectations [97].

5.2 Evaluation Measures In this section, a brief overview of the evaluation procedures and measures for different stages of the proposed approach is given. After introducing the binary classification evaluation measures which are required to assess the performance of the deep neural network model, we discuss briefly how the fitness of the obtained clusters and the approximation capability of the adopted local surrogate decision trees can be measured. The predictive performance of the deep learning classifier for the examined classification problem can be assessed by calculating the number of correctly identified class instances (true positives), the number of correctly recognized instances that are not member of the class (true negatives), the number of the observations that are wrongly recognized as the class instances (false positives) and the number of the examples that were not identified as class members (false negatives). By using these measures, a confusion matrix also referred as to contingency table is constituted (see Table 1). To examine the predictive strength of the adopted classifier for the predictive process monitoring binary classification evaluation measures are computed (see Table 2). Although these single-threshold classification evaluation measures provide useful insights into the model capabilities, a challenging problem arises in the identification of the correct threshold. For this purpose, it is important to define the cost function that should be optimized at the chosen threshold which in turn require a careful consideration of features and dimensions related to the decision-making mechanism. In the examined incident management use case, the domain experts are interested in Table 1 Confusion matrix

Ground truth values Predicted values

Positive

Negative

Positive

True positive (tp)

False positive (fp)

Negative

False negative (fn)

True negative (tn)

18

N. Mehdiyev and P. Fettke

Table 2 Binary classification evaluation measures Measure

Formula

Measure

Formula

Accuracy

t p+tn t p+tn+ f p+ f n

MCC



Precision

tp t p+ f p

F1-Measure

2t p 2t p+ f p+ f n

Recall

tp t p+ f n

False Negative Rate

fn f n+t p

Specificity

tn tn+ f p

False Positive Rate

fp tn+ f p

t p∗tn− f p∗ f n (t p+ f p)∗(t p+ f n)∗(tn+ f p)∗(tn+ f n)

Table 3 Clustering evaluation measures Measure Sum-of-squares between clusters (SSBC)

Formula N 2 i=1 x i − C pi 

Sum-of-squares within cluster (SSWC)

M

i=1 n i ci −



X

2

−  where X = {x1 , . . . , x N } represents the data set with N D-dimensional points, and X = xi /N is the center of the entire data set. The centroids of clusters are C = {c1 , . . . , c M }, where ci is the i − th cluster and M is the number of clusters

identifying the potential push-to-front activities more accurately but without sacrificing the general performance of the classifier. To address these requirements, we identify the threshold where the misclassification error is minimized across both classes equally. In addition to single threshold measures, the area under ROC Curve is also calculated and visualized. This threshold-free metric does not only give a more precise information about the model performance but also can guide for selecting the threshold more interactively. To assess the goodness and validity of the performed k-means clustering, a sum-of-squared based ratio is calculated. This ratio of sum-ofsquares between clusters (SSBC) to sum-of-squares within cluster (SSWC) cluster is expected to measure total variance in the data that is explained by the performed clustering (see Table 3). Finally, to estimate the goodness fit of the local surrogate model and to evaluate how good it can approximate the behavior of the global black-box model in the identified cluster locality the R2 measure is calculated (see Table 4).

5.3 Results 5.3.1

Classification with Deep Learning

The examined process instances are randomly split in ratio 80:20 into training set and validations set. This section reports the evaluation results obtained on the validation set using the deep neural networks model. The Area under the ROC Curve (AUROC)

Explainable Artificial Intelligence for Process Mining …

19

Table 4 Surrogate model evaluation measure Measure

Formula 2 yi − y i   − 2 n y i=1 yi −

n

R2 measure

1−





i=1

where yi represents the prediction score delivered by local surrogate model for the observation i, 



y i denotes the black-box prediction score for instance i and the y is the mean of black-box prediction scores

is presented in the Fig. 4. The applied black-box model achieves a remarkable performance with AUROC of 0.94. To carry out a more detailed investigation, we compute further several single threshold binary classification assessment measures in the predefined threshold of 0.9119 (see Table 5). The achieved results at this threshold, which equally minimizes misclassification across both classes, suggest that the push-to-front processes can be effectively and timely identified.

Fig. 4 Area under the ROC curve

20

N. Mehdiyev and P. Fettke

Table 5 Binary classification evaluation results

5.3.2

Measure

Formula

Measure

Formula

Accuracy

0.8935

MCC

0.4704

Precision

0.9944

F1-Measure

0.9412

Recall

0.8934

False Negative Rate

0.1066

Specificity

0.8946

False Positive Rate

0.1054

Local Explanations with Surrogate Decision Trees

The optimal number of the clusters is identified by maximizing the average local accuracy of the deep learning model in the clusters at the 0.9119 threshold. In the latest run (for which we also present the local results) the number of clusters was defined as 27. The ratio of SSBC/SSWC = 0.915 indicates a good fit where the 91.5% is the measure of total variance that is explained by clustering. After identifying the clusters, the decision trees are fit locally. In order to facilitate the users to justify the validity of generated explanations, they should be provided with various predictive analytics information including the prediction scores of the global deep learning model and surrogate decision trees for the examined instance, the R2 value of local surrogate models, the ground truth label and obtained decision tree paths. With the following example we introduce the explanation for randomly chosen true negative prediction with relevant statistical properties. The Fig. 5 presents the explanations for the chosen instance from the cluster 7 which was correctly classified having the following push-to-front activity.

Fig. 5 Local decision tree with relevant explanation information

Explainable Artificial Intelligence for Process Mining …

21

Supplementary explanation information Cluster number

7

Instance number

18

R2 Value (for local surrogate model in the examined cluster)

0.908

Deep learning prediction

0.290

Surrogate tree prediction

0.267

Prediction

Push-to-Front

Ground truth label

Push-to-Front

The applied deep learning approach confidentially classifies the instance by assigning a prediction score of 0.290 to the class “push-to-front” which is significantly below the pre-defined classification threshold (0.9119) that is supposed to optimize the selected cost function. The local surrogate decision tree model delivers a similar prediction score for the examined instance with 0.267 suggesting an acceptable approximation. Furthermore, the surrogate model has a R2 value of 0.908 in the cluster to which the examined instance belongs to suggests that it is strongly capable to approximate the behavior of the applied deep learning in that local region. After verifying the validity of the global black-box prediction and the suitability of the local surrogate model, the learned representations of the white-box-model, the obtained decision paths, and correspondingly extracted rules, are used for explanation purposes. The examined instance follows Left-Left-Left-Left path of the decision tree illustrated in the Fig. 5. The extracted rule is as follows: I F the "Accepted − I n.Pr ogr ess −...− Accepted−I n.Pr ogr ess" is less than 1 AN D "duration since star t" is gr eater than 169 seconds AN D "I mpact" is Medium AN D "duration since pr evious event" is less than 376 seconds T H E N Pr ediction o f Surr ogate Model is 0.267 Furthermore, it is very important to incorporate the findings of the cognitive sciences into explanation models and interfaces to eliminate the inherent cognitive biases. The study by [105] has empirically revealed that showing the confidence of the extracted rules leads to an increased plausibility of the explanation. Considering this suggestion, we present the confidence of each rule to the users. For the examined instance, the rule introduced above has a confidence of 0.60.

6 Discussion This study aims to design an overarching framework which can be used to develop explainable process prediction solution by allowing to consider various aspects of

22

N. Mehdiyev and P. Fettke

the decision-making environment and the characteristics of the underlying business processes. For this purpose, an extensive analysis of the relevant studies from the explainable artificial intelligence domain was performed, the relevant findings were examined, combined, and customized for process mining applications. Although the proposed conceptual framework provides a solid fundament and preliminary basis for a more sophisticated understanding of the requirements for developing the explanation methods and interfaces, future studies could fruitfully explore this issue further by addressing other dimensions. The relevant activities in this context should be devoted e.g. to defining the relevant evaluation mechanisms, procedures, measures, and methods. For this purpose, it is crucial first to define properties of the generated explanations such as fidelity, consistency, stability, comprehensibility, certainty, importance, novelty, stability, representativeness etc., by exploring the relevant literature [85, 106]. Furthermore, various taxonomies for interpretability evaluation such the one proposed by [107] which includes application grounded evaluation with real humans and real tasks, human-grounded evaluation with real humans, and simplified tasks and functionally-grounded evaluation with no humans, and proxy tasks should be positioned in the proposed framework adequately. Finally, providing an overview of measurement types in the evaluation user studies such as subjective perception questionnaires, explanation exposure delta, correct choices, learning score, domain specific metrics, choice of explanation, interest in alternatives, recommendation acceptance likelihood etc. would be beneficial for system developers to structure their evaluation procedures [80]. After presenting the details, we have instantiated the framework for developing an explainable process prediction approach for domain experts by using the data from the real-world incident management use case. For this purpose, a novel local post-hoc explanation approach was proposed, applied and evaluated. Compared to alternative perturbation-based local explanation techniques that suffer from high computational costs and instabilities, we proposed to identify the local region from the validation dataset by using the learned intermediate latent space representations delivered by the adopted deep feedforward neural networks during the underlying classification task. The idea of using the learned representations can be extended not only to alternative deep learning architectures but also can be easily adapted to alternative black-box approaches such as random forest or gradient boosting approaches. For these ensemble tree-based approaches the leaf node assignments can be used as input to clustering approaches. In the proposed approach, the decision tree was used as a surrogate model to approximate the behavior of deep learning classifier. However, the alternative comprehensible methods such as general linear models and various rule-based models can be implemented. Furthermore, it is worth mentioning that the presentation of the learned rules or reason codes, which are intended to explain the model decisions, can imply various cognitive biases. For this reason, the findings from the cognitive sciences should be applied to develop appropriate debiasing techniques with the aim to increase the quality and plausibility of the explanations [105]. This study examined mainly the applicability of a local explanation approach indepth which facilitates the domain experts to justify the individual model decisions.

Explainable Artificial Intelligence for Process Mining …

23

Such explanations are important to understand the process behavior and validate the judgment capabilities of the black-box approaches. The local explanation approach can also be used for alternative purposes such as understanding the reasons for discrimination once it is detected by using the relevant statistical measures. Furthermore, alternatively, the global explanation approaches can be developed which are mainly suitable for generating long-term strategies for enhancing business processes.

7 Conclusion In this study we proposed a novel local post-hoc explanation approach for predictive process monitoring problem to make the adopted deep learning method explainable, justifiable, and reliable for the domain experts. To ensure the suitability of this approach for the examined incident management use case, we also incorporated the relevant information from our proposed conceptual framework in this manuscript to the explanation generation process. Since the strong predictive performance of the adopted black-box model is a crucial prerequisite for the meaningfulness of the post-hoc explanations, we evaluated first our classier by using both threshold-free and single-threshold evaluation measures. The area under ROC Curve of 0.94 and the accuracy of 0.89 at the threshold where the misclassification in both classes is minimized equally, suggest that a good performance model was obtained. After validating the performance of the model we introduced the explanations by presenting the local surrogate decision trees in the identified clusters and showing relevant measures such as R2 of the local model, the prediction scores from both black-box and surrogate model, the confidence of the rule etc. as well. By proposing a general framework for carrying out explainable process analytics and illustrating the applicability with a specific use case, we have attempted to emphasize the importance and relevance of explainable artificial intelligence for process mining applications. Acknowledgements This research was funded in part by the German Federal Ministry of Education and Research under grant number 01IS18021B (project MES4SME) and 01IS19082A (project KOSMOX).

References 1. Fettke, P., Mayer, L., Mehdiyev, N.: Big-Prozess-Analytik für Fertigungsmanagementsysteme (MES). In: Steven, M., Klünder, T. (eds.) Big Data: Anwendung und Nutzungspotenziale in der Produktion, pp. 215–239. Kohlhammer, Stuttgart (2020) 2. van der Aalst, W.: Process Mining: Overview and Opportunities. ACM Trans. Manag. Inf. Syst. 3 (2012). 3. Mehdiyev, N., Evermann, J., Fettke, P.: A Novel Business process prediction model using a deep learning method. Bus. Inf. Syst. Eng., 1–15 (2018)

24

N. Mehdiyev and P. Fettke

4. Swartout, W.R., Moore, J.D.: Explanation in second generation expert systems. In: Second Generation Expert Systems, pp. 543–585. Springer, Berlin, New York (1993) 5. Sørmo, F., Cassens, J., Aamodt, A.: Explanation in case-based reasoning—perspectives and goals. Artif. Intell. Rev. 24, 109–143 (2005) 6. Hall, P., Kurka, M., Bartz, A.: Using H2O Driverless AI Interpreting Machine Learning with H2O Driverless AI (2017). http//docs.h2o.ai/driverless-ai/lateststable/docs/booklets/MLIBooklet.pdf 7. Bargiela, A., Pedrycz, W.: The roots of granular computing. In: 2006 IEEE International Conference on Granular Computing, pp. 806–809. (2006) 8. Yao, Y.Y.: Granular Computing: Basic Issues and Possible solutions. In: Proceedings of the 5th Joint Conference on Information Sciences, pp. 186–189. Citeseer (2000) 9. Pedrycz, W., Skowron, A., Kreinovich, V.: Handbook of Granular Computing. Wiley (2008) 10. Zadeh, L.A.: Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets Syst. 90, 111–127 (1997) 11. Chen, Y., Yao, Y.: Multiview intelligent data analysis based on granular computing. In: IEEE International Conference on Granular Computing. pp. 281–286 (2006) 12. Yao, Y.: A triarchic theory of granular computing. Granul. Comput. 1, 145–157 (2016) 13. Pedrycz, W., Chen, S.-M.: Granular Computing and Decision-Making: Interactive and Iterative Approaches. Springer (2015) 14. Pedrycz, W., Chen, S.-M.: Granular Computing and Intelligent Systems: Design with Information Granules of Higher Order and Higher Type. Springer Science & Business Media (2011) 15. Pedrycz, W.: Granular computing for data analytics: a manifesto of human-centric computing. IEEE/CAA J. Autom. Sin. 5, 1025–1034 (2018) 16. Yao, J.T., Yao, Y.Y.: A granular computing approach to machine learning. FSKD. 2, 732–736 (2002) 17. Bargiela, A., Pedrycz, W.: Toward a theory of granular computing for human-centered information processing. IEEE Trans. Fuzzy Syst. 16, 320–330 (2008) 18. Su, R., Panoutsos, G., Yue, X.: Data-driven granular computing systems and applications. Granul. Comput., 1–2 (2020) 19. Liu, H., Cocea, M.: Granular computing-based approach of rule learning for binary classification. Granul. Comput. 4, 275–283 (2019) 20. Chen, D., Xu, W., Li, J.: Granular computing in machine learning. Granul. Comput. 4, 299–300 (2019) 21. Liu, H., Cocea, M.: Granular computing-based approach for classification towards reduction of bias in ensemble learning. Granul. Comput. 2, 131–139 (2017) 22. Evermann, J., Rehse, J.R., Fettke, P.: Predicting process behaviour using deep learning. Decis. Support Syst. 100, 129–140 (2017) 23. van Der Aalst, W., Adriansyah, A., De Medeiros, A.K.A., Arcieri, F., Baier, T., Blickle, T., Bose, J.C., Van Den Brand, P., Brandtjen, R., Buijs, J.: Process mining manifesto. In: International Conference on Business Process Management, pp. 169–194 (2011) 24. van Dongen, B.F., De Medeiros, A.K.A., Wen, L.: Process mining: overview and outlook of petri net discovery algorithms. In: transactions on petri nets and other models of concurrency II, pp. 225–242. Springer (2009) 25. van der Aalst, W.: Wil: Process mining. ACM Trans. Manag. Inf. Syst. 3, 1–17 (2012) 26. Di Francescomarino, C., Ghidini, C., Maggi, F.M., Milani, F.: Predictive process monitoring methods: which one suits me best? In: International Conference on Business Process Management, pp. 462–479. Springer (2018) 27. Mehdiyev, N., Evermann, J., Fettke, P.: A multi-stage deep learning approach for business process event prediction. In: IEEE 19th Conference on Business Informatics, CBI 2017, pp. 119–128 (2017) 28. Le, M., Nauck, D., Gabrys, B., Martin, T.: Sequential clustering for event sequences and its impact on next process step prediction. In: International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, pp. 168–178. Springer (2014)

Explainable Artificial Intelligence for Process Mining …

25

29. Le, M., Gabrys, B., Nauck, D.: A hybrid model for business process event and outcome prediction. Expert Syst. 34, e12079 (2017) 30. Márquez-Chamorro, A.E., Resinas, M., Ruiz-Cortés, A., Toro, M.: Run-time prediction of business process indicators using evolutionary decision rules. Expert Syst. Appl. 87, 1–14 (2017) 31. Di Francescomarino, C., Ghidini, C., Maggi, F.M., Petrucci, G., Yeshchenko, A.: An eye into the future: leveraging a-priori knowledge in predictive business process monitoring. In: International Conference on Business Process Management, pp. 252–268. Springer (2017) 32. Polato, M., Sperduti, A., Burattin, A., de Leoni, M.: Time and activity sequence prediction of business process instances. Computing 100, 1005–1031 (2018) 33. Tax, N., Verenich, I., La Rosa, M., Dumas, M.: Predictive business process monitoring with LSTM neural networks. In: International Conference on Advanced Information Systems Engineering, pp. 477–492 (2017) 34. Breuker, D., Matzner, M., Delfmann, P., Becker, J.: Comprehensible predictive models for business processes. Manag. Inf. Syst. Q. 40, 1009–1034 (2016) 35. Lakshmanan, G.T., Shamsi, D., Doganata, Y.N., Unuvar, M., Khalaf, R.: A markov prediction model for data-driven semi-structured business processes. Knowl. Inf. Syst. 42, 97–126 (2015) 36. Leontjeva, A., Conforti, R., Di Francescomarino, C., Dumas, M., Maggi, F.M.: Complex symbolic sequence encodings for predictive monitoring of business processes. In: International Conference on Business Process Management, pp. 297–313 (2015) 37. Maggi, F.M., Di Francescomarino, C., Dumas, M., Ghidini, C.: Predictive monitoring of business processes. In: International Conference on Advanced Information Systems Engineering, pp. 457–472. Springer (2014) 38. De Leoni, M., van der Aalst, W.M.P., Dees, M.: A general process mining framework for correlating, predicting and clustering dynamic behavior based on event logs. Inf. Syst. 56, 235–257 (2016) 39. Folino, F., Guarascio, M., Pontieri, L.: Discovering context-aware models for predicting business process performances. In: OTM Confederated International Conferences on the Move to Meaningful Internet Systems, pp. 287–304. Springer (2012) 40. Rogge-Solti, A., Weske, M.: Prediction of remaining service execution time using stochastic petri nets with arbitrary firing delays. In: International Conference on Service-Oriented Computing, pp. 389–403. Springer (2013) 41. van Dongen, B.F., Crooy, R.A., van der Aalst, W.M.P.: Cycle time prediction: When will this case finally be finished? In: OTM Confederated International Conferences on the Move to Meaningful Internet Systems, pp. 319–336. Springer (2008) 42. van der Aalst, W., Schonenberg, M.H., Song, M.: Time prediction based on process mining. Inf. Syst. 36, 450–475 (2011) 43. Polato, M., Sperduti, A., Burattin, A., de Leoni, M.: Data-aware remaining time prediction of business process instances. In: 2014 International Joint Conference on Neural Networks (IJCNN), pp. 816–823. IEEE (2014) 44. Senderovich, A., Weidlich, M., Gal, A., Mandelbaum, A.: Queue mining for delay prediction in multi-class service processes. Inf. Syst. 53, 278–295 (2015) 45. Conforti, R., Fink, S., Manderscheid, J., Röglinger, M.: PRISM–a predictive risk monitoring approach for business processes. In: International Conference on Business Process Management, pp. 383–400. Springer (2016) 46. Rogge-Solti, A., Weske, M.: Prediction of business process durations using non-Markovian stochastic Petri nets. Inf. Syst. 54, 1–14 (2015) 47. Wynn, M.T., Low, W.Z., ter Hofstede, A.H.M., Nauta, W.: A framework for cost-aware process management: cost reporting and cost prediction. J. Univers. Comput. Sci. 20, 406–430 (2014) 48. Theis, J., Darabi, H.: Decay Replay Mining to Predict Next Process Events. IEEE Access. 7, 119787–119803 (2019) 49. Kratsch, W., Manderscheid, J., Röglinger, M., Seyfried, J.: Machine learning in business process monitoring: a comparison of deep learning and classical approaches used for outcome prediction. Bus. Inf. Syst. Eng., 1–16 (2020)

26

N. Mehdiyev and P. Fettke

50. Al-Jebrni, A., Cai, H., Jiang, L.: Predicting the next process event using convolutional neural networks. In: 2018 IEEE International Conference on Progress in Informatics and Computing (PIC), pp. 332–338. IEEE (2018) 51. Park, G., Song, M.: Predicting performances in business processes using deep neural networks. Decis. Support Syst. 129, 113191 (2020) 52. Di Mauro, N., Appice, A., Basile, T.M.A.: Activity prediction of business process instances with inception cnn models. In: International Conference of the Italian Association for Artificial Intelligence, pp. 348–361. Springer (2019) 53. Pasquadibisceglie, V., Appice, A., Castellano, G., Malerba, D.: Using convolutional neural networks for predictive process analytics. In: 2019 International Conference on Process Mining (ICPM), pp. 129–136. IEEE (2019) 54. Weinzierl, S., Wolf, V., Pauli, T., Beverungen, D., Matzner, M.: Detecting Workarounds in Business Processes-a Deep Learning method for Analyzing Event Logs. In: Proceedings of the 28th European Conference on Information Systems (ECIS), An Online AIS Conference, June 15-17, 2020. https://aisel.aisnet.org/ecis2020_rp/67 55. Schönig, S., Jasinski, R., Ackermann, L., Jablonski, S.: Deep learning process prediction with discrete and continuous data features. In: Proceedings of the 13th International Conference on Evaluation of Novel Approaches to Software Engineering, pp. 314–319 (2018) 56. Camargo, M., Dumas, M., González-Rojas, O.: Learning accurate LSTM models of business processes. In: International Conference on Business Process Management, pp. 286–302. Springer (2019) 57. Tello-Leal, E., Roa, J., Rubiolo, M., Ramirez-Alcocer, U.M.: Predicting activities in business processes with LSTM recurrent neural networks. In: 2018 ITU Kaleidoscope: Machine Learning for a 5G Future (ITU K), pp. 1–7. IEEE (2018) 58. Taymouri, F., La Rosa, M., Erfani, S., Bozorgi, Z.D., Verenich, I.: Predictive business process monitoring via generative adversarial nets: the case of next event prediction (2020). arXiv2003.11268 59. Gilpin, L.H., Bau, D., Yuan, B.Z., Bajwa, A., Specter, M., Kagal, L.: Explaining explanations: an approach to evaluating interpretability of machine learning (2018). arXiv1806.00069 60. Adebayo, J., Gilmer, J., Goodfellow, I., Kim, B.: Local explanation methods for deep neural networks lack sensitivity to parameter values. In: International Conference on Learning Representations Workshop (ICLR) (2018) 61. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps (2013). arXiv1312.6034 62. Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks (2017). arXiv1703.01365 63. Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net (2014). arXiv1412.6806 64. Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., Batra, D.: Grad-CAM: Why did you say that? (2016). arXiv1611.07450 65. Smilkov, D., Thorat, N., Kim, B., Viégas, F., Wattenberg, M.: Smoothgrad: removing noise by adding noise (2017). arXiv1706.03825 66. Petsiuk, V., Das, A., Saenko, K.: Rise: Randomized input sampling for explanation of blackbox models (2018). arXiv1806.07421 67. Hohman, F., Kahng, M., Pienta, R., Chau, D.H.: Visual analytics in deep learning: an interrogative survey for the next frontiers. IEEE Trans. Vis. Comput. Graph. 25, 2674–2693 (2018) 68. Mehdiyev, N., Fettke, P.: Prescriptive process analytics with deep learning and explainable artificial intelligence. In: 28th European Conference on Information Systems (ECIS). An Online AIS Conference (2020). https://aisel.aisnet.org/ecis2020_rp/122 69. Rehse, J.-R., Mehdiyev, N., Fettke, P.: Towards explainable process predictions for industry 4.0 in the DFKI-Smart-Lego-Factory. KI-Künstliche Intelligenz., 1–7 (2019) 70. Sindhgatta, R., Ouyang, C., Moreira, C., Liao, Y.: Interpreting predictive process monitoring benchmarks (2019). arXiv1912.10558

Explainable Artificial Intelligence for Process Mining …

27

71. Jan, S.T.K., Ishakian, V., Muthusamy, V.: AI Trust in business processes: the need for processaware explanations (2020). arXiv2001.07537 72. Miller, T.: Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell. 267, 1–38 (2019) 73. Wang, D., Yang, Q., Abdul, A., Lim, B.Y.: Designing theory-driven user-centric explainable AI. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19). Paper 601, pp 1–15. (2019) 74. Tomsett, R., Braines, D., Harborne, D., Preece, A., Chakraborty, S.: Interpretable to whom? A role-based model for analyzing interpretable machine learning systems (2018). arXiv1806.07552 75. Doshi-Velez, F., Kortz, M., Budish, R., Bavitz, C., Gershman, S., O’Brien, D., Schieber, S., Waldo, J., Weinberger, D., Wood, A.: Accountability of AI under the law: the role of explanation (2017). arXiv1711.01134 76. Preece, A., Harborne, D., Braines, D., Tomsett, R., Chakraborty, S.: Stakeholders in explainable AI (2018). arXiv1810.00184 77. Majchrzak, A., Gasser, L.: On using artificial intelligence to integrate the design of organizational and process change in US manufacturing. AI Soc. 5, 321–338 (1991) 78. Ji-Ye Mao, I.B.: The use of explanations in knowledge-based systems: cognitive perspectives and a process-tracing analysis. J. Manag. Inf. Syst. 17, 153–179 (2000) 79. Wick, M.R., Thompson, W.B.: Reconstructive expert system explanation. Artif. Intell. 54, 33–70 (1992) 80. Nunes, I., Jannach, D.: A systematic review and taxonomy of explanations in decision support and recommender systems. User Model. User-Adapt. Interact. 27, 393–444 (2017) 81. Adadi, A., Berrada, M.: Peeking inside the black-box: A survey on Explainable Artificial Intelligence (XAI). IEEE Access. 6, 52138–52160 (2018) 82. Lipton, Z.C.: The mythos of model interpretability (2016). arXiv1606.03490 83. Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Pedreschi, D., Giannotti, F.: A survey of methods for explaining black box models. ACM Comput. Surv. 51, 5 (2018) 84. Gilpin, L.H., Bau, D., Yuan, B.Z., Bajwa, A., Specter, M., Kagal, L.: Explaining explanations: an approach to evaluating interpretability of machine learning. In: IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA). Turin, Italy, pp. 80–89 (2018) 85. Carvalho, D.V., Pereira, E.M., Cardoso, J.S.: Machine learning interpretability: a survey on methods and metrics. Electronics 8, 832 (2019) 86. Frosst, N., Hinton, G.: Distilling a Neural Network Into a Soft Decision Tree (2017). arXiv:1711.09784 87. Lundberg, S.M., Erion, G.G., Lee, S.-I.: Consistent individualized feature attribution for tree ensembles (2018). arXiv1802.03888 88. Apley, D.W.: Visualizing the effects of predictor variables in black box supervised learning models (2016). arXiv1612.08468 89. Goldstein, A., Kapelner, A., Bleich, J., Pitkin, E.: Peeking inside the black box: visualizing statistical learning with plots of individual conditional expectation. J. Comput. Graph. Stat. 24, 44–65 (2015) 90. Lundberg, S.M., Lee, S.-I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems, pp. 4765–4774 (2017) 91. Ribeiro, M.T., Singh, S., Guestrin, C.: Why should I trust you? Explaining the predictions of any classifier. In: 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD ’16. New York, USA, pp. 1135–1144 (2016) 92. Candel, A., Parmar, V., LeDell, E., Arora, A.: Deep learning with h2o (2016) 93. Alvarez-Melis, D., Jaakkola, T.S.: On the robustness of interpretability methods (2018). arXiv1806.08049 94. Guidotti, R., Monreale, A., Ruggieri, S., Pedreschi, D., Turini, F., Giannotti, F.: Local rulebased explanations of black box decision systems (2018). arXiv1805.10820 95. Mittelstadt, B., Russell, C., Wachter, S.: Explaining explanations in AI. In: Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* ’19). Association for Computing Machinery, pp. 279–288 (2019)

28

N. Mehdiyev and P. Fettke

96. Hu, L., Chen, J., Nair, V.N., Sudjianto, A.: Locally interpretable models and effects based on supervised partitioning (LIME-SUP) (2018). arXiv1806.00663 97. Hall, P., Gill, N., Kurka, M., Phan, W., Bartz, A.: Machine Learning Interpretability with H2O Driverless AI: First Edition Machine Learning Interpretability with H2O Driverless AI (2017) 98. Caruana, R., Kangarloo, H., Dionisio, J.D., Sinha, U., Johnson, D.: Case-based explanation of non-case-based learning methods. In: Proceedings of AMIA Symposium, 212–5 (1999) 99. Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Locally adaptive dimensionality reduction for indexing large time series databases. ACM Sigmod Rec. 30, 151–162 (2001) 100. Min, R., Stanley, D.A., Yuan, Z., Bonner, A., Zhang, Z.: A deep non-linear feature mapping for large-margin KNN classification. In: Ninth IEEE International Conference on Data Mining, 2009. ICDM’09, pp. 357–366. IEEE (2009) 101. Salakhutdinov, R., Hinton, G.: Learning a nonlinear embedding by preserving class neighbourhood structure. In: Artificial Intelligence and Statistics, pp. 412–419 (2007) 102. Freitas, A.A.: Comprehensible classification models. ACM SIGKDD Explor. Newsl. 15, 1–10 (2014) 103. Mehdiyev, N., Krumeich, J., Enke, D., Werth, D., Loos, P.: Determination of rule patterns in complex event processing using machine learning techniques. Procedia Procedia Comput. Sci. 61, 395–401 (2015) 104. Steeman, W.: BPI Challenge 2013 (2013) 105. Fürnkranz, J., Kliegr, T., Paulheim, H.: On cognitive preferences and the plausibility of rulebased models. Mach. Learn., 1–46 (2019) 106. Robnik-Šikonja, M., Bohanec, M.: Perturbation-based explanations of prediction models. In: Zhou J., C.F. (ed.) Human and Machine Learning. Human–Computer Interaction Series, pp. 159–175. Springer (2018) 107. Doshi-Velez, F., Kim, B.: Towards a rigorous science of interpretable machine learning (2017). arXiv:1702.08608

Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI): A Framework of Information Granules Bo Sun

Abstract Research work has been increasing in Visual Analytics (VA) since it was first defined in 2005. The techniques of VA that integrated Machine Learning (ML) models and interactive visualizations have formed a human-centered machine learning approach to assist in Data Analytics. VA aims to interpret the complexities of Big Data and underlying ML models by engaging analysts in an iterative process of observing, interpreting, and evaluating inputs, outputs, and architectures of these models. The process then subsequently provides guidance to users, interaction techniques to control AI, and information about inner workings that are often hidden. This chapter defines underlying stages of ML pipeline in feature selection and model performance as components of Information Granules (IG) in VA; it also explores the use of VA in XAI. This study reviews 13 top-tier publications in recent VA literature to demonstrate (1) the interoperability strategies of VA in feature relevance, model performance, and model architecture; (2) global and local interpretability in information and visual scalability; and (3) stability of explanations through user case studies, reusability, and design goals. The chapter also analyzes the current stage of VA for granular computing in the end. The future work of VA scientists will be to focus on broader behaviors of ML models, particularly in Neural Network, to gain public trust for AI. Keywords Visual analytics · Explainable AI · Explainable machine learning

1 Introduction Techniques of Artificial Intelligence (AI) are commonly used in Data Analytics (DA). Visual analytics (VA), as a new data analytics field, focuses on using interactive visual interfaces to form analytical reasoning [1]. As an outgrowth of areas of information visualization and scientific visualization, the technology is often used to explore performance analysis of AI, assist in parameter tuning, and offer guidance to interpret B. Sun (B) Faculty of Computer Science, Rowan University, Glassboro, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. Pedrycz and S. Chen (eds.), Interpretable Artificial Intelligence: A Perspective of Granular Computing, Studies in Computational Intelligence 937, https://doi.org/10.1007/978-3-030-64949-4_2

29

30

B. Sun

Machine Learning (ML) models [2]. Therefore, it is an ideal tool to explain AI techniques. Visual analytics is primarily concerned with coupling interactive visual representations with underlying analytical processes such as statistical procedures, and data mining techniques [3]. In this way, high level and complex activities—such as sense-making, reasoning, decision-making—can be effectively performed [3]. Techniques of visual analytics have also been shown to benefit analytical reasoning in many scenarios [4, 5]. Human-information discourse is thus enabled by applying human knowledge through interaction with the visualization interface. The field of VA was first defined by Thomas and Cook [6] from the National Visualization and Analytics Center at the U.S. Dept. of Homeland Security in 2005: “Visual Analytics is the science of analytical reasoning facilitated by interactive visual interface.” This research is a multidisciplinary field, including four focus areas: (1) Analytical Reasoning Techniques, (2) Visual Representations and Interaction Technologies, (3) Data Representations and (4) Transformation, Production, Presentation, and Dissemination. Specifically, the analytics reasoning process comprises four activities: gathering information, representing the information, developing insight and producing results, as seen in Fig. 1. These activities are often repeated in a different order and rely on interactive visualization to generate sense-making. Research in Visual Analytics has been increasing since its introduction. It is currently accepted that VA can aid in the process of interpretation and evaluation of ML results, and provide guidance toward increased and improved interpretable ML models to analysts [2, 7]. Although the definition of XAI lacks consensus in the field, this chapter utilizes the definition provided by Arrieta et al. [8]: “Given an audience, an explainable Artificial Intelligence is one that produces details or reasons to make its functioning clear or easy to understand.” Users can better understand, trust, and manage an ML model when the model is explorable and transparent [7]. VA researchers have spent a decade integrating ML algorithms with interactive visualization to support both domain Fig. 1 Analytics reasoning process [6]

Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI) …

31

scientists and ML researchers in understanding underlying details and analytical reasonings of the data and models [8]. An ideal VA platform offers human-centered approaches to engage analysts in an iterative process of observing, interpreting, and evaluating inputs, outputs, and architectures of the system [2]. Such an approach plays an essential role in analyzing ML models; it provides guidance to users, interaction techniques to control AI, and information about inner workings that are often hidden in the black box [7, 9]. Pedrycz [10] defined Information Granules in AI: “Information Granularity is central to a way of problem solving through problem decomposition, where various subtasks could be formed and solved individually.” The ultimate objective of Information Granules according to Pedrycz is to “describe the underlying phenomenon in an easily understood way and at a certain level of abstraction. This requires that we use a vocabulary of commonly encountered terms (concepts) and discover relationships between them and reveal possible linkages among the underlying concepts.” A two-way effective human–machine communication is crucial for Information Granules [9]. Human Computer Interaction (HCI) within VA would aid in the process by enabling analysts to interact with different stages of an ML pipeline through a visual interface. This point of view will be detailed in Sect. 2 on explainability strategies. This chapter referenced 87 works of literature and adopted 13 top-tier publications as examples to demonstrate interpretability functions of VA systems and explore the use of the VA in assisting in Explainable AI (XAI). Specifically, we focused on three characters of interoperability: namely, interoperability strategies, global and local interpretability, and stability of explanations to summarize the features provided by VA. Our main contributions of this chapter follow: (1) providing a survey paper that demonstrates the interoperability of VA in interoperability strategies, information scalability and stability of explanations under concept of XAI. (2) defining information granules in VA using various stages of a predictive visual analytics pipeline evolved from the ML pipeline in Sect. 2. (3) classifying the recent top-tier VA research based on the underlying components of information granules. (4) analyzing the current stage and future research direction of VA for granular computing. All 13 papers reviewed in this chapter were collected from IEEE Transaction on Visualization and Computer Graphics (IEEE TVCG), Proceedings of IEEE Conference on Visual Analytics Science and Technology (IEEE VAST) and IEEE International Conference on Big Data (IEEE BigData) respectively. Table 1 shows the publication distributions among the venues, including their publication year and the number of citations retrieved from Google Scholar. We structure the rest of the chapter at the following: in Sect. 2, we detail the interoperability strategies of VA by reviewing 13 VA research works based on our defined information granules concept in VA, specifically, in feature relevance, model performance, and model architecture explanations; in Sect. 3, we review the global

2014 2019 2018 2017 2018 2019 2018 2017

INFUSE by Krasuse et al. [13]

FeatureExplorer by Zhao et al. [14]

VA Tool for Ensemble Learning by Schneider et al. [15]

Squares by Ren et al. [16]

Clustervision by Kwon [17]

FairVis by Cabrera et al. [18]

Tensor Flow Visualizer by Wongsuphasawat et al. [19]

ActiVis by Kahng et al. [20]

Year

2019

IEEE BigData

Neural Document Embedder by Ji et al. [12]

IEEE VAST 2007

IEEE TVCG

Tableau [11]

Papers

Table 1 Publications used in this work

149

156

10

51

96

4

1

116

6

406

(continued)

Citations

32 B. Sun

2019

RetainVis by Kwon et al. [23]

Year

2020

IEEE BigData

explAIner [22] by Spinner et al.

IEEE VAST 2017

IEEE TVCG

CNNVis by Liu et al. [21]

Papers

Table 1 (continued)

55

15

233

Citations

Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI) … 33

34

B. Sun

and local interpretability of VA by focusing on information and visual scalabilities; in Sect. 4, the stability of explanations is validated via user case studies, reusability, and design goals; in Sect. 5, the current stage and future research direction of VA in supporting granular computing are presented; in Sect. 6, we summarize the promising potential of VA to assist in XAI.

2 Explainability Strategies Information visualization research focused on the creation of a visual approach to convey abstract information in an intuitive way [6]. The technique itself was approved to amplify human cognition decades ago in early 1990. Effective information visualization can increase cognitive resources and reduce search of analytical reasoning in benefitting data analytics. Resnikoff [24] indicated that the human moving gaze system could partition limited cognitive channel capacity to combine high spatial resolution and wide aperture in sensing the visual environment. Larkin and Simon [25] found that symbolical visual presentation can help offload work from cognitive to perceptual operation. They also stated that a dashboard visualization containing group information would reduce search, making it extremely easy for humans. Norman [26] discovered that visualization could expand the working memory of a human subject to solve a problem. Tufte [27] stated that visualization could often represent a large amount of data in a small space. Card et al. [28] and Resnikoff [24] found that visualization can simplify and organize information and offer data patterns through data aggregations. Bauer et al. [29] further stated that visualizations could be constructed to enhance patterns in value, relationship, and trend. Unlike information visualization in developing a visual approach, visual analytics requires both visual representation and interaction techniques to conduct the analytics process. The combination allows users to see, explore, and understand large amounts of information at once [6]. Visual representations in VA must convey the important content of information instantly [6]. The analytical process typically needs a suite of visual metaphors and associated visual approaches to provide users with multiple views of their information in a flow [6]. Meanwhile, the interaction techniques are required to support the communication between the users and the data [6]. More sophistical interactions are also needed to support any given task, timescale, and interaction environment [6]. Nevertheless, AI techniques consist of a variety of machine learning algorithms. Lu et al. [4]. defined a machine learning pipeline, as shown in Fig. 2. Data of interest are collected and cleaned initially. Then machine learning experts would select essential features as an input of a machine learning model. Next, the model is trained, tested, and refined based on performance results; this process is typically uncertain with many rounds of model selections and evaluations [4]. Visualization scientists have been spending many years in developing VA platforms to assist in machine learning development for Data Analytics [19, 20, 30].

Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI) …

35

Fig. 2 Machine learning pipeline [4]

Fig. 3 Predictive visual analytics pipeline [5]

In interactive model analysis, machine learning models are integrated with interactive visualization techniques, capable of translating models into understandable and useful explanations to users [4]. Lu et al. [5] further defined pipeline for predictive visual analytics based on Fig. 2 as seen in Fig. 3. In this process, different visualizations combined with visual interactions (adjustment loop) are adopted throughout the analytical process to support data preprocessing, feature selection, modeling training, selection, and validation. As this chapter focuses on perspective of Information Granules (IG) in XAI, we define each stage of the pipeline outlined in the ML process as a component of IG in VA. In Sects. 2.1–2.2, ML models are decomposed to the different stages of the pipeline; we then interpret the underlying phenomenon of the individual stage to discover relationships and reveal possible linkage between the stages of the pipeline via visual interfaces. Furthermore, the methods for interpretability and explainability in XAI are often classified into two categories of approaches: transparency-based and posthoc [8, 31]. The transparency-based approach is often used for a simple ML model that can explain itself such as linear model or decision trees [31]. On the other hand, the post-hoc method targets complex ML models that prevent users from tracing the logic behind predictions like Neural Network, Support Vector Machine, etc. [5]. Most VA platforms are developed based on the post-hoc approach to assist in model interpretations of feature relevance, model performance, or model architecture/structure. Based on the predictive visual analytics pipeline defined by Lu et al. [5] (and Chatzimparmpas et al. [7]) and the explainability approaches in XAI, we identified three major components to present the start of the art works in the VA fields for XAI, namely, Feature Selection, Performance Analysis, and Model Explanation.

36

B. Sun

Performance analysis is an essential procedure for model evaluation and validation outlined in the ML pipeline. Both feature selection and performance analysis represent two major stages of the ML process. From the prospective of information granules, the interpretations of these two underlying components and how feature selection impacts model performance are essential in understanding the inner workings of an ML model. Feature selection aims to discover the high impact features contributing to a robust machine learning detection. Performance analysis focuses on an analysis of machine learning result to refine parameters of the model. Both of these two areas prompted analysts and domain experts to select the best model for their data analytics tasks quickly. Therefore, the main strategy used for feature selection and performance analysis can be explained using the VA process diagram for model selection from Bogl et al. [32], as seen in Fig. 4. Experienced analysts (or domain experts) typically makes hypotheses about a dataset based on the domain knowledge from prior experiences. The hypotheses are refined in the analytics process through visual interactions with an underlying model that processes the data. In the iterative process, insights are gained by interpreting the interactive visual observations. The insights help to determine model fitness, parameter adjustment leading to the adequate model, and refinement of the hypotheses. The developer provides a major judgment of the model through the visual interaction in the area of user interaction (seen in Fig. 4). Unlike feature selection and performance analysis, model explanation attempts to make the computation process in the black box transparent so that users can understand and trust the AI process. The primary approach of this component is to develop visualization of the computational process or architecture and structure of an ML algorithm, particularly of a complex method such as a neural network.

Fig. 4 VA process for model selection [32]

Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI) …

37

2.1 Feature Selection When a dataset is cleaned with the uniform data format and ready for the machine learning process, feature selection plays a significant role in the performance of a machine learning algorithm. Without appropriate selection, redundant and nondescriptive features could lead to inadequate model training and prediction [5]. Dash and Liu [33] stated that an ideal feature selection should search all subsets of data; however, this exhaustive and costly process is often practically prohibitive even for small feature data. When facing a large scale and multidimensional dataset, machine learning scientists typically select some sample data and conduct dimensional reduction or regression to pick the high impact features for AI model training. The process hides data insight and often described as “analyzing in the dark”. Therefore, a visual analytics tool that can enable data exploration and compress extensive scale data into a small space would provide a “Godly View,” a data overview, in assisting in feature selection. Furthermore, the visual exploration and presentation of data features transparentize the input of an ML model. It is an essential step for analysts to understand how an ML model reacts with selected features in the explanation of feature relevance. Therefore, feature selection in VA serves as a vital interpretation component in assisting in XAI. Tableau [34], an interactive data visualization tool, became a leading software of data analytics in recent years. The software produces data visualizations and allows data exploration of a raw dataset for feature selection [11]. The most popular function associated with the tool is dashboard visualization, as seen in Fig. 5a, a group of visualizations used to interpret multidimensional data features in one screen. Analysts can select different data categories through filters (the top right corner of Fig. 5a) and interact with the chart directly via tooltip (as seen on the map of Fig. 5a) in analyzing the data. Tableau also supports advanced analysis functions for feature selection, as seen in Fig. 5b, including data summary, model-based analysis such as recalculated field, average and median with 95% confidence interval, trend line (using Linear, Logarithmic, Exponential, and Polynomial models), cluster, and customized analysis based on user-specified approaches. These functions can be easily done based on users’ selected data features via a drag and drop movement (Fig. 5c); the corresponding visual results will be presented accordingly. Figure 5d shows a visualization result of four clusters using k-means algorithm.. In visual analytics literature, scientists have been developing approaches to incorporate human knowledge into the feature selection and aim to reduce data features to a manageable set in supporting a robust machine learning model [5]. Ji et al. [12] presented a visual exploration tool for neural document embedding. The embedding technique, converting texts of variable-length to semantic vector representations, is an effective document representation in concise feature vectors to leverage the computational power of neural network [12]. The tool presented several visualizations and interactions to support the complex feature embedding based on identified topics and keywords. Figure 6a adapted t-Distributed Stochastic Neighbor Embedding (t-SNE)

38

B. Sun

Fig. 5 Visualizations created by Tableau software [35]: a dashboard visualization; b advanced analysis features; c drag-drop feature d cluster view

Fig. 6 Visualization of neural document embeddings [12]

to display the configured document map in clusters based on the document embeddings. A user can interact with Fig. 6c, d to specify the cluster algorithms, algorithm factors, and the choice of a clustering level. Figure 6b, f display the topics of clusters and cluster results in a color-coded map, respectively. Specifically, Fig. 6f shows a

Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI) …

39

Fig. 7 Cluster-level dimension behavior [12]

2D map with different aggregation magnitudes to reinforce the clusters by keeping the relative positions of clusters and intra-cluster documents from t-SNE. The tool uses a specifically developed feature selection paradigm, as seen in Fig. 7, where a parallel coordinate is used to render dimensional ranks. The paradigm supports a subset selection of neural dimensions and reflects properties and patterns in the new feature space. With a feature selection box, as seen in Fig. 7b1–b4, users can further investigate any dimensions or feed a set of identified dimensions. Krause et al. presented INFUSE [13] system to support feature selection in predicting diabetes. The system adopted four feature selection algorithms, including Information Gain, Fisher Score, Odds Ratio, and Relative Risk, to analyze patients’ records on diagnosis, lab tests, medications, and procedures. Figure 8a shows the system overview with interactive visualization. The left side of Fig. 8a presents feature view, a way to overview all features categorized by data type (color-coded as seen in the bottom left corner) and then sorted by importance. The overview is supported by a glyph design using a circular glyph, as seen in Fig. 8b. The glyph is divided into four equally sized circular segments to present the rank of the four feature selection algorithms. In each segment, an inward-growing bar further compares feature ranks among each cross-validation fold (random data samples). The buttons at the bottom of Fig. 8a allow a detailed investigation of the results of the ranking algorithms via scatter-plots, as seen in Fig. 8c. List view on the top right side of Fig. 8a, showing a search box filter, permits a sorted list of all features for selection purposes. Classifier view at bottom-right enables access to performance scores of machine learning models adopted by INFUSE, including Logistic Regression, Decision Trees, Naïve Bayes, and K-Nearest Neighbors. Each row in the classifier view represents the feature selection algorithms. An analyst can easily compare the feature and model results via highlighted bar graphs. Zhao et al. proposed a simple visual analytics tool, FeatureExplorer [14], to support feature selection for hyperspectral images. The tool supports the dynamic evaluation of regression models and investigates the importance of feature subsets through the interactive selection of features in high-dimensional feature spaces.

40

B. Sun

Fig. 8 INFUSE [13] a overview of INFUSE system b the glyph representation c scatterplot view on the results of feature selection algorithms

Figure 9a lists a list of selected and unselected features, a regression button to trigger regression, and an automatic feature selection button. Figure 9b shows the feature correlation panel using a correlation matrix and a scatterplot. The evaluation panel in Fig. 9c presents a correlation between ground truth and predicted values via scatterplots. The bottom of Fig. 9c presents a histogram showing the frequency of used

Fig. 9 FeatureExplorer overview [14]

Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI) …

41

pertinent wavelengths, the importance score of each feature, via a horizontal bar chart, and a table displaying the results with and without feature selection. Feature selection is an essential part of the ML pipeline and directly impacts model performances. To understand an ML model, one must understand how the data are fed into the model first. Tableau [34] is often used to explore data in interpreting data structure for multidimensional and streaming data. Neural document embedding [12] is dedicated to explaining a complicated, feature-embedding process for neural networks. Both INFUSE [13] and FeatureExplorer [14] focus on feature relevance in model performance by connecting predictive results with feature selections. The interactive visualization in feature selections can clarify the complexity of data features in feeding ML models and how they impact model performance. They are fundamental approaches in XAI in explaining AI methods, as seen in [8, 31].

2.2 Performance Analysis Performance analysis is a crucial procedure to validate a trained AI algorithm. Machine learning scientists often used a set of testing data to conduct this process by observing the output parameters. Many popular used parameters and means, including accuracy rate, error rate, F1 score, and confusion matrix, can tell only a model’s performance. However, they do not help to analyze results in disclosing and refining model performance, particularly when facing highly compressed features without human understanding toward data insight. Performance analysis in VA attempts to interpret the results of predictive analytics conducted by ML. This interpretation is often linked with input data and explains model performance by illustrating predicted results. The visual interaction in VA enables analysts to discover the relationships between input and output of an ML model and reveal linkages of the two underlying stages (feature selection and model validation) in the ML pipeline in information granules. Therefore, performance analysis in VA helps to explain and understand ML models. In the Visual Analytics community, many tools [15–18] have been developed to assist in performance analysis. Most of these tools focused on multi-class or multi-classifier systems, areas that need more visual analytics support to interpret model performance. Schneider et al. [15] developed a simple visual analytics platform that allows exploration of classification outputs and model selections in Ensemble Learning [36, 37]. The Ensemble learning often combines different classifiers (e.g., train the model successively using different data sets [38] or combine different model types [39]) or expand representable functions in data encoding by using distinct learning approaches (e.g., bagging with random feature combination in Random Forests [40, 41]) at the same time [15]. The ensemble approach complicates the classification process and decreases comprehensibility [15]. The proposed visual platform allows users to observe classification distribution based on data instances in a scatterplot (with a black background), as seen in Fig. 10. In this data space, the highlighted data clusters in red indicate classification errors. When analysts brush the erroneous data,

42

B. Sun

Fig. 10 A visual analytics platform for ensemble learning [15]

corresponding data instances are displayed (in the table next to the plot in Fig. 10) for detailed investigation. The dot in model space (scatter plot with a white background in Fig. 10) represents individual model used in the ensemble learning. The highlighted yellow dots indicate the models contributed to the classification errors selected in the data space. The users can include, replace, or remove models from the ensemble to refine the classification performance. The interactive workflow focused on the role of the user and established a theoretical framework on human-centered machine learning [15]. Ren et al. [16] developed Squares, an interactive visual analytics tool, to support performance analysis of multiclass classifiers. Each class is illustrated through parallel coordinates and color encoded, as seen in Fig. 11. Stacks in the coordinate represent the classification results. The location of the stack shows the performance scores marked on the y-axis; the stacks can be expanded into boxes for more classification details, as seen in class 3 (C3) and 5 (C5). Each box represents a data sample. The solid red boxes in C5 s column indicate true-positive results, and stripped boxes on the left side represent false-positive results. The color code of the boxes hints the true label of data instances; for example, green striped boxes in C5 represent the number of C3 instances incorrectly classified as C5. Squares also provides a bi-directional coupling between the visualization and the table; the feature allows users to view instance properties by selecting boxes or stacks in the visualization. Ren et al. compared Squares with the confusion matrix through a user case study. Squares is found to cost less time and lead to significantly more correct answers than the confusion matrix. The participants found that Squares is helpful, and most of them would prefer Squares to the confusion matrix. Kwon et al. proposed Clustervision [17], a visual-based supervision tool for unsupervised clustering. The approach is a common type of unsupervised machine

Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI) …

43

Fig. 11 Squares [16]

learning that is useful to aggregate complex multi-dimensional data [17]. Clustervision consists of fifteen clustering methods and aims to interpret the data clustering results based on different algorithms. Figure 12a presents an overview by ranking clustering results sorted by the aggregated quality measures. When analysts select one of the ranked results, Fig. 12b shows a projection of data points through color-coded clusters. The feature value trend of the data points in the chosen clusters (highlighted in green color in Fig. 12b) can be rendered in a parallel coordinate, as seen in Fig. 12c. Users can select specific features in checkboxes on the left side of Fig. 12c. Meanwhile, Fig. 12d details the quality metrics of the selected (green) cluster; Fig. 12e shows the feature value distribution of all data points in the cluster (e.g., 372 data points for the green cluster). The comprehensive visual tool helps data scientists

Fig. 12 Clustervision [17]

44

B. Sun

observe and compare the different data patterns clustered from various methods and get the results useful for their tasks. Cabrera et al. [18] developed FairVis to discover the machine learning bias associated with data base rate. FairVis consists of data feature distributions (a), model performance analysis based on various metrics (b), a detailed performance comparison between selected classes (c), and suggested similar datasets in group views (d), as seen in Fig. 13. The dashboard visualization allows analysts to investigate how feature distribution and selection could impact model performance on classifications when facing different data base rate. The interactive visualization involved human judgments in balancing certain inequities caused by the impossibility theorem in the machine learning model [42, 43]. FairVis can ensure the effectiveness of fairness and investigate tradeoffs between metrics. Performance analysis is the other essential component in understanding ML models. The interactive visual analysis linked with input data instances is the key strategy to clarify which data entries caused performance errors. Direct access to the data would help analysts understand an error and quickly diagnose an underlying issue in refining an ML model. This iterative process discloses the computation process of an ML model, thereby assisting in XAI. All four VA platforms [15–18] in this section present this vital feature. Additionally, FairVis [18] focuses on understanding ML bias, an important XAI research area that needs attention in gaining trustworthy insights [8].

Fig. 13 FairVis [18]

Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI) …

45

2.3 Model Explanations Model explanations in VA focus on ML model architecture or computational process interpretation. Visual Analytics scientists have been developing many techniques to understand machine learning models better. Because of the recent success of deep learning in image and video recognitions, many model explanations in the VA field targeted Neural Network (NN). As a sophisticated machine learning model, NN often needs more interpretations to assist both machine learning scientists and data scientists. The VA tools are tailored for various audiences. Yu and Shi [30] categorized NN related VA research into four groups: tools for teaching concepts to beginners; architecture assessment for practitioners; tools for debugging and enhancing model to developers, and visual explanation for performance analysis to domain experts. As we have introduced the last two groups on how VA can interpret AI in feature selection and performance analysis, we will focus our literature reviews on NN’s architecture assessment and computational process in this section. Understanding the architecture mechanism and computational pipeline are essential for both developers and domain experts in developing the intuition of the model, specifically on what the network looks like and how the data flow in the model for the complicated process [30]. A neural network algorithm mimics the structure of the human brain and comprises neuros and weighted edges. Some visualization approaches [4] adopted standard node-link graphs. When considering forward propagation of the weighted edges, deep neural networks are often illustrated as directed acyclic graphs [30], as seen in CNNVis [21]. Google developed TensorBoard [44], a visualization toolkit, to guild the development of Tensor Flow programs, an open-source machine learning system from Google. TensorBoard consists of several charts in supporting program development and performance analysis. Among them, TensorFlow Graph [45], also called model graphs, assists in NN model design and development; the graph is automatically generated by TensorFlow when a NN model is developed via the system. The graph dashboard allows developers to quickly view a conceptual graph of structure of their model and ensure it matches the intended design [45]. The developers can also view the upper-level graph to refine model development. The graph provides a clear computational flow of a programmed NN model. The nodes in the graph represent various operations, and the edges serve as input and output data of the operations. However, when facing complex datasets, the network could include dozens of layers and millions of parameters that produce a massive nested network [4], as seen in Fig. 14a. To resolve the issue, Wongsuphasawat et al. [19] developed Tensor Flow Graph Visualizer, which is integrated into TensorBoard, to transform the giant network into a legible interactive diagram. The tool transformed the original graph via three steps: (1) extract non-critical operations to declutter the graph; (2) build a clustered graph using the hierarchical structure annotated in the source code; and (3) extract auxiliary nodes, as illustrated in Fig. 14. To better understand the complex machine learning architectures, Tensor Flow Graph Visualizer supports the exploration of a nested structure by performing edge bundling to enable stable and responsive cluster expansion (Fig. 15). Developers can

46

B. Sun

Fig. 14 Transformation in tensor flow visualizer [19]

also find out the metadata in terms of input, outputs, shapes, and details by clicking on a node, as seen in Fig. 15 (top right corner). With the support of TensorBoard, developers can effectively develop a NN model in coping with the intended design. Because of the popularity of the visual illustration, Facebook also developed a similar model graph (Fig. 16a) in an interactive visualization system named ActiVis [20] and deployed in their internal system. ActiVis added additional customized

Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI) …

47

Fig. 15 Overview of tensor flow visualizer [19]

Fig. 16 Overview of ActiVis [20]

graphs for performance analysis, as seen in Fig. 16. Users can observe activation pattern for classes and instance subsets via scatterplots (Fig. 16b), explore instancebased classification results using squares method (Fig. 16c), and investigate the neuron activation in both class and instance levels via a matrix view (Fig. 16b). Spinner et al. developed a Visual Analytics Framework named explAIner [22] to construct an iterative XAI pipeline for the understanding, diagnosis, and refinement of ML models. The framework takes one or more model states as input, then apply an XAI method to output an explanation or a transition function. The system was embedded in TensorBoard for a user study, as seen in Fig. 17. The XAI methods are listed in the left as a toolbox in descending order from high-abstraction to lowabstraction. Explanations are shown in the upper right combined with information about the explainer displayed beneath. The provenance bar at the bottom of Fig. 8 contains explanation cards users may like to save after choosing various explainers listed in the toolbox. The user case study showed that explAIner led to an informed machine learning process.

48

B. Sun

Fig. 17 explAIner [22]

Liu et al. developed CNNVis [21], a visual analytics system that helped experts understand, diagnose, and refine deep CNNs for image classification. CNNVis adapted a directed acyclic graph in rendering the computational flow. The system presents a hybrid visualization to disclose each neuron’s multiple facets and their interactions [21]. Each neuron cluster is illustrated as a rectangle, as seen in Fig. 18, which is comprised of several facets in describing learned features, activations, and contributions to the classification result. Each edge represents the connection between neurons. CNNVis adapted a rectangle-packing algorithm to compress the learned features of neurons into a smaller rectangle (Fig. 18b1). Neuro activations

Fig. 18 CNNVis [21]

Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI) …

49

Fig. 19 RetainVis [23]

are encoded as a matrix, as seen in Fig. 18b2, using a matrix reordering algorithm. The two facets can be switched according to users’ selection. CNNVis also developed a biclustering-based edge bunding algorithm to reduce visual clutter caused by the dense edges. The tool interprets CNN based image classification process in great detail. The graph intuitively illustrates the network structure; the encoded facets meanwhile guide model diagnosis and refinement. Kwon et al. [23] proposed RetainVis, a VA tool to interpret and interact with Recurrent Neural Network (RNN) on electronic medical records. RetainVis provides a dashboard visualization to connect prediction results with input data, as seen in Fig. 19. Figure 19a details all patients’ classification results along with the attribute summary of the patients; with the selected cohort circled in Fig. 19a, the corresponding summary of the chosen patients is shown in Fig. 19b. Figure 19c presents individual patient records using rectangles. In Fig. 19c, users can select a patient of interest to view details in Fig. 19e. Figure 19d allows what-if analysis of the selected patient by testing hypothetical scenarios. Using RetainVis, health professionals can view the contribution of medical records for prediction, evaluate and steer underlying RNN models. Although the NN is successful in many AI-based technologies, the VA approach on model performance based on feature and attribute selections presents a challenge because of the computational complexity of NN. This research direction is particularly interesting to domain experts and ML scientists so that the mystery of the black box can be explained. TensorBoard [44] and Tensor Flow Visualizer [19] developed fundamental approaches to interpret NN architecture using interactive network graphs. ActiVis [20] enhanced the interoperability by adding exploration of instance-based classification results and investigating the neuron activation in both class and instance levels. ExplAIner [22] further explained model states by adopting various XAI methods using TensorBoard. CNNVis [21] and RetainVis [23] focused on specific NN explanations and helped domain experts understand, diagnose, and refine the models in image recognition and classification of electronic

50

B. Sun

medical records. The research prototypes mentioned above have been developed to explain deep learning models; some adopted XAI methods, although recent work suggests the techniques regarding attributes such as saliency methods are not reliable [46]. Nevertheless, according to Olah et al. [47] from the Google AI team, the current building blocks on performance analysis permit only specific aspects of model behavior. Future direction in interpretability research should focus on developing techniques that achieve broader coverage of model behavior [47].

3 Global and Local Interpretability The design of a Visual Analytics tool is often evaluated in scalability, an essential metric in both visual and informational aspects. Analysts have massive, multidimensional, and time-varying information streams from multiple resources, but important information may be hidden in a few nuggets [6]. With the advance in technology, we have access to massive information; however, basic human skills and abilities do not change significantly over time [6]. Therefore, when we are given far more information than what we can possibly process as humans, the scalability issue becomes essential for visual analytics scientists [6]. This section will focus on information scalability and visual scalability to detail the global and local interpretability of VA tools for information granules.

3.1 Information Scalability Thomas and Cook [6] defined Information Scalability. Information scalability implies the capability to extract relevant information from massive data streams. Methods of information scalability include methods to filter and reduce the amount of data, techniques to represent the data in a multiresolution manner, and methods to abstract the data sets.

Data scientists have been developing various approaches to address information scalability. In the following section, we will summarize some popularly used sample selection and dimensionality reduction methods for large-scale data processing based on Xu et al.’s review [48]. Sample Selection The global data will grow from 33 ZB in 2018 to 175 ZB by 2025; nearly 30% of the world’s data will need real-time processing [49]. When facing large scale data, the sample selection method plays an essential role in generating the proper amount of training and testing data for AI methods. Xu et al. [48] classified sample selection methods into three categories: supervised sample selection, unsupervised sample selection, and sample selection bias. The supervised sample selection methods are

Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI) …

51

typically tailored to fit the need of a specific data type. The traditional approach is to select data based on some criteria randomly, and the top scored data entry will be picked. However, the selection criteria vary according to analytics tasks. Some of these customized methods for text data can be seen in [50–54] and others [55–57] are related to image data. Recent work related to the unsupervised sample selection methods targeted the selection of unlabeled training samples to enhance classification results; the approach would reduce massive resources needed for data labeling, a major task to train an AI model. Some of the representative work can be seen in [42, 58–60]. Many of these unsupervised selection methods are based on machine learning techniques such as fuzzy decision tree [59] and fuzzy clustering [60]. Sample selection bias occurs when samples are not randomly selected [48]. The selection bias often exists in machine learning algorithms because of theoretical formulas of the methods, as studied by Zadronzny [43] and Wu et al. [61]. Nevertheless, machine learning scientists often used the sample selection bias to balance the imbalanced datasets, as seen in Romero [62] and Krautenbacher et al. [63]’s works. Dimensionality Reduction Dimensionality reduction (DR) algorithms aim to reduce the multi-dimensionality of a dataset and are often used by both machine learning scientists and visual analytics scientists. Xu et al. [48] separated DR methods based on feature selection and extraction. Feature selection attempts to find the highest impact features based on a specific evaluation criterion [48]. This process often relies on machine learning techniques. Feature selection methods include wrapper, filters, and hybrids based on the search models [48]. Wrapper methods evaluate feature subsets where clustering algorithms are often combined with heuristic searches such as particle swarm optimization [64], ant colony optimization [65], and genetic algorithm [48]. Filter methods are much more common, in which ranking methods and space search methods often use distance to select the majority of the features; others are extended methods including Maximum Variance, Laplacian Score, and Fisher Score [48]. Hybrid methods combine the merits of wrapper and filter methods and are the current research focus of feature selection. Feature extraction would compress several original data features into a newly defined feature in order to reduce the dimensionality of a dataset. Visual Analytics scientists often rely on these techniques to render multi-dimensional features to a 2D space. The classic linear approaches include Principal Component Analysis (PCA) [66] and linear discriminant analysis (LDA) [67]. The representative nonlinear techniques such as Kernel PCA [68], Multi-dimensional scaling (MDS) [69] and Isometric Feature Mapping (Isomap) [70], would fit well with complex nonlinear data [48]. The DR and sample selection algorithms have been well developed in the literature. These techniques would support machine learning, especially deep learning, to play a more important role in large-scale data analysis [48].

52

B. Sun

3.2 Visual Scalability Visual scalability refers to the capability of visualization representation to effectively display a massive number or dimension of individual data [71]. With the advance in information scalability, visual analytics tools have benefited from the techniques presented in the literature. Depending on the objectives of the tools, sample selection, and DR algorithms can be incorporated into the VA tool effectively for analytical tasks. Many visual approaches were also specially developed to support information scalability. Data exploration features in tools like Tableau would support pre and post-overview of sample selection; many visual analytics works are dedicated to DR techniques as seen in Sacha et al.’s work [72]. The majority of the Visual Analytics tools consist of dashboard visualization in supporting global and local views of information granules. Globe overview of a dataset is typically illustrated using graphs like scatterplots, map charts, parallel coordinates, network graphs, and matrices. Local investigation is often triggered by visual interactions. Sacha et al. [72] collected different interaction paradigms, as seen in Fig. 20. The main interaction techniques can be classified into two types, namely, direct manipulation and controls. Indirect manipulation allows users to directly interact with a graph to move, select, label data points, or draw lines between them. Most of VA tools implemented this type of interaction for easy access. Controls are performed through some standardized control items such as drop-down lists, sliders, buttons, and command lines. Likely, VA tools would adopt both interaction types for users’ convenience. In Table 2, we summarized all the reviewed VA tools under “strategies” to highlight their visual scalabilities in the globe and local views of their AI methods. We categorized these features based on adapted graph types and interaction types described above. Ten out of thirteen tools used dashboard visualization to support global and local interpretations; the other three tools used single visualization allowing view

Fig. 20 Different interaction paradigms [72]

Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI) …

53

Table 2 Visual scalability summary. DM: Direct Manipulation; C: Controls VA tool

Dashboard visualization

Globe overview

Local investigation

Interactions types

Tableau [11]

YES

Graphs in “show me”

Tooltip, filters

DM & C

Neural Document Embedder by Ji et al. [12]

YES

Scatterplot

Parallel coordinate and scatterplot

DM & C

INFUSE by YES Krasuse et al. [13]

Matrix and scatterplot

Glyph graph

DM & C

FeatureExplorer YES by Zhao et al. [14]

Scatterplot and matrix

Bar graph

C

VA Tool for Ensemble Learning by Schneider et al. [15]

YES

Scatterplot

Table view

DM & C

Squares by Ren et al. [16]

No

Parallel coordinate and stacks

Boxes and table view

DM & C

Clustervision by Kwon [17]

YES

Scatterplot and stack bar graph

Parallel DM & C coordinate and bar graph

FairVis by YES Cabrera et al. [18]

Customized bar chart

Histogram and bar DM & C graph

Tensor Flow Visualizer by Wongsuphasawat et al. [19]

No

Network graph

Expanded node and pop-up window

DM & C

ActiVis by Kahng et al. [20]

YES

Network graph

Matrix, squares and scatterplot

DM

CNNVis by Liu et al. [21]

No

Parallel coordinate and customized Rectangle stacks

Customized rectangles

DM

explAIner [22] by Spinner et al

YES

Network graph

Expanded node and pop-up bar/window

DM & C

RetainVis by Kwon et al. [23]

YES

Scatterplot, bar and area graphs

Customized rectangles and bar, line and area graphs

DM & C

54

B. Sun

expansion to enable visual scalability. Scatterplots are often the choice for data distribution trends after the DR process; parallel coordinates are suitable to illustrate multi-class classification or multi-layer NN. Direct manipulation is the primary choice of visual interaction; nine out of thirteen tools combined direct manipulation with controls. Visual Analytics techniques are capable of interpreting AI methods both globally and locally for information granules. The level of interpretation is often interchangeable via dashboard visualization and visual interactions. With the techniques, analysts can control the stage of the analysis in supporting maximum human intuition.

4 Stability of Explanation The stability of explanation interpreted by VA tools can be validated through three approaches commonly utilized to evaluate the usability of a visualization tool or system: design, evaluation, and reusability. VA tools are designed to support end-users such as analysts and domain experts, who often have basic knowledge about ML and use it to analyze their domain data. The validation approaches in design, evaluation and reusability are all user-oriented. Analysts are the targeted audience of the tool and participate in the design and evaluation stages of its development. The majority of VA tools focus on feature selection, model performance, or model structure because these components are heavily associated with domain data where domain experts can quickly understand, diagnose and refine an ML model. Design is an initial step for visualization development. A majority of VA literature consists of well-defined design goals for the targeted analytics task based on the characteristics of a dataset. The goals typically involve approaches in visual, scalable, and interactive designs along with analytical tasks to justify the designs. Some of these design approaches include meetings or surveys with domain experts to outline specific user requirements. The goals serve as a major principle in developing an effective visualization tool and often compared or highlighted in VA implementation to strengthen the evaluation process. Evaluation is a fundamental component to validate a VA tool, in which VA scientists heavily rely on user studies. Liu et al. [73] stated: “User studies offer a scientifically sound method to measure visualization performance. As a result, they are an important means to translate laboratory Information Visualization research into practical applications.”. Lam et al. [74] further described the evaluation into seven scenarios including understanding environments and work practices, evaluating visual data analysis and reasoning, evaluating communication through visualization, evaluating collaborative data analysis, evaluating user performance, evaluating user experience, and evaluating visualization algorithms. Depending on the

Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI) …

55

evaluation scenarios, the user study would involve human subjects from targeted audiences such as domain experts, machine learning scientists, or randomly selected nonexperienced users. The study’s complexity can be ranged from comment and feedback collection to quantitative statistical analysis on specifically designed evaluation tasks. Reusability is another way to validate the stability of VA tools. This approach is often observed in research focused on AI model performance, as seen in ClusterVision [17], in which authors presented several unlabeled datasets to evaluate the performance of ClusterVision on clustering. We summarized the design, evaluation, and reusability approaches used in the reviewed publications in Table 3 to highlight the validation process in developing a VA platform. All the reviewed papers consist of Design Goals and/or Analytics Tasks to guild their visualization developments. All the platforms were evaluated through user studies based on use cases. All of these studies involved domain experts as human subjects; however, evaluation scenarios varied depending on the analytics tasks. Four out of thirteen papers did not present the Reusability Test.

5 Visual Analytics for Granular Computing Granular computing became an essential field in recent years to address the challenge of 3 V characters of Big Data [75]. Information granules, produced by the process of granular computing, refer to the data or information that are divided based on their similarity, functional or physical adjacency, and indistinguishability [76]. Many research [10, 75] have shown that granular computing in data analytics supports human data interaction (HDI). Wilke and Portmann [77] argue that the ability to represent and reason with information granules is a prerequisite for data legibility and also state: Humans can process their data on different levels of granularity…. They can also offer other humans access to their data, again, at different self-determined granularity levels. These levels mark a relative size, scale, level of detail, or depth of penetration that characterizes certain data, which can also be used for analytics reasons.

Therefore, the ultimate objective of granular computing is similar to that of visual analytics as VA aims to help with analytics reasoning by enabling human-information discourse. VA also presents many advantages in data legibility and human-center computing using the cognitive intuition from interactive visualizations. However, granular computing is an emerging paradigm in computing and applied mathematics, while VA focuses on the visual representation of data and underlying computational methods. These two fields can be well integrated to support intelligence amplification, which according to Wilke and Portmann [77] merges computational and human intelligence and requires collaborative, iterative, interactive and intuitive and feedback. Figure 21 illustrates this idea by incorporating visual interaction as the interface of analysts for HDI. This visual interaction can be used at every stage of granular

56

B. Sun

Table 3 Evaluation approach summary. N.A.: Not Accessible; AT: Analytics Tasks; DC: Design Components; DG: Design Goals; C: Challenges VA Tool

Design

User studies

Reusability

Tableau [11]

3AT

User activity is Approved by public monitored and studied usage with different via publicly released datasets software

Neural Document Embedder by Ji et al. [12]

4 DC supported by 6 AT

2 domain experts in healthcare with comment collections

None

INFUSE by Krasuse et al. [13]

4 AT

A team of domain experts in healthcare with comment collections

None

FeatureExplorer by Zhao et al. [14]

4 DG

One domain expert with feedback collection

2 datasets

VA Tool for Ensemble Learning by Schneider et al. [15]

2 AT

Tool based quantitative Through benchmark performance analysis datasets with hundreds of classifiers & models

Squares by Ren et al. [16]

Survey and 3 DG

Quantitative performance analysis involving human subjects

Clustervision by Kwon [17]

5 DG

7 domain experts from Through datasets 2 domains with from different comment collections domains

Through several datasets

FairVis by Cabrera et al. 4 DG supported by 6 [18] AT

Tool based quantitative None performance analysis involving human subjects

Tensor Flow Visualizer by Wongsuphasawat et al. [19]

5 AT

Comments & feedbacks collections from domain experts and public usage

Approved by public usage

ActiVis by Kahng et al. [20]

3 DG supported by 6 C

3 domain experts with comment collections

3 different datasets and models from experts themselves

CNNVis by Liu et al. [21]

5 DG

2 domain experts with comment collections

None

explAIner [22] by Spinner et al

8 DG

9 participants from novice, user and developer groups respectively

3 use-cases

RetainVis by Kwon et al. [23]

7 AT

1 domain expert

Electronic Medical Records from Korea

Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI) …

57

Fig. 21 Intelligence amplification combined with visual analytics, modified after [77]

computing to interpret both granular models and granular data in supporting granular computing. Nevertheless, although the VA field has been approved to support the explanations of AI, the current literature presents very few VA tools in supporting granular computing. In [77], granular computing is used as a basis of human-data interaction to show a collaborative urban planning use case in a cognitive city environment. An iterative process of user input supported by interactive visualization combined with human-oriented data processing driven by spatial granular calculus of granular geometry can effectively support collective decision-making. Toyota and Nobuhara [87] developed a network visualization to interpret a hierarchical network of laws using morphological analysis and granular computing. The system is confirmed that users can easily analyze and understand the network structure of the laws. The fundamental techniques of granular computing present various frameworks of information granules such as Fuzzy sets [78–81], shadowed sets [82, 83], and rough sets [83–86]. A potential VA platform could employ such granular models as underlying computing models to support VA research for Big Data. The VA process demonstrated in Fig. 4 can be adopted for this case in selecting an appropriate granular model through interactive visual interfaces. On the other hand, a multi-layer interactive graphs can also help to interpret information granules and relationship between them to reveal underlaying linkages. A customized VA platform in this area would augment intelligence amplification as shown in Fig. 21 for HDI. Granular computing can help explain AI by dividing underlaying data into granularity, an essential way to address the challenge of the 3 V character in Big Data. When combining with VA techniques, it would amplify data legibility and human

58

B. Sun

intuition for HDI and consequently strengthen the interpretation of XAI. Therefore, future research that integrates techniques of VA and granular computing is critical for the fields of VA, granular computing and XAI.

6 Summary Visual Analytics that integrated computational models and human intuitions aims to maximize the human capacity to explore and understand the complexity of Big Data and underlying ML models. The predictive visual analytics pipeline evolved from the ML pipeline consists of various visual approaches in interpreting different stages of the pipeline. The visual approach is developed based on understandings of the reasoning process, and of underlying cognitive and perceptual principles [6]. The stages of the pipeline represent the components of information granules in VA; we then can categorize interpretability of VA in each component that contributes to a model explanation. This chapter reviewed 13 top-tier publications in the recent VA field to demonstrate how VA techniques can be appropriately used in XAI. Visual interaction allows analysts to have a real discourse with AI models in feature selection, performance analysis, model architecture and computational process. Customized graph design enables both global and local model interpretations. The stability of the interpretation is validated through human-based user case studies, design requirements, and reusability tests. In the future, visual analytics scientists can make significant contributions to XAI through interactive visualizations by visually interpreting broader ML model behaviors and incorporating granular computing techniques. This is an area that clearly needs more attention to gain public trust in AI.

References 1. Wong, P.C., Thomas, J.: Visual analytics. IEEE Comput. Graph. Appl. 5, 20–21 (2004) 2. Sacha, D., Sedlmair, M., Zhang, L., Lee, J.A., Peltonen, J., Weiskopf, D., Keim, D.A., et al.: What you see is what you can change: human-centered machine learning by interactive visualization. Neurocomputing 268, 164–175 (2017) 3. Earnshaw, R.A., Dill, J., Kasik, D.: Data Science and Visual Computing. Springer International Publishing (2019) 4. Liu, S., Wang, X., Liu, M., Zhu, J.: Towards better analysis of machine learning models: a visual analytics perspective. Vis. Inf. 1(1), 48–56 (2017) 5. Lu, J., Chen, W., Ma, Y., Ke, J., Li, Z., Zhang, F., Maciejewski, R.: Recent progress and trends in predictive visual analytics. Front. Comput. Sci. 11(2), 192–207 (2017) 6. Cook, K.A., Thomas, J.J.: Illuminating the path: The research and development agenda for visual analytics (No. PNNL-SA-45230). Pacific Northwest National Lab. (PNNL), Richland, WA (United States) (2005) 7. Chatzimparmpas, A., Martins, R.M., Jusufi, I., Kerren, A.: A survey of surveys on the use of visualization for interpreting machine learning models. In: Information Visualization, 1473871620904671 (2020).

Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI) …

59

8. Arrieta, A.B., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., Chatila, R., et al.: Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fus. 58, 82–115 (2020) 9. Gillies, M., Fiebrink, R., Tanaka, A., Garcia, J., Bevilacqua, F., Heloir, A., d’Alessandro, N., et al.: Human-centred machine learning. In: Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. 3558–3565 (2016) 10. Pedrycz, W.: Granular computing for data analytics: a manifesto of human-centric computing. IEEE/CAA J. Autom. Sinica 5(6), 1025–1034 (2018) 11. Mackinlay, J., Hanrahan, P., Stolte, C.: Show me: automatic presentation for visual analysis. IEEE Trans. Visual Comput. Graph. 13(6), 1137–1144 (2007) 12. Ji, X., Shen, H.W., Ritter, A., Machiraju, R., Yen, P.Y.: Visual exploration of neural document embedding in information retrieval: semantics and feature selection. IEEE Trans. Visual Comput. Graph. 25(6), 2181–2192 (2019) 13. Krause, J., Perer, A., Bertini, E.: INFUSE: interactive feature selection for predictive modeling of high dimensional data. IEEE Trans. Visual Comput. Graph. 20(12), 1614–1623 (2014) 14. Zhao, J., Karimzadeh, M., Masjedi, A., Wang, T., Zhang, X., Crawford, M.M., Ebert, D.S.: FeatureExplorer: interactive feature selection and exploration of regression models for hyperspectral images. In: 2019 IEEE Visualization Conference (VIS), pp. 161–165. IEEE (2019) 15. Schneider, B., Jäckle, D., Stoffel, F., Diehl, A., Fuchs, J., Keim, D.: Integrating data and model space in ensemble learning by visual analytics. In: IEEE Transactions on Big Data (2018) 16. Ren, D., Amershi, S., Lee, B., Suh, J., Williams, J.D.: Squares: supporting interactive performance analysis for multiclass classifiers. IEEE Trans. Visual Comput. Graph. 23(1), 61–70 (2016) 17. Kwon, B.C., Eysenbach, B., Verma, J., Ng, K., De Filippi, C., Stewart, W.F., Perer, A.: Clustervision: visual supervision of unsupervised clustering. IEEE Trans. Visual Comput. Graph. 24(1), 142–151 (2017) 18. Cabrera, Á.A., Epperson, W., Hohman, F., Kahng, M., Morgenstern, J., Chau, D.H.: FairVis: visual analytics for discovering intersectional bias in machine learning. In: 2019 IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 46–56. IEEE (2019) 19. Wongsuphasawat, K., Smilkov, D., Wexler, J., Wilson, J., Mane, D., Fritz, D., Wattenberg, M., et al.: Visualizing dataflow graphs of deep learning models in tensorflow. IEEE Trans. Vis. Comput. Graph. 24(1), 1–12 (2017) 20. Kahng, M., Andrews, P. Y., Kalro, A., & Chau, D. H. P.:Activis: visual exploration of industryscale deep neural network models. IEEE Trans. Vis. Comput. Graph. 25(1), 88-97 (2017). 21. Liu, M., Shi, J., Li, Z., Li, C., Zhu, J., Liu, S.: Towards better analysis of deep convolutional neural networks. IEEE Trans. Visual Comput. Graph. 23(1), 91–100 (2016) 22. Spinner, T., Schlegel, U., Schäfer, H., El-Assady, M.: explAIner: a visual analytics framework for interactive and explainable machine learning. IEEE Trans. Vis. Comput. Graph. 26(1), 1064–1074 (2019) 23. Kwon, B.C., Choi, M.J., Kim, J.T., Choi, E., Kim, Y.B., Kwon, S., Choo, J., et al.: Retainvis: visual analytics with interpretable and interactive recurrent neural networks on electronic medical records. IEEE Trans. Vis. Comput. Graph. 25(1), 299–309 (2018) 24. Resnikoff, H.L.: The illusion of reality. Springer Science & Business Media (2012) 25. Larkin, J.H., Simon, H.A.: Why a diagram is (sometimes) worth ten thousand words. Cognit. Sci. 11(1), 65–10 (1987) 26. Norman, D.: Things that make us smart: defending human attributes in the age of the machine. Addison–Wesley, Reading (1993) 27. Tufte, E.R.: The Visual Display of Quantitative Information (1983) 28. Card, S. K., Robertson, G. G., & Mackinlay, J. D. (1991, March). The information visualizer, an information workspace. In: Proceedings of the SIGCHI Conference on Human factors in computing systems, pp. 181–186. 29. Bauer, M., Kortuem, G., Segall, Z.: “Where are you pointing at?” A study of remote collaboration in a wearable videoconference system. In: Digest of Papers. Third International Symposium on Wearable Computers, pp. 151–158. IEEE (1999)

60

B. Sun

30. Yu, R., Shi, L.: A user-based taxonomy for deep learning visualization. Vis. Inf. 2(3), 147–154 (2018) 31. Došilovi´c, F.K., Brˇci´c, M., Hlupi´c, N.: Explainable artificial intelligence: a survey. In: 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 0210–0215. IEEE (2018) 32. Bögl, M., Aigner, W., Filzmoser, P., Lammarsch, T., Miksch, S., Rind, A.: Visual analytics for model selection in time series analysis. IEEE Trans. Visual Comput. Graph. 19(12), 2237–2246 (2013) 33. Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1(3), 131–156 (1997) 34. Tableau Software: Last accessed https://www.tableau.com; www.tableau.com in August 2020 35. Tableau Samples. Retrieved from https://en.wikipedia.org/wiki/Tableau_Software; https://en. wikipedia.org/wiki/Tableau_Software and https://www.tableau.com” www.tableau.com in July 2020 36. Dietterich, T.G.: Ensemble methods in machine learning. In: International Workshop on Multiple Classifier Systems, pp. 1–15. Springer, Berlin, Heidelberg (2000) 37. Zhou, Z.H.: Ensemble Methods: Foundations and Algorithms. CRC Press (2012) 38. Freund, Y.: Experiments with a new Boosting algorithm. In: 13th International Conference on Machine Learning, 1996 (1996) 39. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley (2014) 40. Strobl, C., Malley, J., Tutz, G.: An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol. Methods 14(4), 323 (2009) 41. Ghimire, B., Rogan, J., Galiano, V.R., Panday, P., Neeti, N.: An evaluation of bagging, boosting, and random forests for land-cover classification in Cape Cod, Massachusetts, USA. GISci. Remote Sens. 49(5), 623–643 (2012) 42. Yang, Y., Ma, Z., Nie, F., Chang, X., Hauptmann, A.G.: Multi-class active learning by uncertainty sampling with diversity maximization. Int. J. Comput. Vision 113(2), 113–127 (2015) 43. Zadrozny, B.: Learning and evaluating classifiers under sample selection bias. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 114 (2004) 44. Dean, J., Monga, R.: TensorFlow-Google’s latest machine learning system, open sourced for everyone. In: Google Research Blog (2015) 45. TensorBoard. Last accessed at https://www.tensorflow.org/tensorboard/graphs in August 2020 46. Kindermans, P.J., Hooker, S., Adebayo, J., Alber, M., Schütt, K.T., Dähne, S., Kim, B., et al.: The (un) reliability of saliency methods (2017). arXiv:1711.00867 47. Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K., Mordvintsev, A.: The building blocks of interpretability. Distill (2018) 48. Xu, X., Liang, T., Zhu, J., Zheng, D., Sun, T.: Review of classical dimensionality reduction and sample selection methods for large-scale data processing. Neurocomputing 328, 5–15 (2019) 49. Data Age 2025: Last accessed at https://www.seagate.com/our-story/data-age-2025/; https:// www.seagate.com/our-story/data-age-2025/ in August 2020 50. Liao, Y., Pan, X.: A new method of training sample selection in text classification. In: 2010 Second International Workshop on Education Technology and Computer Science, vol. 1, pp. 211–214. IEEE (2010) 51. Jiantao, X., Mingyi, H., Yuying, W., Yan, F.: A fast training algorithm for support vector machine via boundary sample selection. In: International Conference on Neural Networks and Signal Processing, 2003. Proceedings of the 2003, vol. 1, pp. 20–22. IEEE (2003) 52. Xuetong, N.: FCM-LSSVM based on training sample selection. Metall. Min. Ind. (9) (2015) 53. Zhai, J., Li, C., Li, T.: Sample selection based on KL divergence for effectively training SVM. In: 2013 IEEE International Conference on Systems, Man, and Cybernetics, pp. 4837–4842. IEEE (2013) 54. Hao, H.W., Jiang, R.R.: Training sample selection method for neural networks based on nearest neighbor rule. Acta Automatica Sinica 33(12), 1247–1251 (2007)

Use of Visual Analytics (VA) in Explainable Artificial Intelligence (XAI) …

61

55. Li, X., Fang, M., Zhang, J.J., Wu, J.: Sample selection for visual domain adaptation via sparse coding. Signal Process. Image Commun. 44, 92–100 (2016) 56. Chellasamy, M., Ferre, T., Humlekrog Greeve, M.: Automatic training sample selection for a multi-evidence based crop classification approach. In: International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences (2014) 57. Petschenka, G., Agrawal, A.A.: How herbivores coopt plant defenses: natural selection, specialization, and sequestration. Curr. Opin. Insect Sci. 14, 17–24 (2016) 58. Li, K., & Xiong, L. (2015, November). Community detection based on an improved genetic algorithm. In: International Symposium on Computational Intelligence and Intelligent Systems, pp. 32–39. Springer, Singapore 59. Wang, X.Z., Dong, L.C., Yan, J.H.: Maximum ambiguity-based sample selection in fuzzy decision tree induction. IEEE Trans. Knowl. Data Eng. 24(8), 1491–1505 (2011) 60. Yuan, W., Han, Y., Guan, D., Lee, S., Lee, Y.K.: Initial training data selection for active learning. In: Proceedings of the 5th International Conference on Ubiquitous Information Management and Communication, pp. 1–7 (2011) 61. Wu, D., Lin, D., Yao, L., Zhang, W.: Correcting sample selection bias for image classification. In: 2008 3rd International Conference on Intelligent System and Knowledge Engineering, vol. 1, pp. 1214–1220. IEEE (2008) 62. Romero, R., Iglesias, E.L., Borrajo, L.: Building biomedical text classifiers under sample selection bias. In: International Symposium on Distributed Computing and Artificial Intelligence, pp. 11–18. Springer, Berlin, Heidelberg (2011) 63. Krautenbacher, N., Theis, F.J., Fuchs, C.: Correcting classifiers for sample selection bias in two-phase case-control studies. In: Computational and Mathematical Methods in Medicine (2017) 64. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of ICNN’95International Conference on Neural Networks, vol. 4, pp. 1942–1948. IEEE (1995) 65. Dorigo, M., Colorni, A., Maniezzo, V.: Distributed optimization by ant colonies (1991) 66. Vidal, R., Ma, Y., Sastry, S.S.: Principal component analysis. In: Generalized principal component analysis, pp. 25–62. Springer, New York, NY (2016) 67. Martínez, A.M., Kak, A.C.: Pca versus lda. IEEE Trans. Pattern Anal. Mach. Intell. 23(2), 228–233 (2001) 68. Vapnik, V.: The Nature of Statistical Learning Theory. Springer Science & Business Media (2013) 69. Young, F.W.: Multidimensional Scaling: History, Theory, and Applications. Psychology Press (2013) 70. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000) 71. Eick, S.G., Karr, A.F.: Visual scalability. J. Comput. Graph. Stat. 11(1), 22–43 (2002) 72. Sacha, D., Zhang, L., Sedlmair, M., Lee, J.A., Peltonen, J., Weiskopf, D., Keim, D.A., et al.: Visual interaction with dimensionality reduction: a structured literature analysis. IEEE Trans. Vis. Comput. Graph. 23(1), 241–250 (2016) 73. Liu, S., Cui, W., Wu, Y., Liu, M.: A survey on information visualization: recent advances and challenges. Vis. Comput. 30(12), 1373–1393 (2014) 74. Lam, H., Bertini, E., Isenberg, P., Plaisant, C., Carpendale, S.: Empirical studies in information visualization: seven scenarios. IEEE Trans. Vis. Comput. Graph. 18(9), 1520–1536 (2011) 75. Wang, G., Yang, J., Xu, J.: Granular computing: from granularity optimization to multigranularity joint problem solving. Granul. Comput. 2(3), 105–120 (2017) 76. Cho, K.H., Wolkenhauer, O., Yokota, H., Dubitzky, W. (eds.): Encyclopedia of Systems Biology. Springer, New York (2013) 77. Wilke, G., Portmann, E.: Granular computing as a basis of human–data interaction: a cognitive cities use case. Granul. Comput. 1(3), 181–197 (2016) 78. Dubois, D., Prade, H.: Outline of fuzzy set theory: an introduction. In: Advances in Fuzzy Set Theory and Applications (1979)

62

B. Sun

79. Dubois, D., Prade, H.: The three semantics of fuzzy sets. Fuzzy Sets Syst. 90(2), 141–150 (1997) 80. Dubois, D., Prade, H.: An introduction to fuzzy systems. Clin. Chim. Acta 270(1), 3–29 (1998) 81. Pedrycz, W.: Shadowed sets: representing and processing fuzzy sets. IEEE Trans. Syst. Man Cyberne. Part B (Cybernetics) 28(1), 103–109 (1998) 82. Pedrycz, W.: Interpretation of clusters in the framework of shadowed sets. Pattern Recogn. Lett. 26(15), 2439–2449 (2005) 83. Pawlak, Z.: Rough sets. Int. J. Comput. Inform. Sci. 11(5), 341–356 (1982) 84. Pawlak, Z. (1991). Rough Sets (1991)–Theoretical Aspects of Reasoning about Data. 85. Pawlak, Z.: Rough sets and fuzzy sets. Fuzzy Sets Syst. 17(1), 99–102 (1985) 86. Pawlak, Z., Skowron, A.: Rough sets and Boolean reasoning. Inf. Sci. 177(1), 41–73 (2007) 87. Toyota, T., Nobuhara, H.: Hierarchical structure analysis and visualization of Japanese law networks based on morphological analysis and granular computing. In 2009 IEEE International Conference on Granular Computing, 539–543. IEEE (2009)

Visualizing the Behavior of Convolutional Neural Networks for Time Series Forecasting Janosch Henze and Bernhard Sick

Abstract In recent years Neural Networks and especially Deep Neural Networks (DNN) have seen a rise in popularity. DNNs have increased the overall performance of algorithms in applications such as image recognition and classification, 2D and 3D pose detection, natural language processing, and time series forecasting. Especially in image classification and recognition, so-called Convolutional Neural Network (CNN) have gained high interest as they can reach a high accuracy, which makes them a viable solution in cancer detection or autonomous driving. As CNNs are widely used in image tasks, different visualizations techniques have bee developed to show how their internals are working. Apart from image tasks, CNNs are applicable to other problems, e.g., time series classification, time series forecasting, or natural language processing. CNNs in those contexts behave similarly, allowing for the same visualization techniques to make them more interpretable. In this chapter, we adapt image visualization algorithms to time series problems, allowing us to build granular, intuitively interpretable feature hierarchies to make a time series forecast as understandable as an image recognition task. We do so by using our previous work on power time series forecasting using CNN Auto Encoders (AEs) and applying typical CNN visualization techniques to it. Thus, we guid computer scientists to provide better interpretable figures for a time series forecasting task to application domain experts. Keywords Visualization · Convolutional neural networks · Time series data

J. Henze (B) · B. Sick Intelligent Embedded Systems Lab, University of Kassel, Kassel, Germany e-mail: [email protected] B. Sick e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. Pedrycz and S. Chen (eds.), Interpretable Artificial Intelligence: A Perspective of Granular Computing, Studies in Computational Intelligence 937, https://doi.org/10.1007/978-3-030-64949-4_3

63

64

J. Henze and B. Sick

1 Introduction Improved interpretability and explainability of Artificial Intelligence (AI) methods leverage acceptance of neural networks in real-world applications. Thus, the reasoning behind AI becomes understandable to a domain expert expert [1], which in turn helps to improve the acceptance of new applications using AI. Interpretability and understanding of such methods, even for non-computer science experts, is getting more and more crucial, as AI applications enter fields such as medical devices, or automated driving [2]. In those fields, understanding and documentation of the decision-making processes are crucial, and interpretability is vital. CNNs are well known for identifying patterns and shapes in images [3–5]. In image classification tasks, CNNs produce so-called feature hierarchies, providing access to the knowledge contained in an image by learning abstract features within an increasing number of features hierarchies [6]. These different abstraction levels of an image allow for a granulation of image areas, resulting in access to more abstract features of an image. Analyzing these bottom-up feature hierarchies and their information makes the reasoning of complex CNN models more accessible and understandable [7]. Most of the available applications use well-known, pretrained feature hierarchies to extract different features of the image, e.g., faces, eyes, animals, or objects [6]. The kernels that convolve an image turn each kernel of the CNN into a feature extractor of the input of the respective layer. At the same time, many different visualization methods for CNNs exist, that are able to interpret what a particular feature hierarchy describes in an image [8]. As mentioned earlier, CNNs are used in a variety of tasks such as natural language processing, time series classification, or time series forecasting [9–12]. In these cases, the inputs consist of one-dimensional features, e.g., a sentence or a sensor value. In other settings, especially in time series forecasting, we may have several of those one-dimensional features as inputs for a forecasting task. Forecasting tries to generate new information for applications such as power grid operation, autonomous driving, industrial automation, or stock market prices. In such applications, CNNs can also be used as feature extractors [11], which learn feature hierarchies similar to the ones in images. Each feature hierarchy uses the granular information of the input data to create more abstract information of the time series or the image. Such abstract information in the feature hierarchies can contain different information granules depending on the abstraction level. These feature hierarchies may contain information about the slope of a time series in the early stages and learn more precise information about time series patterns. Therefore, it is possible to utilize similar visualization techniques as used for images, allowing them and their reasoning to become more interpretable. This chapter aims to address the issue of the interpretability of time series forecasts using CNNs. We show how we can adapt image visualization methods to CNNs in a forecasting setting. We are providing insights into the information stored on each feature hierarchy. In the end, we provide a way of allowing us to determine the influence of each input feature on the forecast itself. Thus, we enable computer scientists to explain the reasoning and internals of a CNN forecasting task.

Visualizing the Behavior of Convolutional Neural Networks …

65

We start by providing an introduction to Artificial Neural Networks (ANNs), and especially CNNs, followed by a brief introduction to time series forecasting in Sect. 2. Afterward, we provide an overview of current work in the field of visualization and interpretability of CNNs and time series in Sect. 3. The foundational part is followed by Sect. 4, briefly introducing the experiment we performed to generate our visualizations, the data, and the neural network setup. The main part, in Sect. 5, presents different approaches to visualize the components of CNNs. Each visualization explains a specific part of the neural network, i.e., input, output, or internal representations of the data. In the end, in Sect. 6, we will conclude our workon interpretability of time series forecasting and provide the reader with an easy-to-follow approach on how to visualize their CNNs for time series forecasts.

2 Introduction to Neural Networks and Forecasting This section gives a brief introduction into neural networks, especially CNN, and forecasting. We start to explain time series and time series forecasting. Afterward, we give an overview on ANNs, in particular Deep Learning (DL) with a focus on AE, and CNNs. Afterward, we introduce Convolutional AutoEncoder (CNN AE), how they can provide feature hierarchies, and how we can use CNN AE to forecast power time series.

2.1 Power Time Series Forecasting Forecasting defines the act of determining the future of currently unknown values, e.g., of a time series. Such a forecast is created using some input data, e.g., the time series itself, or other exogenous information. In our case, we mainly work with power time series. Hence, we try to forecast future power consumption or generation using historical Numerical Weather Prediction (NWP) data as inputs. The whole process of forecasting power time series includes a two-steps. The first step forecasts the weather features [13], typically with time steps of up to k = 72 h into the future. As creating these NWP features is computationally expensive, we assume the data given during our work. The second step uses some of these NWP features to forecast future power consumption or generation. The second step is typically done with methods, such as non-linear regression, Support Vector Machines (SVMs), or ANNs. All of these methods use the NWP features to map historical and current weather situations to future power generation or consumption. Some of these methods can be adapted to use values of the forecasted time series itself in the forecast process, therefore getting an autoregressive component. In our case, during the forecast step, we try to connect historical NWP data to future power generation data. This regression problem has a time dependency due to the NWP data. The

66

J. Henze and B. Sick

overall process is a time series problem if we include the information about the forecasted time series itself [13]. Both kinds of data, the input data (NWP) and the output data (power generation), are considered a time series, which ultimately is an ordered list of tuples. Each of the tuples consists of a timestamp t ∈ R and a feature vector x ∈ R D , which gathers all D data points for that time step. In the case of the power time series, x is a scalar, i.e., D = 1. In the case of an NWP time series, it is a mixture of different weather features, such as wind speed, wind direction, or solar irradiance. The time steps can be equidistant or not. For the sake of simplicity, we assume that they are equidistant. As already elaborated, we try to find a set of parameters that map our input time series on to our target time series. Here, we map inputs from an NWP time series on to a power time series. The simplest case, a regression, is shown in Equation 1. Here, we want to find a set of weights w that, multiplied with the input NWP data x, results in the current power, allowing for a particular variance of ε[14]: Power(x) = wT x + ε.

(1)

Performing such a regression task with a (deep) neural network is very similar. During a training phase, we try to learn the parameters of a neural network to generate our desired output, i.e., a power time series, based on the input, an NWP time series. From a fundamental viewpoint, most neural networks use several hierarchically structured regression models, as shown in Eq. 1, with non-linear activation functions. Due to this combination, neural networks are capable of learning nonlinear relations between the NWP time series and our output time series.

2.2 Neural Networks Neural networks come in different variants, originally as single-layer perceptrons [15] for solving binary classification tasks. They evolved into multiayer perceptrons (MLPs) also known as backpropagation networks, which connected many neurons, and more recently evolved again with the wake of Deep Learning [14, 16]. Even if simple MLP were quite powerful already, it took until recently to be able to use neural networks for large datasets with high accuracy, e.g., for classification or forecasting tasks. Until a few years ago, problems, such as vanishing or exploding gradients, occurred not only with backpropagation through time for recurrent neural networks. Today these problems mostly have been solved by using different regularization techniques, e.g., Rectified Linear Unit (ReLU), drop out layers, new architectures such as Long Short Term Memory Networks (LSTMs), CNNs, or AEs, and new, faster hardware to train neural networks [17].

Visualizing the Behavior of Convolutional Neural Networks …

67

In this work, we focus on forecasting using representation learning. Representation learning tries to automatically learn those features which are important the current machine learning task. During this chapter, we use CNNs AEs to learn such a repsentation, and an MLPs to forecast.

2.2.1

Auto Encoder

A neural network learns a representation by providing so-called feature hierarchies. A feature hierarchy is a set of layers within the neural network consisting of fully connected linear layers, convolutional layers, or pooling layers. A very common method that uses such hierarchies are AE, as shown in Fig. 1. An AE learns feature hierarchies in each layer. By decreasing the layer size it increases the abstraction power of each layer until the bottleneck is reached. The bottleneck contains condensed information about the original input data, which provides a better forecasting power than the initial input data [11]. The layers of the AE change depending on the kind of AE we want to construct. In the simplest case, linear layers combined with non-linear activation functions are sufficient enough. More complex AE can consist of a combination of convolutional layers, dropout layers, different activation functions, or be designed to learn a distribution at the bottleneck like for a Variational Auto Encoder (VAE). Feature hierarchies have to be trained over several epochs (training iterations) before representing information on each level. Each pre-trained representation consists of several feature hierarchies, which abstracts the data from the previous feature hierarchy. In image processing, for example, this can be an abstraction from the pixData Flow Input

Output

Latent Features

Bottleneck

Encoder

Fig. 1 An exemplary structure of an autoencoder

Decoder

68

J. Henze and B. Sick

els to edges, up to the abstraction of edges to recognizable objects. In time series, it can be slopes or patterns, such as a sudden increase and a slow drop of the time series. To leverage the learned representation at the bottleneck in an application setting, the AE is split after the bottleneck. Another network is then attached to the bottleneck, e.g., a fully connected feed-forward neural network for forecasting.

2.2.2

Convolutional Neural Networks

To allow for a temporal feature extraction power from the introduced AE, we can build the layers of the AE using CNN. A CNN is a neural network that uses a set of kernels, which convolve the input, as shown in Fig. 2. During a convolution, the kernel is moved along the temporal dimension of the input data. In this work, we use 1D convolutions, as such convolutions move over the temporal dimensions of our input time series individually. Furthermore, we use an AE structure, allowing us to ultimately reduce the original input dimension towards the bottleneck. A schematic overview of a CNN AE is shown in Fig. 3. As seen in the figure, each layer reduces the number of features, while trying to keep the number of time steps the same. In Filter Kernel a0

Padding t0

x0

a1

y0

x1

a2

y1

x2

y2

...

...

Input Features

t9

t0

Output Features

...

...

x9

y9

t9

Padding

Fig. 2 Exemplary convolution of a kernel of size 3 as happening in a 1D CNN. Beside the convolution along the time axis (t0 . . . t9 ), we also dsiplay the optional padding

Visualizing the Behavior of Convolutional Neural Networks …

69

Padding

Fe a

tu

re

s

t t t 0 ...... t2323 t 0 t t 0 ... t 23 t0 0 ...... t2323 t t t t t 0 ... t 23 t 0 ...... t 23 t 0 ...... t 23 t ... t t 0 ... t 23 t 0 ... t 23 t 0 ... t 23 t0 0 ... t2323 t0 0 ... t2323 t 0 ... t 23 t t t t t0 0 ... t2323 t0 0 ...... t2323 t0 0 ...... t2323 t t t0 ... t23 t 0 ... t 23 t 0 ... t 23 Latent t 0 ...... t 23 t0 0 ... t2323 Encoder Features

t t t 0 ...... t2323 t 0 t t 0 ... t 23 t0 ... t23 t0 0 ...... t2323 t t t t t 0 ...... t 23 t 0 ...... t 23 t 0 ... t 23 t 0 ... t 23 t 0 ... t 23 t 0 ... t 23 t0 0 ... t2323 t0 0 ... t2323 t t t t t0 0 ...... t2323 t 0 ... t 23 t 0 ... t 23 t0 0 ...... t2323 t ... t Decoder t0 0 ... t2323

Time Series

Fig. 3 Exemplary CNN AE. This CNN AE contains two encoder layers with a lattent feature size of 4 at the bottle neck

this configuration, the CNN AE creates a more abstract view of the input data by combining different features. In the case of 1D convolutions in a time series setting, the kernel moves over the input time series, and determines features in the temporal domain.

2.2.3

Forecasting Using an AutoEncoder

When forecasting using an AE, we require a two-stage training process. During the ˆ of the input data X is learned using the complete first stage, a reconstruction X autoencoder. This stage captures structure within the data and allows us to learn feature hierarchies. After the first training stage is completed, we proceed to the second stage, the forecasting stage. In this second stage, the forecasting stage, first the AE, is split at the bottleneck, separating it into an encoder and a decoder. The encoder allows us to encode our input data into a feature representation Z = encoder(X). Additional to the encoder, we create a fully connected feed-forward network using the representation Z as an input to forecast our target y. As we have already trained the encoder, the weights of the encoder can be fixed during the forecasting network training. Another option is to finetune the representation during the training of the forecasting network. This is done by not fixing all of the weights of the encoder, but allowing the weights of specific layers of the encoder to be adjusted during the forecast network training. This technique allows to learn a a representation, e.g., from a NWP, that can be adapted for several wind power plants. More details on AE and their use in forecasting can be found in [11].

3 Relevant Literature Previous work in the area of CNN for time series forecasting can be divided into two categories. The first category contains information about visualization techniques. It gives an overview of algorithms currently in use for visualization of neural networks, and especially CNNs. In the second category, we show the applications of CNN

70

J. Henze and B. Sick

to time series problems. This allows us to identify individuals working with the results of CNNs and, therefore, possible individuals with a need to understand our visualization. After this section, we have a clear overview of our stakeholders and the different visualization techniques available for CNNs with a focus on time series data. Applications of CNNs to image tasks are vast, exemplary articles are [6, 18, 19]. The authors of [6] use CNN to increase the mean average precision by more than 30% on the PASCAL VOC dataset using region proposals networks and CNN. They do this by using bottom-up region proposals and pre-trained CNN with a finetuning on the target task. In [18], the authors aim to visualize and improve the understanding of CNNs in a diagnostic setting. Furthermore, they identify the contribution of the performance increase of each layer. By visualizing the different layers, they are able to debug problems in the model, which ultimately improved the model. The authors of [19] use a CNN consisting only of convolutions instead of pooling layers. The convolutions only use different strides in the layers, ultimately simulating a pooling layer. They furthermore use the visualization technique provided by [18] to display the learned features. All these different applications were integrated into a tool by [20] to allow for easy visualization of image data during the training and application phase, the CNN Explainer, a tool which allows for high-level visualization and interpretation of CNNs. The tool helps scientists and practitioners to understand the reasoning behind image classifications by providing interactive visualizations at different CNN layers. Therefore, the tool helps to explain the interactions of activations and individual convolutional layers. With the help of questionnaires and interviews, the authors show how their CNN Explainer helps to understand the internal mechanisms of a CNN. The area of different CNN applications in time series tasks, such as classification and forecasting, is huge (examples are [11, 21–27]). All these applications mainly use CNN as feature extractors with a second stage performing either forecasting or classification. The authors of [21] use CNN to analyze financial time series to classify the price trend of the time series. To be able to apply 2D convolutions, they map the time series into a 2D space using different mappings. They show that by using mappings in conjunction with 2D convolutions, they achieve excellent trading simulation performance. In [22], a so-called fully convolutional network is used to perform time series classification. The fully convolutional network consists of a combination of a convolution, a batch normalization, and a ReLU without any pooling operations. In the end, all feature hierarchies are fed into a global average pooling layer. With this approach, the authors achieve state-of-the-art performance in time series classification. Furthermore, the authors can provide class activation maps to determine which input regions contribute to a classification decision. The authors of [23] combine CNN and LSTMs to extract spatio-temporal information to forecast household power consumptions. The CNN is used for spatial information extraction and the LSTM is used for temporal feature extraction. To fur-

Visualizing the Behavior of Convolutional Neural Networks …

71

ther explain the functionality of the CNN-LSTM, they visualize the kernel outputs. Ultimately, their proposed model outperforms a standard LSTM approach. The authors of [24] use a CNN to perform human activity recognition. They use the CNN to learn a feature abstraction from high-dimensional input data, creating a high-level abstraction of the low-level raw time series signals. They show that the CNN outperforms more traditional methods. They contribute the performance to the features learned by the CNN, which increases the discriminative power of the model with respect to human activities. Additionally, all activities are described in one model, instead of having different models for specific tasks. The authors of [25] apply CNN to perform condition monitoring for milling tools. As input data, they use spectrograms of audible sounds recorded during the machining process. In addition to the condition monitoring, they use visualization techniques to obtain insights into the tool wear prediction process. With the help of the visualization, they found that frequency features were more important than features that refer to the signal’s time domain. The authors of [26] use CNN for time series classification. As CNN need many data for training, they propose a semi-supervised data augmentation training method to overcome the need for high amounts of data. By performing these augmentations, they improved the classification results on time series with small amounts of data. From the referenced literature in this section, we can see that much knowledge on the interpretability of neural networks exists. However, the available knowledge is often limited to specific research fields, such as image classification or object recognition. Applications of CNNs were in the financial sector, power sector, human activity recognition, and predictive maintenance. Most of these applications use CNNs directly as feature extractors instead of using an unsupervised stage involving an Auto Encoder. In time series applications, the interpretability gain through visualizations of CNNs is often not leveraged. Hence, our contribution is to use our knowledge about deep learning [10, 11, 27–29], and especially representation learning, to provide the readers of this work with different techniques to make their CNN more interpretable. We focus on a forecast setting using a CNN AE to learn a feature representation, and a separate fully connected neural network as the forecasting stage. We will leverage different visualization techniques to visualize the different feature hierarchies, as well as the inputs and outputs. Thus, provide a guide to different visualizations, how to create them, and how to interpret them.

4 Training the CNN AE This section gives information about the training process of a CNN AE. As shown in Fig. 4, the training of a CNN AE is separated into two stages. The first stage only trains the AE part, and the second stage trains the forecast layer. Both stages and their layer composition are explained here in detail. After explaining the training of the two stages, we visualize the options shown in Fig. 4 in Sect. 5.

72

J. Henze and B. Sick

4.1 Experiment In Fig. 4, we see the experiment we use to guide through the different visualization techniques. The experiment is a two-stage process, with three entry points for visualizations. During the first stage, we will learn a representation of our input data. In this stage, we can visualize the kernel and encoded input data. This first view of the feature hierarchies allows us to get an impression of the important patterns and activations needed to reconstruct the input data. During the second stage, we separate the encoder and decoder of the AE. We then attach a fully connected neural network to the encoder. We limit the trainable layers of the encoder to the last layer before the bottleneck and the bottleneck itself. The attached forecasting neural network is also trained. This modification and limitation of the neural network allows us to fine-tune the learned feature hierarchies in the lower level to the forecasting task. Furthermore, it allows the forecasting network to create precise forecasts. After the second stage, we can gain insightful visualization of the forecast as the learned feature hierarchies have been trained for the forecast task.

Read In Data

Split CNN AE

Preprocess and Split Data

Combine Encoder and the Forecast Network Visualize: - Inputs

Create CNN AE

Train Forecast Neural Network

Train CNN AE

Evaluate Forecast Neural Network Visualize: - Kernel - Output

Visualize: - Activation Maps - Errors - Forecasts

Fig. 4 Flow chart of a typicial forecasting task using representation learning. a describes the first step: the representation learning. b shows the training for the forecasting. The dashed boxes show what can be visualized at the different stages

Visualizing the Behavior of Convolutional Neural Networks …

73

4.2 Data and Code Our experiment uses the Europe Windfarm Dataset [30]. All visualizations shown in this section will be done for the file wf1.csv. All other wind parks can easily be visualized using the provided code. The dataset consists of a total of 45 wind farms scattered over Europe. It contains hourly averaged wind power generation data and the corresponding day ahead NWP information from the European Centre for Medium-Range Weather Forecasts (ECMWF) weather model. Features from the ECMWF weather model include wind speed, wind direction, air pressure, and air temperature. The code for training and visualization is available for download via https://git.ies. uni-kassel.de/cnn-ts-visualization/cnn-ts-vis. The code was written using PyTorch and matplotlib. The data can be requested separately via https://www.ies.uni-kassel. de/ → Downloads.

4.3 Setup The model and training parameters are shown in Table 1. The hyperparameters are derived from the experiments of our previous work in [11]. The encoder layer size allows for three feature hierarchies while still achieving a high compression rate. The padding allows retaining the 24h sequence as best as possible for each feature hierarchy. Our CNN AE is implemented in PyTorch [31]. In total, we train two different models. The first one is a CNN AE consisting only of 1D convolutions in the encoder part and 1D deconvolutions in the decoder part. The encoder uses LeakyReLUs [32], a modification of the ReLU which allows for a small negative activation, as seen in Eq. 2.

Table 1 The parameters used during training of the CNN AE and the forecast layer Name Setting Neurons in the encoder and decoder layers Neurons in theorecast layer Learning rate Kernel size Padding Batch size Epochs Encoder levels retrained

[7, 6, 4, 2], [2, 4, 6, 7] [54, 24] 0.01 4 2 10 1000 2

74

J. Henze and B. Sick

 LeakyReLU(x) =

x, negative slope × x,

if x ≥ 0 else

(2)

The decoder uses normal ReLUs, which has zero activation for negative input values. The CNN AE only uses the NWP data for training. The second model is a fully connected neural network and builds the forecasting stage. As stated previously, we combine the encoder part of the CNN AE with linear layers to create a forecasting model for future wind power production. The layer size is chosen to create a 24h forecast from the output of the encoder. During this experiment, we take the saved CNN AE from the first experiment and attach a forecasting stage consisting of LeakyReLUs and linear fully connected layers to the encoder part of the AE. The data is split into training, validation, and test set. Furthermore, each sequence is split into subsequences of 24 h, creating an input vector with the size of (batch size, 24, # of features). This allows the CNN AE to extract features over the 24-h sequence. By reducing the number of features in each layer, the AE combines the learned time series features. The forecast data is prepared in such a way that with each input sequence of 24 h, we also predict an output sequence of 24 h resulting in an output vector size of (batch size, 24, 1).

5 Visualization and Patterns In this section we visualize the results of the experiment described in the previous section. We visualize the different elements shwon in Fig. 4, by introducing the visualization technique and give an interpretation of the resulting images. A flowchart of this section is shown in Fig. 4, it details the experiment and the process steps in which a visualization gives valuable new insights. As our goal is to provide methods to visualize the decision process within a CNN for forecasting, we focus on only one neural network and one wind farm only. This focus helps to keep the explanations and interpretations understandable and interpretable. As mentioned above, a single convolutional layer consists of a kernel that convolves over the input creating the output. This kind of structure allows us to visualize the following components: • • • •

Inputs, Kernel, Outputs, and Activations.

In addition to these four types of visualizations, especially in an image classification setting, we can create so-called Class Activiation Map (CAM). An Activiation Maps (AMs) shows which regions of the image are used the most in the current decision

Visualizing the Behavior of Convolutional Neural Networks …

75

process. We adapted AM to work with time series forecasting tasks, allowing us to show the particular importance of inputs for the forecast made by the CNN AEforecast networks. During the next section, we will first explain how to interpret the visualizations, followed by the four types of visualization mentioned above. For some of the visualization types, we provide several visualization because each of the different layers is visualized. Each of these visualizations follows the same principle: We first show the line plot representation followed by a heatmap representation of the same data. A guide on how to interpret respective figure is presented in each subsection. Each figure caption provides a small introduction to what is visualized but does not provide the interpretation advice given in the corresponding section.

5.1 How to Interpret the Visualizations A line plot allows us to see the development of a time series in a form most commonly known, time steps increasing from left to right, and the line is going up or down in the graph. For a time series, we can derive information such as the slope, the amplitude, the duration, or mean value from a line plot. The pattern favored by the kernel is visible in the line plot of the kernel. Such patterns can depict a positive or a negative slope, or a small raise followed by a hard drop. Heatmaps display the value of the time series as a color. In our case values below zero are displayed using colder colors, i.e., different shades of blue. Values above zero are displayed using warmer colors, i.e., different shades of red going to bright orange. Values around zero are colored black. The heatmap visualizations behave like the line plot visualizations, the time steps increase from left to right. However, instead of lines and dots, we display the time series value as a color in the color spectrum between blue, black, and red. Such a visualization of the inputs makes it easier to spot evolving patterns. We do not need to follow and compare the lines, but can focus more on color similarities. A similar color is more straightforward to detect than trying to comparing the ups and downs of two lines in a line plot. Kernel visualizations using heatmaps again helps to understand the activation patterns better. In a kernel visualization, dark areas identify areas with less importance, whereas brighter areas, either blue or red, signal higher importance. Whereas point and line plots allow for more accurate judgments, color shades such as used in heatmaps allow for more generic judgements [33, pp. 120, 138]. We want to give a non-computer science expert a better understanding of the decision processes that are happening within a CNN and the accompanying forecast. We display both, line plots and heatmaps, allowing the reader to decide which visualization he favours. Heatmaps allow for a more general interpretation, but we lose interpretation capability at a finer level. Yet, the heatmaps make it easier to determine what parts of the input of feature hierarchies are favored.

76

J. Henze and B. Sick

5.2 Input Visualization A general visualization of the finest information level, the input data, is a straightforward visualization for neural networks. Fig. 5 shows the input as a line plot and as a heatmap. In Fig. 5a, we plotted each input feature as a line plot. This allows us to see that some input features behave similarly, e.g., wind speed at 100 m and wind speed at 10m. The wind direction zonal and meridional seem to be stable in the first part of the input sequence and drop in the second part at a different speed. Air pressure and temperature seem to stay constant, while the humidity varies during the 24 h time sequence. A heatmap plot of the input data, as shown in Fig. 5b, displays the described behavior better. It is more evident by the same color shades that the windspeeds perform similarly. Furthermore, it can be easily seen the wind direction zonal drops slightly, and the meridional wind direction rises slightly in the first part and that they bot drop in the second part with different values. A change in color in the visualization of the data shows this behavior very nicely. Here, similar colors indicate similar behavior, and a similar color gradient, from darker to brighter, shows an increase of the values. The inputs to the other convolutional layers are visualized in Figs. 6, 7, and 8. They all show the combination of the different kernels using the respective inputs from the previous layer. The figures show that we mainly have high positive activations with a few negative activations, as expected when using LeakyReLUs. In each step, the different features are getting combined by the kernels resulting in different activations high and low outcomes of the feature hierarchies. At the bottleneck, we then obtain one feature with high activations and one feature with small negative activations.

5.3 Kernel Visualization As mentioned in Section 2.2.2, a CNN uses kernels to process the input. Therefore, the weight matrices of the kernels hold information about the input features in the form of patterns. As the kernels act as the information extractors for each of the feature hierarchies, plotting them will allow us to identify patterns of the input data, which will become important later in a forecasting stage. The kernels are visualized as a line plot and a heatmap as well. Depending on the AE layer we visualize, we can gather different information about the importance of certain features and feature combinations. The visualization of the input layer can be seen in Fig. 9. In the line plots, in Fig. 9a, we see that each kernel has a set of input features it learned to be important. As we use LeakyReLUs after each convolutional layer, we have more significant activations in the next layer with a higher value in the input. Negative values only have a small impact on the activation in the next layer. This effect is shown in the heatmap in Fig. 9b, any color deviating from black shows high importance of the feature for the

Visualizing the Behavior of Convolutional Neural Networks …

77

Fig. 5 Visualization of the original input data. In a, the input features are displayed as a line plot and in b, as a heatmap

78

J. Henze and B. Sick

Fig. 6 Visualization of the inputs to the second convolution. In a, the input features are displayed as a line plot and in b, as a heatmap

kernel. E.g., in kernel 0, the wind direction zonal seems to be important, whereas kernel 1 shows the importance of wind speed. Another detail that can be seen in the line plot is the activation pattern. Inputs following these patterns have higher activation in the output. Further down in the hierarchy of the autoencoder, the output of the layer’s six kernels are becoming the input of the second feature hierarchy. As we have seen in Section 5.2, the encoded feature 0 show mainly high values where we had a strong

Visualizing the Behavior of Convolutional Neural Networks …

79

Fig. 7 Visualization of the inputs to the third convolution. In a, the input features are displayed as a line plot and in b as a heatmap

Fig. 8 Visualization of the inputs to the forecasting network. In a, the input features are displayed as a line plot and in b as a heatmap

80

J. Henze and B. Sick

Fig. 9 Visualization of the CNN kernels of the first layer. In a, the kernels are displayed as a line plot and in b the kernels are displayed as a heatmap. The heatmap shows important features directly by assigning them a higher value, whereas the line plot shows the form of a pattern which is needed to achieve a high activation

influence of kernel 0. It means the first encoded feature after the input consists mainly of wind direction zonal information. The other outputs can be similarly explained if we look at the input layer’s kernels. If we follow this argumentation for the subsequent layers, we can see in kernel 2 of the second layer, cf. Fig. 10, a high negative impact of the encoded feature. This impact can be attributed to the combination of the previous layer and can be attributed to the wind direction zonal. This encoded feature 2 of layer 2 is only taken into account by kernel 0 of layer 3, the bottleneck layer, as seen in Fig. 11.

Visualizing the Behavior of Convolutional Neural Networks …

81

Fig. 10 Visualization of the CNN kernels of the second layer. In a, the kernels are displayed as a line plot and in b the kernels are displayed as a heatmap. TThe heatmap shows important features directly by assigning them a higher value, whereas the line plot shows the form of a pattern which is needed to achieve a high activation

5.4 Forecast Visualization An important question is: How well did we perform with the forecast. For this scenario, we visualize the output, i.e., the forecasted values and the real values, shown in Fig. 13. In this figure, we again show the line plot and a heatmap. Both make it easy to compare visually how well our forecast performs. As mentioned earlier, we use time sequences of 24 h with a 1 h resolution. As we also apply a zero

82

J. Henze and B. Sick

Fig. 11 Visualization of the CNN kernels of the bottleneck layer. In a, the kernels are displayed as a line plot and in b the kernels are displayed as a heatmap. The heatmap shows important features directly by assigning them a higher value, whereas the line plot shows the form of a pattern which is needed to achieve a high activation

padding in the beginning and end of the time series, we are more likely to have higher errors in the beginning and end of the forecast, which can be seen in the forecast results. The forecast error is shown in Fig. 12.

Visualizing the Behavior of Convolutional Neural Networks …

83

Fig. 12 This plot shows the error of the forecast result. Displayed is the error of the forecast shown in Fig. 13. It is one concatenated batch (10 time sequences of 24 h each). The overall RMSE for the forecast is 0.1473

5.5 Activation Maps AM are typically used in image classifcation tasks. In those tasks, they can help identify which regions of an image are essential for the classification. They do so by highlighting the region with the highges influence within the image. Such information can be beneficial for better interpretation of how the CNN is making a decision. AM also help in getting feedback from domain experts, e.g., by guiding them to important regions in the input, such as in cancer detection, where it is possible to guide a physician to a particular image region for further examination. One way of creating these AM is to use the gradient information, creating socalled Grad-CAM images [34, 35]. The concept can be easily transferred to time series applications. We first calculated a AM in a similar matter as explained in [34], but instead of the one-hot-encoding we used the individual time points of the input time series. With the obtained AM and the input layer’s kernel, we calculated the individual importance of the kernels: Per Kernel Influence = KernelTLayer 0 · Grad-AM

(3)

By obtaining the inner product of the transposed kernel matrices with the Grad-AM, we obtain the individual importance of each of the inputs for each of the kernels. Afterward, we average the Per Kernel Influence and normalize the obtained importance to [0 . . . 1] to show the influence of each parameter on the output. If we follow this mechanism, we obtain an activation time series map as shown in Fig. 14. Each timestep has a different color showing its relative influence on the

84

J. Henze and B. Sick

Fig. 13 Visualization of the output of the CNN AE-forecast network. Displayed is one batch of 24h time sequences concatenated (10 time sequences of 24 h each). In a, the ouput is displayed as a time series and in b the same output is displayed as a heatmap. Both plots show a clear difference in the forecast to real values. The heatmap allows for a direct comparison by colors

forecast. Brighter values show a stronger influence and darker points less influence. In the examples shown in Figs. 14, and 15, we can see that the both windspeeds are always of high importance in calculating the forecasts.

5.6 How to Use the Individual Visualizations So far, we presented different visualizations for the individual parts of the CNN AE. This section provides context to explain how the individual visualziation act together. The visualziation of the inputs allows us to to get a first overview of the individual input time series. The input time series are convolved with the kernels of the first layers. The visualizations of the kernel allow us to identify, which input time series

Visualizing the Behavior of Convolutional Neural Networks …

85

Fig. 14 Activation Map for Input Sample 0. Shown are inputs with their importance to the output. The importance amount is depcited by the color of the dots

86

J. Henze and B. Sick

Fig. 15 Activation Map for Input Sample 4. Shown are inputs with their importance to the output. The importance amount is depcited by the color of the dots

Visualizing the Behavior of Convolutional Neural Networks …

87

with their individual parts are important for the output of the layer. The layer‘s output will be the input to the following layer and we restart the interpretation process. At the end the activation after the bottleneck will be fed into the forecasting neural network. This results in a forecast which we can compare to the ground truth time series to calculate the forecast error. The information of those individual plots can be compiled into a single AM. Instead of tracing the individual kernels and their activations, everything is calculated into a single value. This value can be seen as the individual importance of the input at the point in time and is displayed as a color gradient on the input time series.

6 Conclusion With our work, we wanted to show how to use CNN visualization techniques for image classification tasks in time series forecasts. We gave a brief overview of previous work on time series forecasting, image visualization, and time series forecasting. Afterward, we introduced the different components of a CNN and our two-stage training for CNN AEs while pointing out visualization opportunities in a flowchart. Finally, we applied four different visualization techniques to make CNN more understandable. First, we provided a heatmap as an alternative overview of the input data. These heatmaps provided a simpler view of time series, which allows for an easier understanding of the CNN kernel. The CNN kernel visualization allowed us to identify the essential input features at each layer in the CNN AE. The combination of kernels and inputs has been shown in the bottleneck output plots. The plots show the gradual assessment of features into essential and non-essential features. In a final visualization, we altered gradient class activation maps to work for time series forecasts. With the help of these activation maps, we can show which features and which time points in the input time series have the most influence on the forecast. In future applications, the interpretability of a neural network’s reasoning is getting more and more critical, as we need to understand the decision-making process of neural networks. Only by obtaining a granular view on the neural network’s reasoning we are able to successfully apply them to tasks where trust in the reasoning is needed, e.g., cancer treatment, or fully autonomous cars. Our work provides a tool to allow computer scientists to provide an easier to read and understandable way to explain time series forecasts using CNN to end-users, e.g., nurses, doctors, field technicians, or others who need to rely on the information provided by time series forecasts. Acknowledgements This work was supported within the c/sells RegioFlexMarkt Nordhessen (03SIN119) project, funded by BMWi: Deutsches Bundesministerium für Wirtschaft und Energie/ German Federal Ministry for Economic Affairs and Energy. Furthermore, the authors thank Hermine Takam Makoukam for her preliminary work helping to create some of the code for the visualization techniques during her Master’s thesis.

88

J. Henze and B. Sick

References 1. Vellido, A.: The importance of interpretability and visualization in machine learning for applications in medicine and health care. Neural Comput. Appl., 1–15 (2019) 2. Mathews, S.M.: Explainable artificial intelligence applications in NLP, biomedical, and malware classification: a literature review. In: Intelligent Computing, pp. 1269–1292. Springer International Publishing, Cham (2019) 3. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: The IEEE International Conference on Computer Vision (ICCV), October 2017 (2017) 4. Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2014 (2014) 5. Ma, C., Huang, J.-B., Yang, X., Yang, M.-H.: Hierarchical convolutional features for visual tracking. In: The IEEE International Conference on Computer Vision (ICCV), December 2015 (2015) 6. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014) 7. Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., Lipson, H.: Understanding neural networks through deep visualization (2015). arXiv:1506.06579 8. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015) 9. Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative Study of CNN and RNN for Natural Language Processing (2017) 10. Gensler, A., Henze, J., Raabe, N., Sick, B.: Deep learning for solar power forecasting—an approach using AutoEncoder and LSTM neural networks. In: 2016 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2016—Conference Proceedings (2017) 11. Henze, J., Schreiber, J., Sick, B.: Representation learning in power time series forecasting. In: Pedrycz, W., Chen, S.-M. (eds.) Deep Learning: Algorithms and Applications, pp. 67–101. Springer International Publishing, Cham (2020) 12. Zifeng, W., Huang, Y., Wang, L., Wang, X., Tan, T.: A comprehensive study on cross-view gait based human identification with deep CNNs. IEEE Trans. Pattern Anal. Mach. Intell. 39(2), 209–226 (2016) 13. Gensler, A.: Wind Power Ensemble Forecasting: Performance Measures and Ensemble Architectures for Deterministic and Probabilistic Forecasts. Kassel University Press (2019) 14. Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press (2012) 15. Rosenblatt, F.: The Perceptron, a Perceiving and Recognizing Automaton Project Para. Cornell Aeronautical Laboratory (1957) 16. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016) 17. Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015) 18. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision—ECCV 2014, pp. 818–833. Springer International Publishing, Cham (2014) 19. Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M..: Striving for Simplicity, The All Convolutional Net (2014) 20. Zijie, J., Turko, R., Shaikh, O., Park, H., Das, N., Hohman, F., Kahng, M., Chau, D.H.: CNN Explainer: Learning Convolutional Neural Networks with Interactive Visualization(2020) 21. Chen, J.-F., Chen, W.-L., Huang, C.-P., Huang, S.-H., Chen, A.-P.: Financial time series data analysis using deep convolutional neural networks. In: 2016 7th International Conference on Cloud Computing and Big Data (CCBD), pp. 87–92 (2016)

Visualizing the Behavior of Convolutional Neural Networks …

89

22. Wang, Z., Yan, W., Oates, T.: Time series classification from scratch with deep neural networks: a strong baseline. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 1578–1585 (2017) 23. Kim, T.-Y., Cho, S.-B.: Predicting the household power consumption using CNN-LSTM hybrid networks. In: Yin, H., Camacho, D., Novais, P., Tallón-Ballesteros, A.J. (eds.) Intelligent Data Engineering and Automated Learning—IDEAL 2018, pp. 481–490. Springer International Publishing, Cham (2018) 24. Yang, J., Nguyen, M.N., San, P.P., Li, X.L., Krishnaswamy, S.: Deep Convolutional Neural Networks on Multichannel Time Series for Human Activity Recognition (2015) 25. Kothuru, A., Nooka, S.P., Liu, R.: Application of deep visualization in CNN-based tool condition monitoring for end milling. Procedia Manuf. 34, 995–1004. 47th SME North American Manufacturing Research Conference, NAMRC 47. Pennsylvania, USA (2019) 26. Le Guennec, A., Malinowski, S., Tavenard, R.: Data augmentation for time series classification using convolutional neural networks. In: ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal Data, Riva Del Garda, Italy, September (2016) 27. Schreiber, J., Jessulat, M., Sick, B.: Generative adversarial networks for operational scenario planning of renewable energy farms: a study on wind and photovoltaic. In: Tetko, I.V., K˚urková, V., Karpov, P., Theis, F. (eds.) Artificial Neural Networks and Machine Learning—ICANN 2019: Image Processing, pp. 550–564. Springer International Publishing, Cham (2019) 28. He, Y., Henze, J., Sick, B.: Forecasting Power Grid States for Regional Energy Markets with Deep Neural Networks. IJCNN/WCCI 2020 (2020) 29. He, Y., Henze. J., Sick, B.: Continuous Learning of Deep Neural Networks to Improve Forecasts for Regional Energy Markets. IFAC 2020 (2020) 30. Gensler, A.: EuropeWindFarm Data Set. http://ies-research.de/Software, 2016. Last accessed 2020-08-30 31. Paszke, A, Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison. A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: PyTorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc. (2019) 32. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML, vol. 30, pp. 3 (2013) 33. Cairo, A.: The Functional Art: An Introduction to Information Graphics and Visualization. New Riders (2012) 34. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D, Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 128(2), 336–359 (2020). Feb 35. Ozbulak, U.: PyTorch CNN Visualizations. https://github.com/utkuozbulak/pytorch-cnnvisualizations, 2019. Last accessed 2020-08-30

Beyond Deep Event Prediction: Deep Event Understanding Based on Explainable Artificial Intelligence Bukhoree Sahoh and Anant Choksuriwong

Abstract Big data and machine learning are essential ingredients in artificial intelligent-based applications and systems for decision making. However, intelligentbased approaches, such as neural networks and deep learning, only aim to summarize and predict events extracted from raw data rather than to understand them. This latter feature is especially important for critical events that are unpredictable and complex with the potential to cause serious damage to the economy, society, and the environment. Good event understanding would help the government and enterprises manage an event so they could make the best plan to relieve or prevent a critical situation. This means that high-level knowledge must be supplied to support understanding in uncertain and difficult situations. This chapter describes Deep Event Understanding (DUE), a new perspective for machine learning based on eXplainable Artificial Intelligence (XAI) driven by the paradigm of Granular Computing (GrC). It semantically models knowledge by imitating human thinking which can be understood by decision-makers. DUE offers a level of understanding that is both argumentative and explainable for critical applications and systems. DUE aims to mimic human intelligence by generating real-time knowledge using scientific causal structures of human thinking. We begin this chapter by highlighting the limitations of current machine learning models which are unable to achieve a satisfactory level of understanding in terms of human intelligence. The fundamental human learning process shows the current disadvantages of curve-fitting technology such as deep learning. DUE architecture for critical systems is proposed which employs a cycling process between the system and the environment. The desirable properties for DUE are introduced, along with its basic ability to mimic human-like intelligence from the viewpoint of computing technology in terms of critical thinking and contextual understanding. We describe a learning platform based on Causal B. Sahoh (B) · A. Choksuriwong Department of Computer Engineering, Faculty of Engineering, Prince of Songkla University, Songkla, Thailand e-mail: [email protected] A. Choksuriwong e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. Pedrycz and S. Chen (eds.), Interpretable Artificial Intelligence: A Perspective of Granular Computing, Studies in Computational Intelligence 937, https://doi.org/10.1007/978-3-030-64949-4_4

91

92

B. Sahoh and A. Choksuriwong

Bayesian Networks (CBNs) that encodes critical thinking and contextual understanding based on machine-interpretable and understandable forms, which is ready to learn new experiences from big data. The outcome is a transparent DUE model that can be understood by domain experts, lay users, and machines, and which is manageable, maintainable, and shareable. Finally, we highlight research trends and future directions for DUE. Keywords Bayes’ Theorem · Causal Machine Learning · Causal Bayesian Networks · Interpretable AI · Counterfactual · Critical System

1 Introduction A knowledge-driven approach to support critical decision-making processes plays a key role in various research areas such as marketing, transportation, industry, health care, and safety and security. The decision-making process requires sophisticated knowledge from domain experts and practitioners who can understand the situation by interpreting evidence from an environment. It employs context-driven information from diverse sources to determine a plan to manage the situation properly, and to prevent a situation from becoming worse. The advent of big data offers an unprecedented streaming source for real-time evidence which might supply the information and knowledge necessary to understand events deeply. However, the well-known characteristics of big data include exponential growth, unreliability, complicated data types, and high frequency that are difficult to process with traditional technologies. However, Artificial Intelligence (AI) technology such as machine vision, natural language processing, and machine learning can handle big data characteristics, and extract helpful information to support the decision-making process [1]. Pencheva et al. [2] investigated how the government could employ big data and AI-based machine learning to analyze the public sector, while Lee and Shin [3] examined AI-based machine learning and big data in enterprises. This work shows how AI-based applications are becoming a fundamental requirement in management systems supporting decision making. A critical situation is a complex problem such as economic damage from lockdown, human-made and natural disasters, international terrorism, and serious accident, which are unpredictable, random, and uncertain. It needs multidisciplinary and advanced technology such as model-based knowledge, evidence-driven approaches, and AI-based concepts to deal with the unknown patterns that make up the complex events [4]. Research on the intersecting areas between big data, machine learning, and critical systems has been extensive in the last decade. Shan and Yan [5] and Castillo [6] studied how machine learning-based systems could discover real-time information, and argued that systems based on these fields are vital to support authorities dealing with critical events. Increasingly, computing technologies based on Deep Event Prediction (DEP) drive automation systems to identify event information from the environment. For example, Ghafarian and Yazdi’s [7] machine learning-based

Beyond Deep Event Prediction: Deep Event Understanding …

93

DEP approach called “learning on distributions” can identify event information with high accuracy. Kumar et al. [8] and Chaudhuri and Bose [9] proposed DEP-based models using deep learning to classify and identify critical events, and confirmed that deep learning technologies can perform well in terms of event classification and identification. However, current DEP-based research aims to determine only the predicted output from the sensors that observe the real-world environment. Enterprise and government require a deeper, explainable knowledge-driven approach, which can also explain critical events. Deep understanding of events is a human-like form of intelligence that learns from real-time evidence by reasoning and interpreting according to prior knowledge and experience. It must provide the details of why a certain output is predicted, why it is not something else. It should explain why a system might fail or succeed in predicting output, and how success may become a failure, or vice versa. All of these capabilities are beyond DEP. An understanding ability is mainly needed in critical systems that plan how an enterprise or government should deal with critical events effectively. However, current DEP research based on machine learning has not yet considered how to deal with understanding when the situation is unreliable, uncertain, and chaotic. They assume the presence of experts and practitioners in the enterprise and government, but in these situations there is always a shortage of such people to analyze, interpret, and infer reasons that are usually time-consuming and labor-intensive tasks. Also any delays in making decisions cause serious damage in critical situations, leading to a high risk of failure. Granular Computing (GrC) fulfills a research gap in critical management by providing an approach to understanding events based on human-like intelligence [10]. XAI is a formalization that serves the needs of GrC by encoding knowledge transparently, which gives us the ability to explain and argue events semantically. Our research proposes Deep Event Understanding (DUE), a novel form of machine learning that uses XAI based on GrC to mimic human brain-like thinking, in order to explain the critical events utilized in decision making. The main contributions of this chapter are: • To explain the way of deep prediction of AI-based learning platforms cannot reach the goal of human-like intelligence comparable to the human learning process. • To introduce DUE, an architecture based on the flexible communication between the system and the environment utilizing XAI computing. • To propose a human-like intelligence platform for machine learning using human critical thinking and contextual understanding, which aims to understand the events that support better decision making by staff. • To describe a DUE technology based on XAI using this learning platform for mimicking human ability which can support the critical systems that understand situations deeply. The rest of this chapter is organized as follows. The current limitation of AI-based machine learning models is examined in Sects. 2 and 3. The XAI-based DUE architecture is discussed in Sects. 4 and 5. Sections 6 and 7 investigate the DUE’s definition

94

B. Sahoh and A. Choksuriwong

and its desirable properties. Section 8 introduces related technologies for computingbased-learning models including Bayes’s theorem and Causal Bayesian Networks, XAI-based DUE using Causal Bayesian Networks. The conclusions, research trends, and future directions appear in Sects. 9 and 10.

2 Why Current Machine Learning is Differentiated from Human Learning “How do humans capably learn and solve the problems?” This question can be answered by explaining the fundamental human learning process [11]. Letter tracing in preschool is the simplest example for understanding how humans learn intelligently. The process can be described by five principal procedures: (1) set the goal; (2) define the reference knowledge; (3) design a method; (4) learn new things; and (5) grade preschoolers, and an example is shown in Fig. 1 Figure 1 consists of five steps that a preschooler must follows: (1) understand the task, (2) listen to their teacher explaining the “E” letter characteristics as prior knowledge, (3) understand the procedure ruled by their teacher’s learning design, (4) learn the new experience by themselves, and finally (5) evaluate and grade the scores for each preschooler according to prior knowledge. This is a common learning process and most current DEP-based-machine learning platforms also learn a new task this way [12]. So what are the differences that make the current platform unable to reach the goal of human ability? There is a need for a human-like intelligent platform based on the granular design thinking of semantic relationships that decision-makers can employ to understand events. It becomes a problem in critical cases when staff needs to understand and explain a critical event, rather than just quickly generate predicted results. Understanding is the ability to argue and explain “why” a machine chooses a specific answer but not something else. Compared to the human learning process, one of the preschoolers got 70.83% “why not” 100.00%. Another preschooler got 83.34% “why not” 70.83%. If they need to improve the score from 70.83% or 83.34% to 100.00% what should they do? These questions are impossible to argue and explain but humans can produce

Goal Expectation

Prior Knowledge

Learning Design

New Experience

Assessment 70.83%

83.34%

First Preschooler

Secord Preschooler

Tracing E letter Tracing = 1

Fig. 1 The fundamental human Leaning

,…,

Tracing = n

Beyond Deep Event Prediction: Deep Event Understanding …

95

reasons using explicit prior knowledge from expert experience. For example in Fig. 1, prior knowledge clearly defines four main arrows for the “E” character. The learning design is based on prior knowledge, with the dots aligned to each arrow which makes it understandable as a human learning process. This is a simple example of how humans conceptualize abstract knowledge and aggregate it into meaningful knowledge that can be organized and exchanged, and allows us to explain and argue about general events on an everyday basis. It is a fundamental point of the GrC paradigm which both aims to model knowledge transparently and can be semantically understood by ordinary stakeholders [13]. This knowledge lets a human suggest a solution to the question “Why did the first preschooler get only 70.83%? According to the prior knowledge, the “E” is encoded by four arrows and encoded by 24 dots, and the problem is solved by completely tracing through the centers of each dot using all the arrows. The first preschooler missed three points on the arrows 2 and 4, and one point on arrows 1. For the question: “If the preschooler needs to improve from 70.83% to 100.00% what should he do? the solution is simple. He should re-practice drawing the “E” letter carefully especially concentrating on arrows 2 and 4. This shows how human ability can produce reasons and advise on better solutions because of prior knowledge and concrete leaning design. Miller et al. [14] investigated the limitations of the present AI model in terms of high prediction performance but did not consider the interpretation and explanations needed to understand critical events. Compared with our proposal, the encoding of systematic thinking by experts related to “prior knowledge” and “learning design” is currently ignored. For example, big data offers an intelligent black-box model, which produces output with no insight into its learned model. The challenge is how to build software that mimics the insight of fundamental human learning using big data that can handle the complexity and uncertainty of critical events. It should understand events rather than only predicted output with high accuracy, and support users with good arguments and explanations to enhance their decision making.

3 Beyond Deep Event Prediction This section describes why DEP is insufficient when a system needs to handle critical events in uncertain and unknown patterns. Consider the critical event “a large explosion sound was heard near the coffee shop at the mall”. Users receiving this message might ask (1) “Is it likely to be critical?”, (2) “Why it is likely to be critical?” and (3) “If the event is critical, who is likely to be affected?” Such questions are helpful to prevent the situation from becoming worse. A DEP system would label “coffee shop” as a well-known “cafeteria”, the “a large explosion sound” as an “observed-bombing object” and “the mall” as a “shopping mall”. DEP can identify this information through pre-configured technologies such as curve-fitting (e.g.,

96

B. Sahoh and A. Choksuriwong

using an artificial neural network) which works well when environmental features are completely observed. Chen et al.’s [15] model for critical event prediction from big data using neural networks offers the best performance compared with other approaches. Also, Yu et al.[16] work on real-time situation awareness-based critical event classification using deep learning offered good accuracy and could be applied to critical systems. Kumar and Singh [17] proposed critical event location identification using deep learning, which achieved good performance and was ready to apply to location-based services. Sit et al. [18] proposed an identification-based model for critical event spatial and temporal information, and claimed that deep learning could effectively deal with big data and was suitable for critical management systems. Although this research provides helpful information, it cannot provide event details so that the systems can support decision making. The earlier example questions are beyond DEP such they require a level of understanding using prior knowledge. For instance, the status of a “cafeteria” during a holiday must differ from its status during in-office hours. Likewise, a “shopping mall”’ in a downtown (e.g., in an “economic zone”) is dissimilar to a “shopping mall” in a rural setting (e.g., in a “country zone”). Both must be interpreted using prior knowledge to place them in the context, which is dynamically changing depending on the stochastic and uncertain environment [19, 20]. An adaptive ability to learn, recognize, and reason about deep meanings according to dynamic environments (e.g., locations, periods, stakeholders, events) is one definition of DUE, and requires XAI to help guide its decision making. DUE requires a multidisciplinary approach to reach the necessary level of humanlike-intelligence, and the necessary background knowledge and advanced technology will be described in the next section.

4 Big Data, AI, and Critical Condition ML-based big data analysis is a well-known platform to discover real-time knowledge. Black-box ML techniques such as deep learning, support vector machines, and random forests are powerful models with high performance. The key disadvantage of a black-box model is that it offers no explanation or argument for the machine’s action (e.g., its output). The stakeholders cannot understand a black-box model’s action, and cannot ask for details about the factors affecting the model’s decisions. These disadvantages cause difficulties when the system needs to deal with critical tasks because the users often require explanations in human-understandable forms to help understand the worst-case scenarios that might happen shortly. Lecue [21] reviewed the role of knowledge graphs in XAI and stated that a critical system needs a model with an intelligent-comprehensible component to detail all its decisions that interact with the environment. D. Gunning [22] argued that the critical applications of XAI include medicine, law, finance, and defense, which rely on explainability to help manage situations effectively. Fernandez et al. [23] claimed

Beyond Deep Event Prediction: Deep Event Understanding …

97

Fig. 2 The DUE Intersection

Critical Conditions Real-time Knowledge

Transparent Models

XAI-based DUE Big Data

Intelligent Systems

Artificial Intelligence

that XAI plays an important role in sensitive fields such as safety science where an explanation is necessary. DUE encapsulates XAI abilities for critical systems and applications and emphasizes three main disciplinaries: big data, artificial intelligence, and critical conditions. DUE computational concepts intersect the areas as shown in Fig. 2. Figure 2 shows how XAI-based DUE utilizes real-time knowledge, transparent models, and intelligent systems. Real-time knowledge augments knowledge engineering and XAI-based DUE realtime process so that critical conditions can be quickly understood. Murphy [24] reviewed knowledge engineering in critical situations and pointed out that real-time computing for emergency informatics is the key factor in real-time decision-making. A transparent model has a human-understandable form for explaining, which lets users interact with it by asking counterfactual questions. Barredo Arrieta et al. [25] argued that transparent models are one of the key XAI ingredients, being an essential factor for software to achieve human-like intelligence. The intelligent system is a mechanism based on automation that is added to a system when it needs to employ knowledge from big data [26]. In the last two decades, intelligent systems have been used in many research areas, such as smart cities and intelligent governments. The XAI-based DUE retains the ability to build self-determined systems but also adds the discovery of knowledge for supporting decision-makers in uncertain environments. It provides well-timed knowledge that decision-makers can better understand by interacting with the system’s explanation. The XAI-based DUE multidisciplinary computational concept needs a concrete platform so it can be applied to a real-world environment. For a basic critical system based DUE, the following chapter proposes a cycle architecture that permits collaboration between the DUE and environment.

98

B. Sahoh and A. Choksuriwong

5 DUE Architecture Big data architectures related to the DUE model have been intensively studied recently. For example, Flouris et al. [27] and Terroso-Sáenz et al. [28] have proposed architectures that deal with critical and complex events. Both of them offer predictions based on context awareness and changing time but overlook the dynamic interaction between the system and environment. Such a dynamic architecture is discussed by Wang et al. [29] which predicts future critical events based on the dynamic environment. However, none of them are defined clearly in terms of the cycle and its dynamic process which understands the event. This section proposes an overview DUE architecture that aims to achieve a level of human-like intelligent understanding, based around the AI lifecycle and a real-time action-reaction between the system and environment, which uses the XAI based-understanding ability. The architecture is shown in Fig. 3. Figure 3 shows the architecture that cycles between the system and the environment, to perceive, learn, reason, and understand critical events as human-like intelligence. The environment is where critical events happen, while brain-like computing analyzes and interprets the nature of the critical events. The adaptive knowledge takes from this architecture will support the staff. Real-time environmental data can be collected by multiple sensors (e.g., sentiment from text, body movement from images and video, heart rate from a physical sensor), and be streamed to the system. The system extracts key event information and then reason about it based on prior knowledge. The real-time knowledge extracted from the data through the decision support system which interprets the staff’s profiles and related questions to provide the best answers for making decisions.

Information

Event Identification

Event Understanding

Features

Knowledge

Perception

Decision Support The Brain-like Computing

Data Transmission

Vision

User’s Questions, Profiles

Sensors

Text

Real-time Answers

Voice

Time/Location Authentic Information

The Environment

Fig. 3 An overview DUE architecture

User

Beyond Deep Event Prediction: Deep Event Understanding …

99

In the last decade, the study of brain-like computing has focused on four main sub-components: (1) perception [30], (2) event identification [31, 32], (3) event understanding to reason about and generate real-time knowledge, and (4) decision support [33, 34] to match real-time knowledge. This shows that both infrastructure and other approaches can employ DUE. However, DUE needs contributions from XAI to help design and develop dynamic systems and further its need for multidisciplinary research areas. DUE model is based on human-like intelligence and still in its infancy in terms of its understanding ability, especially for the explanation of critical events. The most desirable properties for DUE is still unclear, but several possibilities are discussed in the next sections.

6 Properties of DUE DUE’s model aims to emulate human-like intelligence so it can produce insight to support decision-making in critical situations. Granular Computing (GrC) is a brainlike computing framework of human cognition for intelligent decision making based systems which helps a system understand events using prior knowledge [10, 35]. Loia et al. [36] and Li et al. [37] discussed how a combination of GrC, big data, and decision-making can generate meaningful knowledge for event understanding in intelligence systems. The GrC based-learning platform for DUE employs properties from multicomputing technologies to support explainable and argumentative abilities. In this way, the principal properties of the DUE model augment the basic needs of earlier studies by supplementing the critical conditions that are summarized in Table 1. Table 1 shows the desirable properties of DUE powered by the GrC framework. In summary, critical events are usually unexplained and the system cannot directly determine their patterns beforehand, which requires earlier properties to process the events. The system requires prior knowledge and evidence to infer and generate dynamic knowledge that is helpful for real-time planning. Moreover, big data is generated endlessly and the system must iteratively observe the environment and dynamically learn and update itself according to counterfactual questions and its realtime observations over time. Finally, the system must provide real-time knowledge to support decision making. A key property is a transparent explanation of why the predicted output is provided along with confidence scores so that the users can decide whether they require more evidence. Although these significant properties have been studied, there is no commonly use DUE learning platform, especially for expressing prior knowledge and based around the human learning process as mentioned in Sect. 2. The next section will study potential technologies for fulfilling these needs.

100

B. Sahoh and A. Choksuriwong

Table 1 The desirable property of DUE Property

Description

Evidence-driven

DUE uses real-time observations for interpreting events [36, 38]

Knowledge-driven

For unknown events, DUE integrating prior knowledge of earlier evidence to induce and predict the most probable event along with its reason [39]

Adaptability

DUE can generate dynamic knowledge either from real-time evidence, or via counterfactual questions about staff preferences and roles [40, 41]

Uncertainty

Critical events occur randomly in a dynamic environment, so cannot be defined using a deterministic language. This suggests that a probabilistic language should be used to model uncertain problems [42, 43]

Endless learning

DUE must continuously update itself according to environmental changes, newborn hypotheses, and new evidence [44, 45]

Transparency

Transparent knowledge allows both lay users and systems to reorganize models based on their reasoning. DUE may contradict human reasoning and this can be accepted or rejected [46, 47]

Causality

Contextual computing and causal inference let DUE understand critical events and provide reasons, which explain the phenomena based on evidence [48, 49]

7 The Concept of DUE DUE aims to model expert knowledge and practitioner experience based on the human learning process. It offers explanation abilities that mimic human-like intelligence by observing, learning, reasoning, and understanding events. DUE utilizes two main concepts from GrC: (1) about human critical thinking to recognize evidence from sensory perception, and (2) contextual understanding to interpret events by aggregating prior knowledge and evidence in uncertain situations.

7.1 Human Critical Thinking Human critical thinking can be represented by the 5W1H questions (“Who”, “What”, “Where”, “When”, “Why”, and “How”), but the process for answering those questions is very complex and must be expressed using systematic procedures. Experts and practitioners utilize 5W1H questions to explain the activities exchange ideas for simulating future scenarios, and understand the event movement. 5W1H is a potent tool for inferring new knowledge and dealing with unknown patterns in daily life such safety and security issues. For example, Steel and Owen [50] studied advanced care planning where healthcare professionals understand how to deal with their patients by applying critical thinking. They discussed that human critical thinking benefits professionals to provide help to the right patients at the right time. Boggs et al. [51] applied a model based on human critical thinking to analyze events related to the transportation safety

Beyond Deep Event Prediction: Deep Event Understanding …

101

Table 2 Human critical thinking for DUE 5W1H

Description

“Who”

The stakeholders in a critical event

“What”

The event type that randomly occurred in the real-world

“When”

Time factors when an event occurs, which will affect planning and response

“Where”

The possible locations of an event

“Why”

Event details for a counterfactual conditional

“How”

Event details for a future situation based on expert experience and present evidence

and security of self-driving vehicles. Xu et al. [52] employed a similar approach to analyze critical events. DUE utilizes human critical thinking to mimic prior knowledge from experts and practitioners who can early explain the critical event. Critical thinking in the context of DUE is described in Table 2. Table 2 shows how 5W1H questions are needed for modeling events using expert knowledge and practitioner experience. The “How” question is one of the most important factors for explaining the reasons for a critical event. For example, the question “What if we see injured victims?” and the answer is “the situation might need urgent first aid and lifesaving” while the question “What if we see infrastructure damage?” and the answer is “the situation might affect area access planning”. The challenge for DUE is the “Why” question. Xu et al. [52] argued that all the critical event questions can be answered from big data using data analysis except for the “Why” question. For example, “What if the critical event had not occurred? What if we had seen a terror attack rehearsal?”. It can be explained and answered only when the event has ended, but a critical system needs to determine answers during the event to plan and respond promptly. This means that current systems can explain only some 5W1H type questions while the rest require experts and practitioners which means that they must infer the context at the beginning to order to explain their “Why” answers. This is a well-known problem of labor-intensive and time-consuming tasks, which challenges the XAI field to apply new technologies. Contextual understanding might be one solution for dealing with that problem.

7.2 Contextual Understanding Contextual understanding aims to encode human intelligence and drives the system that imitates human reasoning for an explanation. It is a form of common sense based on knowledge and experience which allows the decision-making process to progress safely [53]. It reaches conclusions given the evidence that the system can observe, and can explain the facts or causes behind the conclusions. This is the basis of scientific belief using human intelligence [54]. Contextual understanding must be transparent, interpretable, shareable, manageable, and understood by humans and systems. Our approach uses cause-and-effect

102

B. Sahoh and A. Choksuriwong

What (bombing)

Context (explosion sound)

(a) What’s Cause-and-Effect Encoding

Context (shopping mall)

Where (crowded zone)

(b) Where’s Cause-and-Effect Encoding

What (bombing)

Conclusion (critical)

Where (crowded zone)

(c) Conclusion’s Cause-and-Effect Encoding Fig. 4 The cause-and-effect encoding

to encode context as prior knowledge for arguments and explanations. The goal is to explain the thinking behind the answers to 5H1H questions. For example, an “explosion sound” is located at a “shopping mall”, then experts and practitioners will believe that there is a risk of a terrorist attack and conclude that the situation is “critical”. Contextual understanding models their assumptions using cause-and-effect in a machine-interpretable form, which can be diagramed as in Fig. 4. The causal diagrams in Fig. 4 encode human critical thinking and contextual understanding. Each of the rectangles represents a semantic concept. These concepts and the relationships between them are initially derived from the expert’s hypothesis with the cause concept pointing towards in the effect concept. The state of each concept is represented by italic words in brackets and can be varied. This kind of cause-and-effect encoding allows the system to understand the situation by arguing and explaining things meaningfully. The dark-gray concepts in Fig. 4a, b represent the contexts that can be known or observed from the environment, while the light-gray concepts denote human critical thinking. For instance, a known “explosion sound” can explain the possibility of a “bombing” event. When a “shopping mall” is detected, it can then be used to explain the possibility of a “crowded zone” plausibly. White concepts, such as critical in Fig. 4c, represents the hidden knowledge that can be inferred (in this case from “crowded zone” (Where) and “bombing” (What)). In these ways, the hidden knowledge contexts can be incorporated with indirect meaning between the evidence and conclusions, and so mimic human-like intelligence. Contextual understanding is a principal means for designing scientific concepts in the real-world. It does not replace humans but instead provides a system to help users reach the best decision in an intelligent manner. Contextual understanding aims to encode knowledge in a manner that can serve the needs of critical systems for understanding the events deeply. However, contextual understanding must encode its data in a machine-readable language, and the next section examines several fundamental DUE technologies for this.

Beyond Deep Event Prediction: Deep Event Understanding …

103

8 Learning Model for DUE A learning model must encode in a machine-understandable form from the deep event understanding concepts mentioned in Sect. 7. It acts as a fundamental framework that can learn new experiences in different environments. The DUE platform must be both transparent and sensible because it aims to utilize insightful knowledge rather than predict output without reasons. As a result, the model employs XAI computing technology to serve its needs. A literature review of XAI computing was carried out by Adadi and Berrada [55] but did not consider the uncertainty problem which is the biggest issue when a critical system needs to handle big data in a real-world environment. Hagras [56] has reviewed XAI and the uncertainty problem, and proposed fuzzy logic as a suitable form of human-understandable AI. Unfortunately, his approach did not consider how to generate arguments and explanations for the events, which is a fundamental requirement. Consequently, the goal of this section is to propose DUE computing based on the cause-and-effect perspective, and to outline a platform for supporting model learning for event’s arguments and explanations.

8.1 Fundamental Computing for DUE Bayes’ theorem describes the plausible degree of a hypothesis (the effect) in terms of an observation (the cause). It can be written: P(E f f ect | Cause) =

P(E f f ect | Cause) × P(E f f ect) P(Cause)

(1)

The Effect stands for an event that we believe to has occurred (or will occur), while Cause represents an event that we have observed, and P() is the probabilistic encoding of the domain knowledge [57]. The theorem consists of four parts: (1) prior, (2) evidence, (3) likelihood, and (4) posterior, as defined below: Prior P(Effect): the likelihood of the Effect event before we observe any Cause, which can be estimated from expert experience or empirical experiment. Evidence P(Cause): the marginal likelihood, which is the probability of Cause having that already happened. Likelihood P(Cause | Effect): the conditional probability of Cause given Effect. This is how likely that Effect co-occurred with Cause in the past. Posterior P(Effect | Cause): the updated probability of Effect after Cause was observed. For example, we receive a notification that an observed location is now “crowded” and are asked, “how likely is there to be a critical event given a crowded event?” The causal diagram in Fig. 4 can employ cause-and-effect computing based on Bayes’ theorem to instantiate Cause (C) = crowded and Effect (E) = critical in Eq. (1):

104

B. Sahoh and A. Choksuriwong P(E = critical | C = cr owded) =

P(C = cr owded | E = critical) × P(E = critical) P(C = cr owded)

(2)

The posterior of Eq. (2) stands for an argument that thinks the belief is critical given the environmental observations. In contrast, if we receive a “critical” notification and are asked, “how likely is there to be a crowded event given a critical event?”, then the problem runs in the opposite direction from the diagram in Fig. 4. However, it is simple to solve without redesigning or retraining the graph. Instead, Bayes’ theorem can be multiplied by a P(C = crowded) and divided by P(E = critical) on both sides. The result is P(C = crowded | E = critical), which can be stated as: P(C = cr owded | E = critical) =

P(E = critical | C = cr owded) × P(C = cr owded) P(E = critical)

(3)

This means that the cause can be plausibly explained using existing causal diagrams and the evidence gathered from present observations. Equations (2) and (3) show that Bayes’ theorem is both simple but powerful in terms of understanding. It is flexible enough to rearrange the assumptions as used in cause-and-effect. It can discover knowledge either from cause to effect or effect to cause. However, to understand real-world events based on expert’ causal thought is more complex, and requires a more sophisticated platform based on Bayes’ theorem to encode those assumptions. This study will outline a suitable probabilistic-based representation for such complex events. According to the example from Sect. 3, when the user receives the notification, “a large explosion sound was heard near the coffee shop at the mall” typical questions would be (1) “Is it likely to be critical?”, (2) “Why it is likely to be critical?” and (3) “If it is critical, who is likely to be affected?”. Only DUE-based arguments and explanations can answer these questions since the answers are related to multiple concepts and the relationships between observations and conclusions. The system will require complex causal technology to mimic human-like intelligence. Our solutions are Causal Bayesian Networks (CBNs) which represent knowledge using a probabilistic graphical model suitable for cause-and-effect understanding [58]. “Causal” represents a scientific hypothesis made to argue and explain an event, which must be evaluated by gathering real-world evidence from the environment. “Bayesian” refers to cause-and-effect computing based on Bayes’ theorem. “Network” is a contextual bridge that represents cause-and-effect beliefs among the concepts as a transparent and flexible model, which both machine and humans can understand. It can be easily managed even when scientific hypothesis changes and in an uncertain environment. CBNs adjust themselves as critical system requirements change and so provide a very good way to model expert knowledge in real-world situations. CBNs regulate the learning model based on a probabilistic acyclic directed graph G = (V, E). V represents a group random variables X 1 , …, X n for the domain of interest, and E = {(X m , X n ), …} represents a set of cause-and-effect relationships

Beyond Deep Event Prediction: Deep Event Understanding …

105

Question

Context

Conclusion

state 1

1/n

state 1

1/n

state 1

1/n

state 2 …

1/n ....

state 2 …

1/n ....

state 2 …

1/n ....

state n

1/n

state n

1/n

state n

1/n

Fig. 5 The simple discrete bayesian networks encoding the cause-and-effect concept

among V. The effects are viewed as child nodes that are conditionally influenced by their causes which are parent nodes. The structural equation is: P(X 1 , X 2 , ..., X n ) =

n 

P(X i |Pa(X i ))

(4)

i=1

Pa(X i ) stands for the parent as the cause for a random variable X i . The marginal distribution of X i is computed as follows: P(X i ) =



P(X 1 , X 2 , ..., X n )

(5)

except X i

From Eq. (5), X i ’s marginal distribution refers to conditional independence or joint probability based on the cause-and-effect assumptions. X i as an effect random variable, is the product of the rest of the random variables or the set Pa(X i ). The cause-and-effect concepts from Fig. 4 in Sect. 7.2 can be modeled as CBNs by employing Eq. (4). This makes the child node conditioned on its parent nodes as shown in Fig. 5. Figure 5 shows how CBNs encoded the three random variables of Fig. 4 with two edges. Random variable models concept as nodes, such that a nodes staten represents event randomly occurring according to its probability distribution. Edges represent cause-and-effect relationships between the nodes based on conditional independence using a Conditional Probability Distribution (CPD). For example, the CPD of the conclusion node given the question node P(Conclusion | Question) as in Table 3. Table 3a shows the general CPD form for discrete Bayesian Networks, as a Conditional Probability Table (CPT). CPT encodes the semantic meaning of the edges between concepts and the sizes of CPT’s states depend on the states of cause-andeffect concepts. Indeed, the states can be very large according to the domain that needs to be solved. Table 3b shows that the Conclusion and Where’s question both consist of three possible events, which are deterministic in the real-world environment. The initial state’s parameters mean that the likelihood of each event occurring is around 33.33%. However, the words “most possible” does not mean that the parameters are tightly permanent. In fact, all the events can occur randomly, and so are boundless between zero and one. The random variable’s parameter for the Conclusion conditioned on Where’s question can be diagramed as in Fig. 6.

106

B. Sahoh and A. Choksuriwong

Table 3 The conclusion’s CPD based on cause-and-effect relationships (a) The general form’s CPD Question Conclusion

state1

state2



statem

state1

1/n

1/n



1/n

state2

1/n

1/n



1/n











staten

1/n

1/n



1/n

(b) Deterministic domain’s CPD Where question Conclusion

crowded zone

normal zone

safe zone

critical

1/3

1/3

1/3

monitoring

1/3

1/3

1/3

normal

1/3

1/3

1/3

Fig. 6 Illimitable space between one and zero of the Conclusion given the Where’s question

P(Conclusion | Where)

0.40 normal monitoring critical

0.20 0.00 crowded

normal

critical

safe monitoring

normal

Figure 6 represents the proportion areas of all the possible Conclusion’s events given the Where’s question. It can be dynamically varied when states are changed (e.g., by adding a new state or removing an existing one, or by creating a new cause, or removing an existing one). This is the main reason why CBNs are so useful for modeling DUE. DUEs based on CBNs model knowledge in machine-and-human-readable forms, which helps critical systems argue and explain situations flexibly. Several systems employ CBNs to understand critical events. For instance, Tang et al. [59] used them to understand terrorist attacks, developing a methodology to explain the causal relationships between time, accident, and location. Wu et al. [60], Zarei et al. [61] and Li et al. [62] proposed CBNs model to understand critical events in emergencies by interpreting how likely event occurred under uncertain conditions. Although previous studies have applied CBNs to understanding critical events, none of them utilized DUE based on human critical thinking and contextual understanding. They employed the random variables and relationships to predict critical information from big data based on correlation structures but did not deeply consider

Beyond Deep Event Prediction: Deep Event Understanding …

107

the meaningful context for supplying explanations based on causation. Human intervention is still necessary to reason about and interpret knowledge, which is both time-consuming and labor-intensive for experts to perform manually. There is still a need to integrate human critical thinking and contextual understanding for DUE using CBNs. It must represent knowledge about complex events and so that answers can be made plausible. In the next section, we will detail critical events understanding using CBNs-based XAI.

8.2 Computing Using CBNs-Based XAI DUE using CBNs-based XAI employ random variables and conditional probability distributions. Random variables transparently model the semantic concepts from experts both for human critical thinking and contextual understanding. Conditional probability distributions model networks or cause-and-effect relationships, where causes concepts point to the effects. The goal of this section is to highlight the understanding capabilities of CBNs-based XAI for manufacturing arguments, explanations, and counterfactuals. Counterfactuals are contrastive questions in real situations which are used to express possible future scenarios, and unexpected and undesirable events. Hilton and Slugoski [63] have argued that although we could predict factual events using visible evidence, it is only by asking diverse questions based on counterfactuals that people can adjust their thinking about deep knowledge, and achieve a level of deep understanding. Modern AI-based Bayesian programming for DUE based on CBNs-based XAI was first proposed by Bessiere et al. [64] that the researchers can easily follow and apply in their research fields. Although AI-based Bayesian programming is outside the scope of this chapter, we still aim to propose a general computing perspective suitable for the qualitative design of a DUE model. Suppose that semantic concepts and their relationships (model structures) are determined by experts, and their parameters fitted using big data. As an example, we shall introduce the model-based critical situation using the context of terrorism, which utilizes the raw data that was first studied by Sahoh and Choksuriwong [65]. The model using CBNs is diagramed in Fig. 7. Figure 7 shows how the DUE model inherits prior experience and knowledge and fits the model’s parameters using big data in a machine-understandable form. Each node models concept using random variables and all its events utilize random variable states, although the nodes’ parameters in Fig. 7 only show their marginal distributions. The conditional nodes (the nodes with parents) additionally hold CPT parameters such as a Conclusion that is displayed in Table 4. Table 4 shows the CPT of the Conclusion’s distributions as discrete nodes given all possible states of What and Where. Each state has a different chance of occurring that can dynamically change because of its parents or ancestors. The probability between one and zero is considered as random according to the real-world.

108

B. Sahoh and A. Choksuriwong Who’s Context traveler

Who

0.015

student soldier

0.121 0.473

citizen

0.389

Where’s Context park

0.397

public building mall

0.098 0.050

community

0.397

hard target

Where crowded zone 0.100 normal zone 0.257 safe zone 0.641

What’s Context explosion sound 0.457 suspicious object 0.317 gun sound 0.108 fireworks

0.116

0.681

normal target 0.151 weak target 0.166

Conclusion critical 0.076 monitoring normal

What bombing

0.242

shooting

0.372

non-crime

0.385

0.226 0.696

Fig. 7 CBNs encoding critical event-based terrorism

For example, the bold sub-table of Table 4 represents the distribution of P(Conclusion | Where, What = bombing) which can be viewed as Fig. 8. Figure 8 shows the Conclusion’s probability which can be sampled randomly based on its prior knowledge represented by the different heights of the bar charts. For example, if What = bombing and Where = crowded, then the most possible chance’ Conclusion = normal is 0.009. This represents the most likely outcome but is not limited to it. Such a DUE model can be understood by machines and humans and is easy to modify and adjust if experts change their minds. In these ways, the early raw data (“a large explosion sound was heard near the coffee shop at the mall”) can be automatically transformed into the observations (or model inputs) as 5W1H contexts. The “explosion sound” is What while “the mall” is the Where. “The mall” conditionally explains the “crowded zone” as the state for Where. The Who state is assigned a “weak target” (i.e., people with no basic ability to survive a critical situation), which is influenced by “crowded zone” in the context of “citizens”. A “crowded zone” helps users understand the difficulty and seriousness of the response to the critical event. The What’s state is likely to be “bombing”, inferred from “explosion sound” as its context. Finally, the Conclusion’s state is “critical” which is explained by Where and What. Although the evidence of random variables is incomplete (see Fig. 7 where only the bold text is observed), the DUE model must infer and give useful answers to all the questions even in an uncertain situation. “Why” is the highlight of the DUE model for producing answers in the different states of all the random variables. The interaction between the system and the environment based on questions and answers is shown in Table 5.

0.128

0.009

Conclusion monitoring

Conclusion normal

WhatNormal

0.067

0.758

0.175

WhatSafe

0.422

0.435

0.143 0.045

0.467

0.488

Whatcrowded

0.863

Whatcrowded

Conclusion Critical

Whatshooting

Whatbombing

0.537

0.440

0.023

Whatnormal

Table 4 The conclusion’s CPT given What and Where, P(Conclusion | What, Where) WhatSafe

0.995

0.002

0.002

0.175

0.807

0.018

Whatcrowded

Whatnon-crime Whatnormal

0.846

0.148

0.006

WhatSafe

0.990

0.008

0.002

Beyond Deep Event Prediction: Deep Event Understanding … 109

110 Fig. 8 Conditional Probability Distribution between One and Zero for the Conclusion given Where, What = bombing

B. Sahoh and A. Choksuriwong P(Conclusion | Where, What = bombing)

0.80 normal mornitoring critical

0.40 0.00 crowded

normal critical

safe mornitoring

normal

Table 5 Critical questions and their plausible answers inferred by the DUE model Question

DUE model’ explanation and argument

(1) Is it likely to be critical?

Yes, it is. (Factual answer)

(2) How likely is it to be critical?

It looks like a bombing in a crowded zone is the cause of the critical event according to the evidence from your environment. (Explanation)

(3) If so, who is likely to be affected?

The event happened in a crowded zone and the affected people are a weak target which are likely to be citizens. (Argument)

(4) What if soldiers had suffered?

They would have been a hard target in a safe zone that would have resulted in a monitoring situation. (Explanation)

(5) What if the sound had come from a gun?

The gun sound would have caused shooting rather than bombing. The conclusion would then have been normal. (Explanation)

(6) What if covid-19 suspected had been observed?

My model has no experience with covid-19, so you should ask a domain expert to teach me first. (Explanation and Argument)

Table 5 shows some common questions that users often ask in critical situations. Question number 1 is the whether-question and its answer can be a factual answer (e.g., “yes” or “no”). Question number 2 is a how-question, a presupposition that extends upon the whether-question. How-questions rely on their parent nodes to argue the most possible causes. The factual answer to a how-question is based on determining the most associative cause, which could be controlled or prevented to stop its effect from happening. Question number 3 results in an argumentative answer that users employ to find the root cause of an event, which in this case are those people who suffer the most. The “crowded zone” caused by “shopping mall” supports the belief that the sufferer is a weak target such as citizens. Although whether-questions and how-questions provide helpful knowledge, they are not counterfactuals. A counterfactual is a not-factual event, often written as an if-clause, which utilizes a simulated event that is contrasted to another event. Questions 4–5 are the most challenging, since they utilize simulated events but very useful for future planning and underpin human why-questions (i.e. why x rather

Beyond Deep Event Prediction: Deep Event Understanding …

111

than non-x). The DUE model provides the answers depended on the knowledge experienced from the past that mimics the human-understanding ability. Question number 6 shows how a user might ask about the critical situation in a different domain. The system cannot provide knowledge about “covid-19 suspected” but can explain what it does not know. The question is related to the DUE Model because covid-19 is considered a critical event, so the challenging research question is “Can the system learn new covid-19 outbreak problems, which utilizing the previous experience?”. One key human ability is to re-utilize old experience to help learn new tasks faster. When AlphaGo [66], defeated a professional Go player, the following questions were asked: “Does AlphaGo possess human-like intelligence?”, “Can AlphaGo explain how it beat the human?” and “Can the software share, reuse, and manage its knowledge, and transfer it to humans?”. Children do not learn new knowledge solely from observation but also via experience and prior knowledge; software must mimic these learning approaches. Critical situations especially require this ability to construct knowledge by utilizing experience and prior knowledge, in order to learn new tasks based on evidence from a new environment. This parallels how humans solve new problems using a mix of rationality, prior knowledge, and related context. Possessing such an ability will make software able to evolve to understand a situation in a human-like way. XAI can help sue it uses an explicit causal model that allows the software to share, reuse, and manage its experience and knowledge. The CBNs-based XAI model employs random variables and semantic relationships when it considers the question “What if covid-19 suspected had been observed?”. This is a new domain for the critical problem which must be explained in the same way as critical situations including terrorism. It does this by utilizing manageable, shareable, and reusable data to construct a new knowledge-based system. The construction of a covid-19 outbreak model uses prior knowledge (belief assumptions) and experience (prior probabilities) from the model in Fig. 7. These are transferred as a blueprint for a new model because its problem base is relevant but needs to be modified for this particular context. For example, the Who context can be reused by modifying some of its states related to the outbreak type. The What question and its context can be customized by adding new states that are relevant to covid-19. The new model design for covid-19, is shown in Fig. 9. The model in Fig. 9 partially inherits an existing model but adds new states for new forms of problem-solving. This is often called integrated CBNs [67]. The bold-random variables and their relationships encode a new context that is extended from the original model. Its parameters (the state probabilities) are initially unknown (all states are equally likely), and is ready to learn about covid-19 from the real-world environment. In the covid-19 case, training data is scarce because it is an emerging infectious disease, but the compatibility of related diseases (e.g., SARS, MERS-CoV, Ebola, and influenza) always their data to be used for training the model parameters and structures. This will lead to an incomplete model, but one that is still useful for

112

B. Sahoh and A. Choksuriwong Who’s Context older people

0.25

qualified people 0.25 patient, children 0.25

Who hard target 0.681 normal target 0.151

0.25

weak target

Where’s Context park 0.397 public building 0.098 mall 0.050

crowded zone

0.100

normal zone safe zone

0.257 0.641

citizens

community

0.397

What’s Context high temperature 0.25 cough sneeze faint

0.25 0.25 0.25

0.166

Where

What suspected epidemic 0.5 non-epidemic

Conclusion critical 0.076 monitoring

0.226

normal

0.696

0.5

Fig. 9 CBNs encoding a critical event-based outbreak for Covid-19

monitoring situations. In general, models from different related environments have useful knowledge for understanding a new problem, which is a powerful feature of the XAI model. Earlier questions included: “Can AlphaGo explain how its reasons?” and “Can it share, reuse, and manage its knowledge, and transfer it to humans?” Only high-level human learning abilities can provide suitable answers related to deep knowledge and understanding. The XAI model mimics this kind of human intelligence by employing cause-and-effect thinking based on CBNs. In other words, DUE is ready to be applied to real-world critical systems in multi-domains. Open issues and future directions for DUE are discussed in the next section.

9 DUE Trends and Future Outlooks Critical systems require new paradigms to discover deep knowledge, especially as they become more complex in the era of big data, and more difficult to analyze due to chaotic and random events beyond the scope of deep event prediction. DUE fills these gaps when dealing with big data and uncertain events. Some of the remaining major challenges and opportunities for critical situations, including disasters, economics, and safety and security, are discussed below.

Beyond Deep Event Prediction: Deep Event Understanding …

113

9.1 Disasters Natural and human-made disasters are perhaps the most critical kind of events. When search-and-rescue personnel manages disasters through prevention, preparedness, response, and recovery they utilize prior knowledge and real-time observations to understand the disaster’s impact. The advent of recent communication technologies provides much more disaster data, including satellite streaming video, real-time weather (temperature, wind speed, and humidity), and social media platforms, which might reveal loss or damage. All this makes it harder to understand a disaster since the evidence is dynamic, brief, or overloaded which makes it harder to handle disaster management. To support search-and-rescue personnel dealing with uncertainty, an automatic disaster analysis-based DUE approach is needed to monitor and understand the evolution of the situation. Such our approach is still in its infancy and new contributions are needed to fulfill the challenge.

9.2 Economic Consequences Analyzing economic trends and movements in a highly competitive world is difficult. The dynamic problems include high throughput from the digital economy and marketing. Economics decisions become very sensitive, and economists must be careful to address unpredictable events to avoid serious damage. Real-time knowledge is required to explain possible futures and prevent situations from turning bad and having economic consequences. Planning must consider economic conditions, internet trades, global marketing, and the interconnections between global customers and businesses based on modern devices and Internet of Things (IoT) [68]. This is obviously a very hard and time-consuming task, first to discover knowledge and then to promptly understand it. Suitable cause-and-effect designs that integrate traditional and new models demand research. The comparisons between existing prior knowledge with new market problems are handled by applying DUE to the decision-making. The state-of-the-art DUE marketing platforms, which combine prior knowledge with a new environment, are an impressive innovation.

9.3 Safety and Security The rapid growth of the world population has caused many problems for urban planning, such as traffic, public transportation, environmental pollution, and energy usage (electricity, water, and gas). “Smart city” is a new concept [69] improving the quality of urban living using information and communication technology (ICT)based solutions. The aim is that public safety and security will utilize embedded

114

B. Sahoh and A. Choksuriwong

technology, such as sensors and the IoT, to improve the communication between citizens and government through video surveillance, newsfeeds, social media, and other approaches. This requires a complete understanding of each citizen’s daily activities and lifestyle via systems that monitor events. Informed government administrators can better plan and notify citizens through real-time alerts and warning messages to their devices. The drawback is the need to analyze huge amounts of data to uncover insights that directly impact decision making. Planning-and-administrating based DUE should be able to support such decision-making systems by incorporating prior knowledge to deal with complex problems effectively.

10 Conclusions Big data and artificial intelligence have become the raw materials of knowledge engineering in the twenty-first century, especially for critical systems. Although the current automated critical systems work well in many diverse fields by employing Deep Event Predictions, their knowledge lacks explanation. In other words, the system cannot generate deep knowledge which supplies background reasons for its critical events. This means that the users must spend time and effort understanding the reasons for themselves, in order to decide the best plan. This chapter has proposed a novel solution for this problem which emulates human-like intelligence. Deep Event Understanding (DUE) extends eXplainable Artificial Intelligence (XAI) via the paradigm of Granular Computing (GrC) which provides a blueprint for machine learning that constructs an explainable and argumentative model for understanding. The differences between the learning processes for humans and machines are addressed by DUEs using cause-and-effect that goes beyond the deep event prediction approach. It utilizes XAI-based human-like intelligence modeling and Bayes’s theorem encoded as Causal Bayesian Networks that model human critical thinking. Random variables and cause-and-effect model their semantic relationships that deal with complex events in uncertain and random situations. The resulting DUE model exposes deep knowledge through argumentative and explainable abilities. Even in chaotic situations, it still generates useful knowledge for supporting decision making. DUE is manageable, shareable, and reusable, which means that it is able to solve new problems particularly in critical situations that are random in nature. DUE trends for future systems will employ integration concepts to build models using prior knowledge to solve uncertain problems in dynamic environments. Acknowledgements The authors wish to thank the editors, Professor Dr. Witold Pedrycz and Professor Dr. Shyi-Ming Chen, for inviting us to contribute this chapter, and helping us to clarify its contributions. Our especial thanks go to Professor Dr. Shyi-Ming Chen for his administrative support and encouragement during the writing process. We would also like to thank the three anonymous reviewers for their time and effort. Their comments and suggestions helped us to improve the chapter in several ways.

Beyond Deep Event Prediction: Deep Event Understanding …

115

References 1. Ranajit, R., et al.: A short review on applications of big data analytics. In: Emerging Technology in Modelling and Graphics, pp. 265–278. Springer Singapore (2020) 2. Pencheva, I., Esteve, M., Mikhaylov, S.J.: Big Data and AI—A transformational shift for government: so, what next for research? Public Policy Adm. 35(1), 24–44 (2018) 3. Lee, I., Shin, Y.J.: Machine learning for enterprises: applications, algorithm selection, and challenges. Bus. Horiz. 63(2), 157–170 (2020) 4. Bernardi, S., Gentile, U., Nardone, R., Marrone, S.: Advancements in knowledge elicitation for computer-based critical systems. Futur. Gener. Comput. Syst. (2020) 5. Shan, S., Yan, Q.: Emergency Response Decision Support System. Springer, Singapore (2017) 6. Castillo, C.: Big Crisis Data: Social Media in Disasters and Time-Critical Situations. Cambridge University Press, New York, NY (2016) 7. Ghafarian S.H., Yazdi, H.S.: Identifying crisis-related informative tweets using learning on distributions. Inf. Process. Manag. 57(2) (2020) 8. Kumar, A., Singh, J.P., Dwivedi, Y.K., Rana, N.P.: A deep multi-modal neural network for informative Twitter content classification during emergencies. Ann. Oper. Res., 1–32 (2020) 9. Chaudhuri, N., Bose, I.: Exploring the role of deep neural networks for post-disaster decision support. Decis. Support Syst. 130, 113234 (2020) 10. Pedrycz, W.: Granular computing for data analytics: A manifesto of human-centric computing. IEEE/CAA Journal of Automatica Sinica 5(6), 1025–1034 (2018) . Institute of Electrical and Electronics Engineers Inc. 11. Kolb, D.A.: Management and the learning process. Calif. Manage. Rev. 18(3), 21–31 (1976) 12. Jordan, M.I., et al.: Machine learning: trends, perspectives, and prospects. Science 349(6245), 255–260 (2015) 13. Bargiela, A., Pedrycz, W.: Granular Computing as an Emerging Paradigm of Information Processing. In: Computing, G. (ed.) Boston, pp. 1–18. Springer, MA (2003) 14. Miller, T., Howe, P., Sonenberg, L.: Explainable AI: Beware of Inmates Running the Asylum or: How I Learnt to Stop Worrying and Love the Social and Behavioural Sciences (2017) 15. Chen, J., Chen, H., Wu, Z., Hu, D., Pan, J.Z.: Forecasting smog-related health hazard based on social media and physical sensor. Inf. Syst. 64, 281–291 (2016) 16. Yu, M., Huang, Q., Qin, H., Scheele, C., Yang, C.: Deep learning for real-time social media text classification for situation awareness–using Hurricanes Sandy, Harvey, and Irma as case studies. Int. J. Digit. Earth, 1–18 (2019) 17. Kumar A., Singh, J.P.: Location reference identification from tweets during emergencies: a deep learning approach. Int. J. Disaster Risk Reduct. 33(October 2018), 365–375 (2019) 18. Sit, M.A., Koylu, C., Demir, I.: Identifying disaster-related tweets and their semantic, spatial and temporal context using deep learning, natural language processing and spatial analysis: a case study of Hurricane Irma. Int. J. Digit. Earth (2019) 19. Adams S., Zaharchuk, D.: Empowering governments through contextual computing. IBM (2015) 20. IBM Corporation 2016: IBM Intelligent Operations Center for Emergency Management Adapt rapidly to complex, changing environments. IBM Anal., 1–12 (2016) 21. Lecue, F.: On the role of knowledge graphs in explainable AI. Semant. Web 11(1), 41–51 (2020) 22. David, G., Mark, S., Jaesik, C., Timothy, M., Simone, S., Guang-Zhong, Y.: XAI—explainable artificial intelligence. Sci. Robot. 4(37) (2019) 23. Fernandez, A., Herrera, F., Cordon, O.: Evolutionary fuzzy systems for explainable artificial: why, when, what for, and where to? IEEE Comput. Intell. Mag. 14(1), 69–81 (2019). IEEE 24. Murphy, R.R.: Emergency informatics: using computing to improve disaster management. Computer (Long. Beach. Calif) 20(6), 19–27 (2016) 25. Barredo Arrieta, A., et al.: Explainable Explainable Artificial Intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020)

116

B. Sahoh and A. Choksuriwong

26. Liu, H., Cocea, M., Ding, W.: Multi-task learning for intelligent data processing in granular computing context. Granul. Comput. 3(3), 257–273 (2018) 27. Flouris, I., Giatrakos, N., Deligiannakis, A., Garofalakis, M., Kamp, M., Mock, M.: Issues in complex event processing: status and prospects in the Big Data era. J. Syst. Softw. 127, 217–236 (2017) 28. Terroso-Sáenz, F., Valdés-Vela, M., Campuzano, F., Botia, J.A., Skarmeta-Gómez, A.F.: A complex event processing approach to perceive the vehicular context. Inf. Fus.s 21(1), 187–209 (2015) 29. Wang, Y., Gao, H., Chen, G.: Predictive complex event processing based on evolving Bayesian networks. Pattern Recognit. Lett. 105, 207–216 (2018) 30. Tao, F., Zuo, Y., Da Xu, L., Zhang, L.: IoT-Based intelligent perception and access of manufacturing resource toward cloud manufacturing. IEEE Trans. Ind. Inf. 10(2), 1547–1557 (2014) 31. Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., Alsaadi, F.E.: A survey of deep neural network architectures and their applications. Neurocomputing 234(December 2016), 11–26 (2016) 32. Najafabadi, M.M., Villanustre, F., Khoshgoftaar, T.M., Seliya, N., Wald, R., Muharemagic, E.: Deep learning applications and challenges in Big Data analytics. J. Big Data 2(1), 1 (2015) 33. Portugal, I., Alencar, P., Cowan, D.: The use of machine learning algorithms in recommender systems: a systematic review (2015). arXiv1511.05263 [cs] 34. Chen, Y., Jiang, C., Wang, C.Y., Gao, Y., Liu, K.J.R.: Decision learning: data analytic learning with strategic decision making. IEEE Signal Process. Mag. 33(1), 37–56 (2016) 35. Andrzej B., Witold, P.: Granular computing. In: Handbook on Computational Intelligence: Volume 1: Fuzzy Logic, Systems, Artificial Neural Networks, and Learning Systems, pp. 43–66 (2016) 36. Loia, V., D’Aniello, G., Gaeta, A., Orciuoli, F.: Enforcing situation awareness with granular computing: a systematic overview and new perspectives. Granul. Comput. 1(2), 127–143 (2016) 37. Li, X., Zhou, J., Pedrycz, W.: Linking granular computing, Big Data and decision making: a case study in urban path planning. Soft Comput. 24(10), 7435–7450 (2020) 38. Ko, Y.C., Ting, Y.Y., Fujita, H.: A visual analytics with evidential inference for Big Data: case study of chemical vapor deposition in solar company. Granul. Comput. 4(3), 531–544 (2019) 39. Zhang, C., Dai, J.: An incremental attribute reduction approach based on knowledge granularity for incomplete decision systems. Granul. Comput. 5(4), 545–559 (2019) 40. Skowron, A., Jankowski, A., Dutta, S.: Interactive granular computing. Granul. Comput. 1(2), 95–113 (2016) 41. Wilke, G., Portmann, E.: Granular computing as a basis of human–data interaction: a cognitive cities use case. Granul. Comput. 1(3), 181–197 (2016) 42. Pownuk A., Kreinovich, V.: Granular approach to data processing under probabilistic uncertainty. Granul. Comput., 1–17 (2019) 43. Dutta, P.: Multi-criteria decision making under uncertainty via the operations of generalized intuitionistic fuzzy numbers. Granul. Comput., 1–17 (2019) 44. Kuang, K., et al.: Causal inference. Engineering 6(3), 253–263 (2020) 45. Crowder J.A., Carbone, J. N.: Methodologies for continuous life-long machine learning for AI systems. In: 2018 World Congress in Computer Science, Computer Engineering and Applied Computing, CSCE 2018—Proceedings of the 2018 International Conference on Artificial Intelligence, ICAI 2018, pp. 44–50 (2018) 46. Mannering, F., Bhat, C.R., Shankar, V., Abdel-Aty, M.: Big Data, traditional data and the tradeoffs between prediction and causality in highway-safety analysis. Anal. Methods Accid. Res. 25, 100113 (2020) 47. Yoshida, Y.: Dynamic risk-sensitive fuzzy asset management with coherent risk measures derived from decision maker’s utility. Granul. Comput., 1–17 (2019) 48. Pal, A., Rathore, A.S.: “Context-Aware Location Recommendations for Smart Cities”, in Smart Cities Performability, Cognition, & Security, pp. 105–114. Springer, Cham (2020) 49. Chen, D., Xu, W., Li, J.: Granular computing in machine learning. Granul. Comput. 4(3), 299–300 (2019)

Beyond Deep Event Prediction: Deep Event Understanding …

117

50. Steel, A.J., Owen, L.H.: Advance care planning: the who, what, when, where and why. Br. J. Hosp. Med. 81(2), 1–6 (2020) 51. Boggs, A.M., Arvin, R., Khattak, A.J.: Exploring the who, what, when, where, and why of automated vehicle disengagements. Accid. Anal. Prev. 136(July 2019), 105406 (2020) 52. Xu, Z., et al.: Crowdsourcing based description of urban emergency events using social media Big Data. IEEE Trans. Cloud Comput. 8(2), 387–397 (2020) 53. Maynard, R.S.: What is commonsense? Aust. Q. 6(23), 111–115 (1934) 54. Mayes, G.R.: Argument-explanation complementarity and the structure of informal reasoning. Inf. Log. 30(1), 92–111 (2010) 55. Adadi, A., Berrada, M.: Peeking inside the black-box: a survey on Explainable Artificial Intelligence (XAI). IEEE Access 6, 52138–52160 (2018) 56. Hagras, H.: Toward human-understandable, explainable AI. Computer (Long. Beach. Calif). 51(9), 28–36 (2018) 57. Barber, D.: Bayesian reasoning and machine learning, 1st ed. Cambridge University Press, UK (2012) 58. Pearl, J., Mackenzie, D., Makenzie, D., Pearl, J.: The book of why: the new science of cause and effect. In: Basic Books, p. 402 (2018) 59. Tang, Z., Li, Y., Hu, X., Wu, H.: Risk analysis of urban dirty bomb attacking based on Bayesian network. Sustain. 11(2), 1–12 (2019) 60. Wu, B., Tian, H., Yan, X., Guedes Soares, C.: A probabilistic consequence estimation model for collision accidents in the downstream of Yangtze River using Bayesian networks. Proc. Inst. Mech. Eng. Part O J. Risk Reliab. (2019) 61. Zarei, E., Khakzad, N., Cozzani, V., Reniers, G.: Safety analysis of process systems using Fuzzy Bayesian Network (FBN). J. Loss Prev. Process Ind. 57, 7–16 (2019) 62. Li, S., Chen, S., Liu, Y.: A method of emergent event evolution reasoning based on ontology cluster and bayesian network. IEEE Access 7, 15230–15238 (2019) 63. Hilton, D.J., Slugoski, B.R.: Knowledge-based causal attribution. The abnormal conditions focus model. Psychol. Rev. 93(1), 75–88 (1986) 64. Bessiere, P., Mazer, E., Ahuactzin, J.M., Mekhnacha, K.: Bayesian Programming Chapman & Hall/CRC Machine Learning & Pattern Recognition Series. CRC Press, Illustrate (2013) 65. Sahoh, B., Choksuriwong, A.: Automatic semantic description extraction from social Big Data for emergency management. J. Syst. Sci. Syst. Eng. 29(4), 412–428 (2020) 66. Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017) 67. Marcot, B.G., Penman, T.D.: Advances in Bayesian network modelling: integration of modelling technologies. Environ. Model. Softw. 111(March 2018), 386–393 (2019) 68. Habibzadeh, H., Boggio-Dandry, A., Qin, Z., Soyata, T., Kantarci, B., Mouftah, H.T.: Soft sensing in smart cities : handling 3Vs using recommender systems , machine intelligence , and data analytics. IEEE Commun. Mag. 56(2), 78–86 (2018) 69. Ben Sta, H.: Quality and the efficiency of data in ‘Smart-Cities’. Futur. Gener. Comput. Syst. 74, 409–416 (2017)

Interpretation of SVM to Build an Explainable AI via Granular Computing Sanjay Sekar Samuel, Nik Nailah Binti Abdullah, and Anil Raj

Exploring the use of Syllogistic Rules to Diagnose Coronary Artery Disease

Abstract Machine learning (ML) is known to be one of the chief tools used for extracting information from data and predict output with exceptional accuracy. However, this accuracy comes with a lack of explainability. This especially becomes a serious problem when it comes to analyzing and making diagnosis with medical data. These ML models are usually built by expert coders without any incorporation or feedback of contextual information from subject matter experts (physicians in our case). This in turn leads the physicians to be skeptical about these ML medical diagnosis models due to their lack of transparency. Additionally, this may also cause the diagnosis given by the AI to have a gap in explainability. Thus, we aimed to overcome these challenges by incorporating physicians at an early development stage for our XAI from the initial stage of validating the granular information extracted from the SVM algorithms to the final validation of explanations given by the XAI mobile app. Through this, we were able to eliminate the gap in explainability which could have caused the AI to make incorrect diagnosis. Moreover, we also achieved higher levels of trust and improved the overall performance of the XAI which aided the physicians in various ways. Keywords Explainable AI · Granular computing · Information granules · Human-centric · Machine learning · Support vector machine · Coronary artery disease S. S. Samuel (B) · N. N. B. Abdullah Monash University, Kuala Lumpur, Malaysia e-mail: [email protected] N. N. B. Abdullah e-mail: [email protected] A. Raj IHMC, Pensacola, FL, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. Pedrycz and S. Chen (eds.), Interpretable Artificial Intelligence: A Perspective of Granular Computing, Studies in Computational Intelligence 937, https://doi.org/10.1007/978-3-030-64949-4_5

119

120

S. S. Samuel et al.

1 Introduction Artificial Intelligence or commonly known as AI, is the field of science and engineering that accentuates computational understanding commonly called intelligent behavior with the conception of artifacts that exhibits such behavior like humans [1]. In recent years, AI is increasingly being featured in products and services we use on an everyday basis. For instance, Google uses an AI that acts as a spam filter to ensure that users are protected from scam artists that lurk around the internet. Even while having a conversation with virtual assistance in our phone (Siri, Google Assistant), we are engaging a conversation with an AI application that has been trained on numerous of data. Today, AI is no longer an aspirational technology used exclusively by academics and researches. Instead, they are being used in almost every field conceivable in the form of Machine learning (ML) , which is a branch of AI used to identify patterns in a stream of inputs and predict the output. In recent times, ML has gained a lot more popularity in various files of research like business analytics, medical diagnosis, and many more because of its accuracy, speed, and optimization capabilities [2]. In medical diagnosis, ML is now widely used in multiple different fields like tumor identification in cancer diagnosis, diabetic retinopathy, and many more [3]. Although ML algorithms have shown exceptional performance in diagnosing a patient’s condition, unfortunately, physicians are still unsatisfied with its outcome [4]. This occurs mainly because of the black box within the ML algorithm that hides its internal workings from the users (physicians) on how it contrived to a particular diagnostic decision [5, 6]. As [7] said, without having a complete understanding of a system, it is impossible to have trust towards something or someone—an important case in point for this is in medical practice. ML cannot present its outcomes in a meaningful way to a medical expert (physician) who may ask “why?” and the assurance “I simply know from probability” is hardly appropriate to settle the case. Consequently, this will cause the physicians to developed distrust in using the machine and look out for other means of diagnosis. The black-box system yields high-performance results even though it does not sound rationale [8]. An example of this is the use of ML algorithm to classify whether a given patient has heart disease or not. Since the algorithm works on a binary system, the final output of the algorithm would be either a one (1) or a zero (0), where one (1) can infer to a patient having heart disease and zero (0) to not having a heart disease. Even though this ML algorithm could provide us with an explanation as to why it classifies a patient to have heart disease, it would not give a reasonable explanation [9]. For instance, it would fail to provide us with the multiple factors that lead to the classification (e.g., the patient has blocked arteries, the patient has high cholesterol, etc.), but instead, it would argue in a way by just giving the prediction result that would seem incoherent and absurd to a regular person. Furthermore, a black-box ML system might not provide evidence a subject matter expert would need to be convinced that the outcome given by it is trustworthy.

Interpretation of SVM to Build an Explainable …

121

1.1 The Era of Explainable AI with Granular Computing Given the success of ML computation in many fields [10–12] while at the same time, the rise of challenges in using ML especially for healthcare domain led to the third wave of AI known as Explainable Artificial Intelligence (XAI) [13]. XAI is designed to combine the performance of statistics learning with the transparency of knowledge systems. These XAI models will henceforth be able to be transparent in their decision by explaining the reasoning behind why it made a particular decision. By analyzing the reasoning behind the model, a user can judge whether or not they should trust the decision made by it. We decided to harness this strength of an XAI to build a mobile App that could be used to provide a transparent diagnosis of Coronary Artery Disease (CAD) . The XAI is built by leveraging the power of granular computing in SVM. This is made possible as when we take the dataset as a whole to classify with SVM, things tend to get very complex and almost impossible to decipher at a human level [14]. However, with information granules, we were able to decompose the complex data by zooming in and extracting small abstracts of information granules. These information granules were then individually classified with SVM to visualize how each abstract in the data contributed to the target. The entire process was undertaken by validating with physicians as the crucial human-centric stage to develop the XAI mobile app. With this we organize the chapter as follows: In Sect. 2, we begin by giving the problem statement and the gap in research which we aim to overcome and in Sect. 3 we give some related work to this study. Followed by in Sect. 4, we give a brief background on some of the important concepts used in this study. In Sect. 5 and 6, we exhibit the two main research methodology undertaken for this study followed by how it was implemented. In Sect. 7, we exhibit a sample outcome by sharing some screenshots of the mobile App and in Sect. 8 we discuss the results obtained from it. Finally, in Sect. 9 and 10 we finish this paper with the conclusion and future work.

2 The Problem with a Gap in Explainability There have been several studies where researchers developed various ML algorithms to design a more interpretable and accurate model in diagnosing diseases [3, 15–18]. These studies include the use of ANN, fuzzy models, SVM, etc., to build more robust and precise models to perform accurate diagnosis. However, their work dwindles when considering incorporating a human-centric model while also taking into consideration the complexities involved in a medical setting. Before heading into how we aim to solve the gap in explainability, first, we will emphasize the importance of having a human-centric model especially for medical setting. Diagnosis in medical settings requires subject matter experts (physicians) contextual information in interpreting given information. For instance, in our case CAD occurs when the coronary artery of the heart gets blocked due to multifactorial

122

S. S. Samuel et al.

causes such as cholesterol level, smoking, diabetes, and also including demographic and behavioral information such as a person’s age, lifestyle, genetic predisposition, and family medical history [19]. Given these multifactorial causes for CAD which are also known as contextual information, a ML algorithm may perform a prediction based on its trained data (which could have missing information) but with the lack of explainability. This means that a physician or a patient using it would never know if the diagnosis is right or wrong because of missing contextual information that causes a gap in explainability. Because without the explanation having contextual information from the physician’s experiential knowledge, a ML algorithm will always be a handicap when it comes to explainability in diagnosing diseases. We say this with affirmation because, when we analyzed the two datasets classified with ML algorithm in this study, we discovered that it was missing a lot of vital contextual information that would have been crucial in diagnosing a patient’s condition. Hence, regardless of which ML algorithm we chose to use, it will always be a handicap because of this missing contextual information. On top of that, these multifactorial causes do not show up until they reach climacteric stage [20]. Due to this, early diagnosis of CAD is usually invasive and costly typically performed by an Angiogram or Echocardiogram test [21]. Applying XAI in the context of CAD diagnosis can significantly save costs in doing tests that may be unnecessary for the patients. Developing an XAI system that can address these challenges is especially cost and timesaving in developing countries such as Malaysia and India, where patients heavily subsidized by the government [22]. Therefore, with these given challenges, our research study aims to fill the gap in explainability by using contextual information from a human-centric approach at an early stage—to explore the components and information required to develop an XAI that can assist physicians in diagnosing CAD. This human-centric approach is especially crucial for medical AI research to develop trust and to improve the overall performance of the XAI by including explanations that have experiential contextual information from physicians. Thus, we focus on answering the following objectives in this chapter: (1) use granular computing to extract information granules and use it with SVM’s classification to model syllogistic rules from it in a knowledge base as the two primary components to investigate; (2) communicating the interpretation with the missing information to physicians in a manner that develops trust and removes the gap in explainability; (3) emphasize the importance of having a human-centric model at the earliest stage of development.

3 Related Work Prior implementations by [18, 23–26] have shown great potential in using SVMbased CAD diagnostic models. Additionally, there have also been past work by [27, 28] where they have accomplished in extracting understandable and accurate rules from SVM. Rule extraction from SVM has also shown great prospects in predicting post-operative life expectancy of lung cancer patients as shown by [29].

Interpretation of SVM to Build an Explainable …

123

Neural Network (NN) based AI assistants have also shown effective performance in diagnosing CAD and related conditions in medical studies [30]. Talking about NN, we also see the paper written by [31] on how rules can be extracted from NN by analyzing the input and output nodes in a single feed forward NN layer. The paper by [32] presents an interesting approach where they use decision tree to decompose the entire NN and extract rules from it by using the UCI repository. As we see from the work of [33], decision tree algorithms also support design tools useful for explaining and interpreting predictions derived from ML algorithms. Work done by [34] shows that by representing algorithmic predictions through human–machine explanatory dialogue systems that employ contrastive explanations and exemplar-based explanations can provide transparency and improve trust in XAI. In this context, the XAI uses contrastive rather than direct explanations to illustrate the cause of an event concerning some other event. When users accept an explanation as good and relatively complete, they can develop a representative mental model, which in turn, fosters appropriate trust towards the AI system as shown by [35]. Moreover, explanation is key because enhancing the explanatory power of intelligent systems can result in easier understanding and also improves their overall decision-making and problem-solving performance [36]. There are also various known models built by using granular computing for medical classification. And these models have shown exceptional performance in diagnosing/classifying a patient depending upon their medical condition [37]. Additionally, these models have also shown to have clear cut transparent and interpretable results. Thus, with this information granules, the physicians could easily interpret the internal workings of the model and its outcomes [38]. This is crucial because, without trust, the physician may feel conflicted to accept the diagnosis of the model even when the model outperforms the physician. And without the physician’s trust in the model, it is a no brainer that the patient would also have conflicting thoughts towards the outcome of the model as shown by [39]. Thereby, we used this as an initiative to help the physicians develop trust with the XAI by allowing them to validate the information granules themselves and by involving them iteratively in all the primal stages of this study.

4 Background In this section we give a brief overview of the main concepts used in this study and why they were specifically chosen.

4.1 SVM Algorithm In simple terms, the main objective of SVM is to find a hyperplane in N-dimension space (N-Number of features) that can distinctly classify the given datapoints.

124

S. S. Samuel et al.

Though there can be various different hyperplanes to classify the given data points, with SVM, we have the capability of finding the best hyperplane that has the maximum distance between data points from both the classes. Though other ML algorithms as shown in the related work section like decision tree and neural network models can demonstrate significant performance in classifying heart disease patients, SVM-approaches are compatible with large and complex data (e.g., “omic” data) [9]. They can process data with higher accuracy than other supervised learning algorithmic approaches [40]. Omic data in this context refers to genomics, metabolomics, proteomics, and data from standardized electronic health records or precision medicine platforms. Additionally, compared to the hidden layers in NNs, SVM methods provide a direct pathway to visualize the underlying AI model feature characteristics through graphs while reaching similar or higher levels of accuracy [41]. This in turn spurred our physicians to find the visual explanation approach given by SVM’s graphs by using hyperplane division (shown in Figs. 2 and 3) more accordant compared to the complex neural network and decision tree graphs. SVM also comes in handy for representing cases that are similar through the visualization of shared patient characteristics [11]. Resultantly, since SVM comes right after neural networks in the interpretability and accuracy graph, and to accommodate our physicians’ requirements, swayed us to choose SVM as the ML algorithm for our initial work to build the components of an XAI.

4.2 Granular Computing Granular computing is defined as the method involved in processing basic chunks of information granules. Information granules, as the name itself suggests, is the collection of entities that are usually in the numerical level arranged together due to their functional adjacency, similarity, or likewise [42]. Although there has been extensive work done on information granulation, it is still difficult to attain a perfect definition for it. However, from a philosophical and theoretical point of view, information granulation is the very essence of human problem solving, and hence has a significant impact on the design and implementation of intelligent systems [43]. The three significant factors that underlie human problem-solving cognition involve granulation, organization, and causation [44]. Where granulation involves decomposition of the whole into parts, the organization involves the integration of parts into a whole, and causation involves the association of causes and effects [44]. Similarly, information granulation in computers also work the same, which involves the need to split an incomplete, uncertain, or vague problem into a sequence of smaller and more manageable subtasks [45]. Once information is gathered from these subtasks, they are combined, thereby helping us in providing a better comprehension of the problem and gain insights from it. Additionally, information granules exhibit different levels of granularity, and they are grouped in each of these levels depending upon the size of information granule

Interpretation of SVM to Build an Explainable …

125

[42]. Granulation of information at different levels is an inherent and omnipresent activity of people carried out with an intent to better understand the problem. These information granules are transformed into syllogisms that can then be updated into the knowledge base [46]. By the use of this methodology, as this chapter proceeds, we will see how these individually gathered information granules from the datasets play a vital role when classified by SVM to build a more interpretable intelligent system for medical settings.

4.3 Syllogisms These generated granules of information need some form of representation for a computer to understand, and that is where syllogisms come in. Like Aristotle said, if acquired knowledge cannot be represented in an understandable form, then the acquired knowledge is not empirical [47]. Thus, to overcome this, we use the method known as Syllogism to represent these interpreted rules from the information granules. Based on the book by [48], syllogisms are short, two premise deductive arguments in which at least one of the premises is a condition. In other words, a syllogism is a valid argument form in logic often used to describe a chain of events where one chain leads to the next. Syllogisms are more reflective of reality that comes in really handy while making medical diagnosis [49]. Thus, a combined collection of these syllogisms interpreted from information granules would represent a more comprehensible model of SVM’s classification. However, these syllogism’s utility can only be properly manifested if stored in the right medium, and that is accomplished by using a knowledge base. Knowledge Base Knowledge bases (KB) are used to store relationships, or links among facts or data in the form of schematic network, syllogistic rules, or frames [50]. In general, a knowledge base is a public repository that contains rules and inferences that are required to solve a problem. A knowledge base should not be considered as a static collection of information, but a dynamic resource that may have the capacity to learn. Additionally, logic is the best-known method for describing patterns of relationships with crisscrossing dependencies that have no natural linearization. When such dependencies occur, graph logic is the best for displaying them. The author [50], also discovered that conceptual graphs are the most successful medium for communicating with domain experts. When given a problem, a knowledge base works on the premise of how we understand a human mind works—by tracing back to the background knowledge it has about that specific problem [47]. In this study, we use a KB to store the syllogism (rules) to diagnose CAD. For an XAI to be explainable, a highly sophisticated knowledge base with an inference engine is required in the background to identify the chain of rules that would affirm

126

S. S. Samuel et al.

the existence of CAD. Before we head into the workings of an inference engine, we will see why these syllogisms were validated with a gold standard set of guidelines before incorporating into the working KB. Clinical Practical Guidelines Documents that are agreed by all the members of a medical community are called clinical practice guidelines (CPG) [51]. CPGs were prepared with a goal of guiding clinical decisions. Clinical guidelines are used to help with the diagnosis, management, and treatment of diseases in a particular field of healthcare [51]. With information technology increasingly being integrated into health care industries and with the availability of medical databases over the internet, accessible clinical guidelines are available for a heart specialist in every step of decision making [52]. Guidelines are a great source of information to guide clinical practitioners in making confident medical decisions. However, these international guidelines require some tuning and changes to be beneficial in making medical decisions for local and ethical situations [53]. Undeniably, the usage of clinical guidelines has led to substantial improvement in the quality of medical practice [54]. As a result, we emphasize on validating the interpreted syllogisms with CPGs to investigate and validate these syllogisms built for the knowledge base. The validation of the interpreted syllogism with CPGs is a crucial step to minimize biases and variations that may occur between physicians while validating these syllogisms as human subjective assessments may vary depending upon experience [55]. Henceforth, with that in mind, the CPG followed for this particular study is from American College of Cardiology Guideline [56]. Inference Engine The inference engine of a KB has the general problem solving or reasoning abilities. It is a part of the knowledge base that works on the problem to come up with a solution. The inference engine incorporates the control strategy of the system [57]. In a rulebased system, the method used to find a path through a network of rules is known as the control strategy of the system. Rules are stored in the KB and must always be in a formalized form (syllogisms) that is understandable by the inference engine. To put it into a nutshell, a KB is the working memory that stores all facts and the inference engine carries out all pattern matching, sequencing, and rule executions. Another element of the inference engine is the directionality: whether the inference engine works as forwarding chaining (premise to conclusion) or backward chaining (conclusion to the premise) [58]. Forward chaining starts with the given initial condition and searches forward through the knowledge base to find a conclusion for the given initial condition. Whereas, in backward chaining, the opposite occurs; the inference engine chooses a hypothetical conclusion and works backward to the initial condition to prove or disprove the conclusion hypothesis. For this research, we use a forward-chaining approach because CAD is caused by multifactorial conditions like cholesterol, hypertension, and many more, which leads to a conclusion. Thus, a forward-chaining approach is more appropriate to be

Interpretation of SVM to Build an Explainable …

127

used here than a backward chaining approach. A clear description of how this was implemented can be found in Sect. 6.7.

4.4 Explainable Artificial Intelligence Explainable artificial intelligence (XAI) or Interpretable AI is a branch of AI which ensures that its inner workings and complexities can be easily trusted and understood by humans [59]. Traditional ML algorithms always tend to have a “black box” where the designers of the AI generally cannot explain the internal workings of the AI, and why it arrived at a specific decision [60]. These ML algorithms are generally incomprehensible by human intuition and are quite opaque. Transparency is crucial because it creates a fusion of trust between the AI and the human interacting with it [61]. Without transparency, it is impossible to know whether an AI came to a particular diagnosis through careful analysis of all the attributes or by cheating. For example, in 2017, a system that was tasked to recognize images learned to “cheat” by identifying the copyright tag that was associated with the horse pictures [62]. Thereby, whenever the system sees a copyright symbol in an image, it classifies it to be a horse. However, this is not plausible and can sometimes be fatal for medical diagnosis as patients’ lives are at stake [63, 64]. Thus, a transparent explanation does not only mean that the output of the operation is interpretable, but also whether the whole process and the intention behind the model can be properly accounted for. Because in medical diagnosis, it is fundamental to know that the AI uses the right attributes to develop its classifications and infer diagnosis based on it. There are two main techniques used in the development of explainable systems: which are post-hoc and ante-hoc [65]. Ante-hoc technique ensures that the model is explainable from the beginning. Post-hoc techniques allow the model to be built normally, with the explainability only being added at the end during testing time. For this study, we follow an ante-hoc technique as physicians were involved through every iteration of this study from validating the information in the knowledge base to validating the explanations in the final stage.

5 Research Methodologies In this section, we introduce the research methodologies implemented for this study. For this research, we followed an iterative and sequential process, with each stage having its own procedures and validations to develop the XAI. To identify the important components needed for this study, we undertook a combination of constructive and human-centric method, which was iteratively conducted in two stages. At the first stage of the research, a constructive model is used to find a solution to a persisting problem—to identify components needed to construct an XAI specifically for application in the domain of CAD diagnosis by using real-world

128

S. S. Samuel et al.

datasets. In the second stage, we applied a human-centric approach, where physicians were involved iteratively in the process (i.e., in the loop) of the research to make improvements on the XAI as we go [66]. Questionnaires were also used to get qualitative feedback from physicians related to any improvements needed for the XAI model. Finally, the XAI mobile App developed for this study is simply used as a tool to engage with physicians and give them an ease of access to the XAI, thus, it is not the focus of this paper to detail the mobile App’s design. The following sections will detail further with the steps involved in constructing the XAI model by using the two methodologies mentioned above.

5.1 A Constructive Approach in Developing XAI The term “construct” often refers to a new concept, theory, or model being developed in research [67]. Following this approach, here are the iterative stages that were tailed in this study. (1) We used granular computing to extract information granules that has maximum coverage and specificity from the data and validated it with physicians. (2) We classified the information granules with SVM to get graphs. (3) From these graphs, we then interpreted syllogistic rules. (4) CPGs and physicians then validated these syllogistic rules for any misinterpretation from the graph because of contextual information missing in the data. CPGs were used here despite having human validators in order to minimize the biases involved between multiple human validators. (5) These validated syllogisms with contextual information from physicians were then incorporated into a knowledge base. (6) These syllogisms stored in the knowledge base are then incorporated into the controller1 of the mobile App which then transforms them into explainable outcomes. This is done with the help of a voice system which uses the inference engine in the controller of the App to find a chain of relations between the input conditions and the syllogisms stored in the KB. (7) These explanations are then validated and evaluated by physicians to measure the goodness and satisfaction of the model. Implementing this human-centric approach here allows the physicians to have more trust with the XAI model because they have an understanding of all the reasoning behind each diagnosis. The physicians then gave qualitative feedback on the explanations given by the XAI mobile app.

1 Controller

here refers to the main file in the mobile app software which has all the necessary components to have control over all entire processes in the app.

Interpretation of SVM to Build an Explainable …

129

Fig. 1 Research Methodology Workflow

(8) Once the feedback is received, the explanations are then adjusted based on the expert’s feedback with more contextual information that could have been missed out from Stage 4 validation, and the process continues again. The highlighted segments above represent the stages where human-centric approach was followed to overcome the gap in explainability with the XAI and improve trust and overall performance of the model. Figure 1 below shows an outlook of this approach.

5.2 A Human-Centric Approach at Early Development Stage In this section, we will take a closer look at the human-centric approach for validation taken on stages 1,4 and 7. We followed a human-centric approach from an early stage of this research. To elaborate more on these stages, next, we will see the three stages where we employed a human-centric approach in this research. Human-centric Stages with Questionnaires There were three different sets of questionnaires designed for the three humancentric validation stages. The stages where questionnaires were used are highlighted

130

S. S. Samuel et al.

Fig. 2 Interpretation of Syllogisms from SVM flow

Fig. 3 SVM graphical classification with hyperplane on Cholesterol VS Age

in orange in Fig. 1. The first questionnaire consists of ten polar questions for each data set. The main objective of this questionnaire is to validate if the information granules discovered by the univariate selection is accurate from an expert’s (physician) perspective and give their inference and feedback on it. Through this process, we were able to eliminate granules of information that would not play a vital role in diagnosing CAD. The second questionnaire consists of four subjective type questions. This questionnaire is developed to get an inference from a physician on the syllogisms interpreted

Interpretation of SVM to Build an Explainable …

131

from the SVM’s graphs. Some of these syllogisms could be incomplete due to the missing contextual information in the data. Thus, having a physician to validate these syllogisms and add contextual information to it gave the XAI model better precision during diagnosis of patient condition. The third questionnaire is the most vital for this research adopted from the work of [35]. This questionnaire was used during the live testing stage and covers all the necessary parts to determine the goodness and satisfaction of the vocal diagnosis explained by the XAI mobile App. This stage was dividing into three phases with the final phase including physicians and as well as patients. A scale-based validation approach is used here to measure and rate the XAI’s vocal diagnosis. Through this scale-based scoring, we can gauge the entire performance of the XAI mobile App. We will elaborate more on this in Sect. 8.

6 Implementation: A Syllogistic Approach to Interpret SVM’s Classification from Information Granules Following the Research Methodology shown in the previous section, in this section we will explain the schematics of implementing the entire methodology process. Before continuing this section, we will give a brief overview of the structure of this section. We start by introducing the two data sets used for this study with SVM. Next, we elaborate on how we use granular computing to extract information granules from these two data sets. Then we will give a brief overview of the parameter setting of SVM which is used to interpret syllogistic rules from SVM’s classification graphs. These interpreted syllogistic rules are then incorporated into the knowledge base. And lastly, a model on how these rules are converted to explainable outcomes is given and an overall framework of all the discovered components needed for an XAI.

6.1 Data Selection The two datasets selected for this research study are from the Cleveland UCI heart disease repository and Framingham heart disease repository [68, 69]. The complete UCI database has 74 attributes, but this study only uses 304 records (patients) with 22 attributes (physiological conditions) as per following the standard of all published records. In comparison, the Framingham heart study has 4241 records (patients) with 17 attributes (physiological conditions). This study began in 1948 and is still ongoing in the city of Framingham, Massachusetts, and considered to be one of the most extensive heart studies to date. Since the Framingham data had few null values, it required some preprocessing before visualizing it with SVM.

132

S. S. Samuel et al.

6.2 Identifying the Information Granules from These Data Sets In order to extract information granules from these two data sets, the two main criteria we need to establish are that these selected information granules must have maximum coverage and specificity. This means that the selected information granules can account to have the maximum weightage on the target values, which can also improve the overall performance of the model. To make this possible we used univariate selection and feature importance which will be discussed next. Univariate Selection Univariate selection is used to examine each attribute individually and determine the relationship strength of these attribute with the response variable [70]. The attributes are selected based on the highest k-score values. Univariate selection is the best to get a better understanding of the data, its structure, and its characteristics. Feature Importance Feature importance is used to extract the importance that each attribute gives to the data [71]. Higher the score, the more relevant is the attribute towards the output variable. The feature importance is extracted with the help of an inbuilt tree-based classifier. For these two datasets, we will be using Extra Tree classifier for identifying the top 10 best attributes from it. Information Granules used in this Study From the two methods mentioned above, we were able to utilize granular computing to extract information granules that has the most coverage and specificity in the entire dataset. Since we follow a human-centric approach from an early stage of development, these information granules were then validated with physician and CPG to verify if the above two selection methods performed well in identifying the primal information necessary to diagnose a patient with CAD. After validation and removal of unnecessary information, the chosen information from UCI dataset are: Max heart rate, ST-depression, Number of blood vessels highlighted by fluoroscopy, typical angina, exercise-induced angina, cholesterol level, age, and non-anginal pain. And from the Framingham dataset we have: SysBP, DiaBP, age, sex, cholesterol level, cigsPerday, diabetes, prevalent hypertension, and glucose level.

6.3 Analyzing and Interpretation of Syllogisms from SVM The validated information granules from the previous section were then decomposed into smaller subsets and classified with SVM to get classification graphs. SVM is used here because of its ability to dynamically visualize the non-linear features in the datasets to transform the original data into higher dimensions. The defining

Interpretation of SVM to Build an Explainable …

133

characteristics of dynamic visualization are animation, interaction and real-time— any one of these features is enough to meet the definition [72]. In our work, we use the interaction capability of the SVM’s visualization to interpret rules from it. The hyper parameters selected for this study by using Grid-SearchCV are {‘C’: 1, ‘gamma’: 0.001, ‘kernel’: ‘rbf’} [73]. Both the datasets have a target column, which tells if a patient has CAD or not (1 for yes and 0 for no). From Fig. 2, we see that to interpret syllogistic rules, we first classify the subset of information granules (numerical data points) represented as X (age, cholesterol, etc.) individually with the target Y in SVM to get the classification output in the form of graphs. This process of attaining graphs from SVM is a form of granular computing as the graphs are also information granules in a more interpretable form which can then be acted upon to interpret syllogistic rules from it. In Sect. 6.4, we will give a more detailed description on how this works with real examples followed.

6.4 The General Framework for Modelling Syllogistic Rules In this section, we design a general framework that was tailed in this study to form syllogisms from information granules through the classification results of SVM graphs. The core of this framework built under the fundamental concept of granular computing, where a complex data is transformed into more interpretable information granules. Which is then used to interpret syllogisms that forms the KB of XAI. We define the following: D = Entire dataset

(1)

X = {X f 1 , X f 2 , X f 3 . . . X f n } Set of information granules from D

(2)

Y = Target

(3)

First to simplify the complexity of classifying the entire dataset X with SVM, we first extract a set of abstract information granules that has the most specificity and coverage from the entire data as shown in (1). These {X f 1 ,X f 2 , X f 3 ….X f n } represent the information granules that has the maximum weightage on the target which were extracted by using the univariate selection and feature selection. From this set of information granules, we extract smaller subsets {X f 1 ,X f 2 }which can then be individually classified with the target Y to attain graphs from SVM. These graphs are a form of interpretable information granules through which syllogisms An {X f 1 ∧ X f 2 } can be interpreted as shown in (5). We do this because, this makes a very complex dataset into a group of human interpretable syllogisms that has all the information which could have been attained by classifying the entire complex dataset with SVM. However, while classifying the

134

S. S. Samuel et al.

Table 1 Variables explanation table Variables

Representation

X f1

Feature 1

X f2

Feature 2

Y

Target

{X f 1 ∧ X f 2 }

Subset of two selected features from the set X

An {X f 1 ∧ X f 2 }

The Syllogism (fact) generated with the subset on the target variable

entire dataset with SVM, the acquired information would have been uninterpretable as shown in (4). Thereby, to sum it all up, a combination of all these subsets [{X 1 , X 2 }, {X 2 , X 3 }…] classified individually on the target Y with SVM to attain syllogistic rules would account for a complete interpretation of the set X. Hence, a combination of all these interpreted syllogisms would account for complete interpretable outcome of SVM. The above discussed can be formulized as follows (Table 1). 

D → Y ⇒ (uninterpretable)

{X f 1 ∧ X f 2 } → ⇒ An {X f 1 ∧ X f 2 } ⇒ (interpretable)

(4) (5)

An Excerpt of Interpreting Syllogistic rules from SVM To illustrate the above-mentioned method on how syllogisms were extracted by classifying subsets of information granules on target with SVM to get graphs, let us take the following example from the UCI dataset. Here Cholesterol and Age represents the subset of information granules {X f 1 , X f 2 } extracted from the entire set of information granules X. This was individually classified with the target Y in SVM to attain the hyperplane graph as shown in Fig. 3. The black line in the figure represents the hyperplane which divides the datapoints into two classes. Here, the orange datapoints represents the patients with CAD and the green datapoints represent the patients without CAD. And from this hyperplane’s division, we can infer that the patients below the Age of 60 with a cholesterol level of over 200 are more prone to have CAD. The syllogism for this can be interpreted as: Age(x) < 60 ∧ Cholesterol(x) > 200 CAD Risk. Let’s take another example with Age vs. maximum heart rate in Fig. 4. From this hyperplane’s division (black line running across the graph), we noticed that there is a high coagulation of data points on patients with CAD between 140–200 beats per minute. A maximum heart rate a person can have is given by the formula 220 subtracted from the patient’s Age. This formula signifies that as the patient gets older, their maximum heart rate decreases. Additionally, we can observe and interpret from the graph that as the patient gets older, maximum heart rate for them to have CAD

Interpretation of SVM to Build an Explainable …

135

Fig. 4 SVM graphical classification with hyperplane on Maximum heart rate VS Age

also goes down. This interpretation is in relations with the formula and shows that as the patient’s heart gets weaker, the probability of having a CAD increases. A similar approach was followed to interpret syllogisms from all the information granule subsets until an adequate number of syllogistic rules were formed as specified by the physicians.

6.5 Validating the Interpreted Syllogistic Rules with Physicians and CPGs The syllogistic rules interpreted from SVM’s graphs need some sort of validation before being incorporated into the KB. For this, we chose five physicians and used the American College of Cardiology’s (ACC) guideline to aid us. Validating these inferences is the most crucial stage because the interpretations from SVM are just assumptions until they are validated by guidelines and physicians to become rules/facts. The reason we emphasized on validating these interpretations with physicians and also with CPGs in the loop is because it’s important to ensure that there are no conflicts between the interpretations provided by the physicians. As these physicians may have different treatment regimens for similar symptoms based on their experience. Thereby, having this universally accepted CPGs, we were able to resolves such conflicts. In the next section, we will see how these syllogisms are stored in KB of the XAI.

136

S. S. Samuel et al.

Fig. 5 Knowledge base architecture for CAD

6.6 XAI Knowledge Base for CAD The syllogistic rules formed from information granules with SVM’s classification was then used to build a KB. In our work, we do not apply a complete KB for representation of events. Instead, the KB here stores background information extracted from SVM to provide an interpretation of its classification. Given that identifying CAD is highly complex due to multifactorial variables that need to be perused, the construction of KB with all possible inference rules will be exhaustive. Thus, by including physicians in loop interactively in our method allowed us to make incremental improvements by combining the interpretation from SVM with experts’ interpretation. Figure 5 gives a pictorial representation of the stages involved on how an XAI would use the CAD KB. In stage 1, the physiological condition of the anonymized patient is fed to the model for analysis. In stage 2, the model identifies if the anonymized patient’s condition are symptoms of CAD by going through the information stored in the KB. Finally, in stage 3, the model evaluates the inference from KB to identify if the given patient condition may lead to CAD. However, this is only for one specific patient condition. Whereas, to make a diagnosis when there are multiple patient conditions, which is usually the case when it comes to diagnosing a patient, we use something called an inference engine.

6.7 XAI with Inference Engine Once all the validated syllogisms interpreted from the SVM graphs with the contextual information are collected and stored in the knowledge base. Next, we move

Interpretation of SVM to Build an Explainable …

137

Fig. 6 Inference engine transforming triggered facts to explainable outcome

forward into designing the XAI architecture with the inference engine to make diagnosis. This architecture is built inside the controller of the mobile App that transforms a chain of triggered syllogisms (facts) to human-understandable natural language. Section 6.9 gives an excerpt of these syllogisms (facts) in the KB. The inference engine then transforms these syllogisms into a phrase that leads to a sentence. Figure 6 gives an illustration of this transformation. While making diagnosis for a patient condition, the XAI inference engine architecture is designed in a way that it goes through all the syllogisms (facts) in the KB and triggers only the facts that are relevant to the input condition. These triggered facts (highlighted in red in Fig. 6) are then added up and explained with all the additive chain of syllogisms (facts) that instigated it to come up with a particular diagnosis. The transformation of additive facts into explainable outcomes is achieved by using built-in swift module called AVSpeechUtterance. With the inference engine, the XAI’s explanation became understandable and trustworthy for physicians. As through validation with physicians and CPGs, the XAI covers the missing knowledge gap in explanation which could have occurred if we had only used the information interpreted from the data. In the next section we will see how the XAI will communicate with physicians.

6.8 User Interface in Mobile Application In the previous section, we saw how the XAI components (KB and inference engine) runs in the background (controller) of the mobile App. Now, we will see the foreground of the mobile App, which is called the user interface (UI). The UI is a platform

138

S. S. Samuel et al.

where the user (physician) interacts with the model. For this study, the UI is developed through an App that can communicate in natural language so that it can be operated by physicians who are well-versed experts in CAD. The physician does not have to be affluent with AI to use the App. Since the XAI mobile App developed for this research study is mainly used as a tool to incorporate the components of XAI, the UI of the App is simply a voice system that acts as a mediator between the XAI and the physician.

6.9 Preliminary Results In this section, we describe the initial validated syllogistic results by applying our process experimentally to UCI and Framingham datasets. We partitioned the repositories into training sets for each feature 80% and reserved 20% for testing and provide a representative excerpt of the syllogistic rules interpreted as our preliminary results using the testing datasets. We did not include a validation set for this research because the final validation of the App was conducted in real time with physicians and patients. The variable x here represents the input value of a specific physiological condition and the symbol ∃ represents the existence of a particular condition. Blood pressure: (SystolicBloodPressure(x) > 140) ∧ (DiastolicBloodPressure(x) > 90). ∃(HighBloodPressure). Cholesterol: (ch_LDL_lev(x) > 160) ∧ (age(x) > 21) ∧ (ch_HDL_lev(x) < 40). ∃(cholesterol_risk_high). Blood Sugar (normal): (blood_sugar(x) > 4) ∧ (blood_sugar(x) < = 5.4) ¬∃(DiabeticRisk). Chest Pain risk: (anginal_cp = True) ∨ (atypical_anginal_cp = True) ∨ (non_anginal_cp = True). ∃(chest_pain_risk). Physiological triggering factors (KB inference Engine) Rule 1: if there is high cholesterol risk, and diabetic risk, and high blood pressure risk, then CAD is present: (cholesterol_risk_high = True) ∧ (DiabeticRisk = True) ∧ (HighBloodPressure = True) ∃(CoronaryArteryDisease). Rule 2: if the patient’s age is between 30 and 33 years old and diastolic blood pressure falls between: 60 and 110 mmHg, then CAD is present: (age(x) > 30 ∧ age(x) < 33) ∧ (DiastolicBloodPressure(x) > 60) ∧ (DiastolicBloodPressure(x) < 110) ∃(CoronaryArteryDisease). Rule 3: if there is diabetic risk and a body-mass index (BMI) risk, then CAD is present: (DiabeticRisk = True) ∧ (BMI_risk = True) ∃(CoronaryArteryDisease).

Interpretation of SVM to Build an Explainable …

139

Rule 4: if blood sugar level is normal with high cholesterol risk and blood sugar medication is true (contextual information added by physician), then CAD is present: (blood_sugar(x) < 140 mg/dl) ∧ (blood_sugar_med = True). ∃(CoronaryArteryDisease). In Rule 4, we discovered that the interpreted rule from the two datasets did not have medication information. Thereby, if this was not validated with the physician to get the contextual information, it would have created a gap in explainability leading the XAI to give an incorrect diagnosis. And this could have been fatal for patients with CAD. The syllogistic rules generated above represent only a small portion of the syllogistic background of the KB. When a specific condition given to the XAI, it analyzes the condition based on the contextual information relating to the syllogistic rules to determine if a CAD condition can be warranted. We incrementally improved the explainability of this reasoning process during the project by verbalizing the rules via voice system for physicians to understand and validate these syllogisms. Given below is an excerpt from the incremental development of dialogues given by the XAI. For the explanation dialogues used in Phase I testing (11 April 2020), the XAI model only reported the risk factors without the incorporation of any contextual information: A “Male, cholesterol level risk, blood pressure risk, smoking risk. Has CAD". B “Normal blood sugar level. Has CAD". Explanation dialogues used in Phase III testing (17 May 2020), utilized a more natural language dialogue to represent the XAI model’s output with the incorporation of the missing links from the contextual data: A “The patient’s blood pressure level with past history could indicate that this patient may have some blocked blood vessels. An echo test on the patient before going for consultation would be a suggestable opinion.” B (Example of Gap in Explainability Being Overcome) “The patient seems to be on glucose medication, which normalizes blood glucose level. This infers why a patient may have a high potential risk for CAD though they have a normal glucose level.” For the above two iterative excerpts, we discovered that inclusion of contextual information within the explanation eliminated the gap in explainability and gave a drastic improvement in the trust, understandability and the overall performance of the model. We will be validating these findings in the testing and results in Sect. 8. Next, we will give the readers a diagrammatic description of the entire XAI architecture

140

S. S. Samuel et al.

developed using the concepts discussed in the previous sections. And we will also show how the entire architecture was built around a human-centric process that helped in validating not only the syllogistic rules developed in this study but also the XAI’s explanations.

6.10 Iterative Retuning and Validation of XAI Mobile App with Physicians in the Loop In this section, we will describe the iterative development process with the physicians in the loop to build the components required for the XAI mobile app. All validation stages were conducted with questionnaires as described in Sect. 5.2. First, we used granular computing to extract information granules from the two datasets using the univariate and feature selection methods. These information granules were then validated with physicians to become the first stage of the humancentric loop (Questionnaire 1). Second, we decomposed the information granules into smaller subsets and classified them individually with SVM to get classification graphs. Through these classification graphs, we discovered that they were more interpretable forms of information granules. Third, these interpretable information granules enabled us to easily extract syllogistic rules. The combination of these extracted syllogistic rules represents the complete interpretation of SVM’s classification of UCI and Framingham datasets. Fourth, physicians were involved in validating and readjusting these interpreted syllogisms by adding any missing contextual information to it (Questionnaire 2). These syllogisms with contextual information incorporated with them are then stored in the KB and transformed into explainable outcomes by using an inference engine that is incorporated into the iOS mobile app. For a given patient condition, the XAI mobile app explains the complete diagnosis in a natural dialogue manner. Lastly, these explanations from the XAI mobile app was again validated by the physicians to evaluate a human-centric element of trust and the entire performance of XAI (Questionnaire 3). Figure 7 shows a pictorial representation of the workflow consisting of the different components developed for the XAI model.

7 Final XAI Mobile App 7.1 XAI Mobile App The XAI mobile app developed for this research has two main screens. The first screen, also known as the landing page, is the page the user comes upon opening the App. This page has the input field for the main conditions required for XAI to diagnose a particular CAD condition. Once the required fields are filled, the values

Interpretation of SVM to Build an Explainable …

Fig. 7 Complete workflow with discovered components of XAI

141

142

S. S. Samuel et al.

from the input fields are taken to the controller of the App, which has the XAI with KB incorporated into it. The controller then uses the inference engine to identify the chain of rules that get triggered for that particular input condition. Once the triggered rules are identified, they are added to give out an explainable output in the form of text and voice dialogue on the second screen of the App. In the next section, we will illustrate this with the help of some screenshots taken from the XAI mobile app. XAI Mobile App Landing Page The landing page of the mobile App has the following input text fields as per the physician’s requirement. These selected input text fields were deemed sufficient by the physicians to diagnose a particular CAD event. The chosen input text fields are shown below in Fig. 8. XAI Mobile App Output Page Once the input fields are filled, the App then takes input conditions and performs an inference with KB stored in the controller of the mobile App. Once the inference is evaluated from the controller of the App, the recommended output is given out by the XAI on the second screen of the App. This is show in Figs. 9 and 10. The output Fig. 8 XAI mobile app landing page

Interpretation of SVM to Build an Explainable …

143

Fig. 9 Excerpt of XAI mobile app output 1

is given out in the form of text and voice dialogue through the App to accommodate our physician’s requirements.

8 Testing Results from XAI Mobile App In this section we will detail the testing results attained for the XAI. The main components being tested here are if adding human-centric loop from the beginning of the development process and the addition of contextual information through it has enabled the XAI to be more trustworthy and improved its overall performance. This was conducted using the goodness and satisfaction test taken from Dr. Hoffman’s work from IHMC [35]. The testing of goodness and satisfaction of the XAI model were conducted in three phases as shown below. The score attained in each phase represents the overall performance level of the XAI. Performance score here in our context indicates how explainable and precise the XAI’s diagnosis is. With the performance score, we can have a clear-cut comparison with SVM to solidify the significance of having a human-centric approach to eliminate the problem of gap in explainability for these AI models.

144

S. S. Samuel et al.

Fig. 10 Excerpt of XAI mobile app output 2

8.1 Testing Phase I During phase I of the app development, the physicians found a lot of limitations within the App, especially with the content of the explanation which resulted in the App being poorly rated during the live testing session. Below given is an excerpt of the reviews given by physicians (Table 2). Comment from a physician: “Though the App is user friendly, and I understand the explanations compared to the previous version, which had a human mediator Table 2 Questionnaire phase I result No

Statement

Score (1–5)

1

I understand the explanation of how the XAI app works

3

2

The explanation given by the XAI is satisfying

1

3

The explanation given by the AI has sufficient details

1

4

The explanation lets me judge when I can trust or not trust the AI

2

5

The explanation says how accurate the AI is

3

6

The explanation will aid me in diagnosis

3

Interpretation of SVM to Build an Explainable …

145

between the doctor and the AI. I feel the explanations are not sufficient for me to have complete trust with the AI. Most of the time, I had to make my own reasoning for the App’s diagnosis. It would be better if the App gives out more reasoning for each diagnosis instead of me making my own assumptions of the App’s reasoning.” Performance score I.1: Attained Scor e f r om (T able 2) 4+1+1+2+3+3 = T otal scor e f r om (T able 2) 5 +5+5+5+5+5 . 14 = = 46% (Physician 1)\ 30 Performance score I.2: 38% (Physician 2). Performance score I.2: 43% (Physician 3). Performance score I.2: 51% (Physician 4). Performance score I.2: 43% (Physician 5).

8.2 Testing Phase II During phase II of the app development, we made some additional changes to the App with respect to the feedback given by the physicians during phase I. With the additional reasoning abilities provided to the inference engine in the controller of the App, the App faced slower diagnostic process but provided better reasoning. Below given is an excerpt of the reviews given by physicians in Phase II testing (Table 3). Comment from a physician: “Overall satisfied with the app, but it can have additional information for a diagnosis like medication data and past medical history because it plays a vital role in patients’ body conditions.” Performance score II.1: 73% (Physician 1). Performance score II.2: 59% (Physician 2). Performance score II.3: 77% (Physician 3). Performance score II.4: 72% (Physician 4). Performance score II.5: 67% (Physician 5). Table 3 Questionnaire phase II result No

Statement

Score (1–5)

1

I understand the explanation of how the XAI app works

4

2

The explanation given by the XAI is satisfying

3

3

The explanation given by the AI has sufficient details

4

4

The explanation lets me judge when I can trust or not trust the AI

3

5

The explanation says how accurate the AI is

4

6

The explanation will aid me in diagnosis

4

146

S. S. Samuel et al.

8.3 Testing Phase III Before the final phase of the testing, we added all the updates required by the physicians to the App and send it in for the live testing. The major update for the App in phase III was adding in the missing contextual information like demographic data, medication intake and patients’ innate physiological conditions into the syllogistic rules for explanation. During this testing phase, since we did not include a validation set for the final phase, we decided to do a live test by included physicians and also patients in the testing loop to validate the final results of the App. Through this, the physicians found the App to be very helpful and trustworthy during diagnostic process. As it was able to explicitly identify and diagnose the additive symptoms for a particular input patient condition. Below given is an excerpt of the reviews given by physicians in phase III (Table 4). Comment from physician: “A well thought off App as it lightens the burden while helping doctors when there is a big patient queue. The App has also cut down my consultation time by half. Hope to see more developments in this apps knowledge bank in the future". Comment from physician: “The XAI app was simple and user-friendly to use. It did aid me in diagnosis by providing the links between medication and the physiological condition of the patient. However, there can be more improvement in its knowledge bank.” Performance score III.1: 93% (Physician 1). Performance score III.2: 95% (Physician 2). Performance score III.3: 92% (Physician 3). Performance score III.4: 93% (Physician 4). Performance score III.5: 96% (Physician 5). Table 4 Questionnaire phase III result No

Statement

Score (1–5)

1

I understand the explanation of how the XAI app works

5

2

The explanation given by the XAI is satisfying

4

3

The explanation given by the AI has sufficient details

5

4

The explanation lets me judge when I can trust or not trust the AI

5

5

The explanation says how accurate the AI is

4

6

The explanation will aid me in diagnosis

5

Interpretation of SVM to Build an Explainable …

147

8.4 Results from Testing From our three testing phases, we observed an incremental improvement in the performance score given by five physicians. The incompleteness and lack of narrative given by the Phase I XAI mobile App was evident for understanding and trusting the XAI model. The third iterative phase, however, returned a much more comprehensive model explanation. This retuned model included the qualitative information, which presented in a more natural dialogue form that the physicians felt necessary for a more trustworthy and understandable interaction with the model’s output. Additionally, an interesting finding we observed was that the final phase of our early XAI model scored 93% (according to the goodness and satisfaction chart given by Robert Hoffman, IHMC) from the physicians which is considerably better than the 81% scored by SVM standing alone with a training and testing cohort of 80% and 20% from its respective records. This shows the significance of having a human-centric development strategy with the added benefits of contextual information to the diagnosis process. We will elaborate more on this and conclude in the next section.

9 Conclusion and Discussion With our work in exploring the notion of Explainable AI, we were able to accomplish the main goal this study aimed to achieve by using a human-centric approach from an early development stage. Through this human-centric loop, we were able develop an XAI that could explain CAD diagnosis with not just the information interpreted from the data but also with the contextual information missing in it as given by the physicians. These explanations in turn developed trust among the physicians and removed the gap in explainability. In specific, the gap in explainability we discovered was the contextual information we added from the physician’s experiential knowledge which are usually not involved when a ML makes diagnosis based on information from the datasets alone. And this is accomplished by harnessing the strengths of SVM and granular computing using syllogistic rules. Moreover, our accomplishment also revealed four main lessons learned. Firstly, the physicians mentioned that by employing them to validate the XAI model from an early development stage, they were able to develop better trust and understandability with the model. This is because the physicians knew every component and information that went into making the entire XAI model architecture. This again enforces the importance of having a human-centric approach. Secondly, when physicians were allowed to use the XAI mobile App themselves, they were able to explore its internal workings in a more personalized manner. Through this, we were able to remove the technical expert middleman to explain how the XAI works. As our previous version of the XAI required a technical expert to be present in-person to aid the physicians in working with the XAI. Additionally, this also aided junior physicians in making diagnosis and gain knowledge from it

148

S. S. Samuel et al.

as the XAI is built through the contextual information knowledge provided by more experienced physicians. Though the main goal of this study was just to explore the development of XAI, the XAI mobile App allowed us to have more mobility during the testing phases and add in more functionality to the XAI. Thirdly, with the aid of XAI mobile app, the physicians were able to cut off a huge portion of their consultation time. As physicians use the XAI mobile App to perform pre-diagnosis on patients with the help of a nurse and then the physicians were just left with making cross-reference to the diagnosis given by the XAI. Thereby, this saves them the burden of doing the entire diagnostic process in person, which can be tiresome considering the number of patients with appointments every day. Lastly and more importantly, we discovered that employing a human-centric approach as an iterative method in constructing the interpretation of SVM’s output helped to add more contextual information to the interpretation. Consequently, this enabled the XAI to remove the gap in explainability and perform better in diagnosis. However, there was a significant finding whereby some of the interpretation generated gave contradicting results when validated with CPGs. By including physicians in the process, we found that the data was lacking greatly in demographic and treatment data of the patient’s record. These data provide important contextual information. For example, blood glucose results from the Framingham dataset showed that patients with normal blood glucose are more prone to have CAD. However, the physicians inferred that blood glucose level is one of the most important factors contributing to CAD. In regards, they mentioned that most of these CAD patients could be under medication and compliant to treatment. They are hence causing their blood glucose level to be within normal range though they are shown to have CAD. Therefore, the method needed to be improved to discover more hidden rules and patterns within the data (specific types of CAD). Although the concept of KB is not new, however in our work, the contribution is in the technique of using granular computing to interpret SVM’s classification. Which is done by including a feedback loop we create in improving the XAI’s overall performance by including contextual information from physicians in the process. Thus, we conclude this study by saying that this is not about machine versus human, but very much about optimizing the strengths of a human physician and patient care by harnessing the strengths of AI.

10 Future Work The key challenge we aimed to solve in this research is just the tip of the iceberg. There are still many research challenges to overcome from adaptive explainable AI, knowledge acquisition with sharing, and human-agent collaboration. However, our current work can aid as a proven skeleton model to help other researchers in developing trustworthy XAI diagnostic models for diseases like Cancer, Alzheimer’s, etc.

Interpretation of SVM to Build an Explainable …

149

In the near future, we plan to explore the following two possible directions in making the current XAI more robust. First, we are currently investigating how a fully automated Explainable AI system can be developed which can learn new knowledge without any human intervention by using multi-agent architecture. Second, we would investigate the use of this automated XAI system to identify the outliers in CAD patient data set, which are often misdiagnosed by the ML algorithm due to the absence of contextual information within them. Additionally, physicians also find it hard to identify these anomaly cases due to its vast complexities, which again sterns the importance of having an automated and robust XAI. Finally, with this, we conclude this study with hopes to inspire and convince other researches to join and invest their experience and expertise in this emerging research field.

References 1. Russell, S.J., Norvig, P.: Artificial intelligence a modern approach. Pearson, Boston (2018) 2. Alpaydin, E.: Machine learning: the new AI. MIT Press, Cambridge, MA (2016) 3. Sajda, P.: Machine learning for detection and diagnosis of disease. Annu. Rev. Biomed, England (2006) 4. Watson, D.S., Krutzinna, J., Bruce, I.N., Griffiths, C.E., Mcinnes, I.B., Barnes, M.R., Floridi, L.: Clinical applications of machine learning algorithms: Beyond the black box. Bmj,L886 (2019). doi:https://doi.org/10.1136/bmj.l886 5. Medicine, T.L.: Opening the black box of machine learning. The Lancet Respiratory Med. 6(11), 801 (2018). doi:https://doi.org/10.1016/s2213-2600(18)30425-9 6. Zhou, J., Li, Z., Wang, Y., Chen, F.: Transparent machine learning—revealing internal states of machine learning. In: Proceedings of IUI2013 Workshop on Interactive Machine Learning, pp. 1–3 (2013) 7. Dierkes, M.: Between understanding and trust: the public, science and technology. Routledge, Place of publication not identified (2012) 8. It’s Time to Start Breaking Open the Black Box of AI, www.ibm.com/blogs/watson/2018/09/ trust-transparency-ai/, last accessed 2020/6/22 9. Narayanan, M., Chen, E., He, J., Kim, B., Gershman, S., Doshi-Velez, F.: How do humans understand explanations from machine learning systems? An evaluation of the human-interpretability of explanation (2018). arXiv preprint arXiv:1802.00682 10. Libbrecht, M.W., Noble, W.S.: Machine learning applications in genetics and genomics. Nat. Rev. Genet. 16(6), 321–332 (2015) 11. Kourou, K., Exarchos, T.P., Exarchos, K.P., Karamouzis, M.V., Fotiadis, D.I.: Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 13, 8–17 (2015) 12. Kolouri, S., Park, S.R., Thorpe, M., Slepcev, D., Rohde, G.K.: Optimal mass transport: signal processing and machine-learning applications. IEEE Signal Process. Mag. 34(4), 43–59 (2017) 13. Adadi, A., Berrada, M.: Peeking inside the black box: A survey on Explainable Artificial Intelligence (XAI). IEEE Access 6, 52138–52160 (2018) 14. Zhu, J., Chen, N., Xing, E. P.: Infinite SVM: a Dirichlet process mixture of large-margin kernel machines. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 617–624 (2011) 15. Kononenko, I.: Machine learning for medical diagnosis: history, state of the art and perspective. Artif. Intell. Med. 23(1), 89–109 (2001) 16. Anooj, P.: Clinical decision support system: risk level prediction of heart disease using weighted fuzzy rules and decision tree rules. Open Comput. Sci. 1(4) (2011). doi:https://doi.org/10.2478/ s13537-011-0032-y.

150

S. S. Samuel et al.

17. Shouman, M., Turner, T., Stocker, R.: Using decision tree for diagnosing heart disease patients. In: Proceedings of the Ninth Australasian Data Mining Conference, vol. 121, pp. 23–30 (2011) 18. Zhang, Y., Liu, F., Zhao, Z., Li, D., Zhou, X., Wang, J.: Studies on application of Support Vector Machine in diagnose of coronary heart disease. In 2012 Sixth International Conference on Electromagnetic Field Problems and Applications, pp. 1–4 (2012) 19. Castelli, W.P.: The lipid hypothesis: is it the only cause of atherosclerosis? Medical Science Symposia Series Multiple Risk Factors in Cardiovascular Disease, pp. 13–18 (1992). doi:https:// doi.org/10.1007/978-94-011-2700-4_2 20. Shavelle, D.M.: Almanac 2015: coronary artery disease. Heart 102(7), 492–499 (2016) 21. Robertson, J.H., Bardy, G.H., German, L.D., Gallagher, J.J., Kisslo, J.: Comparison of twodimensional echocardiographic and angiographic findings in arrhythmogenic right ventricular dysplasia. Am. J. Cardiol. 55(13), 1506–1508 (1985) 22. Abdullah, N.N.B., Clancey, W.J., Raj, A., Zain, A.Z.M., Khalid, K.F., Ooi, A.: Application of a double loop learning approach for healthcare systems design in an emerging market. In 2018 IEEE/ACM International Workshop on Software Engineering in Healthcare Systems (SEHS), pp. 10–13, (2018) 23. Zhu, Y., Wu, J., Fang, Y.: Study on application of SVM in prediction of coronary heart disease. Sheng Wu Yi Xue Gong Cheng Xue Za Zhi = Journal of Biomedical Engineering = Shengwu Yixue Gongchengxue Zazhi, 30(6), 1180–1185 (2013) 24. Hongzong, S., Tao, W., Xiaojun, Y., Huanxiang, L., Zhide, H., Mancang, L., BoTao, F.: Support vector machines classification for discriminating coronary heart disease pa-tients from noncoronary heart disease. West Indian Med. J. 56(5), 451–457 (2007) 25. Xing, Y., Wang, J., Zhao, Z.: Combination data mining methods with new medical data to predicting outcome of coronary heart disease. In: 2007 International Conference on Convergence Information Technology. ICCIT, pp. 868–872 (2007) 26. Babao˘glu, I., Fındık, O., Bayrak, M.: Effects of principle component analysis on assessment of coronary artery diseases using support vector machine. Expert Syst. Appl. 37(3), 2182–2185 (2010) 27. Chen, F.: Learning accurate and understandable rules from SVM classifiers (Doctoral dissertation, Science: School of Computing Science) (2004) 28. Blachnik, M., Duch, W.: Prototype rules from SVM. In Rule extraction from support vector machines. Springer, Berlin, Heidelberg, pp. 163–182 (2008) ´ atek, J.: Boosted SVM for extracting rules from 29. Zi˛eba, M., Tomczak, J.M., Lubicz, M., Swi˛ imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. Applied Soft Computing 14, 99–108 (2014) 30. Çolak, M.C., Çolak, C., Kocatürk, H., Sagiroglu, S., Barutçu, I.: Predicting coronary artery disease using different artificial neural network models/Koroner arter hastaliginin degisik yapay sinir agi modelleri ile tahmini. Anadulu Kardiyoloji Dergisi AKD 8(4), 249 (2008) 31. Chakraborty, M., Biswas, S.K., Purkayastha, B.: Rule extraction from neural network using input data ranges recursively. New Generat. Comput. 37(1), 67–96 (2019) 32. Sato, M., Tsukimoto, H.: “Rule extraction from neural networks via decision tree induction,” IJCNN’01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222), vol. 3, pp. 1870–1875, Washington, DC, USA, (2001) 33. Sokol, K., Flach, P.: Conversational Explanations of Machine Learning Predictions Through Class- contrastive Counterfactual Statements, pp. 5785–5786 (2018) 34. Miller, T.: Explanation in artificial intelligence: insights from the social sciences. arXiv: 1706.07269 (2018) 35. Hoffman, R.R., Mueller, S.T., Klein, G., Litman, J.: Metrics for Explainable AI: Challenges and Prospects. arXiv preprint arXiv:1812.04608 (2018) 36. Nakatsu, R.: Explanatory power of intelligent systems. In: Intelligent Decision-making Support Systems. Springer, pp. 123–143 (2006) 37. Sharma, M., Singh, G., Singh, R.: An advanced conceptual diagnostic healthcare framework for diabetes and cardiovascular disorders. arXiv preprint arXiv:1901.10530 (2019)

Interpretation of SVM to Build an Explainable …

151

38. An, A., Butz, C.J., Pedrycz, W., Ramanna, S., Stefanowski, J., Wang, G.: Rough Sets, Fuzzy Sets, Data Mining and Granular Computing: 11th International Conference, RSFDGrC 2007, Toronto, Canada, May 14–16, 2007. Proceedings. Springer, Berlin (2007) 39. Wang, W., Siau, K.: Trusting Artificial Intelligence in Healthcare (2018) 40. Krittanawong, C., Zhang, H.J., Wang, Z., Aydar, M., Kitai, T.: Artificial intelligence in precision cardiovascular medicine. J. Am. Coll. Cardiol. 69(21), 2657–2664 (2017) 41. Guidi, G., Pettenati, M.C., Melillo, P., Iadanza, E.: A machine learning system to improve heart failure patient assistance. IEEE J. Biomed. Health Inf. 18(6), 1750–1756 (2014) 42. Pedrycz, W.: Granular computing: an introduction. In: Proceedings joint 9th IFSA world congress and 20th NAFIPS international conference (Cat. No. 01TH8569), vol. 3, pp. 1349– 1354 (2001) 43. Zadeh, L.A.: Towards a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets Syst. 19, 111–127 (1997) 44. Yager, R.R., Filev, D.: Operations for granular computing: mixing words with numbers, Proceedings of IEEE International Conference on Fuzzy Systems, pp. 123–128 (1998) 45. Yao, Y.Y.: Granular computing: basic issues and possible solutions. In: Proceedings of the 5th joint conference on information sciences, vol. 1, pp. 186–189 (2000) 46. Akama, S., Kudo, Y., Murai, T.: Granular computing and aristotle’s categorical syllogism. Intelligent Systems Reference Library Topics in Rough Set Theory, pp. 161–172 (2019) 47. Sowa, J.F.: Knowledge representation: logical, philosophical, and computational foundations. Course Technology, Boston (2012) 48. Speca, A.: Hypothetical syllogistic and Stoic logic. Brill (2001) 49. Montgomery, E.B.: Medical reasoning: the nature and use of medical knowledge. Oxford University Press, New York, NY, United States of America (2019) 50. Slagle, J.S., Gardiner, D.A., Han, K.: Knowledge specification of an expert system. IEEE Intell. Syst. 4, 29–38 (1990) 51. Grimshaw, J.M., Russell, I.T.: Effect of clinical guidelines on medical practice: a systematic review of rigorous evaluations. The Lancet 342(8883), 1317–1322 (1993) 52. Lucas, P.: Quality checking of medical guidelines through logical abduction. Springer, London, pp. 309–21 (2004) 53. Kendall, E., Sunderland, N., Muenchberger, H., Armstrong, K.: When guidelines need guidance: considerations and strategies for improving the adoption of chronic disease evidence by general practitioners. J. Eval. Clin. Pract. 15(6), 1082–1090 (2009) 54. Saadat, S.H., Izadi, M., Aslani, J., Ghanei, M.: How well establishment of research plans can improve scientific ranking of medical universities. Iranian Red Crescent Med. J. 17(2) (2015) 55. Argote, L., Ingram, P., Levine, J.M., Moreland, R.L.: Knowledge transfer in organizations: learning from the experience of others. Organ. Behav. Hum. Decis. Process. 82(1), 1–8 (2000) 56. Gibbons, R.J., Balady, G.J., Beasley, J.W., Bricker, J.T., Duvernoy, W.F., Froelicher, V.F., WL, J.W.: ACC/AHA guidelines for exercise testing: a report of the American College of Cardiology/American Heart Association task force on practice guidelines (committee on exercise testing). J. Am. Coll. Cardiol. 30(1), 260–311 (1997) 57. Hadden, S. G.,Feinstein, J. L.: Symposium: Expert systems. Introduction to expert systems. Journal of Policy Analysis and Management,8(2), 182–187 (2007) 58. Sharma, T., Tiwari, N., Kelkar, D.: Study of difference between forward and backward reasoning. Int. J. Emerg. Technol. Advan. Eng. 2(10), 271–273 (2012) 59. Gunning, D.: Explainable artificial intelligence (xai). Defense Advanced Research Projects Agency (DARPA) (2017) 60. Krause, J., Perer, A., Ng, K.: Interacting with predictions: visual inspection of black-box machine learning models. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 5686–5697 (2016) 61. Zhu, J., Liapis, A., Risi, S., Bidarra, R., Youngblood, G.M.: Explainable AI for designers: a human-centered perspective on mixed-initiative co-creation. In: 2018 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8 (2018)

152

S. S. Samuel et al.

62. Lapuschkin, S., Wäldchen, S., Binder, A., Montavon, G., Samek, W., Müller, K.R.: Unmasking clever hans predictors and assessing what machines really learn. Nat. Commun. 10(1), 1–8 (2019) 63. Cabitza, F., Rasoini, R., Gensini, G.F.: Unintended consequences of machine learning in medicine. JAMA 318(6), 517–518 (2017) 64. London, A.J.: Artificial intelligence and black-box medical decisions: accuracy versus explainability. Hastings Cent. Rep. 49(1), 15–21 (2019) 65. Gandhi, P.: KDnuggets analytics big data data mining and data science. (Accessed 2019) Available at: www.kdnuggets.com/2019/01/explainable-ai.html. 66. Trentesaux, D., Millot, P.: A human-centred design to break the myth of the “magic human” in intelligent manufacturing systems. In: Service orientation in holonic and multi-agent manufacturing. Springer, Cham, pp. 103–113 (2016) 67. Lehtiranta, L., Junnonen, J.M., Kärnä, S., Pekuri, L.: The constructive research approach: Problem solving for complex projects. Designs, methods and practices for research of project management, pp. 95–106 (2015) 68. Ronit.: Heart disease UCI. (Accessed 2020) Available at: https://www.kaggle.com/ronitf/heartdisease-uci 69. Ajmera, A.: Framingham heart study dataset. (Accessed 2020) Available at: https://www.kag gle.com/amanajmera1/framingham-heart-study-dataset 70. Raj, J.T.: Dimensionality reduction for machine learning. (Accessed 2019) Available at: https:// towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e 71. Hooker, S., Erhan, D., Kindermans, P.J., Kim, B.: Evaluating feature importance estimates. arXiv preprint arXiv:1806.10758 (2018) 72. Lowe, R.: Interrogation of a dynamic visualization during learning. Learning and Instruction 14(3), 257–274 (2004) 73. Friedrichs, F., Igel, C.: Evolutionary tuning of multiple SVM parameters. Neurocomputing 64, 107–117 (2005)

Factual and Counterfactual Explanation of Fuzzy Information Granules Ilia Stepin, Alejandro Catala, Martin Pereira-Fariña, and Jose M. Alonso

Abstract In this chapter, we describe how to generate not only interpretable but also self-explaining fuzzy systems. Such systems are expected to manage information granules naturally as humans do. We take as starting point the Fuzzy Unordered Rule Induction Algorithm (FURIA for short) which produces a good interpretabilityaccuracy trade-off. FURIA rules have local semantics and manage information granules without linguistic interpretability. With the aim of making FURIA rules selfexplaining, we have created a linguistic layer which endows FURIA with global semantics and linguistic interpretability. Explainable FURIA rules provide users with evidence-based (factual) and counterfactual explanations for single classifications. Factual explanations answer the question why a particular class is selected in terms of the given observations. In addition, counterfactual explanations pay attention to why the rest of classes are not selected. Thus, endowing FURIA rules with the capability to generate a combination of both factual and counterfactual explanations is likely to make them more trustworthy. We illustrate how to build self-explaining FURIA classifiers in two practical use cases regarding beer style classification and vehicle classification. Experimental results are encouraging. The generated classifiers exhibit accuracy comparable to a black-box classifier such as Random Forest. Moreover, their explainability is comparable to that provided by white-box classifiers designed with the Highly Interpretable Linguistic Knowledge fuzzy modeling methodology (HILK for short) in terms of explainability. I. Stepin · A. Catala · J. M. Alonso (B) Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Santiago de Compostela, Spain e-mail: [email protected] I. Stepin e-mail: [email protected] A. Catala e-mail: [email protected] M. Pereira-Fariña Departamento de Filosofía e Antropoloxía, Universidade de Santiago de Compostela, Santiago de Compostela, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. Pedrycz and S. Chen (eds.), Interpretable Artificial Intelligence: A Perspective of Granular Computing, Studies in Computational Intelligence 937, https://doi.org/10.1007/978-3-030-64949-4_6

153

154

I. Stepin et al.

Keywords Interpretable artificial intelligence · Granular computing · Counterfactual reasoning · Fuzzy rule-based classifiers

1 Introduction The Granular Computing (GrC) paradigm approaches Artificial Intelligence (AI) through human-centric processing of information granules [10, 11, 37]. The granulation task decomposes a whole into meaningful parts. Accordingly, information granules represent concepts which correspond to objects put together in terms of their indistinguishability, similarity, proximity or functionality [39]. GrC deals naturally with information granules and plays a central role in the development of Interpretable AI. One of the main challenges in Interpretable AI is producing transparent, interpretable and explainable AI-based models [2, 20]. There are many successful applications of Interpretable AI (e.g., in medicine [8] or in education [4]). The human centricity of GrC is rooted into the Fuzzy Set Theory [38, 40] which offers a mathematical framework to manage information granules. Information granularity can properly be represented by fuzzy sets. Fuzzy rules relate fuzzy sets and make it possible to infer meaningful information granules at certain levels of abstraction. To do so, information granules must be carefully designed to become interpretable and thus meaningful. Interpretability is one of the most appreciated properties of fuzzy sets and systems [6]. Indeed, the structure of linguistic rules extracted from a fuzzy inference system allows for deducting unambiguously the reasoning behind a fuzzy rule-based classifier for the given data instance. However, while linguistic information granules obtained as a result of fuzzy inference appear to be interpretable, they are not necessarily self-explanatory. While extracting the output explanation from the activated rules in the rule base is relatively straightforward, taking into account solely pure output may be confusing or misleading, as this piece of information does not consider a big picture of the fuzzy inference process. For example, two rules may have similar activation degrees. Hence, interpreting the output in terms of a single fired rule that led to the given output may result in misleading conclusions about the inference process. Instead, considering alternative non- or less activated rules when explaining the system’s factual output may enhance its self-explanatory capacity. The explanation corresponding to the system’s output may be formulated as an answer to the question “Why P and not Q?” where P is the predicted class label and Q is some alternative classification used to explain why the given output is not any different. This property of explanation, referred to as contrastiveness [25], has become the core of the pragmatic approach to explanation in philosophy of science [19]. For example, contrastive explanations have already been generated in the context of image classification [9, 30]. The theoretical foundations of contrastive explanation allow us to reformulate the problem of explainability of fuzzy information granules. Thus, the user can be informed of how the output could have been changed, given certain input features

Factual and Counterfactual Explanation of Fuzzy Information Granules

155

have had different values. Contrary to factual explanation, it may be of crucial importance to go beyond summarizing available information on the reasoning behind the system’s output and offer an insight into how alternative outcomes could be reached. Such a piece of “contrary-to-fact” information constitutes counterfactual explanation [27]. Being contrastive by nature, counterfactual explanation allows us to find a minimal set of feature intervals that would have led the classifier to making a different decision. Such counterfactual explanations are claimed to enhance user’s trust in decisions made by recommender systems [24]. Moreover, counterfactual thinking [13] and reasoning [29] pave the way on the search of speculations [35]. It is worth noting that not all counterfactuals are equally explanatory for a particular data instance. Indeed, counterfactual explanation is required to be minimally different from the corresponding factual explanation in terms of the input features offered to the user as part of the explanation. Hence, given a set of all potential counterfactual explanations, it is of crucial importance to properly reduce the search space when looking for the most suitable one. Such feature-based explanations [12] are context-dependent, hence certain combinations of features have a stronger explanatory power than others given algorithm’s output. In this regard, counterfactual explanations have different degrees of relevance. Hence, measuring relevance is among the primary concerns when constructing the counterfactual explanation. Preliminary experiments on counterfactual explanation generation for fuzzy rulebased classifiers [34] show that their design offers an excellent opportunity to generate accurate and concise factual and counterfactual explanations for linguistic information granules. However, the nature of the rule bases of different fuzzy classifiers imposes further constraints on their linguistic interpretability. Linguistic fuzzy modeling (LFM) [15] favors interpretability at the expense of accuracy, while precise fuzzy modeling (PFM) [14] gives priority to accuracy. On the one hand, the HILK1 fuzzy modeling methodology, which imposes global semantics thanks to the use of strong fuzzy partitions (SFPs) [32], is an example of LFM. It is worth noting that SFPs satisfy all properties (e.g., coverage, distinguishability, etc.) demanded for interpretable fuzzy partitions [6]. On the other hand, FURIA,2 which deals with local semantics, is an example of PFM. In our previous work [34], we have shown how to generate factual and counterfactual explanations associated to interpretable white-box classifiers. Namely, we considered crisp decision trees and fuzzy decision trees designed with LFM. In this chapter, we investigate how to apply our method for generation of factual and counterfactual explanations in PFM. Namely, we propose the creation of a linguistic layer with global semantics on top of FURIA rules, what makes FURIA rules self-explaining. In practice, we investigate how to do a linguistic approximation, supported by SFPs, of fuzzy information granules in the premises of FURIA rules. Then, we evaluate our proposal in two use cases on beer style classification and vehicle classification where we describe how to generate not only interpretable

1 HILK

stands for Highly Interpretable Linguistic Knowledge [7]. stands for Fuzzy Unordered Rule Induction Algorithm [22].

2 FURIA

156

I. Stepin et al.

but also self-explanatory fuzzy systems, i.e., how to facilitate the climbing of the DIKW3 pyramid from data to wisdom through GrC. The rest of the chapter is organized as follows. Section 2 introduces some preliminary concepts which are needed to understand the rest of the manuscript. Section 3 describes our method for factual and counterfactual explanation generation for PFM. Section 4 shows an illustrative example. Section 5 describes the experimental settings for testing the method and reports the empirical results. Section 6 discusses in detail the obtained results. Finally, we conclude and outline directions for future work in Sect. 7.

2 Background We consider the supervised learning problem of multi-class classification, i.e., learning a mapping function f : X −→ Y from dataset X = {xi | 1 ≤ i ≤ n} containing n labeled instances to a discrete output variable (class) Y = {y p | 1 ≤ p ≤ m} where m is the number of classes. Each data instance {xi = (Fi , cl p ) ∈ X | (1 ≤ i ≤ n), (1 ≤ p ≤ m)} is characterized by an output class label cl p ∈ Y and k continuous numerical features Fi = {( f j , v j ) | 1 ≤ j ≤ k} where f j is the title of the input feature, v j being the corresponding numeric value in the interval [v jmin , v jmax ]. The predicted output class cl f ∈ Y is said to be the factual explanation class. Let us now formally introduce the main notions that we utilize throughout the rest of this chapter. Given data instance x ∈ X , we operate on a set S = {s1 , . . . , s|S| } of Fuzzy Rule-Based Classification Systems (FRBCS) [17] to find the explanation e f (x, s) and a nonE(x, s) = e f ∪ E c f which consists of a single factual explanation  empty set of counterfactual explanations E c f (x, s) = ec f (x, s, cl) : ∀cl ∈ {Y \ cl f } for the output of the classification system s ∈ S. An FRBCS is defined in terms of two components:

cl

1. a knowledge base which is, in turn, subdivided into two parts: • a database that consists of a collection of fuzzy input and output variables associated with fuzzy linguistic terms, with the corresponding membership functions; • a rule base that presents a set R = {r1 , r2 , . . . , r|R| } of |R| weighted fuzzy rules of the form rk : IF f 1 is v1 AND . . . AND f k is vk THEN x is y p

(1)

where rk ∈ R : 1 ≤ k ≤ |R|, ( f j , v j ) ∈ Fi , y p ∈ Y . Each rule r ∈ R is a 4-tuple r =< ACr , cqr , a, w > that contains an antecedent ACr , a consequent cqr , fired by degree a and weighted by factor w. The rule 3 DIKW

stands for Data, Information, Knowledge and Wisdom [1, 31].

Factual and Counterfactual Explanation of Fuzzy Information Granules

157

activation degree is computed as the product of a and w. The antecedent ACr is a non-empty set of single conditions mapping input features to linguistic terms. The consequent cqr is a unique class label predicted by the classification system given the conditions in the antecedent. 2. a fuzzy processing structure which comprises three main elements: • a fuzzification interface which translates crisp input data into fuzzy values; • an inference mechanism which makes the mapping from input to output by means of applying a fuzzy reasoning method (i.e., it infers an output class for the given data instance in terms of the information stored in the knowledge base); • a defuzzification interface which translates the inferred fuzzy values into a crisp value. The fuzzy reasoning method of an FRBCS operates on fuzzy sets in the universe of discourse U by applying a membership function μt (·) to each feature defined in the corresponding fuzzy set: μt (·) : v j → [0, 1]

(2)

where t ∈ T , v j ∈ Fi . In case of LFM, each feature { f j ∈ Fi | 1 ≤ i ≤ n, 1 ≤ j ≤ k} f f is defined by a set of linguistic terms T ( f j ) = {T j1 , . . . , T jk }. In case of PFM, we can use generic names M F j for each feature f j but they lack linguistic interpretability. Then, T-norms (e.g., minimum or product) and T-conorms (e.g., maximum or sum) are applied to compute each rule firing degree as well as the aggregated output for the entire rule base. For example, in case of the popular Mamdani fuzzy systems [26], the inference mechanism is called min-max because both conjunction (AND) and implication (THEN) are implemented by the T-norm minimum, and the output accumulation is done by the T-conorm maximum. Even though there are several defuzzification methods to apply with Mamdani systems, the winner rule is usually applied in FRBCS (i.e., the output class is determined by the conclusion of the rule with the highest activation degree). In case of the HILK [7] LFM, each input feature f j is associated with an SFP that is defined in U. Such SFPs are defined according to both expert knowledge and data distribution, regarding granularity and semantics (i.e., each fuzzy set is associated to a meaningful linguistic term). In addition, the rule base is endowed with global semantics because the linguistic terms defined beforehand are shared among all rules. Moreover, all rules have the same weight (w = 1) so rule activation degree equals rule firing degree; what facilitates the understanding of the winner rule inference mechanism while reading the list of rules. In case of the FURIA [22] PFM, rules have local semantics and lack linguistic interpretability (i.e., no SFPs are defined beforehand). In addition, FURIA applies the winner rule inference mechanism with weighted rules in combination with the so-called rule stretching method. Rule stretching applies when no rules are fired for a given input data instance. Then, FURIA creates on the fly a new set of rules from the

158

I. Stepin et al.

initial rule base. In short, it iteratively tours every rule tuning antecedents in order one by one (taking advantage of the ordering of antecedents set in the induction of rules) from the least to the most important, until the instance is covered. If all antecedents are removed from an individual rule, then this rule is discarded. As a result, the new rule set includes at most the same number of initial rules. Even if FURIA usually produces a small number of compact rules and it is able to find out the right output class in most cases, understanding the inference mechanism is hard, especially when the rule stretching mechanism comes into play. In practice, HILK is implemented in the GUAJE4 [28] open source software for fuzzy modeling. In addition, FURIA is implemented in the WEKA [21, 36] data mining tool. Moreover, the ExpliClas [5] web service generates factual explanations in natural language of several WEKA classifiers, including FURIA. It is worth noting that both GUAJE and ExpliClas produce FRBCSs which comply with the IEEE standard for fuzzy markup language [33].

3 Proposal We propose a method with the aim of generating an explanation consisting of one piece of factual explanation as well as of a non-empty set of counterfactual explanations. A factual explanation e f (x, s) for the given data instance x ∈ X and an FRBCS s ∈ S is defined as a tuple: e f (x, s) = AC f , cq f 

(3)

where AC f = {ac f i |1 ≤ i ≤ |AC f | } is a set of conditions in the corresponding rule r f ∈ R, cq f ∈ Y being the consequent of rule r f (for the general template of the rule, see Eq. 1). Every data instance x is assumed to have only one factual explanation e f (x, s), that is associated to the winner rule. Similarly, a (single) counterfactual explanation ec f (x, s, yc f ) for the given data instance x ∈ X classified by an FRBCS s ∈ S and a non-predicted class yc f ∈ {Y \ cq f } is a tuple: (4) ec f (x, s, yc f ) = ACc f y , cqc f y . There exist as many candidate counterfactual explanations as rules r ∈ R leading to the classes different from the factual explanation class. However, only the most relevant counterfactual explanations for each non-predicted class are included in the set of the selected counterfactual explanations E c f derived from the rule base. Hence, the exhaustive set of all potential counterfactual explanations for the data instance x is defined as a set of counterfactual explanations to all the classifications of each non-predicted class: 4 GUAJE stands for Generating Understandable and Accurate fuzzy systems in a Java Environment.

Factual and Counterfactual Explanation of Fuzzy Information Granules

159

Fig. 1 A pipeline of the algorithmic workflow

E c f (x, s) =



ec f (x, s, y) : ∀y ∈ {Y \ cq f }

(5)

y

We construct the textual representation of both factual and counterfactual explanations in terms of linguistic information granules [23]. A factual explanation e f (x, s) for an arbitrary FRBCS s ∈ S and a minimal set of minimal counterfactual explanations ec f (x, s, Yc f ) are claimed to fully explain a particular classification outcome. In order to generate a corresponding conditional textual (factual and/or counterfactual) explanation, we further assume a set {c1 , . . . , ck } of conditions (i.e., “ci = f i is vik ” in Eq. 1) to be an antecedent of the output sentence, whereas the conclusive classification cl j ∈ Y (i.e., “x is y j ” in Eq. 1) is referred to as a consequent in the output sentence. Hereby we propose a method to generate a factual explanation for the given test instance x = (F, cl x ) ∈ X and a non-empty set of counterfactual explanations E c f for all the classes opposed to the predicted one {cl | cl ∈ {Y \ cq f }}, one counterfactual per class. Furthermore, we distinguish multiple counterfactual explanations ranking them in accordance with their explanatory capacity with respect to the data instance input information. In addition to producing a logical representation for computed counterfactuals, the output explanations are designed to be sentences that not only specify the classifier’s prediction but also offer a comprehensive explanation justifying the classifier’s behavior. A factual explanation is assumed to be a sequence of conditions (i.e., rule premises) constituting the antecedent of the winner rule r f ∈ R. It is trivially reconstructed by merging the antecedent constituents mapped to the corresponding linguistic terms. Conversely, counterfactual explanation generation presumes the four following steps as depicted in Fig. 1. First, the rules whose consequents are the classes different from the factual explanation consequent make up a collection of candidate counterfactuals. Second, the candidate counterfactuals are ranked on the basis of their distance to the factual explanation rule. Third, linguistic approximation is applied to derive meaningful linguistic explanation terms if expert knowledge-based linguistic terms are not readily available. Fourth, the selected best-ranked counterfactuals are passed to the natural language generation module where they are transformed into comprehensive textual fragments understandable by end users. Let us consider each of these steps in detail. • Candidate counterfactual explanation representation. At first, all the rules rk ∈ Rc f = {R \ r f } : {1 ≤ k ≤ |Rc f |} with the consequents different from that of the winner rule are collected. In order to estimate the impact of each variation

160

I. Stepin et al.

of the given linguistic terms, we set a |Rc f | × p binary rule matrix called R M where each row corresponds to a rule in the rule base {rk | 1 ≤ k ≤ |Rc f |} and each column is a set c = {c1 , c2 , . . . , c p } of all the unique feature-value pairs. Each cell R Mi j (1 ≤ i ≤ |Rc f |, 1 ≤ j ≤ p) of the rule matrix is populated with binary values so that:  R Mi j =

1, if c j ∈ f m × vm (1 ≤ m ≤ |F|), 0, otherwise

(6)

Similarly, the test instance x ∈ X is vectorized in the form of a binary vector test1× p = [test1 , test2 , . . . , test p ]( p = |c|). The test instance vector contains binarized membership function values as a result of fuzzy inference and subsequent α-cutting,  1, if μ(c j ) ≥ δ test j = (7) 0, otherwise where δ ∈ [0, 1] is a predefined threshold. • Counterfactual relevance estimation. If there exists only one counterfactual prediction (|Rc f | = 1), the only identified counterfactual constitutes the set of counterfactual explanations E c f = {rc f1 }. Otherwise, candidate counterfactual explanations E c f = {ec f1 , ec f2 , . . . , ec f|Rc f | } are ranked to determine the most relevant one in accordance with their distance to the factual classification. Ranking counterfactuals allows us to ensure that the test instance and the best counterfactual data point are minimally different. To do so, we calculate the bitwise XOR-based distance for each pair of vectors test1× p : r1× p , ∀rk ∈ Rc f (1 ≤ k ≤ |Rc f |), normalized over the number of feature-value permutations and thus transformed into a scalar: |rk |  1[testi = ri ] i=1 . (8) dist (test, rk ) = |rik | All the obtained distance values are sorted to enable us to find the minimally distant counterfactual ec f (x, s, yc f ), so that it is claimed to be the most relevant to complement the factual classification. If multiple counterfactuals have the same minimal distance, they are claimed equivalently relevant, so the counterfactual included in the final explanation is selected randomly. • Linguistic approximation. Both factual and counterfactual FRBCS explanations are supported by linguistic approximations. In the absence of linguistic terms provided by experts, best-ranked candidates are passed on to the linguistic approximation layer. More precisely, each numerical condition c j associated to feature f j f f is verbalized as “ f j is Ta ”, Ta being the most similar linguistic term in T ( f j ). In practice, we compute the similarity degree based on the Jaccard Similarity Index [18] between two numerical intervals S(A, L) as follows:

Factual and Counterfactual Explanation of Fuzzy Information Granules

∀L ≈ Taf ∈ T ( f j ) : S(A, L) =

A∩L ∈ [0, 1], A∪L

161

(9)

being 1 in case A perfectly matches L, and 0 if both intervals are disjoint. A is the numerical interval representing c j . L is the numerical interval associated to the selected α-cut. For example, given a feature f j ∈ U = [Umin , Umax ], c j = “ f j ≤ a", a ∈ U , A = [Umin , a]. In addition, if f j were defined by a uniform max SFP with two linguistic terms (Low, High), α=0.5, L Low = [Umin , Umin +U ] and 2   Umin +Umax . , U L H igh = max 2 • Textual explanation generation. Explanations for each class are designed to be two-sentence pieces of text where the first sentence is a linguistic realization of a factual explanation e f (x, s) whereas the second offers the best-ranked counterfactual explanation ec f (x, s, yc f ) ∈ E c f to the given non-predicted class yc f ∈ {Y \ cq f }. Then, the linguistic realization module constructs an explanation from the antecedent {aci = ( f j , v j )} : (1 ≤ i ≤ |ACr |, 1 ≤ j ≤ |ACi |) and the consequent of the corresponding explanations (cq f ∈ Y for a factual explanation and yc f ∈ {Y \ cq f } for the best-ranked counterfactual explanation). The following template is used for all combinations of such explanations: The test instance is of class cq f  because f 1 is v1 (and f 2 is v2 (and …(and f k is vk ))). It would be of class yc f  if ec f 1 (x, s, yc f ). It is important to note that in the case of fuzzy partitions with more than 2 fuzzy sets, we consider not only the linguistic terms in the SFP but also additional combinations of adjacent linguistic terms. For example, given a SFP with 5 linguistic terms (e.g., Very low, Low, Medium, High, Very high), the additionally composed linguistic terms to take into account when computing the linguistic approximations are: Very low or Low, Low or Medium, Medium or High, High or Very high, Very low or Low or Medium, Low or Medium or High, Medium or High or Very high, Very low or Low or Medium or High, Low or Medium or High or Very high.

4 Illustrative Use Case As a use case, we have considered the beer style classification problem used for illustration in our previous work [34]. The classification task consists of identifying one out of the eight beer styles (Blanche, Lager, Pilsner, IPA, Stout, Barleywine, Porter, and Belgian Strong Ale) in terms of three features (Color, Bitterness and Strength). The Beer dataset,5 hereafter denoted as X , was built up by Castellano et al. from a collection of beer recipes provided by expert brewers [16]. See the numerical intervals for all the features provided by an expert brewer in Table 1. These numerical intervals correspond to applying to the SFPs the α−cut that equals 0.5. The fuzzy sets associated to Color are illustrated in Fig. 2. 5 The

Beer dataset in arff format is available online at https://gitlab.citius.usc.es/jose.alonso/xai.

162

I. Stepin et al.

Table 1 Numerical intervals associated to each linguistic term according to an expert brewer Feature Linguistic term Range of values Color

Bitterness

Strength

Pale Straw Amber Brown Black Low Low-medium Medium-high High Session Standard High Very high

0–3 3–7.5 7.5–19 19–29 29–45 7–21 21–32.5 32.5–47.5 47.5–250 0.035–0.0525 0.0525–0.0675 0.0675–0.09 0.09–0.136

Fig. 2 Fuzzy sets for expert knowledge-based linguistic terms (feature Color). For illustrative purposes, only one additional linguistic term (Amber or Brown) is depicted

In order to illustrate the method proposed in the previous section, we are going to outline a short illustrative example. Let us define as s an FURIA-based FRBCS whose behavior we aim to explain and assume that the said classification system’s rule base contains a set R of 14 rules (all of them with w = 1) composed as follows: R1 : R2 :

IF Bitterness is M Fb0 AND Color is M Fc0 THEN Beer is Blanche IF Bitterness is M Fb1 AND Color is M Fc1 THEN Beer is Blanche

Factual and Counterfactual Explanation of Fuzzy Information Granules

R3 : R4 : R5 : R6 : R7 : R8 : R9 : R10 : R11 : R12 : R13 : R14 :

163

IF Bitterness is M Fb2 AND Color is M Fc2 THEN Beer is Lager IF Bitterness is M Fb3 AND Color is M Fc3 THEN Beer is Lager IF Bitterness is M Fb4 AND Color is M Fc4 THEN Beer is Pilsner IF Bitterness is M Fb5 AND Color is M Fc5 AND Strength is M Fs0 THEN Beer is IPA IF Bitterness is M Fb6 AND Strength is M Fs1 THEN Beer is IPA IF Color is M Fc6 THEN Beer is Stout IF Bitterness is M Fb7 AND Color is M Fc7 AND Strength is M Fs2 THEN Beer is Stout IF Bitterness is M Fb8 AND Strength is M Fs3 THEN Beer is Barleywine IF Bitterness is M Fb9 AND Color is M Fc8 THEN Beer is Barleywine IF Color is M Fc9 AND Strength is M Fs4 THEN Beer is Porter IF Color is M Fc10 AND Strength is M Fs5 THEN Beer is Porter IF Bitterness is M Fb10 AND Strength is M Fs6 THEN Beer is Belgian Strong Ale

The meaning of membership functions in the previous rule base is given in Table 2. Each membership function is characterized by a trapezoidal function as follows: ⎧ ⎪ 0, ⎪ ⎪ ⎪ x−a ⎪ ⎪ ⎨ b−a , μt (x) = 1, ⎪ ⎪ d−x ⎪ , ⎪ d−c ⎪ ⎪ ⎩0,

x ≤a a 0, then (x1 , x2 ) belongs to class 1 else to class 2. Next the green line is interpolated by a dotted step function S. Then each step i is represented as a logical rule: If x 1 is in the limits si11 ≤ x 1 ≤ si12 , and x 2 is in limits si21 ≤ x 2 ≤ si22 defined by the step, then pair (x 1 , x 2 ) is in class 1, else in class 2. We interpolate a whole linear function with multiple steps, and multiple such logical rules. This set of rules can be quite large in contrast with a compact linear equation, but these rules have a clear interpretation for the domain experts when x 1 and x 2 have a clear meaning in the domain. For each new case to be predicted, we just need to find a specific step, which is applicable to this case and provide a single simple local rule of that step to the user as an explanation. Moreover, we do not need to store the step function and rules. They can be generated on the fly for each new case to be predicted. To check that the linear function is meaningful before applying to new cases, a user can randomly select validation cases, find their expected steps/rules and evaluate how meaningful those rules and their predictions are.

222

B. Kovalerchuk et al.

Quasi-explainable weights. Often it is claimed that weights in the linear models are major and efficient tool to provide model interpretation to the user, e.g., [47]. Unfortunately, in general this is an incorrect statement, while multiple AutoML systems implement it as a model interpretation tool [68] for linear and non-linear discrimination models. The reason is the same as before—heterogeneity of the attributes. If a weighted sum has no interpretation, then the fact the weight a for x 1 is two time greater than b for x 2 has no meaning to be able to say that x 1 is two times more important than x 2 . It is illustrated in Fig. 1 and explained below. In the case of homogeneous attributes and a vertical discrimination line, a single attribute X1 would discriminate the classes. Similarly, if a horizontal line discriminates classes then a single attribute X2 would discriminate classes, and X1 would be unimportant. If the discrimination line is the diagonal with 45°, then X1 and X2 are seem equally important. Respectively, for the angles larger than 45° X1 would be more important else X2 would be more important. In fact, in Fig. 1 (on the left) the angle is less than 45° with X2 more important, but on the right the angle is greater than 45° with X1 more important. However, both pictures show the discrimination for the same data. The only difference is that X1 is in decimeters on the left and in meters on the right. If X1 is a length of the object in decimeters and X2 is its weight in kilograms, then X2 is more important in Fig. 1, but if X1 is in meters, then X1 is more important. Thus, we are getting very different relative importance of these attributes. In both cases we get quasi-explanation. In contrast if X1 and X2 would be homogeneous attributes, e.g., both lengths in meters, then weights of attributes can express the importance of the attributes meaningfully and contribute to the actual not quasi-explanation. Note, that a linear model on heterogeneous data converted to logical rules is free of this confusion. The step intervals are expressed in actual measurement units. The length and height of those steps can help to derive the importance of attributes. Consider a narrow, but tall step. At first glance it indicates high sensitivity to X1 and low sensitivity to X2 . However, it depends on meaningful insensitivity units in X1 and X2 , which a domain expert can set up. For instance, let the step length be 10 m, the step height be 50 kg with insensitive units of 2 m and 5 kg. Respectively, we get 5 X1 units and 10 X2 units and can claim high sensitivity to X1 and low sensitivity to X2 . In contrast, if the units will be 2 m and 25 kg, then we get 5 X1 units and 2 X2 units then we can claim the opposite. Thus, a meaningful scaling and insensitivity units needs to be set up as part of the model discovery and interpretation process for linear models with heterogeneous attributes to avoid quasi-explanations.

1.3 Informal Definitions Often desirable characteristics of the explanation are used to define it, e.g., [7, 8, 9]: • Give an explanation that in meaningful to a domain expert;

Survey of Explainable Machine Learning with Visual …

223

• Give an explanation comprehensible to humans in (i) natural language and in (ii) easy to understand representations; • Give an explanation to humans using domain knowledge not ML concepts that are external to the domain; • Provide positive or negative arguments for the prediction; • Answer why is this prediction being made not an alternative one? • Give an explanation how inputs are mathematically mapped to outputs e.g., regression and generalized additive models. The question is how to check that these desirable characteristics are satisfied. Microsoft and Google started eXplainable AI (XAI) services, but do we have operational definitions of model comprehensibility, interpretability, intelligibility, explainability, understandability to check these properties? Is a bounding box around the face with facial landmarks provided by the Google service an operational explanation without telling in understandable terms how it was derived?

1.4 Formal Operational Definitions It is stated in [47]: “There is no mathematical definition of interpretability.” Fortunately, this statement is not true. Some definitions are known for a long time. They require. • showing how each training example can be inferred from background knowledge (domain theory) as an instance of the target concept/class by probabilistic first order logic (FOL), e.g., [22, 46, 49, 50]. Below we summarize more recent ideas from [51], which are inspired by work of Donald Michie [43], who formulated three Machine Learning quality criteria: • Weak criterion—ML improves predictive accuracy with more data. • Strong criterion—ML additionally provides its hypotheses in symbolic form. • Ultra-strong criterion (comprehensibility)—ML additionally teaches the hypothesis to a human, who consequently performs better than the human studying the training data alone. The definitions from [51], presented below, are intended to be able to test the ultrastrong criterion. These definitions allow studying comprehensibility experimentally and operationally. Definition (Comprehensibility, C(S, P)). The comprehensibility of a definition (or program) P, with respect to a human population S, is the mean accuracy with which a human s from population S after brief study and without further sight, can use P to classify new material, sampled randomly from the definition’s domain.

224

B. Kovalerchuk et al.

Definition (Inspection time T (S, P)). The inspection time T of a definition (or program) P, with respect to a human population S, is the mean time a human s from S spends studying P, before applying P to new material. Definition (Textual complexity, Sz(P)). The textual complexity Sz, of a definition of definite program P, is the sum of the occurrences of predicate symbols, functions symbols and variables found in P. The ideas of these definitions jointly with prior Probabilistic First Order Logic inference create a solid mathematical basis for interpretability developments.

1.5 Interpretability and Granularity It was pointed out in [6] that interpretation differs from causal reasoning stating that interpretation is much more than causal reasoning. The main point in [6] is that, even if we will have perfect causal reasoning for the ML model, it will not be an explanation for a human. A detailed causal reasoning can be above human abilities to understand it. Assume that this is true, then we need granularity of the reasoning, generalized at different levels, in linguistic terms. It can be similar to what is done in fuzzy control. In fuzzy control, a human expert formulates simple control rules, in uncertain linguistic terms, without details such as if A is large and B is small, and C is medium, then control D needs to be slow. In fuzzy control, such rules are formalized through membership functions (MFs) and different aggregation algorithms, which combine MFs. Then the parameters of MFs are tuned using available data. Finally, the produced control model became a competitive one with a model built using solid physics, as multiple studies had shown. The ML explanation task of detailed causal reasoning is an opposite task. We do not start from a linguistic rule, but from a complex reasoning sequence and need to create simple rules. Moreover, these rules need to be at different levels of detail/granularity. How can we design this type of rules, from black box machine learning models? This is a very complex and open question. First, we want to build this type of granular rules, for the explainable models, such as decision trees. Assume that we have a huge decision tree, with literally hundreds of nodes, and branches with dozens or hundreds of elements on each branch. A human can trace and understand any small part of it. However, the total tree is beyond the human capabilities for understanding and tracing a branch which has, say, 50 different conditions like x1 > 5, x2 < 6, x3 > 10 and so on 50 times. It is hard to imagine that anybody will be able to meaningfully trace and analyze it easily. In this situation, we need ways to generalize a decision tree branch, as well as the whole tree. Selecting most important attributes, grouping attributes to larger categories, and matching with a representative case are among

Survey of Explainable Machine Learning with Visual …

225

the ways to decrease human cognitive load. Actual inequalities like x1 > 5, x2 < 6, and x3 > 10 can be substituted by “large” x1 and x3 , and “small” x2 . If x1 and x3 are width and length and x2 is weight of the object, then we can say that a large light object belongs to class 1. If an example of such an object is a bucket, we can say that objects, like buckets, belong to class 1. The perceptually acceptable visual explanation of the ML is often at the different and more coarse level of granularity than the ML model. Therefore, visual ML models that we discuss in Sect. 4, have important advantage being at the same granularity level as their visual explanation.

2 Foundations of Interpretability 2.1 How Interpretable Are the Current Interpretable Models? Until recently the most interpretable large time series models were not really interpretable [59]. In general, the answer for this question depends on many factors, such as the definition of the interpretable models and the domain needs of explanation. We already discussed the definition issue. The needs issues are as follows: How severe the domain needs the explanation, what kind of explanation would suffice the domain needs, how complete an explanation is needed, and types/modality of data. These factors are summarized below: • Level of domain needs. Problems in healthcare e.g., risk of mortality have more stringent explanation requirement than retail, e.g., placement of ads [1]. • Level of soundness needed. An explanation is sound if it adheres to how the model works [34]. The optical character recognitions (OCR) model for the text printed in a high-quality laser printer based on the neural networks does not need to be sound as far as it has high OCR accuracy. In contrast, models for healthcare need to be sound. • Level of completeness needed. An explanation is complete if it encompasses the complete extent of the model [34]. The OCR task of the handwritten text needs explanation for poorly written characters and people with bad handwriting habits. • Data types/modality. Different data types (structured or unstructured data, images, speech, text, time series and so on) can have different needs for explanation.

2.2 Domain Specificity of Interpretations Domain specificity has multiple aspects. The major one is the need to describe the trained ML model in terms of domain ontology, without using terms that are foreign to the domain, where the ML task must be solved [22, 33]. It is much more critical for the domains and problems with high cost of errors, such as medicine.

226

B. Kovalerchuk et al.

The next question is: to what extent the terms and concepts, which are foreign for the domain can be included into the explanations, and continue to be useful. The general statements like the explanations/interpretation must make sense to the domain expert, who is going to use the ML model, are not very helpful because they are not operational. Another important question is what should be the explanation for the domain without much theory and background knowledge? An example is predicting new movie rating. The same question should be answered for the domain, where the background knowledge is very inconsistent, and expert opinions are very diverse on the same issue.

2.3 User Centricity of Interpretations While asking the explanations to be in the right language, and in the right context [9, 11] seems mandatory, the actual issue is how to define it within the domain terms and ontology. Not every term from the domain ontology can be used in the explanation efficiently. Moreover, one of the equivalent concepts can be preferred by some users. Some users can prefer decision trees, while others prefer logic rules when both are applicable. The quest for simple explanation came to its purest form in the ELI5 principle: Explain it Like I am 5 years old. It is obviously not coming for free, making sense may require sacrificing or deemphasizing model fidelity. User-centricity of the explanations often requires to be role-based. A physician needs different explanations as compared to a staffing planner in a hospital. Thus, simplicity and fidelity of the explanation for people in these roles is not the same.

2.4 Types of Interpretable Models Internally interpreted versus externally interpreted models. Internally interpreted models do not separate the predictive model and the explanation model. Such models are self-explanatory, e.g., decision trees. They are explained in terms of interpreted elements of their structure not only inputs. An extra effort to convert the model to a more convenient form, including visualization, often is beneficial. Externally interpreted models contain a separate predictive model and an explanation model and an external explanation model. Such ML models are explained in terms of interpretable input data and attributes, but typically without interpreting the model structure. An example would be providing a list of most important input attributes without telling how it is supported by the model’s structure. Another example is producing an interpretable decision tree or logic rules from the neural networks (NN) [62] by interpolating a set of input–output pairs generated from the

Survey of Explainable Machine Learning with Visual …

227

NN. As any interpolation it can differ from the NN, for instance, due to the insufficient size of the dataset used for interpolation. If a new case to be predicted is not represented in the generated set of input–output pairs by similar cases, then the decision tree prediction and explanation will not represent the NN model. This is a major problem of the external explanation approach that can produce a quasi-explanation. Explicit versus Implicit Interpretations. Decision trees and logic rules provide explicit interpretations. Many other ML methods produce only implicit interpretations where a full interpretation needs to be derived using additional domain knowledge. An implicit heatmap explanation for a Deep Neural Network (DNN) model, which we discuss in detail in the next section, requires human knowledge beyond the image.

2.5 Using Black-Box Models to Explain Black Box Models The examples below show shallow not deep explanations that attempt explaining one black box using another black box. Often the end users very quickly recognize this because black box explanation does not answer a simple question: Why is this the right explanation? Deciphering that black box is left for the user. Consider a task of recognizing a boat in Fig. 2 from [48] with an implicit DNN explanation. We can recognize a boat based on a group of pixels highlighted by DNN as a heatmap that represent a mast. To derive a conceptual explanation that uses the concept of a mast we need external human common-sense knowledge what is the mast and how it differs from other objects. This is not a part of the heatmap DNN explanatory model that shows salient pixels. Without the concept of the mast salient pixels provide a shallow black box explanation not a deep explanation. Thus, deep learning neural network models, are deep in terms of the number of layers in the network. This number is larger than in the prior traditional neural network models. However, as was illustrated above, DNNs are not deep in the terms of the explanations. Fundamentally, new approaches are needed to make the current quasi-explanation in DNN a really deep

Fig. 2 DNN boat explanation example [48]

228

B. Kovalerchuk et al.

explanation. An alternative approach, which we advocate, such as GLC, is building explainable models from the very beginning, in addition or instead of explaining DNN, and other black boxes. Similarly, in medical imaging, external domain knowledge is needed for deep explanation. If an expert radiologist cannot match DNN salient pixels with the domain concepts such as tumor, these pixels will not serve as an explanation for the radiologist. Moreover, the radiologist can reject these pixels to be a tumor. The major problem is explaining, in the domain terms, why are these salient points are right ones. It fundamentally differs from explaining, in ML terms, how these points were derived. One of the common methods, for the last one, is backward gradient tracing in DNN, to find salient pixels that contributed most to the class prediction. This explanation is completely foreign to the radiology domain. In other words, we try to produce an explanation, using unexplained and unexplainable method for the domain expert. This can be a deep explanation for the computer scientist, not for a radiologist who is the end user. In the boat example, in addition to the unexplained prediction of class “boat”, we produce unexplained salient pixels as an explanation of the boat. Here we attempt to explain one black box, using another black box. This is a rather quasiexplanation. This quasi-explanation happened, because of the use of model concepts and structures, which are foreign to the domain. Why explanation models are often not explained? Often it is just a reflection of the fact, that the ML model explainability domain, is in the nascent stage. Explaining one black box, using another black box, is an acceptable first step to deep explanation, but it should not be the last one. Can every black box explanation be expanded to a deep one? This is an open question.

3 Overview of Visual Interpretability 3.1 What is Visual Interpretability? Visual methods that support interpretability of ML models have several important advantages, over non-visual methods, including faster and more attractive communication to the user. There are four types of visual interpretability approaches, for ML models, and workflow processes of discovering them: (1) Visualizing existing ML models to support their interpretation; (2) Visualizing existing workflow processes of discovering ML models to support their interpretation; (3) Discovering new interpretations of existing ML models, and processes by using visual means; (4) Discovering new interpretable ML models by using visual means.

Survey of Explainable Machine Learning with Visual …

229

The goal of (1) and (2) is better communicating on existing models and processes, but not discovering new ones using visual means. Visualization of salient points with heatmap in DNN is an example of (1). Other works exemplify (2): they visualize specific points within the workflow process (hyperparameter tuning, model selection, the relationships between a model’s hyperparameters and performance), provide multigranular visualization, and monitor the process and adjust the search space in real time [16, 63]. In contrast, AutoAIVis system [65] focuses on multilevel real-time visualization, of the entire process, from data ingestion to model evaluation using Conditional Parallel Coordinates [64]. The types (3) and (4) are potentially much more rewarding, but more challenging, while many current works focus on (1) and (2). This chapter focuses on types (2) and (4). The last one can produce interpretable models, avoiding a separate process of model interpretation.

3.2 Visual Versus Non-Visual Methods for Interpretability and Why Visual Thinking Figure 3 illustrates benefits of visual understanding over non-visual ones. Analysis of images is a parallel process, but analysis of the text (formulas and algorithms) is a sequential process, which can be insufficient. Chinese and Indians knew a visual proof of the Pythagorean Theorem in 600 B.C. before it was known to the Greeks [35]. Figure 4 shows it. This picture was accompanied by a single word see as a textual “explanation” with it. To provide a complete analytical proof the following inference can be added in modern terms: (a + b)2 (area of the largest square)—2ab(area of 4 blue triangles) = a2 + b2 = c2 (area of inner green square).

Fig. 3 Visual understanding versus over non-visual understanding

230

B. Kovalerchuk et al.

Fig. 4 Ancient proof (explanation) of the Pythagorean Theorem. Actual explanation of the theorem presented in a visual form

a b

c

Thus, we follow this tradition—moving from visualization of solution to finding a solution visually with modern data science tools. More on historical visual knowledge discovery can be found in [24].

3.3 Visual Interpretation Pre-Dates Formal Interpretation Figure 5 shows an example of visual model discovery in 2-D, for the data in the table on the left [33]. Here, a single fitted black line cannot discriminate these two 7 6

x 1 1.1 2 2.2 2.8 3 3.5 4 4 4.5 5 5 5.5 6

5 4 3 2 1 0 0

1

2

3

4

5

6

0

1

2

3

4

5

6

7 6 5 4 3 2 1 0 7

Fig. 5 “Crossing” classes that cannot be discriminated by a single straight line

y 0.5 6 1.5 5 2.8 4 3.3 3.8 2.6 4.7 1.8 5 5.5 0.8

class 1 2 1 2 1 2 1 1 2 1 2 1 1 2

Survey of Explainable Machine Learning with Visual … ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

FD1 0 11.51334 10.27931 0 14.12261 0 0 6.169271 11.64548 9.957874 9.994487 0 13.65315 0 11.08665 0 0 9.52255 0 0 0 9.609696 10.71318 6.625456 9.794333 10.25995

FD2 0 9.092989 0 18.31495 15.1236 0 0 4.918356 0 7.829115 12.3192 8.446147 18.11681 0 0 0 0 0 9.237608 16.78071 0 12.07202 0 0 0 0

FD4 2.749807 0 2.075787 0 9.695051 5.405394 0 5.566813 0 0 3.058695 7.506574 2.457055 0 0 7.32989 8.49376 0 3.488988 2.745921 8.18506 0 0 3.686915 0 0

FD5 9.826302 12.46223 0 0 0 0 8.068472 0 12.16663 0 0 0 0 8.710629 12.57808 9.848915 0 10.30969 7.443493 0 0 6.483721 11.44685 6.715843 9.788224 9.531824

FD6 4.067554 0 4.042145 0 0.915031 0 0 0 0 0 0 0 8.218276 0 0 0 0 0 0 5.606468 0.469365 0 0 0.187058 0 0

FD10 0 7.597155 0 0 0 2.951092 3.267916 4.884737 8.407408 0 0 5.846259 0 0 8.377558 0 0 0 0 0 4.241147 0 8.097867 0 4.599581 1.156152

FD12 0 0 0 0 0 0 0 5.168666 0 0 6.111047 7.362241 5.689919 0 0 6.639803 7.403671 6.508697 0 7.824948 0 0 0 3.735899 0 6.604298

FD15 0 0 0.477713 0 6.086389 3.797284 0 0 0 0 0.380701 6.557457 0 0 9.269582 0 9.346368 0 0 0 5.823779 0 8.832153 3.55698 0 0

FD16 5.244006 8.940897 3.97378 4.472742 9.139287 4.576391 5.09157 5.189289 0 7.082694 3.904454 7.627757 4.45029 6.466624 0 0 0 0 0 0 0 0 8.646919 0 0 0

FD18 0 0 0 4.671682 0 0 6.082168 0 0 8.388349 0 9.05184 3.213032 0 10.28637 0 0 9.04743 0 4.807075 0 1.554688 0 0 0 0

231 FD20 2.743422 0 0 0 0 0 0 0 0 0 2.573056 0 5.992753 0 0 0 0 0 0.921821 4.454489 0 0 0 0 0 0

FD22 0 0 2.477745 7.248355 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.305681 0 0 5.446015 0 0 0 0

FD23 0 4.268456 0 12.11645 8.931774 0 5.42044 2.49011 4.289772 0 0 0 11.56691 3.865449 4.141793 0 0 3.113288 0 0 0 0 0 0 0 6.346496

FD24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

FD25 0 0 0 0 0 2.763756 0 4.750784 0 0 0 6.646436 0 5.339944 0 0 0 7.667032 0 0 0 0 0 2.996381 0 1.300262

FD26 FD27 FD28 6.254963 0 0 0 0 1.309903 5.583099 0 7.418219 0 6.030322 0 0 0 0 0 0 2.562996 4.431955 0.415844 2.73227 2.994664 0 0 4.652006 0 0 4.706276 0 0.705345 5.610187 0 0 0 0 0 7.734966 0 0 3.943355 0 0 4.953615 0 0.433766 4.288343 0 0 0 0 0 0 0 0 4.447716 0 4.174564 7.226364 0 10.62363 6.475445 0 4.49432 0 0 9.85667 4.705225 0 0 3.700704 0 0 010 4.694789 0 0 1.869395 4.265034

Fig. 6 Multidimensional data with difficulty for visual pattern discovery

“crossing” classes. In addition, the visualization clearly shows, that any single line cannot discriminate these classes. However, a common ML modeling practice (without visualizing the data) starts with a simplest model, which is a linear discrimination function (black line in Fig. 5 to separate the blue and red points. It will fail. In contrast, visualization immediately gives an insight of a correct model class of “crossing” two linear functions, with one line going over blue points, and another one going over the red points. How to reproduce such success in 2-D for n-D data such as shown in Fig. 6 where we cannot see a visual pattern in the data with a naked eye? The next section presents methods for lossless and interpretable visualization of n-D data in 2-D.

4 Visual Discovery of ML Models 4.1 Lossy and Lossless Approaches to Visual Discovery in n-D Data Visual discovery in n-D data needs to represent n-D data visually in the form, which will allow to discover undistorted n-D patterns in 2-D. Unfortunately, in high dimensions one cannot comprehensively see data. Lossless and interpretable visualization of n-D data in 2-D is required to preserve multidimensional properties for discovering undistorted ML models and their explanation. Often multidimensional data are visualized by lossy dimension reduction (e.g., Principal Component Analysis), where each n-D point is mapped to a single 2-D point, or by splitting n-D data into a set of low dimensional data (pairwise correlation plots). While splitting is useful it destroys integrity of n-D data, and leads to a shallow understanding complex n-D data.

232

n-D data

B. Kovalerchuk et al.

2-D data & 2-D patterns

(a) Conversion of n-D data to 2-D with loss of some n-D information and visual discovery of distorted n-D patterns in 2-D data.

n-D data and n-D patterns

2-D data & n-D patterns

(b) n-D data are converted to 2-D without loss of information and abilities for visual discovery of undistorted n-D patterns in 2-D.

Fig. 7 The difference between lossy and lossless approaches

An alternative, for deeper understanding of n-D data is visual representations of n-D data in low dimensions without splitting and loss of information, is graphs not 2-D points, e.g., Parallel, Radial coordinates, and new General Line Coordinates. Figure 7 illustrates the difference in the approaches.

4.2 Theoretical Limitations Below we summarize theoretical limitations analyzed in detain in [27]. The source of information loss in the process of dimension reduction from n dimensions to k dimensions (k < n) is in the smaller neighborhoods in k-D when each n-D point is mapped to k-D point. In particular, the 2-D/3-D visualization space (with k = 2 or k = 3) does not have enough neighbors to represent the n-D distances in 2-D. For instance, the 3-D binary cube has 23 nodes, but the 10-D hypercube has 210 nodes. Mapping 210 10-D points to 23 3-D points leads to the distortion of n-D distances, because the variability of distances between 3-D points is much smaller than between 10-D points. It leads to the significant corruption of n-D distances in 2-D visualization. The Johnson-Lindenstrauss lemma states these differences explicitly. It implies that only a small number of arbitrary n-D points can be mapped to k-D points of a smaller dimension k that preserves n-D distances with relatively small deviations. Johnson-Lindenstrauss Lemma. Given 0 < ε < 1, a setX of m points in Rn , and a number k > 8ln(m)/ε 2 , there is a linear map ƒ: Rn → Rk such that for all u, v ∈ X. ( 1 − ε)  u − v 2 ≤  f ( u) − f ( v) 2 ≤ ( 1 + ε)  u − v 2 . In other words, this lemma sets up a relation between n, k and m when the distance can be preserved with some allowable error ε. A version of the lemma defines the possible dimensions k < n, such that for any set ofm points in Rn there is a mapping f: Rn → Rk with “similar” distances in Rn and Rk between mapped points. This similarity is expressed in terms of the error 0 < ε < 1.

Survey of Explainable Machine Learning with Visual …

233

√ For ε = 1, the distances in Rk are less or equal to 2 S, whereS is the distance in Rn . This means that the distance s in Rk will be in the interval (0, 1.42S). In other words, the distance will not be more than 142% of the original distance, i.e., it will not be much exaggerated. However, it can dramatically diminish to 0. The lemma and this theorem allow to derive three formulas, to estimate the number of dimensions (sufficient and insufficient) to support the given distance errors. These formulas show that to keep distance errors within about 30%, for just 10 arbitrary high-dimensional points, the number of dimensions k needs be over 1900 dimensions, and over 4500 dimensions for 300 arbitrary points. The point-to-point visualization methods do not meet these requirements for arbitrary datasets. Thus, this lemma sets up the theoretical limits to preserve n-D distances in 2-D. For details, see [27].

4.3 Examples of Lossy Versus Lossless Approaches for Visual Model Discovery 4.3.1

GLC-L Algorithms for Lossless Visual Model Discovery.

The GLC-L algorithm [26] allows lossless visualization of n-D data and discovering a classification model. It is illustrated first for a lossless visualization of 4-D point x = ( x 1 , x 2 ,x 3 , x 4 ) = (1,0.8, 1.2, 1) in Fig. 8. The algorithm for this figure consists of the following steps: • Set up 4 coordinate lines at different angles Q1 -Q4 • Locate values x 1 -x 4 of 4-D point x as blue lines (vectors) on respective coordinate lines • Shifting and stacking blue lines

X1

1

X3

X2

0.8

Q2

Q1

Q3

1.2

X4

1 Q4

U

x4

x3 x2

x1 Q1

X2 Q2

X3 Q3

X4 Q4

Fig. 8 GLC-L Algorithms for lossless visual model discovery

234

B. Kovalerchuk et al.

444 benign (blue) cases

239 malignant (red) cases

Fig. 9 WBC classification model. Angles in green boxes are most informative

• • • •

Projecting the last point to U line Do the same for other 4-D points of blue class Do the same for 4-D points of red class Optimize angles Q1 -Q4 to separate classes (yellow line).

The applicability of the GLC-L algorithm to important tasks is illustrated in Fig. 9 for 9-D breast cancer diagnostics task using Wisconsin Breast Cancer data from the UCI ML repository. It allowed explanation of patterns and visual understanding of them, with lossless reversible/restorable visualization. It reached high accuracy with only one malignant (red case) on the wrong side. The resulting linear classification model can be converted to a set of interpretable logical rules as was described in Sect. 1.2 by building a respective step function. Also, this example shows a fundamentally new opportunity for splitting data into training and validation sets. This is a critical step of any ML model justification— checking accuracy of the model. Traditionally in ML, we split data randomly to training and validation sets, compute accuracy on each of them, and if they are similar and large enough for the given task, we accept the model as a predictive tool. While this is a common approach, which is used for decades, it is not free from several deficiencies [33]. First, we cannot explore all possible splits of a given set of n-D data points, because the number of splits is growing exponentially with the size of the dataset. As a result, we use a randomly selected small fraction of all splits,

Survey of Explainable Machine Learning with Visual …

235

such as tenfold Cross-Validation (CV). This is not only a small fraction of all splits, the selected splits overlap as in tenfold CV, respectively, the accuracy estimates are not independent. As a result, we can get a biased estimate of the model accuracy, and make a wrong decision to accept or reject the model as a predictive tool. Why do we use such random splits, despite these deficiencies? The reason is that we cannot see multidimensional data with the naked eye, and we are forced to use a random split of data into training and validation datasets. This is clear from the example in Fig. 3 above, which represents 2-D data. If all training cases will be selected, as cases below the red line, and all validation cases will be cases above that line, we will get a biased accuracy. Visually we can immediately see this bias in 2-D, for 2-D data. In contrast we cannot see it in n-D space, for n-D data.

4.3.2

Avoiding Occlusion with Deep Learning

This example uses the same WBC data as above. In Fig. 9 polylines that represent different cases occlude each other, making discovering visual patterns by a naked eye challenging. In [10] the occlusion is avoided, by using a combination of GLC-L algorithm, described above, and a Convolutional Neural Network (CNN) algorithm. The first step is converting non-image WBC data to images, by GLC-L, and the second one is discovering a classification model, on these images by CNN. Each image represents a single WBC data case, as a single polyline (graph) completely avoiding the occlusion. n. It resulted in 97.22% accuracy on tenfold cross validation [10]. If images of n-D points are not compressed, then this combination of GLC-L and CNN is lossless.

5 General Line Coordinates (GLC) 5.1 General Line Coordinates to Convert n-D Points to Graphs General Line Coordinates (GLC) [27] break a 400-year-old tradition of using the orthogonal Cartesian coordinates, which fit well to modeling the 3-D physical world, but are limited, for lossless visual representation of the diverse and abstract high-dimensional data, which we deal with in ML. GLC relax the requirement of orthogonality. In GLC, the points on the coordinates form graphs, where coordinates can overlap, collocate, be connected or disconnected, straight or curvy and go into any direction. Figures 10–12 show the examples of several 2-D GLC types and Fig. 13 shows different ways how GLC graphs can be formed. The case studies in the next section show benefits of GLC for ML. Table 1 outlines 3-D GLC types. Several GLCs are described in more details in the next section in case studies. For full description of

236

B. Kovalerchuk et al.

100

X3 Wednesday

0

Fig. 10 Examples of different GLCs: Parallel, Non-parallel, Curved, In-line Coordinates, Triangular, Radial and Pentagon Coordinates

GLCs see [27]. GLCs are interpretable because they use original attributes from the domain and do not construct artificial attributes that are foreign to the domain and it is done in methods such as PCA and t-SNE. Traditional Radial Coordinates (Radial Stars) locate n coordinates radially and put n nodes on respective coordinates Xi to represent each n-D point. Then these points are connected to form a “star”. The Paired Radial Coordinates (CPC-Stars) use a half of the nodes to get a reversible/lossless representation of an n-D point. It is done by creating n/2 radial coordinate axes and collocating coordinate X2 with X3 , X4 with X5 , and finally Xn with X1 . Each pair of values of coordinates (x j , x j+1 ) of an n-D point x is displayed in its own pair of non-orthgonal radial coordinates (Xj , Xj+1 ) as a 2-D point, then these points are connected to form a directed graph. Figure 12 shows data in these coordinates on the first row for 6-D data and for 192D data. These coordinates have important advantages over traditional star coordinates for shape perception. Figure 12 illustrates it showing the same 192-D data in both. Several mathematical statements have been established for GLC [27] that cover different aspects of the GLC theory and pattern simplification methodology. These statements are listed below. Figure 14 illustrates some of these statements for the Shifted Paired Coordinates (SPC). Statement 1. Parallel Coordinates, CPC and SPC preserve Lp distances for p = 1 and p = 2, D(x,y) = D*(x*,y*).

Survey of Explainable Machine Learning with Visual …

237

Y

X Collocated Paired Shifted Paired Coordinates Coordinates 6-D point (5,4,0,6,4,10) in Paired Coordinates. 1 Y, Y`,Y`` 0.9 0.8 (0.4, 0.8)=(x``,y``) 0.7 0.6 0.5 (0.1, 0.6)=(x`,y`) 0.4 (0.2, 0.4)=(x,y) 0.3 0.2 0.1 X, X`,X`` 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Collocated Paired Coordinates

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 x

y

x'

y'

x''

y''

Parallel Coordinates

State vector x = (x,y,x`,y`,x``,y``) = (0.2, 0.4, 0.1, 0.6, 0.4, 0.8) in Collocated Paired and Parallel Coordinates Fig. 11 Examples of the different GLCs: Collocated, Shifted Paired Coordinates, and Parallel Coordinates

Statement 2 (n points lossless representation). If all coordinates Xi do not overlap, then GLC-PC algorithm provides bijective 1:1 mapping of any n-D point x to the 2-D directed graph x*. Statement 3 (n points lossless representation). If all coordinates Xi do not overlap then GLC-PC and GLC-SC1 algorithms provide bijective 1:1 mapping of any n-D point x to 2-D directed graph x*. Statement 4 (n/2 points lossless representation). If coordinates Xi , and Xi+1 are not collinear in each pair (Xi , Xi+1 ), then the GLC-CC1 algorithm provides bijective 1:1 mapping of any n-D point x to the 2-D directed graph x* with n/2 nodes and n/2—1 edges. Statement 5 (n/2 points lossless representation). If coordinates Xi , and Xi+1 are not collinear in each pair (Xi , Xi+1 ) then GLC-CC2 algorithm provides bijective 1:1 mapping of any n-D point x to 2-D directed graph x* with n/2 nodes and n/2—1 edges.

238

B. Kovalerchuk et al.

X3 (x3,x4) =(2,2)

X5 X4

X2

X3 X2

X6 X7

(1,1)= (x1,x2)

2 1

X4 X 5

1

2

X1 X6

(1,1)= (x5,x6)

6-D point as a closed contour in 2-D where a 6-D point x=(1,1, 2,2,1,1) is forming a tringle from the edges of the graph in Paired Radial Coordinates with non-orthogonal Cartesian mapping.

9 X8

ρ X1X16

X11X10

X15

X13 X12

X14

(a) (b) (c) n-D points as closed contours in 2-D: (a) 16-D point (1,1,2,2,1,1,2, 2,1,1,2,2,1,1,2,2) in Partially Collocated Radial Coordinates with Cartesian encoding, (b) CPC star of a 192-D point in Polar encoding, (c) the same 192-D point as a traditional star in Polar encoding.

M

X4 CE

0.2

X2 0.5

0.5

6-D point (1, 1, 1, 1, 1, 1) in two X1-X6 coordinate systems (left – in Radial Collocated Coordinates, right – in Cartesian Collocated Coordinates).

X1

A

B

X3

0.3

4-D point P=(0.3,0.5,0.5,0.2) ) in 4-D Elliptic Paired Coordinates, EPC-H as a green arrow. Red marks separate coordinates in the Coordinate ellipse.

Fig. 12 Examples of different GLCs: Radial, Paired Collocated Radial, Cartesian Collocated and Elliptic Paired Coordinates

Six coordinates and six vectors that represent a 6-D data point (0.75,0.5,0.7,0.6,0.7, 0.3)

6-D data point (0.75,0.5,0.7,0.6,0.7, 0.3) in GLC-PC.

6-D data point (0.75,0.5,0.7,0.6,0.7, 0.3) in GLC-CC1

6-D data point (0.75,0.5,0.7,0.6,0.7, 0.3) in GLC-CC2 X6 x1

X1

X3 X 4

x2 X2

x4

x5

x6

x3 X5

6-D data point (0.75,0.5,0.7,0.6,0.7, 0.3) in GLC-SC1. 6-D data point (0.75,0.5,0.7,0.6,0.7, 0.3) in GLC-SC2

Fig. 13 Different ways to construct graphs of General Line Coordinates

Statement 6 (n points lossless representation). If all coordinates Xi do not overlap then GLC-SC2 algorithm provides bijective 1:1 mapping of any n-D point x to 2-D directed graph x*. Statement 7. GLC-CC1 preserves L p distances for p = 1, D(x,y) = D*(x*,y*).

Survey of Explainable Machine Learning with Visual …

239

Table 1 General Line Coordinates (GLC): 3-D visualization Type

Characteristics

3-D General Line Coordinates (GLC)

Drawing n coordinate axes in 3-D in variety of ways: curved, parallel, unparalleled, collocated, disconnected, etc

Collocates Tripled Coordinates (CTC)

Splitting n coordinates into triples and representing each triple as 3-D point in the same three axes; and linking these points to form a directed graph. If n mod 3 is not 0 then repeat the last coordinate Xn one or two times to make it 0

Basic Shifted Tripled Coordinates (STC)

Drawing each next triple in the shifted coordinate system by adding (1,1,1) to the second triple, (2,2,2) to the third triple (i-1, i-1, i-1) to the i-th triple, and so on. More generally, shifts can be a function of some parameters

Anchored Tripled Coordinates (ATC) in 3-D Drawing each next triple in the shifted coordinate system, i.e., coordinates shifted to the location of the given triple of (anchor), e.g., the first triple of a given n-D point triple are shown relative to the anchor easing the comparison with it 3-D Partially Collocated Coordinates (PCC) Drawing some coordinate axes in 3-D collocated and some coordinates not collocated 3-D in-Line Coordinates (ILC)

Drawing all coordinate axes in 3D located one after another on a single straight line

In-plane Coordinates (IPC)

Drawing all coordinate axes in 3D located on a single plane (2-D GLC embedded to 3-D)

Spherical and polyhedron coordinates

Drawing all coordinate axes in 3D located on a sphere or a polyhedron

Ellipsoidal coordinates

Drawing all coordinate axes in 3D located on ellipsoids

GLC for linear function (GLC-L)

Drawing all coordinates in 3D dynamically depending on coefficients of the linear function and value of n attributes

Paired Crown Coordinates (PWC)

Drawing odd coordinated collocated on the closed convex hull in 3-D and even coordinates orthogonal to them as a function of the odd coordinate value

Statement 8. In the coordinate system X1 ,X2 ,…,Xn constructed by the Single Point algorithm with the given base n-D point x = (x 1 , x 2 „..,x n ) and the anchor 2-D point A, the n-D point x is mapped one-to-one to a single 2-D point A by GLC-CC algorithm. Statement 9 (locality statement). All graphs that represent nodes N of n-D hypercube H are within the square S.

240

B. Kovalerchuk et al. X2 X4 X6 5 4 7

(2,4,1,7,3,5)

(1,5,0,8,2,6) X2 X4 (3,5,2,8,4,6) X6 5 4 7 (2,4,1,7,3,5)

(3,3,2,6,2,4)

3

(3,3,2,6,4,4)

(1,3,0,6,2,4)

2

X1

2

X5

3

1

X3

6-D points (3,3,2,6,2,4) and (2,4,1,7,3,5) in X1-X6 coordinate system build using point (2,4,1,7,3,5) as an anchor.

1

X1 X5 X3

Data in Parameterized Shifted Paired Coordinates. Blue dots are corners of the square S that contains all graphs of all n-D points of hypercube H for 6-D base point (2,4,1,7,3,5) with distance 1 from this base point.

Fig. 14 Lossless visual representation of 6-D hypercube in Shifted Paired Coordinates

4-D data: representation of prevalence of undernourished in the population (%) in Collocated Paired Coordinates

4-D data: representation of prevalence of undernourished in the population (%) in traditional time series (equivalent to Parallel Coordinates for time series)

Fig. 15 Visualization of the Global Hunger Index (GHI) in Collocated Paired. Coordinates (CPC) versus traditional time series visualization

5.2 Case Studies 5.2.1

World Hunger Data

To represent n-D data, in Collocated Paired Coordinates (CPC), we split an n-D point x into pairs of its coordinates (x 1 ,x 2 ),…,(x n-1 ,xn ); draw each pair as a 2-D point in the collocated axes; and link these points to form a directed graph. For odd n coordinate xn is repeated to make n even. Figure 15 shows advantages of visualization of the Global Hunger Index (GHI) for several countries in CPC over traditional time series visualization [17, 26]. This CPC visualization is simpler, and without occlusion and overlap of the lines.

5.2.2

Machine Learningfor Investment Strategy with CPC

The goal of this study is learning trading investment strategy to predict long and short positions [66]. It is done in 4-D and 6-D spaces, which are represented in Collocated

Survey of Explainable Machine Learning with Visual …

241

Paired Coordinates (CPC), and Collocated Tripled Coordinates (CTC), respectively. In CPC, each 4-D point is an arrow in 2-D space (see previous section), and each 6-D point is an arrow in 3-D CTC space. Each 2-D arrow represents two pairs (Vr ,Yr ) of values (volume Vr and relative main outcome variable Yr ) at two consecutive moments. In contrast with traditional timeseries CPC has no time axis. The arrow direction shows time from i to i + 1. The arrow beginning is the point in the space (Vr i ,Yr i ), and its head is the next time point in the collocated space (Vr i+1 ,Yr i+1 ). CPC give the inspiration idea for building a trading strategy, in contrast with the timeseries figure without it. It allows finding the areas with clusters of two kinds of arrows. In Fig. 16, the arrows for the long positions are green arrows. The arrows for the short positions, are red. Along the Yr axis we can observe a type of change in Y in the current candle. if Yr i+1 > Yr i then Yi+1 > Yi the right decision in i-point is a long position opening. Otherwise, it is a short position. Fig.16. 4-D and 6-D trading data in 2-D and 3-D CPC with the maximum asymmetry between long (green) and short (red) positions [66]

242

B. Kovalerchuk et al.

Fig. 17 GLC-L algorithm for recognition of digits with dimension reduction

Next, CPC shows the effectiveness of a decision in the positions. The very horizontal arrows indicate small profit. The more vertical arrows indicate the larger profit. In comparison with traditional time series, the CPC bring the additional knowledge about the potential of profit, in the selected area of parameters in (Vr ,Yr ) space. The core of the learning process is searching squares and cubes in 2-D and 3-D CPC spaces with the prevailing number of long positions (green arrows). See Fig. 16. It is shown in [66] that this leads to beneficial trading strategy in simulated training.

5.2.3

Recognition of Digits with Dimension Reduction

Figure 17 shows the results of GLC-L algorithm (see Sect. 3.3) on MNIST handwritten digits [25]. Each image contains 22 × 22 = 484 pixels by cropping edges from original 784 pixels. The use of GLC-L algorithm allowed to go from 484-D to 249-D by the GLC-L algorithms, with minimal decrease of accuracy from 95.17 to 94.83%.

5.2.4

Cancer Case Study with Shifted Paired Coordinates

This case study deals with the same 9-D WBC data, by using the FSP algorithm [29], and Shifted Paired Coordinates (SPC) [27] for a graph representation of n-D points. The idea of SPC is presented in Fig. 18. The Shifted Paired Coordinates (SPC) visualization of the n-D data requires the splitting of n coordinates X1 -Xn into pairs producing the n/2 non-overlapping pairs (Xi ,Xj ), such as (X1 ,X2 ), (X3 ,X4 ),

Survey of Explainable Machine Learning with Visual …

Point a in (X1,X2), (X3,X4), (X5,X6) as a sequence of pairs (3,2), (1,4) and (2,6).

243

Point a in (X2,X1), (X3,X6), (X5,X4) as a sequence of pairs (2,3), (1,6) and (2,4).

Fig.18 6-D point a = (3,2,1,4,2,6) in Shifted Paired Coordinates

(X5 ,X6 ),…,(Xn-1 ,Xn ). In SPC, a pair (Xi ,Xj ) is represented as a separate orthogonal Cartesian Coordinates (X,Y), where Xi is X and Xj is Y, respectively. In SPC, each coordinate pair (Xi ,Xj ) is shifted relative to other pairs to avoid their overlap. This creates n/2 scatter plots. Next, for each n-D point x = (x 1 ,x 2 ,…,x n ), the point (x 1 ,x 2 ) in (X1 ,X2 ) is connected to the point (x 3 ,x 4 ) in (X3 ,X4 ) and so on until point (x n-2 ,x n-1 ) in (Xn-2 ,Xn-1 ) is connected to the point (x n-1 ,xn ) in (Xn-1 ,Xn ) to form a directed graph x*. Figure 18 shows the same 6-D point, visualized in SPC, in two different ways, due to different pairing of coordinates. The FSP algorithm has the three major steps: Filtering out the less efficient visualizations from the multiple SPC visualizations, Searching for sequences of paired coordinates that are more efficient for classification model discovery, and Presenting the model discovered with a best SPC sequence, to the analyst [29]. The results of FSP applied to CPC graphs of WBC data are shows in Figs. 19–20. Figure 19 shows the motivation for filtering and searching in FSP. It presents WBC data in SPC, where graphs occlude each other making it difficult to discover the pattern visually. Figure 20 shows the results of automatic filtering and searching by FSP algorithm. It displays only cases located outside of a small violet rectangle at the bottom in the middle and go inside of two larger rectangles on the left. These cases are dominantly cases of the blue class. Together these properties provide a rule:

Fig. 19 Benign and malignant WBC data visualized in SPC as 2-D graphs of 10-D points

244

B. Kovalerchuk et al.

Fig. 20 SPC visualization of WBC data with areas dominated by the blue class

If (x 8 ,x 9 ) ∈ R1 & (x 6 ,x 7 ) ∈ / R2 & (x 6 ,x 7 ) ∈ / R3 then x ∈ class Red else x ∈ class Blue, where R1 and R2 and R3 are three rectangles described above. This rule has accuracy 93.60% on all WBC data [29]. This fully interpretable rule is visual and intelligible by domain experts, because it uses only original domain features and relations. This case study shows the benefits of combining analytical and visual means for producing interpretable ML models. The analytical FSP algorithm works on the multiple visual lossless SPC representations of n-D data, to find the interpretable patterns. While occlusion blocks discovering these properties by visual means, the analytical FSP algorithm discovers them in the SPC, simplifying the pattern discovery, providing the explainable visual rules, and decreasing the cognitive load.

5.2.5

Lossless Visualization viaCPC-Stars Radial Stars and Parallel Coordinates and Human Abilities to Discover Patterns in High-D Data

The design of CPC-Stars versus traditional Radial Stars was described in Sect. 4.1. Several successful experiments have been conducted to evaluate human experts’ abilities to discover visual n-D patterns [17, 28, 31, 26]. Figure 21 shows lossless visualization of 48-D and 96-D data in CPC-Stars, Radial Stars and Parallel Coordinates. While all of them are lossless, fully preserving high-dimensional data, abilities of humans to discover visual patterns shown on the right in Fig. 21, are higher using CPC-stars, than using the two others. This is a result of using only n/2 nodes versus n nodes in alternative representations. Similar advantages have been demonstrated with 160-D, 170-D and 192-D. Figure 22 shows the musk 170-D data from UCI ML repository. The examples and case studies demonstrate that ML methods with General Line Coordinatesallow: (1) visualizing data of multiple dimensions from 4-D to 484- D without loss of information and (2) discovering interpretable patterns by combining humans perceptual capabilities, and Machine Learning algorithms for classification

Survey of Explainable Machine Learning with Visual …

245

Two stars with identical shape fragments on intervals [a,b] and [d,c] of coordinates.

Samples of some class features on Stars for n=48.

Samples of some class features on PCs for n=48.

Examples of corresponding figures: stars (row 1) and PCs lines (row 2) for five 48-D points from two tubes with m = 5%. Row 3 and 4 are the same for dimension n=96.

Visual Patterns-- combinations of attributes

Fig. 21 48-D and 96-D points in CPC-Stars, Radial Stars and Parallel Coordinates

Fig. 22 Nine 170-dimensional points of two classes in Parallel Coordinates (row 1), in star coordinates (row 2 class “musk”, row 3 class “non-musk chemicals”), and in CPC stars (row 4 class “musk” and row 5, class “non-musk chemicals”)

246

B. Kovalerchuk et al.

of such high-dimensional data. Such hybrid technique can be developed further in multiple ways, to deal with different new challenging ML and data science tasks.

6 Visual Methods for Traditional Machine Learning 6.1 Visualizing Association Rules: Matrix and Parallel Sets Visualization for Association Rules In [69] association rules are visualized. A general form of the association rule (AR) is A⇒ B, where A and B are statements. Commonly A consists of several other statements, A = P1 &P2 &…&Pk , e.g., If customers buy both tomato (T) and cucumbers (Cu), they likely buy carrots (Ca). Here A = T &Cu and B = Cu. The qualities of the AR rule are measured by the support and confidence that express, respectively, a frequency of the itemset A in the dataset, and a portion of transactions with A and B relative to frequency of A. ARs are interpretable being a class of propositional rules expressed in the original domain terms. The typical questions regarding ARs are as follows: What are the rules with the highest support/confidence? What are outliers of the rule and their cause? Why is the rule confidence low? What is the cause of rule? What rules are non-interesting? Visualization allows the answering some of these questions directly. Figure 23a shows structure-based visualization of association rules with a matrix and heatmap [69]. Here left-hand sides (LHS) of the rules are on the right and right-hand sides (RHS) of the rules are on the top. Respectively, each row shows a possible LHS itemset of the rule, and each column shows a possible RHS itemset. The violet cells indicate discovered rules A⇒ B with respective LHS and RHS. The darker color of the rule cell shows a greater rule confidence. Similarly, the darker LHS shows the larger rule support. A similar visualization is implemented in the Sklearn package where each cell is associated with colored circles of different sizes to express the quality of the rule. The major challenges here are scalability and readability for a large number of LHS and RHS [69]. A bird view solution was implemented in the Sklearn package, where each rule is a colored point in a 2-D plot with support and confidence as coordinates, which allowed to showing over 5000 rules. Figure 23b shows the input-based model visualization for a set of ARs. It uses Parallel Sets. Parallel Sets display dimensions as adjacent parallel axes and their values (categories) as segments over the axes. Connections between categories in the parallel axes form ribbons. The segments are like points, and the ribbons are like lines in Parallel Coordinates [18] The ribbon crossings cause clutter that can be minimized by reordering coordinates and other methods [69]. Both model visualizations shown in Fig. 23 are valid for other rules-based ML models too.

Survey of Explainable Machine Learning with Visual …

247

RHS

B

LHS

(a)

(b)

LHS RHS

(c) Fig. 23 Visualizations of association rules [69] and Sklearn

6.2 Dataflow Tracing in ML Models: Decision Trees Graph Visualizer, TensorBoard, TensorFlow’s dashboard, Olah’s interactive essays, ConvNetJS, TensorFlow Playground, and Keras are current tools for DNN dataflow visualization (Wongsuphasawat et al. 2017). They allow observing scalar values, distribution of tensors, images, audio and others to optimize and understand models by visualizing model structure at different levels of detail. While all these tools are very useful, the major issue is that dataflow visualization itself does not explain or optimize the DNN model. An experienced data scientist should guide data flow visualization for this. In contrast, the dataflow for explainable models can bring explanation itself , as we show below for Decision Trees (DTs). Tracing the movement of a given n-D point in the DT shows all the interpretable decisions made to classify this point. For instance, consider a result of tracing the 4-D point x = (7,2,4,1) in the DT through a sequence of nodes for attributes x 3 , x 2 , x 4 , x 1 with the following thresholds: x 3 < 5, x 2 > 0, x 4 < 5, x 1 > 6 to a terminal node of class 1. The point x satisfies all these directly interpretable inequalities.

248

B. Kovalerchuk et al.

6

(a) Traditional visualization of WBC data decision tree. Green edges and nodes indicate the benign class and red edges and nodes indicate the malignant class.

(b) DT with edges as Folded Coordinates in disproportional scales. The curved lines are cases that reach the DT malignant edge with different certainties due to the different distances from the threshold node.

Fig. 24 DT dataflow tracing visualizations for WBC data [33]

Figure 24 shows a traditional DT visualization for 9-D Wisconsin Breast Cancer (WBC) data from UCI Machine Learning repository. It clearly presents the structure of the DT model, but without explicitly tracing individual cases. The trace is added with a dotted polyline in this figure. Figure 24b shows two 5-D points a = (2.8, 5, 2.5, 5.5, 6.5) and b = (5, 8, 3, 4, 6). Both points reach the terminal malignant edge of the DT, but with different certainty. The first point reaches it with a lower certainty, having its values closer to the thresholds of uc and bn coordinates. In this visualization, called Folded Coordinate Decision Tree (FC-DT) visualization [32], the edges of the DT not only connect decision nodes, but also serve as Folded Coordinates in disproportional scales for WBC data. Here, each coordinate is folded, at the node threshold point with different lengths of the sides. For instance, with threshold T = 2.5 on the coordinate uc with the interval of values between 1 and 10, the left interval is [1, 2.5], and the right interval is [2.5,10]. In Fig. 29b, these two unequal intervals are visualized with equal lengths, i.e., forming a disproportional scale.

6.3 IForest: Interpreting Random Forests via Visual Analytics The goal of iForest system [74] is assisting a user in understanding how random forests make predictions and observing prediction quality. It is an attempt to open the innerworkings of the random forests. To be a meaningful explanation for the end user it should not use terms, which are foreign for the domain where the data came from. Otherwise, it is an explanation for another user—the data scientist/ML model designer. The actual usability testing was conducted with this category of users (students and research scientists). iForest uses t-SNE to project data onto a 2D plane (Fig. 25) for data overview and analysis of similarity of the decision passes. The t-SNE challenge is illuminated in

Survey of Explainable Machine Learning with Visual …

249

Fig. 25 iForest to interpret random forests for Titanic data [74]

Fig. 25 on the left where the yellow section a1 is identified as outlier from the classification viewpoint (low confidence to belong to the same class as t-SNE neighbors), but they are not outliers in t-SNE. Similarly, in Fig. 25 on the right, multiple cases from different classes and of different confidence are t-SNE neighbors. The reasons for this are that (1) t-SNE as an unsupervised clustering method can produce clusters that differ from given classes, (2) t-SNE dense 2-D areas may not be dense areas in n-D [76], and that t-SNE is a point-to-point mapping of n-D points to 2-D points with the loss of n-D information. This issue was discussed in prior sections in depth. The major advantage of using t-SNE and the other point-to-point mappings of n-D data to 2-D data is that they suffer much less from occlusion than point-to- graph mapping which we discussed in the prior sections that are General Line Coordinates. In summary, the general framework of iForest is beneficial for ML model explanation and can be enhanced with point-to-graph methods that preserve n-D information in 2-D.

250

B. Kovalerchuk et al.

6.4 TreeExplainer for Tree Based Models TreeExplainer is a framework to explain random forests, decision trees, and gradient boosted trees [40]. Its polynomial time algorithm computes explanations based on game theory and part of the SHAP (SHapley Additive exPlanations) framework. To produce an explanation, effects of local feature interaction are measured. Understanding global model structure is based on combining the local explanations of each prediction. Figure 26 illustrates its “white box” local explanation. As we see it deciphers mortality rate 4 as a sum 2.5 + 0.5 + 3–2 of 4 named features. In contrast, the black box model produces only the mortality rate 4 without telling how it was obtained. The question is can we call sum 2.5 + 0.5 + 3–2 a “white box” or it is rather a blackbox explanation. The situation here is like in Sect. 2.5 with quasi-explanations. If a user did not get any other information beyond the numbers 2.5, 0.5, 3 and -2 for 4 attributes, then it is a rather black-box explanation not a white box explanation. In other words, the user has got a black box model prediction of 4, and a black box explanation of 4 without answers for the questions like: why should numbers 2.5, 0.5, 3 and -2 be accepted, what is their meaning in the domain, and why summation of them makes sense in the domain.

7 Traditional Visual Methods for Model Understanding: PCA, t-SNE and Related Point-to-Point Methods Principal Component analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE) [76] and related methods are popular methods applied to get intuitive understanding of data, ML model and their relations. These point-to-point projection methods convert n-D points to 2-D or 3-D points for visualization. PCA can distort local neighborhoods [12]. Often t-SNE attempts preserving local data neighborhoods, at the expense of distorting global structure [12]. Below we summarize and analyze the challenges and warnings with t-SNE highlighted in [12, 76]. Including from t-SNE author. One of them is that t-SNE

Fig. 26 TreeExplainer “white box” local explanation [40]

Survey of Explainable Machine Learning with Visual …

251

may not help to find outliers or assign meaning to point densities in clusters. Thus, outliers and dense areas visible in t-SNE may not be them in the original n-D space. Despite this warning, we can see the statements that users can easily identify outliers in t-SNE [6], and see the similarities [74]. It will be valid only after showing that the n-D metrics are not distorted in 2-D for the given data. In general, for arbitrary data, any point-to-point dimension reduction method distorts n-D metrics in k-D of a lower dimension k as shown in the Johnson-Lindenstrauss lemma presented above. Figure 27 shows PCA and t-SNE in 2-D, and 3-D visualizations of 81-D breast lesion ultrasound data [19]. These visualizations differ significantly creating very different opportunities to interpret data and models. In this example, each 81-D point is compressed 40 times to get a 2-D point, and 27 times to get a 3-D point, respectively with significant loss of 81-D information. Each of these 4 visualizations captures different properties of the original 81-D data, and losses other properties. Moreover, properties presented in these visualizations are

Fig. 27 PCA and t-SNE visualizations of 81-D breast lesion ultrasound data. Green is benign lesions, red is malignant, and yellow is benign-cystic. Here a 2-D and b 3-D PCA, and c 2-D and d 3-D t-SNE [19]

252

B. Kovalerchuk et al.

artificial “summary” properties, which differ from original interpretable attributes of 81-D points. In fact, all PCA principal components have no direct interpretation, in the domain terms, for heterogeneous attributes. The same is true for t-SNE. Any attempts at discovering meaningful patterns in these 2/3-D visualizations will hit this wall of lack of direct interpretation of “summary” attributes. In general point-to-point methods like T-SNE and PCA, do not preserve all information of initial features (they are lossy visualizations of n-D data), and produce a “summary”, which has no direct interpretation. Another example with visualizations is explaining why the model misclassified some samples [42]. The idea is generating a closest sample (from the correct class) to the misclassified sample and visualizing the difference between their attributes as an explanation of misclassification (see Fig. 28c). This is a simple and an attractive way of explanation. Moreover, this new sample can be added to the training data to improve the model. The goal of Fig. 28ab is explaining visually the method of finding the closest sample from the correct class proposed in [42]. It is done by visualizing these samples using t-SNE to see how close they are. Figure 27a shows the original samples and Fig. 27b shows them together with the closest samples from the correct class. It is visible that these samples are close (in fact they overlap) in t-SNE.

Fig. 28 Misclassified and modified samples in t-SNE (a, b) with their differences c [42]

Survey of Explainable Machine Learning with Visual …

(a) Case misclassified: thick dark red.

st

(b) 1 nearest case of correct class: thick light red.

253

(c) 2nd nearest case of correct class: thick light red.

Fig. 29 GLC-L: misclassified case (a) and two nearest cases of correct class (b, c) with dotted lines show changed attributes

Fig. 30 Relevance heatmaps on an exemplary time series with different relevance/explanation indicators [59]

While the idea to explain the similarity of samples, by visualizing samples, has merit, the use of lossy point-to-point algorithms like t-SNE is not free from deficiencies. which can be resolved by applying lossless point-to-graph visualization methods such as GLC-L algorithm described in Sect. 4. In Fig. 28ab t-SNE does not show the difference between samples, they overlap in t-SNE visualization. While it shows that the closest samples, computed by the algorithm from [42], and the t-SNE representation of these samples are quite consistent with each other, the resolution of t-SNE, for these data, is not sufficient to see the differences. Thus, t-SNE distorted the closeness, in the original n-D space computed in [42]. In addition, t-SNE does not show alternative closest samples from the correct classes. In n-D space, the number of closest samples is growing exponentially with the dimension growth. Which one to pick up for explanation? We may get attribute x i = 5 in one neighbor sample, and x i = -5 in another one from the same correct class. Respectively, these neighbors will lead to opposite explanations. Averaging such neighbors to x i = 0 will nullify the contribution of x i to the explanation. In [42] a sample was selected, based on the proposed algorithm, without exploring alternatives.

254

B. Kovalerchuk et al.

The AtSNE algorithm [14] is to resolve the difficulties of t-SNE algorithm, for capturing the global n-D data structure, by generating 2-D anchor points (2-D skeleton) from the original n-D data with a hierarchical optimization. This algorithm is only applicable to the cases when the global structure of the n-D dataset can be captured by a planar structure using point-to-point mapping (n-D point to 2-D point). In fact, 2-D skeleton can corrupt the n-D structure (see the Johnson-Lindenstrauss lemma above). Moreover, the meaningful similarity between n-D points can be non-metric. The fact, that t-SNE and AtSNE can distort n-D structures in a low-dimensional map, is the most fundamental deficiency of all point-to-point methods, along with the lack of interpretation of generated dimensions. Therefore, we focus on point-tograph GLC methods, which open a new opportunity to address these challenges. Figure 29 shows how GLC-L resolves the difficulties exposed in Fig. 28. It shows a 4-D misclassified case, and two nearest 4-D cases of the correct class, where dotted lines show the attributes, which changed values relative to the misclassified case. The vertical yellow line is a linear discrimination line, of the red and blue classes. This lossless visualization preserves all 4-D information, with interpretable original attributes. It does not use any artificial attributes. Interpreting Time Series. Trends in retrospective time series data are relatively straightforward to understand. How do we understand predictions of time series in data? In [59] existing ML predictive algorithms and their explanation methods such as LRP, DeepLIFT, LIMR and SHAP are adapted for the specifics of timeseries. Time points t i are considered as features. Training and test data are sequences of m such features. Each feature is associated with its importance/relevance indicators r i computed by a respective explanation method. The vector of r i is considered as an explanation of the sequence playing the same role as salience of pixels. Respectively, they are also visualized by a heatmap (see Fig. 30). The authors modify test sequences in several ways (e.g., permuting sequences), and explore how the vector of explanations r i is changed. Together with domain knowledge, an expert can inspect the produced explanation visualizations. However, the abilities for this are quite limited in the same way as for salience of pixels that we discussed in Sect. 2.5, because explanations r i are still black boxes for domain experts.

8 Interpreting Deep Learning 8.1 Understanding Deep Learning via Generalization Analysis The most challenging property of DNN models is that the number of their parameters way exceeds the number of training cases. It makes overfitting and memorization of training data quite likely with the failure to generalize to the test data, outside of the training data.

Survey of Explainable Machine Learning with Visual …

255

There are empirical observations that DNN trained with stochastic gradient methods fit a random labeling of the training data even after replacing the true images by completely unstructured random noise [71]. It was expected by these authors that the learning should not be converging or slowing down, but for multiple standard architectures it did not happen. This is consistent with their theoretical result below. This theorem is for a finite sample of size n, and complements the NN universal approximation theorems that are for the entire domain. Theorem. There exists a two-layer neural network with 2n + d weights that can represent any function on a sample of size n in d dimensions [71]. So far, together these results faded the expectation to find tips to distinguish the models that generalize well from the models, which can only memorize training data, using models’ behavior during the training. If these expectations would materialize, then they would shed light on the interpretability of the models too. We would be able, to filter out as unexplainable, the models that behave specifically for nongeneralizable models. However, the situation is different due to actual results. We can try to explain models that are accurate on training data using the heatmap activation method that identifies salient pixels. Obviously, we can compute these pixels and “explain” complete noise. To distinguish it from a meaningful explanation, we would need to use a traditional ML approach—analyze errors beyond training data on the test data. Even this will not fully resolve the issue. It is commonly assumed that for the success of the ML model, training, validation and testing data should be from the same probability population. In the same way, the noise training, validation and testing data can be taken from a single population. How to distinguish between the models trained on the true labels that are potentially explainable and the models trained on random labels that should not be meaningfully explainable? This is an open question for the black box ML methods. The conceptual explanation methods based on the domain knowledge for the glass box ML models are equipped much better to solve this problem.

8.2 Visual Explanations for DNN Visualizing activations for texts. The LSTMVis system [20] is for interactive exploration of the learnt behavior of hidden nodes in LSTM network. A user selects a phrase, e.g., “a little prince,” and specifies a threshold. The system shows hidden nodes with activation values greater than the threshold and finds other phrases for which the same hidden nodes are highly activated. Given a phrase in a document, the line graphs and heatmap visualize the activation patterns of hidden nodes over the phrase. Several other systems employ activation, heatmap, and parallel coordinates too. The open questions for all of them for model explanation are: why the activation should make sense for the user, (2) how to capture relations between salient elements, (3) how to measure that the explanation is right?

256

B. Kovalerchuk et al.

Heatmap based methods for images. Some alternative methods to find salient pixels in DNN include: (i)

sensitivity analysis by using partial derivatives of the activation function to find the max of its gradient, (ii) Taylor decomposition of the activation functions by using its first-order components to find scores for the pixels, (iii) Layer-wise relevance propagation (LRP) by mapping the activation value to the prior layers, and (iv) blocking (occluding, perturbing) sets of pixels and finding sets, which cause the largest change of activation value that can be accompanies by the class change of the image [48]. More approaches are reviewed and compared in [3, 6, 15, 70]. From visualization viewpoint different methods that identify salient pixels in input images as explanation all are in the same category, because all of them use the same visualization heatmap method. The variations are that salient pixels can be shown in a separate image or as overlay/outline on the input image. The ways how they identify salient pixels are black boxes, for the end users. The only differences that users can observe is how well salient pixels separate objects of interest from the background (horses and bird in Fig. 31 [48]), and how specifically they identify these objects (each horse is framed or not). The visualization of features that led to the conclusion typically needs other visualization tools, beyond the heatmap capabilities. Therefore, heatmap explanation is incomplete explanation. Visualizing intermediate layers. While heatmaps overlaid on the original image to show salient pixels is a common way to explain DNN discoveries, heatmaps are also used to show features at the intermediate layers and to compare internal representations of object-centric and scene-centric networks [75] that can be used for model explanation. However, it is even more difficult to represent in the domain terms, because in contrast with the input data/ image the layers are less connected to the domain concepts.

Fig. 31 DNN heatmap for classes “horse” and “bird” [48, 57]

Survey of Explainable Machine Learning with Visual …

257

8.3 Rule-Based Methods for Deep Learning RuleMatrix. In [45] the rule-based explanatory representation is visualized in the form of RuleMatrix, where each row represents a rule, and each column is a feature used in the rules. These rules are found by approximating the ML model that was already discovered. As we see in Fig. 32, it is analogous to visualization of the association rules presented above in Fig. 23. In Fig. 32, the rule quality measures are separate columns, while, in Fig. 23, they are integrated with the rules by using a heatmap approach. Next, Fig. 32 provides more information about each rule in the Matrix form, while Fig. 3 provides more information in the form of parallel sets. These authors point out that the current visualization tools focus on addressing the needs of machine learning researchers and developers, without much attention to help domain experts who have little or no knowledge of machine learning or deep learning [1]. While it is true that domain experts have little ML knowledge, the issue is much deeper. All of us are domain “experts” in recognizing cats, dogs, birds, horses, boats and digits in the pictures. Consider the question that how many ML experts would agree that their knowledge of DNN and other ML algorithms allows them to say that they have an explanation how DNN recognized a cat versus a dog? The meaningful explanation needs to be in the terms of features of cats and dogs, which is part of our commonsense knowledge. Similarly, for the domain experts, the meaningful explanation must be in their domain knowledge terms, not in the foreign ML terms. Interpreting Deep Learning Models via Decision Trees. The idea of interpreting neural networks using decision trees can be traced to [62]. Now it is expanded to DNN to explain the prediction at the semantic level. In [72], a decision tree decomposes feature representations into elementary concepts of object parts. The decision tree shows which object parts activate which filters, for the prediction, and how much each object part contributes to the prediction score. DNN is learned for object classification

Fig. 32 Visualization of rules in Rule Matrix [45]

258

B. Kovalerchuk et al.

with disentangled representations, in the top conv-layer, where each filter represents a specific object part. The decision tree encodes various decision modes hidden inside the fully connected layers of the CNN in a coarse-to-fine manner. Given an input image, the decision tree infers a parse tree, to quantitatively analyze rationales for the CNN prediction, i.e. which object parts (or filters) are used for prediction, and how much an object part (or filter) contributes to the prediction.

8.4 Human in the Loop Explanations Explanatory Interactive Learning. DNN can use confounding factors within datasets to achieve high prediction of the trained models. These factors can be good predictors in a given dataset, but be useless in real world settings [36]. For instance, the model can be right in prediction, but for the wrong reasons, focusing incorrectly on areas outside of the issue of interest. The available options include: (1) discarding such models and datasets, and (2) correcting such models by the human user interactively [60]. The corrections are penalizing decisions made for wrong reasons, adding more and better training cases including counterexamples, and annotated masks during the learning loop. While these authors report success in this explanation, a user cannot review thousands of images on correctness of heatmaps in training and validation data. This review process is not scalable to thousands of images. Explanatory graphs. A promising approach to provide human-interpretable graphical representations of DNN are explanatory graphs [70] that allow representing the semantic hierarchy hidden inside a CNN. The explanatory graph has multiple layers. Each graph layer corresponds to a specific conv-layer of a CNN. Each filter in a conv-layer may represent the appearance of different object parts. Each input image can only trigger a small subset of part patterns (nodes) in the explanatory graph. The major challenge for the explanatory graphs is that they are derived from the DNN, if DNN is not rich enough to capture semantics, it cannot be derived. It follows from the fact that the explanation cannot be better, than its base model itself.

8.5 Understanding Generative Adversarial Networks (GANs) via Explanations In GAN the generative network generates candidates while the discriminative network evaluates them, however, visualization and understanding of GANs is largely missing. A framework to visualize and understand GANs at the unit, object, and scene level proposed in [4] is illustrated by the following example. Consider images of the buildings without a visible door. A user inserts a door into each of them at the generative stage and then the discriminative network evaluates them. The abilities of this

Survey of Explainable Machine Learning with Visual …

259

network to discover the door depends on the local context of the image. In this way, the context can be learned and used for explanation. The opposite common idea is covering some pixels to find the salient pixels.

9 Open Problems and Current Research Frontiers 9.1 Evaluation and Development of New Visual Methods An open problem for the visual methods intended to explain what a deep neural network has learned is matching the salient features discovered by explanation methods with human expertise and intuition. If, for a given problem, such matching is not feasible, then the explanation method is said to have failed the evaluation test. Existing explanation methods need to go through rigorous evaluation tests to be widely adoptable. Such methods are necessary as they can help in guiding the discovery of a new explainable visualization of what a deep neural network has learned. The example below illustrates these challenges. The difference in explanation power between three heatmap visualizations for digit ‘3’ in shown in [56]. See Fig. 33. The heatmap on the left is a randomly generated heatmap that does not deliver interpretable information relevant to ‘3’. The heatmap in the middle shows the whole digit without relevant parts, say, for distinguishing ‘3’ from ‘8’ or ‘9’, but separate ‘3’ from the background well. The heatmap of the right provides a relevant information for distinguishing between ‘3’, ‘8’ and ‘9’. If these salient pixels are provided by the ML classifier that discriminates these three digits then these salient pixels are consistent/matched with human intuition on differences between ‘3’, ‘8’ and ‘9’. However, for distinguishing ‘3’ and ‘2’, these pixels are not so salient for humans.

Fig. 33 Difference between explainability of heatmaps [58]

260

B. Kovalerchuk et al.

9.2 Cross Domain Pollination: Physics & Domain Based Methods A promising area of potentially new insights for explainable methods is the intersection of machine learning with other well stablished disciplines like Physics [2], Biology [5] etc. which have a history of explainable visual methods in their domains, e.g., the mathematical expressions describing the behavior and interaction of subatomic particles are quite complex, but they can be described via Feynman diagrams which a visual device for representing their interactions [13]. Similarly physics inspired models are now being used to simplify and inform machine learning models which are readily explainable [2] but can still be limited by visualizing of large number of contributing variables. A promising direction would be to combine such physics based methods with GLC family of methods described above.

9.3 Cross-Domain Pollination: Heatmap for Non-Image Data Recent ML progress has been guided by cross-pollination of different subfields of ML and related computer science fields. This chapter illustrates multiple such examples of integration of machine learning, visualization and visual analytics. Deep neural networks algorithms have shown remarkable success in solving image recognition problems. Several DNN architectures developed for one type of images have been successful also in other types of images, demonstrating efficiency of knowledge transfer to other types of images. Converting non-image data to images by using visualization expands this knowledge transfer opportunity to solve a wide variety of Machine Learning problems [10, 61]. In such methods, a non-image classification problem is converted into the image recognition problem to be solved by powerful DNN algorithms. The example below is a combination of CPC-R and CNN algorithms. The CPC-R algorithm [31] converts non-image data to images, and the CNN algorithm discovers the classification model in these images. Each image represents a single numeric n-D point, as a set of cells with a different level of intensities and colors. The CPC-R algorithm first splits attributes of an n-D point x = (x 1 ,x 2 ,…,x n ) into consecutive pairs (x 1 ,x 2 ), (x 3 ,x 4 ),…,(x n-1 ,x n ). If n is an odd number, then the last attribute is repeated to get n + 1 attributes. Then all pairs are shown as 2-D points in the same 2-D Cartesian coordinates. In Fig. 34, the CPC-R algorithm uses the grey scale intensity from black for (x 1 ,x 2 ) and very light grey for (x n-1 ,x n ). Alternatively, intensity of a color is used. This order of intensities allows full restoration of the order of the pairs from the image. In other words, a heatmap is created to represent each n-D point. The size of the cells can be varied from a single pixel to dozens of pixels. For instance, if each attribute has 10 different values then a small image with 10 × 10 pixels can represent a 10-D point by locating five grey scale pixels in this image. This visualization is lossless

Survey of Explainable Machine Learning with Visual …

1

2

3

4

5

6

7

8

261

9 10

1 2 3 4 5 6 7 8 9 10

(a) 10-D point (8, 10, 10, 8, 7,10, 9,7,1,1) in CPC-R.

(b) Visualization in colored CPC-R of a case superimposed with mean images of two classes put side by side.

Fig. 34 CPC-R visualization of non-image 10-D points

when values of all pairs (x i , x i+1 ) are different and do not repeat. An algorithm for treatment of colliding pairs is presented in [31]. Figure 34a shows the basic CPC-R image design and Fig. 34b shows a more complex design of images, where a colored CPC-R visualization of a case is superimposed with mean images of the two classes, which are put side by side, creating double images. The experiments with such images produce accuracy between 97.36% and 97.80% in tenfold cross-validation for different CNN architectures for Wisconsin Breast Cancer data [31]. The advantage of CPC-R is in lossless visualization of n-D points. It also opens an opportunity discovering explanations in the form of salient pixels/features as it is done for DNN algorithms and described in the previous sections. In this way non-image data and ML models will get visual explanations.

9.4 Future Directions Interpretable machine learning or explainable AI has been an active area of research for the last few years. Outside of a few notable exceptions, generalized visual methods for generating deep explanations in machine learning has not progressed as much given the centrality of visualization in human perceptual understanding. Future directions in interpretability research and applications are diverse and are informed by domain needs, technical challenges, ethical and legal constraints, cognitive limitations etc. Below is an incomplete list of some prominent challenges facing the field today:

262

B. Kovalerchuk et al.

• Creating simplified explainable models with prediction that humans can actually understand. • “Downgrading” complex deep learning models for humans to understand them. • Expanding visual and hybrid explanation models. • Further developing of explainable Graph Models. • Further developing of ML models in First Order Logic (FOL) terms of the domain ontology. • Generating advanced models with the sole purpose of explanation. • Post-training rule-extraction. • Expert-in-the-loop in the training and testing stages with auditing models to check generalizability of models to wider real-world data. • Rich semantic labeling of a model’s features that the users can understand. • Estimating the causal impact of a given feature on model prediction accuracy. • Using new techniques such as counter-factual probes, generalized additive models, generative adversarial network technique for explanations. • Further developing heatmap visual explanations of CNN by Gradient-weighted Class Activation Mapping and other methods with highlighting the salient image areas. • Adding explainability to DNN architectures by layer-wise specificity of the targets at each layer.

10 Conclusion Interpretability of machine learning models is a vast topic that has grown in prominence over the course of the last few years. The importance of visual methods for interpretability is being recognized as more and more limitations of realworld systems are coming into prominence. The chapter covered the motivations for interpretability, foundations of interpretability, discovering visual interpretable models, limits of visual interpretability in deep learning, a user-centric view of interpretability of visual methods, open problems and current research frontiers. The chapter demonstrated that the approaches for discovering the ML models aided by visual methods are diverse and expanding as new challenges emerge. This chapter surveyed current explainable machine learning approaches and studies toward deep explainable machine learning via visual means. The major current challenge in this field is that many explanations are still rather quasi-explanations and are often geared towards the ML experts rather than the domain user. There are often trade-offs required to create the models as explainable which requires loss of information, and thus loss of fidelity. This observation is also captured by theoretical limits, in regard to preserving n-D distances in lower dimensions presented based on the JohnsonLindenstrauss Lemma for point-to-point approaches. The chapter also explored that additional studies, beyond the arbitrary points, explored in this lemma are needed for the point-to-point approaches.

Survey of Explainable Machine Learning with Visual …

263

Many of the limitations of the current quasi-explanations, and the loss of interpretable information can be contrasted with new methods like point-to-graph GLC approaches that do not suffer from the theoretical and practical limitations described in this chapter. The power of the GLC family of approaches was demonstrated via several real-world case studies, based on multiple GLC-based algorithms. The advantages of the GLC methods were shown, and suggestions for additional multiple enhancements was also discussed. The dimension reduction, classification and clustering methods described in this chapter support scalability and interpretability in a variety of settings: These methods include the visual PCA interpretation with GLC clustering for cutting the number of points etc. Lastly, we also discussed several methods that are used for interpreting traditional machine learning problem, e.g., visualization association rules via matrix and parallel set visualizations, data flow tracing for decision trees, visual analysis of Random Forests etc. PCA and t-SNE correspond to a class of models that are used for data and model understanding. While these methods are useful for a high-level data summarization, they also suffer from simplification, are lossy, and have distortion bias. The GLC based methods, being interpretable and lossless, however do not suffer from many of the limitations of these methods described here. Deep learning models have resisted yielding to methods that not only provide explainability but also have high model fidelity, mainly because of the model complexity inherent in deep learning models. A brief survey of interpretable methods in deep learning is also given in this chapter along with the strength and weakness of these methods. The need for explanations via heatmaps and for time series data is also covered. It is likely that heatmap implicit visual explanations will continue to be in the focus of further studies, while this chapter has shown the need to go beyond heatmaps in the future. What the examples demonstrate is that the landscape of interpretability is uneven in the sense that some domains have not been explored as compared to others. The human element in the machine learning system may prove to be the most crucial element in creating the interpretable methods. Lastly, it is noted that despite the fact that much progress has been made in interpretability of machine learning methods and the promise offered by visual methods, there is still a lot of work that needs to be done in this field to create systems that are auditable and safe. We hope that the visual methods outlined in this chapter will provide impetus for further development of this area, and help towards understanding and developing of new methods for multidimensional data across domains.

References 1. Ahmad, M., Eckert, C, Teredesai, A., McKelvey, G.: Interpretable machine learning in healthcare. IEEE Intell. Inform. Bull. 19(1), 1–7 (2018, August) 2. Ahmad, M.A., Özönder, S.: ¸ Physics inspired models in artificial intelligence. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3535–3536 (2020, August)

264

B. Kovalerchuk et al.

3. Ancona, M., Ceolini, E., Oztireli, A., Gross, M.: A unified view of gradient-based attribution methods for deep neural networks, CoRR (2017). https://arxiv.org/abs/1711.06104 4. Bau, D, Zhu, J.Y., Strobelt, H., Zhou, Tenenbaum, J.B., Freeman, W.T., Torralba, A.T.: GAN dissection: visualizing and understanding generative adversarial networks, (2018/11/26). arXiv preprint arXiv:1811.10597 5. Bongard, J.: Biologically Inspired Computing. IEEE Comput. 42(4), 95–98 (2009) 6. Choo, J., Liu, S.: Visual analytics for explainable deep learning. IEEE Comput. Graph. Applic. 38(4), 84–92 (2018, Jul 3) 7. Craik, K.J.: The nature of explanation. Cambridge University Press (1952) 8. Doran, D., Schulz, S., Besold, T.R.: What does explainable AI really mean? A new conceptualization of perspectives. arXiv preprint arXiv:1710.00794 (2017 Oct 2) 9. Doshi-Velez, F., Kim, B.: Towards a rigorous science on f interpretable machine learning. arXiv preprint arXiv:1702.08608 (2017, Feb 28) 10. Dovhalets, D., Kovalerchuk, B., Vajda, S., Andonie, R.: Deep Learning of 2-D Images Representing n-D Data in General Line Coordinates, Intern, Symp. on Affective Science and Engineering, pp. 1–6 (2018). https://doi.org/https://doi.org/10.5057/isase.2018-C000025 11. Druzdzel, M.J.: Explanation in probabilistic systems: Is it feasible? Will it work. In: Proc. of 5th Intern. Workshop on Intelligent Information Systems, pp. 12–24 (1996) 12. Embeddings, Tensorflow guide, https://www.tensorflow.org/guide/embedding (2019) 13. Feynman, Richard P.: The theory of positrons. Phy. Rev. 76(6), 749 (1949) 14. Fu, C., Zhang, Y., Cai, D., Ren, X.: AtSNE: Efficient and Robust Visualization on GPU through Hierarchical Optimization. In: Proc. 25th ACM SIGKDD, pp. 176–186, ACM (2019) 15. Gilpin LH, Bau D, Yuan BZ, Bajwa A, Specter M, Kagal L. Explaining explanations: An overview of interpretability of machine learning. In: 2018 IEEE 5th Intern. Conf. on data science and advanced analytics (DSAA) 2018, 80–89, IEEE. 16. Golovin, D., Solnik, B., Moitra, S., Kochanski, G., Karro, J., Sculley, D.: Google vizier: a service for black-box optimization. In KDD’17. ACM, pp. 1487–1495 (2017) 17. Grishin, V., Kovalerchuk, B.: Multidimensional collaborative lossless visualization: experimental study, CDVE 2014, Seattle. Luo (ed.) CDVE 2014. LNCS, vol. 8683 (2014 Sept) 18. Inselberg, A.: Parallel Coordinates, Springer (2009) 19. Jamieson, A.R., Giger, M.L., Drukker, K., Lui, H., Yuan, Y., Bhooshan, N.: Exploring nonlinear feature space dimension reduction and data representation in breast CADx with Laplacian Eigenmaps and t-SNE. Med. Phys. 37(1), 339–351 (2010) 20. Kahng, M., Andrews, P.Y., Kalro, A., Chau, D.H.: ActiVis: visual exploration of industry-scale deep neural network models. IEEE Trans. Visualiz. Comput. Graph. 24(1), 88–97 (2018) 21. Kovalerchuk, B., Grishin, V.: Adjustable general line coordinates for visual knowledge discovery in n-D data. Inform. Visualiz. 18(1), 3–32 (2019) 22. Kovalerchuk, B., Vityaev, E.: Data Mining in Finance: Advances in Relational and Hybrid Methods, Kluwer/Springer (2000) 23. Kovalerchuk, B., Vityaev E., Ruiz J.F.: Consistent and complete data and “expert” mining in medicine. In: Medical Data Mining and Knowledge Discovery, Studies in Fuzziness and Soft Computing, 60, Springer, pp. 238–281 (2001) 24. Kovalerchuk, B., Schwing, J., (Eds).: Visual and spatial analysis: advances in data mining, reasoning, and problem solving, Springer (2005) 25. Kovalerchuk, B.: Quest for rigorous intelligent tutoring systems under uncertainty: Computing with Words and Images. In: IFSA/NAFIPS, pp. 685–690, IEEE (2013) 26. Kovalerchuk, B., Dovhalets, D., Constructing Interactive Visual Classification, Clustering and Dimension Reduction Models for n-D Data, Informatics, 4(23) (2017) 27. Kovalerchuk, B.: Visual knowledge discovery and machine learning, Springer (2018) 28. Kovalerchuk, B., Neuhaus, N.: Toward efficient automation of interpretable machine learning. In: Intern. Conf. on Big Data, 4933–4940, 978–1–5386–5035–6/18, IEEE (2018) 29. Kovalerchuk, B., Grishin, V.: Reversible data visualization to support machine learning. In: Intern. Conf. on Human Interface and the Management of Information, pp. 45–59, Springer (2018)

Survey of Explainable Machine Learning with Visual …

265

30. Kovalerchuk, B., Gharawi, A.: Decreasing occlusion and increasing explanation in interactive visual knowledge discovery. In: Human Interface and the Management of Information. Interaction, Visualization, and Analytics, pp. 505–526, Springer (2018) 31. Kovalerchuk, B., Agarwal, B., Kalla, D.: Solving non-image learning problems by mapping to images, 24th International Conference Information Visualisation IV-2020, Melbourne, Victoria, Australia, 7-11 Sept. 2020, pp. 264–269, IEEE, https://doi.org/10.1109/IV51561. 2020.00050 32. Kovalerchuk, B.: Explainable machine learning and visual knowledge discovery. In: The Handbook of Machine Learning for Data Science, Springer (in print) (2021) 33. Kovalerchuk, B.: Enhancement of cross validation using hybrid visual and analytical means with Shannon function. In: Beyond Traditional Probabilistic Data Processing Techniques: Interval, Fuzzy etc. Methods and Their Applications, pp. 517–554, Springer (2020) 34. Kulesza, T., Burnett, M., Wong, W.K., Stumpf, S.: Principles of explanatory debugging to personalize interactive machine learning. In: Proceedings of the 20th International Conference on Intelligent User Interfaces, pp. 126–137 (2015, Mar 18) 35. Kulpa, Z.: Diagrammatic representation and reasoning. In: Machine Graphics & Vision 3 (1/2) (1994) 36. Lapuschkin, S., et al.: Unmasking clever hans predictors and assessing what machines really learn. Nat. Commun. 10, 1096 (2019) 37. Liao, Q.V., Gruen, D., Miller, S.: Questioning the AI: Informing Design Practices for Explainable AI User Experiences. arXiv preprint arXiv:2001.02478. (2020, Jan 8) 38. Lipton, Z.: The mythos of model interpretability. Commun. ACM 61, 36–43 (2018) 39. Liu, S., Ram, P., Vijaykeerthy, D., Bouneffouf, D., Bramble, G., Samulowitz, H., Wang, D., Conn, A., Gray, A.: An ADMM Based Framework for AutoML Pipeline Configuration. arXiv preprint cs.LG/1905.00424. (2019, May 1) 40. Lundberg, S.M., Erion, G., Chen, H., DeGrave, A., Prutkin, J.M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., Lee, S.I.: Explainable AI for trees: from local explanations to global understanding. arXiv preprint arXiv:1905.04610. (2019, May 11) 41. van der Maaten, L.J.P., Hinton, G.E.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008) 42. Marino, D.L., Wickramasinghe, C.S., Manic, M.: An adversarial approach for explainable AI in intrusion detection systems. In IECON 2018–44th Conference of the IEEE Industrial Electronics Society, pp. 3237–3243, IEEE (2018 Oct 21) 43. Michie, D.: Machine learning in the next five years. In: Proceedings of the Third European Working Session on Learning, pp. 107–122. Pitman (1988) 44. Miller, T.: Explanation in artificial intelligence: insights from the social sciences. Artif. Intell. 267, 1–38 (2019, Feb 1) 45. Ming, Y., Qu, H., Bertini, E.: Rulematrix: visualizing and understanding classifiers with rules. IEEE Trans. Visualiz. Comput. Graph. 25(1), 342–352 (2018, 20) 46. Mitchell, T.M.: Machine learning. McGraw Hill (1997) 47. Molnar, C.: Interpretable Machine Learning (2020). https://christophm.github.io/interpretableml-book/ 48. Montavon, G., Samek, W., Müller, K.R.: Methods for interpreting and understanding deep neural networks. Digi. Sig. Proc. 73, 1–5 (2018, Feb 1) 49. Muggleton, S., (Ed.).: Inductive logic programming. Morgan Kaufmann (1992) 50. Muggleton, S.: Stochastic logic programs. Adv Induct Logic Program. 32, 254–264 (1996, Jan 3) 51. Muggleton, S., Schmid, U., Zeller, C., Tamaddoni-Nezhad, A., Besold, T.: Ultra-strong machine learning: comprehensibility of programs learned with ILP. Mach. Learn. 107(7), 1119–1140 (2018 Jul 1) 52. Neuhaus, N., Kovalerchuk, B., Interpretable machine learning with boosting by Boolean algorithm, joint 2019 Intern. Conf. ICIEV/IVPR, Spokane, WA, pp. 307–311, IEEE (2019) 53. Park, H., Kim, J., Kim, M., Kim, J.H., Choo, J., Ha, J.W., Sung, N.: VisualHyperTuner: visual analytics for user-driven hyperparameter tuning of deep neural networks. InDemo at SysML Conf (2019)

266

B. Kovalerchuk et al.

54. Ribeiro, M., Singh, S., Guestrin, C.: Why should I trust you?: Explaining the predictions of any classifier. Proc. the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016) 55. Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206–215 (2019) 56. Samek, W., Wiegand, T., Müller, K.R.: Explainable artificial intelligence: understanding, visualizing and interpreting deep learning models. arXiv preprint arXiv:1708.08296. (2017) 57. Samek, W., Binder, A., Montavon, G., Lapuschkin, S., Müller, K.R.: Evaluating the visualization of what a deep neural network has learned. IEEE Trans. Neural Networks Learn. Syst. 28(11), 2660–2673 (2016 Aug 25) 58. Samek, W., Binder, A., Montavon, G., Lapuschkin, S., Müller, K.R.: Evaluating the visualization of what a deep neural network has learned. IEEE Trans. Neural Networks Learn. Syst. 28(11), 2660–2673 (2017 Nov) 59. Schlegel, U., Arnout, H., El-Assady, M., Oelke, D., Keim, D.A.: Towards a rigorous evaluation of XAI methods on time series. arXiv preprint arXiv:1909.07082 (2019, Sep 16) 60. Schramowski, P., Stammer, W., Teso, S., Brugger, A., Luigs, H.G., Mahlein, A.K., Kersting, K.: Right for the Wrong Scientific Reasons: Revising Deep Networks by Interacting with their Explanations. arXiv:2001.05371. (2020 Jan 15) https://arxiv.org/pdf/2001.05371 61. Sharma, A., Vans, E., Shigemizu, D., Boroevich, K.A., Tsunoda, T.: Deep insight: a methodology to transform a non-image data to an image for convolution neural network architecture. Nat. Sci. Reports 9(1), 1–7 (2019, Aug 6) 62. Shavlik, J.W.: An overview of research at Wisconsin on knowledge-based neural networks. In: Proceedings of the International Conference on Neural Networks, pp. 65–69 (1996 Jun) 63. Wang, Q., Ming, Y., Jin, Z., Shen, Q., Liu, D., Smith, M.J., Veeramachaneni, K., Qu, H.: Atmseer: Increasing transparency and controllability in automated machine learning. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2019) 64. Weidele, D.: Conditional parallel coordinates. IEEE Trans. Visual Comput. Graph. 26(1), 2019 (2019) 65. Weidele, D., Weisz, J.D., Oduor, E., Muller, M., Andres, J., Gray, A., Wang, D.: AutoAIViz: opening the blackbox of automated artificial intelligence with conditional parallel coor-dinates. In: Proc. the 25th International Conference on Intelligent User Interfaces, pp. 308–312 (2020) 66. Wilinski, A., Kovalerchuk, B.: Visual knowledge discovery and machine learning for investment strategy. Cogn. Syst. Res. 44, 100–114 (2017, Aug 1) 67. Wongsuphasawat, K., Smilkov, D., Wexler, J., Wilson, J., Mane, D., Fritz, D., Krishnan, D., Viegas, F.B., Wattenberg, M.: Visualizing dataflow graphs of deep learning models in tensorflow. IEEE Trans. Visualiz. Comput. Graph. 24(1), 1–12 (2017) 68. Xanthopoulos, I., Tsamardinos, I., Christophides, V., Simon, E., Salinger, A.: Putting the human back in the AutoML loop. In: CEUR Workshop Proceedings. https://ceur-ws.org/Vol-2578/ ETMLP5.pdf (2020) 69. Zhang, C., et al.: Association rule-based approach to reducing visual clutter in parallel sets. Visual Informatics 3, 48–57 (2019) 70. Zhang, Q.S., Zhu, S.C.: Visual interpretability for deep learning: a survey, frontiers of information technology & electronic. Engineering 19(1), 27–39 (2018) 71. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. (2016, Nov 10) 72. Zhang, Q., Yang, Y., Ma, H., Wu, Y.N.: Interpreting CNNs via decision trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6261–6270 (2019) 73. Zhang, Y., Liao, Q.V., Bellamy, R.K.: Effect of Confidence and Explanation on Accuracy and Trust Calibration in AI-Assisted Decision Making. arXiv:2001.02114. (2020) 74. Zhao, X., Wu, Y., Lee, D.L., Cui, W.: iForest: interpreting random forests via visual analytics. IEEE Trans. Visualiz. Comput. Graph. 25(1), 407–416 (2018, Sep 5) 75. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Advances in Neural Information Processing Systems, pp. 487–495 (2014)

Survey of Explainable Machine Learning with Visual …

267

76. van der Maaten, L.: Dos and Don’ts of using t-SNE to Understand Vision Models, CVPR 2018 Tutorial on Interpretable Machine Learning for Computer Vision, https://deeplearning.csail. mit.edu/slide_cvpr2018/laurens_cvpr18tutorial.pdf

MiBeX: Malware-Inserted Benign Datasets for Explainable Machine Learning Wayne Stegner, Tyler Westland, David Kapp, Temesguen Kebede, and Rashmi Jha

Abstract Deep learning has shown its capability for achieving extremely high accuracy for malware detection, but it suffers from an inherent lack of explainability. While methods for explaining these black-box algorithms are being extensively studied, explanations offered by algorithms, such as saliency mapping, are difficult to understand due to the lack of interpretability of many malware datasets. This chapter explores the role of information granularity in malware detection, as well as a scalable method to produce an intelligible malware dataset for machine learning classification. One of the resultant datasets is then used with a Malware as Image classifier to prove the method’s validity for use in training deep learning algorithms. The Malware as Image classifier achieves a training accuracy of 98.94% and a validation accuracy of 93.83%, showing that the method can produce valid datasets for use with machine learning. Gradient-based saliency mapping is then applied to the trained classifier to generate heat-map explanations of the network output. Keywords Explainable AI · Malware dataset · Malware detection · Granular computing · Deep learning · Malware as image · Saliency maps

W. Stegner (B) · T. Westland · R. Jha Department of EECS, University of Cincinnati, Cincinnati, United States e-mail: [email protected] T. Westland e-mail: [email protected] R. Jha e-mail: [email protected] D. Kapp · T. Kebede Resilient and Agile Avionics Branch, AFRL/RYWA, Wright-Patterson AFB, United States e-mail: [email protected] T. Kebede e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. Pedrycz and S. Chen (eds.), Interpretable Artificial Intelligence: A Perspective of Granular Computing, Studies in Computational Intelligence 937, https://doi.org/10.1007/978-3-030-64949-4_9

269

270

W. Stegner et al.

1 Introduction We live in a society1 where malware is becoming an increasing problem, with Kaspersky reporting an increase of 11.5% in daily malware detection in 2017 [1]. To combat the malware, researchers have been using machine learning to distinguish between malicious and benign software [2–10]. While machine learning has yielded high accuracy in malware detection, some machine learning algorithms, such as deep learning, are black-box in nature. In order to improve the trust in these black-box algorithms, there are several important research questions which must be addressed. How do these techniques achieve such high accuracy in malware detection? What features do these models consider malicious? How do we validate that these algorithms are making classifications for the correct reasons? Answering these questions will not only improve our understanding of malicious software features, but it will allow for higher confidence in the classifications made by these networks. Another benefit of explaining the classifiers decisions is to gain an understanding of how the detection methods can be tricked. In 2017, Hu [11] was able to trick black-box malware detection models into classifying malicious software as benign by utilizing a generative adversarial network (GAN) called MalGAN. This alarming study emphasizes the need to understand what is going on in these black-box detection algorithms. To attempt to answer these questions, interpretability techniques have been applied to deep learning models in the context of malware detection [4]. While the explanations show which code the model considers malicious and which is benign, the task of determining the correctness of these explanations proves to be a challenging task due to insufficient feature labels in the dataset. Manually labeling the malicious features of an existing dataset is a tremendously time-consuming task and leaves room for error, making it an impractical solution for validating explanations in large-scale datasets. In this chapter, we address this problem by presenting a method to generate scalable malware datasets with intelligible malicious features for the purpose of training deep learning algorithms and verifying the correctness of their explanations. This method was designed keeping in mind the ability to check the intelligibility of the model using explainable artificial intelligence techniques. The remainder of this chapter is outlined as follows. In Sect. 2, we discuss important background information and related works, including the role of information granules in malware detection. Then, we will present our new dataset generation method in Sect. 3. Next, we test our new dataset in a malware classification task using a Malware as Image classifier in Sect. 4. We then apply saliency mapping to the network and examine the result in Sect. 5. Finally, we discuss our conclusions and highlight areas of future work.

1 AFRL

Public Approval case number APRS-RY-20-0240.

MiBeX: Malware-Inserted Benign Datasets for Explainable Machine Learning

271

2 Background and Related Works 2.1 Malware Analysis Overview There are two encompassing ways to analyze a file for malicious code; dynamic and static analysis [12]. Dynamic analysis refers to running the executable and observing its behavior. This is done in a sandbox environment to avoid having damage done to a real system if the executable is malicious. The analysis system will look for particular actions taken by the executable, such as overwriting system files or the registry in the case of Microsoft Windows. The benefit is that if an attacker has obfuscated the code, but it still makes the same malicious actions, it will still be caught. The down side is that malicious code will often attempt to detect if it is executing within a sandbox environment, and then adjust its behavior accordingly. Static analysis refers to analyzing the file without running the code. A common first step is to disassemble the file into assembly code instructions using a disassembler tool, such as Ghidra [13] or IDA Pro [14]. An example would be to compute the cryptographic hash of the entire file. This quickly detects if a file was edited in any way, but not if the edits are malicious. Disassembling the file into assembly code allows one to look deeper and create signatures. Signatures can detect if a particular malware sample exists in other files. These signatures will include things like code blocks from the malware and associated edits to the file like changes to the file header. These typically have to be hand crafted as they are difficult to ensure that enough information is captured without creating false positives.

2.2 Granularity in Malware Analysis Granular computing [15–19] is a computation paradigm concerning the formation and processing of information granules. While there is no formal definition, Pedrycz [15] defines information granules as collections of entities that have common traits, including (but not limited to) functional similarity, closeness of value, or temporal relationship. Essentially, an information granule is an aggregate representation of lower-level data points. Granular computing is well-established in many fields, such as fuzzy sets, rough sets, interval analysis, and probabilistic environments [15, 16], but the underlying concept of forming information granules, known as information granulation, spans well beyond just those fields. An interesting example application of granular computing techniques lies in the task of image classification [17, 18]. For example, Rizzi et al. [17] segments images into regions that share similar color and texture to form information granules. The spatial relationships between these symbols are represented as symbolic graphs, which are then classified by their similarity to a set of pre-classified graphs. Another approach, done by Zhao et al. [18], investigates a granular approach to the classifi-

272

W. Stegner et al.

cation of handwritten digits. A Convolutional Neural Network (CNN) is trained to classify the dataset, and the convolutional filters learned in this network are used to extract features from the images. These features are broken into feature sets and fed into an ensemble learning framework. While the concept of granular computing is not generally mentioned by name in the field of malware analysis, the concept of information granulation is fundamental to the field. Software itself has many different levels of information granularity. At the finest level, software can exist as binary machine code, that is the series of 1’s and 0’s which a computer can interpret to execute a set of instructions. Typically, each eight adjacent bits are grouped into bytes for more space-efficient viewing. While machine code is meaningful to a computer processor, humans will struggle to parse any meaning out of them. For instance, it is difficult to discern whether a given byte is part of an instruction or data. Instead, the machine code can be transformed into assembly code by using a disassembly tool, as discussed in Sect. 2.1. Disassembling machine code into assembly code is a form of information granulation in which adjacent bytes are grouped into a set of assembly code instructions. At this level of granularity, opcodes and operands can be more easily identified by humans, which allows us to better apply domain-specific knowledge of instruction semantics to identify low-level behaviors of the software. An even coarser level of granulation is source code, for example a C or C++ source code file. Source code can be obtained from decompiling assembly code instructions, or it can be manually written. While assembly code is more concerned with the low-level details of executing instructions in a processor, source code describes higher-level behaviors and is an abstraction of assembly code. In other words, a simple action in C++, such as adding numbers, can be composed of many assembly instructions. Source code inherently contains more levels of information granulation, which are described by Han et al. [19]. For example, the paradigm of object-oriented programming (OOP) involves organizing code into higher level constructs, such as classes and functions. A function can be thought of as a collection of functionallyadjacent actions in the source code which work together to achieve higher order actions. A class is even higher-level, often containing functions that describe the possible actions and behaviors of objects. Commercial off-the-shelf (COTS) software is often distributed in a binary format. Due to this fact, many malware analysis and detection techniques view the code at either machine code [2, 5, 6, 8, 9] or assembly instruction code [3, 4, 10, 20, 21] granularity. As previously discussed, it is difficult for humans to parse meaning out of bytelevel machine code. To overcome this obstacle, a technique called Malware as Image transforms the machine code into an image by interpreting each byte as a pixel brightness value and resizing it into a 2D array. Su et al. [6] utilizes Malware as Image by forming images as described previously, then resizing them into a 64x64 pixel image. These images are then fed into a CNN classifier, which identifies malware with an accuracy of 94% on a dataset containing 365 malware samples. While this method does not involve information granulation, other Malware as Image methodologies incorporate granulation techniques [2, 8, 9]. In these studies, the files are transformed

MiBeX: Malware-Inserted Benign Datasets for Explainable Machine Learning

273

into images by interpreting bytes as pixel brightness value and shaping them into a 2D array. Instead of directly classifying the images, the studies utilize information granulation to extract features from the images, primarily intensity-based (i.e. average brightness, variance, etc.) and texture-based (i.e. Wavelet and Gabor transform) features. While the information granules in these studies are formed out of similar features, the method of classification and size of the dataset is the primary difference. Nataraj et al. [9] utilizes a k-nearest neighbors classifier and achieves an accuracy of 98% on a dataset containing 9458 malware files. Kancherla et al. [2] classifies the data using a support vector machine, achieving an accuracy of 95% on a dataset containing 15000 malicious and 12000 benign files. Makandar et al. [8] uses feed forward neural networks to classify the malware, achieving an accuracy of 96% on a dataset containing 3131 malware samples. While these studies achieve high classification with information granules extracted from the images, it is worth noting that these information granules are not derived from any semantic or functional relationship. Furthermore, Raff et al. [5] discusses why representing machine code as an image is not a perfect analogy. A 2D convolution of an image can also be viewed as a dilated 1D-convolution of sequential data with large gaps in the convolution, which was shown to cause a decrease in classification performance. That being said, Malware as Image has shown the ability to achieve high accuracy in malware classification tasks. As a more coarse level of information granularity, assembly code allows for the use of instruction semantics to better understand the behavior of the software. For example, Christodorescu et al. [20] utilizes disassembly in conjunction with the assembly instruction semantics in order to detect malicious behavior. Essentially, semantic meaning is used to simplify obfuscated assembly, which allows for more accurate malware classification. This level of analysis would not have been possible at the machine code level. Greer [21] granulates assembly code by clustering opcodes by semantic usage, allowing instructions to be grouped into families of similar functionality. For example, the commands AD D and SU B are similar actions. Instead of having separate symbols for them, the semantic clustering helps to model them as similar actions. An even coarser granulation is done by Garg et al. [10], where features are manually extracted by forming information granules with aggregated API calls measured through static analysis. These granules are then used with several supervised learning algorithms to detect malicious software with a highest accuracy of 93% using support vector machines. Santacroce et al. [3] also utilized assembly code to classify malware, which is discussed in Sect. 2.4.

2.3 Feature Visualization In order to visualize the features learned by deep learning architectures, several algorithms may be used. One algorithm of particular interest is Activation Maximization [22]. The goal of this algorithm is to calculate an input which will maximize the activation at a given unit in a trained deep network. Further building upon this method,

274

W. Stegner et al.

Class Model Visualization [23] focuses on finding the input which maximizes the output class score. This method involves calculating the L 2 -regularized input, I ∗ , which maximizes the class score, Sc , given the regularization parameter, λ: I ∗ = argmax I Sc (I ) − λI 22

(1)

Because I ∗ represents an input to the classifier, the granularity of this explanation is the same as that of the input space. Furthermore, the coarseness of information granulation may influence whether or not I ∗ represents a realistic example of a feature vector in a given classifier. Consider, for example, a classification model which uses a fine-grained input, such as the raw bytes of a file, and determines whether or not the file is malware. In this case, applying Class Model Visualization will produce an input I ∗ which represents a string of bytes that maximizes a particular output class. However, it is highly unlikely that the value I ∗ will produce a functional binary file. A binary file stores a sequential set of program instructions and data, which leads to sensitive functional and spatial dependencies between the bytes of the files. Because Class Model Visualization does not consider these constraints, the value of I ∗ may bear some similarities to example members of that class, but will be unreadable to the computer. Therefore, Class Model Visualization will not produce a realistic example of a feature vector in this classifier. Now, consider an example which takes a coarser-grained input, such as an aggregation of Application Programming Interface (API) call frequencies [10], to determine whether or not a file is malware. In this case, Class Model Visualization produces I ∗ , which represents a set of API call frequencies that maximizes a particular output class. In contrast to the fine-grained example, I ∗ cannot be immediately invalidated in this instance. Because API calls are aggregated from within the program, the functional and spatial dependencies between them are from the feature-vector. With the reduced constraints on the input space, it is more likely that I ∗ will be an example of a realistic feature vector. A different approach to generating visual explanations of important features is Image-Specific Class Saliency Visualization [23], which we will refer to as saliency mapping. The purpose of saliency mapping is to produce a heat map of important features in the input space. To calculate the saliency map, first the gradient, w, of the class score, Sc , with respect to the input, I , at a given input, I0 , is calculated: w=

 ∂ Sc (I )  ∂ I  I0

(2)

The saliency map, M, is then calculated, as show in (3): M = |w|

(3)

MiBeX: Malware-Inserted Benign Datasets for Explainable Machine Learning

275

Because the saliency map is gradient-based, it can be thought of as a measure of how much each input node would affect the output class score if it were to be changed, and therefore represents how important each input node is to the output class score. We can modify (2) to use the output of an arbitrary network layer, h i , instead of Sc , as show in (4):  ∂h i (I )  (4) wi = ∂ I  I0 Using (4), we can calculate the saliency map using network layer i, as show in (5): Mi = |wi |

(5)

This enhancement allows us to more closely examine which input features are important to arbitrary layer i of the network, and not just to Sc . Case Study: In order to show a visually intuitive example of saliency mapping, it will be demonstrated on a dataset of dogs and cats from Kaggle [24]. The goal in this dataset is to determine whether the given input image contains a dog or a cat. Given the prevalence of CNNs in image processing tasks [25, 26], we will be using a simple CNN for this task. Our CNN architecture consists of two 2D convolutional layers (Conv0 , Conv1 ), followed by a hidden dense layer (Dense0 ), and finally the output classification layer (Dense1 ). After training for 500 epochs, this network achieves 99.7% training accuracy and 86.03% validation accuracy. The exact training parameters and network structure are not the primary focus of this example, and therefore will not be discussed in detail. After training the network, we apply saliency mapping from (5) to each layer of the network, which is shown in Fig. 1. In these saliency maps, we can see very distinct details of both the dogs and cats faces represented in various layers, which means that the network considers those groups of pixels important in those layers.

2.4 Malware as Video Based on the Malware as Image concept, Santacroce et al. [3] has previously developed a technique called Malware as Video. The motivation behind this approach is to counteract one of the downfalls of Malware as Image using CNNs [6], that is all images must be the same size to feed into the neural network. While Malware as Video is a visual representation classification technique, it addresses the size constraint by breaking the image into video frames and feeding them through a time-distributed CNN. By using this technique, the amount of padding to an image is drastically reduced, and the network is able to handle files with drastic variations in size. Using this method, we achieve 99.86% training and 98.74% testing accuracy on the 2015 Microsoft Malware Classification Challenge dataset [7].

276

W. Stegner et al.

Fig. 1 Saliency maps of cat images (top two rows) and dog images (bottom two rows). From left to right: original image, Conv0 saliency map, Conv1 saliency map, Dense0 saliency map, and Dense1 saliency map

As a result of our high accuracy, we desired an explanation for what features the network considers malicious. Using the saliency mapping technique in (3), we were able to generate image specific saliency maps to show which features were considered malicious and which were benign by our network. To validate that we had indeed extracted the important features in the inputs, we modified input videos by removing non-salient portions. Using the modified video resulted in a 99.84% training and 99.31% testing accuracy, which confirmed that we identified the truly salient portion of the code. For comparison, when removing the salient code from the input, the resulting accuracy is 66.93% for training and 66.75% for testing. While we know that we have identified the salient code for the network, we still cannot say with certainty that the extracted portions of code are actually malicious. Due to a lack of intelligibility of the dataset, it is difficult to determine which features in the original code are malicious and which are benign. It is entirely possible that the dataset has some features which are common among malicious files, but do not actually perform any malicious action. In order to perform a more in-depth feature analysis, we require a dataset with intelligible features.

MiBeX: Malware-Inserted Benign Datasets for Explainable Machine Learning Table 1 Summary of bash commands Command cp cut find grep mkdir xargs

277

Description Copy a file Remove sections from a file, often with a delimiter Recursively search for files in a directory Prints lines that match a pattern Make a directory Directs standard input to the argument of a command

2.5 MetaSploit A well known tool for producing malicious executables is Metasploit [27]. It has been cited as being a common method of producing payloads for testing and research purposes [28–33]. In our study, we focused on a single payload: the reverse TCP shell. When executed, it contacts the attackers PC and presents a shell. Using that, an attacker could set up more complicated attacks like setting up a permanent presence on the user’s machine. Doing so requires the use of two tools from Metasploit. The first is MSFvenom, which allows one to create a malicious executable for a specified machine. An additional feature of MSFvenom is that it can insert a specified payload into a given binary to create a Trojan. That particular feature is how we create the Trojans for our dataset which our detection system analyzes. MSFvenom is unable to retain the original functionality of the benign binary, but as our system is a type of static analyzer, that would not be known to our system. The second tool is MSFconsole, which acts as the “Command & Control” center for malicious executables created by MSFvenom. After informing MSFconsole of the type of attack being performed, it will maintain connections from all connecting malicious executables.

2.6 Bash Commands Throughout this work, we utilize several bash commands in our dataset generation methodology. Table 1 summarizes the important bash commands used in our work.

278

W. Stegner et al.

Fig. 2 Overview of the dataset generation process

3 Dataset Generation Our approach to generating a dataset of malicious files fulfills two major goals. The primary goal is for the dataset to be easily interpretable for use in explainable machine learning for malware detection. In this case, interpretability means knowing which portion of the code is malicious and which is benign. The purpose of this goal is to increase the value of fine-grained interpretability models. The second goal is for this method to be scalable to produce high volume datasets. Deep learning based techniques, such as Malware as Video, require large datasets to properly train and validate models. The use of a scalable method to generate these large datasets can be a valuable tool for model validation. To meet these goals, we developed a dataset generation method consisting of two main stages. The first stage involves gathering a pool of executable files, which serve as the benign files in the dataset. In the second stage, a malicious Trojan is inserted into each benign file, which forms the malicious portion of the dataset. We use this method to create both 32-bit and 64-bit datasets. A high-level diagram of this process is shown in Fig. 2. After creating the dataset, we validate the functionality of the newly generated malicious files by running them in a virtual machine.

3.1 Gathering Benign Files We wanted our method of gathering benign files to be easily repeatable. To achieve this goal, we gathered the benign files from default installations of Kali Linux 2020.2 64-bit and Kali Linux 2020.2 32-bit. We chose to gather files from both 32-bit and 64-bit systems to validate the method on a wider variety of target systems. For convenience, the files were collected from installations running on a virtual machine, but the process can be replicated with a physical machine. In the Kali Linux installation, we are particularly interested in collecting the executable binary files. In the case of Linux, the most popular file format for executable files is the

MiBeX: Malware-Inserted Benign Datasets for Explainable Machine Learning

279

Fig. 3 Bash script to scrape ELF files

Fig. 4 Example MSFvenom command call to insert a Trojan into ls.benign

Executable and Linkable Format (ELF) [34]. In the Linux filesystem, many ELF files can be found in the /bin/ and /usr/bin/ directories. Figure 3 shows a bash script we used to scrape these directories for ELF files and copy them to the benign directories. The script calls the find command on both directories, which is a Unix program that searches through a given directory and outputs a list of files. The file command is called on each output, which displays information about the type of the file (i.e., “ASCII text” or “ELF 64-bit”). From this information, we filter out the ELF files with the grep command, which is a command used to print out lines of a string or file matching a particular pattern. In our case, we only want lines to display if the output of file contains the string “: ELF”, designating it as an ELF file. Each file from the filtered group is copied over to the directory designated by $BEN_DIR, where they are stored for Trojan insertion. After gathering the benign pool, each benign file has the extension .benign appended to distinguish from their infected counterparts.

3.2 Trojan Insertion Once the benign files are collected, the malicious dataset is constructed by inserting a Trojan payload into each file. To accomplish this task, we utilized the Metasploit [35] framework, more specifically a tool called MSFvenom, which we discussed in Sect. 2.5. Figure 4 shows an example command used to insert the payload into a benign file. We specify the command to target the 32-bit Linux platform with a reverse TCP shell payload. The template argument designates a file in which to insert the payload. In this case, we insert the payload into the file clear.benign and output the infected file to clear.infected.

280

W. Stegner et al.

To automate the process of inserting the Trojan payload into every single benign file, we utilized the make tool [36]. It is a versatile tool with the capability to automatically determine which components of a project require recompilation. In typical usage, make is used in conjunction with a build tool, such as gcc. A set of rules in a Makefile define parameters, such as input files, output files, and specific commands needed to generate output files. For our application, we treat MSFvenom as our build tool, files ending in .benign as our source files, and files ending in .infected as our output files. The Makefile rules are configured so that if a .benign file does not have a corresponding .infected file, the .infected file is generated by calling MSFvenom similarly to the command shown in Fig. 4.

3.3 Malware Verification After the dataset was completed, we wanted to test that the newly infected malware files actually perform their intended function. In order to do this, we set up a script that automatically runs the malicious file, connects to an instance of MSFconsole (referred to as the remote console), sends communications, and then closes. When the reverse TCP shell connects to the MSFconsole, the remote console creates a new file on the system. If the testing script detects the file is created, it means that the Trojan payload properly connected to the remote console. If the connection times out or the file is not created, it is assumed that the malware has not run properly, and counted as a failed test. The testing script was added to the Makefile so it is automatically run for every malicious file, and if the file has already passed it will not run again. In addition to the automatic functionality testing, we examined a few random files from each dataset with a dissassembly tool. We performed a file diff of the infected file against its benign counterpart to see more clearly what changes were made by MSFvenom.

3.4 Dataset Generation Results Using the methodology described in this section, we were able to generate a dataset of 32-bit files consisting of 1685 files of each benign and infected class and a dataset of 64-bit files consisting of 1687 of each benign and infected classes. Due to the nature of having both benign and infected versions of the same file, we are easily able to determine which features of the file are malicious and which are benign. A simple diff of the infected file against the binary file will reveal this information, and it can be entirely automated. The use of the make command greatly helps to improve the scalability of the method. Because the Makefile rules check which files need to be generated, it is possible to add a few files to the benign partition and have them quickly and automatically infected without having to reinfect every

MiBeX: Malware-Inserted Benign Datasets for Explainable Machine Learning

281

single file. Additionally, the payload functionality testing is fully automated with the Makefile, which saves time when expanding the dataset. Our dataset generation method produces datasets that are greatly advantageous over the 2015 Microsoft Malware Classification Challenge dataset [7] in terms of intelligibility. As previously discussed, we were unable to verify the benign and malicious components of the Microsoft dataset. However, we can easily identify the malicious components of our dataset. This level of intelligibility holds great promise to system verification of malware detection systems of any type. An interesting behavior we observed from the infected files is that none of the 64bit files passed the functionality test, but the 32-bit files all passed with the exception of just 13 files. A “payload only” 64-bit file was generated by running the command from Fig. 4 without the template argument. This file was successfully able to connect to the remote console, indicating that the issue was not in the functionality of the payload, but instead in the process of inserting the payload into the benign file. We also verified that the Trojan was inserted into the 64-bit files by performing a file diff. We suspect that this odd behavior is due to a glitch in MSFvenom, because the 32-bit dataset performs as intended. However, since the Trojan is actually inserted into the file, it can be found with some static analysis techniques.

4 Malware Classification In order to validate the use of our dataset generation method for creating machine learning datasets, we used the 32-bit dataset in a malware classification task. We decided not to use the 64-bit dataset because the payloads were not properly activating. While the payloads would have been detectable with some static analysis techniques, the 32-bit dataset serves as a more realistic example. In order to select the classification algorithm, we must first consider the implications of the level of granularity of the method. In Santacroce et al. [4], software was analyzed at the assembly code level of granularity. However, the pre-processing overhead associated with machine code disassembly is computationally prohibitive for embedded applications, such as drone flight computers. Therefore, we will form our information granules using the raw bytes of the machine code, providing a fine level of information granularity with minimal pre-processing overhead. Based on this decision, we chose to apply Malware as Image with CNN classification for this example. This method also provides an easy visualization method, and it is also a straight-forward, easy to follow implementation. Figure 5 shows an overview of the Malware as Image method we will be using.

282

W. Stegner et al.

Fig. 5 Overview of the Malware as Image classification model

Fig. 6 Histogram showing the distribution of file sizes in the 32-bit dataset

4.1 Pre-processing To prepare the dataset for the network, we must determine the image size, because the network requires all inputs to have the same dimensions. For simplicity, the end of each file is simply padded with zeros until they are all the same size. When file sizes differ drastically, this causes a large amount of padding to be applied to the end of the smaller files. Figure 6 shows the distribution of file sizes in the entire 32-bit dataset. The majority of the files fall on the left side of the histogram, with a very small number of files outside of the first bin. Some of the files are orders of magnitude larger than other files. The reason for this size discrepancy is an artifact of scraping ELF files from Kali Linux; some of the files just happen to be substantially larger than others. Figure 7 shows the distribution of file sizes that are less than 100 KB. While the histogram still skews heavily to the left, the difference between the minimum and maximum file size is greatly reduced, allowing for the use of less padding when inputting files into our network. Discarding any file larger than 100 KB leaves 2591 files to use with our Malware as Image classifier.

MiBeX: Malware-Inserted Benign Datasets for Explainable Machine Learning

283

Fig. 7 Histogram showing the distribution of file sizes under 100 KB

Next, we set the image width to 32 pixels and the height to 3070 pixels. The reason why we chose 32 pixels as the width is mainly due to the word size of machine code typically being a power of two. Notice how in Fig. 8, the top and bottom of the image features vertical bars. These bars appear because the width of the image is a multiple of the word size, which causes the alignment to be consistent across the rows. If an arbitrary width were used, the words might not line up between the rows, which may prove problematic due to the nature of using 2D convolution in our CNN. We must convert the binary file into an image file using methodology similar to [6]. We achieve this by first reading in the binary file as a byte array. Each byte is then cast into an 8-bit unsigned integer, making each element of the array an integer in the range [0, 255]. The array is then reshaped to have a width of 32 and a height of 3070, giving us a gray-scale image representation of the original binary file. Note that during the reshaping process, the end of the file is padded with zeros to ensure that all images are the same size. Figure 8 shows examples of image files generated from clear.benign and clear.infected, as well as the differences between the two files. Note that for the sake of space, these images are 64 pixels wide and 220 pixels tall, instead of the aforementioned resolution of 32 pixels by 3070 pixels. There are two distinct differences, which appear in every pair of benign and infected files. First, there is a small dot of difference at the top of the image. Upon further analysis with a disassembly tool, is a slight change in one of the program headers where one of the file segments has the write bit enabled. This change is likely just an artifact of MSFvenom modifying the file. The second change is in the middle of the file, and that is where the Trojan was inserted at the entry point of the file. This is a more important change, because this area is where the malicious code lives in the file. If we were to calculate a saliency map of clear.infected, we would expect a network which has properly learned to identify the malicious features of the dataset to highlight this region of the file.

284 Fig. 8 Image representations of the clear (top row) and udp_server (bottom row) programs. From left to right: benign file image, infected file image, image differences of benign and infected highlighted in red

W. Stegner et al.

MiBeX: Malware-Inserted Benign Datasets for Explainable Machine Learning Table 2 Malware classification network architecture Layer type Layer name Activation 2D Convolutional 2D Max pooling Dropout 2D Convolutional 2D Max pooling Dropout Global average pooling Dense Dense

285

Notes

Conv0 Pool0 Dr op0 Conv1 Pool1 Dr op1 Pool2

tanh N.A. N.A. tanh N.A. N.A. N.A.

Shape = 4x4 Shape = 3x3 Rate = 0.25 Shape = 3x3 Shape = 3x3 Rate = 0.25 N.A.

Dense0 Dense1

tanh sigmoid

Size = 50 Size = 1

4.2 Network Specifications To classify these datasets, we constructed a two-layer convolutional neural network similar to the one found in [6]. Table 2 shows the specific parameters for constructing our network. We trained the network for 5000 epochs with a batch size of 32, a learning rate of 0.001, and Binary Cross-entropy as the loss function. The datasets were partitioned 80% training data and 20% testing data. Both the 80% and 20% partitions are composed of equal amounts of benign and malicious files. Because this particular dataset only utilizes one Trojan in the malicious class, there are no “zero-day” Trojans (i.e., Trojans that have not been seen before by the system) in this dataset. While we note that utilization of 5-fold cross validation would further validate the results of this network, for this example we are going to use the simple 80%:20% split to simplify our saliency mapping application in Sect. 5.

4.3 Classification Results During the training process, we observed the training and validation accuracy over 5000 epochs. These observations are shown in Fig. 9. After training, the accuracy is 98.94% on the training set and 93.83% on the validation set. A breakdown of the classification accuracy is shown in Table 3. We observe that the training and validation accuracies still appear to trend upward after 5000 epochs, meaning our network may benefit from additional training epochs. Regardless, we have demonstrated the ability to use our new dataset for the purpose of training a malware detection machine learning algorithm.

286

W. Stegner et al.

Fig. 9 Training (red) and validation (blue) accuracy over 5000 epochs Table 3 Malware detection results Metric Train value True positive True negative False positive False negative Accuracy

1016 1034 1 21 98.94%

Test value 237 250 9 23 93.83%

MiBeX: Malware-Inserted Benign Datasets for Explainable Machine Learning

287

5 Saliency Mapping Now that we have a trained Malware as Image classifier, we can use it to apply saliency mapping to attempt to add some transparency to this black-box system. From Fig. 8, we can see approximately what our saliency maps should look like in the file difference images; we should expect to see a network which has properly learned the malicious features of this dataset to find the Trojan portion of the file important to make the classification of malware. To test this, we took the saliency map of the following layers: Conv0 , Conv1 , Dense0 , and Dense1 . An example set of saliency maps from a malicious file is shown in Fig. 10. Only the first 250 columns are shown for space considerations. Despite the presence of the Trojan in this file, the area where the Trojan is in the file is not highlighted. Instead, only the black portions of the image are showing up as salient to our network. Because the dataset generation method allows us to know the expected saliency maps, we know that the black portions of the image should not be significant in the explanation. We can use this knowledge to understand that the explanations generated by our saliency maps are not what we expected, and we might not be able to trust that our network has actually learned the malicious features present in the dataset. These saliency mapping results point us to two main possibilities. The first possibility is that our network has learned to classify this dataset accurately, but is using patterns other than the presence of the Trojan in the file. For example, in Fig. 8, we discussed the two major differences between the files: the program header modification and the Trojan insertion. It may be possible that the algorithm is learning to detect the program header modification instead of the Trojan insertion. The second possibility is that our saliency mapping method is not an effective visualization method in this particular example. Comparing this saliency mapping to that done by Santacroce et al. [4], we may note several significant differences. In [3], classification is done on assembly instructions rather than the raw machine code, as we do in this chapter. This difference means that the two studies use two different levels of data granularity. As discussed in Sect. 2.2, processing the raw machine code is a finer level of granularity than using the assembly instructions. Using the coarser granularity allowed for additional pre-processing to take place that may have additionally affected the features learned by the network. For instance, the assembly code is tokenized and vectorized with an embedding layer in the classification network, which further differentiates the two applications conceptually. Furthermore, only a subset of the assembly code is used; classification in [3] is done solely on the .text portion of the file, while the work in this chapter operates on the file in its entirety. It is possible that the data pre-processing done in [3] helped to filter out information that was irrelevant to the classifier, which allowed the explanations in [4] to focus on the actually relevant portions of the code. With these three notable differences in methodology, it will be beneficial to utilize the dataset produced in this chapter, but follow the classification methodology in [3], then analyze the saliency mapping results.

288

W. Stegner et al.

Fig. 10 a Original malware image; saliency map for layer: b Conv_0; c Conv_1; d Dense_0; e Dense_1. Brighter pixels indicate higher saliency values. Note that the only the first 250 rows are shown for space considerations

MiBeX: Malware-Inserted Benign Datasets for Explainable Machine Learning

289

6 Conclusion and Future Work Our dataset generation method was able to successfully generate two datasets for two different architectures, which will be important for achieving interpretability in machine learning and granular computing. Additionally, we were able to train a machine learning malware classifier to training and validation accuracies of 98.94% and 93.83% respectively. Our dataset is also intelligible as to specifically which features are malicious and which are benign. Although the saliency mapping results found in the Malware as Image classifier did not look as expected, we were able to compare and contrast the classification methodology used in this chapter against previously used methods, as well as speculate what factors may be beneficial to investigate in the future. We recommend that future research should involve performing the classification using this dataset at different levels of code granularity (i.e. machine code and assembly code), because that may provide some more insight into how the granularity affects the explanations generated by deep learning classifiers. Furthermore, we recommend investigating the use of different information granulation methods, such as the graph-based image representation presented by Rizzi et al. [17] and the CNN feature extraction presented by Zhao et al. [18]. The biggest limitation with the dataset generation method in its current implementation is that only one Trojan payload is used. This issue can be solved in future study, because MSFvenom can be easily configured to insert many different types of Trojan payloads. In fact, MSFvenom has the capability to use custom payloads, as well as payload obfuscation techniques. MSFvenom has the tools available to produce an extremely diverse dataset. This dataset generation method will scale based on the amount of benign files supplied, as well as the number of Trojan payloads selected for use. Increasing the diversity of the types of Trojan payloads is another area of future study, as it will prove beneficial to further testing the performance of various classification techniques. Kali Linux has a vast collection of files in its software repositories which can be freely downloaded and added to the benign set. Overall, our dataset generation method has shown promise to produce highly scalable datasets for machine learning and granular computing. In addition to malware detection, the interpretable nature of the datasets this method produces allows the dataset to be used for other tasks, such as semantic segmentation of executable files. The flexibility of the Metasploit framework allows this method to be extensible to virtually any target platform. This technique provides tremendous opportunity for achieving interpretable granular computing for malware analysis. Acknowledgements The authors would like to thank Daniel Koranek, Bayley King, and Siddharth Barve for constructive technical discussion. We would also like to thank The Design Knowledge Company, Dayton, Ohio and Air Force Research Laboratory, Wright Patterson, Ohio for funding this research under AFRL Award No. FA8650-18-C-1191.

290

W. Stegner et al.

References 1. Kaspersky. Kaspersky Lab Number of the Year: 360,000 Malicious Files Detected Daily in 2017 2017. https://usa.kaspersky.com/about/press-releases/ 2017 2. Kancherla, K., Mukkamala, S.: Image visualization based malware detection. In: 2013 IEEE Symposium on Computational Intelligence in Cyber Security (CICS). IEEE, pp. 40–44 (2013). ISBN: 978-1-4673-5867-5. https://ieeexplore.ieee.org/document/6597204 3. Santacroce, M.L., Koranek, D., ha, R.: Detecting Malware code as video with compressed, time-distributed neural networks. IEEE Access 8, 132748–132760 (2020). ISSN: 2169-3536. https://ieeexplore.ieee.org/document/9145735/ 4. Santacroce, M., Stegner, W., Koranek, D., Jha, R.: A foray into etracting malicious features from executable code with neural network salience. In: Proceedings of the IEEE National Aerospace Electronics Conference, NAECON 2019-July, pp. 185–191. IEEE. ISBN: 9781728114163. https://ieeexplore.ieee.org/document/9057859/ 5. Raff, E., et al.: Malware Detection by Eating a Whole EXE. arXiv:1710.09435. http://arxiv. org/abs/1710.09435 (2017) 6. Su, J., et al.: Lightweight classification of IoT Malware based on image recognition. In: 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), vol. 2, pp. 664–669. IEEE (2018). ISBN: 978-1-5386-2666-5. arXiv: 1802.03714. https://ieeexplore. ieee.org/document/8377943/ 7. Ronen, R., Radu, M., Feuerstein, C., Yom-Tov, E., Ahmadi, M.: Microsoft Malware classification challenge. arxiv:1802.10135. http://arxiv.org/abs/1802.10135 (2018) 8. Makandar, A., Patrot, A.: Malware analysis and classification using artificial neural network. In: 2015 International Conference on Trends in Automation, Communications and Computing Technology (I-TACT-15), pp. 1–6. IEEE (2015). ISBN: 978-1-4673-6667-0. http://ieeexplore. ieee.org/document/7492653/ 9. Nataraj, L., Karthikeyan, S., Jacob, G., Manjunath, B.S.: Malware images: visualization and automatic classification. In: Proceedings of the 8th International Symposium on Visualization for Cyber Security—VizSec ’11, pp. 1–7. ACM Press, New York, USA (2011). ISBN: 9781450306799. http://dl.acm.org/citation.cfm?doid=2016904.2016908 10. Garg, V., Yadav, R.K.: Malware detection based on API calls frequency. In: 2019 4th International Conference on Information Systems and Computer Networks (ISCON), pp. 400–404. IEEE (2019). ISBN: 978-1-7281-3651-6. https://ieeexplore.ieee.org/document/9036219/ 11. Hu, W., Tan, Y.: Generating adversarial Malware examples for black-box attacks based on GAN. arXiv: 1702.05983. http://arxiv.org/abs/1702.05983 (2017) 12. Gadhiya, S., Bhavsar, K., Student, P.D.: Techniques for Malware Analysis. International Journal of Advanced Research in Computer Science and Software Engineering 3, 2277–128 (2013) 13. NSA. Ghidra. https://ghidra-sre.org/ 14. Hex-Rays. IDA Pro. https://www.hex-rays.com/products/ida/ 15. Pedrycz, W.: Granular computing: an introduction. In: Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569), vol. 3, pp. 1349– 1354. IEEE (2001). ISBN: 0-7803-7078-3. http://ieeexplore.ieee.org/document/943745/ 16. Livi, L., Sadeghian, A.: Granular computing, computational intelligence, and the analysis of non-geometric input spaces. Granular Comput. 1, 13–20 (2016) 17. Rizzi, A., Del Vescovo, G.: Automatic Image Classification by a Granular Computing Approach in 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing, pp. 33–38. IEEE, Arlington, VA (2006) 18. Zhao, H.-h., Liu, H.: Multiple classifiers fusion and CNN feature extraction for handwritten digits recognition. Granular Comput. 5, 411–418 (2020). ISSN: 2364-4966 19. Han, J., Dong, J.: Perspectives of granular computing in software engineering. In: 2007 IEEE International Conference on Granular Computing (GRC 2007), pp. 66–66. IEEE (2007). ISBN: 0-7695-3032-X. http://ieeexplore.ieee.org/document/4403068/

MiBeX: Malware-Inserted Benign Datasets for Explainable Machine Learning

291

20. Christodorescu, M., Jha, S., Seshia, S., Song, D., Bryant, R.: Semantics-aware Malware detection. In: 2005 IEEE Symposium on Security and Privacy (S&P’05), pp. 32–46. IEEE (2005). ISBN: 0-7695-2339-0. https://ieeexplore.ieee.org/document/1425057/ 21. Greer, J.: Unsupervised interpretable feature extraction for binary executables using LIBCAISE Master’s Thesis. University of Cincinnati (2019), pp. 1–51. https://etd.ohiolink.edu/pg_10?:: NO:10:P10_ETD_SUBID:180585 22. Erhan, D., Bengio,Y., Courville, A., Vincent, P.: Visualizing higher-layer features of a deep network. Technical Report (2009), pp. 1–13. https://www.researchgate.net/publication/ 265022827 23. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps, pp. 1–8 (2013). arXiv: 1312.6034. http://arxiv. org/abs/1312.6034 24. Dogs vs. Cats 2013. https://www.kaggle.com/c/dogs-vs-cats/ 25. Garg, A., Gupta, D., Saxena, S., Sahadev, P.P.: Validation of random dataset using an efficient CNN model trained on MNIST handwritten dataset. In: 2019 6th International Conference on Signal Processing and Integrated Networks (SPIN), pp. 602–606. IEEE (2019). ISBN: 978-17281-1380-7. https://ieeexplore.ieee.org/document/8711703/ 26. Kayed, M., Anter, A., Mohamed, H.: Classification of garments from fashion MNIST dataset using CNN LeNet-5 architecture. In: 2020 International Conference on Innovative Trends in Communication and Computer Engineering (ITCE), pp. 238–243. IEEE (2020). ISBN: 9781-7281-4801-4. https://ieeexplore.ieee.org/document/9047776/ 27. MetasploitUnleashed. https://www.offensive-security.com/metasploit-unleashed/ 28. Thamsirarak, N., Seethongchuen, T., Ratanaworabhan, P.: A case for Malware that make antivirus irrelevant. In: 2015 12th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), pp. 1–6. IEEE (2015). ISBN: 978-1-4799-7961-5. https://ieeexplore.ieee.org/document/ 7206972/ 29. Casey, P., et al.: Applied comparative evaluation of the metasploit evasion module. In: 2019 IEEE Symposium on Computers and Communications (ISCC) 2019-June, pp. 1–6. IEEE (2019). ISBN: 978-1-7281-2999-0. https://ieeexplore.ieee.org/document/8969663/ 30. Wang, M., et al.: Automatic polymorphic exploit generation for software vulnerabilities. In: Zia, T., Zomaya, A., Varadharajan, V., Mao, M. (eds.) Security and Privacy in Communication Networks, pp. 216–233. Springer International Publishing, Cham (2013). ISBN: 978-3-31904283-1. http://link.springer.com/10.1007/978-3-319-04283-1 31. Baggett, M.: Effectiveness of antivirus in detecting metasploit payloads. Technical Report. SANS Institute (2008) 32. Meng, G., Feng, R., Bai, G., Chen, K., Liu, Y.: DroidEcho: an in-depth dissection of malicious behaviors in Android applications. Cybersecurity 1, 4 (018). ISSN: 2523- 3246. https:// cybersecurity.springeropen.com/articles/10.1186/s42400-018-0006-7 33. Liao, X., et al.: Cloud repository as a malicious service: challenge, identification and implication. Cybersecurity 1, 14 (2018). ISSN: 2523-3246. https://cybersecurity.springeropen.com/ articles/10.1186/s42400-018-0015-6 34. Executable and Linkable Format (ELF) (2001). http://www.skyfree.org/linux/references/ELF 35. Metasploit. https://www.metasploit.com/ 36. Stallman, R., McGrath, R., Smith, P.: GNU make. https://www.gnu.org/software/make/ manual/make.html

Designing Explainable Text Classification Pipelines: Insights from IT Ticket Complexity Prediction Case Study Aleksandra Revina, Krisztian Buza, and Vera G. Meister

Abstract Nowadays, enterprises need to handle a continually growing amount of text data generated internally by their employees and externally by current or potential customers. Accordingly, the attention of managers shifts to an efficient usage of this data to address related business challenges. However, it is usually hard to extract the meaning out of unstructured text data in an automatic way. There are multiple discussions and no general opinion in the research and practitioners’ community on the design of text classification tasks, specifically the choice of text representation techniques and classification algorithms. One essential point in this discussion is about building solutions that are both accurate and understandable for humans. Being able to evaluate the classification decision is a critical success factor of a text classification task in an enterprise setting, be it legal documents, medical records, or IT tickets. Hence, our study aims to investigate the core design elements of a typical text classification pipeline and their contribution to the overall performance of the system. In particular, we consider text representation techniques and classification algorithms, in the context of their explainability, providing ultimate insights from our IT ticket complexity prediction case study. We compare the performance of a highly explainable text representation technique based on the case study tailored linguistic features with a common TF-IDF approach. We apply interpretable machine learning algorithms such as kNN, its enhanced versions, decision trees, naïve Bayes, logistic regression, as well as semi-supervised techniques to predict the ticket class

A. Revina (B) Chair of Information and Communication Management, Faculty of Economics and Management, Technical University of Berlin, 10623 Berlin, Germany e-mail: [email protected] K. Buza Faculty of Informatics, Eötvös Loránd University, 1117 Budapest, Hungary e-mail: [email protected] A. Revina · V. G. Meister Faculty of Economics, Brandenburg University of Applied Sciences, 14770 Brandenburg an der Havel, Germany e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. Pedrycz and S. Chen (eds.), Interpretable Artificial Intelligence: A Perspective of Granular Computing, Studies in Computational Intelligence 937, https://doi.org/10.1007/978-3-030-64949-4_10

293

294

A. Revina et al.

label of low, medium, or high complexity. As our study shows, simple, explainable algorithms, such as decision trees and naïve Bayes, demonstrate remarkable performance results when applied with our linguistic features-based text representation. Furthermore, we note that text classification is inherently related to Granular Computing. Keywords Text classification · Explainability · Linguistics · Machine learning · TF-IDF · IT tickets

1 Introduction One of the most critical issues facing enterprises today is how to structure, manage, and convert large amounts of unstructured text data coming from inside or outside of the company into valuable information. Getting knowledge from unstructured text, also referred to as text analytics, has always been restricted by the ability of machines to grasp the real semantics of human language [1]. However, especially with the current technological advancements, many organizations are putting considerable efforts into developing successful text analytics strategies. From the industry branch perspective [2], these are pharmaceutical, medical and insurance companies mainly known for using text analytics to improve product development processes [3], make medical diagnoses [4], analyze claims, or predict frauds [5] correspondingly. From the organizational perspective [2], these are marketing and customer service departments that typically use text analytics techniques to extract actionable insights from customer-related data such as customer call notes, service requests, or social media. The goal is to understand customer problems better, react to service issues on time, or plan process improvement strategies [6]. In the text analytics tasks, such as classification, machines process the input from humans and produce the output for humans. In this regard, one key question remains the quality of the output delivered by machines. This information should build trust to ensure that humans accept the decisions made or suggested by the machine. Hence, understanding the way how the information was obtained plays a crucial role in creating confidence in all human–machine interaction tasks [7–11]. Lately, such a term as Explainable Artificial Intelligence (XAI) has been established to refer to those systems which goal is to provide transparency on how an AI system makes its decisions and predictions and performs actions [12]. The research on XAI has attracted a lot of attention in recent years. A number of concepts and definitions were elaborated. For a comprehensive overview, we refer to [11]. As the title suggests, in this chapter, we aim at building explainable text classification pipelines. We show that such pipelines can be advantageous not only in the sense of their explainability but also in prediction quality. In our previous research [13–15], we performed an in-depth linguistic analysis of the IT ticket text descriptions originating from the IT ticket processing department of a big enterprise with more than 200,000 employees worldwide. Using the

Designing Explainable Text Classification Pipelines …

295

elaborated linguistic representations of the ticket texts, we implemented a rule-based approach to predict the process complexity [16]. The precision of this approach reached approximately 62%. Apart from such a precision quality, the process of rule set establishment demanded a lot of analysis and testing effort on our and process experts’ side. Another substantial disadvantage of such an approach, particularly in complex scenarios, is difficulty in maintenance, inability to learn and scale, as adding new rules requires revising the existing ones. With the motivation to address this challenge and especially develop an understanding of which factors—text representation techniques or text classification algorithms—play an essential role in the prediction quality, we suggest a new analysis in this study. Hereby, we put special emphasis on explainability and consider it using the Granular Computing paradigm [17]. Specifically, in the representation of the ticket text data, we compare the performance of a commonly accepted and widely used standard TF-IDF (Term FrequencyInverse Document Frequency) [18] with the explainable linguistic approach described in our previous work [13, 14, 15]. Despite massive research efforts on text classification, the representation based on the linguistic features considered in this study has not been systematically compared with TF-IDF representation in the ticket classification context. In the classification methods, we focus on various classifiers known for their explainability and widely used for text [19 and ticket [20] classification, including kNN, its enhanced versions, so-called hubness-aware classifiers [21, 22], decision trees [23], naïve Bayes [24], logistic regression [25], as well as semi-supervised techniques. The latter includes kNN with self-training [26], Semi-sUpervised ClassifiCation of TimE SerieS algorithm (SUCCESS) [27], and its improved version QuickSUCCESS [28]. Although state-of-the-art technology allows us to capture a considerable amount of instances in many applications (e.g., millions of images may be observed), in our case, it is costly and difficult to obtain class labels for a large number of tickets. On the one hand, to label the tickets, expert knowledge is required. On the other hand, as various aspects have to be taken into account, even experts need much time for this task. Therefore, due to the limited amount of labeled tickets, we also experimented with semi-supervised approaches for ticket classification. Hence, our work makes some key methodological contributions. Our extensive analysis of linguistic features, TF-IDF, and various machine learning (ML) algorithms confirms the positive influence of linguistic style predictors on the prediction quality in general. Furthermore, simple, explainable algorithms in combination with linguistic features-based text representation demonstrate excellent performance results. The managerial and practical contributions of the research are related to the following points: (i) decision support for managers and ML experts in the design of text classification pipelines for diverse enterprise applications, (ii) addressing the limitations of our rule-based approach with ML. In the following sections, we give an overview of the acknowledged related work in the field and present in detail the research methodology followed by the experiments on the case study datasets. Afterward, we discuss the implications of the findings and conclude with a summary and future research directions.

296

A. Revina et al.

2 Related Work In general, natural language in business processes has been largely studied. Different problems, such as extracting performance indicators [29] or checking process compliance [30], have been addressed. In the present chapter, we focus on the problem of designing explainable real-world text classification tasks related to an IT ticket processing case study. We structure the related work section as follows: (i) explainability and granularity, (ii) text representation, (iii) text classification, (iv) ticket classification research.

2.1 Explainability and Granularity The interest towards explanation in AI has emerged in the press due to the prominent failures. For example, in 2018, an Uber’s self-driving car caused the death of a pedestrian in the US [31], or in the same year, IBM Watson recommended unsafe and incorrect cancer treatments [32]. Hence, the ability to understand how such AI systems work is crucial for reliance and trust. According to [33], machines are advantageous only to that degree that their actions can be relied upon to achieve the objectives set by humans. XAI is supposed to clarify the rationale behind a decision-making process of an AI system, highlight the strengths and weaknesses of the process, and give an understanding of how the system will act in the future [12]. On the one hand, the term XAI can be considered straightforward and selfexplainable. On the other hand, the confusion is brought by such terms as interpretability, transparency, fairness, explicitness, and faithfulness [11]. As a rule, they are used synonymously. However, some researchers imply a different meaning [7, 8, 34, 35], such as quasi-mathematical [35]. In this study, we refer to a commonly accepted definition from the Oxford English dictionary. The dictionary does not provide definitions for terms “explainable” or “explainability”, but for “explanation”: a statement, fact or situation that tells why something happened; a statement or piece of writing that clarifies how something works or makes something easier to understand [36]. To make this definition more feasible, we use the Granular Computing paradigm. In Granular Computing, one operates on the level of so-called “information granules” [17]. The granules of similar size are usually grouped in one layer. The more detailed and computationally intensive processing is required, the smaller information granules are created. Afterward, these granules are arranged in another layer, which results in the information processing pyramid [17]. Granular Computing has become popular in different domains. One can point out multiple research projects using Granular Computing concepts on various data types such as time series [37–39] and image data [40] as well as different application areas such as medicine [41, 42], finance [43], manufacturing [44].

Designing Explainable Text Classification Pipelines …

297

2.2 Text Representation Text representation is one of the essential building blocks of approaches for text mining and information retrieval. It aims to numerically represent the unstructured text documents to make them mathematically computable while transforming into a feature vector [45, 46]. We studied different techniques of text representation and feature extraction and structured them into three main categories: weighted word techniques, word embedding [18], and linguistic features.

2.2.1

Weighted Words

Weighted words approaches are based on counting the words in the document and computing the document similarity directly from the word-count space [19]. Due to their simplicity, Bag-of-Words (BoW) [47] and TF-IDF [48] can be considered as the most common weighted words approaches. The employed techniques usually rely on simple BoW text representations based on vector spaces rather than in-depth linguistic analysis or parsing [49]. Nonetheless, while these models represent every word in the corpus as a one-hot-encoded vector, they are incapable of capturing the word semantic similarity and can become very large and technically challenging [19]. To address these limitations, we use linguistic features in the proposed approach. As will be described in Sect. 3, the number of our linguistic features is independent of the size of the corpus, and they are capable of capturing relevant aspects of semantics.

2.2.2

Word Embedding

With the progress of research, new methods, such as word embedding, have come up to deal with limitations of weighted words approaches. Word embedding techniques learn from sequences of words by considering their occurrence and co-occurrence information. Each word or phrase from the corpus is mapped to a high dimensional vector of real numbers and trained based on the surrounding words over a huge corpus. Various methods have been proposed, Word2Vec [50], Doc2Vec [51], GloVe [52], FastText [53, 54], contextualized word representations [55, 56] being the most significant ones. These methods do consider the semantic similarity of the words. However, they need a large corpus of text datasets for training. To solve this issue, pre-trained word embedding models have been developed [57]. Unfortunately, the models do not work for the words outside of the corpus text data. Another limitation of the word embedding models is related to the fact that they are trained based on the words appearing in a selected window size parameter. For example, window size three means three words behind and three words ahead, making up six in total. This is an inherent limitation for considering different meanings of the same word occurring in different contexts. While addressing this limitation, contextualized word representations techniques based on the context of the word in a document [55, 56]

298

A. Revina et al.

were developed. Nonetheless, in real-world applications, new words may appear (in the description of a new ticket, the customer may use phrases and specific words that have not been used before). These new words are not included in the corpus at training time. Therefore, these word embedding techniques will fail to produce a correct mapping for these new words. While this problem may be alleviated by retraining the model, this requires a lot of data and computational resources (CPU, RAM). In contrast, as it is discussed next, it is straightforward to use our approach in case of new words as it doesn’t need training.

2.2.3

Linguistic Features

To address the aforementioned limitations of weighted words and words embeddings, text representations based on linguistic features have been introduced. Below, we list some examples. Lexicon-based sentiment analysis is a well-established text classification technique [58]. Synonymy and hypernymy are known approaches to increase prediction quality [45]. Extensive research on the linguistic analysis of ticket texts has been performed in [59, 60]. The authors use parts-of-speech (PoS) count and specific terms extractions to define the severity of the reported problem. Coussement and Van den Poel extract the following linguistic features: word count, question marks, unique words, the ratio of words longer than six letters, pronouns, negations, assents, articles, prepositions, numbers, time indication [61]. The results of the study indicated a profoundly beneficial impact of combining traditional features, like TF-IDF and singular value decomposition, with linguistic style into one text classification model. However, the authors declare a demand for more experiments and research. This demand is addressed in our work. There is no unified opinion in the research community whether the linguistic approaches are good enough to substitute or enhance the traditional text representations as they are more complex to implement and do not compensate this complexity with the expected performance increase. Both proponents [62–67] and opponents [68] of the linguistic approach provide convincing experimental results. In our work, we amend these research discussions with case study-based findings while comparing the performance of linguistic features with TF-IDF. Word embedding techniques are not applicable in our study due to the following reasons: (i) pre-trained models would not perform well considering the domain-specific vocabulary which contains many new words compared to the training corpus, (ii) at the same time, a limited text corpus would not be enough for training a new model (or retraining an existing one), (iii) industrial applications prevailingly demand explainable models to be able to understand and correct classification mistakes. Table 1 summarizes the strengths and weaknesses of the discussed techniques.

Designing Explainable Text Classification Pipelines …

299

Table 1 Text representation techniques Technique

Strengths

Weaknesses

• Simple to implement • Established approaches to extract the most descriptive terms in a document • Do not need data to train the mapping

• Do not consider syntax and semantics

Weighted words BoW, TF-IDF

Word embedding Word2Vec, Doc2Vec, GloVe, • Consider syntax and contextualized word semantics • Contextualized word representations representations: consider polysemy

• Need much data for training • Consider only words that appear in the training data • Computationally expensive to train (CPU, RAM)

Linguistic features Word count, special characters, parts of speech, unique words, long words, context-specific taxonomies, lexicons, sentiment, etc

• Highly explainable and • Expert knowledge is required understandable to define an appropriate set of features • large choice of features • Do not necessarily depend on capturing semantics and context • Depending on the selected features, can capture both syntax and semantics • Do not need data to train the mapping

2.3 Text Classification Text classification, also referred to as text categorization or text tagging, is the task of assigning a text to a set of predefined categories [69]. Traditionally, this task has been done by human experts. Expert text classification, for example, remains widely used in qualitative research in the form of such tasks as coding or indexing [70, 71]. Nevertheless, with the growing amount of text data, the attention of the research community and practitioners shifted to automatic methods. One can differentiate three main groups of automatic text classification approaches: rule-based, ML-based, and a combination of both in a hybrid system. Below, the main two approaches will be shortly discussed.

2.3.1

Rule-Based Text Classification

As the name suggests, this kind of text classification system is based on a set of rules determining classification into a set of predefined categories. In general, rulebased classifiers are a popular data mining method applicable to diverse data types. It

300

A. Revina et al.

became popular due to its transparency and relative simplicity [72]. The researchers distinguish various types of rule development: ML-based such as decision trees [23, 73], association rules [74, 75], handcrafted rules created by the experts, or hybrid approaches [76, 77]. It is essential to mention that rule-based systems work well with small rule sets and become challenging to build, manage, and change with the growing number of rules. In our previous work (shortly mentioned in the introduction part), we conceptualized a recommender system for IT ticket complexity prediction using rule-based classification, i.e., handcrafted rules and rules based on decision trees [16]. Nonetheless, due to the complexity of the domain, the process of rule development consumed much time and effort. The system was difficult to manage and change.

2.3.2

Machine Learning-Based Text Classification

In the context of ML, text classification is a task which can be defined as follows: given a set of classification labels C and a set of training examples E, each of which has been assigned to one of the class labels in C, the system must use E to form a hypothesis to predict the class labels of previously unseen examples of the same type [78]. Hereby, in text classification, E is a set of labeled documents from a corpus. The labels can be extracted topics, writing styles, judgments of the documents’ relevance [45]. Regarding the prediction itself, various techniques such as kNN, its enhanced versions (hubness-aware classifiers), decision trees, naïve Bayes, logistic regression, support vector machine, neural networks have been introduced [19]. ML approaches have been shown to be more accurate and easier to maintain compared to rule-based systems [79]. At the same time, it is challenging to select the best ML technique for a particular application [19]. Table 2 summarizes the strengths and weaknesses of the main approaches for text classification. Most ML techniques, including all the aforementioned approaches, require a large amount of training data. However, as described previously, in our application, it is difficult to obtain labeled training data. In contrast, semi-supervised learning (SSL) allows inducing a model from a large amount of unlabeled data combined with a small set of labeled data. For an overview of major SSL methods, their advantages and disadvantages, we refer to [81]. For the above reasons, we also include semisupervised ML classifiers in our study, in particular, kNN with self-training and SUCCESS [27]. Although SUCCESS showed promising results, it doesn’t scale well to large datasets. In this chapter, we address this limitation by developing a scaling technique for SUCCESS, and we called the resulting approach QuickSUCCESS [28].

2.4 Ticket Classification Research Ticket classification in the context of software development and maintenance has been studied to tackle such challenges as correct ticket assignment and prioritization,

Designing Explainable Text Classification Pipelines …

301

Table 2 Text classification techniques Technique

Strengths

Weaknesses

Rule-based classification ML-based rule development • Able to handle a variety of with decision trees, association input data rules, handcrafted rules • Explainable

• With the growing number of rules—difficult to build, manage, maintain, and change

ML-based classification kNN, hubness-aware classifiers

• Non-parametric • Adapts easily to various feature spaces

• Finding an appropriate distance function for text data is challenging • prediction might become computationally expensive

SUCCESS

• Learns with few labeled instances

• Finding an appropriate distance function for text data is challenging • Computationally expensive training

Decision trees

• Fast in learning and prediction • Explainable

• Overfitting • Instability even to small variations in the data

Naïve Bayes

• Showed promising results on text data [80]

• Strong assumption about the conditional independence of features

Logistic regression

• Computationally inexpensive

• Does not model the interdependence between features • is not appropriate for non-linear problems

Support vector machines

• Robust against overfitting • Able to solve non-linear problems

• Lack of transparency in results • Choosing an appropriate kernel function may be challenging

Neural networks

• Can achieve rather accurate predictions

• Requires a large amount of data for training • May be extremely computationally expensive • Finding an efficient architecture and structure is difficult • Not explainable

302

A. Revina et al.

prediction of time and number of potential tickets, avoiding duplicates, and poorly described tickets [20, 82]. One can observe various ticket text classification [83, 84] and clustering approaches [85, 86]. The most common classification algorithms are traditional support vector machines, naïve Bayes, decision trees, logistic regression, as well as algebraic and probability theory-based algorithms. The most popular text representation methods are weighted words techniques, such as TF-IDF [20, 87, 88]. The same as in the general text classification studies [61, 68], one can find strong proponents of the linguistic features also in the IT ticket domain [59, 60]. [59, 60] performed an extensive research on the linguistic analysis of ticket text descriptions. The authors used parts-of-speech (PoS) count and specific terms extractions to define the severity of the reported problem. Their findings evidence high classification accuracy in the range of 85–91% and the characteristic PoS distributions for specific categories.

2.5 Summary This section provided an overview of acknowledged work featuring those subject areas necessary to understand the selected research set-up and methods as well as its contributions. While many text classification techniques exist, their implementation effort and performance are data and case study dependent, which makes the choice of the best fit solution even more difficult. This holds true for both general text classification tasks, like email classification [61], and, particularly, IT tickets [20]. At the same time, in the text representation subject area, there is no consensus among scientists if the linguistic representation is more beneficial than a traditional one, for example, the most widespread TF-IDF or BoW [68]. Whereas the researchers report a significant performance improvement of linguistic approach [62–67], the need for further experiments on the case study projects to be able to learn about the origin of the data [60], exploration of more linguistic features [45] or the necessity to experiment with other text classification tasks and algorithms [61] are declared. Hence, the present study aims at throwing more light on the discussion of selected elements in the design of a text classification pipeline putting emphasis on the importance of their explainability. Accordingly, the main purpose of the study is to improve our understanding of text classification applications in the context of the discussed text representation techniques and classification algorithms.

Designing Explainable Text Classification Pipelines …

303

3 Methods 3.1 Feature Extraction Overall, texts belong to an unstructured type of data. However, if one would like to use any kind of mathematical modeling and computer-enabled analysis, the unstructured text must be transformed into a structured feature space. The data needs to be pre-processed to remove unnecessary characters and words. Afterward, diverse feature extraction techniques can be applied to get a machine-understandable text representation. In our study, we compare two feature extraction techniques: common TF-IDF and the linguistic features-based approach elaborated in our previous work. Term frequency (TF) weights measuring the frequency of occurrence of the terms in the document or text have been applied for many years for automatic indexing in the IR community [89]. TF alone produces low-quality search precision when high-frequency terms are equally spread over the whole text corpus. Hence, a new factor, inverse document frequency (IDF), was introduced to reflect the number of documents n to which a term is assigned in a collection of N documents and can be computed as log of N/n [48]. Taking into account both factors, a reasonable measure of text representation can be received using the product of TF and IDF [90]. TF-IDF is considered to be one of the most widely used text representation techniques [20] due to its simplicity and good quality of search precision, which can be achieved. Following the popularity of this technique, especially in the IT ticket domain, as well as for the reasons listed in Sect. 2, we perform the feature extraction procedure correspondingly and represent each ticket text in our case study corpus as a TF-IDF numerical vector. To compare TF-IDF with a more transparent and human-understandable technique and as encouraged by the promising results of other researchers [61, 59, 60] and the growing importance of XAI research [91, 92], especially in the high risk and high consequence decision-making domains [93], we implement an explainable linguistic approach, based on our own case study specific set of linguistic features (see Table 3). These were designed for the classification task of IT ticket complexity prediction. It is commonly distinguished among three levels of text understanding: (1) objective (answering the questions who, what, where, when) measured by semantic technologies, (2) subjective (an emotional component of a text)—by sentiment analysis, and (3) meta-knowledge (information about the author outside of the text) measured by stylometry or stylistic analysis [94]. Accordingly, we develop a set of features which are aggregated by the respective measures indicating the IT ticket complexity. We proposed these features in our initial works [13, 14, 15].1 A detailed explanation of the linguistic features is provided below in text using an anonymized IT ticket example and summarized in Table 3.

1 https://github.com/IT-Tickets-Text-Analytics.

304

A. Revina et al.

Table 3 Overview of linguistic features illustrated by a ticket example Aspects

Description

Linguistic feature

Ticket example: “Refresh service registry on the XYZ-ZZ YYY server. See attachment for details.” Objective knowledge aspect [15]

Subjective knowledge aspect [14]

Meta-knowledge aspect [13]

Relative occurrence of words according to the taxonomy of Routine, semi-cognitive and cognitive terms

Routine = 0.8

Relative occurrence of words with positive, neutral and negative sentiment

Negative = 0

Word count

12

Occurrence of nouns in all words

0.5

Semi-cognitive = 0.2 Cognitive = 0 Neutral = 1 Positive = 0

Occurrence of unique nouns in 1 all nouns

3.1.1

Occurrence of verbs in all words

0.17

Occurrence of unique verbs in all verbs

1

Occurrence of adjectives in all words

0.07

Occurrence of unique adjectives in all adjectives

1

Occurrence of adverbs in all words

0

Occurrence of unique adverbs in all adverbs

0

Wording style

0 (no repeating words)

Objective Knowledge Aspect

Core research in Natural Language Processing (NLP) addresses the extraction of objective knowledge from text, i.e., which concepts, attributes, and relations between concepts can be extracted from text, including specific relations such as causal, spatial, and temporal ones [94]. Among diverse approaches, specifically, taxonomies and ontologies, are widely used in the business context [95, 96]. Thus, we suggest a specific approach of objective knowledge extraction using the Decision-Making Logic (DML) taxonomy [15] illustratively presented in Appendix I. Herewith, it is aimed to discover the decision-making nature of activities, called DML level. We use the following DML levels: routine, semi-cognitive, and cognitive (corresponding to the columns of the table in Appendix I). Using a Latent Dirichlet Allocation Algorithm (LDA) [97], we identify the most important keywords, see [15] for details. Each of the keywords is associated with a DML level. For example, the keywords

Designing Explainable Text Classification Pipelines …

305

user, test, request, etc. are associated with routine, whereas the keyword management belongs to cognitive. We detect these keywords in IT ticket texts. Based on the total number of detected keywords, we calculate the relative occurrence of the words of each category in the ticket text. In the example shown in Table 3, the total number of detected words equals five, out of which four words belong to routine (server, attach, detail, see), one to semi-cognitive (service), and no word to cognitive. Thus, the corresponding features are calculated as follows: routine: 4/5 = 0.8, semi-cognitive: 1/5 = 0.2, and cognitive: 0/5 = 0.

3.1.2

Subjective Knowledge Aspect

To extract the subjective knowledge aspect, we perform sentiment analysis [98]. In [14], we suggest a specific approach of business sentiment as an instrument for measuring the emotional component of an IT ticket. This latent information is extracted from the unstructured IT ticket texts with the help of a lexicon-based approach. As standard lexicons do not work well in our context of IT ticket classification, using the state-of-the-art VADER [99] and LDA, we developed a domainspecific lexicon, see [14] for details. Our lexicon can also be found in Appendix II. Each of the words in the lexicon is associated with a positive, negative, or neutral sentiment. In particular, words with valence scores greater than 0 are considered to be positive, whereas words with valence score less than 0 are considered to be negative. All other words with 0 valence score are considered to have a neutral sentiment. For each IT ticket text, we determine the proportion of words with negative, neutral, and positive sentiment and use these values as features. In our example, there are no words with positive or negative sentiment. Therefore, the ticket is assigned to be entirely neutral.

3.1.3

Meta-Knowledge Aspect

In our case, meta-knowledge is the knowledge about the author of an IT ticket. The quality of the written text will likely depend on such factors as the author’s professionalism and expertise, level of stress, and various psychological and sociological properties [94]. To extract the meta knowledge aspect, we suggest the following features [13]: (1) IT ticket text length, (2) PoS features, (3) wording style [13] calculated with the Zipf’s law of word frequencies [100]. By length, we mean the number of words in the IT ticket text. This feature is motivated by the following observation: in most cases, IT tickets texts containing a few words, such as update firewalls, refer to simple daily tasks. Therefore, short ticket texts may be an indication of the low complexity of the ticket. In the example shown in Table 3, the length of the ticket is 12 words. As for PoS features, we consider the following PoS tags: nouns, verbs, adjectives, and adverbs. For each of them, we calculate their absolute occurrence, i.e., the total

306

A. Revina et al.

number of words having that PoS tag (for example, the total number of nouns). Subsequently, we calculate their relative occurrence, called occurrence for simplicity, i.e., the ratio of nouns, verbs, adjectives, and adverbs relative to the length of the text. We use these occurrences as features. In the example shown in Table 3, the occurrence of nouns (registry, XYZ-ZZ, YYY, server, attachment, details) in all words is 6/12 = 0.5. We also calculate the number of unique words having the considered PoS tags (for example, number of unique nouns). Then, we calculate the occurrence of unique nouns, verbs, adjectives, and adverbs relative to the number of all nouns, verbs, adjectives, and adverbs, respectively. We use these occurrences as features as well. In Table 3, the uniqueness of nouns is 6/6 = 1 (no repeating words). According to Zipf’s word frequency law, the distribution of word occurrences is not uniform, i.e., some words occur very frequently. In contrast, others appear with a low frequency, such as only once. Our wording style feature describes how extreme is this phenomenon in the IT ticket text, i.e., whether the distribution of occurrences is close to a uniform or not. For details, we refer to [13].

3.2 Machine Learning Classifiers After the features have been extracted and the texts are represented in the form of numerical vectors, they can be fed into ML classifiers. Choosing the best classifier is fairly acknowledged to be the essential step of any text classification pipeline. In our study design, we put a high value on the explainability of the classification results. As fairly stated by [101], ML has demonstrated its high potential in improving products, services, and businesses as a whole. However, machines do not explain their predictions, which is a barrier to ML adoption. Hence, from the existing classifiers, we focus on simple algorithms known for their explainability, such as kNN, decision trees, naïve Bayes, and logistic regression [101]. In the attempt to improve the classification quality, we experiment with enhanced versions of kNN (hubnessaware classifiers) as well as its semi-supervised variations (kNN with self-training, SUCCESS, and QuickSUCCESS).

3.2.1

Standard kNN

Standard kNN is a non-parametric algorithm widely used both for classification in general and for text classification in particular [102]. It is simple to implement, works well with multi-class labels, and adapts easily to various feature spaces [19], which also determines the choice of the algorithm in this study.

Designing Explainable Text Classification Pipelines …

3.2.2

307

Hubness-Aware Classifiers

One problem of kNN is related to the presence of so-called bad hubs, instances that are similar to a high amount of other instances, and this way may mislead the classification [103]. A hub is considered to be bad if its class label differs from the class labels of many of those instances that have this hub as their nearest neighbor. Thus, bad hubs have proven to be responsible for a surprisingly high fraction of the total classification error. To address this challenge, several hubness-aware classifiers have been proposed recently [21, 104]: kNN with error correction (ECkNN) [103, 105], hubness-weighted kNN (HWkNN) [106], hubness-fuzzy kNN (HFNN) [107], naive hubness-Bayesian kNN (NHBNN) [108]. We test these approaches to compare their performance with the standard kNN. In text data, document length can be correlated with hubness, so that shorter or longer documents might tend towards getting hubs [109]. The length of the tickets in our case study could vary from 5–10 words to 120–150. This is another reason for the choice of hubness-aware classifiers. Moreover, the datasets contain prevailingly English and a small portion of German language texts. According to [110], document relevance is preserved across languages and it is possible to approximate document hubness if given access to data translations. Additionally, based on the results of previous research [111], we believe that hubness-aware classifiers could successfully address the problem of the imbalanced dataset which is also our case. The mentioned class imbalance in our dataset can be explained by the following reasons: (i) nature of the case study and dataset itself, i.e., many tickets of low complexity and considerably less of high; (ii) the tickets for a labeling task were selected randomly.

3.2.3

Semi-supervised Approaches—kNN with Self-Training and SUCCESS

One more challenge we encounter is a very small quantity of labeled data. This may be addressed by semi-supervised ML (SSML). In the study, we implement two SSML approaches, which are discussed below. First, we use a simple and effective technique of self-training [112] that has been successfully applied in many real-world scenarios. A classifier, in our case kNN, is trained with a small number of labeled data and iteratively retrained with its own most confident predictions [26]. Second, we implement SUCCESS [27], SSML approach that shows promising results on time series datasets. We believe that its potential has not been exploited in the context of text classification. Therefore, we adapt it to ticket classification and evaluate its performance. Below, we will shortly review the SUCCESS approach. We define the semi-supervised classification problem as follows: given a set of n , labeled instances L = {(xi , yi )}li=1 and a set of unlabeled instances U = {xi }i=l+1 the task is to train a classifier using both L and U . We use the phrase set of labeled training instances to refer to L, xi is the i-th instance, yi is its label, whereas we say that U is the set of unlabeled training instances. The labeled instances (elements of

308

A. Revina et al.

L) are called seeds. We wish to construct a classifier that can accurately classify any instance, i.e., not only elements of U . For this problem, we proposed the SUCCESS approach that has the following phases: 1. The labeled and unlabeled training instances (i.e., instances of U and L) are clustered with constrained single-linkage hierarchical agglomerative clustering. While doing so, we include cannot-link constraints for each pair of labeled seeds, even if both seeds have the same class labels. In our previous work [27], we measured the distance of two instances (time series) as their Dynamic-TimeWarping distance. 2. The resulting top-level clusters are labeled by their corresponding seeds. 3. The final classifier is 1-nearest neighbor trained on the labeled data resulting at the end of the 2nd phase. This classifier can be applied to unseen test data. 3.2.4

Resampling Approaches—kNN with Resampling and QuickSUCCESS

As we noticed rather slow training speed of semi-supervised classifiers, we implemented resampling [113], a technique similar to bagging, to accelerate the classification and possibly improve classification results [114]. In particular, we select a random subset of the data and train the model on the selected instances. This process is repeated several times. When making predictions for new instances, the predictions of all the aforementioned models are aggregated by majority voting. As the sample size is much smaller than the size of the original dataset and the training has a superlinear complexity, the training of several models on small samples is computationally cheaper than the training of the model on the entire dataset. While kNN with resampling has been established in the research community since a long time ago [115], to the best of our knowledge, ours was the first attempt to speed up the SUCCESS algorithm using resampling. We call the resulting approach QuickSUCCESS [28]. Next, we explain QuickSUCCESS in detail. The core component of SUCCESS is constrained hierarchical agglomerative clustering. Although various implementations are possible, we may assume that it is necessary to calculate the distance between each pair of unlabeled instances as well as between each unlabeled instance and each labeled instance. This results in a theoretical complexity of O(l(n − l) + (n − l)2 ) = O(n 2 − l 2 − nl) at least, where l denotes the number of labeled instances (l = |L|), while n denotes the number of all instances (labeled and unlabeled) that are available at training time (n= |L| + |U |). Under the assumption that l is a small constant, the complexity of distance computations isO(n 2 ). Considering only the aforementioned distance computations required for SUCCESS, the computational costs in case of a dataset containing n/c instances is c2 -times lower than in case of a dataset containing n instances. Therefore, even if the computations have to be performed on several “small” datasets, the overall computational costs may be an order of magnitude lower. In particular, computing SUCCESS m-times on a dataset containing r instances has an overall computational

Designing Explainable Text Classification Pipelines …

309

Fig. 1 When sampling the data, the size r of the sample should be chosen carefully so that the sampled data is representative

cost of O(m × r 2 ). Under the assumption that r = n/c and m ≈ O(c), the resulting complexity is O(n 2 /c). Based on the analysis, we propose to speed up SUCCESS by repeated sampling. In particular, we sample the set of unlabeled instances m-times. From now on, we will use U ( j) to denote the j-th sample, 1 ≤ j ≤ m. Each U ( j) is a random subset of U containing r instances (|U ( j) | =r ). For simplicity, each instance has the same probability of being selected. To train the j-th classifier, we use all the labeled instances in L together with U ( j) . When sampling the data, the size r of the sample should be chosen carefully so that the sampled data is representative in the sense that the structure of the classes can be learned from the sampled data. This is illustrated in Fig. 1.

As the sampling is repeated m-times, we induce m classifiers denoted as C (1) ,…,C (m) . Each classifier C ( j) predicts the label for the unlabeled instances in the corresponding sample of the data, i.e., for the instances of U ( j) . More importantly, each of these classifiers can be used to predict the label of new instances, i.e., instances that are neither in L nor in U . Labels for the instances in U are predicted as follows. For each xi ∈ U , we consider all those sets U ( j) for which xi ∈ U j and the classifiers that were trained using these datasets. The majority vote of these classifiers is the predicted label of xi . Our approach is summarized in Algorithm 1. The label of a new instance x ∈ / L ∪ U is predicted as the majority vote of all the classifiers C (1) ,…,C (m) . We note that the computations related to the aforementioned classifiers C (1) ,…,C (m) can be executed in parallel, which may result in additional speed-up in case of systems where multiple CPUs are available, such as high-performance computing systems.

310

A. Revina et al.

Fig. 2 Overview of text classification pipeline

3.2.5

Decision Trees, Naïve Bayes, Logistic Regression

We also experimented with decision trees, naïve Bayes, and logistic regression, a group of algorithms, commonly used both in general text mining [19] and ticket classification [20] context and recognized as explainable approaches. Decision trees perform effectively with qualitative features, which is the case of the ticket classification case study. They are known for their interpretability, also an important aspect for the case study, and fast in learning and prediction [19]. Naïve Bayes and logistic regression are chosen as traditional classifiers working well with text data. These classifiers can be easily and quickly implemented [18]. Thus, it may be relatively simple to integrate them into industrial systems.

4 Experimental Evaluation Our experimental design is summarized in Fig. 2 and includes the following steps: (1) collect the case study data; (2) pre-process the data; (3) label part of the data; (4) split the data into training and test sets; (5) extract two sets of features: linguistic features and TF-IDF; (6) apply the classifier; (7) evaluate the results with common metrics. All the experiments were conducted using Python 3.6 and PyHubs2 on an Intel® Core™ i7 machine with 16 GB RAM.

4.1 Case Study and Datasets Being one of the fastest-growing sectors of the service economy, the enterprise IT service subject area has gained in importance both in research and practice. One popular stream of this research is IT service management (ITSM) [116, 117], which focuses on the servitization of the IT function, organization and management of IT 2 https://www.biointelligence.hu/pyhubs/.

Designing Explainable Text Classification Pipelines … Table 4 Datasets

311

#

Time period

# of ticket texts

# of acquired labels

Data1

2015–2018

28,243

30

Data2

January–May 2019

4,684

60

service provision [118]. For a successful establishment of organizational infrastructure for IT services, IT practitioners implement IT service delivery with a reference process—the IT Infrastructure Library (ITIL) [119, 120]. Separate areas of ITIL, such as Incident, Problem, or Change management (CHM), deal with a large amount of unstructured text data, i.e., tickets issued in relation to the incidents, problems, or changes in the IT infrastructure products and services. These tickets serve as a case study in the present work. We obtained two datasets (see Table 4) in the form of IT ticket texts originating from an ITIL CHM department of a big enterprise (step 1). They were received according to their availability and covered the whole period obtainable at the moment of this study. The first dataset (Data1) comprised 28,243 tickets created in 2015–2018. The data was pre-processed (step 2) by removing of stop words, punctuation, turning to lowercase, stemming and converted into a CSV-formatted corpus of ticket texts. The tickets texts contained prevailingly English texts (more than 80% of English words, a small portion of German words was present). The second dataset (Data2) comprised 4,684 entries in prevailingly English language from the period January to May 2019. The length of IT ticket texts varied from 5–10 words to 120–150. As labeling of the data is a time-consuming and tedious process and our data comes from an operational unit fully overbooked in the daily work, which is often the case in the industry, we managed to acquire 30 and 60 labels for Data1 and Data2 respectively (step 3). To provide a correct label, the case study workers needed to analyze all the factors influencing the IT ticket complexity: number of tasks; number of Configuration Items (CIs), specifically applications; if the ticket had to be executed online or offline (planning of downtime and considering affected CIs); involvement of Change Advisory Board (CAB), etc. Therefore, labeling a single ticket could take up to 30 min. For complexity class labels, a qualitative scale of low, medium, and high has been selected to simplify the classification task and as a well-known scale of priority ratings [121]. Although two datasets are coming from the same case, a distinct period of time and the number of labels allowed us to test our approaches in two independent settings.

4.2 Experimental Settings To evaluate classifiers, we use leave-one-out cross-validation [122]. We select one of the labeled instances as test instance and use all the remaining instances as training

312

A. Revina et al.

data (step 4). The process of selecting a test instance and training the classifier using all the other instances is repeated as many times as the number of labeled instances so that in each round, a different instance is selected as the test instance. As evaluation metrics, we use accuracy, average precision, average recall, and F-score (step 7). To compare the differences in the classifiers’ performance, we use the binomial test by Salzberg [123] at the significance threshold of 0.05. In the case of semi-supervised classifiers, we consider all the unlabeled instances as available at training time. When applying semi-supervised classifiers, we consider all the unlabeled instances as available at the training time. In the case of kNN with self-training, we set the number of self-training iterations to 100. In the experiments, we tested various distance measures (such as Euclidean distance, Cosine distance and Manhattan distance) and different k values, i.e., 1, 3, 5, 7, and 9. Hereby, we could not detect major differences. Hence, we use kNN with k = 1 and Euclidian distance, which is also justified by theoretical and empirical studies [124, 125]. In the context of explainability of text representation techniques, specifically linguistic features, we include feature selection tests in our experiments to identify which features play an important role in prediction quality. Similarly to [126, 127], we use logistic regression to determine the most predictive features based on weights. Subsequently, we train our classifiers using the selected set of best-performing features to compare the changes in prediction results.

4.3 Comparison of SUCCESS and QuickSUCCESS As an initial experiment, we measured the execution time of SUCCESS and QuickSUCCESS. We used the linguistic features representation. In the case of QuickSUCCESS approach, we selected r = 100 instances and trained m = 100 models. Table 5 shows our results: the execution time of one round of cross-validation (in seconds) and the accuracy. As one can see, the proposed technique leads to several orders of magnitude speed-up, both in case of Data1 and Data2, with negligible loss of prediction quality (the difference corresponds to the misclassification of one additional instance). Hence, in further experiments, we used the QuickSUCCESS algorithm. Table 5 Execution time and accuracy of SUCCESS and QuickSUCCESS

Time (s)

Accuracy

Data1 SUCCESS QuickSUCCESS

75,831 2

0.600 0.523

Data2 SUCCESS QuickSUCCESS

586 3

0.767 0.767

Designing Explainable Text Classification Pipelines …

313

4.4 Results In this section, we present our experimental results regarding text representation and text classification techniques on the two case study datasets. In Table 6, we provide obtained values of accuracy, an average of precision and recall calculated over the three prediction classes of low, medium, and high complexity. F-score is calculated based on the average precision and recall. With the help of these values, we analyze how far the choice of text representation techniques and ML algorithms influences the quality of prediction. We compare the performance of classifiers using linguistic features with that of classifiers using TF-IDF on both Data1 and Data2. As stated in the Experimental settings subsection, using logistic regression, we identified the most predictive features for Data1 and Data2. According to the weights of logistic regression, in the case of Data 1, five most important features are (1) relative occurrence of cognitive words, (2) occurrence of unique adjectives in all adjectives, (3) wording style, (4) occurrence of unique verbs in all verbs, (5) relative occurrence of words with positive sentiment. In the case of Data2, the five most important features are (1) relative occurrence of cognitive words, (2) occurrence of unique adjectives in all adjectives, (3) occurrence of unique verbs in all verbs, (4) relative occurrence of words with negative and (5) positive sentiments. As it can be concluded, the most essential feature appeared to be a relative occurrence of cognitive words, further referred to as cognitive words feature. After that, we trained the classifiers with the selected linguistic features. The results are summarized in Table 7. It is to note that in both experimental settings on Data1 and Data2, we made prevailingly consistent observations, which are discussed below. Text representation techniques We compared linguistic features and TF-IDF while applying various classifiers in our text classification pipeline (Fig. 2). We point out a systematic improvement in the prediction quality of algorithms when using linguistic features. Whenever the difference between the two experiments was only text representation technique, we observed that the classifiers’ performance using linguistic features was almost always higher than that in the case of using TF-IDF. ML classifiers Due to the higher number of labeled tickets, Data2 revealed an expected systematic classification quality increase compared to Data1, independently of applied algorithm or text representation technique. As can be seen in Tables 6 and 7, the usage of our linguistic features delivers excellent performance with simple algorithms, such as decision trees and naïve Bayes. Both of them are statistically significantly better than other classifiers. Additionally, we observed that selecting the best set of features improves the performance of classifiers consistently. Hence, using a smaller set of features also simplifies the process of extraction and reduces computational costs. As mentioned above and according

314

A. Revina et al.

Table 6 Accuracy, average precision, average recall and F-score of the examined classifiers using linguistic features and TF-IDF Algorithm

Accuracy

Average precision

Average recall

F-score

Data1: linguistic features standard kNN

0.533

0.526

0.498

0.511

kNN self-training

0.533

0.526

0.498

0.511

kNN self-training & resampling

0.533

0.526

0.498

0.511

ECkNN

0.633

0.599

0.580

0.589

HFNN

0.600

0.411

0.490

0.447

HWkNN

0.533

0.526

0.498

0.511

NHBNN

0.433

0.144

0.333

0.202

decision trees

1.000

1.000

1.000

1.000

naïve Bayes

0.967

0.972

0.944

0.958

logistic regression

0.500

0.475

0.463

0.469

QuickSUCCESS

0.533

0.542

0.523

0.532

Data1: TF-IDF standard kNN

0.200

0.067

0.333

0.111

kNN self-training

0.200

0.067

0.333

0.111

kNN self-training & resampling

0.200

0.067

0.333

0.111

ECkNN

0.433

0.144

0.333

0.202

HFNN

0.433

0.144

0.333

0.202

HWkNN

0.200

0.067

0.333

0.111

NHBNN

0.433

0.144

0.333

0.202

decision trees

0.433

0.144

0.333

0.202

naïve Bayes

0.267

0.738

0.389

0.510

logistic regression

0.433

0.144

0.333

0.202

QuickSUCCESS

0.200

0.067

0.333

0.111

standard kNN

0.750

0.593

0.576

0.584

kNN self-training

0.750

0.593

0.576

0.584

kNN self-training & resampling

0.750

0.593

0.576

0.584

ECkNN

0.733

0.586

0.568

0.577

HFNN

0.733

0.539

0.528

0.534

HWkNN

0.750

0.593

0.576

0.584

Data2: linguistic features

NHBNN

0.700

0.233

0.333

0.275

decision trees

1.000

1.000

1.000

1.000 (continued)

Designing Explainable Text Classification Pipelines …

315

Table 6 (continued) Algorithm

Accuracy

Average precision

Average recall

F-score

naïve Bayes

1.000

1.000

1.000

1.000

logistic regression

0.800

0.552

0.582

0.567

QuickSUCCESS

0.750

0.593

0.576

0.584

Data2: TF-IDF standard kNN

0.700

0.233

0.333

0.275

kNN self-training

0.700

0.233

0.333

0.275

kNN self-training & resampling

0.700

0.233

0.333

0.275

ECkNN

0.700

0.233

0.333

0.275

HFNN

0.700

0.233

0.333

0.275

HWkNN

0.700

0.233

0.333

0.275

NHBNN

0.700

0.233

0.333

0.275

decision trees

0.700

0.233

0.333

0.275

naïve Bayes

0.467

0.419

0.468

0.442

logistic regression

0.700

0.233

0.333

0.275

QuickSUCCESS

0.700

0.233

0.333

0.275

to the weights of logistic regression, the most important feature in the ticket text appears to be the cognitive words feature. When enhancing kNN with self-learning, which is a semi-supervised technique and a known way to improve the learning process under the condition of a large number of unlabeled and a small number of labeled instances, we expected a noticeable increase in performance quality. Nonetheless, the evaluation results evidenced no improvement. Considering hubness-aware classifiers, in some cases, selected hubness-aware classifiers proved to be better or equally performing if compared to standard kNN. Additionally, under the condition of TF-IDF representation on Data2, the performance of almost all algorithmic approaches stayed on the same level (0.700, 0.233, 0.333, 0.275 for accuracy, precision, recall, and F-score respectively). Due to the small number of labeled instances and dominance of the low complexity tickets in the labeled set, our dataset can be described as an imbalanced one. This can be the reason that all classifiers (except for naïve Bayes) predicted the dominant class of low complexity.

5 Discussion In this study, our focus was to gain a better understanding of those factors influencing the quality of prediction in text classification tasks under the lense of XAI and

316

A. Revina et al.

Table 7 Accuracy, average precision, average recall and F-score of the examined classifiers using five best linguistic features Algorithm

Accuracy

Average precision

Average recall

F-score

Data1: five best linguistic features standard kNN

0.667

0.589

0.585

0.587

kNN self-training

0.667

0.589

0.585

0.587

kNN self-training & resampling

0.667

0.589

0.585

0.587

ECkNN

0.667

0.578

0.580

0.579

HFNN

0.633

0.429

0.515

0.468

HWkNN

0.667

0.589

0.585

0.587

NHBNN

0.433

0.144

0.333

0.202

decision trees

1.000

1.000

1.000

1.000

naïve Bayes

1.000

1.000

1.000

1.000

logistic regression

0.633

0.611

0.580

0.595

QuickSUCCESS

0.733

0.814

0.636

0.714

Data2: five best linguistic features Standard kNN

0.817

0.711

0.670

0.690

kNN self-training

0.817

0.711

0.670

0.690

kNN self-training & resampling

0.700

0.233

0.333

0.275

ECkNN

0.750

0.523

0.558

0.539

HFNN

0.800

0.526

0.582

0.553

HWkNN

0.817

0.711

0.670

0.690

NHBNN

0.700

0.233

0.333

0.275

decision trees

1.000

1.000

1.000

1.000

naïve Bayes

1.000

1.000

1.000

1.000

logistic regression

0.850

0.554

0.606

0.579

QuickSUCCESS

0.867

0.791

0.693

0.739

granularity. At the same time, we addressed (i) the need for further experiments in industrial settings, especially with recent classifiers that have not been used for ticket classification before and (ii) limitations of our rule-based approach. Below we will discuss the explainability and granularity implications of the study results as well as methodological, managerial, and practical contributions.

Designing Explainable Text Classification Pipelines …

317

5.1 Explainability and Granularity Implications Since AI becomes an increasing part of our business and daily lives, we recognize the paramount importance of the explainability of AI-based solutions. The need to understand the decisions made by an AI system is crucial in the context of trust and primarily to efficiently address the errors made by such a system. In the present research, we closely studied the ITIL Change Management IT ticket processing of a big telecommunication company. Implementing the changes in the IT infrastructure means every time intervening in the organization’s IT environment, its functions, and experience. Due to the criticality of this infrastructure, one can consider the IT ticket classification task as a high risk and high consequence decision-making process. Wrong decisions and actions based on those decisions can have severe consequences for the company, for example, a complete connection service outage for city districts or even country regions. Hence, the classification outcomes should be explainable and understandable for a human decision-maker. Therefore, we emphasize the explainability of the selected approaches, both in text representation and text classification. In our study, we observed that simple, explainable algorithms, such as decision trees and naïve Bayes, can deliver excellent performance results when applied with our linguistic features-based text representation. Furthermore, we note that text classification is inherently related to Granular Computing [17, 128–134]. This can be accounted for the observation that in typical text classification pipelines, various representations of the data are considered (such as raw text, TF-IDF, domain-specific features) that correspond to various levels of abstraction.

5.2 Methodological Contributions With the vast research on text classification in general, numerous limitations are reported, and demand for more experiments and case study-based findings is declared. The latter is one of the limitations we addressed in the chapter. Especially while designing linguistic features representation, it is essential to have some knowledge about the authors of the text and involve the experts in the design of features. This is also important with respect to the explainability of the approach. For example, in the given case study, we used the text length (word count) as one of the linguistic features. In the case study interviews, we found out that short IT tickets were prevailingly written by professionals to professionals. Thus, in case of simple, explicit, and already familiar requests, case study process workers usually received short texts composed in a very condensed telegraphic way, which is also supported by the heuristics of theory of the least effort [135]. Such information was particularly important due to the specificity of classification task—IT ticket complexity prediction.

318

A. Revina et al.

However, based on our experiments with the weights of the linguistic features, we identified that the most predictive linguistic features are those related to the relative distribution of cognitive words, positive and negative business sentiments (the construction of which demanded the involvement of experts), and relative occurrences of unique verbs, adjectives and wording style feature (the development of which didn’t demand the involvement of experts). The ticket length appeared to be unimportant in predicting the ticket label. Based on these finding as well as further discussions with case study process workers, we could explain this phenomenon by two reasons: 1. In some cases, we missed the original IT ticket texts in the provided datasets. Very often, in real-world settings, the rules are not followed. I.e., in our case, process workers found workarounds how to simplify and accelerate the IT ticket processing. The ticket templates were used, and the original ticket texts were not copied into the template. 2. In certain scenarios, the heuristics of the theory of least effort didn’t work, meaning that also complex tickets were written in a condensed way. In this case, the complexity was defined by other factors, such as the number and criticality of affected configuration items. As there is a general discussion on the advantages and disadvantages of various text representation techniques and classification algorithms, in our chapter, we offer a new comprehensive analysis of both topics. First, we systematically compared the efficiency of linguistic features and TF-IDF representation for IT ticket complexity prediction. To the best of our knowledge, it is the first attempt of such a research setting in the context of explainability. Second, using different datasets and text representation techniques, we consistently tested eleven ML classifiers and showed that simple algorithms, such as decision trees and naïve Bayes, achieved accurate prediction with the linguistic features representation. Hereby, the five best performing features delivered results of the same quality. Hence, building explainable text classification pipelines, i.e., using linguistic features with simple algorithms, can have great potential when applied in real-world tasks. To sum up, systematic testing of the aforementioned representation techniques and classification algorithms and their application to the IT ticket classification task of complexity prediction embrace the significant potential for the text data researchers who seek to understand the role of specific factors influencing the prediction quality such as text representations and classifiers.

5.3 Managerial and Practical Contributions From the managerial contribution perspective, our case study is related to the IT ticket processing area. With today’s increased digitization, any enterprise maintains a broad application portfolio, often grown historically, which must be supported by large scale complex IT service environments [136]. This reveals a fundamental role

Designing Explainable Text Classification Pipelines …

319

of IT support systems in any organization’s support operations. Two essential steps of any IT ticket processing, which are their correct prioritization and assignment, attract the attention of managers, especially in the context of an ever-increasing amount of tickets, errors, long lead times, and lack of human resources [137–142]. While small companies still tend to perform these steps manually, large organizations dedicate high budgets to the implementation of various commercial text classification solutions. Usually, these approaches are complex monolithic software focused on accuracy at the cost of explainability and understandability for a process worker. As mentioned above, considering the IT infrastructure criticality, the IT helpdesk workers’ ability to evaluate the automatic ticket classification decision allows to address efficiently the problem of wrongly classified tickets, this way reducing the errors, rework, and processing time. An essential practical contribution of the present study is improving our own rulebased approach where we used the described linguistic features approach and both handcrafted and decision trees based rules to predict the IT ticket complexity of low, medium, or high [16]. Using the ML classification pipeline discussed in the chapter, we managed to achieve the best prediction quality without the need to define and constantly update the rules with the experts, which is a big hurdle in the management, maintenance, and scalability of such systems. Thus, the managerial and practical contributions of the research can be summarized as follows: (i) providing insights into building explainable IT ticket classification applications for managers and ML experts designing text classification pipelines as well as for IT subject matter experts using them; (ii) addressing the limitations of our rule-based approach of IT ticket complexity prediction [16], namely inability to learn and scale, difficulty in analysis, testing, and maintenance.

6 Conclusion and Future Works The goal of our work was to develop an understanding and provide a comparative analysis of text representation techniques and classifiers while focusing on the development of an explainable IT ticket classification pipeline. The obtained knowledge can be useful for decision support in the text classification task design of various enterprise applications, specifically in the IT ticket area. Additionally, we addressed the limitations of related work as well as our research. Below, we highlight the methodological, managerial, and practical contributions. The methodological contributions can be summarized as follows: (i) case study based linguistic features extraction and the identification of the best set of features allow us to understand their predictive power; (ii) the comprehensive comparative analysis of linguistic features with TF-IDF and various ML algorithms confirms the positive influence of linguistic style predictors on the prediction quality already detected by [61]; (iii) our observation that simple algorithms work well if using appropriate linguistic features contributes to the general discussion on the advantages and disadvantages of various text classification algorithms [19, 20].

320

A. Revina et al.

The following managerial and practical contributions of the work are presented and discussed: (i) our case study findings can provide decision support for ML experts in the design of text classification pipelines, (ii) improvement of our rule-based approach with ML. As a part of future work, one can: (i) consider further information that may be related to IT ticket complexity, such as the number of tasks and configuration items per ticket; (ii) test other application cases in the IT ticket area and beyond, i.e., further explore the potential of linguistic features; (iii) as we showed that selecting an appropriate subset of linguistic features can considerably improve the performance of classifiers, one may conduct further experiments with more advanced feature selection techniques [143]. Acknowledgments A. Revina was supported by the Data Science and Engineering Department, Faculty of Informatics, Eötvös Loránd University. K. Buza was supported by the project ED_18-12019-0030. Project no. ED_18-1-2019-0030 (Application domain-specific highly reliable IT solutions subprogramme) has been implemented with the support provided from the National Research, Development, and Innovation Fund of Hungary, financed under the Thematic Excellence Programme funding scheme.

Appendix I: Taxonomy of Decision-Making Logic Levels Following [15], we consider diverse semantic concepts: Resources, Techniques, Capacities, and Choices, elements of RTCC framework. We designed contextual variables [144], based on which experts categorized words into one of the three DML levels and one of the four semantic concepts.3 Contextual variables

Decision-making logic levels Routine

Semi-cognitive

Cognitive

Team, leader, project, colleague, property

Management, system, CAB, measure, approval

Conceptual aspects RESOURCES Problem Processing Level

User, user request, task, test, check, target, release, contact role, access, interface, cluster, tool, client, file system, partner, node

(continued)

3 https://github.com/IT-Tickets-Text-Analytics.

Designing Explainable Text Classification Pipelines …

321

(continued) Contextual variables

Decision-making logic levels Routine

Semi-cognitive

Cognitive

Conceptual aspects Accuracy

Time, application, product, configuration item, CI, right, instance, machine, minute, hour, day, week, detail, description

Description, environment, requirement, validity, reason, solution, method, problem, rule, modification

Situational Awareness

Name, password, group, directory, number, email, package, phone, ID, IP, attachment

Request for change, RfC, customer, rollout, backout

Information

Server, file, location, dataset, network, data, patch, port, information, type, root, certificate, account, device, cable, parameter, agent, folder, disk, fallback, database, db, backup, version, tool, firewall, system, hotfix, supervisor, reference, instruction, format

Requestor, software, Risk, freeze, impact downtime, production, power-supply, outage, service, case

Experience

Need, see, deploy, Implement, create, document, monitor, support, require, use, follow, note, classify provide, test, contain, accompany, inform, consist, describe

Approve, delegate, propose

Action Choice

Start, finish, monitor, import, export, run, stop, step, end, put, send, switch, install, reject, update, upgrade, include, replace, remove, move, begin, make, get, migrate, open, initialize, revoke

Freeze

Server farm

TECHNIQUES

Deploy, migrate, process, modify, forget, increase, miss

(continued)

322

A. Revina et al.

(continued) Contextual variables

Decision-making logic levels Routine

Semi-cognitive

Cognitive

Perform, modify, assign, check, need, expect, verify

Define

Conceptual aspects Effort

Cancel, rundown, decommission, restart, delete, set, add, activate, reboot, specify, agree upgrade, mount, execute, transfer, write, find

Specificity

Additional, preapproved, affected, initial, attached, internal, external, reachable, regular, active, scheduled, next, whole, formal, virtual, wrong, individual, administrative, local

Secure, separate, specific, technical, urgent, separate, corrected, minor, normal

Related, multiple, multi-solution, major, high, small, big

Decisions Formulation

New, old, preinstalled, fixed, ready, following, current, valid, primary, necessary

Available, necessary, important, significant, successful, appropriate, relevant, main, further, responsible

Possible, many, desired, different, various

Predictability

Actual, full, online, standard, responsible, administrative, existing, minimum, same, visible

Strong, temporary, offline, previous, last, other, more, much, similar, standard

Random, strong randomized, encrypted, expected

Normally, newly, shortly, urgently, temporarily

Maybe, randomly, likely

CAPACITIES

CHOICES Precision

Automatically, instead, manually, there, where, here, separately, additionally, internally

(continued)

Designing Explainable Text Classification Pipelines …

323

(continued) Contextual variables

Decision-making logic levels Routine

Semi-cognitive

Cognitive

Conceptual aspects Scale

Permanently, currently, still, now, often, never, already, just, always, yet, anymore, firstly, before, together, daily, meanwhile, really, furthermore, afterwards, therefore

Again, later, however, usually, previously, recently

Soon

Ambiguity

Correctly, therefore, accordingly, actually, consequently, completely, simultaneously, anyway, necessarily

Well, enough, immediately, easily, simply

Approximately, properly

Appendix II: Business Sentiment Lexicon with Assigned Valences4 Tickets

ITIL

Valence

Expressions +2

No risk, no outage Be so kind, would be nice

0.5

Disaster recovery, set alarms warnings, poison attack vulnerability, critical security leaks, fan, outstanding windows updates, thank you, kind regards, would like, best regards

Request for change, RfC

0

Big measure

Projected service outage, change advisory −0.5 board, high impact, major change

Single key words Kind, success, correct, like, nice

Well, successful, happy

0.5 (continued)

4 https://github.com/IT-Tickets-Text-Analytics.

324

A. Revina et al.

(continued) Tickets

ITIL

Disaster, recovery, affected, stop, disable, dump, alarm, warning, poison, attack, vulnerability, error, prevent, drop, cancel, delete, exclude, problem, problems, faulty, failed, destroy, defective, obsolete, lack, security, leak, crash, please, support, optimize, grant, privilege, create, dear, acceptance, clarity, restore, increase, danger, balance, right, deny, wrong, retire, missing, weak, invalid, see, follow, yes, allow, approve, approval, confirm

Problem, failed, information, operational, 0 identify, order, include, adequately, procedure, necessary, assess, criteria, clear, provide, potentially, identification, adequate, initiate, value, KPI, standard, schedule, align, properly, release, accurate, report, organization, continuous, ensure, service, beneficial, stakeholder, requirement, correct, record, essential, clearly, RfC, support, tool, relevant, attempt, subsequently, configuration, different, follow, directly, CI, potential, request, individual, plan, work, evaluate, author, organizational, manage, number, financial, status, low, chronological, recommend, responsible, model, accountable, handle, timescale, business, normal, submit, update, create, manual, consider, backout, accept, item, project, deliver, formal, data, iterative, produce, local, describe, test, improve, result, deployment, deploy, technical, management, repeatable, determine, minimum, develop, appropriate, activate, implement, require, process, evaluation, customer, contractual, authorize, share, acceptable

Valence

Blocked, critical

Cost, PSO, CAB, important, unauthorized, major, significant, undesirable, incomplete, delegate, avoid, coordinate, immediately, significantly

Offline, risk, outage, emergency, downtime

Impact, risk, emergency, incident, outage, −1 downtime

Rejected

Unacceptable

−0.5

−2

References 1. Jurafsky, D., Martin, J.H.: Speech and Language Processing. Pearson (2009) 2. Müller, O., Junglas, I., Debortoli, S., vom Brocke, J.: Using text analytics to derive customer service management benefits from unstructured data. MIS Q. Exec. 15, 243–258 (2016) 3. Jiao, J. (Roger), Zhang, L. (Linda), Pokharel, S., He, Z.: Identifying generic routings for product families based on text mining and tree matching. Decis. Support Syst. 43, 866–883 (2007). https://doi.org/10.1016/j.dss.2007.01.001 4. Luque, C., Luna, J.M., Luque, M., Ventura, S.: An advanced review on text mining in medicine. WIREs Data Min. Knowl. Disc. 9(3). Wiley Online Library (2019). https://doi.org/10.1002/ widm.1302

Designing Explainable Text Classification Pipelines …

325

5. Wang, Y., Xu, W.: Leveraging deep learning with LDA-based text analytics to detect automobile insurance fraud. Decis. Support Syst. 105, 87–95 (2018). https://doi.org/10.1016/j.dss. 2017.11.001 6. Ibrahim, N.F., Wang, X.: A text analytics approach for online retailing service improvement: evidence from Twitter. Decis. Support Syst. 121, 37–50 (2019). https://doi.org/10.1016/j.dss. 2019.03.002 7. Gilpin, L.H., Bau, D., Yuan, B.Z., Bajwa, A., Specter, M., Kagal, L.: Explaining Explanations: An Overview of Interpretability of Machine Learning. In: Proc. IEEE 5th Int. Conf. Data Sci. Adv. Anal. DSAA 2018, pp. 80–89 (2018) 8. Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Pedreschi, D., Giannotti, F.: A survey of methods for explaining black box models. ACM Comput. Surv. 51 (5), article no. 93 (2018) 9. Lee, J.D., See, K.A.: Trust in automation: designing for appropriate reliance. Human Factor Editor’s Collection 46 (1), 50–80 (2004). https://doi.org/10.1518/hfes.46.1.50_30392 10. Jennings, N.R., Moreau, L., Nicholson, D., Ramchurn, S., Roberts, S., Rodden, T., Rogers, A.: Human-agent collectives. Commun. ACM 57 (12), 80–88 (2014). https://doi.org/10.1145/ 2629559 11. Rosenfeld, A., Richardson, A.: Explainability in human–agent systems. Auton. Agent. Multi. Agent. Syst. 33, 673–705 (2019). https://doi.org/10.1007/s10458-019-09408-y 12. Rai, A.: Explainable AI: from black box to glass box. J. Acad. Mark. Sci. 48, 137– 141(2020). https://doi.org/10.1007/s11747-019-00710-5 13. Rizun, N., Revina, A., Meister, V.: Discovery of stylistic patterns in business process textual descriptions: IT Ticket Case. In: Proc. 33rd International Business Information Management Association Conference (IBIMA), pp. 2103–2113, Granada (2019) 14. Rizun, N., Revina, A.: Business sentiment analysis. Concept and method for perceived anticipated effort identification. In: Proc. Inf. Syst. Develop. Inf. Syst. Beyond (ISD), Toulon (2019) 15. Rizun, N., Revina, A., Meister, V.: Method of Decision-Making Logic Discovery in the Business Process Textual Data. In: Proc. Int. Conf. Bus. Inf. Syst. (BIS), pp. 70–84, Sevilla (2019). https://doi.org/10.1007/978-3-030-20485-3_6 16. Revina, A., Rizun, N.: Multi-criteria knowledge-based recommender system for decision support in complex business processes. In: Koolen, M., Bogers, T., Mobasher, B., and Tuzhilin, A. (eds.) Proceedings of the Workshop on Recommendation in Complex Scenarios co-located with 13th ACM Conference on Recommender Systems (RecSys 2019), pp. 16–22, CEUR, Copenhagen (2019) 17. Pedrycz, W.: Granular computing: an introduction. In: Annual Conference of the North American Fuzzy Information Processing Society (NAFIPS), pp. 1349–1354 (2001). https://doi.org/ 10.1109/nafips.2001.943745 18. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988). https://doi.org/10.1016/0306-4573(88)90021-0 19. Kowsari, K., Meimandi, K.J., Heidarysafa, M., Mendu, S., Barnes, L.E., Brown, D.E.: Text classification algorithms: a survey. Information 10 (4), 1–68, MDPI (2019). https://doi.org/ 10.3390/info10040150 20. Cavalcanti, Y.C., da Mota Silveira Neto, P.A., Machado, I. do C., Vale, T.F., de Almeida, E.S., Meira, S.R. de L.: Challenges and opportunities for software change request repositories: a systematic mapping study. J. Softw. Evol. Process. 26, 620–653 (2014). https://doi.org/10. 1002/smr.1639 21. Tomašev, N., Buza, K.: Hubness-aware kNN classification of high-dimensional data in presence of label noise. Neurocomputing. 160, 157–172 (2015). https://doi.org/10.1016/J.NEU COM.2014.10.084 22. Tomašev, N., Buza, K., Marussy, K., Kis, P.B.: Hubness-aware classification, instance selection and feature construction: survey and extensions to time-series. In: Sta´nczyk, U. and Jain, L.C. (eds.) Feature Selection for Data and Pattern Recognition. pp. 231–262. Springer, Berlin (2015). https://doi.org/10.1007/978-3-662-45620-0_11

326

A. Revina et al.

23. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1, 81–106 (1986). https://doi.org/10. 1007/BF00116251 24. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval. UK Cambridge University Press, pp. 253–286 (2009) 25. Hosmer, D.W., Lemeshow, S., Sturdivant, R.X.: Applied Logistic Regression. Hoboken (2013) 26. Triguero, I., García, S., Herrera, F.: Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl. Inf. Syst. 42, 245–284 (2015). https://doi. org/10.1007/s10115-013-0706-y 27. Marussy, K., Buza, K.: SUCCESS: A new approach for semi-supervised classification of timeseries. In: International Conference on Artificial Intelligence and Soft Computing, pp. 437– 447, Springer, Zakopane (2013). https://doi.org/10.1007/978-3-642-38658-9_39 28. Buza, K., Revina, A.: Speeding up the SUCCESS Approach for Massive Industrial Datasets. In: Proc. Int. Conf. Innov. Intell. Syst. Appl. (INISTA), pp. 1–6, IEEE Digital Library, Novi Sad (2020) 29. van der Aa, H., Leopold, H., del-Río-Ortega, A., Resinas, M., Reijers, H.A.: Transforming unstructured natural language descriptions into measurable process performance indicators using Hidden Markov Models. Inf. Syst. 71, 27–39 (2017). https://doi.org/10.1016/J.IS.2017. 06.005 30. van der Aa, H., Leopold, H., Reijers, H.A.: Checking process compliance against natural language specifications using behavioral spaces. Inf. Syst. 78, 83–95 (2018). https://doi.org/ 10.1016/J.IS.2018.01.007 31. The Economist explains–Why Uber’s self-driving car killed a pedestrian | The Economist explains | The Economist, https://www.economist.com/the-economist-explains/2018/05/29/ why-ubers-self-driving-car-killed-a-pedestrian, last accessed 2020/08/21 32. IBM’s Watson recommended “unsafe and incorrect” cancer treatments—STAT, https:// www.statnews.com/2018/07/25/ibm-watson-recommended-unsafe-incorrect-treatments/, last accessed 2020/08/21 33. Russell, S.: Human Compatible: Artificial Intelligence and the Problem of Control. Penguin Publishing Group (2019) 34. Samek, W., Wiegand, T., Müller, K.-R.: Explainable artificial intelligence: understanding, visualizing and interpreting deep learning models. arXiv (2017) 35. Lipton, Z.C.: The mythos of model interpretability. Commun. ACM. 61, 35–43 (2016) 36. Explanation noun–Definition, pictures, pronunciation and usage notes | Oxford Learner’s Dictionary of Academic English at OxfordLearnersDictionaries.com, https://www.oxfordlea rnersdictionaries.com/definition/academic/explanation, last accessed 2020/08/21 37. Singh, P., Dhiman, G.: A hybrid fuzzy time series forecasting model based on granular computing and bio-inspired optimization approaches. J. Comput. Sci. 27, 370–385 (2018). https://doi.org/10.1016/j.jocs.2018.05.008 38. Hryniewicz, O., Kaczmarek, K.: Bayesian analysis of time series using granular computing approach. Appl. Soft Comput. J. 47, 644–652 (2016). https://doi.org/10.1016/j.asoc.2014. 11.024 39. Han, Z., Zhao, J., Wang, W., Liu, Y., Liu, Q.: Granular computing concept based longterm prediction of gas tank levels in steel industry. In: IFAC Proceedings Volumes (IFACPapersOnline), pp. 6105–6110, IFAC Secretariat (2014). https://doi.org/10.3182/201408246-za-1003.00842 40. Hu, H., Pang, L., Tian, D., Shi, Z.: Perception granular computing in visual haze-free task. Expert Syst. Appl. 41, 2729–2741 (2014). https://doi.org/10.1016/j.eswa.2013.11.006 41. Ray, S.S., Ganivada, A., Pal, S.K.: A granular self-organizing map for clustering and gene selection in microarray data. IEEE Trans. Neural Networks Learn. Syst. 27, 1890–1906 (2016). https://doi.org/10.1109/TNNLS.2015.2460994 42. Tang, Y., Zhang, Y.Q., Huang, Z., Hu, X., Zhao, Y.: Recursive fuzzy granulation for gene subsets extraction and cancer classification. IEEE Trans. Inf. Technol. Biomed. 12, 723–730 (2008). https://doi.org/10.1109/TITB.2008.920787

Designing Explainable Text Classification Pipelines …

327

43. Saberi, M., Mirtalaie, M.S., Hussain, F.K., Azadeh, A., Hussain, O.K., Ashjari, B.: A granular computing-based approach to credit scoring modeling. Neurocomputing. 122, 100–115 (2013). https://doi.org/10.1016/j.neucom.2013.05.020 44. Leng, J., Chen, Q., Mao, N., Jiang, P.: Combining granular computing technique with deep learning for service planning under social manufacturing contexts. Knowl.-Based Syst. 143, 295–306 (2018). https://doi.org/10.1016/j.knosys.2017.07.023 45. Scott, S., Matwin, S.: Text Classification Using WordNet Hypernyms. In: Usage of WordNet in Natural Language Processing Systems, pp. 45–51, ACL Anthology (1998) 46. Yan, J.: Text Representation. In: Encyclopedia of Database Systems, pp. 3069–3072, Springer US, Boston, MA (2009). https://doi.org/10.1007/978-0-387-39940-9_420 47. Zhang, Y., Jin, R., Zhou, Z.H.: Understanding bag-of-words model: a statistical framework. Int. J. Mach. Learn. Cybern. 1, 43–52 (2010). https://doi.org/10.1007/s13042-010-0001-0 48. Sparck, K.J.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28, 11–21 (1972). https://doi.org/10.1108/eb026526 49. Radovanovic, M., Ivanovic, M.: Text Mining: Approaches and Applications. Novi Sad J. Math. 38, 227–234 (2008) 50. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space. arXiv (2013) 51. Le, Q. V., Mikolov, T.: Distributed Representations of Sentences and Documents. arXiv (2014) 52. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Proc. Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, Association for Computational Linguistics, Stroudsburg, PA, USA (2014). https://doi.org/10. 3115/v1/D14-1162 53. Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: FastText: Compressing text classification models. arXiv (2016) 54. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with Subword Information. arXiv (2016) 55. Melamud, O., Goldberger, J., Dagan, I.: context2vec: Learning Generic Context Embedding with Bidirectional LSTM. In: Proc. 20th SIGNLL Conference on Computational Natural Language Learning, pp. 51–61, Association for Computational Linguistics, Stroudsburg, PA, USA (2016). https://doi.org/10.18653/v1/K16-1006 56. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. arXiv (2018) 57. Rezaeinia, S.M., Ghodsi, A., Rahmani, R.: Improving the accuracy of pre-trained word embeddings for sentiment analysis. arXiv (2017) 58. Yin, F., Wang, Y., Liu, J., Lin, L.: The Construction of Sentiment Lexicon Based on ContextDependent Part-of-Speech Chunks for Semantic Disambiguation. IEEE Access 8 (2020). https://doi.org/10.1109/access.2020.2984284 59. Sureka, A., Indukuri, K.V.: Linguistic analysis of bug report titles with respect to the dimension of bug importance. In: Proc. 3rd Annual ACM Bangalore Conference on COMPUTE ’10, pp. 1–6, ACM Press, Bangalore (2010). https://doi.org/10.1145/1754288.1754297 60. Ko, A.J., Myers, B.A., Chau, D.H.: A Linguistic analysis of how people describe software problems. In: Visual Languages and Human-Centric Computing (VL/HCC’06), pp. 127–134, IEEE, Brighton (2006). https://doi.org/10.1109/VLHCC.2006.3 61. Coussement, K., Van den Poel, D.: Improving customer complaint management by automatic email classification using linguistic style features as predictors. Decis. Support Syst. 44, 870–882 (2008). https://doi.org/10.1016/J.DSS.2007.10.010 62. Fürnkranz, J., Mitchell, T., Riloff, E.: Case study in using linguistic phrases for text categorization on the WWW. Workshop Series Technical Reports WS-98-05, Association for the Advancement of Artificial Intelligence (AAAI), Palo Alto, California (1998) 63. Mladenic, D., Grobelnik, M.: Word sequences as features in text-learning. In: Proc. 17th Electrotechnical and Computer Science Conference (ERK98), pp. 145–148, Ljubljana (1998) 64. Raskutti, B., Ferrá, H., Kowalczyk, A.: Second order features for maximising text classification performance. In: Proc. European Conference on Machine Learning, pp. 419–430, Springer (2001). https://doi.org/10.1007/3-540-44795-4_36

328

A. Revina et al.

65. Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In: Proc. 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 146–153, ACM Press, New York (2001). https://doi.org/10.1145/383952.383976 66. Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorization. Inf. Process. Manag. 38, 529–546 (2002). https://doi.org/10.1016/S0306-4573(01)00045-0 67. Scott, S., Matwin, S.: Feature engineering for text classification. In: Proc. ICML-99, 16th International Conference on Machine Learning, pp. 379–388, Morgan Kaufmann Publishers, Bled (1999) 68. Moschitti, A., Basili, R.: Complex linguistic features for text classification: a comprehensive study. In: European Conference on Information Retrieval, pp. 181–196, Springer, Sunderland (2004). https://doi.org/10.1007/978-3-540-24752-4_14 69. Sasaki, M., Kita, K.: Rule-based text categorization using hierarchical categories. In: Proc. IEEE International Conference on Systems, Man, and Cybernetics, pp. 2827–2830, IEEE, San Diego (1998). https://doi.org/10.1109/ICSMC.1998.725090 70. Mason, J.: Qualitative Researching. Sage Publications Ltd (2002) 71. Seidel, J., Kelle, U.: Computer-Aided Qualitative Data Analysis: Theory, Methods and Practice. SAGE Publications Ltd, London (1995) 72. Chua, S., Coenen, F., Malcolm, G.: Classification inductive rule learning with negated features. In: Proc. 6th International Conference Advanced Data Mining and Applications (ADMA), pp. 125–136, Springer, Chongqing (2010). https://doi.org/10.1007/978-3-642-17316-5_12 73. Rokach, L.: Decision forest: twenty years of research. Inf. Fusion. 27, 111–125 (2016). https:// doi.org/10.1016/j.inffus.2015.06.005 74. García, E., Romero, C., Ventura, S., Calders, T.: Drawbacks and solutions of applying association rule mining in learning management systems. In: Proc. International Workshop on Applying Data Mining in e-Learning, pp. 13–22, CEUR (2007) 75. Kaur, G.: Association rule mining: a survey. Int. J. Comput. Sci. Inf. Technol. 5, 2320–2324 (2014) 76. Hu, H., Li, J.: Using association rules to make rule-based classifiers robust. In: 16th Australasian Database Conference, pp. 47–54, Australian Computer Society, Newcastle (2005) 77. Chakravarthy, V., Joshi, S., Ramakrishnan, G., Godbole, S., Balakrishnan, S.: Learning decision lists with known rules for text mining. IBM Research, 835–840 (2008) 78. Mitchell, T.M.: Machine Learning. McGraw Hill (1997) 79. Uzuner, Ö., Zhang, X., Sibanda, T.: Machine learning and rule-based approaches to assertion classification. J. Am. Med. Informatics Assoc. 16, 109–115 (2009). https://doi.org/10.1197/ jamia.M2950 80. Frank, E., Bouckaert, R.R.: Naive Bayes for text classification with unbalanced classes. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 503–510, Springer (2006). https://doi.org/ 10.1007/11871637_49 81. Zhu, X.: Semi-Supervised Learning Literature Survey. Technical report, University of Wisconsin-Madison (2008) 82. Lazarov, A., Shoval, P.: A rule-based system for automatic assignment of technicians to service faults. Decis. Support Syst. 32, 343–360 (2002). https://doi.org/10.1016/S0167-9236(01)001 22-1 83. Ahsan, S.N., Wotawa, F.: Impact analysis of SCRs using single and multi-label machine learning classification. In: Proc. IEEE International Symposium on Empirical Software Engineering and Measurement. ACM Press, Bolzano/Bozen (2010). https://doi.org/10.1145/185 2786.1852851 84. Ahsan, S.N., Ferzund, J., Wotawa, F.: Automatic Classification of Software Change Request Using Multi-label Machine Learning Methods. In: Proc. 33rd Annual IEEE Software Engineering Workshop, pp. 79–86, IEEE (2009). https://doi.org/10.1109/SEW.2009.15

Designing Explainable Text Classification Pipelines …

329

85. Rus, V., Nan, X., Shiva, S., Chen, Y.: Clustering of defect reports using graph partitioning algorithms. In: Proc. 21st International Conference on Software Engineering & Knowledge Engineering (SEKE’2009), pp. 442–445, Boston (2009) 86. Santana, A., Silva, J., Muniz, P., Araújo, F., de Souza, R.M.C.R.: Comparative analysis of clustering algorithms applied to the classification of bugs. In: Lecture Notes in Computer Science, pp. 592–598, Springer, Berlin, Heidelberg (2012). https://doi.org/10.1007/978-3642-34500-5_70 87. Kanwal, J., Maqbool, O.: Bug prioritization to facilitate bug report triage. J. Comput. Sci. Technol. 27, 397–412 (2012). https://doi.org/10.1007/s11390-012-1230-3 88. Menzies, T., Marcus, A.: Automated severity assessment of software defect reports. In: Proc. IEEE International Conference on Software Maintenance, pp. 346–355, IEEE (2008). https:// doi.org/10.1109/ICSM.2008.4658083 89. Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1, 309–317 (1957). https://doi.org/10.1147/rd.14.0309 90. Salton, G., Yang, C.: On the specification of term values in automatic indexing. J. Doc. 29, 351–372 (1973). https://doi.org/10.1108/eb026562 91. Shin, D., Park, Y.J.: Role of fairness, accountability, and transparency in algorithmic affordance. Comput. Human Behav. 98, 277–284 (2019). https://doi.org/10.1016/j.chb.2019. 04.019 92. Diakopoulos, N., Koliska, M.: Algorithmic transparency in the news media. Digit. Journal. 5, 809–828 (2017). https://doi.org/10.1080/21670811.2016.1208053 93. Hepenstal, S., McNeish, D.: Explainable artificial intelligence: what do you need to know? In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 266–275, Springer (2020). https://doi.org/ 10.1007/978-3-030-50353-6_20 94. Daelemans, W.: Explanation in computational stylometry. In: Proc. Int. Conf. on Intelligent Text Processing and Computational Linguistics, pp. 451–462, Springer, Samos (2013). https:// doi.org/10.1007/978-3-642-37256-8_37 95. Taxonomy Strategies | Bibliography–Taxonomy Strategies, https://taxonomystrategies.com/ library/bibliography/, last accessed 2019/09/02 96. Blumauer, A.: Taxonomies and Ontologies | LinkedIn Blog Article, https://www.linkedin. com/pulse/taxonomies-ontologies-andreas-blumauer/, last accessed 2019/09/02 97. Blei, D.: Probabilistic topic models. Commun. ACM. 55, 77–84 (2012). https://doi.org/10. 1145/2133806.2133826 98. Liu, B.: Sentiment analysis and opinion mining. Synth. Lect. Hum. Lang. Technol. 5, 1–184 (2012). https://doi.org/10.2200/S00416ED1V01Y201204HLT016 99. Hutto, C.J., Gilbert, E.: VADER: A parsimonious rule-based model for sentiment analysis of social media text. In: Proc. 8th Int. Conf. on Weblogs and Social Media (ICWSM-14), Ann Arbor (2014) 100. Zipf, G.K.: Selected Studies of the Principle of Relative Frequency in Language. Harvard University Press (1932) 101. Molnar, C.: Interpretable Machine Learning. A Guide for Making Black Box Models Explainable. Creative Commons Attribution-Noncommercial-ShareAlike 4.0 International License (2020) 102. Jiang, S., Pang, G., Wu, M., Kuang, L.: An improved K-nearest-neighbor algorithm for text categorization. Expert Syst. Appl. 39, 1503–1509 (2012). https://doi.org/10.1016/J.ESWA. 2011.08.040 103. Buza, K., Nanopoulos, A., Nagy, G.: Nearest neighbor regression in the presence of bad hubs. Knowledge-Based Syst. 86, 250–260 (2015). https://doi.org/10.1016/J.KNOSYS.2015. 06.010 104. Radovanovi´c, M., Nanopoulos, A., Ivanovi´c, M.: Hubs in space: popular nearest neighbors in high-dimensional data. J. Mach. Learn. Res. 11, 2487–2531 (2010) 105. Buza, K., Neubrandt, D.: A new proposal for person identification based on the dynamics of typing: preliminary results. Theor. Appl. Inf. 28, 1–12 (2017). https://doi.org/10.20904/2812001

330

A. Revina et al.

106. Radovanovi´c, M., Nanopoulos, A., Ivanovi´c, M.: Nearest neighbors in high-dimensional data. In: Proc. 26th Annual International Conference on Machine Learning, pp. 1–8, ACM Press, New York (2009). https://doi.org/10.1145/1553374.1553485 107. Tomašev, N., Radovanovi´c, M., Mladeni´c, D., Ivanovi´c, M.: Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification. Int. J. Mach. Learn. Cybern. 5, 16–30 (2011). https://doi.org/10.1007/978-3-642-23199-5_2 108. Tomasev, N., Radovanovi´c, M., Mladeni´c, D., Ivanovi´c, M.: A probabilistic approach to nearest-neighbor classification. In: Proc. 20th ACM international conference on Information and knowledge management, ACM, Glasgow (2011). https://doi.org/10.1145/2063576.206 3919 109. Radovanovi´c, M.: High-Dimensional Data Representations and Metrics for Machine Learning and Data Mining (2011) 110. Tomašev, N., Rupnik, J., Mladeni´c, D.: The role of hubs in cross-lingual supervised document retrieval. In: Proc. Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 185–196, Springer, Gold Coast (2013). https://doi.org/10.1007/978-3-642-37456-2_16 111. Tomašev, N., Mladeni´c, D.: Class imbalance and the curse of minority hubs. Knowl.-Based Syst. 53, 157–172 (2013). https://doi.org/10.1016/J.KNOSYS.2013.08.031 112. Ng, V., Cardie, C.: Weakly supervised natural language learning without redundant views. In: Proc. Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 173–180 (2003) 113. Shao, J., Tu, D.: The Jackknife and Bootstrap. Springer, New York (1995) 114. Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Mach. Learn. 36, 105–139 (1999). https://doi.org/10.1023/A:100751 5423169 115. Firte, L., Lemnaru, C., Potolea, R.: Spam detection filter using KNN algorithm and resampling. In: Proc. IEEE 6th International Conference on Intelligent Computer Communication and Processing, pp. 27–33 (2010). https://doi.org/10.1109/ICCP.2010.5606466 116. Orta, E., Ruiz, M., Hurtado, N., Gawn, D.: Decision-making in IT service management: a simulation based approach. Decis. Support Syst. 66, 36–51 (2014). https://doi.org/10.1016/j. dss.2014.06.002 117. Ruiz, M., Moreno, J., Dorronsoro, B., Rodriguez, D.: Using simulation-based optimization in the context of IT service management change process. Decis. Support Syst. 112, 35–47 (2018). https://doi.org/10.1016/j.dss.2018.06.004 118. Fielt, E., Böhmann, T., Korthaus, A., Conger, S., Gable, G.: Service management and engineering in information systems research. J. Strateg. Inf. Syst. 22, 46–50 (2013). https://doi. org/10.1016/j.jsis.2013.01.001 119. Eikebrokk, T.R., Iden, J.: Strategising IT service management through ITIL implementation: model and empirical test. Total Qual. Manag. Bus. Excell. 28, 238–265 (2015). https://doi. org/10.1080/14783363.2015.1075872 120. Galliers, R.D.: Towards a flexible information architecture: integrating business strategies, information systems strategies and business process redesign. Inf. Syst. J. 3, 199–213 (1993). https://doi.org/10.1111/j.1365-2575.1993.tb00125.x 121. Saaty, T.L.: The analytic hierarchy and analytic network processes for the measurement of intangible criteria and for decision-making. In: Multiple Criteria Decision Analysis: State of the Art Surveys, pp. 345–405, Springer, New York (2005). https://doi.org/10.1007/0-387-230 81-5_9 122. Webb, G.I., Sammut, C., Perlich, C., Horváth, T., Wrobel, S., Korb, K.B., Noble, W.S., Leslie, C., Lagoudakis, M.G., Quadrianto, N., Buntine, W.L., Quadrianto, N., Buntine, W.L., Getoor, L., Namata, G., Getoor, L., Han, Xin Jin, J., Ting, J.-A., Vijayakumar, S., Schaal, S., Raedt, L. De: Leave-One-Out Cross-Validation. In: Encyclopedia of Machine Learning, pp. 600–601, Springer US, Boston (2011). https://doi.org/10.1007/978-0-387-30164-8_469 123. Salzberg, S.L.: On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min. Knowl. Discov. 1, 317–328 (1997). https://doi.org/10.1023/A:1009752403260

Designing Explainable Text Classification Pipelines …

331

124. Xi, X., Keogh, E., Shelton, C., Wei, L., Ratanamahatana, C.A.: Fast time series classification using numerosity reduction. In: Proc. 23rd Int. Conf. on Machine Learning (ICML), pp. 1033– 1040, ACM Press, New York (2006). https://doi.org/10.1145/1143844.1143974 125. Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory. 13, 21–27 (1967). https://doi.org/10.1109/TIT.1967.1053964 126. Ng, A.Y.: Feature selection, L1 vs. L2 regularization, and rotational invariance. In: Proc. 21st Int. Conf. on Machine Learning (ICML), pp. 615–622, ACM Press, New York (2004). https:// doi.org/10.1145/1015330.1015435 127. Cheng, Q., Varshney, P.K., Arora, M.K.: Logistic regression for feature selection and soft classification of remote sensing data. IEEE Geosci. Remote Sens. Lett. 3, 491–494 (2006). https://doi.org/10.1109/LGRS.2006.877949 128. Pedrycz, W.: Granular Computing: Analysis and Design of Intelligent Systems. CRC Press (2018) 129. Pedrycz, W., Skowron, A., Kreinovich, V.: Handbook of Granular Computing. Wiley (2008) 130. Yao, J.T., Vasilakos, A.V., Pedrycz, W.: Granular computing: perspectives and challenges. IEEE Trans. Cybern. 43, 1977–1989 (2013). https://doi.org/10.1109/TSMCC.2012.2236648 131. Bargiela, A., Pedrycz, W.: Toward a theory of granular computing for human-centered information processing. IEEE Trans. Fuzzy Syst. 16, 320–330 (2008). https://doi.org/10.1109/ TFUZZ.2007.905912 132. Pedrycz, W., Homenda, W.: Building the fundamentals of granular computing: a principle of justifiable granularity. Appl. Soft Comput. J. 13, 4209–4218 (2013). https://doi.org/10.1016/ j.asoc.2013.06.017 133. Pedrycz, W.: Allocation of information granularity in optimization and decision-making models: towards building the foundations of granular computing. Eur. J. Oper. Res. 232, 137–145 (2014). https://doi.org/10.1016/j.ejor.2012.03.038 134. Pedrycz, W.: Granular computing for data analytics: a manifesto of human-centric computing. IEEE/CAA J. Autom. Sinica 5 (6), pp. 1025–1034 (2018). https://doi.org/10.1109/JAS.2018. 7511213 135. Zipf, G.K.: Human behavior and the principle of least effort: an introduction to human ecology. Martino Pub (2012) 136. Diao, Y., Bhattacharya, K.: Estimating Business Value of IT Services through Process Complexity Analysis. In: Proc. IEEE Netw. Oper. Manage. Symp. (NOMS) (2008) 137. Paramesh, S.P., Shreedhara, K.S.: Automated IT service desk systems using machine learning techniques. In: Lecture Notes in Networks and Systems 43, 331–346, Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-2514-4_28 138. Paramesh, S.P., Ramya, C., Shreedhara, K.S.: Classifying the unstructured IT service desk tickets using ensemble of classifiers. In: Proc. 3rd Int. Conf. on Computational Systems and Information Technology for Sustainable Solutions (CSITSS), pp. 221–227 (2018). https:// doi.org/10.1109/CSITSS.2018.8768734 139. Roy, S., Muni, D.P., Tack Yan, J.J.Y., Budhiraja, N., Ceiler, F.: Clustering and labeling IT maintenance tickets. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 829–845, Springer (2016). https://doi.org/10.1007/978-3-319-46295-0_58 140. Dasgupta, G.B., Nayak, T.K., Akula, A.R., Agarwal, S., Nadgowda, S.J.: Towards autoremediation in services delivery: context-based classification of noisy and unstructured tickets. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 478–485, Springer (2014). https://doi.org/ 10.1007/978-3-662-45391-9_39 141. Agarwal, S., Sindhgatta, R., Sengupta, B.: SmartDispatch: Enabling efficient ticket dispatch in an IT service environment. In: Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining , pp. 1393–1401 (2012). https://doi.org/10.1145/2339530.2339744 142. Agarwal, S., Aggarwal, V., Akula, A.R., Dasgupta, G.B., Sridhara, G.: Automatic problem extraction and analysis from unstructured text in IT tickets. IBM J. Res. Dev. 61, 41–52 (2017). https://doi.org/10.1147/JRD.2016.2629318

332

A. Revina et al.

143. Szenkovits, A., Meszlényi, R., Buza, K., Gaskó, N., Ioana Lung, R., Suciu, M.: Feature Selection with a Genetic Algorithm for Classification of Brain Imaging Data. In: Stanczyk, U., Zielosko, B., Jain, L.C. (eds.) Advances in Feature Selection for Data and Pattern Recognition, pp. 185–202, Springer, Cham (2018) 144. Rizun, N., Taranenko, Y.: Simulation Models of Human Decision-Making Processes. Manag. Dyn. Knowl. Econ. 2 (2), pp. 241-264 (2014).

A Granular Computing Approach to Provide Transparency of Intelligent Systems for Criminal Investigations Sam Hepenstal, Leishi Zhang, Neesha Kodagoda, and B. L. William Wong

Abstract Criminal investigations involve repetitive information retrieval requests in high risk, high consequence, and time pressing situations. Artificial Intelligence (AI) systems can provide significant benefits to analysts, by sharing the burden of reasoning and speeding up information processing. However, for intelligent systems to be used in critical domains, transparency is crucial. We draw from human factors analysis and a granular computing perspective to develop Human-Centered AI (HCAI). Working closely with experts in the domain of criminal investigations we have developed an algorithmic transparency framework for designing AI systems. We demonstrate how our framework has been implemented to model the necessary information granules for contextual interpretability, at different levels of abstraction, in the design of an AI system. The system supports an analyst when they are conducting a criminal investigation, providing (i) a conversational interface to retrieve information through natural language interactions, and (ii) a recommender component for exploring, recommending, and pursuing lines of inquiry. We reflect on studies with operational intelligence analysts, to evaluate our prototype system and our approach to develop HCAI through granular computing. Keywords Granular computing · Interpretable AI · Intelligence analysis · Conversational agents

1 Introduction Artificial intelligence (AI) can assist criminal investigators when they retrieve information. Investigations involve complex interactions with large databases and each query can lead to many other lines of inquiry. It is important that these lines are S. Hepenstal (B) Defence Science and Technology Laboratory, Salisbury, UK e-mail: [email protected] L. Zhang · N. Kodagoda · B. L. W. Wong Middlesex University London, London, UK © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. Pedrycz and S. Chen (eds.), Interpretable Artificial Intelligence: A Perspective of Granular Computing, Studies in Computational Intelligence 937, https://doi.org/10.1007/978-3-030-64949-4_11

333

334

S. Hepenstal et al.

pursued as much as is necessary, to ensure that insights are gained. The scale of the task, with limited resources, presents a significant and time-consuming challenge for investigators. This can sometimes obstruct the exploration of important lines of inquiry. The impact of information overload on investigations is explained by the Commissioner of the Metropolitan Police (UK) that, “if police were able to harness data more effectively, a ‘very, very large proportion’ of crimes could be solved” [1]. These crimes include cases of murder and manslaughter. There are, therefore, opportunities for smarter analytical tools. AI systems that can automate reasoning about what information an analyst needs, allowing the analyst to focus their attention upon gaining insights, could provide significant benefits. Examples of intelligent systems include: (i) A conversational agent (CA) system, which removes the requirement for analysts to translate their questions into restrictive syntax or structures. By recognising and translating an analyst’s query into the necessary information retrieval intention, and responding as desired to the analyst, the agent provides shared reasoning for a single reasoning activity; (ii) A recommender system that can suggest insightful lines of inquiry by identifying and exploring possible directions for an investigation. The system understands the possibilities desired by an analyst for the overall directions an investigation can take, it recommends and explores the best approaches, and thus incorporates multiple phases of reasoning. These systems share the burden of reasoning with an analyst and have the potential to speed up information processing. An intelligent system can reduce the cognitive load associated with retrieving and interpreting large amounts of information and allows an analyst to focus their attention upon gaining key insights to make better decisions. Furthermore, the potential of AI is not limited to automating the information retrieval process. If the analysts’ intentions and reasoning pathways can be captured and modelled appropriately, one can design more advanced and effective solutions for criminal investigations. Criminal investigations involve high risk, high consequence, and time pressing situations and a challenge for developing AI systems is in modelling system intentions and reasoning pathways with transparency. For intelligent systems to be used in high risk and high consequence domains, the various goals and constraints must be easily interpretable to an end user. If an analyst misinterprets CA system processes and caveats when retrieving information in a live investigation, it could waste resources and cause delays to critical action. Alternatively, if a system unfairly guides an investigation it could lead to the arrest of an innocent person. There are ethical and trust implications in high risk and high consequence domains. A criminal intelligence analyst must be held accountable for their analysis and be able to inspect and verify system processes. There is also concern that an over reliance on technology may lead to poorer analysis and missed insights, where analysts consider only the outputs of machine processes without applying their own reasoning. To mitigate this, a critical decision maker requires the awareness to challenge the processes when necessary and explore key data. We have introduced a framework to describe transparency of intelligent systems that share reasoning with

A Granular Computing Approach to Provide Transparency …

335

a user. Our framework can be applied to design intelligent systems to ensure that they deliver appropriate transparency, particularly the need to inspect and verify the goals and constraints of the system. While an analyst needs to accurately interpret system behaviour, it is also important that the provision of transparency does not undo any gains in time saving and reduction of the burden on the analyst. A transparent information retrieval system inevitably requires an analyst to review more information than a system without transparency. It is therefore expected that this can place additional burden upon the analyst, where they need to assess the way the system has behaved in addition to interpreting the data returned and gaining insight into the situation. AI systems are typically complex, with many component parts. In order that analysts can interpret the important processes quickly and easily, it is necessary to represent system behaviour in an abstract and digestible form, where an analyst can delve into lower levels of granularity when they need to. Interpretation is required of the important functional attributes underpinning each system behaviour. These should be provided with a recognisable structure that reflects the key concerns of the analyst concisely. We propose that a granular computing approach to form, process, and communicate the appropriate information granulesto a user, can achieve the interpretability required. Our contribution: In this chapter, we present our approach to develop HumanCentered AI (HCAI) that tackles the problems described, through a combination of human factors analysis and human-centric granular computing [2]. In order to develop reliable, safe and trustworthy HCAI, it requires that humans have high control over high levels of computer automation. This means explainable user interfaces employing designs that reduce the need for explanation [3]. Developing systems that make AI truly explainable requires research that brings together AI system design and an understanding of how people think, including the context and purpose of the application and the specific needs for interpretability. “Explainable AI cannot succeed if the only research foundations brought to bear on it are AI foundations. Likewise, it cannot succeed if the only foundations used are from psychology, education, etc.” [4] Our work helps to bridge this gap, where we looked to understand the requirements and thought processes of a user, in the context of the domain of criminal intelligence analysis. We drew from a human factors study, described in Sect. 3, to understand the nature of information retrieval in a criminal investigation. This understanding underpinned the design of two elements of a prototype system, and we explain how our system transparency framework [5] is implemented and extended to address multiple levels of granularity. Deriving and communicating the appropriate information granules for contextual user understanding is central to our approach to develop Human-Centered AI, where granular computing is about human-centered information processing [6]. In Sect. 4, we describe how we designed the architecture of a conversational agent (CA) called Pan [7], with an original perspective on information granularity, so that a user can recognise and interpret the system processes. In Sect. 5, we describe how the environment for a novel recommender system is constructed and presented with the necessary information granules to derive insight from investigation paths, while also allowing for recognition of system processes at each stage in a selected path. Finally, in Sect. 6, we describe two evaluation studies

336

S. Hepenstal et al.

and assess the needs for transparency and the impact of our granular computing approach to develop HCAI.

2 Supporting Intelligence Analysts with Intelligent Systems In this section, we consider the potential benefits to using AI systems to support criminal investigations and we look at existing research on the needs for transparency.

2.1 Faster Investigations with Intelligent Systems The process of intelligence analysis involves repetitive and intellectually non-trivial information retrieval tasks, as analysts recognise patterns and make connections to form insights [8]. A more natural interaction with data, which removes the requirement for analysts to translate their questions into restrictive syntax or structures, could speed up information retrieval processes significantly. We propose that if an analyst were able to communicate with their data more fluidly, through a conversational agent (CA) , then they could achieve significant time savings in investigations. We define typical task-driven CAs as being able to understand users’ by matching their input pattern to a particular task category (intent), for example through ‘Artificial Intelligence Markup Language’ (AIML) [9], where the intent triggers a set of functional processes and manages dialog with a user. There are many examples of commercial CA technologies that provide intent classification, for example Microsoft LUIS and NVIDIA Jarvis. If, for example, an analyst wants to find out who is employed by a company then they can ask this directly of a CA. The interaction with the CA is an example of shared reasoning, where the user is directing the conversation based upon their own thoughts, and the agent is interpreting their intentions to select the appropriate processes to extract data and form a response. Analysts can also save time if, rather than having to manually explore each line of inquiry, they deploy AI systems that perform investigations autonomously. For example, by reasoning over paths to take and recommending information that may be of interest. Such systems have the potential to both speed analysis and challenge investigation scope without further burdening analysts. For example, by automatically seeking and returning known and unknown ‘unknowns’ [10]. Even if a system can explore only simple paths and make recommendations, triggered by an initial question from an analyst, it could provide helpful assistance and save time. In a high risk and high consequence environment, such as a criminal investigation, even a modest time saving can be significant.

A Granular Computing Approach to Provide Transparency …

337

2.2 The Need for Transparent Systems in Criminal Investigations Intelligence analysis involves high risk and high consequence situations, where analyst decisions can have significant consequences. There are, therefore, serious ethical considerations, including in relation to algorithmic bias or the misinterpretation of system processes. Take a simple example, where an analyst does not appreciate the caveats when asking a CA to search and retrieve data. This may lead to an incorrect assessment that some important information cannot be found. The assessment informs subsequent decisions and this could cause delays, with potential risk to life. Alternatively, consider a recommender system that can prompt and direct inquiries. Even in commercial recommender systems, there are concerns that “malicious actors can manipulate these systems to influence buying habits, change election outcomes, spread hateful messages, and reshape attitudes about climate, vaccinations, gun control, etc. Thoughtful designs that improve user control could increase consumer satisfaction and limit malicious use.” [3] In an investigation scenario malicious influences could mean directing scarce resources towards an innocent person, for example, through discriminatory processes. Algorithmic bias can occur in various ways. “Human error, prejudice, and misjudgment can enter into the innovation lifecycle and create biases at any point in the project delivery process from the preliminary stages of data extraction, collection, and pre-processing to the critical phases of problem formulation, model building, and implementation” [11]. Past work undertaken as part of the Visual Analytics for Sensemaking in Criminal Intelligence Analysis (VALCRI) project developed a solution for policing which includes AI algorithms, and also uncovered important ethical issues such as “accidental discrimination, the Mosaic effect, algorithmic opacity, data aggregation with mixed levels of reliability, data and reasoning provenance, and various biases” [12]. Human rights campaigners have raised concerns over the use of AI systems in the criminal justice system, where “the nature of decision making by machines means there is no option to challenge the process, or hold the system to account” [13]. Police analysts and officers have also raised concerns that an inability to understand and challenge machine reasoning, including any bias that may have been introduced, is a critical barrier to the use of complex systems [14]. Effective shared reasoning between a human and a machine requires trust and accountability. A key aspect enabling trust engineering in systems, and addressing ethical concerns, is transparency so that the analyst can predict, interpret and refute any results, acknowledging caveats where they exist [15]. Therefore, algorithmic transparency is central to the ability to provide interpretable AI systems that can be challenged and critiqued, so that the analyst remains fully accountable for any decisions made. The research area of Explainable Artificial Intelligence (XAI) has received a large amount of attention in recent years and seeks to find solutions for interpretable, explainable, systems. Explainable AI (XAI) can refer to several concepts [16], with no widely agreed definition. In complex systems, particularly those that involve multiple algorithms, there is an important distinction to be made between local and global

338

S. Hepenstal et al.

Fig. 1 System transparency framework [5]

explanations. Local explanations focus on justifying a single decision or output, whilst a global explanation covers overall system behavior [17–20]. A global understanding of a system is important in high risk and high consequence domains, where the limitations of system behaviour can be most damaging. Beyond the principles for XAI, the delivery of explanations also requires consideration, for example, that users should be able to “actively explore the states and choices of the AI, especially when the system is operating close to it’s boundary conditions” [18]. We have developed an algorithmic transparency framework (Fig. 1) [5], that helps us to design intelligent systems for high risk and high consequence domains. The framework echoes other XAI research on the distinction between local and global explanations, or process and output transparency. It shows that to provide transparency, a user requires explanations of how results are derived together with visibility of the functional relationships in the system, with context in which to interpret the system behaviour. Explanations of results should focus upon the underlying data so the user can gain some understanding of the internals of an algorithm i.e. through the identification of important features. Visibility means allowing a user to inspect and verify the goals and constraints of system behaviours.

2.3 A Granular Computing Perspective to Design Transparent Systems In previous work we have explored how explanations can be tailored based upon a users’ role, and upon the system components, and found that a one-size-fits-all approach to XAI is insufficient [21]. Here we also propose that, by understanding the human behaviour involved in completing a task, we can better design the required information granules, abstracted from system behaviours, with an appreciation of the context in which an interpretation is desired. We are not providing explanations for the

A Granular Computing Approach to Provide Transparency …

339

sake of it. Whilst analysts require transparency, it is unhelpful to provide “‘fishbowl’ transparency in which huge amounts of information are provided in indigestible form. Transparency does not necessarily provide explainability—if systems are very complex, even providing code will not be illuminating” [22]. When analysts use an intelligent system they have a specific purpose that should influence, and be aided by, their interpretation of the system behaviour, be it to recognise a situation or to find new insights from lines of inquiry. Furthermore, whilst the purpose of AI is typically to reduce cognitive load and allow users to make sense of larger amounts of data faster, providing transparency of underlying system processes through complicated explanations can actually increase cognitive load. “Reading such explanations can help users to understand model predictions, but this comes at a cognitive cost since users will have to expend effort to cognitively encode and apply these explanations” [23]. A better mapping of transparency to the information granules required, with a structure that mirrors human cognition, could help mitigate increases in load. Granulation is important for human cognition where elements are drawn together by indistinguishability, equivalence, similarity, proximity or functionality [24, 25] For example, it has been found to have a “significant role to play in the development of data-driven Machine Learning methods, in particular when human centricity is important” [26]. Granular computing seeks to provide a better understanding of a problem, by splitting the problem into manageable chunks. There is an extensive body of literature related to granular computing approaches, which dwells upon individual formalisms of information granules and unifies methods to form a coherent methodological and development environment. Traditional methods simplify the communication of data through reduction at different levels of granularity, while retaining the essence of the data. For example, by constructing sets, including fuzzy sets [27], rough sets [28, 29], or probabilistic environments, such as a probabilistic Petri-net [30]. Granular computing can be beneficial in environments that involve the interpretation of large amounts of complex information by humans, where attributes in the raw data need to be reduced, abstracted or generalised so that they can be represented and understood [6, 31]. Here, granular computing approaches can provide an alternative to complex algorithms, for example, to provide classifications in sentiment analysis. Typical prediction approaches, such as Naïve Bayes and support vector machine models, involve high dimensionality and high complexity and fuzzy rulebased systems may provide an advancement in terms of interpretability of models for sentiment classification [32]. There are, therefore, many applications for granular computing methods, including medical diagnosis [33], fault diagnosis, image processing, processing large datasets, and intelligent control systems [29]. Fundamentally, granular computing is about understanding how to form, process and communicate information granules. A key aspect of criminal investigations is that an analyst has the appropriate situational awareness to inform their decisions, including their perception of elements within an environment, comprehension of their meaning, and projection in the near future [34]. Past research has considered how situational awareness can be supported by granular computing, where raw information such as sensor data is interpreted into a higher domain-relevant situation concept [35]. We seek to design information granules that

340

S. Hepenstal et al.

aid the recognition of situations through system transparency, and ultimately improve interpretability, as an analyst performs an investigation. In our case, we seek granules that can describe information retrieval situations as they occur. In each situation, the analyst will be interacting with an intelligent system that interprets and responds to their intention. This is akin to a rule-based system, where a series of functional rules are triggered once an intention in matched. Again, granular computing approaches can be applied to aid interpretability. For example, where each of the rules that make up a rule-based system can be seen as a granule, as can each of the specific terms or parts of those rules [36]. With our transparency framework we can identify an appropriate granularity to explain functional processes, or rules, and communicate them through appropriate explanation structures. In our approach, the context for forming granules and structures is drawn from the human cognitive requirements for the tasks at hand.

3 How Analysts Think In this section of the chapter, we introduce a human factors study that helped us to gain an understanding of how analysts think when they perform analysis in a criminal investigation. In Sects. 4 and 5 of this chapter, we demonstrate how this understanding was translated into the design of information granules for functions, intention concepts, and investigation paths, in order to provide contextual interpretability of a conversational agent and recommender system.

3.1 Cognitive Task Analysis (CTA) Interview Study [37] Purpose: To understand how to design interpretable AI systems we need to first explore the context in which they will be used, including the way that the user thinks about and interacts with available data. From this understanding we can model the necessary granules of information that require explanation with an appropriate structure. To inform our modelling we conducted a series of interviews with intelligence analysts. We sought to understand the cognitive strategies applied by each analyst in an investigation. We have used our analysis to help design the architecture of our AI system and identify the appropriate granularity for contextual interpretability. Methodology: We performed Cognitive Task Analysis (CTA) interviews with four analysts, each with more than three years of experience. Three of the analysts had worked in policing, across various police forces, and the other analyst had a background in defence intelligence. Each interview lasted an hour and applied the Critical Decision Method (CDM) [38] to elicit expertise, cues, goals and decision making, for a memorable investigation the analyst was involved with from start to end. The interviewer began with more general questions about the day to day role of each analyst, before asking them to describe a particular investigation. A timeline of

A Granular Computing Approach to Provide Transparency …

341

key events was sketched out by the analyst and explored in detail to identify where, why, and how they recognised and responded to the situation. The interviews have been transcribed and we have looked to identify themes across critical investigation stages, including about how analysts recognise situations and make decisions. Results and Analysis: In the investigations described by analysts, each new insight triggered another phase of intense information gathering. The investigations involved uncertain situations, so each analyst looked to explore many lines of inquiry to retrieve as much relevant information as possible. For example, as described by one analyst on responding to a suspected kidnapping, “we had no idea initially what the kidnap was for. We were searching associates, we looked for any previous criminal convictions, we spoke to neighbours, and (gathered) telephone information for his (the kidnap victim) phone.” [Cognitive Task Analysis, Analyst 1, 11:30] Much of the data processing was manual and time consuming and analysts lacked the time to conduct in depth analysis to refute their hypotheses. As one analyst put it, “you don’t always get comfort to do that, you respond and validation comes afterwards.” [CTA, A1, 40:30] Validation is particularly challenging when, as was typically the case, there is a large amount of potential data available for consideration. Furthermore, the information an analyst needs was not always available, either because it simply did not exist, could not be accessed, or there was too much of it to filter and explore in time. Experience was needed to counter these limitations. An experienced analyst can crudely filter large volumes of data based upon the patterns they expect to find. For example, when considering whether records of communications data involved a firearm deal, an analyst stated that “we can rule out text messages, based upon experience that criminals (when purchasing firearms) normally call about this kind of thing.” [CTA, Analyst 4, 16:00] Likewise, they may apply abductive reasoning to predict new lines of inquiry from small amounts of information with gaps. Commonly held assumptions were significant in guiding the direction and boundaries of investigation paths, by influencing the hypothesis which explains the overall investigation scenario. We have termed this the ‘scope’ of the investigation [37]. The investigation scope is crucial to direct intelligence analysis and enable recognition, by creating a basis from which expectancies can be drawn. An analyst does not have the capacity to explore all of the data in time pressing scenarios, so they need to identify what queries will be most fruitful given both the data available and the scope of the investigation. To do this, they apply their experience to recognise aspects of the situation and construct a plausible narrative explanation with supporting evidence. The Recognition-Primed Decision (RPD) model [39] helps us to understand how experienced people make rapid decisions, including how they recognise a situation. Klein describes “four important aspects of situation assessment (a) understanding the types of goals that can be reasonably accomplished in the situation, (b) increasing the salience of cues that are important within the context of the situation, (c) forming expectations which can serve as a check on the accuracy of the situation assessment (i.e., if the expectancies are violated, it suggests that the situation has been misunderstood), and (d) identifying the typical actions to take” [39]. These aspects allow people to compare patterns from their experience of past situations with emerging situations, thus enabling quick understanding, predictions and decisions. We have

342

S. Hepenstal et al.

used the RPD model to describe how each analyst recognised situations in an investigation context. We have analysed the CTA interview transcripts and captured each task undertaken by the analysts throughout their investigations. We have mapped each task against the RPD model and find that this allows us to capture the important considerations and processes involved. Table 1 presents two tasks from the start of a kidnapping incident. At first, on hearing of the kidnapping, the analyst sought to find details about the incident, such as the identity of the victim and possible suspects. After this, they wanted to assess the level of risk to the victim. Implications for this chapter: In this chapter, we describe how we have used the findings from this CTA study to develop an HCAI system for retrieving information and exploring lines of inquiry in an investigation. When analysts interact with an intelligent system to perform an investigation, they need to recognise and interpret the behaviour of the system. Granular computing, as introduced by Pedrycz et al. [40, 41], ascertains that information granules are key components in knowledge representation and processing, where the level of granularity is crucial to both problem description and problem solving, and is problem oriented and user dependent. We propose that information granularity should, therefore, be carefully Table 1 Snippet of decision analysis table—kidnapping [37] Hypothesis Scope 1. Who (Victim), When, How

2. Who (victim, offender), When, How, Where, Why

Cues

Man gone missing. Thought he had been kidnapped due to witness report. Known to be vulnerable

Identified in police records man had been victim of assaults by fluid group of youths. Not linked to NCA or serious organised crime

Goals

Understand what could have happened and more about the victim

Understand how dangerous youths are, find out what vehicles they use and telephone numbers and where they live/operate

Expectancies

Unknown—scope too broad

That those involved are local known bullies and not OCG. Expectation that this was an incident which had gone too far and offenders had panicked

Actions

Searched known associates of victim, Search databases for offenders looked for previous convictions, looking for vehicles, telephone spoke to neighbours and witnesses, numbers and associates looked at telephone information. looked for victims name

Why?

To reduce scope of investigation and assess level of risk

To assess risk to victim (danger posed by offenders) and to trace possible locations through vehicles, telephones, addresses

What for?

To direct next steps of investigation and better use experience to recognise patterns

To locate the victim and provide support

A Granular Computing Approach to Provide Transparency …

343

designed, informed by a detailed understanding of the needs and context. We propose that the RPD structure provides an appropriate level of granularity to aid recognition and interpretability of system behaviour, derived from our human factors analysis. In Sect. 4, we describe how we have modelled and implemented the RPD structure in a conversational agent system. Recognition is also important to enable analysts to use their intuition and gain insights on a situation. As presented by Gerber et al. [42] in a model of analyst decision making, experts recognise patterns from experience. This can lead to intuition and enables them to spot information that may be helpful for solving the problem. Analysts make ‘leap of faith’ assessments and insight “occurs unexpectedly while collecting data towards and beyond the leap of faith by providing a comprehension of the situation” [42]. There is much variability in the analytical reasoning applied where strategies range from “making guesses and suppositions that enable storytelling when very little is known, to reasoning strategies that lead to rigorous and systematic evaluation of explanations that have been created through the analytic process” [43]. An analyst’s intuition is the anchor for lines of inquiry, and reflects the scope of the investigation. The reliability of intuition depends on the analysts’ expertise in a given domain. An investigation that is framed by past experiences alone can be limited in scope, particularly if the analyst is unfamiliar with an emerging situation, or has an incomplete experience that leads to bias. In these cases, the scope may not reflect the reality of the situation and important lines of inquiry are missed. This limitation is captured by an analyst describing an attempted murder investigation. The analyst explained that the husband of the victim had been a suspect but was cleared through verification of his statement, via call data and CCTV. They went on to say that, “No other evidence was available, so no lines of inquiry. Expertise, such as the burglary expert, felt it looked like staged burglary. (They) had expected a pattern to the way draws had been pulled out (if burglar), that it was not a burglary pattern. I felt this was too tenuous. It could be a novice burglar.” [CTA, A2, 36:00] The priority of an intelligence analyst is to connect relevant pieces of information across various lines of inquiry, so that they can gain insight into an unfolding situation. When we design AI systems that aid analysts in exploring and interpreting lines of inquiry we should, therefore, consider how insights are gained, such as from drawing connections through intuition and leap of faith. We can then deliver information at the appropriate granularity and structure that reflects this. In Sect. 5, we describe how we have designed an HCAI recommender system to explore and recommend lines of inquiry to enable insights.

4 Designing Recognisable Systems When analysts retrieve information to support an investigation, they seek to recognise the situation so that they can make appropriate decisions on how to proceed. The types of decision include which results to query further to gather additional information, whether to apply an alternative information retrieval strategy, or what conclusions

344

S. Hepenstal et al.

to draw. Analysts use systems to interact with available data to help address any questions they have. As well as recognising the emerging situation, they also need to be able to recognise and interpret the behaviour of the system. Specifically, we propose that analysts need to be aware of the goals and constraints of a system to verify correspondence with the intended behaviour. System interpretability should therefore be provided with a granularity and structure that aids an analyst to recognise the system behaviour, at a level where goals and constraints become significant. As noted previously, the recognition aspects of the Recognition-Primed Decision (RPD) model [39] provides a useful structure to capture the information granules considered by an analyst when they recognise a situation and make a decision [37].

4.1 Modelling Information Granules for Conversational Agent Intentions In a typical task-driven CA, analyst inputs are matched to ‘intentions’ and these define how the system processes information and manages dialog with a user. We propose that, if intentions are modelled appropriately, the RPD structure can aid an analyst to recognise system behaviour. We analysed the CTA interview data, described in Sect. 3, to identify the thought processes of analysts, including the questions they asked during their investigations and their requirements for responses. By breaking down the investigation scenarios and tasks in detail, we isolated specific questions that the analysts could ask of a CA, together with the system processes required to find an answer from available data and fulfil the task. Table 2 describes, for one analyst statement, how questions were elicited from the interview data and mapped to functional processes. Each of the processes describes a specific attribute that can be created as a function in code and associated with an aspect of the RPD model. For example, “search for people connected to the victim’s name” describes a function that delivers upon the ‘Actions’ aspect. These attribute functions are granules, where their overall requirement is decomposed into several parts, as described by Liu et al. [36]. Each row of attributes defines a complete intention, including all RPD aspects, so that a CA can match and trigger a response to the question posed. In this way, by combining the individual functions we construct higher level information granules that maintain the structure of the RPD model. We extracted questions, such as those presented in Table 2, for all the interview data and identified possible attributes for each RPD aspect. Using these questions and attributes, we dynamically modelled analyst intentions for searching and retrieving information with Formal Concept Analysis (FCA). FCA is an analysis approach which is effective at knowledge discovery and provides intuitive visualisations of hidden meaning in data [44]. FCA represents the subject domain through a formal context made of objects and attributes of the subject domain [45], where in our case objects are questions posed by the analyst and attributes are functional RPD aspects (Example questions and attributes shown in Table 2). In this way, FCA

Expectancies

Victim name

Find convictions

Does the victim have any previous convictions?

Victim name

Cues

Find associates

The victim has been targeted before

Why?

Actions

Search for convictions directly linked to victims name

To understand past victimisation

To find potential suspects

Why?

Searched known To reduce scope of associates, looked investigation and for previous assess level of risk convictions, spoke to neighbours and witnesses, looked at telephone information

Actions

The victim knows the Search for people offenders connected to victim’s name

Expectancies

Man gone missing. There is information Thought he had been for victim within kidnapped due to existing databases witness report. known to be vulnerable

What people are associates of the victim?

Understand the motive, the risk to the victim, and possible suspects

“We had no idea initially what the kidnap was for. We were searching associates, we looked for any previous criminal convictions, we spoke to neighbours, and telephone information for his phone. One of the neighbours had suspected he had been kidnapped, and a witness had seen him being bundled into a car and alerted the police because they knew he was vulnerable.”

Cues

Extracted Questions Goals

Goals

Transcript statement [CTA: Analyst 1, 11:30]

Table 2 RPD mapping from interview statements (example from Interview 1) [53]

(continued)

To assess risk and inform prioritisation

So that inquiries can be made into suspects

What for?

To direct next steps of investigation and better use experience to recognise patterns

What for?

A Granular Computing Approach to Provide Transparency … 345

Goals

Find calls

Transcript statement [CTA: Analyst 1, 11:30]

What calls have involved the victim’s phone?

Table 2 (continued)

Victim phone number

Cues

The victim has been involved in recent calls

Expectancies

Search for calls involving phone number

Actions

To find recent communications

Why?

To identify possible leads or location

What for?

346 S. Hepenstal et al.

A Granular Computing Approach to Provide Transparency …

347

provides multi-level granularity, as described by Qi et al. [46], where we organise the lower level functional attribute granules into overall intention concept granules. The use of FCA means that we do not have to build bespoke intentions that the CA can fulfil. Instead, intention concepts can be derived from useful combinations of possible functional attributes. The concepts preserve a consistent modular structure that reflects the way the CA has recognised and responded to the question. By using FCA to identify distinct intention concepts, we can evolve intentions when a new attribute is developed, or when a new combination of attributes is desired. Therefore, the intention architecture is flexible and can evolve over time. Our approach derives distinct intention concepts, through FCA, to communicate system behaviour to users. This has much in common with well-known methods for granular computing, such as Rough Set Theory (RST) [47], and FCA has received attention in past granular computing literature [48–50]. FCA provides a summary of concepts within the data, where it seeks conjunctions, while RST defines concepts by disjunctions of properties [51]. We use FCA to fulfil a key aim of granular computing, to form, process, and communicate information granules that reflect CA intentions when a user interacts with the CA. However, rather than capturing objects by collecting granules that originate at the numeric level, such as by clustering their similarity or proximity, we apply a human factors model to derive distinct functions at the lowest level. These functions mirror the aspects of the RPD model. We then use FCA to combine granules into intention concepts, where each concept preserves the RPD model structure. The concept lattice, as shown in Fig. 2, presents distinct object groupings. The final layer of concept circles are complete intention concepts, where all parts of the RPD are considered. The circles are sized based upon the number of associated questions. We can see that three questions in our set can be answered by combining the highlighted attributes. For example, these attributes can answer the question, ‘how many vehicles are in our database?’, with ‘vehicle’ as a cue. The CA looks for adjacent information i.e. where there are instances of the class ‘vehicles’, it presents a summary count, and

Fig. 2 Concept lattice for RPD model intentions (computed and drawn with concept explorer [54, 53]

348

S. Hepenstal et al.

outputs a list. To provide transparency we propose we can simply present descriptions of the goals and constraints for what attributes, and therefore functional processes, underpin a concept. Our model-agnostic and modular approach is akin to what Molnar [52] describes as the future of machine learning interpretability.

4.2 Implementing Interpretable Conversational Agent Intentions We have developed a prototype CA interface, called Pan [7], which responds to analyst questions by searching a graph database and returning relevant data, with descriptions and interactive visualisations. In our prototype, a user can ask a question and see a textual and graphical explanation of the results, as shown in Fig. 3. They can also choose to step into the system processes (Fig. 4). All data show is fictional. We have used the concept lattice to define the intentions that Pan can trigger, where each intention reflects our explanation structure; the RPD model. Without the ability to break apart and inspect and verify these RPD aspects and the relationships between them, it would be difficult for a user to understand the system. We therefore meet the requirements for transparency outlined in our framework and provide interpretability of intention concepts through granular computing, with the explicit aim

Fig. 3 Pan interface response explanation

A Granular Computing Approach to Provide Transparency …

Fig. 4 Pan Interface system visibility

Fig. 5 System transparency framework for CA intentions (Adapted from Fig. 1)

349

350

S. Hepenstal et al.

to enhance recognition of system behaviours where the RPD aspects are the information granules required. Figure 5 presents how our intention modelling approach reflects the transparency framework.

5 Insightful Investigative Agents When analysts conduct investigations, they explore numerous lines of inquiry. By asking questions of the available data they gain a better comprehension of situation, which ultimately informs insight. Past research suggests that insights can occur unexpectedly, where analysts construct a narrative that is informed by the results of leap of faith questioning and suppositions. If our objective is to develop a system that can automate this process of leap of faith questioning and lead an analyst to insights, it requires that interpretability is designed to provide the information granules from which insight can be derived. An analyst performs a single interaction with an intention concept when they ask Pan a question. We, therefore, provide interpretability of system processes with a single layer of abstraction that presents the functional attributes triggered, structured by the RPD model. There is a distinct reasoning stage, where the CA classifies an analyst’s intention, then responds with appropriate processes. In our investigative agent there are a number of levels at which reasoning occurs. The system derives an environment for possible lines of inquiry and makes recommendations. In each line, data is passed from one intention to another, triggering the relevant processes. Interpretability therefore requires information at multiple levels of abstraction. There is a requirement to understand, inspect, and verify path recommendations and possibilities within the environment, at a higher level. At a lower level, an analyst needs to delve in to recognise the system processes within individual stages in a selected path, together with data that allows for storytelling and insight.

5.1 Modelling Information Granules for Investigation Pathways As an investigation proceeds, analysts ask questions that draw upon their previous findings. Rather than isolated and distinct queries, often there is a chain of questions and this translates to a line of inquiry. For example, if an analyst finds a new entity that they believe is relevant, such as a suspicious vehicle or phone number, they will naturally want to know more about the new entity. If a further question about the new entity also returns interesting new information, they will ask additional questions related to the new information. In following this approach an analyst creates a question network. We have explored how question networks are formed in previous

A Granular Computing Approach to Provide Transparency …

351

Fig. 6 Section of question network (Interview 1)

work [55] and Fig. 6 is an example of a question network for a kidnapping investigation, as described in a CTA interview (Sect. 3) and discussed in Tables 1 and 2. At the start of the investigation the analyst had information about the victim’s name and this was their cue for the inquiry. Initially they searched any available data for people, events, vehicles, and telephones, that were directly associated with the victim (Table 2). The analyst found events linked to the victim and looked for other people involved, learning that the victim had been “assaulted over a period of time by a fluid group of local boys.” [CTA, Analyst 1, 16:23] The analyst continued to question data, as shown by the network, until they identified possible locations to find the victim and prevent threat to life. A system that can automatically pursue lines of inquiry and retrieve potential locations could save an analyst time, by removing the need for manual processes. For example, analysts spend “quite a lot (of time) doing detective work, where a piece of intelligence was nothing on its own, but we needed to trawl data and find links to that.” [CTA, A4: 2.30] However, it was only possible for the analyst to gain crucial insights on the situation, such as to “understand the level of danger the victim was in” [CTA, A1, 17:55], because they were informed by information found from the intermediary steps in the line of inquiry. For example, the analyst “worked out that it wasn’t going to be a serious organised crime group. We knew it was a local kidnapping. Once we understood who he was (the victim), and his vulnerability, his history and bullying. It was most likely a group of local people. It changes the scope. It was understood that it was locals who had gone too far.

352

S. Hepenstal et al.

That helps you to hone in on the local landscape. If it was that they just disappeared out of the flat then you are scuppered, because where would you start?” [CTA, A1, 45.20] A recommendation from a system that suggests likely locations of the victim, without the reasoning and context that define the suggestions, would miss this insight. Making the investigative process faster or easier, through automation, can therefore reduce the benefits that come with actually performing the investigation. An information retrieval system needs to provide a sufficiently stimulating environment that maintains user engagement, whilst engendering trust and expertise. We want to support investigators by directing their attention to the right information granules at the right time. Each node in the network (Fig. 6) represents a question event, where an analyst includes cues in their question and performs some tasks according to their intention. In each event an analyst has options on how to process the results, represented by edges. We have designed Pan to be able to respond to questions such as those presented in the network (Table 2), to retrieve data from a semantic knowledge graph. Furthermore, Pan can capture a large question network as analysts interact with it. By allowing multiple analysts to interact with Pan, and expand the question network, we help mitigate experiential bias and broaden the scope of possible inquiries. We capture behaviour in a way that is agnostic to the underlying data instances and this mitigates data biases, whilst allowing the same behaviour model to be applied across sensitive environments and protected datasets. Additionally, by creating an environment for possible options from measurable interactions with the system, we avoid some of the pitfalls, including brittleness and ambiguity, that affects tailored models of real world variables. In order to capture the appropriate information granules to deliver a view of the data, it requires that details that are not relevant are ignored and a level of abstraction is provided that is aligned to the nature of the problem at hand [2]. To achieve this, Pan derives an abstract event tree (Fig. 7) from a large question network, where a node (question event) requires three components: an input i.e. the question subject (e.g. a phone number), a query class (e.g. people), and an intention. The intention defines the way in which the question will be processed. Each intention concept is interpretable, making use of the functional architecture structured by the RPD model, as described in Sect. 4.1. This allows us to concisely capture, consolidate, and represent the required system processes at each stage of the tree. We can make our event stages more, or less, domain specific by manipulating class granularity. In this way the possibilities for lines of inquiry are flexible and can evolve over time. Links are formed between questions if the results of one are subsequently queried in another. The event tree in Fig. 7 has been formed initially from analyst question networks for two CTA investigation scenarios, a kidnapping and a firearm dealing. We have demonstrated that, through abstracting analyst behaviour in this way, we can predict paths of inquiry for entirely new investigation scenarios [55]. In this case, the tree is created when a question is posed seeking communication events linked to a phone number. From this anchoring position, Pan constructs a tree for possible follow on inquiries, with branches and likelihoods drawn from the question network. There are numerous possible options and directions to take, where the results of each stage

Fig. 7 Section of event tree from a search for calls involving a phone number [55]

A Granular Computing Approach to Provide Transparency … 353

354

S. Hepenstal et al.

in the tree can be passed as inputs to the next. The event tree can be infinite if loops are present. For example, if other phone numbers are found to be involved in call events with the original number, then the analyst may want to find what additional call events involve these new numbers, and the tree is repeated. The entirety of the event tree, including repetitive steps, cannot be represented to a user concisely. This limits an analyst’s ability to interpret the possibilities in the tree, the relationships between different states, and any significant data inputs or results. There are barriers to both transparency and the ability of an analyst to identify key pieces of information with the context from which they can derive insight. From the event tree we can elicit where there are probabilistic symmetries. If the entire probability distribution over the atoms of future events unfolding from two different stages in the tree are the same, we can say these stages are in the same position [56]. This is the case where there are loops in the tree. One purpose of information granulation is to simplify the complexity of data, so that meaningful structures can be communicated more easily. We can achieve this by collapsing the tree to represent the positions, and relationships between them, as a Dynamic Chain Event Graph (DCEG). We capture the topology of an infinite staged tree with a DCEG, as described by Barclay et al. [56]. A DCEG provides a succinct explanation of the various stages available and their influences on one another, which could help us to achieve algorithmic transparency and analyst insight. We have translated the event tree in Fig. 7 to a DCEG, where we can represent the entire infinite tree with just 7 positions [55]. These positions, and the relationships between them, are the information granules required for interpretability of the possibilities and constraints within the environment, and where recognition is required of system processes. By clearly articulating the inputs and results at each stage we can provide the necessary cues and context for insight. There are numerous approaches that can be taken to explore, predict and recommend optimal lines of inquiry from our DCEG environment. For example, in recommender systems for sequential decision problems Markov Decision Processes may be an appropriate model [57]. One example is EventAction, which models sequences of events as a probabilistic suffix tree, based upon historic events, and applies a Markov Decision Process (MDP) and Thompson Sampling to compute and select a recommended action plan [58]. Under certain assumptions, a DCEG “corresponds directly to a semi-Markov process.” [56] Therefore, in a similar fashion to EventAction, it is possible to generate and select interesting lines of inquiry for an investigation from our DCEG.

5.2 Implementing Interpretable Recommendations for Investigation Paths A system that can independently identify, recommend, and pursue lines of inquiry, performs multiple reasoning steps and actions in sequence and can be greatly beneficial. However, the DCEG environment from which options are chosen is inevitably

A Granular Computing Approach to Provide Transparency …

355

constrained. For example, if an investigator expects the agent to also identify other vehicles linked to an address, but there is no relationship in the environment to describe this, then the option will not be considered. An investigator has many requirements for system transparency so they are aware of what has, has not, and will not, be explored. They need visibility of the goals and constraints of the possible paths and any influences between states, explanations of recommended paths, visibility of the functional processes associated with individual states in each path, and explanations for the data fed from one state to another. Importantly, the state space should be visible so an investigator can critique how lines of inquiry are formed and directed. The DCEG provides a concise representation of this. We have developed a prototype system that simulates and recommends lines of inquiry from a DCEG, created through interactions with Pan. In our initial prototype, an analyst can explore possible paths and choose one to run, with recommendations based upon the likelihood that an analyst would select a path and the expected return from a path (Fig. 8). The likelihood is simply found by running simulations in the environment, where we know the probability with which different transitions are made from any given state. In an investigation it is important not just to consider typical inquiries. Exploring alternative paths would help challenge the scope of the investigation, and typical behaviors, as well as save time. For example, there may be limited information available from a preferred line, where pursuing related questions would provide little value, whereas an alternative path may have greater reward and therefore be more likely to gather the information required for insights. There should, therefore, be a consideration of the amount of information available from different lines of inquiry when selecting a path to recommend. Additionally, as we have described previously, a line of inquiry relies upon the information gleaned from previous questions. We should therefore not only consider the immediate reward, or information, returned by a question, but also whether this is likely to lead to more information at a later stage and ultimately deliver an end to the line of inquiry. Concurrently, we do not want to overload the analyst with too much information and it is fair to expect that, as an inquiry moves away from an anchor position, the information returned becomes less directly relevant. In our prototype, we apply a simple Reinforcement Learning (RL) approach. RL involves training an agent to take actions, within a defined environment, where the agent learns to maximise reward. In our case, the agent can take actions to transition between states in the DCEG, and the reward for each action is based upon the information found. We seek to recommend an investigation path based upon the relative ‘quality’, given the considerations described. To make an assessment of the ‘quality’ of a state transition, including both immediate and discounted future rewards, we apply Q-learning [59]. Q-learning is a model-free approach to find the appropriate actions (transitions) to take in different circumstances. It ignores the transition probability distribution underlying the environment and thus helps challenge typical behaviours, with an understanding of information value. Q-learning provides us with an action-value for the expected future reward for taking a given transition, from a given state, and helps indicate directions to explore. There are many other approaches to gather and recommend optimal lines of inquiry from the

Fig. 8 Section of recommendation interface (path selection)

356 S. Hepenstal et al.

A Granular Computing Approach to Provide Transparency …

357

environment we have created. In this chapter, we have selected an initial example for the purposes of developing and demonstrating a simple prototype to recommend investigation paths, and we will analyse other approaches further in future work. Key research questions to explore include; how can we define and reward paths that reflect the necessary conditions for insight? To answer this, we first need to understand what are the required conditions for insight? In our prototype, analysts can inspect and verify information granules that describe each path’s ranking, including the end state and the steps to get there. They can also view explanations for the ranking, for example, the number and proportion of times a given path is run. When an analyst uses Pan to explore paths in an investigation they are seeking insight, rather than a single answer, where different options can be examined and compared. Therefore, throughout the interface we make use of rankings and we seek to recommend an appropriate leap of faith questioning path with a relative understanding of alternatives. Rankings allow us to translate otherwise abstract values, such as the information reward, into meaningful and comparable information granules, acknowledging that “contrastive reasoning is central to all explanations” [18]. For each path we provide an average for the expected reward of ‘High’, ‘Medium’, or ‘Low’, again to allow an analyst to compare paths. On selection of a path the relevant stages and influences in the DCEG are highlighted. The stages reflect the relative stage ranking for reward and quality, using a grid. The user can therefore easily compare and contrast paths and gain an intuitive understanding about how the reinforcement learning algorithm behaves. We propose that our provision of information reward helps tackle the ‘black hole’ problem, described by Wong and Varga [60], where missing data needs to be represented. In an inquiry it may be that there is no data available and this can provide important insight, for example that further information gathering is required. Rather than ignore missing data, by modelling analyst behaviour we can represent where a transition could be worthwhile or favourable, but is lacking in data. The analyst can step into each position in a path and inspect and verify the goals and constraints of system processes, explore result data and select input cues for the next position (Fig. 9). In Fig. 10, we present an extension to our earlier transparency framework (Fig. 1). Transparency is delivered through a hierarchy where information granules differ at

Fig. 9 Section of recommendation interface (running a path)

358

S. Hepenstal et al.

Fig. 10 Extended algorithmic transparency framework

different levels of abstraction. In our prototype recommender system, at the multistage level, our purpose is to identify useful leaps of faith for questioning and a user needs transparency to inspect and verify reasoning paths. These are observable in the state space, which describes the priorities, values, and constraints of different path policies. For each stage in a chosen path the analyst will need to delve deeper, to critique associated functions, as captured by the transparency framework for single stage reasoning. Additionally, for insight they will need to consider the underlying data, including how inputs are selected and passed from one state to another.

6 Evaluation Studies We have undertaken two studies to evaluate our transparency framework and our implemented solutions. Initially, we conducted interviews with analysts in relation to a static prototype, where analysts were given examples of predefined responses from a CA to typical investigation questions. Analysts then answered a questionnaire and discussed their responses with a researcher. Following this, we conducted interviews with a different group of analysts who were asked to interact with a prototype CA and drive lines of inquiry themselves, in order to derive and test their own hypotheses. On completion of the interactive exercise, they were asked to consider a number of general questions, giving their responses verbally in discussion with a researcher. This interactive prototype approach allowed us to capture a better understanding of each analyst’s thought processes and appreciation of the information granules provided by the system, as well as a more honest assessment of their understanding and confidence in the system. The two studies are described in more detail in this section.

A Granular Computing Approach to Provide Transparency …

359

6.1 Static Prototype Evaluation Study: Interpretability Requirements Depend Upon the System Component [53] Purpose: We sought to understand the user requirements for transparency of a CA by probing participants with example interactions. Our aim was to capture the relative importance of system transparency, beyond explanations of results. We also wanted to understand the different parts of the system that analysts wanted to understand, together with the level of detail required. Methodology: We conducted four interviews with different intelligence analysts to our previous CTA study (described in Sect. 3), each with more than 10 years operational experience. Each interview lasted an hour and we presented interviewees with a series of predefined questions and corresponding CA responses with two explanation conditions, switching the order of presentation. For one condition, responses explained the result data alone and, in the other condition, analysts were given an explanation of the result data together with visibility of the system processes. We were not trying to test the differences between the conditions, rather we used them as a starting point from which we could explore additional needs through discussion. To analyse the statements we used an approach called Emergent Themes Analysis (ETA), as described by Wong and Blandford [61, 62], where broad themes, which are similar ideas and concepts, are identified, indexed and collated. ETA is useful for giving a feeling of what the data is about, with structure, and is fast and practical [63]. Results and Analysis: There were a total of 114 distinct utterances captured from the interviews, ranging between 24 to 34 per analyst. At first the researcher sought to identify the component part of the system being described by the analyst. For example, where one analyst responds to the question, “What might you need to understand about an individual output?”, with, “what is the information source and how complete is the database?” [Emergent Themes Analysis, Analyst 1] Here they described a need to understand the data component of the system. By encoding utterances in this way, we found that the concerns raised by analysts can be mapped to the different components in the system. The researcher then looked to encode the reasons why the analysts required understanding of the related components. For example, an understanding of the information source and the completeness of the database is required so that an analyst can clarify their understanding of the data. The coding of the overall framework area came naturally from this, to summarise the subthemes. In this case the overall requirement for understanding the data source and completeness is for clarification. The utterances were encoded by a single researcher, to ensure consistency. Our ETA provided an understanding of the various components of a CA that require explaining, of which system processes are just one. Additionally, analysts sought explanations of different components for different reasons. Table 3 presents the results of our ETA and the core understanding needs for each component of a CA. Where analysts expressed requirements for understanding the system processes, our transparency framework (Fig. 1) seems appropriate to capture their needs and

360

S. Hepenstal et al.

Table 3 CA component core understanding needs [53] CA Component Theme

Framework Area (common for multiple analysts)

Summary of Sub-theme(s)

Extracted Entities

Clarification + Verification (3)

More information of entities extracted for clarification and verification

CA Intention Interaction Clarification (3), Continuation (2) Clear language to understand classification (i.e. no confusing response metric) and information to support continuation of investigation System Processes

Continuation (4), Verification (4), User wants system understanding Clarification (3), Exploration (2), to support continuation of Justification (2) investigation, to allow them to verify processes are correct and explore them in more or less detail and justify their use/approach and constraints

Data

Clarification (3)

Response

Clarification (4), Justification (4), Justification of response with Exploration (2) underlying data, clarification of language (not trying to be human) and terminology, ability to explore results in more detail

Clarification of data updates and source, and data structure to aid forming questions

information granularity should be modelled with this in mind. Whilst justification of the data underlying a response is important, the analysts also indicated a desire to be able to inspect and verify system processes, with a focus on understanding the associated goals and constraints.

6.2 Interactive Prototype Evaluation Study Purpose: Our aim was to further evaluate our transparency framework and to validate the effectiveness of an RPD structure for presenting the functions triggered by a CA, specifically whether it enhanced an analysts ability to recognise and understand the system processes triggered when they interacted with the CA. Methodology: 12 operational intelligence analysts were recruited from a range of organisations, with backgrounds that included the police, National Crime Agency (NCA), military, and prison service. The data from two experiments was not included in this study, one because the analyst did not have a comparable level of experience and the other because the analyst could not properly see the prototype through the video conferencing software. The analysts included in this study had a minimum

A Granular Computing Approach to Provide Transparency …

361

of 3 years’ experience in a fulltime role involving network analysis. The analysts interacted with Pan to apply critical thinking and perform an investigation. The scenario was fictional but realistic, having been designed around one of the situations described in a previous CTA interview. We divided the analysts into two groups with roughly similar experience in terms of length and background. As with the static prototype experiments, for one condition analysts were given the ability to view textual responses from Pan, including details of the data that underpinned a response. They could visualise this data with a network graph. For the other condition, analysts could access and visualise the data in the same way, however, they were also encouraged to step into the triggered intention to inspect and verify the functional processes. All analysts were provided with the same information prior to the exercise and were able to ask the researcher questions throughout. The researcher could give a response if the analyst had seen the information previously in a question to Pan or in the initial brief. The researcher could also prompt an analyst to view system visibility, if they were in the group able to do so. Our prototype system included a range of capabilities for information retrieval and the analysts interacted with all of them. The capabilities covered a range of search complexities, from the simple retrieval of direct relationships, such as vehicles owned by a person, to more complicated and subjective analysis of similarities, for example, the organisations that are most similar to one another. The study was split into three parts: • Interactive exercise: Analysts were asked to interact with a CA system to complete a network analysis investigation, to identify a potential suspect. A researcher shared their screen with each analyst individually, through virtual conferencing software, and typed the questions asked by the analyst into the system. The investigation scenario was realistic, based upon a real life attempt to identify the owner of a mobile phone involved in an illegal firearm purchase. The researcher followed a checklist in each interview to ensure the same steps were covered for all analysts. After revisiting the briefing documents, the researcher asked the opening question, “What mobiles have been involved in call events with IDMOB1 (the phone of interest)?” The researcher used this opening question to introduce the different ways that the data could be explored, by expanding the text or visualising the network graph. For those with the transparency condition, the researcher also demonstrated how the intention concept could be displayed. The researcher then asked the analyst to perform their investigation, prompting them to verbalise any thoughts, considerations, or concerns they had after each interaction with Pan. • Interview related to CA system: An interview followed each exercise in which various questions were asked about the CA system. One question posed to analysts on completion of the exercise was for each to explain what they might need to know about the system. We revisited this later on in the interview, when we asked each analyst to reflect upon the interactive scenario, considering the importance of transparency provision from the perspective of the alternative condition i.e. for those who were provided transparency throughout the exercise, we asked them to consider the impact of not having it, and for those who were not

362

S. Hepenstal et al.

provided transparency the researcher gave them a quick demonstration of transparency provision, before asking whether it would have been be useful during the scenario. By approaching the discussions in this way, we hoped to first capture the needs that analysts initially felt having been immersed in the context of a realistic investigation scenario. We then drew their attention to consider our provision of system transparency directly, so that changes in their assessment of need could be observed. • Demonstration and interview related to the recommender system: The researcher gave a demonstration of the prototype recommender system. In doing so, they explained that by following the path with the highest expected reward (Fig. 8) from the opening question in the exercise, the analyst was led straight from ‘IDMOB1 ’ to a person, ‘IDPaulRichards’ (Fig. 9), who was the chosen suspect by analysts in most investigations. The demonstration was followed by interview questions, with focus upon potential uses of a recommender system in intelligence analysis and general concerns that need mitigating. Results and Analysis: Comparing responses for the two groups to the postexercise question, ‘what might need to be known about the system behaviour?’, we find that all those who were given transparency identified the importance of being able to inspect and verify the system goals and constraints. They understood that transparency was a necessity immediately after completing the interactive exercise. Those who were not provided transparency were unaware of the necessity initially, and their answers instead focussed upon what they wanted to understand about the data. For example, the need to have an explanation of the underlying data source, reliability, and relationships between entities. Only after being asked to reflect on the alternative condition, where their attention was drawn to the possibility that they could view the behaviour of the CA and the implications of this, did they recognise the importance. On reflection, all analysts commented that it was useful to inspect system behaviour to help gain an understanding of the goals and constraints. In other research, and in our interviews, there has been concern that system transparency increases cognitive load. Analysts who were given transparency had more information to digest with each interaction. Therefore, if this caused significant increases in cognitive load, we may have expected those who were provided with transparency to take longer to complete our investigation exercise than those who were not. However, this was not the case. We found that analysts with full transparency were faster to complete the exercise than those without. By providing visibility as an optional popup and representing the goals and constraints of system processes, structured as information granules that reflect the RPD model, there were not significant delays between interactions with the system. While both groups understood the scenario and came to an evidence-based conclusion, those with full transparency demonstrated a much better understanding of the system. These results in some way reflect the real-world nature of the experiment, where there are many interrelated aspects that culminate in insight, learning, and confidence to complete a task. Cognitive load is not the only influence on making fast and efficient decisions from data analysis. We propose that our approach to represent system processes with

A Granular Computing Approach to Provide Transparency …

363

the RPD structure aids analysts to recognise the system behaviour, and thus better understand the processes with confidence. Nearly all analysts raised critical concerns that would need to be addressed before a recommender system could be used to suggest lines of inquiry. The analysts commented that these issues were mitigated by our system design. An analyst needs to understand the behaviour of the system, to provide them with confidence and accountability through auditing the processes applied. The key enabler of this is an ability to inspect and verify system processes, including the goals and constraints, at every level of reasoning as reflected in our extended transparency framework (Fig. 10). During the interactive exercises, the analysts without transparency could not inspect the system constraints and therefore missed important insights or drew incorrect conclusions. This is an interesting finding when we consider how to design systems that can lead analysts to insight, such as our recommender of paths for inquiry. In intelligence analysis, an awareness of gaps in the data, or what data is not being shown by a given intention, is a driver for insight. It must, therefore, be clear exactly what processes have been triggered and their limitations. At each stage in the path an analyst needs to inspect and verify information granules describing the processes, so that they do not miss key insights.

7 Conclusion In this chapter, we have described the necessity for, and our approach to develop, a framework for designing transparent reasoning systems. We have extended the framework to consider multiple levels of reasoning and have presented how it can be implemented in the design of two prototypes; one a CA for retrieving information to support criminal investigations called Pan, and the other a recommender system to advise on lines of inquiry. We have performed an initial evaluation of the prototypes, with a series of interviews and experiments. Our transparency frameworks are validated by our findings, where an understanding of the goals and constraints of system processes is recognised as being of critical importance, alongside explanation of the results returned by an algorithm. Furthermore, we have found that the very provision of transparency promotes the need for it. Conversely, if not provided, the necessity for transparency may not be clearly recognised. This is an interesting consideration with respect to other AI systems that, just because users may not request transparency of system processes, it does not mean it is not important. Only once provided can users answer whether or not it is needed. Additionally, without an understanding of the system behaviour, analysts missed key insights when performing their investigations. This suggests that transparency is not merely required for auditing systems, although this is certainly important, but it is also required so that AI systems can be used effectively and efficiently. Both of our prototypes tailor the representation of information granules to the goals and constraints of system processes within the broader context of the purpose the

364

S. Hepenstal et al.

system is trying to achieve. We have applied human factors research to understand the appropriate structures with which to provide this information. Our approach has been validated by analyst interviews, where analysts who were provided with transparency demonstrated a much better understanding of the system processes and completed the investigation task more quickly. We suggest that a human-centered granular computing approach is key to provide concise interpretability of important information, if framed by the context of the situation requiring interpretation. There are many areas to explore in future work, not least the specific needs for analysts to derive insight. With a more detailed understanding and modelling of the key pieces of information and connections that create insights, as an analyst pursues a line of inquiry, we can develop agents that reason and prioritise actions to inform more helpful recommendations. We will analyse the data collected during our interactive exercises to try to identify the conditions for insight and hypothesis generation, and consider how these can be reflected in our recommender prototype. The analysts involved in studies to date were highly experienced at performing network analysis tasks, so it would also be interesting to consider what differences arise with inexperienced analysts, or non-expert users. We propose that our transparency framework and approach to design HCAI is applicable to other AI systems, not just conversational agents or recommender systems, and we will look to demonstrate this in the future.

© Crown Copyright (2020), Dstl. This material is licensed under the terms of the Open Government Licence except where otherwise stated. To view this licence, visit https://www.nationalarchives.gov.uk/doc/open-government-licence/ version/3 or write to the Information Policy Team, The National Archives, Kew, London TW9 4DU, or email: [email protected].

References 1. Shaw, D.: Crime solving rates ‘woefully low’, Met Police Commissioner says, BBC (2019). https://www.bbc.co.uk/news/uk-48780585. Accessed 3 Sep 2020. 2. Pedrycz, W.: Granular computing for data analytics: a manifesto of human-centric computing. IEEE/CAA J. Autom. Sin. 5(6), 1025–1034 (2018) 3. Shneiderman, B.: Human-centered artificial intelligence: reliable, safe & trustworthy. Int. J. Hum.-Comput. Interact. 36(6), 495–504 (2020) 4. Burnett, M.: Explaining AI: fairly? well?. In: Proceedings of the 25th International Conference on Intelligent User Interfaces (Cagliari, Italy) (IUI ’20), Cagliari, Italy, 2020. 5. Hepenstal, S., Kodagoda, N., Zhang, L., Wong, B.L.W.: Algorithmic transparency of conversational agents. In: IUI Workshops. ATEC, Los Angeles (2019) 6. Chen, Z., Yan, N.: An update and an overview on philosophical foundation of granular computing. In: IEEE International Conference on Granular Computing (GrC-2010), San Jose, CA, 2010

A Granular Computing Approach to Provide Transparency …

365

7. Hepenstal, S., Zhang, L., Kodagoda, N., Wong, B.L.W.: Pan: conversational agent for criminal investigations. In: Proceedings of the 25th International Conference on Intelligent User Interfaces Companion (Cagliari, Italy) (IUI ’20), Cagliari, Italy, 2020 8. Wong, B.L.W., Kodagoda, N.: How analysts think: inference making strategies. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting., Los Angeles, 2015. 9. Radziwill, N., Benton, M.: Evaluating quality of chatbots and intelligent conversational agents (2017). arXiv:1704.04579 10. Logan, D.: Known knows, known unknowns, unknown unknowns and the propagation of scientific enquiry. J. Exp. Bot. 60(3), 712–714 (2009) 11. Leslie, D.: Understanding artificial intelligence ethics and safety: a guide for the responsible design and implementation of AI systems in the public sector. The Alan Turing Institute, London (2019) 12. Duquenoy, P., Gotterbarn, D., Patrignani, N., Wong,B.L.W.:Addressing Ethical Challenges of Creating New Technology for Criminal Investigation: The VALCRI Project (2018) 13. Couchman, H.: Policing by Machine: Predictive Policing and The Threat to Our Rights. Liberty, London (2019) 14. Babuta, A., Oswald, M.: Data Analytics and Algorithmic Bias in Policing. RUSI, London (2019) 15. Ezer, N., Bruni, S., Cai, Y., Hepenstal, S., Miller, C., Schmorrow, D.: Trust engineering for human-AI teams. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Seattle (2019) 16. Lipton, Z.: The mythos of model interpretability. In: ICML Workshop on Human Interpretability in Machine Learning (WHI 2016), New York, NY, USA (2016) 17. Doshi-Velez, F., Kim,B.:Towards a rigorous science of interpretable machine learning. https:// arxiv.org/abs/1702.08608 (2017) 18. Hoffman, R., Klein, G., Mueller,S.:Explaining explanation for “Explainable Ai”.In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Philadelphia (2018) 19. Ribeiro, M.T., Singh, S., Guestrin,C.: “Why should I trust you?”: explaining the predictions of any classifier.CoRR abs/1602.04938 (2016). arXiv:1602.04938, 2016 20. Weller,A.:Challenges for transparency.CoRR abs/1708.01870 (2017). arXiv:1708.01870, 2017 21. Hepenstal, S., McNeish,D.:Explainable artificial intelligence: what do you need to know? In: Schmorrow D., Fidopiastis C. (eds) Augmented Cognition. Theoretical and Technological Approaches. HCII 2020. Lecture Notes in Computer Science, vol. 12196 22. Spiegelhalter,D.:Should we trust algorithms? Harv. Data Sci. Rev. 2(1) (2020) 23. Abdul, A., von der Weth, C., Kankanhalli, M., Lim,B.Y.:COGAM: measuring and moderating cognitive load in machine learning model explanations. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20), Association for Computing Machinery (2020) 24. Zadeh, L.: Granular computing and rough set theory. In: Rough Sets and Intelligent Systems Paradigms. Springer, Berlin (2007) 25. Pedrycz, W.: Granular computing: an introduction. In: Joint 9th IFSA World Congress and 20th NAFIPS International Conference, Vancouver (2001) 26. Su, R., Panoutsos, G., Yue, X.: Data-driven granular computing systems and applications. Granul. Comput. (2020) 27. Pedrycz, A., Hirota, K., Pedrycz, W., Dong, F.: Granular representation and granular computing with fuzzy sets. Fuzzy Sets Syst. 203, 17–32 (2012) 28. Cheng, Y., Zhao, F., Zhang, Q., Wang, G.: A survey on granular computing and its uncertainty measure from the perspective of rough set theory. Granul. Comput. (2019) 29. Zhang, Q., Xie, Q., Wang, G.: A survey on rough set theory and its applications. CAAI Trans. Intell. Technol. 1(4), 323–333 (2016) 30. Jianfeng, Z., Reniers, G.: Probabilistic Petri-net addition enabling decision making depending on situational change: the case of emergency response to fuel tank farm fire. Reliab. Eng. Syst. Saf. 200 (2020)

366

S. Hepenstal et al.

31. Zhang, C., Dai, J.: An incremental attribute reduction approach based on knowledge granularity for incomplete decision systems. Granul. Comput. 5, 545–559 (2020) 32. Liu, H., Cocea, M.: Fuzzy information granulation towards interpretable sentiment analysis. Granul. Comput. 2, 289–302 (2017) 33. Ejegwa, P.A.: Improved composite relation for pythagorean fuzzy sets and its application to medical diagnosis. Granul. Comput. 5, 277–286 (2020) 34. Endsley, M.: Toward a theory of situation awareness in dynamic systems. J. Hum. Factors Ergon. Soc. 37(1), 32–64 (1995) 35. Loia, V., D’Aniello, G., Gaeta, A., Orciuoli, F.: Enforcing situation awareness with granular computing: a systematic overview and new perspectives. Granul. Comput. 1, 127–143 (2016) 36. Liu, H., Gegov, A., Cocea, M.: Rule-based systems: a granular computing perspective. Granul. Comput. 1, 259–274 (2016) 37. Hepenstal, S., Wong, B.L.W., Zhang, L., Kodagoda, N.: How analysts think: a preliminary study of human needs and demands for AI-based conversational agents. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Seattle (2019) 38. Klein, G., Calderwood, R., MacGregor, D.: Critical decision method for eliciting knowledge. Trans. Syst., Man, Cybern. 19(3), 462–472 (1989) 39. Klein, G.: A recognition-primed decision (RPD) model of rapid decision making. In: Klein, G.A., Orasanu, J., Calderwood, R., Zsambok, C.E. (eds) Decision Making in Action: Models and Methods, pp. 138–147 (1993) 40. Pedrycz, W.: Granular computing: an introduction. In: Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569), Vancouver, BC, Canada (2001) 41. Pedrycz, W., Skowron, A., Kreinovich, V.: Handbook of Granular Computing. Wiley (2008) 42. Gerber, M., Wong, B.L.W., Kodagoda, N.: How analysts think: intuition, leap of faith and insight. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Washington DC (2016) 43. Wong, B.W., Seidler, P., Kodagoda, N., Rooney, C.: Supporting variability in criminal intelligence analysis: from expert intuition to critical and rigorous analysis. In: Societal Implications of Community-Oriented Policing and Technology, pp. 1–11 (2018) 44. Andrews, S., Akhgar, B., Yates, S., Stedmon, A., Hirsh, L.: Using formal concept analysis to detect and monitor organised crime. In: Larsen, H.L., Martin-Bautista, M.J., Vila, M.A., Andreasen, T., Christiansen, H. (eds.) Flexible Query Answering Systems. Lecture Notes in Computer Science, vol. 8132, pp. 124–133 (2013) 45. Qazi, N., Wong, B.L.W., Kodagoda, N., Rick, A.: Associative search through formal concept analysis in criminal intelligence analysis. In: Institute of Electrical and Electronics Engineers (IEEE) (2016) 46. Qi, J., Wei, L., Wan, Q.: Multi-level granularity in formal concept analysis. Granul. Comput. 4, 351–362 (2019) 47. Benítez-Caballero, M.J., Medina, J., Ramírez-Poussa, E.: Attribute reduction in rough set theory and formal concept analysis. Lect. Notes Comput. Sci. 10314, 513–525 (2017) 48. Singh, P.K., Aswani Kumar, C.: Concept lattice reduction using different subset of attributes as information granules. Granul. Comput. 2, 159–173 (2017) 49. Dubois, D., Prade, H.: Bridging gaps between several forms of granular computing. Granul. Comput. 1, 115–126 (2016) 50. Priya, M., Aswani Kumar, C.: An approach to merge domain ontologies using granular computing. Granul. Comput. (2019) 51. Yao, Y., Chen, Y.: Rough set approximations in formal concept analysis. Lect. Notes Comput. Sci. 4100, 285–305 (2004) 52. Molnar, C.: Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, Bookdown (2020) 53. Hepenstal, S., Zhang, L., Kodagoda, N., Wong, B.L.W.: What are you thinking? Explaining conversational agent responses for criminal investigations. In: Proceedings of the IUI workshop on Explainable Smart Systems and Algorithmic Transparency in Emerging Technologies (ExSS-ATEC’20), Cagliari, Italy (2020)

A Granular Computing Approach to Provide Transparency …

367

54. Yevtushenko,S.A.:System of data analysis “Concept Explorer”, Russia (2000) 55. Hepenstal, S., Zhang, L., Kodagoda, N., Wong, B.L.W.: Providing a foundation for interpretable autonomous agents through elicitation and modeling of criminal investigation pathways. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Chicago (2020) 56. Barclay, L., Smith, J., Thwaites, P., Nicholson, A.: The dynamic chain event graph. Artif. Intell. (2013) 57. Shani, G., Heckerman, D., Brafman, R.I.: An MDP-based recommender system. J. Mach. Learn. Res., 1265–1295 (2005) 58. Du, F., Plaisant, C., Spring, N., Crowley, K., Shneiderman, B.: Eventaction: a visual analytics approach to explainable recommendation for event sequences. ACM Trans. Interact. Intell. Syst. 9(4) (2019) 59. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction, 2 edn, Cambridge, Massachusetts. The MIT Press, London, England (2015) 60. Wong, B.L.W., Varga, M.: Black holes, keyholes and brown worms: challenges in sense making. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Boston (2012) 61. Wong, B.L.W., Blandford, A.: Describing situation awareness at an emergency medical dispatch centre. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Los Angeles (2004) 62. Wong, B.L.W., Blandford, A.: Analysing ambulance dispatcher decision making: trialing emergent themes analysis. In: Vetere, F., Johnson, L., Kushinsky, R. (eds.) Ergonomics Society of Australia, Canberra, Australia, (2002) 63. Kodagoda, N., Wong, B.L.W., Khan, N.: Cognitive task analysis of low and high literacy users: experiences in using grounded theory and emergent themes analysis. In: Human Factors and Ergonomics Society Annual Meeting Proceedings., San Antonio (2009)

RYEL System: A Novel Method for Capturing and Represent Knowledge in a Legal Domain Using Explainable Artificial Intelligence (XAI) and Granular Computing (GrC) Luis Raúl Rodríguez Oconitrillo, Juan José Vargas, Arturo Camacho, Alvaro Burgos, and Juan Manuel Corchado Abstract The need for studies connecting the machine’s explainability with granularity is very important, especially for a detailed understanding of how data is fragmented and processed according to the domain of discourse. We develop a system called RYEL based on subject-matter experts about the legal case process, facts, pieces of evidence, and how to analyze the merits of a case. Through this system, we study the Explainable Artificial Intelligence (XAI) approach using Knowledge Graphs (KG) and enforcement unsupervised algorithms which results are expressed in an Explanatory Graphical Interface (EGI). The evidence and facts of a legal case are represented as knowledge graphs. Granular Computing (GrC) techniques are applied in the graph when processing nodes and edges using object types, properties, and relations. Through RYEL we propose new definitions for Explainable Artificial Intelligence (XAI) and Interpretable Artificial Intelligence (IAI) in a much L. R. R. Oconitrillo (B) · J. J. Vargas · A. Camacho · A. Burgos Universidad de Costa Rica, San José, Costa Rica e-mail: [email protected] J. J. Vargas e-mail: [email protected] A. Camacho e-mail: [email protected] A. Burgos e-mail: [email protected] J. M. Corchado Universidad de Salamanca, Bisite Research Group, Salamanca, Spain e-mail: [email protected] Air Institute, IoT Digital Innovation Hub, Salamanca, Spain Department of Electronics, Information and Communication, Faculty of Engineering, Osaka Institute of Technology, Osaka, Japan Pusat Komputeran dan Informatik, Universiti Malaysia Kelantan, Kelantan, Malaysia

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. Pedrycz and S. Chen (eds.), Interpretable Artificial Intelligence: A Perspective of Granular Computing, Studies in Computational Intelligence 937, https://doi.org/10.1007/978-3-030-64949-4_12

369

370

L. R. R. Oconitrillo et al.

better way and will help us to cover a technological spectrum that has not yet been covered and promises to be a new area of study which we call InterpretationAssessment/Assessment-Interpretation (IA-AI) that consists not only in explaining machine inferences but the interpretation and assessment from a user according to a context. It is proposed a new focus-centered organization in which the XAI-IAI will be able to work and will allow us to explain in more detail the method implemented by RYEL. We believe our system has an explanatory and interpretive nature and could be used in other domains of discourse, some examples are: (1) the interpretation a doctor has about a disease and the assessment of using certain medicine, (2) the interpretation a psychologist has from a patient and the assessment for a psychological application treatment, (3) or how a mathematician interprets a real-world problem and makes an assessment about which mathematical formula to use. However, now we focus on the legal domain. Keywords RYEL · Explanatory Graphical Interface (EGI) · Interpretation-Assessment/Assessment-Interpretation (IA-AI) · Explainable Artificial Intelligence (XAI) · Granular Computing (GrC) · Explainable legal knowledge representation · Case-Based Reasoning (CBR)

1 Introduction There is a need for a computational framework that allows capturing and representing granules of information about what a human interprets and assesses from real-world objects in a legal domain. In [1] it is explained a natural granularity process is made by the human when perceiving and conceptualizing real-world objects. In our case, it means when a judge takes the facts and evidence provided in the trial and converts them into little pieces of concepts called granules. According to [2], there are three concepts involved in human cognition when perceiving real-world objects: (1) organization, which means integrating information as a whole, (2) decomposition which is taking the whole and dividing it into parts called granules, and (3) considering causing and the effect produced by the information (causation). Latter points mean in this investigation, the cognitive effort a judge must make when have to (1) organize the facts and evidence of an expedient, (2) divide the case into scenarios and analyze granularly each fact with evidence, and (3) consider the causes from proven facts and the laws which define sanctions. Then, the following research question arises: How to capture and represent the interpretation and assessment processes that a judge makes of the annotations of a file and apply machine learning on said processes to generate recommendations prior to the resolution of a related case with jurisprudence, doctrine, and norms in different legal contexts? Our objective is to explore a new granular spectrum that has been little studied and refers to how a human interprets and assesses objects and relationships according to its perception, and a machine able for capturing and processing that information with an interface that explains it graphically [3]; we call to this interface

RYEL System: A Novel Method for Capturing and Represent Knowledge …

371

Explanatory Graphical Interface (EGI) and the new spectrum is called InterpretationAssessment/Assessment-Interpretation (IA-AI) that consists not only in explaining machine inferences but the interpretation and assessment from a user according to a context, in this case, is the law. This promising research area is closely related to Human-centric [4–8] research, therefore it can be used in other scientific areas, for example, mathematics, physics, medicine. IA-AI implies to redefine what Explainable Artificial Intelligence (XAI) and Interpretable Artificial Intelligence (IAI) is. We define the first as the processes and interfaces whose design [8, 9] accessibility [10] and explicability [11] allows to provide a systematic explanation and assessment [12] of data in a certain domain of discourse; is knowledge-oriented [13] and accountable [14]. The second one is defined as those techniques oriented to simulate, represent, express, and manipulate granular data that symbolize an expert’s perspective of the domain of discourse in a given context; where the perspective may be different depending on expert, field, or domain of discourse for a particular purpose. These new definitions allow us to express the essence and characteristics of our system. We present RYEL a novel hybrid system in legal domain [15–18], which means, the use of different machine learning techniques to carry out the processes of explanation and interpretability with EGI about legal cases scenarios using Granular Computing [1, 5, 19–22]. RYEL uses focus-centered organization fundamentals, this means the organization of XAI and IAI should be done and focus according to the perspective and approach an expert has in a domain of discourse, this means, it is human-centric [4, 5]. In this way we communicate novel progress in XAI and Granular Computing [5, 6, 22–25]. This system has the ability to allow the user (judge) to express granularly what they understand from the real world (facts and evidences) through images and relationships that contain legal information about types of objects and properties. The system takes graphical information and translates it into Knowledge Graphs (KG) [26, 27] and then store it in property graph [28–31] format. Each graph represent a case in the case-based library [32, 32, 33]. There are edges that connect one node to another. Both nodes and edges contain labels and properties. Using Case-Based Reasoning (CBR)’s methodology [34–36] the system allow user to retrieve similar scenarios from previous cases by propounding a graphic arrangement of objects that represent facts and pieces of evidence (nodes) and their relationships (edges) according to their interpretation of a given legal scenario. In this way, the judge can make queries at a granular level to obtain similar scenarios form previous cases and reuse applicable law to new ones. Experimentation has been carrying out using representative scenarios of real legal cases on criminal law chosen by subject-matter experts, which helped to reduce the number of cases that would initially have been necessary to carry out experimentation. We have managed 113 legal case scenarios from different countries: 83 from Costa Rica, 25 from Spain, and 5 from Argentina. The case-Based library continues rising as conducting more laboratory testing and experimentation with subject-matter experts. Experiments have consisted of letting judges from Costa Rica, Spain, and Argentina to use RYEL in more or less similar circumstances, to capture and represent the interpretation and assessment they make of the facts and evidence contained

372

L. R. R. Oconitrillo et al.

in scenarios of a criminal case in order to analyze interpretation patterns about the laws that were used and related to proven facts (felonies). Then through an EGI to be able to explain to the expert about the causes or purposes from using a given law coming from previous legal case scenarios. This article is made up as follows: (1) related work, (2) XAI and GrC with RYEL, (3) results, and (4) conclusions.

2 Related Work In this investigation, the granularity processes are a human-centric approach [4, 5], that is, where human is the focus for development, in this particular case the investigation subject is the judge. In addition, the GrC has become a cross-cutting activity in numerous domains and human activities [5]. Due to the above, below are some related work having a better bearing on our AI-GrC based. This will allow us to explain trends about how XAI and GrC mix together in our research. The main areas to cover are GrC, XAI, graph, and law domain application. In Table 1 different techniques and investigations using granular computing are compared and the focus or development tendency is indicated which is H—human or P—computational process. This helps us to understand there is research about breaking human knowledge into small pieces to process and represent them like in [2]. Other research trying to obtain profiles types from human knowledge as in [46]. Studies like in [37] work on representing basic fragments of knowledge. This indicates studies are oriented not only to provide the human with a computational solution to a problem but also to study the cognitive and perceptual aspects of the human. It can also be seen that very few investigations on granular processing on the legal domain currently exist like in [38] and [39] focused on processing legal texts to create a hierarchy of laws, and to a lesser extent, investigations about the judge’s perception of information in a trial expedient. Table 1 also indicates most of the research is towards data processing and taking advantage of the granular processing to improve existing algorithms. However, despite the existence of several human-centered investigations, as shown in this table, there is no investigation about capturing and processing information a human generates when interpret and assess real-world objects in the complexity of the legal domain, and then process that information granularity to finally explain it to a domain expert. It can be concluded that there are few approaches oriented to granular processing of juridical knowledge from an expert in this domain whose framework has an eventual capacity to be applied to other areas of human knowledge due to its inherent cognitive nature. This panorama, far from being just a justification for this research, is rather the contribution to the state of the art to which we intend to contribute by studying new spectra in the area of GrC and XAI such as AI-IA, which was previously introduced.

RYEL System: A Novel Method for Capturing and Represent Knowledge … Table 1 Granular computing (GrC) related work Investigation Key Description Zadeh [2]

Attempting to define how a human granulates information when reasoning using the concept of a generalized constraint where a granule means a constraint that defines it Yager and Filev [1] Objects from the text are described, ordered, and handled using values and aggregation operators, this means, using words and numbers representing objects are processed applying granularity Barot and Lin [37] Study basic knowledge managing partitions of knowledge structures. It proposes the management of knowledge approximations, theories of learning, and knowledge reduction used to represent the pieces of basic knowledge Toyota and Nobuhara Investigate hierarchical networking on [38] Japanese laws similar to in [39] but according to the authors, using morphological analysis and an index like a value to calculate degree distribution Toyota and Nobuhara It seeks to find the relationships between [39] Japanese laws and proposes a hierarchical network of laws using granular computing. The user can visually observe the relationships in the networks Keet [40] Study a semantically enriched granulation hierarchy using characteristics and levels in each hierarchy and specify a meaning that is distinguishable between a specific set of information granules Wang et al. [41] Granularity is used to obtain values representing weights that express criteria that directly affect the results of a decision. Weights represent the effects with a multiplier effect in a diffuse type relationship Mani [42] Using two different domains with distinct granular rough concepts. The investigation seeks to characterize natural concepts in common between each domain Bianchi et al. [43] Makes structure cell classifications using nodes in the form of graphs representing granular elements of a cell. Classification is done using embedded algorithms with synthetic data Skowron et al. [21] Granular computing in calculations produced by agents interaction aiming to obtain better control over the trajectory performed by the agent

373

1 Technique

2 Focus

TFIG

H

OWA

P

PPKBS PCKBS GCKBS

H

DD

P

DD CC

H

SG

P

MCDM

H

RST

P

GE

P

IGrC

P

(continued)

374 Table 1 (continued) Investigation Miller et al. [44]

L. R. R. Oconitrillo et al.

Key Description

The work is about modeling words and concepts granularity in the decision making of an expert in cybersecurity for which weight values are used, and fuzzy foundations to handle data about agreements Parvanov [45] Hybrid systems using fuzzy logic and neural networks for data and model interpretability used to process the granular information. Denzler and Kauf- Search in texts, granularity, information to mann [46] create people’s knowledge profiles and detect topics and people related to them Liu and Cocea [47] Use of statistical classification and regression called Boostrap Aggregating by carrying out small training stages per learning algorithm, which according to the authors are a granular processing basis

1 Technique

2 Focus

OWA/FL

H

HS

P

W2V CS

P

Bagging

P

1 TFIG = Theory of Fuzzy Information Granulation, OWA = Ordered Weighted Average Operator, Bagging = Boostrap Aggregating, HS = Hybrid System, MCDM = Decision Making with Multiple Criteria, IGrC = Interactive Granular Computing, FL = Fuzzy Logic, RST = Rough Set Theory, GE = Graph Embedding, PPKBS = Pawlak Partition Knowledge Base, PCKBS = Pawlak Covering Knowledge Base, GCKBS = General Covering knowledge base, SG = Semantics of Granularity, DD = Degree Distribution algorithm, CC = Closeness Centrality algorithm, W2V = Word2Vec Algorithm, CS = Cosine Similarity 2 H = Human, P = Process

Table 2 shows the most used explanatory artificial intelligence techniques. From this table can be distilled that most of the investigations for explainable artificial intelligence are oriented to explain predictions of black box machine learning models and many of its variations, for example, as in [50] and [52], or in a preload data for linear foundations like in [51]. Some like in [51] try to make the explanation available to humans through natural language. However, most of them lack of a flexible interface and with the ability to be sufficiently explanatory to allow an expert to work with the semantics, relationships, types, and data properties belonging to the domain of discourse, and at the same time help experts for decision-making. From Table 2 can be synthesized that the vast majority of explainability techniques in machine learning processes emphasize black box aspects of supervised methods. However, depending on the domain of the expert, data type and processes; the unsupervised methods [30, 31, 54, 55] also deserves attention for explainability processes due to their heuristic nature for problem-oriented self-discovery which may have a level of complexity equal to or greater than those supervised or reinforcement techniques.

RYEL System: A Novel Method for Capturing and Represent Knowledge …

375

Table 2 Explainable Artificial Intelligence techniques 1 Algorithm type Investigation Technique Mayr et al. [48]

GAM

Alvarez-Melis and Jaakkola [50]

LIME

Ehsan et al. [51]

RZ

Montavon et al. [52]

LRP

Arnaud and Klaise [53]

CF

Is a linear model which is generalized having a predictor linearly depending on specific functions of some variables, and focuses on inference about these functions. In the beginning it was created in [49] to blend properties of generalized linear models. The model idea is to relate a variable response Y with some predictor variables Xi and explain in this way relations Explain small or individual predictions from black box machine learning models. It uses a surrogate model, this means, instead of training a global model, LIME focuses on training local surrogate models to explain individual predictions Explain an autonomous system behavior, for example, in [51] uses neural network to translate internal state action representations of an autonomous agent to natural language It uses a set of propagation rules designed to explain how prediction propagates backward in a neural network Explains a causal situation, for example, “If A had not occurred, B would not have occurred”. It requires using a hypothetical reality contradicting an observed fact

3 LRP = Layer-Wise Relevance Propagation, CF = Counterfactual, LIME = Local Interpretable Model-agnostic Explanations, RZ = Rationalization

3 RYEL: Explainable Artificial Intelligence and Granular Computing RYEL works with specific concepts that need a detailed approach to understand the type of information about the judge’s knowledge capturing and processing made by the system. Then we will explain details on how XAI and GrC processes are performed using the captured knowledge. In this investigation, (1) understanding is defined in terms of (2) the perception, (3) perspective and (4) interpretation a judge makes of the facts and evidence from an expedient. These 4 concepts are described below. Perception, in [56], is a mental process that involves activities such as memory, thought and learning, among others, and that are linked to the state of mind and its organization. This definition means that perception is a subjective state, because a person captures and understands in her own way the qualities of facts and objects of the external world of her mind. The judge may have a different perception of the information in one expedient than another. For example, a judge in the Criminal

376

L. R. R. Oconitrillo et al.

Court captures, learns and mentalizes that the “Principle of Innocence” must be used with a defendant, which presumes a state of not being guilty until proven otherwise, however, the same principle will never be applied by a judge of the Domestic Violence Court who has grasped, learned and made aware that a defendant from the beginning is an alleged aggressor given a situation of vulnerability in the victim. In [57] the perspective is the point of view from which an issue is analyzed or considered. This is how perspective can influence the person’s perception or judgment. This means that perspective is the point of view in which one could explain how a judge perceives things. In this way, depending on the field of work, the annotations of a case (information from expedient) can be analyzed using a different perspective from others, for example, a judge of the Domestic Violence Court can see the action of hitting a person is very serious, while a judge in the Tax Criminal Investigation Court may see it not so serious or even minimize it. Interpretation, according to [58], means conceiving or expressing a reality personally, or attributing a certain meaning to something. With this definition we can explain that a judge can conceive an annotation in the expedient according to his personal reality and attribute, then giving a meaning, for example, in a criminal court, the blow to a person can be interpreted by a judge as an act of personal defense, while another judge, from the same court, may interpret it as an act of aggression or guilty injury. Perception and perspective can influence the interpretation of an annotation. For example, a judge may interpret the facts and evidence one way and then another according to the stage of the trial in which he may be working because he learned something new that he did not know, or changed the point of view from which valued an entry. The change in interpretation could have an impact on the decisions that a judge makes at the end and on the way in which annotations are valued. Then judge understanding of something depends on the perception, perspective and interpretation that he makes of what he studies in a case. This allows to clarify the concept of interpretation in order to focus on it. The judge uses RYEL to present, organize, and relate information on facts, evidence, jurisprudence, doctrine, and norms derived from what the parties argue, so that he can express what he interprets from them and it is precisely the type of information the system captures and stores. This will be done using EGI working with Knowledge Graph (KG) [59–61] to manipulate and preserve all of that information. The framework used by the system does not consider a judge like an automatic entity that will always generate the same solution for a specific case. Therefore, it is necessary to clarify the system does not intend to clone in any way the thought of a judge to be applied as a solution indiscriminately to other case sentences, rather, we want to work with the interpretation of facts and evidence to help with the analysis of a legal case when dictating a sentence. The sentences contain judge’s decisions and usually it relates concepts to find meaning to the information provided by the parties, for example, relating evidence to facts to determine whether they are true or not. In this way a network is generated and could be visualized as a Semantic Network [62].

RYEL System: A Novel Method for Capturing and Represent Knowledge …

377

The groupings and relationships of information contained in the scenarios of a case could be seen as a network that expresses a meaning. In [63] explains Semantic Network as a directed graph that is made up of nodes, links or arcs, and labels on the links. Nodes describe physical objects, concepts, or situations. Links are used to express relationships between objects. Link labels specify a particular relationship between objects. Relationships are seen as a way of organizing the structure of knowledge. Knowledge Graphs is a data structure in the form of graphs used by the system to capture the judge’s knowledge when the interpretation and assessment of the facts and evidence contained in the scenarios of a case is taking place. Graphs include data on the properties, types, and relationships of entities. An entity that can be an object, person, fact, proof or law. In [26] knowledge graph have been predominant both in industry and in academic circles in these years, because they are one of the most effective approaches to efficient knowledge integration. In [27] is explained a knowledge graph is a type of semantic network but the difference lies in the specialization of knowledge and in a specific set of relationships that must be created, that is, the structure of knowledge will depend on the domain of application and the structure of the graph will change according to the knowledge that is expressed. Having explained the above, next we will explain the fundamentals of RYEL architecture to handle the judge’s knowledge and some details of the implementation.

3.1 Implementation Research develop in [15–18] in a legal context allowed using design science research process proposed in [64] to create the CBR life cycle stages and create RYEL to manage legal cases. The system handles a legal case as a Knowledge Graph (KG) but presented to the judge using EGI. This system is intended to be used by a judge to register and analyze a juridical case and obtain information in the form of recommendations that can be used when issue a sentence. RYEL captures and processes the interpretation and evaluation made by a judge when it literally draws the facts and evidence through interlaced images using the EGI, like in Fig. 2. In this way, a Knowledge Graph is built using the EGI. The system performs a Granular Transformation (GT) which means the data captured by the EGI in graphical form, for example, image types, relationships, properties and images distribution, are decomposed into granules of information [5, 22, 65] to build a knowledge graph in order to store a Knowledge Representation (KR) [27] from the case based on the perception of the judge. The type of graph that the system handles internally is called Property Graph (PG) which is composed of nodes, relations, properties and labels [31]. EGIs are classified into 3 types: (1) those designed to graphically explain how the data are captured and legally represented, (2) those in charge of explaining how a judge assigns the importance, link and effect of the facts and evidence, and (3) those aimed at explaining the context of the factual picture contained in the legal records

378

L. R. R. Oconitrillo et al.

(legal files) and why the use of certain norms or laws is suggested in the cases under analysis. Figure 1 shows the inputs, processes and outputs of the system. An EGI is used to capture and process the images that the judge selects from a menu where each image represents an annotation of a case. The images are placed, distributed, organized and related by the judge in a workspace to draw according to his interpretation as shown in Fig. 2. Then images are processed in the form of KG and stored in an unstructured database in the form of PG. A relatively simple transformation method is used to convert the images and their relationships into a PG. The method consists of taking the images and transforming them into nodes, and the arrows into edges as a form of binding. The nodes, as a product of the transformation process, acquire colors, sizes and properties to explain the attributes of the legal scenarios of a case. Edges acquire properties such as length, thickness, color, orientation to explain how the nodes are linked and distributed. Both nodes and edges contain unique properties resulting from the transformation process. There are 3 main image processes: (1) the use of an EGI that allows the judge to assign levels and ranges to the images by simply dragging and dropping each one into previously designed graphic boxes (order and classification); a brief idea about how it looks like is seen in Fig. 8 and the internal process in Fig. 5, (2) other EGI is used to assess the links between facts and evidence (proof assessment); the detail of this interface is simplified as shown

Fig. 1 Data overview diagram: input, processes, and output

RYEL System: A Novel Method for Capturing and Represent Knowledge …

379

Fig. 2 Factual picture: using an EGI to analyze the merits of a homicide case in real-time according to the graphical interpretation made by a judge. Each image is translated to a node, and each arrow is converted into an edge internally in the system

in Fig. 6, and (3) another EGIs is used to explain the factual picture of the scenarios according to the legal context; Figs. 10 and 11 are examples of how some of these interfaces look like. The output of the system consists of information displayed as recommendations on which laws and regulations that can be used according to the context and the factual picture. A judge could do analysis and queries about legal cases using some options of the EGI, like in Fig. 3. In the background is running CYPHER queries [31, 55]. The queries process granules of information of the graph, this is, collecting node patterns according to the interpretation previously captured through the EGI by the way in which a judge express the facts and evidence graphically using the system. Pattern search queries involve distinguishing nodes and proximity between them. The queries in Fig. 3 allow searching for similarities between facts and evidence using properties, types and relationships between nodes. The disposition and composition of the facts and evidence generate the factual picture which is the legal functionality. This functionality allows identifying the type of law that must be applied. All of this is about indistinguishability, similarity, proximity or functionality like in [5, 66], but applied to graphs in granular way. Then information is generated in the form of recommendations about law and norms that can apply to the case the judge is working with, in order to support decision-making [67]. Therefore, RYEL incorporates granular data by implementing properties, labels, nodes and relationships attributes, together with algorithms and procedures which where designed using pattern query language [54] and algorithms adaptation, for

380

L. R. R. Oconitrillo et al.

Fig. 3 Graphical interface query about stabbing in a murder case

example, Jaccard explained in Eq. 1, Cosine Similarity in Eq. 2 and Pearson in Eq. 3, all of them applied to the knowledge graphs like in Fig. 3; This allows investigating about interpretation [68] made by a human in a given context, a legal one in our case. J (A, B) =

|A ∩ B| |A| + |B| − |A ∩ B|

(1)

In the Eq. 1 the Jaccard index J (A, B) is a statistical similarity measure, it is also called Jaccard Coefficient and is used to measure the similarity between finite sets of data, in this case A and B. It is defined as the size of the intersection divided by the size of the union of the sets. It is used by the system to get the granular similarity between a set of facts or evidences belonging to different groups of case scenarios. n Similarit y (A, B) =  n

i=1

2 i=1 Ai

Ai ∗ Bi  n 2 ∗ i=1 Bi

(2)

In the Eq. 2 the Cosine Similarity is a similarity measure between two non-zero vectors, in this case the vectors are obtained by the distribution of the nodes and edges in the graph from EGI, of an inner product space. It is defined as the angle between vectors to equal the cosine. The system used it to process granular data about the bond between a fact and an evidence in a factual picture. n



Similarit y (A, B) =  i=1  n i=1

 Bi − B 2  2 Ai − A Bi − B

Ai − A

 

(3)

In the Eq. 3 the Pearson Similarity is a statistical measure to detect a linear correlation between two variables A and B. Usually it has a value between +1 and −1. A value of +1 is total positive linear correlation, 0 is no linear correlation, and −1

RYEL System: A Novel Method for Capturing and Represent Knowledge …

381

is total negative linear correlation. The higher correlation the higher similarity. It is used by the system when calculating the similarity between small set of facts or evidences that belong to the same scenario of a case, or of different cases. Thus, RYEL uses EGI to capture interpretation and assessment the judge does, then a granular transformation take place to converts granular data from EGI into knowledge graph, and the CBR guidelines are used to handle cases back and fore from knowledge database to EGI. The following points summarize the using CaseBased Reasoning (CBR) [36] guidelines adaptation in order to have a methodological framework to work with cases. 1. 2. 3. 4. 5. 6.

Case-Base: is Semantic Network. Problem: is the interpretation and assessment from facts a proof. Retrieve: implemented by script patterns and graph similarity algorithms. Reuse: about norms and laws. Revise: Knowledge Graph using EGI. Retain: legal features and law characteristics in the knowledge graph.

EGI implements Semantic Networks through Knowledge Graphs graphically. The graphics techniques implemented in EGI are interactive used to represent and process information related to the interpretation and assessment made by a judge. The techniques allow investigating about interface accessibility problems related to explanatory computational models [10]. In addition, RYEL uses other types of graphics to display information resulting from analysis-consultations made by the judge on the legal factual situation (factual picture).

3.2 Granular Computing A case can be made up of multiple juridical and legal scenarios, which in turn contain various facts and evidence related to each other. In the Fig. 4 we can observe 3 scenarios contained in a single case and the facts together with the evidence related to each scenario by a black line; Evidence showing the facts is represented by a dashed gray line connecting them to each other. There may be facts that relate to more than one scenario such as Fact 7 and evidence that shows various facts that belong to different scenarios such as Evidence 1. It is up to the judge to make decisions for each scenario, based on facts and evidence, and issue a resolution for the legal case. This is an example of how a case is organized with scenarios, facts and evidence in our investigation. For the processing of a legal case, an expedient is maintained, in which the information of the case is organized. Everything that the parties (litigants) report is recorded on sheets called expedient folios, and this is characterized by a series of annotations that describe properties of the case, such as facts, evidence, jurisprudence, doctrine, norms, principles, and petitions of the litigants. This is how the file allows storing, historically, all the actions and decisions of those involved in a legal

382

L. R. R. Oconitrillo et al.

Fig. 4 Granular representation of a legal case: the result of internally converting the images and arrows to a graph with nodes and edges is shown. Groups and relationships of scenarios, facts and evidence of the case are created

conflict, that is, everything that litigants do, ask or justify about a situation in conflict and that they hope will be resolved, and everything that the judges dictate as resolution. In Fig. 5 it is observed how a legal case is processed by the system, after being captured by the EGI shown in Fig. 2. The levels of the scenarios, represented by 3 tiles with grids, are obtained according to how the judge has organized the factual picture using the EGI. The importance levels are represented by columns of each tile. Ranges are expressed by the rows of each tile. For each column there is a set of rows that belong to it. The lower the number of a tile level, columns and rows the must important is a granule (facts and evidence). Maintaining a 3D vision, the Fig. 5 shows a relational organization between the legal granules from multiple tiles levels, ranges and importance levels which express the interpretation and assessment made by the judge for case scenarios. Thus, granular processing by tile level, range and importance level consists of a strategic arrangement and classification of the facts and evidence of the scenarios of a case. The system provides to the judge with a special EGI which help him to classify and abstracting the importance of facts and evidence by levels of importance (colums), and at the same time being able to assign ranges (rows) of relevance in each classification per tile level. Figure 6 shows the granular processing of edge properties that link facts to evidence. This is done after the process shown in Fig. 5. Granular edge processing consists of narrowing the distance between a fact and evidence, that is, the link is shorter. This represents a closer bond between granules of case information. The shorter the

Fig. 5 On the left, the legal representation of the granules of a case. On the right, the granular fragmentation process of the case using tiles representing levels, and rows representing ranges, together with the relationships between levels of granules

RYEL System: A Novel Method for Capturing and Represent Knowledge … 383

Fig. 6 On the left, the granules of a legal case. On the right, the process of granular fragmentation of edges (links) bonding facts and evidence along with the order of each edge

384 L. R. R. Oconitrillo et al.

RYEL System: A Novel Method for Capturing and Represent Knowledge …

385

edge, the harder the bound between granules and therefore the legal effect is greater. For example, the value A is closer between Fact5 and Pr oo f 1 while the value of C is further away between Fact3 and Pr oo f 3. This means, the bond between Fact5 and Pr oo f 1 is taking more effect than Fact3 and Pr oo f 3. To do this, the system also provides the judge with an EGI that helps him with this type of analysis. Next, the new granular spectrum detected by the RYEL is explained below. This spectrum is based on the granular computing described up to this point.

3.3 Interpretation-Assessment/Assessment-Interpretation (IA-AI) It was mentioned that IA-AI is a new area of study that consists not only in explaining machine inferences, but the interpretation and assessment from a user according to a context. In this sense, it is pointed out that the activity of interpreting elements of the real world, as well as the ability to evaluate them, are inherent in human activity. However, a computational framework is needed to organize and process them. Therefore, it is materialized as a computational technique to understand the perception of a person. Using IA-AI the judge can raise and organize annotations from the expedient with the objective of finding out and evidencing existing relationships between the facts and evidence. Through this, the judge would establish the importance and precedence between facts and evidence, thereby beginning the analysis of the merits of the case.

3.3.1

Interpretation

Interpretation process is made up of 3 stages that will come into operation when the judge interacts with the system using EGI. These interpretation stages are the following. 1. Graph of images: it is the approach and organization that a judge makes of images and their relationships according to what is understood in a scenario and that is registered in the system. 2. Interest level: it is a classification group that a judge makes of the images according to their criteria, experience and type of setting with the help of the system. At this stage, internally, a value will be assigned to each image according to the level chosen by the judge, which will serve to process information from each image at the level. 3. Order of precedence: it is the priority that a judge assigns to each image from its legal point of view and it happens within each level of interest. The system will assign a value to each image according to the position assigned by the judge. This value together with that produced by the level of interest will be used by the system in the inference processes.

386

L. R. R. Oconitrillo et al.

Fig. 7 Procedural differences: the concept of weights is not used by the system, instead the system uses the concepts of order of precedence (range) and level of interest (importance) that are implemented by the IA-AI framework within a Euclidean space

Stage 1 will allow to structure and study the factual picture in a better way, this will help a judge to understand a scenario in more detail. Stages 2 and 3 will demonstrate the importance and precedence of the facts and evidence, allowing the content of the scenarios to be categorized and ordered. This is why a judge could use the information resulting from these stages as a form of support when making a decision. It is clarified that stages 2 and 3 will not use the concept of weights but rather a value that will symbolize the degree of interaction between the level where an image is found and the order of that image with respect to the level where it is located, because using a single value would not help to adequately define a whole set of criteria that the judge may come to consider when carrying out an analysis of the merits of a scenario, which can be achieved with the stages mentioned above. We will explain this with the Fig. 7 that shows 3 segments where there are circles with letters that represent facts or evidence, in case of using weights these would be symbolized by a circular figure with a letter inside, the larger the circle then the greater the weight of the element and vice versa. The figure compares the use of weights and the use of the stages of the interpretation process. Using only the concept of weights generates drawbacks, below we list the most relevant ones. 1. When a judge works on a case, he may have the inconvenience of considering all the facts and evidence important. This is represented as circles of equal size shown in segment 1 of Fig. 7, and causes difficulties for the judge to detect relative differences between them. 2. A judge may have the inconvenience of having a group of facts and evidence that are different but there is difficulty in identifying differences between them. This is represented as circles with various sizes shown in segment 2 of Fig. 7, and could generate bias problems related to the importance and precedence of each one. The judge can overcome the drawbacks shown in segments 1 and 2 of Fig. 7 working each fact and evidence within a scenario and applying the 3 stages of the

RYEL System: A Novel Method for Capturing and Represent Knowledge …

387

interpretation process. Stages 2 and 3 are represented in segment 3 of Fig. 7 with dark cylinders that symbolize the levels and the way the elements are stacked inside each cylinder would be the order. An example of evidence of this implemented process can be seen in Fig. 8 showing with gray boxes the levels of interest, and the order with the arrangement of the elements within each box.

3.3.2

Assessment

We will define the valuation process as an instrument that allows evaluating the relationship between each pair of images. It will be used when the judge needs to express his opinion on the relevance of a relationship. For example, when a criminal judge considers the relationship between evidence A with a fact X as more relevant than evidence B related to the same fact. That is why it focuses on evaluating the connection that one image has with another. The valuation process will run after the interpretation process, because the output of the latter is the input of the former. It means that the existence of properly planned, organized and related images is necessary, and then focus on evaluating the connection between each pair of them. We intend the evaluation process works using 3 legal considerations, each one will be used by the judge when evaluating each connection. The considerations are listed below. 1. Importance (I ): is a value assigned by the judge that means how important an image is to another, for example, a proof for a fact. This is represented by the size of the image. 2. Link (V ): is a value assigned by the judge and represents the degree of closeness in which an image is in relation to another. For example, 3 means closer and a 10 is farther. The closer an image is to another, the stronger the legal bond. 3. Effect (E): is a value automatically produced when importance (I ) and link (V ) interact, for example, we could use the formula of a Euclidean space modifying and adapting the Pythagorean theorem, the result we produce would be an equation like the following:  (4) E ≈ V 2 + (I /2)2 Equation 4 is an example of how we could obtain the legal effect E that is produced by squaring the importance V and the link I /2, and then adding them, and applying the square root to the obtained result. We clarify that the link is divided by 2 because we represent the radius produced by the diameter of a node. The diameter will be calculated by the system from the interpretation in the EGI. Using the Fig. 9 we can explain how the Eq. 4 is applied within the assessment process. The importance V and the link I are computed by the system according to the information on the assessment of the facts and evidence that the judge makes. The link is the line that connects one image to another from its centers and is represented by the arrow between each pair of images. Importance is the radius obtained from the diameter of the image that is

Fig. 8 EGI used by the judge to classify the granules of information in a case according to his perspective. The levels represent degree of interest (importance), and the location of each granule within each level represents the order of precedence (range). Again, the granules can represent facts or proofs, as well as other data of the case

388 L. R. R. Oconitrillo et al.

RYEL System: A Novel Method for Capturing and Represent Knowledge …

389

Fig. 9 The nodes (circles) represent granules and edges (lines) the link bonding them. The values obtained from the size of nodes and their arrangement, as well as the values extracted on the type and length of the edges between them, are used to feed the formula in the Euclidean space. A series of vectors are generated that uniquely encode fragments of a case at granular level

represented by its size. Effect is the result of the equation calculated from importance and effect. Thus, the Eq. 4 produces an effect. If an evidence is very far from a fact then the link representing its veracity is less, on the other hand, if the evidence is very close to the fact the link is close and its probatory effect is greater. If an effect, produced by the Eq. 4, tends to 0 it will generate a greater effect, for example, the Fig. 9 shows that for a fact of Sexual Violence, the evidence on Video would have a greater effect than the Medical Expertise. This because the resulting value of the effect is E = 21.52 and E = 30.7 respectively.

3.3.3

Biases in Judgment

Depending upon the judge’s innate experience, education, country, way of thinking, kind of law, legal circumstances, and failures produced by the judge due to human nature, biases can occur. The system manages the bias in the opinion of the judges by taking advantage of the process called “recursion of judicial resolutions” also called “challenge of judgments” or “impugnment of judgments” [69]. To understand this we will explain what are the most important types of legal bias and clarify what recursion is. The most common biases are procedural, interpretative, assessment, form, or substance of a judicial resolution. Recursion is a request process made by the parties involved in the litigation to claim to the same judge who works on the case or to a higher judge,

390

L. R. R. Oconitrillo et al.

any change in the judicial resolution due to the existence of bias, errors, failures, inaccuracies, or problems in the legal process that affect the judicial resolution. The most important types of recursion of judicial resolutions are (1) revocation of sentence, (2) appeal of sentence, (3) cassation appeal, and (4) sentence review. Any of these recursions will allow correcting the bias in the judges’ opinion. To correct a resolution, the judge evaluates the claim, if the claim is valid, it uses an EGIs like in Fig. 2 where the case is shown and applies the change that it thinks is correct. If it is necessary to change the assessment of the facts or evidence, any of the other EGIs such as Fig. 8 could be used to apply the change. Otherwise, the claim is rejected and confirms that there is no bias or errors in the judicial resolution. The knowledge base is updated by making modifications to the case using any of the EGIs.

3.4 Explainable Artificial Intelligence The RYEL system uses the explanation technique called Fragmented Reasoning (FR), as can be seen in Figs. 10 and 11, and consists of the use of different statistical graphics that granulate the information following a hierarchical order of importance according to the interpretation and the assessment made by a human about the objects of the real world, in this case the facts and evidence analyzed by the judge. FR considers data strategically arranged in each observation made by the human taking information IA-AI as input data and revealing how it was interpreted and evaluated, then showing the calculations made to generate recommendations, in this case, the laws that could apply to a legal case. In this way it has been able to investigate about the legal explanations related to the inferences [10] obtained by the system.

4 Results RYEL experiments have consisted of letting judges from Costa Rica, Spain and Argentina to use RYEL, in more or less similar circumstances, to capture and represent the interpretation and assessment they make of the facts and evidence contained in scenarios of a criminal case in order to analyze interpretation patterns about the laws that were used and related to proven facts (felonies). Then explain to the judge, using an EGI, the causes or purposes from using a given law coming from previous legal case scenarios. The system has been remotely tested in several countries such as Panama [15] and Argentina by 10 expert judges in the criminal field. An example of one live experiments conducted remotely in Spain and Argentina can be seen in the Fig. 12 where a judge uses RYEL to analyze the merits of a homicide case, and has the opportunity to consult information about the interpretation and assessment that other judges have made on a similar case.

Fig. 10 EGI that explains by means of circles, size and colors to the user what are the set of expedients that contain the factual picture and similar scenarios to the legal context with which the judge works. The machine explains that each circle is a set of expedients with characteristics and states that those that best fit the case under analysis are those that are higher and to the right of the graph. The judge can explore, analyze, and select those that considers best for a certain legal context

RYEL System: A Novel Method for Capturing and Represent Knowledge … 391

Fig. 11 EGI explains by means of circles, size and colors to the user what are the set of laws and regulations that are more in line with the factual picture of the case under analysis. The machine expresses by a graphical distribution of circles those norms and laws that best describe the legal scenario under analysis. The higher and further to the right in the chart the best norms and laws to take. However, the judge can explore, analyze, and select those that considers best for the factual picture

392 L. R. R. Oconitrillo et al.

RYEL System: A Novel Method for Capturing and Represent Knowledge …

393

Fig. 12 Remote execution of experiments with judges: analysis of the merits of a homicide case according to the interpretation and assessment made by a judge

RYEL has been tested by 7 judges from Costa Rica, and by 5 judges in Spain. In addition, tests with 8 experts in artificial intelligence have been included as well. The RYEL method was validated by other groups of legal experts and computer engineers in Mexico [18], Panama [15], Costa Rica [16], and Spain [17]. During the experimentation with the judges, information was collected to evaluate the system and was obtained from the opinions they issued about the system. The evaluation consisted of determining if the method, used by RYEL, could satisfy the stated and implicit [70] needs of the judges when interpreting and evaluating facts and evidence. Some characteristics defined in [70] are adapted and used as reference parameters to explain what was intended to be evaluated during experimentation; such is the case of “functional suitability”, “usability” and “efficiency” of the system. Table 3 shows the evaluation matrix according to the opinion of judges. In this table the judges were selected randomly, and grouped according to the judicial office where they work. In this way, 5 types of dispatches were obtained and 9 types of subcategories describing suitability, usability and efficiency were used to evaluated RYEL. The reason for grouping judges by legal office is due to the hierarchical rank in doing their work. For example, basic levels of hierarchy are judges of criminal courts and provincial courts, they can create and revoke sentences. Superior courts decide appeals, which are similar to medium levels. The magistracy is the highest levels of hierarchy and resolve certain legal cases that require major and special reviews. The experiments tried to cover a wide spectrum of hierarchy of judges, as many as possible, in order to have the greatest diversity of opinions to evaluate the system. It is possible to conclude from Table 3 that the evaluation of the system by the judges has been satisfactory. In criminal courts offices there is a 93.52% success rate average in the operation of the system to analyze the merits of a case because the execution of some tests in the experiments were affected by the remote connection used during the test. This situation affected the opinion issued by the judges in this

394

L. R. R. Oconitrillo et al.

Table 3 RYEL evaluation matrix according to the judge’s opinion 4 Criterion

Criminal courts

Military criminal court

Superior courts

Provincial court

Criminal magistracy

Suitability

90.00

100.00

100.00

90.00

100.00

Accuracy

95.00

100.00

100.00

80.00

100.00

Functionality compliance 90.00

100.00

100.00

80.00

100.00

Understandability

90.00

100.00

100.00

80.00

95.00

Learnability

90.00

100.00

100.00

60.00

100.00

Operability

90.00

100.00

100.00

60.00

100.00

Attractiveness

100.00

100.00

100.00

80.00

100.00

Time behaviour

100.00

100.00

100.00

60.00

100.00

Efficiency compliance

96.67

100.00

100.00

73.33

100.00

Average

93.52

100.00

100.00

73.70

99.44

4 Adaptation and use of software evaluation concepts that are defined in ISO-25010 [70]

type of office. The provincial court offices gave a result of 73.70% success rate in the operation of the system because some judges could not complete the experiment because they had to attend trials and it was not possible to reschedule the experiment. The radar graph in Fig. 13 shows comparisons in the result of the judges’ assessment according to the characteristic indicated at each vertex. The distances that exist between the vertices of functional compliance, accuracy, learnability, understandability, suitability and operability are very close to each other and with very high values, this means that regardless of the hierarchical level, the judges evaluate the system with high acceptance rates. The evaluations in the characteristics of attractiveness, time behavior and efficiency compliance are practically the same with values close to or equal to 100% acceptance of the system to do the legal analysis. As a result of the experiments, it is possible to synthesize the opinion of judges through the following 3 points: (1) acceptability by judges that EGI is suitable of capturing and represent legal knowledge in granular form about a case, (2) IAAI framework is efficient to perform the tasks of analyzing the merits of a case, (3) the judges stated that the information explained and provided in the form of recommendation by the system is useful.

5 Conclusions It is possible to conclude that explanatory techniques (XAI) are necessary for unsupervised algorithms in a domain of discourse that involves subjective information from a human. Most of the explanatory techniques developed have a black box explanation approach, however, the complexity of some tasks in the expert domain are very

RYEL System: A Novel Method for Capturing and Represent Knowledge …

395

Fig. 13 Scoring radar on judge evaluation: quality characteristics [70] of RYEL according to the hierarchy of judges

high, and therefore, also deserve the generation of techniques that allow explaining how a machine performs the entire heuristic and discovery learning process. A new spectrum of applicability in Granular Computing (GrC) has been explained and evidenced, which involves the capture and representation of what a human interprets and values according to their perspective and the need for an Explanatory Graphical Interface (EGI) to achieve it. Due to the above, it was necessary to redefine the concepts of Explainable Artificial Intelligence and Interpretable Artificial Intelligence in order to detail the new spectrum covered by IA-AI. There are limitations that had to be dealt with. The most relevant are: (1) lack of time and budget in the project to travel to other countries and work with a greater number of judges, (2) the Judicial Power of a country has directives and cumbersome processes to communicate officially with judges, which limited and reduced the

396

L. R. R. Oconitrillo et al.

number of them to be available for experimentation, (3) the global pandemic of COVID-19 limited access in some offices of the judges, it also limited the movement in some of the offices. This caused some experiments to be carried out virtually. Remote communications in some cases could not be performed because some offices did not have internet or network permissions. As future work, it is considered to create an embedded model of graphs to be processed by supervised algorithms in order to have a set of pre-trained legal actions to solve juridical problems in very particular scenarios of criminal cases. It is also intended to use RYEL to perform interpretation and assessment experiments with cases other than criminal ones. Acknowledgements BISITE Research Group and the Faculty of Law, both from the University of Salamanca (USAL); the School of Computer Science and Informatics (ECCI) and the Postgraduate Studies System (SEP), both from the University of Costa Rica (UCR); the Edgar Cervantes Villalta School of the Judicial Branch of Costa Rica; special thanks to all the judges and AI experts who have participated in this investigation from Mexico, Costa Rica, Spain, Panama, and Argentina.

References 1. Yager, R., Filev, D.: Operations for granular computing: mixing words with numbers. In: Proceedings of 1998 IEEE International Conference on Fuzzy Systems, pp. 123–128 (1998) 2. Zadeh, L.: Towards a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets Syst. 19, 111–127 (1997) 3. Teerapong, K.: Graphical ways of researching. Graphical ways of researching p. In: Proceedings of the ACUADS 2014 Conference: The Future of Discipline (2014) 4. Pedrycz, W., Gomide, F.: Fuzzy Systems Engineering: Toward Human-Centric Computing. Wiley-IEEE Press (2007) 5. Bargiela, A., Pedrycz, W.: Granular computing for human-centered systems modelling. In: Human-Centric Information Processing Through Granular Modelling, pp. 320–330 (2008) 6. Yao, Y.: Human-inspired granular computing. In: Novel Developments in Granular Computing: Applications for Advanced Human Reasoning and Soft, pp. 1–15 (2010) 7. Pedrycz, W.: Granular computing for data analytics: a manifesto of human-centric computing. IEEE/CAA J. Automatica Sinica 5, 1025–1034 (2018) 8. Zhu, J., Liapis, A., Risi, S., Bidarra, R., Youngblood, M.: Explainable AI for designers: a humancentered perspective on mixed-initiative co-creation. In: IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8 (2018) 9. Mohseni, S.: Toward design and evaluation framework for interpretable machine learning systems. In: AIES ’19: Proceedings of the 2019 AAAI/ACM, pp. 27–28 (2019) 10. Wolf, C., Ringland, K.: Designing accessible, explainable AI (XAI) experiences. ACM SIGACCESS Accessibility Comput. (6), 1–5 (2020) 11. Khanh, H., Tran, T., Ghose, A.: Explainable software analytics. In: ACM/IEEE 40th International Conference on Software Engineering: New Ideas and Emerging Results (2018) 12. Sokol, K., Flach, P.: Explainability fact sheets: a framework for systematic assessment of explainable approaches. In: FAT 20: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 56–67 (2020) 13. Samih, A., Adadi, A., Berrada, M.: Towards a knowledge based explainable recommender systems. In: BDIoT 19: Proceedings of the 4th International Conference on Big Data and Internet of Things, pp. 1–5 (2019)

RYEL System: A Novel Method for Capturing and Represent Knowledge …

397

14. Abdul, A., Vermeulen, J., Wang, D., Lim, B., Kankanhalli, M.: Trends and trajectories for explainable, accountable and intelligible systems: an hci research agenda. In: CHI ’18: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–18 (2018) 15. Rodríguez, L.R.: Jurisdictional Normalization of the Administration of Justice for Magistrates, Judges Using Artificial Intelligence Methods for Legal Guidance Systems, pp. 1–10. II Central American and Caribbean Congress on Family Law, Panamá pp (2016) 16. Rodríguez, L., Osegueda, A.: Business intelligence model to support a judge’s decision making about legal situations. In: IEEE 36th Central American and Panama Convention (CONCAPAN XXXVI), pp. 1–5. Costa Rica (2016) 17. Rodríguez, L.R.: Jurisdictional normalization based on artificial intelligence models. In: XX Iberoamerican Congress of Law and Informatics (FIADI), pp. 1–16. Salamanca, Spain (2016) 18. Rodríguez, L.R.: Artificial intelligence applied in procedural law and quality of sentences. In: XXI Iberoamerican Congress of Law and Informatics (FIADI), pp. 1–19. San Luis Potosí, México (2017) 19. Yao, Y.: A triarchic theory of granular computing. Granular Comput. 1, 145–157 (2016) 20. Liu, H., Cocea, M.: Granular computing-based approach for classification towards reduction of bias in ensemble learning. Granular Comput. 2, 131–139 (2017) 21. Skowron, A., Jankowski, A., Dutta, S.: Interactive granular computing. Granular Comput. 1, 95–113 (2016) 22. Su, R., Panoutsos, G., Yue, X.: Data-driven granular computing systems and applications. Granular Comput. 275–283 (2019) 23. Yao, Y.: The art of granular computing. In: International Conference on Rough Sets and Intelligent Systems Paradigms, pp. 101–112 (2007) 24. Pedrycz, W., Skowron, A., Kreinovich, V.: Handbook of Granular Computing. Wiley, The Atrium, Southern Gate, Chichester, West Sussex, England (2008) 25. Pedrycz, W., Chen, S.M.: Granular Computing and Intelligent Systems. Springer, Berlin Heidelberg, Germany (2011) 26. Yan, J., Wang, C., Cheng, W., Gao, M., Aoying, Z.: A retrospective of knowledge graphs. Front. Comput. Sci. 55–74 (2018) 27. Bonatti, P., Cochez, M., Decker, S., Polleres, A., Valentina, P.: Knowledge graphs: new directions for knowledge representation on the semantic web. Report from Dagstuhl Seminar 18371, 2–92 (2018) 28. Robinson, I., Webber, J., Eifrem, E.: Graph Databases New Opportunities for connected data. O’Reilly Media, 1005 Gravenstein Highway North, Sebastopol, CA (2015) 29. McCusker, J., Erickson, J., Chastain, K., Rashid, S., Weerawarana, R., Bax, M., McGuinness, D.: What is a knowledge graph? Semantic Web—Interoperability, Usability, Applicability an IOS Press J. 1–14 (2018) 30. Inc., N.: What is a graph database? https://neo4j.com/developer/graph-database/ (2019), [accedido: 2019-01-05] 31. Robinson, I., Webber, J., Eifrem, E.: Graph Databases New Opportunities for Connected Data. O’Reilly Media, Sebastopol, CA (2015) 32. Yamaguti, T., Kurematsu, M.: Legal knowledge acquisition using case-based reasoning and model inference. In: ICAIL ’93: Proceedings of the 4th International Conference on Artificial Intelligence and Law. pp. 212–217. ACM, New York, NY, USA (1993) 33. Berman, D., Hafner, C.: Representing teleological structure in case-based legal reasoning: the missing link. In: ICAIL ’93 Proceedings of the 4th international conference on Artificial intelligence and law, pp. 50–59 (1993) 34. Kolodner, J.: Case-Based Reasoning. Morgan Kaufmann Publishers Inc, San Mateo, California (1993) 35. Aleven, V.: Using background knowledge in case-based legal reasoning: a computational model and an intelligent learning environment. Artificial Intelligence—Special issue on AI and law, pp. 183–237 (2003)

398

L. R. R. Oconitrillo et al.

36. Ashley, K.: Case-based models of legal reasoning in a civil law context. In: International Congress of Comparative Cultures and Legal Systems of the Instituto de Investigaciones Jurídicas, Universidad Nacional Autónoma de México (2004) 37. Barot, R., Lin, T.: Granular computing on covering from the aspects of knowledge theory. In: NAFIPS 2008—2008 Annual Meeting of the North American Fuzzy Information Processing Society, pp. 1–5 (2008) 38. Toyota, T., Nobuhara, H.: Hierarchical structure analysis and visualization of japanese law networks based on morphological analysis and granular computing. In: IEEE International Conference on Granular Computing, pp. 539–543 (2009) 39. Toyota, T., Nobuhara, H.: Analysis and visualization of japanese law networks based on granular computing -visual law: visualization system of japanese law. J. Adv. Comput. Intell. Intelli. Inf. 14, 150–154 (2010) 40. Keet, M.: The granular perspective as semantically enriched granulation hierarchy. IJGCRSIS 2, 51–70 (2011) 41. Wang, B., Liang, J., Qian, Y.: Information granularity and granular structure in decision making. In: Rough Set and Knowledge Technology, pp. 440–449 (2012) 42. Mani, A.: Axiomatic granular approach to knowledge correspondences. In: 7th International Conference on Rough Sets and Knowledge Technology, pp. 482–486 (2012) 43. Bianchi, F., Livi, L., Rizzi, A., Sadeghian, A.: A granular computing approach to the design of optimized graph classification systems. Soft Comput. 18, 393–412 (2014) 44. Miller, S., Wagner, C., Garibaldi, J.: Applications of computational intelligence to decisionmaking: Modeling human reasoning/agreement. In: Handbook on Computational Intelligence, Volume 1: Fuzzy Logic, Systems, Artificial Neural Networks, and Learning Systems, pp. 807– 832 (2016) 45. Parvanov, P.: Handbook on Computational Intelligence, vol. 1. World Scientific Publishing Co. Pte. Ltd., 5 Toh Tuck Link, Singapore 596224 (2016) 46. Denzler, A., Kaufmann, M.: Toward granular knowledge analytics for data intelligence: Extracting granular entity-relationship graphs for knowledge profiling. In: IEEE International Conference on Big Data, pp. 923–928 (2017) 47. Liu, H., Cocea, M.: Nature-inspired framework of ensemble learning for collaborative classification in granular computing context. Granular Comput. 4, 715–724 (2019) 48. Mayr, A., Fenske, N., Hofner, B., Kneib, T., Schmid, M.: Generalized additive models for location, scale and shape for high dimensional data, a flexible approach based on boosting. J. Roy. Stat. Soc. 61, 354–514 (2012) 49. Hastie, T., Tibshirani, R.: Generalized additive models. Stat. Sci. 1, 297–310 (1986) 50. Alvarez-Melis, D., Jaakkola, T.: On the robustness of interpretability methods. arXiv, pp. 1–6 (2018) 51. Ehsan, U., Harrison, B., Chan, L., Riedl, M.: Rationalization: A neural machine translation approach to generating natural language explanations. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 81–87 (2018) 52. Montavon, G., Binder, A., Lapuschkin, S., Samek, W., Muller, K.R.: Layer-wise relevance propagation: an overview. In: Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp. 193–209 (2019) 53. Arnaud, V.L., Klaise, J.: Interpretable counterfactual explanations guided by prototypes. arXiv, pp. 1–17 (2019) 54. Inc., N.: What is a graph database? https://neo4j.com/developer/cypher-query-language/ (2019), [accedido: 2019-01-05] 55. Merkl, B., Chao, J., Howard, R.: Graph Databases for Beginners. Neo4j Print, Packt Publishing (2018) 56. Leonardo, G.: La definición del concepto de percepción en psicología. Revista de Estudios Sociales 18, 89–96 (2004) 57. Galinsky, A., Maddux, W., Gilin, D., White, J.: Why it pays to get inside the head of your opponent: the differential effects of perspective taking and empathy in negotiations. Psychol. Sci. 19, 378–384 (2008)

RYEL System: A Novel Method for Capturing and Represent Knowledge …

399

58. Carral, M.d.R., Santiago-Delefosse, M.: Interpretation of data in psychology: A false problem, a true issue. Philos. Study 5, 54–62 (2015) 59. Paulheim, H.: Knowledge graph refinement: a survey of approaches and evaluation methods. Semantic Web 0 (2016) 1–0, pp. 1–23 (2016) 60. Zhang, L.: Knowledge Graph Theory and Structural Parsing. Twente University Press, Enschede (2002) 61. Singhal, A.: Introducing the knowledge graph: things, not strings. https://googleblog.blogspot. com/2012/05/introducing-knowledge-graph-things-not.htm (2012), [accedido: 2018-12-03] 62. Florian, J.: Encyclopedia of Cognitive Science: Semantic Networks. Wiley, Hoboken, NJ (2006) 63. Lehmann, F.: Semantic networks. Comput. Math. Appl. 1–50 (1992) 64. Offermann, P., Levina, O., Schönherr, M., Bub, U.: Outline of a design science research process. In: Proceedings of the 4th International Conference on Design Science Research in Information Systems and Technology, pp. 7–1, 7–11 (2009) 65. Lee, C., Ousterhout, J.: Granular computing. In: HotOS 19: Proceedings of the Workshop on Hot Topics in Operating Systems, pp. 149–154 (2019) 66. Lotfi, Z.: Toward extended fuzzy logic–a first step. Fuzzy Sets Syst.—Science Direct 160, 3175–3181 (2009) 67. Khazaii, J.: Fuzzy logic. In: Advanced Decision Making for HVAC Engineers, pp. 157–166. Springer (2016) 68. Loui, R.: From berman and hafner’s teleological context to baude and sachs’ interpretive defaults: an ontological challenge for the next decades of ai and law. Artif. Intell. Law 371–385 (2016) 69. Assembly, L.: Creation of the Appeal Resource of the Sentence, Other Reforms to The Challenge Regime and Implementation of New Rules of Orality in The Criminal Process, pp. 1–10. Gaceta, Costa Rica pp (2010) 70. Standards, B.: Systems and software engineering—systems and software quality requirements and evaluation (square)—system and software quality models. BS ISO/IEC 25010(2011), 1–34 (2011)

A Generative Model Based Approach for Zero-Shot Breast Cancer Segmentation Explaining Pixels’ Contribution to the Model’s Prediction Preeti Mukherjee, Mainak Pal, Lidia Ghosh, and Amit Konar

Abstract Deep learning has extensively helped us to analyze complicated distributions of data and extract meaningful information from the same. But the fundamental problem still remains—what if there is no or rare occurrence of any instance which needs to be predicted? This is a significant problem in the health-care sector. The primary motivation of the work is to detect the region of anomaly in the context of breast cancer detection. The novelty here lies in designing a zero-shot learning induced Generative Adversarial Network (GAN) based architecture which has the efficacy to detect anomaly and thereby to discriminate the healthy and anomalous images even if the network is trained with the healthy instances only. In addition, this work has also been extended to segmentation of the anomalous region. To be precise, the novelty of this approach lies in the fact that the GAN takes only healthy samples in the training phase, though it is capable of segmenting tumors in the “unseen” anomalous samples explaining each pixel’s contribution to the model’s prediction. Experiments have been conducted on the mini MIAS breast cancer dataset and significantly appreciable results have been obtained. Keywords Breast cancer · Explainable AI · Granular computing · Generative adversarial networks · Anomaly detection

P. Mukherjee · M. Pal · L. Ghosh (B) · A. Konar Department of Electronics and Telecommunication Engineering, Jadavpur University, Kolkata, India e-mail: [email protected] P. Mukherjee e-mail: [email protected] M. Pal e-mail: [email protected] A. Konar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. Pedrycz and S. Chen (eds.), Interpretable Artificial Intelligence: A Perspective of Granular Computing, Studies in Computational Intelligence 937, https://doi.org/10.1007/978-3-030-64949-4_13

401

402

P. Mukherjee et al.

1 Introduction With the evolution of mankind with Artificial Intelligence (AI), deep learning has inevitably taken up the role of building blocks for handling datasets and extracting systematic and useful information from the same. Undoubtedly, Machine Learning (ML) has almost every research and industrial application. In some of these fields, the application needs to be perfect at its truest sense. In medical disease detection, a slight malfunction in an algorithm can lead to fatality. Functionalities of deep learning algorithms are mostly black-box for both the users and developers. Thus the ability to explain is playing a key role in rapid advancements. Can we explain the reason behind the model’s prediction?—finding its answer has two fold merits. Firstly, it can identify the region of interest (ROI) behind the prediction. Second, if the prediction is wrong, the abnormality should be detected and addressed for further improvement. Recent developments in Explainable Artificial Intelligence (XAI) techniques contributed significantly towards more transparent AI maintaining high performance. Zadeh postulates that, “there are three concepts underlying human recognition: granulation, organization, and causation. Granulation involves decomposition of the whole into parts; organization involves integration of parts into the whole; and, causation relates to the association of causes with effects” [19]. While mimicking the concept behind image understanding, granulation plays an important role. Information granules extracted from the images can be regarded as the processing unit in the image analysis. Additionally, Generative Adversarial Networks (GANs) [1–3] have further made it possible to understand and analyze the arbitrarily complex structured data. However, labeling large scale datasets is extremely tedious, time consuming and an inefficient practice. In a futuristic sense, it might as well happen that a model might be required to identify or predict something that is not present in the training set or is present in scanty numbers. Such an example can be seen in the medical and health-care sector. To identify a disease from the photographic reports, the first thing the professionals see is—“Is there any difference in this image from the normal ones? If so, where is it?” It can be statistically concluded that the availability of normal/healthy samples is far more than the number of infected or diseased instances. To tackle such problem, a system must be designed that is efficient enough to confirm the presence of some anomaly or patterns of data along with its location, which do not resemble a welldefined notion of normal behavior. As GAN require training on only one class instead of multi-class data (here, only the healthy samples), so for one modal training, the data is sufficient enough. Existing state-of-the-art algorithms often use adversarial training for anomaly detection [4]. A GAN framework consists of two parts: a Generator and a Discriminator. The Generator model essentially takes noise as input and generates an image that is close to the required sample. The Discriminator is fed with the original image and generated image and learns to differentiate between the same. Next this error is back-propagated to the generator, so that it can make appropriate corrections and generate images as

A Generative Model Based Approach …

403

close to the original sample as possible. Thus by iteratively going through this closedloop cycle, the generator becomes trained enough to produce real-like samples to the extent that the discriminator is unable to distinguish between the real and generated images. Ideally, in an adversarial learned generator-discriminator structure, a well-trained generator network is efficient enough to generate extremely real-like samples that can befool discriminator networks. GANs are very efficient as there is no need for tedious and complex pre-processing or sampling of the data; however the major drawback is that it is extremely difficult to train both the networks simultaneously. The state-of-the art Anomaly detection adversarial networks like AnoGAN [5] use Bi-directional GAN (BiGAN) [6] or Adversarially learned inference (ALI) [7] architecture to learn the features. It is particularly helpful as it draws a relation of inverse mapping or defines a function which takes the image as input and generates the latent space. But the major problem of identification of the region of anomaly still prevails. To overcome such problems, this chapter aims at proposing a preliminary solution to identify an image as healthy or sick and then locate the region of anomaly using a method, inspired from the Randomized input sampling for explanation of black-box (RISE) model [8]. The novelty of this work lies in the zero-shot learning [74–76] of the GAN for the present application. Although no unhealthy instances are required in the training phase, this approach is able to identify the region of anomaly using a saliency determination scheme of generated samples from masked images. This mimics the layman approach of identifying anomaly. This respite of the chapter is structured well in the following sections. Section 2 briefly explains the related research works, carried out in the existing literature until the date. Section 3 portrays the fundamental theory basis of the proposed system and Sect. 4 depicts the proposed framework and mathematical models in detail. Section 5 evaluates the investigation outcomes, and Sect. 6 concludes the chapter with a brief explanation about the future scope of the study.

2 Related Works This section describes the existing research works in the present context.

2.1 Granular Computing in Image Understanding Granular computing is one of the most powerful paradigms for multi-view data analysis at different levels of granularity. Information granules are conceptually meaningful entities that are used as the primary building block containing information. Plenty of research works [27–37] describe the use of information granules in various applications. On the other hand, a number of papers in the literature [38– 44] also describe the efficient use of granular computing in image reduction. For

404

P. Mukherjee et al.

image understanding, we often look for semantically meaningful constructs. Granular computing approaches rely on constructing information granules, analyzing them and conclude with the help of that analysis. Rizzi et al. [46] proposed a granular computing approach for automatic image classification. The same work is extended in [45, 48] for image segmentation. Liu et al. [47] has given a quantitative analysis for image segmentation using granular computing. In the proposed approach, we use the concept of granularity to distinguish the region of anomaly from its normal one.

2.2 Generative Adversarial Networks on Anomaly Detection Application of GANs on Anomaly detection has still remained quite unexplored. This concept was first published by Schlegl et al. [5] through their work on AnoGAN. AnoGAN is a simple GAN network which has learnt the mapping between a predicted sample and its corresponding inverse in the latent space. It is trained only on positive samples. As a result, the generator learns only how to generate positive samples. So, when a negative sample is passed through the encoder, the network is unable to match with the trained samples, thereby resulting in a very high “anomaly score”. With this score, it is easy to identify if the image is normal or “anomalous”. As a contemporary to the AnoGAN method, an inversion method is proposed by Creswell and Bharath [22], which is quite similar to the approach of iterative mapping from image space to latent space. To further improvise the iterative mapping scheme, Lipton and Tripathi [23] proposed stochastic clipping. However, using an iterative mapping scheme, the time complexity increases many fold, thereby making it unsuitable for real world applications. However, the fast unsupervised anomaly detection technique with generative adversarial networks (f-AnoGAN) [24] replaced this iterative procedure by a learned mapping from image to latent space, by improving the speed dramatically. Other applications of the AnoGAN and contemporary methods have been greatly observed. A GAN-based telecom fraud detection approach was suggested in Zheng et al. [25], where a deep de-noising auto encoder (AE) learns the relationship among the inputs and adversarial training is employed to discriminate between positive and negative samples in the data distribution. Ravanbakhsh et al. [26] employed two conditional generators, generating an optical flow image conditioned by a video frame and vice versa. The discriminator takes two images as input and decides whether both images are real data samples. The training procedure is followed in our proposed framework to the effect that only normal data was used for training. Their approach does not include a separate encoder training procedure as it is used in the present work. Furthermore, a few more works such as Efficient GAN-based anomaly detection (EGBAD) [9] and GANomaly [10] also employed the BiGAN structures. In EGBAD, the encoder is able to map the input samples to the latent representations, directly during adversarial training. On the other hand, GANomaly includes an auto-encoder that learns how to encode the images to their latent space efficiently.

A Generative Model Based Approach …

405

2.3 Explainable Artificial Intelligence (XAI) With the recent breakthrough of Deep Learning in Artificial Intelligence, mankind is moving towards building more sophisticated and reliable neural systems that can work not only more efficiently but also very quickly. Explainable AI or XAI bridges the gap in knowing the unknown of black-box neural models. More specifically in medical domain, we have to consider the other factors like risks and responsibilities for judging interpretability of systems [55, 56]. Many works such as [53, 54] explored the explainability in medical domain. Saliency based methods such as LIME [58], Grad-CAM [52], DeepLIFT [59], LRP [60, 61], BiLRP [62], PRM [63], RISE [8] explains the reasons behind the model’s output by assigning values to the input components as weight and studying their contribution towards the final prediction. On the other hand, DeconvNet [64], DBN [65], Semantic Dictionary [66] utilized signal based approaches for defining explainability. In these cases, activated values of neurons are transformed into interpretable forms under the following assumption: activation maps in deeper layer contain significant information behind model’s prediction. Also several mathematical models have been used to explore interpretability. In this regard, GAM [67], TCAV [68], Meta-predictors [69], Representer Theorem [71], Explanation Vector [70] are the significant contributions. In medical domain, there are few works which employed the above methods. The literature [72] uses GradCAM for visualization of pleural effusion in radiograph. To provide interpretability in neural network used for EEG sleep stage scoring, saliency based method has been used in [73]. The work in [57] provides a summary of literatures that are useful in understanding the background.

2.4 XAI for Anomalous Region Segmentation Till now, GANs have been successful in taking the decision if an image is anomalous or not. However, the next question that arises is about the location of the anomaly. Yousefi Kamal [11] and Zhang et al. [12] implemented various deep learning approaches in breast tumor segmentation. Recently in 2019, Singh et al. used deep adversarial training for breast tumor segmentation [13]. Kauffmann et al. [49] worked with explainable artificial intelligence applied to One-Class Support Vector Machine (OCSVM) for anomaly detection. But there is hardly any such algorithm on the BiGAN structure or related works that promise to identify the location of anomaly accurately and efficiently. To overcome the same, this paper proposes an architecture based on an XAI approach named Randomized Input Sampling for Explanation (RISE) model, that can efficiently identify the location of anomaly. RISE Models work in the same way as an electrical black box does; the different ports are alternatively probed and tested to identify the nature or characteristics of the black box. The experiments are conducted on semantically labeled data [16] i.e., every image has an associated label or line describing the contents of the image. Here, in

406

P. Mukherjee et al.

an image, random masks are generated and the semantic outputs are obtained. The corresponding labels are compared with the ground truth labels and the similarity is measured. The masked image with the least similarity measure is likely to occlude the most significant part of the image. In other words, the mask covers the most essential portion of information of the image. This idea has been extended for identifying the anomalous region in this work.

3 The Fundamentals Before we jump into the actual algorithm, it is important to walk through the various building blocks of the same. This will help the reader to easily understand, comprehend and analyze the entire concept and also realize the “why”-factor behind each step. A brief introduction to the fundamentals of the proposed method is given below.

3.1 Why Adversarial Training? The word adversarial essentially means involving or characterized by conflict or opposition. Szegedy et al. in 2014 provided an adversarial example on ImageNet [3]. It shows precisely that addition of an imperceptibly small vector whose elements are equal to the sign of the elements of the gradient of the cost function with respect to the input, one can change GoogLeNet’s classification of the image. To put it into perspective, let us consider an image x, as shown in Fig. 1. If an appropriate amount of noise is added to it, the state-of-the-art algorithms tend to misclassify the considered image.

Fig. 1 A VGG network was initially predicting the koala correctly. But on adding some noise (which the model identified as mosquito net), the network showed up an adversarial example by predicting the image to feature a fox squirrel. It must be understood that the human eye is still able to identify the animal as koala; but the network has been befooled

A Generative Model Based Approach …

407

Adversarial training intuitively tries to improve robustness of a neural network by training it with adversarial examples—it is effectively a defense method against adversarial samples. It can be formulated as a min–max game [20] as: 

Z (θ, x , y) 

where, θ denotes  the model weights and x is the adversarial input, y indicates groundtruth labels, D x, x denotes a distance metric between the actual image input x and  the adversarial input x and Z(.) is the adversarial loss function. Deep neural networks are often vulnerable to adversarial examples, so adversarial training is important. This is where generative adversarial networks come into play. GAN is a new successful framework for generative models [21]. A generative model is trained by maximizing the likelihood function. This likelihood function computes marginal probabilities, partition functions and other such computationally difficult parameters. However, a GAN framework has a major advantage over this. It does not incur much tedious calculation, instead it makes two networks (a generative model, Generator, G and a discriminative model, Discriminator, D) to compete with each other. The Generator G maps a sample z (noise distribution) to the data distribution, and the Discriminator D, which discriminates between training data and a sample from a generative model. The G aims to maximize the probability (p) that the D will make an error. The two competing models play the following min–max competition with a value function V , defined as V (D, G) = E x∼ pdata (x) [log D(x)] + E z∼ ps (z) [log(1 − D(G(z)))]

(1)

where E denotes the expectation of the respective arguments. This competition continues till the discriminator cannot distinguish a generated sample from a data sample.

3.2 What is an Anomaly? An anomaly refers to the deviation from the pattern or the exhibition of a certain feature that does not pertain to the context of normal behavior. For instance, we all are used to be human beings with a single nose—so if suddenly a person develops 2 noses, it is considered as an anomaly. In the context of pattern recognition, say an image has uniform horizontal striations. But suddenly a bump appears, as depicted in Fig. 2, in a part of the image—this bump is not in continuation or in accordance to the basic striped pattern in the image. So this bump may be considered as an anomaly. Anything that does not conform to the basic flow of behavior is called an anomaly.

408

P. Mukherjee et al.

Without Anomaly

With Anomaly

Fig. 2 Schematic view explaining anomaly

3.3 Why GAN in Anomaly Detection? As mentioned earlier, GANs are extremely versatile—they tend to produce images that are similar to the input images. So, if a GAN is properly trained, the Generator will produce samples similar to the input and the discriminator will easily be able to distinguish between the two. Using this property, it has been shown later that the region of anomaly can also be determined.

3.4 Impact of XAI in Anomaly Detection XAI is based on explaining a model or phenomenon using “interpretable rules.” The best part of using XAI is that it does not deal with a black box architecture, where the model is made to take in inputs and a result is obtained after processing through the same—instead, it can answer questions; make errors and correct them accordingly. The beauty of this pipeline lies in the fact that every step is definite and determinable—so naturally it provides a detailed explanation to “how” these architectures work. In the proposed method, finding the reason behind segregating anomalous regions from its closest normal sample is inspired from the technique described in RISE Model [8].

3.5 RISE Model Randomized Input Sampling for Explanation (RISE) approach is generally applied to any off-the-shelf image network, treating it as a complete black-box and not assuming access to its parameters, features or gradients. The key idea is to probe the base model by sub-sampling the input image via random masks and recording its response to each of the masked images. The final importance map is generated as a linear combination of the random binary masks where the combination weights come from the output probabilities predicted by the base model on the masked images

A Generative Model Based Approach …

409

Fig. 3 Result obtained from RISE model; (a) Query image, (b) RISE model explanation for dog

(Fig. 3). This seemingly simple yet surprisingly powerful approach allows us to peek inside an arbitrary network without accessing any of its internal structure. Thus, RISE is a true black-box explanation approach which is conceptually different from mainstream white-box saliency approaches such as GradCAM [51] and, in principle, is generalizable to base models of any architecture. Let us consider an example (as stated in the authors’ code) to understand the process better: (link to author’s code: https://github.com/eclique/RISE). We take an image of a Pomeranian dog, as depicted in Fig. 4. The questions arise: what exactly makes the image to be qualified to contain the dog? which precise part of the picture makes the model predict that it is a dog—is it the background, the shadow of the dog or the white fluffy body of the dog? This idea can also be represented as if some parts of the image contribute more to the object recognition in the image and some parts less. So, what if we try to hide the areas in the image to see which part actually contributes to the primary subject in the image? To understand that, first the image is masked with random binary masks as described in Fig. 4. Thus, the portion analogous to the black portion of the mask in the image, is charred. So if this image is able to identify the object in the image as a dog, then we can confirm that the area occluded by the mask does not substantially contribute to the object recognition process. Conversely, if the object is incorrectly identified, it can be inferred that the important features of the image have been occluded.

Fig. 4 Probing the query image with a random mask

410

P. Mukherjee et al.

But a single mask cannot reliably identify or accurately mark the region of importance. So, it is important to use a number of masks. A black-box model, say f , basically figures out the object present in the image. Here, the image of the dog is masked with multiple masks, M i (say) and each masked output is sent to the blackbox model f , which gives a score that is analogous to the probability that the masked image contains a dog(s), as the original image contains a dog. Now a weighted sum of these masks have been evaluated, such that a mask that caused the probability of the image to contain a dog to reduce—must be occluding an important portion of the image—hence it deserves a greater weight or a greater contribution. In this way, a final mask is obtained, that highlights the portion with a greater contribution towards the object identification. In this case we see that the mask has been converted to a heat map and has been shown in the final image. It is interesting to note that in the case of dogs a major focus has been put on the face—probably because of its unique shape and pattern of eyes; that distinguishes its identity as a dog!

3.6 Motivation Behind This Approach In a nutshell, we have the following amenities: 1. Generating a healthy image from any image 2. Analyzing which part of the image actually contributes to its identity. Using these two amenities, one can easily determine if an image is anomalous and later point out the position of anomaly—and this is precisely where our approach is unique. First the image is randomly masked. As previously described, the generator network has been trained to be able to produce healthy images even from unhealthy ones. Then this masked image is sent into the GAN to produce its closest healthy sample. This reconstructed healthy image is then passed through the anomaly detector along with the original image. A greater score of anomaly indicates greater contribution towards anomaly.

4 Proposed Methodology This section will give a broader insight into the architecture of the various building blocks of the proposed model. The training pipeline is broadly divided into two phases: I. Healthy GAN: a. Encoder network b. Generator network c. Discriminator network

A Generative Model Based Approach …

411

II. Anomalous region segmentation. Following is a detailed description of the various parts.

4.1 Healthy GAN Schematic diagram of the proposed network architecture is shown in Fig. 5. The paradigm of zero-shot learning of the GAN, i.e., learning the healthy instances of dataset has been implemented here. An image (x) is first sent into the encoder, so that the various feature mappings are established and then sent into the latent layer (z) (Fig. 5). Existing state-of-the-art algorithms follow the BiGAN approach or the ALI approach. So, the generator takes inputs from random noise generated out of the network. But, the major difference of this paper lies here in the Generator. Earlier models used external random noise as inputs to the generator and later improved the generator training with the help of the discriminator. Here, the input for the generator is directly taken from the latent space, z, constructed by the encoder. Let  the generated image be named as x . Now, once the reconstructed    samples from the Generator are ready, the original (x) and generated samples x are sent to the discriminator. The discriminator analyses the difference between the two samples and the adversarial loss is back-propagated to both the discriminator and generator network. Simultaneously the reconstruction loss is also back propagated through the encoder network, in order to achieve better representation in the latent layer (z). The generator retrains and tries to produce samples closer to the original samples and the  process continues. The bottleneck arises when the generated image (x ) becomes so

Fig. 5 HealthyGAN architecture

412

P. Mukherjee et al.

close to the actual image (x), that the discriminator is unable to distinguish between the two. For the sake of simplicity let us assume px (x) be the distribution of healthy samples for x ∈ Sx . Similarly pz (z) be the latent layer distribution for z ∈ Sz . The various architectures of this subsection are explained next.

4.1.1

Encoder Network

The encoder consists of an input layer, three convolution layers, two batch normalization layers with ReLU, flatten and dense layers. Coming to the parametric specifications, the input layer has a shape of 200 × 200 (same as the input image shape after preprocessing) and a channel size of 1. Next, the two convolution layers have a channel size of 32 and each having kernel size 3 × 3 with a stride of 2 × 2. Hereafter, to normalize the feature maps, batch normalization layers and leaky ReLU layers are added. Another convolution layer is added but, this time the number of channels is 64, though the kernel size and strides remain the same. This is yet again followed by another layer of batch normalization and leaky ReLU. Finally to ensemble the feature mappings obtained, the output of the ReLU layer is flattened and then passed through a dense layer to reach the latent space S. Mathematically training an encoder E can be interpreted as, E : Sx → Sz

(2)

Probability distribution induced by the encoder is p E (x) = δ(z − E(x))

(3)

Encoder maps healthy samples x into the latent feature space, z.

4.1.2

Generator Network

Generator network constructs the image from the encoded latent space. The stereotypical dense, batch-normalization, ReLU and reshaping layers are added. Now, the output has 64 channels. To obtain our image, the dimensions of the previous output must be increased. So, an inverse convolution layer is added with a kernel size of 2 × 2 and stride of 2 × 2, but the number of channels decreased to 32. To normalize the process, a convolution layer was added keeping all other parameters fixed except that the kernel size was changed to 3 × 3. Hereafter, a batch normalization layer and a ReLU activation layer was added. This was followed by a de-convolution layer with kernel size of 5 × 5. Finally another convolution and tanh activation layer were added which reduced the channel size to 1 and the required generated image was obtained. Mathematically Generator G can be expressed as

A Generative Model Based Approach …

413 

G : Sz → Sx

(4)

   pG (z) = δ x − G(z) ,

(5)

   pG x = E z∼ pz [ pG (x |z)]

(6)

with induced probability,





where   x ∈ Sx , reconstructed sample space. The goal is to train a generator such that pG x ≈ px (x).

4.1.3

Discriminator Network

The discriminator network is similar to the encoder network. Instead of ending up in the latent layer, the output of the dense layer is again passed through another dense layer to reduce the dimension to 1. A softmax and activation layer is added. It is trained in a way that it maps to 0 when trained with a generated image and 1 when trained with a real image. Discriminator network D can be interpreted as 

D : Sx × Sx → [0, 1]

4.1.4

(7)

Training

The training objective is defined as the min–max objective function defined at Donahue et al. [6], given by 

J (θ, x , y) where, 





V (D, E, G) = E x∼ px E z∼ p E (x) [log D(x)] + E z∼ pz E x∼ pG (z)



log(1− D(G(z))

(8)

In order to get better latent layer representation we back propagate the reconstruction loss to the encoder and generator network. Reconstruction loss can be interpreted as, 

L r econ = x − x 1

(9)

414

P. Mukherjee et al.

Also, we update the generator with the internal representation of the discriminator. Let f (.) be the function that outputs an intermediate discriminator representation from any given input x. Adversarial loss thus obtained is back-propagated to both the generator and encoder. Adversarial loss can be interpreted as, 

L adv =  f (x) − f (x )2

(10)

where, 

x = G(E(x))

4.1.5

(11)

Optimal Generator, Encoder and Discriminator

Our architecture shares many theoretical properties with both Goodfellow et al. [1] and Donahue et al. [6]. It can be proved that the optimal discriminator, D E G is the Radon-Nikodym derivative, DE G =

dp X (x) :  → [0, 1] d( p X (x) + pG (E(x)))

(12)

It can be also be proved that if E and G are the optimal encoder and generator respectively, then E = G −1 almost everywhere; that is, G(E(x)) = x

(13)

for p X -almost every x ∈ x .

4.2 Anomalous Region Segmentation This subsection is greatly inspired by the technique adapted in the RISE model. Authors there had attempted to address the problem of XAI. It proposes an importance map indicating how much importance each pixel carries for the model’s prediction by probing the model with the randomly masked images. In this paper, we will introduce a novel way to use the RISE in anomaly detection. Let M :  → {0, 1} be a set of random binary masks where m i ∈ M and  = {1, . . . , H } × {1, . . . , W }.

(14)

In addition, H and W are the height and width of the input image, x. A detailed explanation of mask generation is given at Sect. 4.2.2.

A Generative Model Based Approach …

415



LetE G : x → x , be the network that maps outputs the generated version of images encoded by the encoder trained previously. Mathematically, E G(x) = G(E(x)) Consider G i ∈ G, set of generated images from the masked inputs. Therefore, G i = E G(I  m i )

(15)

where  denotes element-wise multiplication.

4.2.1

Anomaly Detector

Now generated images are sent to the anomaly detector. We have configured the anomaly detector so that it can compare the image with the original image X and  outputs an anomaly score. Thus AD : X × X → (0, 1). Let Si ∈ S, set of anomaly scores for different generated images. Thus, Si = AD(X, G i )

(16)

Anomaly Detector mentioned above is inspired from Schlegl et al. [5]. It can be mathematically expressed as,           AD x, x = λ · R x, x + λ · D x, x

(17)

    where R x, x and D x, x give residual score and discrimination score respec tively. λ and λ are adjusted and normalized to give the anomaly score in (0, 1). Therefore,     (18) R x, x = x − x 1 and     D x, x =  f (x) − f (x )2

4.2.2

(19)

Random Granular Mask Generation

The method of mask generation is inspired from the technique mentioned at RISE [8]. Here, we will demonstrate the method in light of the anomaly detector AD. As we have discussed earlier, if M :  → {0, 1} be a set of random binary masks  with distributionD. For better understanding let us define a similar set  as of 

416

P. Mukherjee et al. 







separately for generated imagex ∈ Sx . Now for pixels λ ∈  and λ ∈  we can define S for query image I as the expected anomaly score over  all possible mask M  conditioned that λ and λ are explicitly anomalous, that isM λ, λ = 1. So we can express S as,     S I,AD λ, λ = E M [AD(I, E G(I  m i ))|M λ, λ = 1]

(20)

So the above equation can be written for m i ∈ M as,  

  AD(I, E G(I  m i ))P[M = m i |M λ, λ = 1] S I,AD λ, λ =

(21)

i

S I,

AD

1 D(I, E G(I  m i ))  P[M(λ, λ ) = 1] i     P M = m i , M λ, λ = 1

  λ, λ =

(22)

Now we know,     P M = m i , M λ, λ = 1 = {0, m i (λ, λ ) = 0 

P[M = m i ], m i (λ, λ ) = 1}

(23)

   = m i λ, λ P[M = m i ]

(24)

Substituting this value we get, S I,

AD

  1 λ, λ = P[M(λ, λ ) = 1]

  AD(I, E G(I  m i )) · m i λ, λ · P[M = m i ]

(25)

i

Now we can say Si signifies the importance of mask m i in anomaly detection. So in order to get the most anomalous region in the test image, we need a mask that properly masks the anomaly region. It can be expressed as, M=



Si · m i

(26)

i

Segmented image, R can be obtained by multiplying M with I. Thus the segmented image can be expressed as, R=MI

(27)

A Generative Model Based Approach …

417

5 Evaluation Experiments were conducted on the benchmark mini MIAS [14, 15] benchmark dataset. It consists of 208 healthy mammograms and 114 abnormal samples. The schematic diagram of the procedure is shown in Fig. 6 and pictorial observations on samples from MIAS dataset have been shown in Table 1.

5.1 Evaluation Metric Taha et al. [50] summarizes a nice comparison of different evaluation metrics for medical image segmentation. We have used the Receiver Operating Characteristics (ROC) curve and its Area under the curve (AUC) for pixel level analysis of our proposed method and also compared with the other state-of-the-art algorithms. ROC curve is the plot of true positive rate (TPR) against the false positive rate (FPR). In light of segmentation evaluation, TPR can be defined as the region where it is anomalous at ground truth and the model truly predicts that it is anomalous. And FPR is the region where the model predicts that to be anomalous while the region actually is normal. Higher rate of TPR per FPR denotes better segmentation. AUC is the area under the curve of ROC.

Fig. 6 Schematic diagram of the proposed approach for anomalous region segmentation

418

P. Mukherjee et al.

Table 1 Experimental results of breast cancer tumor segmentation using proposed method. First column shows the input images, second column shows corresponding segmented tumor in blue contour within the breast region. Third column shows the ROC curve and fourth column shows the AUC of ROC

5.2 Experiments and Results In this section the various experiments and their results have been discussed:

5.2.1

Pixel Level Analysis

As mentioned earlier, we have used the ROC curve and AUC for pixel level analysis of our proposed algorithm. We have flipped the odd indexed images to maintain homogeneity during training. Segmentations obtained from our method are shown

A Generative Model Based Approach …

419

Table 2 Continuation of Table 1

in Tables 1 and 2. After segmentation we have labeled each pixel with proper thresholding and compared them with the ground truth segmentation mask. If a pixel lies in the anomalous region in ground truth and also if after thresholding our model can predict that to occur in an anomalous region, then it falls under true positive case. Similarly false positive case is also calculated as the method mentioned above. Area under the curve (AUC) is calculated from the ROC curve. Results are shown in Tables 1 and 2.

5.2.2

Effect of Granularity

  According to the Eq. (25), expected anomaly score (S I,AD λ, λ ) for any query pixel is directly dependent upon the number of masks used. To verify this fact we

420

P. Mukherjee et al.

Fig. 7 Variation of average AUC with number of masks (N) used

have varied the number of masks used and checked the AUC score obtained and averaged them on the whole test sample. From Fig. 7 it is clearly visible that the AUC score increased initially with the increase in the number of samples used but later it remains constant. This phenomenon can be explained by the concept of granularity. As we move towards increasing the number of masks, conceptually we consider finer granules as the information building block. For certain level we get a boost in performance. Theoretically at that level, we get the most appropriate information granules as the building block element. Furthermore, increasing granularity only causes creating sub-granules from that ideal information granule. That is why we get saturation in performance. Also processing a greater number of masks simultaneously increases computational expenses. So we maintain a proper trade-off between the number of masks used and performance.

5.2.3

Algorithmic Comparisons

We have compared the performance of our proposed algorithm with the other stateof-the-art algorithms on the basis of average AUC obtained from whole test data. Results are shown in Table 3. Table 3 Comparative performance of our method with existing state-of-the-art

Method

Average AUC

AnoGAN

0.82

GANomaly

0.84

Proposed method

0.92

A Generative Model Based Approach …

421

6 Conclusion and Future Work Breast cancer poses a formidable stand for humanity [17, 18]. However if the location of the tumor is estimated early, one can have a head start while combating this disease. Here, a GAN network is first trained on healthy instances (following the procedure of zero-shot segmentation), such that given any anomalous input, it can produce a closest healthy instance. Next the test images are considered. They are then probed with randomly generated masks. The masked images are sent to the GAN model and corresponding closer healthy images are generated. The original image and this generated image are now sent to the anomaly detector. A greater anomaly score determines a greater extent of anomaly. So, a masked image with greater anomaly score indicates that the region covered by the mask was the required region of anomaly. In this method the advantage is manifold—first the patient is in a more educated state, second, it requires fewer resources to obtain the results. Another interesting observation we have noticed during conducting the experiments is that for some of the query images, the AUC value is abruptly lower than the others. That indicates masking pixels somehow still causing adversarial effects as mentioned earlier. So, further exploration in this regard may lighten the effect of granularity in adversarial effects. On the other hand, it is necessary in this model to generate and test on images of similar resolution for proper functioning. Also another aspect of research is to augment the models so that they can identify and locate all sorts of anomaly and not just pertain the model to Breast Cancer Detection. Further work on this type of data augmentation is in progress. Acknowledgments This work has been conducted at the Artificial Intelligence Laboratory, Jadavpur University.

References 1. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014) 2. Goodfellow, I.: NIPS 2016 tutorial: generative adversarial networks (2016). arXiv preprint arXiv:1701.00160. 3. Goodfellow, I.J., Shlens, J. and Szegedy, C.: Explaining and harnessing adversarial examples (2014). arXiv preprint arXiv:1412.6572 4. Di Mattia, F., Galeone, P., De Simoni, M. and Ghelfi, E.: A survey on gans for anomaly detection (2019). arXiv preprint arXiv:1906.11632 5. Schlegl, T., Seeböck, P., Waldstein, S.M., Schmidt-Erfurth, U., Langs, G.: Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In: International Conference on Information Processing in Medical Imaging, pp. 146–157. Springer, Cham (2017) 6. Donahue, J., Krähenbühl, P., Darrell, T.: Adversarial feature learning (2016). arXiv preprint arXiv:1605.09782

422

P. Mukherjee et al.

7. Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., Courville, A.: Adversarially learned inference (2016). arXiv preprint arXiv:1606.00704 8. Petsiuk, V., Das, A., Saenko, K.: Rise: randomized input sampling for explanation of black-box models (2018). arXiv preprint arXiv:1806.07421 9. Zenati, H., Foo, C.S., Lecouat, B., Manek, G., Chandrasekhar, V.R.: Efficient gan-based anomaly detection (2018). arXiv preprint arXiv:1802.06222 10. Akcay, S., Atapour-Abarghouei, A., Breckon, T.P.: December. Ganomaly: Semi-supervised anomaly detection via adversarial training. In: Asian Conference on Computer Vision, pp. 622– 637. Springer, Cham (2018) 11. Yousefikamal, P.: Breast tumor classification and segmentation using convolutional neural networks (2019). arXiv preprint arXiv:1905.04247 12. Zhang, L., Luo, Z., Chai, R., Arefan, D., Sumkin, J., Wu, S.: March. Deep-learning method for tumor segmentation in breast DCE-MRI. In: Medical Imaging 2019: Imaging Informatics for Healthcare, Research, and Applications, vol. 10954, p. 109540F. International Society for Optics and Photonics (2019) 13. Singh, V.K., Rashwan, H.A., Abdel-Nasser, M., Sarker, M., Kamal, M., Akram, F., Pandey, N., Romani, S., Puig, D.: An efficient solution for breast tumor segmentation and classification in ultrasound images using deep adversarial learning (2019). arXiv preprint arXiv:1907.00887 14. Dance, P.J., Astley, D., Hutt, S., Boggis, I., Ricketts Suckling, C.J.: Mammographic image analysis society (mias) database v1.21 (2015) 15. Bowyer, K., Kopans, D., Kegelmeyer, W.P., Moore, R., Sallam, M., Chang, K., Woods, K.: The digital database for screening mammography.In: Third International Workshop on Digital Mammography, vol. 58, p. 27 (1996) 16. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431– 3440 (2015) 17. Azamjah, N., Soltan-Zadeh, Y., Zayeri, F.: Global trend of breast cancer mortality rate: a 25-year study. Asian Pac. J. Cancer Prev.: APJCP 20(7), 2015 (2019) 18. Siegel, R.L., Miller, K.D., Jemal, A.: Cancer statistics, 2020. CA: A Cancer J. Clin. 70(1), 7–30 (2020). https://acsjournals.onlinelibrary.wiley.com/doi/full/10.3322/caac.21590 19. Zadeh, L.A.: Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets Syst. 90(2), 111–127 (1997) 20. Ren, K., Zheng, T., Qin, Z., Liu, X.: Adversarial attacks and defenses in deep learning. Engineering (2020) 21. Lee, H., Sungyeob, H., Jungwoo, L.:Generative adversarial trainer: Defense to adversarial perturbations with gan (2017).arXiv preprint arXiv:1705.03387 22. Creswell, A., Bharath, A.A.: Inverting the generator of a generative adversarial network. IEEE Trans. Neural Netw. Learn. Syst. 30(7), 1967–1974 (2018) 23. Lipton, Z.C., Tripathi, S.: Precise recovery of latent vectors from generative adversarial networks (2017). arXiv preprint arXiv:1702.04782 24. Schlegl, T., Seeböck, P., Waldstein, S.M., Langs, G., Schmidt-Erfurth, U.: f-anogan: Fast unsupervised anomaly detection with generative adversarial networks. Med. Image Anal. 54, 30–44 (2019) 25. Zheng, Y.J., Zhou, X.H., Sheng, W.G., Xue, Y., Chen, S.Y.: Generative adversarial network based telecom fraud detection at the receiving bank. Neural Netw. 102, 78–86 (2018) 26. Ravanbakhsh, M., Nabi, M., Sangineto, E., Marcenaro, L., Regazzoni, C., Sebe, N.: Abnormal event detection in videos using generative adversarial nets. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 1577–1581. IEEE (2017) 27. Acampora, G., Loia, V., Vasilakos, A.V.: Autonomous composition of fuzzy granules in ambient intelligence scenarios. In: Human-Centric Information Processing Through Granular Modelling (pp. 265–287). Springer, Berlin, Heidelberg (2009) 28. Acampora, G., Gaeta, M., Loia, V., Vasilakos, A.V.: Interoperable and adaptive fuzzy services for ambient intelligence applications. ACM Trans. Auton. Adapt. Syst. (TAAS) 5(2), 1–26 (2010)

A Generative Model Based Approach …

423

29. Yao, Y.: Granular computing: past, present and future. In: 2008 IEEE International Conference on Granular Computing, pp. 80–85. IEEE (2008) 30. Bargiela, A., Pedrycz, W.: A model of granular data: a design problem with the Tchebyschev FCM. Soft. Comput. 9(3), 155–163 (2005) 31. Bargiela, A., Pedrycz, W.: The roots of granular computing. In: 2006 IEEE International Conference on Granular Computing, pp. 806–809. IEEE (2006) 32. Bargiela, A., Pedrycz, W.: Toward a theory of granular computing for human-centered information processing. IEEE Trans. Fuzzy Syst. 16(2), 320–330 (2008) 33. Bargiela, A., Pedrycz, W. (eds.): Human-centric information processing through granular modelling, vol. 182. Springer Science & Business Media (2009) 34. Pedrycz, W., Bargiela, A.: Granular clustering: a granular signature of data. IEEE Trans. Syst. Man Cybern., Part B (Cybern.) 32(2), 212–224 (2002) 35. Pedrycz, W., Vasilakos, A.: Granular models: design insights and development practices. In: Novel Developments in Granular Computing: Applications for Advanced Human Reasoning and Soft Computation, pp. 243–263. IGI Global (2010) 36. Zadeh, L.A.: Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets Syst. 90(2), 111–127 (1997) 37. Zadeh, L.A., 1999. From computing with numbers to computing with words. From manipulation of measurements to manipulation of perceptions. IEEE Transactions on circuits and systems I: fundamental theory and applications, 46(1), pp.105–119. 38. Zadeh, L.A.: Toward a generalized theory of uncertainty (GTU)––an outline. Inf. Sci. 172(1–2), 1–40 (2005) 39. Kirshner, H., Porat, M.: On the role of exponential splines in image interpolation. IEEE Trans. Image Process. 18(10), 2198–2208 (2009) 40. Nobuhara, H., Hirota, K., Sessa, S., Pedrycz, W.: Efficient decomposition methods of fuzzy relation and their application to image decomposition. Appl. Soft Comput. 5(4), 399–408 (2005) 41. Konar, A.: Computational Intelligence: Principles, Techniques and Applications. Springer Science & Business Media (2006) 42. Beliakov, G., Bustince, H., Paternain, D.: Image reduction using means on discrete product lattices. IEEE Trans. Image Process. 21(3), 1070–1083 (2011) 43. Di Martino, F., Loia, V., Perfilieva, I., Sessa, S.: An image coding/decoding method based on direct and inverse fuzzy transforms. Int. J. Approximate Reasoning 48(1), 110–113 (2008) 44. Loia, V., Sessa, S.: Fuzzy relation equations for coding/decoding processes of images and videos. Inf. Sci. 171(1–3), 145–172 (2005) 45. Paternain, D., Fernández, J., Bustince, H., Mesiar, R., Beliakov, G.: Construction of image reduction operators using averaging aggregation functions. Fuzzy Sets Syst. 261, 87–111 (2015) 46. Wang, F., Ruan, J.J., Xie, G.: Medical image segmentation algorithm based on granular computing. In: Advanced Materials Research, vol. 532, pp. 1578–1582. Trans Tech Publications Ltd (2012) 47. Rizzi, A., Del Vescovo, G.: Automatic image classification by a granular computing approach. In: 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing (pp. 33–38). IEEE (2006) 48. Liu, H., Diao, X., Guo, H.: Quantitative analysis for image segmentation by granular computing clustering from the view of set. J. Algorithms Comput. Technol. 13, 1748301819833050 (2019) 49. Kok, V.J., Chan, C.S.: GrCS: granular computing-based crowd segmentation. IEEE Trans. Cybern. 47(5), 1157–1168 (2016) 50. Kauffmann, J., Müller, K.R., Montavon, G.: Towards explaining anomalies: a deep Taylor decomposition of one-class models. Pattern Recogn. 101, 107198 (2020) 51. Taha, A.A., Hanbury, A.: Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med. Imaging 15(1), 29 (2015) 52. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)

424

P. Mukherjee et al.

53. Tonekaboni, S, Joshi, S., McCradden, M., Goldenberg, A.: What clinicians want: contextualizing explainable machine learning for clinical end use. In: Proceedings of the 4th Machine Learning for Healthcare Conference, PMLR, vol. 106, pp. 359–380 (2019) 54. Holzinger, A., Langs, G., Denk, H., Zatloukal, K., & Müller, H. (2019). Causability and explainabilty of artificial intelligence in medicine. Wiley Interdisc. Rev.: Data Min. Knowl. Discovery, e1312. https://doi.org/10.1002/widm.1312. 55. Xie, Y., Gao, G., Chen, X.A.: Outlining the design space of explainable intelligent systems for medical diagnosis (2019). arXiv preprint arXiv:1902.06019 56. Croskerry, P., Cosby, K., Graber, M.L., Singh, H.: Diagnosis: Interpreting the shadows. CRC Press (2017) 57. Tjoa, E., Guan, C.: A survey on explainable artificial intelligence (XAI): towards medical XAI (2019). arXiv preprint arXiv:1907.07374 58. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should I trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016) 59. Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences (2017). arXiv preprint arXiv:1704.02685 60. Samek, W., Montavon, G., Binder, A., Lapuschkin, S., Müller, K.R.: Interpreting the predictions of complex ml models by layer-wise relevance propagation (2016). arXiv preprint arXiv:1611. 08191 61. Lapuschkin, S., Wäldchen, S., Binder, A., Montavon, G., Samek, W., Müller, K.R.: Unmasking clever hans predictors and assessing what machines really learn. Nat. Commun. 10(1), 1–8 (2019) 62. Eberle, O., Büttner, J., Kräutli, F., Müller, K.R., Valleriani, M., Montavon, G.: Building and interpreting deep similarity models (2020). arXiv preprint arXiv:2003.05431 63. Zhou, Y., Zhu, Y., Ye, Q., Qiu, Q., Jiao, J.: Weakly supervised instance segmentation using class peak response. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3791–3800 (2018) 64. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision, pp. 818–833. Springer, Cham (2014) 65. Erhan, D., Bengio, Y., Courville, A., Vincent, P.: Visualizing higher-layer features of a deep network. University of Montreal 1341(3), 1 (2009) 66. Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K., Mordvintsev, A.: The building blocks of interpretability. Distill 3(3), e10 (2018) 67. Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., Elhadad, N.: Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730 (2015) 68. Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F.: Interpretability beyond feature attribution: quantitative testing with concept activation vectors (tcav). In: International Conference on Machine Learning, pp. 2668–2677. PMLR (2018) 69. Fong, R.C., Vedaldi, A.: Interpretable explanations of black boxes by meaningful perturbation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3429–3437 (2017) 70. Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., Müller, K.R.: How to explain individual classification decisions. The J. Mach. Learn. Res. 11, 1803–1831 (2010) 71. Yeh, C.K., Kim, J., Yen, I.E.H., Ravikumar, P.K.: Representer point selection for explaining deep neural networks. In: Advances in Neural Information Processing Systems, pp. 9291–9301 (2018) 72. Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., Seekins, J.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 590–597 (2019)

A Generative Model Based Approach …

425

73. Vilamala, A., Madsen, K.H., Hansen, L.K.: Deep convolutional neural networks for interpretable analysis of EEG sleep stage scoring. In: 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. IEEE (2017) 74. Bucher, M., Tuan-Hung, V.U., Cord, M., Pérez, P.: Zero-shot semantic segmentation. In: Advances in Neural Information Processing Systems, pp. 468–479 (2019) 75. Xian, Y., Lampert, C.H., Schiele, B., Akata, Z.: Zero-shot learning—A comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2251–2265 (2018) 76. Zhu, Y., Elhoseiny, M., Liu, B., Peng, X., Elgammal, A.: A generative adversarial approach for zero-shot learning from noisy texts. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1004–1013) (2018)

Index

A Anomaly detection, 402–405, 408, 414, 416

B Bayes’ theorem, 103, 104 Breast cancer, 234, 248, 261, 401, 418, 421

C Case-based reasoning, 371, 381 Causal Bayesian Networks, 92, 94, 104, 114 Causal machine learning Conversational agents, 334–336, 340, 343, 344, 348, 364 Convolutional neural networks, 6, 211, 235, 272, 285 Coronary artery disease, 121 Counterfactual, 10, 97, 99–101, 107, 110, 153, 155, 156, 158–161, 165, 167– 169, 171–182, 212, 375 Counterfactual reasoning Critical system, 91–93, 96, 97, 101–104, 106, 112, 114

D Data mining, 2, 30, 158, 299 Deep learning, 1, 3, 4, 6, 7, 12, 13, 15, 17, 18, 20–23, 45, 50, 51, 71, 91, 93, 96, 211, 212, 218, 221, 227, 235, 254, 257, 262, 263, 269, 270, 273, 278, 289, 401, 402, 405

E Explainability, 7, 9, 10, 31, 34, 35, 64, 96, 119, 121, 122, 127, 129, 139, 143, 147, 148, 153, 154, 187–189, 201, 205, 211, 212, 214, 217–219, 223, 228, 259, 260, 262, 263, 269, 293– 296, 302, 306, 312, 316–319, 339, 369, 374, 405 Explainable Artificial Intelligence (XAI), 2– 4, 6–10, 22, 29–31, 34, 35, 37, 41, 44, 47, 49, 50, 58, 91, 93, 96–99, 101, 103, 107, 111, 112, 114, 119, 121–125, 127–129, 131, 133, 135– 149, 154, 212, 223, 261, 270, 294, 296, 303, 315, 335, 337, 338, 369– 372, 374, 375, 390, 394, 395, 402, 405, 408, 414 Explainable legal knowledge representation Explainable machine learning, 262, 278 Explanatory graphical interface, 369, 371, 395

F Fuzzy rule-based classifiers, 155

G General line coordinates, 218, 220, 235, 238, 239, 244, 249 Generative adversarial networks, 258, 262, 270, 401, 402, 404, 407 Granular computing, 3, 4, 29, 31, 34, 55, 57, 58, 91, 93, 99, 114, 121–124, 128, 131–133, 140, 147, 148, 154, 189, 202, 271, 272, 289, 294–296, 317,

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. Pedrycz and S. Chen (eds.), Interpretable Artificial Intelligence: A Perspective of Granular Computing, Studies in Computational Intelligence 937, https://doi.org/10.1007/978-3-030-64949-4

427

428

Index

333, 335, 336, 338–340, 342, 347, 348, 364, 369, 371–375, 381, 385, 395, 403, 404 Granularity, 31, 55, 57, 124, 154, 157, 187– 189, 194, 196, 201, 202, 205, 206, 208, 209, 211–213, 217, 219, 224, 225, 269, 271–274, 281, 287, 289, 296, 316, 317, 335, 339, 340, 342– 344, 347, 352, 360, 369, 370, 372– 374, 403, 404, 419–421

91–94, 114, 119, 120, 212, 217, 218, 220, 223, 224, 240, 244, 246, 248, 257, 260–263, 269, 270, 281, 285, 289, 293, 295, 300, 306, 339, 348, 370, 371, 374, 375, 402 Malware as Image, 269, 270, 272, 273, 275, 281, 282, 287, 289 Malware dataset, 269, 270 Malware detection, 269, 270, 278, 281, 285, 289

H Human-centric, 121, 122, 127–129, 132, 140, 143, 147, 148, 154, 335, 371, 372

P Predictive process monitoring, 3, 5–7, 11, 17, 23 Process mining, 1–9, 22, 23

I Information compression, 188, 190, 196, 200, 203, 213 Information granules, 4, 29, 31, 35, 36, 41, 50, 52, 54, 55, 57, 58, 64, 121–125, 128, 130–136, 140, 153–155, 159, 169, 181, 182, 189, 201, 203, 206, 208, 209, 213, 270, 271, 273, 281, 296, 333, 335, 338–340, 342, 344, 347, 350, 352, 354, 357, 358, 362, 363, 373, 402–404, 420 Intelligence analysis, 335–337, 341, 362, 363 Interpretability, 10, 12, 22, 29, 31, 34, 35, 50, 58, 64, 65, 71, 87, 124, 153–155, 157, 170, 182, 187, 188, 201, 205, 211, 212, 214, 217–220, 223–225, 228, 229, 255, 261–263, 269, 270, 278, 289, 296, 310, 333, 335, 339, 340, 343, 344, 348, 350, 354, 359, 364, 371, 374, 405 Interpretable AI, 127, 154, 337, 340 Interpretable artificial intelligence, 369, 371, 395 Interpretable machine learning, 261, 293 Interpretation-assessment/Assessmentinterpretation, 370, 371, 385 IT tickets, 293, 302, 305, 317

R RYEL, 369–372, 375–377, 379, 381, 385, 390, 393, 395, 396

L Linguistics, 192 M Machine learning, 2, 4, 6, 8–12, 15, 17, 29, 30, 34–37, 39, 41–45, 47, 51, 55, 67,

S Saliency maps, 275, 276, 287 SP computer model, 187, 189, 191, 193–195, 197, 198, 203, 205, 207–210, 213 SP-multiple-alignment, 191–193, 200, 210, 211, 213 SP theory of intelligence, 187, 189, 195, 213 Support vector machine, 6, 35, 96, 273, 300–302, 339, 405 T Term

Frequency-Inverse Document Frequency (TF-IDF), 293, 295, 297–299, 302, 303, 310, 313–315, 317–319 Text classification, 14, 293–296, 298–302, 306, 307, 313, 315, 317–320 Time series data, 70, 254, 263, 307 Transparency, 9, 17, 35, 100, 119, 121, 123, 127, 187–189, 194, 196, 201, 203, 205, 206, 208, 211–214, 287, 294, 296, 300, 301, 333–340, 348–350, 354, 355, 357–364

V Visual analytics, 29–31, 34, 35, 37, 39, 41, 42, 45, 47, 48, 50–52, 54, 55, 57, 58, 248, 260, 337 Visualization, 7, 10, 29–31, 34–39, 41, 42, 44–46, 48, 49, 52–55, 57, 58, 63–65,

Index

429 69–82, 84, 87, 124, 133, 226, 229– 234, 239, 240, 242–244, 246–248,

250–254, 256–261, 263, 274, 281, 287, 405 Visual knowledge discovery, 220, 230