Interpretability in Deep Learning
 9783031206382, 9783031206399

Table of contents :
Preface
Field Emergence, Relevance, and Necessity
Acknowledgements
Contents
Acronyms
1 Introduction to Interpretability
1.1 Deep Learning Glossary
1.2 Evolution of Deep Learning
1.2.1 Neural Learning
1.2.2 Fuzzy Learning
1.2.3 Convergence of Fuzzy Logic and Neural Learning
1.2.4 Synergy of Neuroscience and Deep Learning
1.3 Awakening of Interpretability
1.3.1 Relevance
1.3.2 Necessity
1.3.3 The Taxonomy of Interpretability
1.4 The Question of Interpretability
1.4.1 Interpretability—Metaverse
1.4.2 Interpretability—The Right Tool
1.4.3 Interpretability—The Wrong Tool
2 Neural Networks for Deep Learning
2.1 Neural Network Architectures
2.1.1 Perceptron
2.1.2 Artificial Neural Networks
2.1.3 Recurrent Neural Networks
2.1.4 Convolutional Neural Networks
2.1.5 Autoencoder Neural Networks
2.1.6 Generative Adversarial Networks
2.1.7 Graph Neural Networks
2.2 Learning Mechanisms
2.2.1 Activation Function
2.2.2 Forward Propagation
2.2.3 Backpropagation
2.2.4 Gradient Descent
2.2.5 Learning Rate
2.2.6 Optimization
2.2.7 Initialization
2.2.8 Regularization
2.3 Challenges and Limitations of Traditional Techniques
2.3.1 Resource-Demanding Checks
2.3.2 Uncertainty Measure
2.3.3 Network Learning Sanity Check
2.3.4 Gradient Checks
2.3.5 Decision Transparency
3 Knowledge Encoding and Interpretation
3.1 What Is Knowledge?
3.1.1 Image Representation
3.1.2 Word Representation
3.1.3 Graph Representation
3.2 Knowledge Encoding and Architectural Understanding
3.2.1 The Role of Neurons
3.2.2 Role of Layers
3.2.3 Role of Explanation
3.2.4 Semantic Understanding
3.2.5 Network Understanding
3.3 Design and Analysis of Interpretability
3.3.1 Divide and Conquer
3.3.2 Greedy
3.3.3 Back-Tracking
3.3.4 Dynamic
3.3.5 Branch and Bound
3.3.6 Brute-Force
3.4 Knowledge Propagation in Deep Network Optimizers
3.4.1 Knowledge Versus Performance
3.4.2 Deep Versus Shallow Encoding
4 Interpretation in Specific Deep Architectures
4.1 Interpretation in Convolution Networks
4.1.1 Case Study: Image Representation by Unmasking Clever Hans
4.1.2 Variants of CNNs
4.1.3 Interpretation of CNNs
4.1.4 Review: CNN Visualization Techniques
4.1.5 Review: CNN Adversarial Techniques
4.1.6 Inverse Image Representation
4.1.7 Case Study: Superpixels Algorithm
4.1.8 Activation Grid and Activation Map
4.1.9 Convolution Trace
4.2 Interpretation in Autoencoder Networks
4.2.1 Visualization of Latent Space
4.2.2 Sparsity and Interpretation
4.2.3 Case Study: Microscopy Structure-to-Structure Learning
4.3 Interpretation in Adversarial Networks
4.3.1 Interpretation in Generative Networks
4.3.2 Interpretation in Latent Spaces
4.3.3 Evaluation Metrics
4.3.4 Case Study: Digital Staining of Microscopy Images
4.4 Interpretation in Graph Networks
4.4.1 Neural Structured Learning
4.4.2 Graph Embedding and Interpretability
4.4.3 Evaluation Metrics for Interpretation
4.4.4 Disentangled Representation Learning on Graphs
4.4.5 Future Direction
4.5 Self-Interpretable Models
4.5.1 Case-based Reasoning Through Prototypes
4.5.2 ProtoNets
4.5.3 Concept Whitening
4.5.4 Self-Explaining Neural Network
4.6 Pitfalls of Interpretability Methods
4.6.1 Case Study: Feature Visualization and Network Dissection
4.6.2 Gradients as Sensitivity Maps
4.6.3 Multiplying Maps with Input Images
4.6.4 Towards Robust Interpretability
5 Fuzzy Deep Learning
5.1 Fuzzy Theory
5.1.1 Fuzzy Sets and Fuzzy Membership
5.1.2 Fuzzification and Defuzzification
5.1.3 Fuzzy Rules and Inference Systems
5.2 Neuro-Fuzzy Inference Systems
5.2.1 Combinations of Fuzzy Systems and Neural Networks
5.2.2 Architecture of a Neuro-Fuzzy Inference System
5.2.3 Other Design Elements of Neuro-Fuzzy Inference Systems
5.2.4 Learning Mechanisms for Neuro-Fuzzy Inference Systems
5.2.5 Online Learning with Dynamic Streaming Data
5.3 Case Studies
5.3.1 POPFNN Family of NFS—Evolution Towards Sophisticated Brain-Like Learning
5.3.2 Combining Conventional Deep Learning and Fuzzy Learning
5.3.3 Overview of Fuzzy Deep Learning Studies
Appendix A Mathematical Models and Theories
A.1 Choquet Integral
A.1.1 Restricting the Scope of FM/ChI
A.1.2 ChI Understanding from NN
A.2 Deformation Invariance Property
A.3 Distance Metrics
A.4 Grad Weighted Class Activation Mapping
A.5 Guided Saliency
A.6 Jensen-Shanon Divergence
A.7 Kullback-Leibler Divergence
A.8 Projected Gradient Descent
A.9 Pythagorean Fuzzy Number
A.10 Targeted Adversarial Attack
A.11 Translation Invariance Property
A.12 Universal Approximation Theorem
Appendix B List of Digital Resources and Examples
B.1 Open-Source Datsets
B.1.1 Face Recognition Image Dataset
B.1.2 Animal Image Dataset
B.1.3 Satellite Imagery Dataset
B.1.4 Fashion Image Dataset
B.2 Applications in Computer Vision Task
B.2.1 Image Classification
B.2.2 Object Detection
B.2.3 Image Segmentation
B.2.4 Face and Person Recognition
B.2.5 Edge Detection
B.2.6 Image Restoration
B.2.7 Feature Matching
B.2.8 Scene Reconstruction
B.2.9 Video Motion Analysis
Appendix References

Citation preview

Interpretability in Deep Learning

Ayush Somani · Alexander Horsch · Dilip K. Prasad

Interpretability in Deep Learning

Ayush Somani Bio-AI Lab Department of Computer Science UiT The Arctic University of Norway Tromsø, Norway

Alexander Horsch Bio-AI Lab Department of Computer Science UiT The Arctic University of Norway Tromsø, Norway

Dilip K. Prasad Bio-AI Lab Department of Computer Science UiT The Arctic University of Norway Tromsø, Norway

ISBN 978-3-031-20638-2 ISBN 978-3-031-20639-9 (eBook) https://doi.org/10.1007/978-3-031-20639-9 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

You can’t connect the dots looking forward; you can only connect them looking backwards. So you have to trust that the dots will somehow connect in your future. You have to trust in something—your gut, destiny, life, karma, whatever. Because believing that the dots will connect down the road will give you the confidence to follow your heart even when it leads you off the well worn path; and that will make all the difference. —Steve Jobs (Stanford commencement speech, June 2005)

Preface

This book is motivated by the large gap between the black-box nature of deep learning architectures and the human interpretability of the knowledge models they encode. It is increasingly important that the artificial intelligence models are accurate and understandable such that artificial and human intelligence can co-exist and collaborate. In certain life-threatening applications, interpretability is vital for root cause analysis and human decision-making. This book focuses on a comprehensive curation, exposition, and illustrative discussion of recent research tools for interpretability of deep learning models, with a focus on neural network architectures. A significant part of the work complements existing textbooks on deep learning and neural networks, and builds upon works of the past decade where the focus has been visualization and interpretability of the knowledge encoded in the networks. These works are sourced from the leading conferences and journals in computer vision, pattern recognition, and deep learning. Furthermore, we include several case studies from applicationoriented articles in different fields, including computer vision, optics, and natural language processing. In current graduate courses related to deep learning, machine learning, and neural networks, there is an absence of teaching/learning material which deals with the topic of interpretability/explainability. This is mainly attributed to the fact that the previous focus of the machine learning community was precision, whereas the question of interpretability is an emergent topic. However, it is gaining traction as increasingly relevant subject, with books [81], [428], lecture notes [532], new courses as well as perspectives [520] being published. Nonetheless, the focus on general machine learning in these works implies that the question of interpretability in deep learning, which is now ubiquitously used across a large variety of machine learning applications, remains currently unaddressed to sufficient depth. This textbook will be therefore one of the pioneer textbooks dedicated to this topic. It will potentially lead to the creation of specialized graduate courses on this topic, since a need for such courses is perceived but unavailability of organized material on the topic is a prominent obstacle.

vii

viii

Preface

This book is also intended as a monograph or treatise on this topic with the coverage of the most recent topics. We expect that scientists with research, development, and application responsibilities may benefit from a systematic exposition to the topic being provided for the first time. The digital supporting content of the book, together with several codes and data from various application domains, will help significantly in this regard and serve as an enabler.

Field Emergence, Relevance, and Necessity In Chap. 1, we introduce the background and motivation of the book, helping the readers to set the expectations from the book and understanding the scope of material. We also provide a brief history by summarizing the evolution of deep learning. In doing so, we also establish how the evolution has led to increasing abstraction of knowledge, leading to the well-known paradigm of black box that encodes but does not explain the knowledge. We naturally lead this discussion to the question of interpretability, establishing the necessity as well as the challenges. We also clarify that the focus of the book is to address interpretability in existing deep learning architectures, and delegate the topic of novel designs of inherently interpretable deep learning architectures to a small section in the last chapter (and potentially a volume 2 of this book in the future). In Chap. 2, we introduce a variety of contemporary topics in deep learning, including the conventional neural network architectures, the learning mechanisms, and the challenges in deep learning. The aim of this chapter is to present background concepts and set the stage for technical exposition in the following chapters. In particular, we will cover convolutional, autoencoder, adversarial, graph, and neurofuzzy networks since the mechanisms of interpretability will be explored for these paradigms in detail in the next chapters. Similarly, specific learning mechanisms will be explained so that the loss or opportunities of interpretability can be identified in the later chapters. For the reasons of comprehensiveness, we will also include a section on the other varieties of deep learning approaches even though their interpretability is not expounded in the other chapters. In Chap. 3, we start a full-fledged treatment of interpretability. Specifically, we discuss the concepts of interpretability in the context of general traits of deep learning methods. We begin with a discussion of abstract encoding of knowledge at the scale of neurons and features, which is followed by interpretability and visualization of the abstract encoding. Conventional techniques, such as activation maps, saliency, attention models, etc., are discussed from the perspectives of understanding the concepts, advantages, and disadvantages. This is followed by the analysis of how knowledge propagates during the optimization or learning process as an insight into challenges and opportunities to approach interpretability of the knowledge learned using a deep learning model. Neural networks extract features using successive nonlinear activations, making the representation of knowledge difficult while posing sensitivity to noise and incomplete data areas. Knowledge versus performance is discussed

Preface

ix

using a case study. Lastly, the interpretation of deep versus shallow encoding with competing performance is discussed. Thus a range of topics in interpretability generally applicable to any deep learning architecture are covered in this chapter. Chapter 4 is dedicated to approaches of interpretability for specific individual architectures. The architectures selected for this chapter are convolutional neural networks, autoencoder networks, adversarial networks, and graph learning techniques. We include relatively new topics specific to these architectures, for example, the novel concept of “convolution trace” for convolutional neural networks, interpretability of abstract features in the latent space of autoencoder networks, interpretability of discriminative model in adversarial networks, and graph embedding for interpretability of graph neural networks. We give at least one case study per architecture, including cases from a variety of application fields. We also briefly attend to attention networks which inherently include some aspect of interpretability in design. Chapter 5 is dedicated to fuzzy deep learning. This method family is slightly different from neural network-centric deep learning in the sense that the fuzzy logic and rule-based inference lie at the center of neural network design of such networks. The need for an explanation has led to a renewed interest in rule-based systems. It is also a topic which is studied independently, and seldom in the specific context of deep learning and interpretability. This gap is being filled by expounding on the topics of fuzzy deep learning and the related question of interpretability. We cover the basic topics of fuzzy logic, fuzzy learning, and eventually leading to fuzzy neural network architectures. Recent topics such as convolutional fuzzy neural network, filter relations, tagging, etc., are covered. The open problem of end-to-end interpretability, as opposed to simple fuzzy rules layers, is also discussed. Some features of our book and their corresponding benefits are presented below: • The first organized content on ‘interpretability in deep learning’ serving as both a textbook and a monograph. The book fills an important gap which becomes more pressing as deep learning models become more complex and abstract. • Good coverage of the fundamental concepts pertaining to interpretability with illustrative case studies improves the longevity of the book’s relevance. • Coverage of the state of the art on the topic, with the newest concepts and concepts likely to evolve into the most popular approaches. The relevance of the topics addresses the urgency of a systematically designed learning resource covering a wide range of emerging concepts. • Exposition of generally applicable concepts of interpretability as well as treatment of specific popular deep learning architectures allow for both breadth and depth. Covering multiple deep learning architectures allows the readers to build connections and larger perspectives about how ‘interpretability’ can be derived across the wide variety of architectures. • Inclusion of fuzzy deep learning architectures.

x

Preface

• Codes, datasets, and interactive learning exercises include the scope of assimilation of the concepts in both theory and practice. A wide variety of case studies covered in the digital material provides a broader perspective across different application domains. Tromsø, Norway July 2022

Ayush Somani Alexander Horsch Dilip K. Prasad

Acknowledgements

We thank the European Union guidelines on ethics in Artificial Intelligence to ignite the need of a book on interpretability in deep learning. Writing this book was (and continues to be) a lot of fun. It is as difficult as it sounds to take an idea and convert it into a book. Internally, the experience is both demanding and gratifying. None of this would have been possible without the support of everyone we’ve had the opportunity to lead, be led by, or observe from afar. The amount of effort required is substantial, but we are grateful for the help we have received. People with a passion for growth and leadership make the world a better place. People who are willing to invest their time in mentoring the leaders of tomorrow make it even better. Thanks to everyone who works to improve themselves and the lives of those around them. Without consistent feedback and suggestions of these people, we would never have finished this book. As a result, we would like to show our gratitude by formally recognizing their efforts. The book evolved from a collection of notes compiled from survey papers, articles, blogs, and regular brainstorming meetings on neural networks and deep learning. Many thanks to all of the participants who provided comments on the book’s content. Special thanks to our loved ones, peers, and the UiT Tromsø’s Bio-AI Lab (https:// www.bioailab.org/) team for creating an intellectual environment in which we could commit a significant amount of time to produce this book and receive comments and assistance from colleagues. We thank Dr. Himanshu Buckchash and Suyog Jadhav for reading and providing input on multiple chapters, as well as Anish Somani for his intriguing questions, interesting views, and ethical counsel and Gunjan Agarwal, for sticking with us and contributing to this project so diligently. Among those who provided feedback on specific chapters were Rohit Agarwal, Nirwan Banerjee, and Mayank Roy. We also thank the individuals who allowed us to utilize illustrations, figures, or data from their publications. Their contributions are acknowledged in the figure captions interspersed throughout the text. Furthermore, we would like to thank Sumit Shekhar and Aditya for assisting Ayush with digitally shaping a few illustrations. Finally, we would also like to thank our family for their unwavering support. Poonam Somani, Ayush’s mother, for patiently supporting her son during the book’s writing process, Gyri Magee for her xi

xii

Acknowledgements

encouragement when writing was difficult, and everyone else for their unconditional support. Finally, Springer Nature’s willingness to publish this work is greatly appreciated. Springer’s Balaganesh Sukumar and Anthony Doyle were always helpful and sympathetic. The team reviewed this book with great patience and helped us communicate exactly what we intended to say, meticulously improving the layout of this book. Their experience and advice were invaluable that enabled us to complete the book successfully. Tromsø, Norway

Ayush Somani Alexander Horsch Dilip K. Prasad

Contents

1 Introduction to Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Deep Learning Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Evolution of Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Neural Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Fuzzy Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Convergence of Fuzzy Logic and Neural Learning . . . . . . . . 1.2.4 Synergy of Neuroscience and Deep Learning . . . . . . . . . . . . . 1.3 Awakening of Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Necessity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 The Taxonomy of Interpretability . . . . . . . . . . . . . . . . . . . . . . . 1.4 The Question of Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Interpretability—Metaverse . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Interpretability—The Right Tool . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Interpretability—The Wrong Tool . . . . . . . . . . . . . . . . . . . . . .

1 6 10 13 18 20 22 23 25 31 37 47 50 57 62

2 Neural Networks for Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Neural Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.4 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 2.1.5 Autoencoder Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.6 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . 2.1.7 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Learning Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69 70 73 74 83 93 109 116 125 134 135 142 143 145 149

xiii

xiv

Contents

2.2.6 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.7 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.8 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Challenges and Limitations of Traditional Techniques . . . . . . . . . . . 2.3.1 Resource-Demanding Checks . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Uncertainty Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Network Learning Sanity Check . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Gradient Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.5 Decision Transparency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

154 161 167 172 173 176 176 178 181

3 Knowledge Encoding and Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 What Is Knowledge? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Image Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Word Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Graph Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Knowledge Encoding and Architectural Understanding . . . . . . . . . . 3.2.1 The Role of Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Role of Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Role of Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Semantic Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Network Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Design and Analysis of Interpretability . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Divide and Conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Greedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Back-Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Dynamic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.5 Branch and Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.6 Brute-Force . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Knowledge Propagation in Deep Network Optimizers . . . . . . . . . . . 3.4.1 Knowledge Versus Performance . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Deep Versus Shallow Encoding . . . . . . . . . . . . . . . . . . . . . . . .

183 184 185 194 207 210 211 216 224 228 239 249 249 256 261 264 268 274 274 278 283

4 Interpretation in Specific Deep Architectures . . . . . . . . . . . . . . . . . . . . . 4.1 Interpretation in Convolution Networks . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Case Study: Image Representation by Unmasking Clever Hans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Variants of CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Interpretation of CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4 Review: CNN Visualization Techniques . . . . . . . . . . . . . . . . . 4.1.5 Review: CNN Adversarial Techniques . . . . . . . . . . . . . . . . . . 4.1.6 Inverse Image Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.7 Case Study: Superpixels Algorithm . . . . . . . . . . . . . . . . . . . . . 4.1.8 Activation Grid and Activation Map . . . . . . . . . . . . . . . . . . . . 4.1.9 Convolution Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Interpretation in Autoencoder Networks . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Visualization of Latent Space . . . . . . . . . . . . . . . . . . . . . . . . . .

289 290 291 294 297 302 306 307 308 309 312 312 313

Contents

4.3

4.4

4.5

4.6

xv

4.2.2 Sparsity and Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Case Study: Microscopy Structure-to-Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interpretation in Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Interpretation in Generative Networks . . . . . . . . . . . . . . . . . . . 4.3.2 Interpretation in Latent Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Case Study: Digital Staining of Microscopy Images . . . . . . . Interpretation in Graph Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Neural Structured Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Graph Embedding and Interpretability . . . . . . . . . . . . . . . . . . 4.4.3 Evaluation Metrics for Interpretation . . . . . . . . . . . . . . . . . . . . 4.4.4 Disentangled Representation Learning on Graphs . . . . . . . . . 4.4.5 Future Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Self-Interpretable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Case-based Reasoning Through Prototypes . . . . . . . . . . . . . . 4.5.2 ProtoNets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Concept Whitening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Self-Explaining Neural Network . . . . . . . . . . . . . . . . . . . . . . . Pitfalls of Interpretability Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Case Study: Feature Visualization and Network Dissection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Gradients as Sensitivity Maps . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.3 Multiplying Maps with Input Images . . . . . . . . . . . . . . . . . . . . 4.6.4 Towards Robust Interpretability . . . . . . . . . . . . . . . . . . . . . . . .

5 Fuzzy Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Fuzzy Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Fuzzy Sets and Fuzzy Membership . . . . . . . . . . . . . . . . . . . . . 5.1.2 Fuzzification and Defuzzification . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 Fuzzy Rules and Inference Systems . . . . . . . . . . . . . . . . . . . . . 5.2 Neuro-Fuzzy Inference Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Combinations of Fuzzy Systems and Neural Networks . . . . 5.2.2 Architecture of a Neuro-Fuzzy Inference System . . . . . . . . . 5.2.3 Other Design Elements of Neuro-Fuzzy Inference Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Learning Mechanisms for Neuro-Fuzzy Inference Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Online Learning with Dynamic Streaming Data . . . . . . . . . . 5.3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 POPFNN Family of NFS—Evolution Towards Sophisticated Brain-Like Learning . . . . . . . . . . . . . . . . . . . . . 5.3.2 Combining Conventional Deep Learning and Fuzzy Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Overview of Fuzzy Deep Learning Studies . . . . . . . . . . . . . . .

315 316 318 319 322 326 330 331 335 337 346 348 351 352 353 354 355 355 358 360 361 362 363 367 368 369 373 379 381 383 385 386 390 394 400 400 403 404

xvi

Contents

Appendix A: Mathematical Models and Theories . . . . . . . . . . . . . . . . . . . . . 409 Appendix B: List of Digital Resources and Examples . . . . . . . . . . . . . . . . . 417 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435

Acronyms

AAR AdaIN AE AI AM ANFIS ANN AV BCE BCM BM BN BNN BoW BP CAE CAM CBE CBoW CBR CCA cGAN ChI CNF CNN CoG conv CRF CRI CRM CSI

Analogical Reasoning Schema Adaptive Instance Normalization AutoEncoder Artificial Intelligence Activation Maximization Adaptive Neural Fuzzy Inference System Artificial Neural Network Autonomous Vehicles Binary Cross-Entropy Bienenstock-Cooper-Munro Boltzmann Machine Batch Normalization Biological Neural Network Bags of Visual Words Back-propagation Convolutional AutoEncoder Class Activation Map Case-Based Explanation Continuous Bag of Words Case Based Reasoning Canonical Correlation Analysis Conditional Generative Adversarial Network Choquet Integral Continuous Normalizing Flows Convolutional Neural Network Center of Gravity Convolutional Conditional Random Fields Compositional Rule of Inferences Class Response Map Collaborative Semantic Inference xvii

xviii

CSL CV CW DAG DBN DCNN DeepRED DGNN DL DMOS DNN ELMo EOT FC FC-NLP FC-NN FGSM FID FIS FISTA FL FN FNN FP FSA FSD FV GAM GAN GAP GARIC GAT GCN GDPR GloVe GMM GNN GPU Grad-CAM GRU GuidedBP HEFK HMI HMM HOG

Acronyms

Context Sensitive Language Computer Vision Concept Whitening Directed Acyclic Graph Deep Belief Network Deep Convolutional Neural Network Deep neural network Rule Extraction via Decision tree induction Deep Graph Neural Network Deep Learning Differential Mean Opinion Score Deep Neural Network Embeddings from Language Models Expectation Over Transformation Fully Connected Fully Connected Natural Language Processing Fully Connected Neural Network Fast Gradient Sign Method Fréchet Inception Distance Fuzzy Inference System Fast Iterative Shrinkage-Thresholding Algorithm Fuzzy Logic False Negative Fuzzy Neural Network False Positive Finite State Automation Fréchet Segmentation Distance Fisher Vector Graph Agreement Models Generative Adversarial Network Global Average Pooling Generalized Approximate Reasoning-based Intelligent Control Graph Attention Network Graph Convolutional Network General Data Protection Regulation Global Vectors Gaussian Mixture Model Generative Neural Network Graphics Processing Unit Gradient Class Activation Map Gated Recurrent Units Guided Backpropagation Hybrid implementation of Extended Kalman Filter Human Machine Interaction Hidden Markov Model Histograms of Oriented Gradient

Acronyms

IoT IoU IPM IS JS KL LDA LIME LPIPS LR LRP LSA LSTM MAE MAPLE MF MLP MOS MRE MRF MSE MTT NAG NFIS NFS NLP NN OCR PCA PGD PM PN POP PP PPV PRM PSNR RBM ReLU RETAIN RF RL RMSE RNN R-prop

xix

Internet of Things Intersection-over-Union Integral Probability Metric Inception Score Jensen-Shanon Kullback-Leibler Latent Dirichlet Allocation Local Interpretable Model-agnostic Explanations Learned Perceptual Image Patch Similarity Learning Rate Layer-wise Relevance Propagation Latent Semantic Analysis Long Short-Term Memory Mean Absolute Error Model Agnostic suPervised Local Explanations Membership Function Multi-Layer Perceptron Mean Opinion Score Mean Relative Error Markov Random Field Mean Squared Error Multiple Trace Theory Nesterov Accelerated Gradient Neural Fuzzy Inference System Neuro-Fuzzy System Natural Language Processing Neural Network Optical Character Recognition Principal Component Analysis Projected Gradient Descent Predictability Minimization Pertinent Negative Pseudo Outer Product Pertinent Positive Positive Predictive Value Peak Response Map Peak Signal-to-Noise Rate Restricted Boltzmann Machine Rectified Linear Unit REverse Time AttentIoN Receptive Field Reinforcement Learning Root Mean Square Error Reinforcement Neural Network Resilient Back-propagation

xx

s.t. SA SeFa SENN SGD SIFT SISTA SL SNR SOTA SpRAy SSE SSIM SURF SVCCA SVD SWD TF TN TP TSK t-SNE UL VAE VMA XAI

Acronyms

such that Sensitivity Analysis Semantics Factorization Self-Explaining Neural Network Stochastic Gradient Descent Scale-invariant Feature Transform Sequential Iterative Soft-Thresholding Algorithm Supervised Learning Signal to Noise Ratio State-Of-The-Art Spectral Relevance Analysis Summed Squared Error Structure Similarity Index Measurement Speeded Up Robust Features Singular Vector Canonical Correlation Analysis Singular Value Decomposition Sliced Wasserstein Discrepancy TensorFlow True Negative True Positive Takagi-Sugeno-Kang Fuzzy Model t-distributed Stochastic Neighbor Embedding Unsupervised Learning Variational Autoencoders Video Motion Analysis Explainable Artificial Intelligence

Chapter 1

Introduction to Interpretability

Artificial Intelligence (AI) and modern computing captivate a large and growing number of people. It’s fascinating to see how they progressed from a mere impression of mimicking human-like behavior to surpassing human-level performance that fits in one’s pocket. The introduction of deep learning (DL) models with heavily layered architectures that allow abstraction of low-level feature extraction and possess rigorous decision-making abilities was arguably the pivotal moment in shifting scales. The large gap between the black-box nature of DL architectures and the interpretability of model-encoded knowledge by humans motivates this book. It is becoming increasingly important that AI models are not only accurate, but also comprehensible, so that artificial and human intelligence can to the full benefit co-exist and collaborate. Consider, for instance, an intelligent self-driving car equipped with complex DL algorithms fails to brake or slow down as it approaches a sharp roundabout. This unexpected response may frustrate and perplex researchers, who may wonder why. Or, even worse, if the car is on a highway or a busy street, poor decisions can have catastrophic consequences. Concerns about DL models’ black-box nature limit their potential applications in our society. Interpretability is critical for root cause analysis and human decision © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Somani et al., Interpretability in Deep Learning, https://doi.org/10.1007/978-3-031-20639-9_1

1

2

1 Introduction to Interpretability

making in life-threatening applications. The models’ reliance on correct features must be ensured, particularly in health care, and other critical sectors that necessitate safety precautions and testing. The interpretability of DL techniques can help human experts use them more deliberately and intelligently. This book considers the following aspects about interpretability in DL: 1. Introducing the origins, impact, and growth of DL applications in the modern world, 2. Presenting insights as to how knowledge is encoded in the architecture as well as the learning process, 3. Exploring existing approaches of decision explainability, 4. Using popular tools for knowledge encoding verification, interpretation, and visualization in popular DL architectures, and 5. Presenting fuzzy DL paradigms that present inherent advantages of interpretability.

Work of Imagination The book begins with an attempt to elicit the inquisition in the call for interpretability. Let us go through some admittedly exaggerated short stories meant to set the stage for the book’s topic. If you’re in a hurry, you can skip the narratives. If your imaginative self is wondering why these effort are being made to entertain the readers, read on.

A Suspended Thread Let’s time travel to the 12th of December, 2026. Raj has been feeling under the weather for quite some time. Upon repeated persuasions from his wife, he grudgingly made his way to the doctor. It is blisteringly cold outside. Fortunately, the clinic is quite close to his residence. On arrival, Raj is questioned by the on-call physician, Andreas, about the nature of his illness. “Ahh, A minor cough, doctor; that’s all. I don’t think it’s too serious of a situation. Even though, I think it would benefit from an expert’s perspective, because it’s starting to annoy me a bit”, said Raj. “Yeah, influenza has spread widely this winter. If you have any other symptoms, please let me know,” Andreas inquired. “True, but not in any significant way. I’d say I’m experiencing some chest tightness and shortness of breath”, Raj replied, coughing away. “I see”, Andreas mumbled before scribbling away on a piece of paper. After submitting the medical report, Raj found himself again in his doctor’s office a few days later. His wife Susan was there with him this time. Raj’s X-rays had been sent to Andreas in advance, so he was aware of the prognosis. As much as Andreas detested it, he had to look at the couple. Breaking bad news to a patient wasn’t

1 Introduction to Interpretability

3

Fig. 1.1 Visual depiction of a medical diagnostic center where the doctor’s verdict are in the hands of predictive learning machines

something he would ever get used to. “I have terrible news to share with you, Raj, you have been diagnosed with lung cancer”. A pause. Susan’s eyes welled up with tears and Raj could not find the words to comfort her. Not once did he consider the possibility that this might be the case. He stutter-asked, “A-Are you sure, doctor?”, hoping this was a joke in poor taste. “I’m afraid that’s correct. It was determined by the CH-256 model that you have stage 2 lung cancer” (Fig. 1.1). “The CH what? What model?” Raj, who wasn’t involved in the medical field by a long shot, was unaware of the tremendous technological advances made in the medical community. “Hmm, yes, yes, it’s an DL model for analyzing respiratory diseases. Significant credibility amongst medical professionals. They call it “stateof-the-art,” which sounds impressive. It has been operational for nearly three years. Never have I come across an instance where it was proven incorrect.” Upon hearing this, Susan’s silent tears turned into a low sob. Raj went to other doctors for a second opinion. That Andreas had arrived at the correct diagnosis of stage 2 lung cancer was echoed by everyone he spoke to. The CH-256 architecture was universally adopted, after all. It had a reputation for infallibility. It was more widely respected than even the senior doctors of the country. Raj was subjected to rounds of chemotherapy. In a word, it was excruciating. He knew that chemo was painful for others, but he would not wish it on his worst enemy. His health insurance did not cover all of his medical expenses. In addition, his illness had left him unemployed. Their savings had dried up. Susan put in extra hours at work to cover Raj’s medical bills. This was a painful experience. Also, Raj wasn’t showing any signs of improvement. When he reported this to his doctor, he was told that it would take some time for him to feel better.

4

1 Introduction to Interpretability

A few months passed. Raj’s condition continued to deteriorate. Due to the effects of the chemotherapy, he was unable to do much of anything physically demanding. Raj, a healthy 28 year old, adult, aged by 10 years. This man had undergone a radical transformation, making him practically unrecognizable. Not even a shred of the rangy young man remained in him. He was an even closer resemblance to Gollum than the latter himself! The medical staff treating him was equally perplexed. They re-diagnosed Raj to make sure they hadn’t missed anything. Raj and Susan insisted that they shall do things the old fashioned way to give him a diagnosis. The couple was uneasy with the idea that an algorithm was determining his prognosis. It was inevitable that Susan and Raj’s suspicions would be proven correct. Raj’s fatal illness wasn’t lung cancer as the model had predicted, but asthma! The medical establishment was understandably shaken by this event. Indeed, the media had a field day as well. Things got pretty heated on social media. Was this the first and only occurrence of its kind? Have there been other patients who, like Raj, received wrong therapy due to a false diagnosis? Should the CH-256 model be revised? Can we ever again put complete faith in an algorithm? Some blamed the practice of using DL in the medical field, while others were arguing against it. The algorithm made diagnostics quick and easy, and the possibility of a single incorrect diagnosis was outweighed by its huge benefits. If DL models were interpretable, some people argued, this wouldn’t have happened. If the models could explain their reasoning, the doctors could have double-checked their work and possibly avoided this tragic end.

Artistic Stroke I wiped the sweat from my brows and realized that I had been frowning the entire time I was painting. Getting off my wooden stool and navigating my way around the wads of paper and open oil paints so I can get a better look at the painting. “Ahhh, my God!” I frowned disapprovingly, sensing a flaw but being unable to identify it. Perhaps the color scheme doesn’t go with the backdrop. Is it just me, or do I not see a seamless progression of skin tones? No matter what the issue is, I have to give a presentation at the exhibition this coming weekend. Out of frustration, I kicked the clutter on the floor. To paraphrase: “WHAT IS WRONG WITH THIS PIECE!?” Taking out my frustration by kicking the floor didn’t help. Instead, I kept working on my paintings, now with a swollen leg. After an hour or two, the studio was even more disorganized, and no progress had been made on the piece. I may have even made things worse! I felt like crying out of despair because I was so overwhelmed (Fig. 1.2). ...

1 Introduction to Interpretability

5

Fig. 1.2 A visual story—paints, oils, measuring scales, and crumpled paper littered the floor of the artist’s studio. Mateo sat on a wooden chair in the center of the studio, disgruntled and dissatisfied with his art piece

It is not necessary for something to be explainable or immune to criticism in order for it to be masterfully executed. This is especially true of the arts where a free-flowing expression of ideas and creativity from many different minds that needs no validation. At the end of the day, we have to convince ourselves that not all achievements can be better explained and evaluated using the same standard of excellence. Instead of focusing on the potential downsides, we should celebrate the degree of causal inference that a system or strategy provides. In any case, it’s important to pay attention to a situation in which machines are competing with human decision-making abilities. It could be a breakthrough with an initially uphill accuracy slope. However, it must identify its limitations and edge cases in order to hone its performance to perfection. To boost system performance, get past saturation, raise work quality, cut costs relative to time put in, and avoid randomly stumbling into blind spots while seeking answers, it is necessary to explain what it does, how it does it, and where it goes wrong for what reason. We’ll examine the current state of the lightning-fast smart industry and discuss potential solutions to the mostly opaque methods currently in use.

6

1 Introduction to Interpretability

Who Should Read This Book? This book can benefit a wide range of readers, but it was written with a specific set of readers in mind. The broader groups include university students, both undergraduate and graduate, learning about AI and the smart world shift, as well as novices beginning a career in DL and AI. The more specific groups are data scientists, software engineers, statisticians, and tech-savvy consumers who may not have a computer science or statistics background but want to quickly acquire knowledge and start using DL in their innovation or platform. Furthermore, this book is recommended for AI practitioners who want to get an overview of techniques to make their models clear and understandable. Readers are expected to be acquainted with basic DL terminologies. It is also desirable that readers have an understanding of entry-level university mathematics in order to follow the theory and formulas in the book. Nonetheless, for the sake of completeness, the book introduces fundamental concepts for reader’s convenience. The intuitive explanations of the techniques at the beginning of each chapter do not use mathematics. They are expected to be readily accessible for readers of all backgrounds. The book is divided into five chapters. This chapter describes the evolution of DL and the call for interpretability. Chapter 2 introduces readers to the necessary mathematical tools and DL fundamentals. In Chap. 3, the topics of knowledge encoding and knowledge propagation in the learning methodology will be discussed, and the categorization of interpretable deep learning (IDL) approaches will be introduced from the perspective of algorithmic design and analysis. The specific practices dedicated to interpretability are described in Chap. 4. Chapter 5 examines the incorporation of fuzzy logic into DL and its impact on causal inference.

1.1 Deep Learning Glossary In the world of data science, buzzwords like AI, DL, quantum computing, and blockchain are ubiquitous. This section explains some of the important terms used in the book to avoid ambiguity. The list of terms is by no means meant to be exhaustive. Nonetheless, it might be challenging at times to convert implicit neural network (NN) information into simple mathematical equations or another understandable form. As a result, it is accessible to flexible adaptation for readers from different fields to fit their needs. The book uses the words “interpretability” and “explainability” interchangeably to emphasize a common idea. “Explanation” is in contrast used to indicate “why each prediction is true”. Following Miller (2019) and Molnar (2020), it is good to learn the difference between these terms so that they can be used conveniently.  Accuracy is the quality of being right or accurate, or a way of measuring it. It is presumably coined as a synonym for ‘precision’, but the two terms are not the same when it comes to metrics evaluation. Accuracy is the ratio of “correct” to “incorrect”

1.1 Deep Learning Glossary

7

predictions made by a model. The correctness can be discrete like in classification, where there is a single correct answer, or fuzzy, like in object detection where there is a gradual transition between best, good, bad, worst. ‘Top-5 accuracy” is a term that is often used to describe how accurate the top 5 most likely predictions of a model are. Other standard terms are the Top-1 accuracy and the Top-3 accuracy.  Annotation is the process of labeling, tagging, or transcribing the data to make it easier for the model to find the feature of interest. It can also mean the label, tag, or transcription itself. In computer vision (CV), annotation can be a bounding box for classifying objects, a polygon for separating them into groups, a landmark annotation, or an image transcription for mapping. The most common types of data annotations in natural language processing (NLP) are text annotations for sentiment analysis, audio annotations for speech recognition, and sequencing. In all of these situations, the accuracy of the annotation is key to creating a convincing ground truth for the learning process and helping the model learn correctly.  An Algorithm is a set of rules or steps to reach a certain goal. It is a set of instructions that a machine follows to turn data and attributes from the outside world into information that makes sense.  The Architecture of a DL model consists of neuron layers and connections for data flow between the neurons of adjacent layers. To be comprehensible, a specific DL architecture typically requires additional information on how it is intended to provide a solution to a class of tasks.  Artificial Intelligence (AI) is an extensive discipline of computer science which made its advent in 1956 at a summer conference at Dartmouth College, sponsored by DARPA. AI is responsible for simulating intelligent machines to accomplish tasks that typically require human intelligence. AI applications focus on three cognitive skills: learning, reasoning, and self-correction. Expert systems, robotics, NLP, speech recognition, and CV are all examples of these kinds of uses.  Black-box Model: A model that hides the details of how it works on the inside is called a black-box model. This makes it hard to explain “why” a certain prediction was made. DL models are often called “black-box models” because it’s hard to see what happens inside them and how they make decisions. This is because their architectures are usually complex, multilayered, and non-linear. The book talks about trying to open the black box and make the model easier to understand.  Classification is a way to learn to predict a categorical output in a way that is guided by an expert. Here, the number of classes and their meaning is determined/known. This could be a two-class classification or a classification with more than two possible outcomes. The goal of most typical CV tasks is to find out if any example of a certain class is in the image or video, regardless location.  Computer Vision, often abbreviated as CV, is a subfield of computer science, AI, physics, and mathematics referred in Fig. 1.3. It focuses on making digital versions of the high-level complexity of the human visual system so that it can see, understand,

8

1 Introduction to Interpretability

Fig. 1.3 The development of CV, a popular field that will be covered in detail in this book, is influenced by a variety of disciplines

and respond to the visible world. It is a way to give the pixels in an image a clear meaning and use this information to make a decision.  Convolutional Neural Network (CNN, ConvNet) is the most common type of NNs used in CV. It has many layers of weights and biases that need to be trained. The many layers of convolutional blocks make it easy to learn more complicated ideas. A convolution is an operation where a sliding window is used to focus on a small part of a large input matrix. It helps narrow down the network’s parameters to a good learning model, with the first layers learning low-level ideas like lines and color blocks, and the deeper layers learning complex high-level features like faces or objects.  A Dataset is a group of related but separate pieces of data that can be used to train a model. The data can be accessed individually, in groups, or as a whole.  Deep Learning (DL) is a subset of ML that can be thought of in simple terms as the automation of representation learning. It uses many layers to get information out of raw data in a step-by-step way and teaches the computer to act close to human behavior (Fig. 1.4).

Fig. 1.4 Timeline of the advancement of AI into ML and further into specialized DL domains of model learning over the course of almost a century using human cognitive abilities

1.1 Deep Learning Glossary

9

 Domain-specific: Problems, techniques, or solutions that don’t apply to all situations but only to those of a special field are called domain-specific. For example, a model that was trained to separate the instruments in a colonoscopic dataset is domain-specific because it can only be used for medical imaging diagnostics and may not work for imaging for autonomous ship navigation.  Features is, in the context of AI, is a set of properties that a model learns from the data it receives during learning. A sequence of learning loops, for example, might be able to figure out the texture of the eye as a feature for the facial recognition challenge.  Fuzzy Logic is an approach to learning based on vague and ambiguous and ambiguous form of information. The approach uses ‘degree of truth’ instead of the usual “true or false” Boolean reasoning of a digital computer.  Ground Truth is the best available solution to the problem that a model is supposed to solve. Comparison with the ground truth lets you see how well the model works by looking at the relevant datasets for the use case. Therefore, the quality of ground truth is exceptionally crucial both for the training of models and for the evaluation of model performance.  The Loss Function, also called the cost function, is a differentiable mathematical function that tells how far off the current model’s predictions are from the real world/the ground truth. At each step in the training stage, the model is steered by a mathematical optimization process that is run on a training dataset. The goal is to reduce the loss. During validation and testing, if the loss function has a low value, it means that the model works well.  Machine Learning (ML) is a branch of AI that improves a machine’s performance through experience and examples of data. Instead of traditional programming, in which we code the “program rules” that transform input into output, ML learns from the training dataset to form its own rules thorough ‘smart trial and error’ with minimal human intervention.  Metrics are used to validate and test a model’s functionality. These are similar to loss functions but do not have to be differentiable. The loss function itself is not used as a metric since the model was trained to minimize it, which introduces a bias that makes the function unsuitable for validation. Instead, the quality of classification is measured using accuracy, precision, recall, area under the curve (AUC), and fitness score (F1-score). For regression problems, the metrics mean square error (MSE) and mean absolute error (MAE) are used. Metrics such as mean reciprocal rank (MRR), mean average precision (MAP), and discounted cumulative gain (DCG) are also used to assess models.  A Model is an implementation of a specific architectural design with a defined input size, a layout for its weight and bias parameters, and output deliverables. The model is a trainable implementation, which means that its weights and biases can be updated during ML. ResNet50, a well-known Artificial Neural Network (ANN) architecture for CNN in the ResNet family, is an example of this kind of model.

10

1 Introduction to Interpretability

 Precision is one way to measure how well a model can make correct predictions. It is the number of true positives predicted by the model divided by the total number of positives predicted by the model.  Recall is another way to measure the quality of predictions made by the model. It is the number of true positives predicted by the model divided by the sum of the true positives and false negatives that the model predicted (i.e. all real positives).  Representation Learning is a type of DL methods that allows a system to determine from raw data what representations are needed for detecting features or for classifying them. When a machine learns the features and uses them for a given task, the need for manual feature engineering is eliminated.  Segmentation is the process of dividing the visual input of a model into segments by putting pixels into groups of pixels sharing specific characteristics, so-called segments, for instance objects of interest and the background. It also refers to the result of this process, i.e. the segments generated during segmentation.  State-Of-The-Art (SOTA) is a term that is often used in AI. It means that the model called SOTA performs better on a benchmark dataset than other models that have been tried before.  Transfer Learning is a DL method that uses the features a trained model has learned previously for a specific task, to solve a different but related task. It uses the weights of the trained model and is then trained for adaptation to the new task, only. This speeds up the learning process.  Visualization is the process of putting information and data into pictures. In general, it is a powerful way to help the mind understand quickly.  Meta-definitions of Interpretability The extent to which an individual can comprehend the cause of a model’s outcome (Miller 2019). The degree to which a human can consistently predict a model’s outcome (Kim et al. 2016).

1.2 Evolution of Deep Learning Today, DL is one of the driving forces behind the global AI revolution that is occurring across industries. People may question whether this is a recent discovery, but its history dates back to the 1940s. In fact, DL did not appear overnight; rather, it evolved gradually over the course of eight decades, as depicted in Fig. 1.5. Even

1.2 Evolution of Deep Learning

11

Fig. 1.5 A visual history of AI landmarks since 1943. Adapted image from Rashidi (2020) (CC-BY 4.0)

12

1 Introduction to Interpretability

though no one acknowledged that NNs have a future, a number of ML practitioners worked tenaciously to bring about this development. Things are getting better. Recent advancements in CV, NLP, speech recognition, and audio recognition are attributable to DL. Fundamentally, DL is an evolved subfield of ML that employs algorithms with multiple processing layers (hence “deep”) to learn representations from data containing multiple levels of abstraction. In fact, the greater the complexity of the task, the larger the required training dataset. It is based on the concept of ANNs, or computational systems that mimic the functioning of the human brain. Therefore, the brief history of DL should begin with the 1943 paper by W. Pitts and W. McCulloch that introduced the mathematical model of biological neurons (McCulloch and Pitts 1943). Almost a decade later, F. Rosenblatt created a new version of the McCullochPitts neuron called “Perceptron” (Rosenblatt 1957) that had the ability to learn binary classification. This event marked the beginning of AI. The theory of behavioral changes of the cerebral cortex in response to light projected onto one or both retinas emerged the following year. In 1961, Hubel and Wiesel studied the complexity of receptive field arrangements in the visual cortex of the cat in order to comprehend the response pattern of individual cells (Hubel and Wiesel 1962). Revolutionary is Werbos et al. (1990) effort to design a simpler version of backpropagation using the chain rule for ordered derivatives. H. J. Kelley implemented the first continuous backpropagation model within the context of Control Theory in 1960. S. Linnainmaa published his work on computer-coded backpropagation ten years later. However, it took another decade for backpropagation to be implemented in NNs. The ability to directly translate a basic strategy into computer code for usage in NNs with extensive applications in pattern recognition and fault diagnostics is unprecedented. The era was succeeded by what we refer to as the ‘AI Dark Ages’. Fukushima’s Neocognitron, Vapnik’s Support Vector Machine, and Schmidhuber’s Recurrent Neural Network (RNN)/LSTM were instrumental in laying the groundwork for the AI renaissance. 1980 marked the introduction of Neocognitron, the first CNN architecture capable of recognizing visual patterns such as handwritten characters (Fukushima and Miyake 1982). J. Hopfield (1982) created the Hopfield Network, which was later popularized as an RNN. Boltzmann Machine, a stochastic RNN, was created by Ackley et al. (1985). It has one input layer, one hidden layer, and no output layer. The following year, Geoffrey Hinton and his team successfully implemented backpropagation in NNs for the first time (Rumelhart et al. 1986). This work marks the beginning of complex DNN training. We anticipate that many readers of this book will be familiar with DL as an exciting new technology and will be surprised to find a reference to “history” in a book about an emerging field. Undoubtedly, a vast number of researchers, either directly or indirectly, are contributing to the imminent growth referred to by a variety of obscure names, and only recently termed as “Deep Learning.” As depicted in Fig. 1.5, this section only attempts to provide a concise history by highlighting several pivotal moments and events.

1.2 Evolution of Deep Learning

13

In recent years, numerous multilayered feed-forward networks comprised of artificial neurons that loosely mimic the operation of biological neurons have been used to learn to perform complex pattern recognition tasks. While the majority of contemporary DL projects begin with LeCun’s convolutional network or Hinton’s backpropagation (Mhaskar and Poggio 2016). In contrast, the parameters of these modern networks are real numbers whose system modeling relies on real values. The unpredictability of real-world situations and human supervisory operators’ linguistic data exceeds the capacity of a neural network’s learning graph. Fuzzy sets and inference systems, on the other hand, have enabled the management of human-interpretable linguistic information and data uncertainty in real-world scenarios. This subsection provides a concise summary of works pertaining to neural networks (NNs), fuzzy networks, and the fusion of fuzzy logic with neural networks for improved decision precision and transparency.

1.2.1 Neural Learning Unsurprisingly, the concept of neural learning began as a model of how neurons in the brain function, referred to as “connectionism.” It utilizes interconnected circuits to simulate intelligent behavior.  Highlight Two fundamental concepts serve as precursors to neural learning: • In Threshold Logic, continuous inputs are transformed into discrete outputs. • Hebbian Learning is a model of learning based on neural plasticity that was proposed by Donald Hebb (Shaw 1986) and is commonly recapitulated by the phrase “Cells that fire together, wire together.” Both concepts were proposed during the 1940s. In 1954, researchers at MIT successfully implemented the first Hebbian network after attempting to translate these networks onto computational systems for successive years. Broadly speaking, the three waves of DL evolution are cybernetics in the 1940s– 1960s, connectionism in the 1980s–1990s, and the current revival initiated in 2006 under the term DL. These are quantitatively represented in Fig. 1.6. Backpropagation, a technique developed by researchers in the 1960s and refined throughout the AI winter, assisted NNs in rising from their premature graves. It was a method based on intuition that assigned decreasing significance to each event as one moved further back in time. Paul Werbos was the first to recognize the potential of NNs and solve the translation query for MLPs (Werbos 1990). Interestingly, until D.B. Parker published a report on his MIT work in 1985, the work remained unnoticed by the community. The technique took the community by storm only after being

Fig. 1.6 Chronological progress of AI and the advent of eXplainable AI (XAI) with major landmarks in history

14 1 Introduction to Interpretability

1.2 Evolution of Deep Learning

15

rediscovered by Rumelhart, Hinton, and Williams and republished in a clear and detailed framework (Rumelhart et al. 1986). In their 1969 publication, the same authors also addressed the specific flaws identified by Minsky and Papert (1969). Backpropagation and Gradient Descent (GD) formed the skeleton and engine of the NNs discussed later in the book. Thus, by the 1990s, NNs had returned, capturing the world’s imagination and finally meeting, if not exceeding, its expectations. During the AI winter of 1987–1993, shortly after the successful implementation of backpropagation in 1986, Yann LeCun used backpropagation to train a CNN to recognize handwritten digits (LeCun et al. 1989). This was the defining moment in the foundation of the modern CV using DL. In the same year, G. Cybenko published his work (Cybenko 1989) on the Universal Approximation Theorem using a single hidden layer, thereby enhancing the credibility of DL. Hochreiter et al. (1998), however, identified the issue of vanishing gradient, which rendered the learning of DNNs slow and nearly impractical. The significant content of this book is inspired by LeCun’s 1995 introduction of CNNs for pattern recognition in images, speech, and timeseries tasks (LeCun and Bengio 1995). It is a potentially exciting plan to eliminate the traditional hand-designed feature extractor and instead rely on backpropagation to transform layers into meaningful feature extractors on “raw” normalized inputs, typically containing several hundred variables (Fig. 1.7). Support Vector Machines (Cortes and Vapnik 1995), IBM’s DeepBlue (Hsu 1999), LSTM by Hochreiter et al. (1997), and the Deep Belief Network (Hinton et al. 2006) have made neural learning more efficient for large amounts of data in the coming decade. Later, the Renaissance period will be viewed as a time of rapid advancement in AI and DL. Krizhevsky’s use of non-saturating neurons, Andrew Ng’s group’s efficient GPU implementation to train a large convolutional network (Krizhevsky et al. 2012), and ImageNet (Deng et al. 2009), a large labeled dataset repository created by Fei-Fei Li and team in 2009 accelerated the learning paradigm (Fig. 1.7). This was supported by the work of Yoshua Bengio and his team, which utilized ReLU to circumvent the vanishing gradient problem (Glorot et al. 2011). The DL community discovered this additional tool in addition to GPU to circumvent the issue of longer and impractical DNN training times. These were among many

Fig. 1.7 Two significant developments in the evolution of CNNs. Figure adapted from LeCun et al. (1989) and Krizhevsky et al. (2012) with permission

16

1 Introduction to Interpretability

contributions by Yoshua Bengio, Geoffrey Hinton, and Yann LeCun in the area of DL and AI that helped the trio win the 2018 Turing award. A defining moment for those who worked tirelessly on NNs during the 1970s, when the entire ML community had moved on. In less than a decade, Facebook’s AI Research, DeepMind AlphaGo, Tesla’s Autonomous Car, Microsoft’s Speech Recognition and Nervana’s Movidius are notable industrial drifts in the DL domain. While the types of NNs used for ML have occasionally been applied to the study of brain function (Hinton and Shallice 1991), they are not typically intended to serve as biologically accurate models. The neural perspective on DL is inspired by two fundamental concepts: • Exploit the brain to create intelligent machines. The brain provides an example that intelligent behavior is possible, and a conceptually simple way to create intelligence is to reverse-engineer the brain’s computational principles and replicate its functionality. • Use machines to learn about the brain. It would be extremely interesting to comprehend the brain and the underlying principles of human intelligence; therefore, ML models that shed light on these fundamental scientific questions are valuable regardless of their ability to serve engineering applications. Neural learning has proven to be highly effective, particularly for tasks involving images and texts, such as image classification and language translation. In 2012, a DL-based approach won the ImageNet classification challenge. Since then, there has been a Cambrian explosion of neural architectures, characterized by a trend toward deeper networks with an increasing number of weight parameters. To make a prediction, the input data is passed through multiple multiplication layers with the learned weights and non-linear transformations. Depending on the architecture of the NN, a single prediction may require millions of mathematical operations. There is no possibility that humans can follow the precise mapping from input data to prediction. To comprehend a NN’s prediction, we would have to consider millions of weights that interact in a complex manner. In various domains of text, speech, and vision, these NNs have demonstrated significant improvement over the SOTA approach, with backpropagation serving as the fundamental building block. To interpret the behavior and predictions of these networks, which will be covered in later chapters, we require specific interpretation methods.  Highlight Russell and Norvig (1995) have investigated four practices that have historically defined the field of AI: 1. 2. 3. 4.

Thinking humanly Reasoning logically Functioning humanly Acting rationally.

1.2 Evolution of Deep Learning

17

The contemporary concept of DL transcends the neuroscientific perspective of the current generation of ML models. It employs a more general principle of learning multiple levels of composition that can be implemented in ML frameworks that are not necessarily neurally inspired. Nevertheless, some of the earliest learning algorithms we recognize today were designed to be computational models of biological learning, or models of how learning occurs or could occur in the brain. Consequently, one of the names for DL is ANNs. A new era in CV may have begun when NN models inferred the spatial and temporal invariance properties of images. Convolutional architectures rely on non-linear activation functions that transform input data into a non-linear space in order to determine whether a neuron should fire (Nwankpa et al. 2018). It has been successfully applied to the solution of complex problems, particularly in image classification (Krizhevsky et al. 2012; Iandola et al. 2016), object detection (He et al. 2016; Redmon and Farhadi 2017), action recognition (Wang et al. 2017; Lea et al. 2017), and medical applications (Acharya et al. 2017, 2018). Although the majority of research on deep CNNs is focused on classification, some work on regression applications pertain to time series forecasting (Miao et al. 2016; Niu et al. 2016). Ian Goodfellow’s substantial introduction of Generative adversarial networks (Goodfellow 2016), or GANs for short, with the capacity to synthesize real-like data from low-level latent space noise unlocked the door to neural learning in fashion, art, and science, among other fields. Autoencoders are another well-known network type for knowledge extraction and reconstruction that is frequently used to discover compressed representations of datasets by minimizing the reconstruction error (Patterson and Gibson 2017). Knowledge is extracted from input data, represented in latent space with fewer dimensions, and then reconstructed at output. The step of extraction is known as the encoder, while the step of reconstruction is known as the decoder. Using fewer dimensions in latent space forces the network to retain only the most significant data, resulting in a compact knowledge base. It is frequently used for dimensionality reduction (Hinton and Salakhutdinov 2006; Wang et al. 2014, 2016) and denoising applications (Vincent et al. 2008; Wang et al. 2017). Recent extensions of the autoencoder model include the convolutional autoencoder (CAE), which extracts deep representations of knowledge by combining the advantages of convolutional architectures and autoencoders (Makhzani and Frey 2014). It extends the autoencoder’s fundamental structure by substituting convolutional layers (convlayer) for the fully connected layers between the encoder and decoder. Specifically, conv-layers replace the encoder, while deconvolutional layers (Chen et al. 2017) replace the decoder. This extension of the basic autoencoder improves the performance in detecting abnormal states based on commonly encountered training data. Autoencoders have also been used for regression (Li et al. 2018; Zhao et al. 2019), but even the information encoded in the latent space and convolutional layers remains uninterpreted and opaque.

18

1 Introduction to Interpretability

 Fun-Facts Let’s discuss some prevalent DL evolution myths. 1. The advent of human-level AI superintelligence is imminent. Even if human-level AI superintelligence is only a few years or a decade away, there is still a long way to go to secure such a system. 2. DL is self-taught. DL is not a magic wand that automatically improves situations without human intervention. Typically, learning is limited to the operational reach of the training data. Idealistically, a machine should learn from its errors without human-level fine-tuning. Humans, on the other hand, are adept at efficiently applying knowledge from one discipline to its future diverse domains. 3. AI will render us jobless. Even though predicting the long-term impact of AI on employment is difficult, technological advances have always altered the way we work. For instance, it is debatable whether the introduction of the Internet transformed society. Indeed, AI will replace humans in a multitude of situations, but many new jobs, industries, and sectors will be created simultaneously. Lastly, unstructured data has historically been much more difficult for computers to interpret than structured data. In fact, the human species has evolved to be exceptionally competent in comprehending audio and visual cues. Compared to this, text was a more recent development, but people are incredibly adept at interpreting unstructured data. Consequently, one of the most exciting aspects of the rise of neural learning is that computers are now significantly better than they were a few years ago at analyzing unstructured data. This enables the development of many exciting new applications that utilize speech recognition, image recognition, NLP on text and videos, and much more than what was possible few years ago.

1.2.2 Fuzzy Learning According to Freitas (2014), the rule-based classification method is the most userfriendly classification method. Their textual nature makes them easily readable by users. Individual rules are also highly modular, allowing us to examine a handful of relevant rules concurrently in order to identify “local patterns” as the explanation. In the late 1990s, fuzzy logic (Zadeh 1988) was a common buzzword. It extends Boolean logic from a 0 to 1 judgment to a fuzzy approximation of inference in the range [0, 1]. It is further subdivided into fuzzy set theory and fuzzy logic theory. The latter, which emphasizes “IF-THEN” rules, has proven effective in addressing a wide range of complex system modeling and control problems. Despite this, a fuzzy rule-based system is constrained by the tedious and expensive acquisition of a large number of fuzzy rules.

1.2 Evolution of Deep Learning

19

Fig. 1.8 Interpretable decision set versus decision list. The rules use the same data. Rule-based decision sets (top) are more human-friendly. In decision lists (bottom), rules depend on all rules above it being false. The order of rules in decision lists is important, but not for decision sets. Adapted from Lakkaraju et al. (2016) with permission

Lakkaraju et al. (2016) made interpretable decision sets, which are a rule-based framework that aims to make rules in a structure instead of a hierarchical structure like a decision list. The concept is that each rule in the form of ‘if-then’ clauses can stand on its own, without relying on the validity of other rules. The user should be able to read each rule independently and understand how it contributes to a particular classification. The authors assert that this method is preferable to the decision list for achieving interpretability. Figure 1.8 compares (a) the rule types generated by a decision set and (b) the decision list for the same dataset. Each new rule in a decision list (bottom) depends on the previous rule being false in order to be considered; consequently, the rules must be read in chronological order. On the contrary, each rule in the decision set (top) is independent of every other rule and can therefore be evaluated in any order. The rules encoded in the rule-base are either fuzzy IF-THEN Mamdani rules (Mamdani 1976) or fuzzy Takagi-Sugeno rules (Takagi and Sugeno 1993). The former is intuitive because the outcomes of the rules are interpretable. This provides a substantial amount of explainability to the system’s outputs. However, the loss of information caused by a min-max operator and the time-consuming learning required to update the weights and pruning rules are inconvenient. The latter is more computationally efficient, but its outputs lack the linguistic meanings that are necessary for further analysis, such as system identification.

20

1 Introduction to Interpretability

Finally, some specialized networks, including ANFIS (Jang 1993) and RBF networks (Fritzke 1994), correspond directly to fuzzy logic systems. An RBF network, for instance, is equivalent to the Takagi–Sugeno rule system (Ying 1998), which includes rules such as if x ∈ A and y ∈ B, then z = f (x, y). Each neuron or filter in a network can be seen as a fuzzy logic gate in a fuzzy logic interpretation (Fan and Wang 2020). To some who hold this view, a neural network is only a sophisticated form of fuzzy logic. In a nutshell, Mamdani-type fuzzy rules are well-suited for human interpretation, whereas Sugeno-type fuzzy rules are more suited for mathematical analysis. Consequently, an essential question arises. Is there a simple-to-interpret hybrid fuzzy inference model that can satisfy the demand for mathematical analysis? In other words, a demand for fuzzy model that is both intuitive and computationally accurate is required.

1.2.3 Convergence of Fuzzy Logic and Neural Learning We have witnessed a resurgence of DL models as a result of neural learning’s success in a variety of tasks. However, the fact that it operates in a ‘black-box’ mode causes difficulties to explain why it makes certain decisions. XAI is an effort to remove the lid from the black box (Adadi and Berrada 2018). Rule extraction is a natural approach to explaining outcomes. If the rules contain linguistic variables, as in fuzzy rules, then the explanation resembles human thought more closely. Thus, the combination of NNs and fuzzy systems, which results in neural fuzzy inference systems (NFISs), enables the development of AI systems with innate explanation capabilities. Consequently, the fuzzy system is also endowed with the ability to learn. We look at the problem of integrating fuzzy systems with CNN architectures in a data-driven manner so that the overall design is answerable and computationally efficient for classification problems, particularly for object recognition. Figure 1.9 depicts the concept that combining rule-based fuzzy learning with different ML and DL models is a constructive way to improve their interpretability and predictive accuracy. This is a vision for advancing fuzzy DL and its potential ramifications for trustworthy AI. As a data-driven technique for knowledge extraction, training a NN results in the distribution of neurons that each represent a piece of information. It fails to deliver a satisfactory result and is difficult to follow in an inadequate data setting. A fuzzy logic system, on the other hand, employs expert knowledge and represents a system through IF-THEN rules. Recently, ANNs and fuzzy systems have been implemented to solve problems that are difficult to express using mathematical equations, such as weather forecasting (Purnomo et al. 2017; Ashrafi et al. 2019), medical diagnosis (Nguyen et al. 2015; Shaikhina and Khovanova 2017), and stock market forecasting (Moghaddam et al. 2016; Chang and Liu 2008; Gao and Chai 2018). The primary difficulty of an ANN is its abstract nature, whereas a fuzzy system lacks the capacity to learn (Tung et al. 2011). Neuro-fuzzy systems (NFSs) combine the advantages of NNs and fuzzy

1.2 Evolution of Deep Learning

21

Fig. 1.9 A trade-off between accuracy and interpretability for various learning models. Interpretable black-box models have indeed been propelled by such a proposal for fuzzy rule-based learning in DL

systems by endowment of fuzzy systems with adaptive learning intelligence and exposing the black-box nature of NNs through fuzzy linguistic rules (Kar et al. 2014). The evidence demonstrates early efforts to combine fuzzy networks and convolution architectures. For example, the methodology in Abonyi et al. (2000) proposed a hybrid fuzzy convolution dynamic model for the current state of the application in question. However, prior knowledge of dynamic system behavior concerning the system’s impulse response is necessary. Furthermore, fuzzy CNNs require prior knowledge of how to define fuzzy membership functions in order to recognize human actions (Ijjina and Mohan 2015). In Deng et al. (2016), features are extracted from both fuzzy and neural representations in parallel prior to fusing them by taking a weighted sum and passing it to a fully connected layer. As a result, the outputs are no longer ambiguous degrees with linguistic significance. Consequently, the current body of work either relies on prior knowledge or a simple fusion with weak properties of both architectures. One suggestion is to build a dense RBF network. An RBF network is expressed as f (x) = σin [wi φi (x − ci )] given an input vector x = [x1 , x2 , . . . , xn ], where

22

1 Introduction to Interpretability

φi (x − ci ) is usually chosen to be ex p(−||xi − ci ||2 /2σ 2 ), and ci represents the cluster center of the ith neuron. The functional equivalence of an RBF network and a fuzzy inference system has been proven by Jin and Sendhoff (2003) under moderate conditions. It has also been demonstrated that an RBF network is a universal approximator. Therefore, an RBF network is a potentially robust mechanism capable of encoding fuzzy rules without sacrificing precision in its adaptive representation. In the opposite direction, rule generation and fuzzy rule representation are simpler in an adaptable RBF network than in a multilayer perceptron. Although conventional RBF networks only have a single hidden layer, a deep RBF network, which can be thought of as a deep fuzzy rule system, is conceivable to design. Successful solutions to deep network training problems were found using a greedy layer-wise training approach. This progress is applicable to the development of deep RBF networks for training. The relationship between a deep RBF network and a deep fuzzy logic system is then used to construct a deep fuzzy rule system. We believe that efforts should be made in this direction to combine fuzzy logic and DL techniques with big data. The interpretability and accountability of a fuzzy logic system are commendable, but it is incapable of knowledge acquisition that is both effective and efficient. It appears that a NN and a fuzzy logic system complement each other. Consequently, combining the best of two worlds is essential to enhance interpretability. This roadmap is not entirely novel. The requirement for a new intuitive and computationally efficient fuzzy operator for NFS is identified. Numerous combinations have been proposed in this direction, including the ANFIS (Jang 1993), generic fuzzy perceptron (Nauck 1994), RBF networks (Bishop 1991), and a number of others discussed in Chap. 5.

1.2.4 Synergy of Neuroscience and Deep Learning To date, the only truly knowledgeable systems are humans. ANNs were initially influenced by biological and neurological insights (McCulloch and Pitts 1943). Given the close relationship between biological networks and NNs, advances in neuroscience should be appropriate and even instrumental in designing and interpreting DL techniques. We believe that neuroscience promises a bright future for IDL. In recent years, the effective use of cost functions, such as adversarial loss in GANs, has been key to the development of DNNs. In later chapters, we will see relevant examples illustrating how a well-defined cost function can help a model acquire an interpretable representation by employing strategies such as enhancing feature disentanglement. In this way, a multitude of cost functions that reflect biologically plausible rationales can be constructed. Indeed, the brain can be modeled as an optimization machine with a robust mechanism for credit assignment that forms a cost function (Fan et al. 2021). Note that, despite significant success, backpropagation is far from ideal from a neuroscience perspective. Backpropagation does not demonstrate how a human neural system tunes the synapses of neurons. Synapses are updated locally in a bio-

1.3 Awakening of Interpretability

23

logical neural system by presynaptic and postsynaptic neurons (Krotov and Hopfield 2019). Deep networks are tuned using non-local backpropagation. Deep networks lack global neuromodulators, unlike the human brain, where neuron input-output patterns are controlled by dopamine and serotonin (Soltanolkotabi et al. 2018). Neuromodulators are essential because they can selectively control the on/off states of a DL neuron, altering its cost function (Bargmann 2012). We expect future non-convex optimization algorithms to be unique, stable, and data-dependent. NNs have been developed with many different architectures over the past few decades, from simple feedforward networks to complex convolutional networks and beyond. A specific network architecture governs information flow in ways that are distinct. As a result, specialized architectures address specific issues. There are currently structural distinctions between DL and biological systems. A typical ANN is used and fine-tuned for tasks that require large amounts of data, whereas a biological system learns from small amounts of data and generalizes exceptionally well. Biological Neural Networks (BNNs) clearly require a significant amount of additional research in order to design more desirable and explainable NN architectures. Researchers have recently looked into more inductive biases from neuroscience to improve CNN architectures (Stone et al. 2017). Learning representations via video sequences, depth information, and physical engagement with the environment are all examples of how representations can be biased. In conclusion, DL has proliferated as data for assertive and complex models have grown. There is global consensus on the general impact of AI and its penetration into various sectors of society. People are accustomed to its decision-making abilities in daily life, from software solutions like Netflix recommendations to hardware integrated solutions like Tesla’s autonomous vehicle and maritime vessel navigation. However, even with resource-intensive systems, DL algorithms lack transparency due to both their internal mechanism and their decision-making validation. This is important because systems that can’t explain themselves pose risks.

1.3 Awakening of Interpretability According to Merriam-Webster, the verb ‘interpret’ means to explain something in a fashion that is easily understood. Some may argue that ‘interpretability’ can’t be grammatically awakened. The enticing title above emphasizes the resurgence or rising demand for centuries-old interpretability requirements. In 2004, M.V. Lent coined the term “eXplainable AI” (XAI) for the first time to explain AI-controlled military tactical behavior in the simulated environment of a game (Van Lent et al. 2004). Although the term was new, researchers have been working on the development of explanations for expert systems since the mid-1970s (Moore and Swartout 1988). XAI literature has been growing exponentially over time. It includes human-computer interactions, law and regulations, and social sciences, but excludes computer science. Figure 1.10 depicts the publication trends for

24

1 Introduction to Interpretability

Fig. 1.10 Authorship report for the past 30 years derived from a Web of Science survey using the key phrases—a ‘Deep Learning’ (blue) b ‘Interpretable Deep Learning’ or ‘Explainable Deep Learning’ (orange), and c ‘Deep Learning Ethics’ or ‘Ethical Deep Learning’ (gray)

Deep Learning, Interpretable Deep Learning, and Ethical Deep Learning in the Web of Science for the past 30 years, showing an increase in the research demand for interpretable results with increasing computationally intensive systems. Not all authors use the terms ‘interpretability’ or ‘explainability’ in their works, and therefore the survey isn’t exhaustive. The research community does not appear to have adopted the interpretability assessment criteria uniformly. There have been attempts to define the notions of ‘interpretability’ and ‘explainability’ as well as ‘reliability’ and ‘trustworthiness’ without providing clear definitions of how they should be incorporated into DL model implementations (Tjoa and Guan 2020). Later in the book, promising techniques and architectural designs will pave the way for accountable AI environments. At a high level, we discuss the genetic breeding model of how machines learn for its simplicity. Genetic code is older, but we anticipate that genetic models will resurge as computing power reaches insane peaks. Current focus, however, on DL and CNNs, where linear algebra increases and explainability in brief decreases. We have thousands of parameters, each of which is sensitive to the connection between layers and neuronal features. The network has many links. The deeper the web, the less interpretable its behavior. This is analogous to turning a radio dial, bringing the neuron closer to the answer, say image classification. We don’t know the station’s exact frequency, but we can tell if we’re getting closer or farther away. Similar, but with millions of dials and a lot of mathematics involved. This process is repeated whenever new training data is added to the network. Once completed, a model can recognize new images well, with some limitations. This is the most elementary

1.3 Awakening of Interpretability

25

introduction to IDL. If this piques your interest and you enjoy math and code, let’s delve into the specifics of how the existing literature captures IDL. “If you can’t explain it simply, you don’t understand it well enough.” —Albert Einstein

We focus on explaining DL systems to humans, which means showing in simple terms what the system does. Although ‘explainability’ is more intuitive than ‘interpretability’, we still need to define it. A formal definition remains elusive, so we search the field of psychology for clues. T. Lombrozo said in 2006 that “explanations are central to our understanding and the currency in which we exchange beliefs” (Lombrozo 2006). Questions like what an explanation is, what makes some explanations better than others, how explanations are made, and when people seek explanations are just beginning to be answered. In fact, the definition of ‘explanation’ in the psychology literature ranges from the deductive-nomological view (Hempel and Oppenheim 1948), where explanations are seen as logical proofs, to a more general sense of how something works. Keil (2006) recently defines explanations as “implicit explanatory understanding”. All activities involved in the processes of providing and receiving explanations are regarded as part of what an explanation entails. Interestingly, different works use different standards, and all of them can be explained in some way. In later sections, we’ll take a quick look at how important, necessary, categorized, and justified different interpretability ideas are.  Highlight In general, IDL research will be linked to the ability to be understood with the following attempts: • Explaining the decisions made by the model. • Unveiling the patterns within the inner mechanism of the model. • Introducing the system with models or math that make sense, including loose attempts to make models and algorithms explicit such that a user has a good reason to trust or distrust a specific model.

1.3.1 Relevance AI systems are used in many fields, products, and services to improve performance. AI’s contribution to society is undeniable, from recommendation systems to financial management to high-impact decision making, medical healthcare, and autonomous

26

1 Introduction to Interpretability

Fig. 1.11 A diagram demonstrating how to match learning techniques with underlying methods. It shows that a researcher’s or developer’s confidence (left) in a black-box model may not appeal to end-users (right) seeking clarity in their learning

logistics (Adadi and Berrada 2018). These contributions drove exponential growth in computational capabilities and heterogeneous data collection, resulting in DL systems with exceptional predictive performance but increased complexity. For instance, the deep residual networks (ResNets) (He et al. 2016) introduced in 2016 have demonstrated superior performance in object recognition tasks, outperforming human-level competence. Facebook’s AI Research, DeepMind AlphaGo, Tesla’s Autonomous Car, Microsoft’s Speech Recognition, and Nervana’s Movidius are notable DL developments in less than a decade. A DL spark becomes a forest wildfire with positive and negative consequences over time. We must now feel compelled to step back in time and reflect on how the machine knows, not just that it knows. There are exceptions to the rule that non-linear DL techniques used as predictors to maximize prediction accuracy are black boxes, such as shallow decision trees. However, most DL techniques are black boxes with little explanation, which makes non-linear scientific prediction methods difficult. Due to the black-box nature illustrated by tangled thread matching in Fig. 1.11, a user may be unable to extract deep insights into what the non-linear system has learned, despite the desire to reveal the underlying natural structures (Lapuschkin et al. 2019). There is a trade-off between ease of interpretation and the need for specific mathematical knowledge, which may result in an unjustified bias toward one method over another. The magical blackbox, once appealing with its performance, necessitates an analysis of its behavioral pattern. Impressive applications of DL in complex games such as Go (Silver et al. 2016, 2018), Atari games (Mnih et al. 2015), and Poker (Moravˇcík et al. 2017) have led to speculations about DL systems exemplifying true “intelligence.” Lapuschkin et al. (2019) argue against the idea using predictive performance measures, such as high precision metric accuracy, to validate and assess machine behavior (sciences, games, etc.). In order for these models to be used in the real world, they need to be very reliable and accepted by society. So, it’s important to be able to understand and explain DL algorithms. If something goes wrong, who is to blame? Can we explain

1.3 Awakening of Interpretability

27

Table 1.1 Challenges with interpretability Criteria

Remark

Algorithmic complexity

Convolution, pooling, non-linear activation, embeddings, and others increase NN variability. Even though non-linearity isn’t always opaque (for example, a decision tree model is non-linear but interpretable), the non-linear recursiveness of DL models prevents us from understanding their internal dynamics

Commercial barrier

Businesses profit from black-box models, so they hide them. Model opacity prevents reverse engineering protecting hard work and intellectual properties. Furthermore, the cost of prototyping an interpretable model may be considered too high from a commercial perspective

Data wildness Many domains lack high-quality data. Highly heterogeneous and inconsistent data hinder model accuracy and interpretability. Also, high-dimensionality real-world data suppresses reasoning Human limitation

Expertise is often insufficient in many applications. When dealing with rare and complex issues that humans, even experts, struggle to understand, DL is often employed

Theoretical assumption

DL theoretical advances include non-convex optimization, representational power, and generalization ability. In order to simplify a theoretical study, it is sometimes necessary to make unrealistic assumptions, which can end up weakening the explanation

why something has happened? Do we understand why things are going well and how to make the most out of them? Explainability may hinder DL use in sectors with strict regulatory compliance standards. For example, U.S. financial institutions must justify credit issuance actions by law. AI tools used to make such choices function by plucking out small connections between hundreds of data, making it difficult to explain how a credit decision was made. Similarly, in cat versus dog identification, we are not fully aware of the type of features that the model learned at each iteration. Table 1.1 summarizes IDL implementation criteria. This may lead to specialized education in sectors aimed at realizing these algorithms’ potential. Another scenario is a self-driving car trained to keep a safe distance of 8 feet from potential vehicles on roads. We could test the model to see if the abstraction learned is likely. A proper examination may suggest using the relative speed of traffic respecting their lanes and distance as the most crucial features. The same model may fail drastically in certain situations due to the sheer volume and inherent assumption that traffic follows every road rule. The concept of lanes might be absent altogether. This interpretation may help wonder about the edge cases of the road lines to maintain lanes. Most people don’t like automated systems that make decisions for them. People want to understand a choice or, at the very least, know why it was made. This is because people don’t trust everything they hear. So, trust is one of the things that makes interpretability important. Other things that motivate people are causality, transferability, informativeness, fair and ethical decision-making, accountability.

28

1 Introduction to Interpretability

Table 1.2 Examples of few scenarios from different fields suggesting the relevance of interpretability in place Sectors

Interpretability scenarios

Healthcare

• Third party AI solution in challenging medical care • Confidentiality maintenance in physician patient relationship • Practicality issues with randomizing patient’s treatment

Decision systems

• Human-aided machine operation affected by prejudice, fatigue • Poor scalability and bias induced by geo-political data training • Algorithm is prone to discriminative, opaque, and power asymmetry • High stake ethical voting system

Criminal justice

• People wrongly denied parole • Recidivism prediction • Unfair police dispatch

Finance

• Credit scoring and insurance approval • Loan default and scholarship waiver for students • Stock management toolkit

This chapter suggests ways to figure out operational definitions and evaluate explanations based on data. We want to stress that the need for explanation in the context of an application may not require knowing how bits flow through a complex neural architecture. It may be much simpler, like being able to tell which input the model was most sensitive to or whether a restricted class was used when making a decision. DNNs achieve high discrimination power at the expense of their black-box representations’ low interpretability. We believe that high model interpretability can help people overcome several DL bottlenecks, such as learning from a few annotations, learning via semantic human-computer communications, and semantically debugging network representations. Table 1.2 presents popular potential examples from various sectors.

1.3.1.1

Incompleteness

Have you ever questioned the DL model’s accuracy? Why is it harder now to ignore a system’s decision approximation and trust the result? The dilemma lies in incomplete formalization of the problem, and a single metric such as classification accuracy isn’t good enough for most real-world tasks. Incompleteness means that something about the problem can’t be modeled well (Doshi-Velez and Kim 2017). A tool for assessing criminal risk, for example, should be fair and ethical, and it should follow human ideas of justice. At the same time, ethics is a broad subject that is hard to formalize and is based on personal opinions. For instance, an airplane crash is a crisis that is well understood and can be described in detail. There is nothing else to worry about if a system avoids collisions well enough. Instead of using data “artifacts,” it’s important to represent the problem well to get accurate measurements (Soneson et al. 2014; Lapuschkin et al. 2016).

1.3 Awakening of Interpretability

29

There are numerous reasons why something may be incomplete, including safety concerns. If a system cannot be tested in a full deployment environment, this may indicates that the test environment cannot be used to predict how the system will perform in the real world. The task is also at times kept incomplete because people want to learn more about the science behind it. While people try to figure out how something leads to another, models can learn to optimize their goals only through correlation. These things make them harder to understand, but it’s okay sometimes. DL models must be understandable in order to reveal their tendencies. By default, DL models are biased by the samples used to train them. As a result, a system may encourage racists to unfairly target underrepresented groups. For example, we train a recommendation model that can automatically approve or reject college scholarships based on past accomplishments and education status. People often believe that a top college student with a good grade has a good chance of succeeding. This could be unfair to a disadvantaged group with limited resources. We want to help as many talented students as possible, and to be ethically sound, we must not make assumptions about them based on their background. Therefore, the problem statement is incomplete. Our problem statement, i.e. awarding scholarships in a way that is both meritocratic and legal, is lacking an additional constraint that should have been included in the cost function used to optimize the DL model.

1.3.1.2

Accountability

In January 2017, the Association for Computing Machinery (ACM) released an algorithmic transparency and accountability statement. The ACM warns that algorithms can lead to harmful discrimination when used to make decisions automatically. The ACM issued rules to prevent such problems. In May 2017, DARPA launched the XAI program (Gunning 2017) with the goal of producing explainable and highly accurate models. This is a catch-all term for AI research attempting to solve the AI black-box problem. There is no standard definition of XAI because there are many approaches, each with its own needs and goals. XAI uses the terms ‘understanding’, ‘interpreting’, and ‘explaining’ interchangeably. The goal is to help users understand how models make decisions. In May 2018, the EU’s General Data Protection Regulation (GDPR) replaced 1995s Data Protection Directive (Regulation 2018). The GDPR guarantees every individual a “meaningful explanation of the logic involved” in automated individual decision-making, including profiling. The revised EU GDPR may require AI providers to explain automated decision-making to users. This new requirement affects a large portion of the industry with regulations on personal information collection, storage, and use. This may complicate matters or lead to the ban of opaque models used for applications like personal data recommender systems. For instance, automated credit risk and money laundering decisions must be transparent, interpretable, and accountable. This could affect financial institutions, social networks, and healthcare. According to Goodman and Flaxman (2017), this is the right of explanation for each subject (person).

30

1 Introduction to Interpretability

The European Commission (EC) defines the following accountability aspects (Smuha 2019): • Auditability includes assessing AI algorithms, data, and design processes while preserving the intellectual property related to the AI systems. Internal and external auditors’ reports could increase the technology’s trustworthiness. When AI affects fundamental rights, including safety-critical applications, an external third party should audit it. • Minimizing and reporting negative impacts involves reporting system actions or decisions. It also includes outcome assessment and response. AI system development should consider identifying, assessing, documenting, and minimizing negative impacts. Impact assessments should be performed both before and during the development, deployment, and use of AI systems in order to minimize the potential negative impact. Also, anyone who raises AI concerns should be protected (e.g., whistle-blowers). All evaluations must be proportionate to the risk that AI systems pose. • Trade-offs. If the above requirements cause tension, ethical trade-offs may be considered. Such trade-offs should be reasoned, acknowledged, and documented, and evaluated for ethical risk. The decision maker must be accountable for making the appropriate trade-off, and the decision should be constantly reviewed. If there is no ethically acceptable trade-off, the AI system should not be developed, deployed, or used in that form. • Compensation includes mechanisms to make sure that people are treated fairly when bad things happen that were not expected. To build trust, it’s important to make sure there’s a way to fix things that didn’t go as planned. Vulnerable people or groups should be given extra care. These issues addressed by the EC show how XAI is related to accountability in different ways. First, it helps auditability by explaining AI systems for different profiles, including regulatory ones. Also, since fairness and XAI are linked, XAI can help minimize and report negative impacts.

1.3.1.3

Algorithmic Transparency

Algorithm transparency is necessary for comprehending the dynamics of a model and its training process. This is because the objective function of NNs has a substantially non-convex topology. Due to deep networks’ inability to produce truly novel answers, the model’s openness is compromised. However, modern SGD-based learning algorithms achieve impressive results. Finding the reasons behind the success of learning algorithms is crucial to the advancement of DL studies and applications. Figure 1.12 illustrates a scenario in which neural learning uses an object classification algorithm based on training data to classify a single object in an image. When a multi-class object is present, the algorithm fails to categorize it, resembling the inefficiency of domain adaption in a specific model. In the future, improved transparency

1.3 Awakening of Interpretability

31

Fig. 1.12 Diagrammatic perception on model’s performance: specificity versus generality in network’s behaviour

is expected to decode the learning paradigm and build a model that is resistant to generalization in data. We also come across many algorithm-centric journal articles in the AI community. They often assume that the algorithms are easily interpretable without conducting human subject tests to verify this. It is not always incorrect to assume that a model is obviously interpretable. Human trials may be unnecessary in some cases. For example, predefined models based on commonly accepted knowledge specific to the content-subject may be considered interpretable without human subject tests. Remember that interpretability is important in many domains but not all. There are areas where human intervention is required, such as healthcare, but for instance in aviation, aircraft collision avoidance algorithms have been operating without human interaction for years and are self-explanatory. We must understand that explainability, not certainty, is required when there is some incompleteness. Uncertainty is something that can be formalized and dealt with using mathematical models.

1.3.2 Necessity Let us delve deeper into the reasons why interpretability is so important. In predictive modeling, we frequently find ourselves in a trade-off situation between what is predicted and understanding why such a prediction was made (see Fig. 1.9). At times, knowing the prediction accuracy will suffice. However, in a high-risk environment, we will almost certainly pay for the explainability with a drop in predictive

32

1 Introduction to Interpretability

performance. Accurate prediction only solves a portion of the original problem. The added explainability may help practitioners debug their model (Casillas et al. 2013), explore the induced bias or unintended model learning. To achieve interpretability, we can certainly employ model-independent techniques such as local models or Partial Dependence Plots (PDP). Nonetheless, there are two compelling reasons to consider interpretation methods specifically designed for NNs: 1. NNs learn features and concepts in their hidden layers, which require specialized tools to uncover them. 2. The gradient can be used to implement more computationally efficient interpretation methods than model-agnostic methods that look at the model “from the outside.” Interpretability is required when the goal of the prediction model that was built is different from the way the model is actually used. In other words, there is a difference between what a model can explain and what a decision-maker wants to know. According to DARPA (Gunning 2017), XAI aims to “produce more explainable models” while ensuring a high degree of learning performance (prediction accuracy); and facilitate users to appropriately understand, trust, and manage the emerging generation of artificially intelligent collaborators. This is why explainability is important. Even models that work well but fail on a few data instances can be hard to explain. Here, we also want to know why the model didn’t do well on these few feature combinations (Kabra et al. 2015). We will learn a few axioms of necessity.

1.3.2.1

Knowledge

Knowledge is the first axis of necessity. DL has enabled researchers all over the world to find connections that are far beyond human’s cognitive reach. AI and DL techniques have mostly helped fields that deal with large amounts of reliable data. But we are entering an era when the only thing that research studies care about are results and performance metrics. Although this might be true for some fields, science and society are not only concerned with performance. The search for knowledge is what enables us to improve the model and use it in the real world. In DL, where a decision is based on a huge number of weights and parameters, interpretability is very important. Here, the parameters are often vague and have nothing to do with the real world. This makes it hard to understand and explain the results of deep models (Angelov and Soares 2020). Interpretability will help align algorithms with human values, which will help people make better decisions and give them more control over those decisions. On the other hand, if we really understand a model, we can look closely at its flaws. This is because a model’s ability to be interpreted can help us figure out where it might be weak, and based on this knowledge make it more accurate and reliable. Also, interpretability is a key part of using DL techniques in an ethical way (Geis et al. 2019).

1.3 Awakening of Interpretability

33

Fig. 1.13 The severity of the demand for IDL is determined by the practicality and specificity of the task objective

Interpretability acts as a latent property which is not directly measurable. It relies on the measurable outcomes like the ability to predict the model’s behaviour, detect the shortcomings of the model and the degree of influence of model’s prediction on users. Due to the lack of any established benchmark, and incompleteness in the degree of interpretation, it is essential that we actively seek some evaluation techniques for ethical implementation of interpretability approaches. Doshi et al. (2017) attempts to segregate the evaluation approach based on human involvement and the degree of application for a more costly and expensive need of interpretation. We demonstrate this visually in the form of an example in Fig. 1.13. In general, the more people and the more serious the task at hand, the greater the demand for interpretability and specificity. The general idea of increase in interpretability needs and specificity is based on the number of humans involved and the severity of the task at hand.  THINK IT OVER Weld and Bansal (2018) discuss explanatory debugging and verifiability. What’s your take on flawed performance metrics, inadequate feature representation, and encoded knowledge drift?

1.3.2.2

Regulated Sectors

The legal operation of DL techniques is the second axis. Several legally binding regulations at the national and international levels already govern the development, deployment, and use of AI systems. In addition to horizontally applicable regulations such as consumer work and safety law, GDPR, and UN Human Rights treaties,

34

1 Introduction to Interpretability

specific implementations such as medical device regulation in healthcare are subject to domain-specific rules. The first reason is undoubtedly the gap between the research community and the business sectors, which hinders the complete penetration of the most recent DL models in sectors that have historically lagged in the digital transformation of their processes, such as banking, finance, security, and health, among others. This issue typically arises in sectors with strict regulations and a reluctance to implement techniques that could put their assets at risk. Even with the assurance of an ethical purpose, society must be certain that the system will not contribute to unintended harm. With the emergence of the need for interpretability, one must comprehend the positive and negative legal consequences if one tackles the need for interpretability seriously and puts it into legislation. This indicates that the interpretation should not be sought solely with reference to what could be done, but also with reference to what should be done. This will establish the foundation for a reliable DL. The 2016 work of Ribeiro et al. (2016) examines the explainability of any classifier’s prediction in order to foster trust. Having said that, we must move forward with the obligations that any natural or legal identity is expected to comply with laws—whether they are applicable today or will be adopted in the future as a result of AI development.

1.3.2.3

Consensus on Interpretability

Knowing the ‘why’ can help us gain a deeper understanding of the issue and explain why a model may fail (Doshi-Velez and Kim 2017). Learn which features dominate the model’s decision-making policies by interpreting the model’s latent feature interactions. This ensures that the behavior of the model is fair. Accountability and dependability of the model is ensured by the ability to verify and justify why particular key features were responsible for driving certain decisions made by the model during prediction. The information does not have to provide an exhaustive explanation of the situation, but it should address the primary cause. “Intuition is critical for understanding, but unverified intuition can be misleading and lead us astray.” —Ari Morcos (Leavitt and Morcos 2020)

So far, there is no real consensus regarding the IDL and measurement techniques. Researchers have initially attempted to develop evaluation strategies for reasoningbased ML models. Nevertheless, DL models, with their more abstract learning mechanisms, are a far cry from actual justice. The research community has recognized

1.3 Awakening of Interpretability

35

these interpretability challenges to ethical standards and fairness. In recent years, they have continuously refined models with explicable reasoning. Despite the fact that the volume of research in IDL is rapidly increasing, a comprehensive survey and systematic classification of these research works is lacking. Furthermore, we humans frequently associate beliefs, desires, goals, and even emotions with lifeless objects. These personality traits would be comparable to the description human agents. IoT devices, such as our smart vacuum cleaner, Eufy, are excellent examples. If Eufy gets stuck, we think by ourselves, “Eufy is enthusiastic about cleaning but desperately seeks my assistance when she gets stuck.” Later, when Eufy has finished cleaning and is tracing for the base to recharge, we assume that Eufy wants to recharge and is desperate to get home. Personality traits are also assigned like “Eufy is slightly feeble-minded.” Those are our thoughts, especially after we find that Eufy knocked over a plant while vacuuming the apartment. A machine or model that describes its predictions is more likely to gain acceptance. Explanations govern social communications by assuming a shared interest in something; with this answer, the explainer monitors the beneficiary’s actions, sentiments, and beliefs. Allowing a model to communicate with us may necessitate shaping our emotions and expectations. Devices must “influence” the consumer in order to help them achieve their goal. We would not have accepted our smart cleaner if it did not explain its response in some way. It establishes a common understanding of, say, an accident, such as getting stuck on the balcony carpet repeatedly, by demonstrating being stuck as an alternative to stopping to work without comment. Interestingly, there may be a misalignment between the explanation of the system’s goal (building trust) and the intention of the recipient (understanding the prediction of conduct). Perhaps the complete explanation for Eufy getting stuck is a low battery, one of the wheels not working properly, or an algorithmic bug that causes the robot to return to the same location repeatedly. These (and possibly other) factors could keep Eufy stuck. However, it only demonstrated that something was in the way, which was sufficient for us to accept its behavior and form a shared understanding of the accident. The importance of algorithmic accountability has been emphasized numerous times. Google’s facial recognition algorithm, which labeled some black people as gorillas, and Uber’s self-driving car, which ran a stop sign, are two notable examples. Because Google was unable to fix the algorithm and remove the algorithmic bias that caused this issue, they solved the problem by removing all references to monkeys from Google Photo’s search engine.

36

1 Introduction to Interpretability

 Highlight The assurance of the DL models to explain their actions shall enable the society to effectively with the following traits (Doshi-Velez and Kim 2017): • Fairness: The assurance of unbiased prediction, without implicitly or explicitly discriminating against underrepresented associations. The action of interpreting can alert the action of why a specific user should be granted a loan. It is simple for an evaluator to determine whether a decision is discriminatory based on a learned demographic (e.g., ethnicity). • Privacy: Securing sensitive information in the dataset. The sensitive information about the model’s training samples should not be exposed unlawfully. • Reliability: Ensuring the model’s predictive accuracy incorporates edge cases for safe implementation. • Robustness: Assuring that minor changes in the data feed do not cause significant fluctuations in the prediction. • Causality: Monitor to choose only causal relationships. • Trust: Allow users to trust a machine that translates its decisions rather than a black box. • Accuracy: The ability of the model to respond to unseen instances. The ability of extracted representations to accurately predict unseen examples. • Usability: The usability of interactive and query-able explanation is higher than fixed or textual interpretation. • Scalability: The interpretability is expected to be maintained for large scale input data with a large input space. The ability of the method to scale to opaque models with large input spaces and large numbers of weighted connections. • Monotonicity: The model output probability is expected to react monotonically, increasing or decreasing with the increment of feature values. Now that we’ve covered the basics of the need for interpretability, we will in the following chapters of this book look into various techniques for implementing it. Table 1.3 provides a basic overview of some of the strategies and questions addressed, as a prerequisite to gain new insights into model learning and discovering causality (Pearl 2009).  THINK IT OVER Why is it necessary to quantify the interpretability metric for a model explanation? Furthermore, is it possible to string together ‘local’ explanations in an effort to construct a ‘global’ interpretable model? Finally, we’ll compare what’s expected versus what’s being done now to explain the black-box. Figure 1.14 shows what the DL community seeks in model

1.3 Awakening of Interpretability

37

Table 1.3 Some popular interpretability techniques and the associated insights they provide Techniques Insights provided Concepts Learned features Pixel attribution Adversarial samples Influential instances

What are additional abstract concepts the NN learned? What features have the NN learned? How did each pixel contribute to a particular prediction? How can we trick the NN with counterfactuals? How influential was a training data point for a particular prediction?

Fig. 1.14 A situation in which the DL model knows what to expect and has plausible answers to model decisions (shown in green), against the present-day perplexity of IDL models (in orange)

interpretability and clarity in the top row. While the bottom row marked in orange represents uncertainty and clash of thought with partial black-box opening today. The interpretation of the true model interpretation requires a protocol and consensus.

1.3.3 The Taxonomy of Interpretability We realized that the lack of genuine agreement on explainability stifles progress in the field. To address this issue, people must first learn the various taxonomies and definitions developed by the AI community for classifying different approaches to interpretation. Figure 1.15 depicts an idea of understanding what the term ‘interpretability’ means in order to build our knowledge on basic and advanced strategies for opening the black box across various domains. In some ways, this sub-section is better suited for casual readers from various disciplines who want to understand use-

38

1 Introduction to Interpretability

ful references for popular categories and develop an interest in this young research field.

1.3.3.1

Definitions of Interpretability and Interpretation

There is no universally accepted mathematical, formal, or technical definition of interpretability. Miller (2019) provided a commonly used (non-mathematical) definition of interpretability in 2019. Definition 1.1 “The degree to which an observer can understand the reason for a decision is referred to as interpretability.” Take note of the three segments in the preceding definition: ‘understand’, ‘reason’, and ‘decision.’ Depending on the design, some elements often need to be re-weighed or even changed. First, in DL, where the role of humans needs to be taken into account, the definition of interpretability is usually changed to fit humans, so that the results of interpretation are better at helping humans understand and reason. Second, the word “cause” in the definition makes it easy to think that the interpretation looks at how models show cause and effect. Although causality is important for some kinds of interpretation methods, it is often the case that interpretations are made outside of the framework of causal theories. Third, more and more methods are moving away

Fig. 1.15 We call this a humorous illustration of the true meaning of interpretable AI

1.3 Awakening of Interpretability

39

from the idea of explaining “a decision” and trying to understand a wider range of things, such as model components and data representations (Olah et al. 2018). An observer can understand a model or its predictions through interpretation. In 2018, Montavon et al. (2018) provided the following general and widely accepted definition. Definition 1.2 “An interpretation is the transformation of an abstract concept into a domain that humans can comprehend.” In the preceding definition, two segments stand out: ‘concept’ and ‘comprehend.’ Arrays of pixels in images or words in texts are common examples of humanunderstandable domains. First, the “concept” that needs to be explained could be anything from a predicted class to how a model component is seen to what a latent dimension means. Second, in situations where user experience is critical, it is crucial to convert raw interpretation to a format that facilitates user comprehension, even if it means sacrificing interpretation accuracy. This brings us to the fundamental umbrellas under which most discovered literature interpretation strategies fall under. Because DL interpretability is still a developing field, the proposed taxonomy is neither mutually exclusive nor exhaustive. But this seems like a good way to compare and criticize explanations from different fields. 1.3.3.2

Keywords Related to Interpretation

We frequently come across pieces of literature that attempt to present the model from the perspectives illustrated in Fig. 1.16. We classified popular keywords into three categories:- (i) vision based on models or data, (ii) algorithmic or methodic insights, and (iii) ethical inspection. This attempt to categorize the list into non-exclusive but mutually shared knowledge commonly found in the literature will provide a framework for dealing with the concept of future interpretability statements. Some of the keywords are explained below (Vilone and Longo 2020). • Clarity: Discloses whether the interpretations are understandable or poor in nature. • Soundness: Makes sure that interpretations make sense. In particular, if they seem scientific. • Model Interpretation: Useful to comprehend the functioning of prediction systems. • Creativity: Examines the inventiveness or originality of the explanations. • Accountability: Explains and justifies the decisions and actions taken for system interaction. • Responsibility: Potential to question one’s decision and identify an error or unexpected results. • Transparency: Explains the functioning of a system even when it behaves unexpectedly. For example, the recidivism model can be made transparent by informing the accused that a recidivism prediction model was used to assess recidivism risk as part of the bail decision.

40

1 Introduction to Interpretability

Fig. 1.16 Different variants of Interpretability

1.3.3.3

Nature of Interpretation

Despite the widely acknowledged importance of interpretability, researchers strive to establish universal, objective standards for developing and validating explanations. In the literature, numerous ideas have been proposed that underlie the effectiveness of explanations have been proposed. These explanations are often sought when there are contrastive and/or counter-intuitive data demanding insights into breaking the biases which might have caused it. Some of these biases by Miller (2019) look like this: • Selective: Users usually do not expect an explanation to contain a complete list of the causes of an event, but rather a selection of the few necessary and sufficient reasons to explain it. There is a chance that cognitive biases will influence this selection. • Social: Explanations are part of a conversation that aims to share knowledge and is largely based on the beliefs of both the person doing the explaining and the person being explained to. • Likelihood: Chances of action or statistical relationships between causes and events don’t always give a good explanation that makes sense. When giving an explanation, it’s better to talk about the causes and not how likely they are. Rather than relying on contrastive events to start such discussions, we should aim to proactively remove these biases from our initial observation set.

1.3 Awakening of Interpretability

1.3.3.4

41

Scope of Interpretation

We also need to highlight whether the technique or method describes a single outcome, and thereby making it local, or the entire model’s behavior for global interpretation. We would generally point out the process with its scope of boundness with a model used predominately in computer vision work. • Global: The goal here is to make the entire inferential process of a complete model comprehensible. Holistic reasoning and different possible outcomes report population-level decisions. However, the models are often uniquely structured to preserve interpretability, resulting in limited predictability. • Local: The objective here is to reason about individual inference of a model. It aims to justify the specific decision made by the model for a particular decision and build trust to model outcomes. • Instance-level: Intuitively, the goal is to find another representation of an input variable x (concerning the function f associated with DNN), with the expectation that the representation carries straightforward yet critical information that can help the user understand the decision f (x).  Highlight Growing focus in a combined approach, channelizing the strengths and benefits of ‘Global’ and ‘Local’ interpretation into: • Global approach explains standard model prediction. • Global approach explains the influence of modular level in standard model prediction. • Local approach explains specific model response by a group of prediction instances. • Local approach explains specific model response for a single instance. Thus in terms of trustworthiness, local explanations are more faithful than global explanations. Interestingly, revisiting pieces of literature unanimously suggests researchers’ preference for local explanations to interpret the predictions of DNNs. They also claim that despite the development of techniques aiming to explain NNs, the strategy can be adopted for many models, resembling the model-agnostic behavior of the strategy discussed next.

1.3.3.5

Mode of Generalization

There exists another type of model interpretations based on access to the model’s internal mechanism or model-independent functioning based on the input-output pair. • Model-agnostic: The explainability is applied after selective training of a blackbox algorithm (model ensembling or NNs) with perturbations of inputs, usually by

42

1 Introduction to Interpretability

analyzing relationship between input-output features pairs of the trained models. They do not depend on the internal structure of the model and are instrumental when we have no theory or other mechanism to interpret what is happening inside the model. This approach is independent of the model’s internal architecture and beneficial for evaluating diverse models and comparing their performance. Note, that model-agnostic tools can be used on any DL model after the model has been trained (post-hoc). Some model-agnostic tools, like example-based explanations, use the difference or resemblance between examples in the data to understand how the model behaves. This can provide intuitive interpretation for predictions when the attributes in the data are simple and human-interpretable. • Model-specific: Inspecting or accessing the model internals is limited to specific classes of models. Furthermore, the interpretation of intrinsically interpretable models is invariably model-specific. An approach that exclusively works for understanding a particular NN, say CNNs, is model-specific. The generalization strategy we may apply will depend on the category of the model we have trained and the kind of data we are using. Some techniques are only suitable for classification tasks, some cannot handle categorical inputs, others have few restrictions but are often computationally expensive. Unarguably, the most prevailing class of interpretability generalization methods is the model-agnostic class. Because they are model-independent, they can compare the behavior of the different models with a single type of model-agnostic technique to present a comprehensive study.

1.3.3.6

Level of Transparency

Transparency represents a human-level interpretation of the inner working of the model. The three principal stages, namely simulatability, decomposability and algorithmic transparency, are illustrated in Fig. 1.17 and discussed below. • Simulatability: The ability of a model to be simulated or allow its structure and functioning to be entirely understandable by a human. Hence complexity takes a dominant position in this class. However, simple but extensive rule-based systems fall out of this trait, whereas a single perceptron NN falls within. This perspective aligns well with the claim that sparse linear models are more interpretable than dense ones (Tibshirani 1996), and that an interpretable model can be easily presented to a human using text and visualizations (Ribeiro et al. 2016). Again, backing a decomposable model with simulatability requires the model to be selfcontained enough for a human to think and reason about it as a whole. • Decomposability: The degree to which a model can be decomposed into individual elements to explain its functioning. Lou et al. (2012) marked this as an intelligible model as it empowers the capability to understand, interpret or explain its components (input, parameters, and output). However, as occurs with algorithmic transparency, not every model can fulfill this property. Decomposability

1.3 Awakening of Interpretability

43

Fig. 1.17 The various degrees of openness in a ML model are demonstrated here. To be more precise, Fφ , where φ is the collection of parameters for the model under consideration: The capacity of a Simulatability, b Decomposability, and c Algorithmic transparency for an algorithm implementation. Despite its focus on a specific execution, the example still manages to provide a thorough explanation of the ML paradigm. An example, the output classes, or the dataset can all serve as interpretability goals

demands every input to be readily interpretable. An added constraint for an algorithmically transparent system to become decomposable is that individual parts of the system must be understandable by a user without the need for additional tools. • Algorithmic Transparency: The degree of confidence for a model to perform sensibly in general. An ability of the user to understand the process followed by the model to produce any given output from its input data. In this regard, a linear model is deemed transparent because its error surface can be understood and reasoned about, entitling the user to know how the model will act in every situation (James et al. 2013). Contrarily, a deep architecture will be non-transparent as the loss landscape stays opaque (Kawaguchi 2016; Datta et al. 2016) since it cannot be completely observed. The solution must be approximated through heuristic optimization (e.g., SGD). The main limitation of this class is the inability of complete exploration of the model using mathematical analysis and methods.

1.3.3.7

Mode of Black-box Inspection

Now we refer to the mode in which a method generates explanations. Ante-hoc methods generally aim to consider the explainability of a model from the beginning and during training to make it naturally explainable while still trying to achieve optimal accuracy or minimal error; post-hoc methods are aimed at keeping a trained model unchanged and mimicking or explaining its behavior by using an external explainer at testing time. • Background Knowledge Exploitation: From childhood, we tend to conform to any explanation with previously acquired knowledge. Oftentimes, we are forced to explain in terms of our current understanding. How about leveraging our prior

44

1 Introduction to Interpretability

knowledge to evaluate the model’s explanation? Intuitively, it is hard to forge a link between our knowledge and a complex model trained bluntly on the data by reducing loss. The solution is found in literature similar to inductive programming, where we can use background data in the forms of linked data and knowledge graphs to explain them. The inputs and outputs of the model are linked to background knowledge having positive and negative knowledge and use a symbolic learning system to generate an explanatory theory. Some of the experimental symbolic systems are ECII (Sarker and Hitzler 2019), DL-Learner (Lehmann and Hitzler 2010), and OWL Miner (Ratcliffe and Taylor 2014). • Intrinsically interpretable model/Ante-hoc interpretations: We refer to a method’s explanation mode. Ante-hoc methods generally consider a model’s explainability from the start and during training to make it naturally explainable while achieving optimal accuracy or minimal error. Although post-hoc methods keep a trained model unchanged and mimic or explain its behavior at testing time by using an external explainer. Incorporating interpretability into model structures or learning achieves interpretable modeling. Here people tend to leverage and exploit the inherent visual interpretability of DL algorithms like pre-softmax class visualization, neuron interaction, and rule-based heuristics. The interpretation is purely dependent on the abilities and specialties on a per-model basis. It is still a difficult problem to develop models that are both transparent and capable of achieving cutting-edge performance. Many efforts have been made to improve deep models’ intrinsic interpretability. Some specifics are provided below. The use of distillation is a simple strategy. Specifically, we first build a complex model (e.g., a deep model) to achieve good performance. Then, we use an interpretable model to mimic the complex model’s predictions. Linear, decision tree, and rule-based models are interpretable. This strategy is also known as mimic learning. This interpretable model outperforms standard training and is easier to understand than the complex model. Model agnosticism and specificity don’t apply to ‘ante-hoc’ methods because they aim to make a model’s functioning transparent, so they’re all model-specific.  Highlight Attention models, originally developed for machine translation, are now popular due to their interpretability. Attention models can be explained intuitively using human biological systems, where we selectively focus on some input while ignoring irrelevant parts (Xu et al. 2015). By examining attention scores, we can tell which input features were used for prediction. This is like post-hoc interpretation algorithms that prioritize input features. The important distinction is that attention scores are generated during model prediction, whereas post-hoc interpretation occurs after prediction. Deep models learn effective representations to compress data for downstream tasks. Humans can’t interpret the representations because the dimensions’ meanings

1.3 Awakening of Interpretability

45

are unknown. This problem is addressed by disentangled representation learning, which divides characteristics into different interpretations and encodes their meanings as separate dimensions. We could check each dimension to see which input data factors are encoded. After learning disentangled representations on 3D chair images, chair leg style, width, and azimuth are encoded into different dimensions (Higgins et al. 2017). • Post-hoc interpretations: An alternative approach to interpretability is to develop a highly complex yet accurate black-box model. Such a black box model can then be used together with a separate set of exploiting techniques and reverseengineering strategies to provide an explanation of the black-box without altering its accuracy or gaining complete insight into its inner workings (Lipton 2018). This approach of interpretation is compared against the interpretable design in Fig. 1.18. Techniques used to explain complex black-box models such as DNN, either generate explanations for particular inputs (local) or globally explain the entire model, whereas reverse engineering uses only the black-box input and output to figure the model behavior. This can be achieved either by changing the particular comprehensible predictor or controlled auditing of the black-box using random perturbations leveraged by specific prior knowledge of training or validation.

1.3.3.8

The Big Picture

Many comparable principles, such as interpretability, explainability, and transparency, capture the degree to which humans can understand a model’s internals or what features are used in a decision. In the field of DL, these and other terms are not used consistently. Researchers explain or use these terms interchangeably or with different, often contradictory meanings. Figure 1.19 presents (Vilone and Longo 2020) the enthralling summary of 196 surveyed articles organized in a defined taxonomy of interpretation. In a rudimentary sense, we can attempt to classify the techniques using the criteria listed in Table 1.4. Once we have a deeper understanding of the underlying mechanism, we can dive deeper into the IDL method and evaluate the other approaches to determine how they work. Such methods may include sensitivity analysis, feature extraction, optimization, inversion, decomposition, and many more (see also Table B.1). We then go through another approach to classifying methods. The purpose of discussing the various forms of classification is to clarify the concept of ambiguous agreement in IDL and to establish a more formal foundation for future research. In this book, we present an easy flow chart to categorize the terminologies in a nuclear form as presented in Fig. 1.20. The major split occurs based on the interpretability of the model’s decision or an attempt to explain its internal mechanism. Analogous to the ML interpretability survey by Gilpin et al. (2018), we also look at the mathematical formulations of popular methods. This shows that there are many

Fig. 1.18 Diagrammatic overview of the steps involved in training a model in IDL. Post-hoc interpretation attempts to justify the behavior of a black-box model by poking the model from the outside for validation, while interpretable design reveals the internal workings of the model with the use of a surrogate model to explain its behavior broadly

46 1 Introduction to Interpretability

1.4 The Question of Interpretability

47

Fig. 1.19 Categorization of interpretability methods (left) and distribution of articles across categories (right). Figure reproduced from Vilone and Longo (2020) with permission

Table 1.4 General legends for grouping numerous IDL techniques Type Description Objective Black-box model Data Explanator Scope

Model Explanation, Prediction Explanation, Model Inspection, White-box Design NN, SVM, Agnostic Black-box, Non-linear model, Tree Ensemble Textual, Images, Tabular, Waveform, Time Series, Other (any) Attribution, Decomposition, Examples, Mathematical, Perturbation, Rules, Saliency, Prototype Selection, Adversarial Attacks, Feature Relevance Global, Local, Instance-level, Outlier Example

ways to look at interpretability. The book extends the so-called “integrated interpretability” (Došilovi´c et al. 2018) by including considerations for subject-contentdependent models. Some expect that designing methods to explain black-box models will alleviate a part of the concerns. But following this approach instead of making models that can be understood is likely to perpetuate bad practices and cause harm to society. A way forward is to develop models that are inherently interpretable (Rudin 2019).

1.4 The Question of Interpretability Despite numerous marketing claims about Cognitive Computing, ANNs are more like artificial intuition than intelligence. They can fill gaps creatively and make intuitive leaps to respond appropriately. Many industries are automating with DL. They can take over any human activity and compute the result in moments/loops, such as driving a car, recognizing faces, reading handwriting, and understanding and labeling

Fig. 1.20 An IDL glossary classification scheme is proposed, with a few instances provided in black font

48 1 Introduction to Interpretability

1.4 The Question of Interpretability

49

objects in a scene. The beauty of mathematics that pursues these monumental activities is that it is frequently well-architected but difficult to understand and interpret. Before deploying learning models in the real world, AI must solve the “blackbox problem”. Researchers are actively working to solve this problem. Recently, the number of research publications on the interpretability and explainability of AI has increased. XAI aims to add accountability, transparency, fairness, and consistency to AI learning models. The purpose of interpretability is to bridge the gap between “Right to receive explaination” and “Right to be informed”. Rudin et al. (2019) questioned the DNN model’s accuracy, completeness, and reliability in 2019. In addition, a human-based evaluation is required for improved predictability and legal compliance. However, nothing is known about the precise definitions of AI terms or whether they have distinct or overlapping meanings. When these capabilities are fully developed, responsible AI will emerge (Fig. 1.23), which can be used in realworld scenarios in a variety of industries. Overall: 1. We have traditional computers that are great at calculations and processing. 2. ANNs, especially DL, give us artificial intuition and potentially super-human pattern spotting abilities. 3. This brings us to the issue of “accurate intelligence”—true reasoning. The question “How?” is now crucial. Within the concept of interpretability, the two major categories, namely perceptive interpretability and interpretability by mathematical structures, appear to present different polarities. Here’s a problem with perceptual interpretability: When visual “evidence” is given incorrectly, the algorithm or method used to generate it and the underlying math may not offer clues on how to fix it. On the other hand, a mathematical analysis of patterns, may provide information in multiple dimensions. They are only easily discernible when the pattern is reduced to lower dimensions, abstracting some fine-grained information that we have yet to prove is not discriminative with measurable certainty. The ability of AI to be autonomous and self-sufficient enough to select the learning domain to improve its accuracy remains a thought. AI in the workplace raises ethical concerns because, for better or worse, an AI system will reinforce what it has previously learned while also providing organizations with a variety of new functions. This is a challenge because the DL algorithms, the foundation for most cutting-edge innovative products, are only as smart as their training data. Because humans choose which data is used to train programs, DL bias is a real possibility; therefore, it must be adequately addressed. Furthermore, reinforcement learning can create AI that learns but can’t conceptualize situations. We won’t get that until another AI revolution, which may take decades. Humans can learn across domains, whereas AI learns, performs, and delivers results that are domain-specific. A geophysicist, for example, would understand the significance of the words ‘earthquakes’ and ‘mitochondria’ just as well as a biologist. Because, as humans, we have basic knowledge of various domains and are able to separate keywords into appropriate environments. Will an AI model (say, NLPs)

50

1 Introduction to Interpretability

that has never been trained on datasets from geophysics or biology be able to tell them apart? The significance is to address the need for understanding the methodology and evaluating the ever-increasing computational models. This serves the following purpose: (i) Allows for a greater emphasis on criterion validity, and (ii) Helps in the integration of methodological focus points. In Fig. 1.21, we illustrate the distribution of “Need of Interpretability” into three latent dimensions: Data-driven, Task-driven, and Model-driven. Sub-categories reflect factors relevant to a latent dimension based on current objectives, for example, 1. Uncertainties, whether epistemic or aleatory, influence the data. Another consideration is the stochasticity of the training data. 2. For task-driven interpretability, we should consider the problem’s scope, time constraint, user expertise, and incompleteness severity. as previously discussed. 3. The model-driven need for interpretability has gained popularity due to NN’s deep versus shallow knowledge encoding and knowledge on cognitive chunks. This proposed schematic flowchart may be useful to some beginners interested in working in the XAI industry. Our contribution could potentially bring a fresh perspective that moves the ongoing discussion of whether robots are truly “intelligent” forward. This analysis is a first step toward establishing key concepts of AI and DL systems, such as dependability, equity, and accountability in the future, in the context of laws governing AI models and procedures, like through the GDPR.

1.4.1 Interpretability—Metaverse We now have a better understanding of the importance of interpretability in modern DL. IDL isn’t a single concept; it resembles many others. Interpretability and explainability don’t mean very different things in the book, and most of the time they are used interchangeably to stress the same point. We begin this section with a convincing introduction stating: “We encounter multiple interpretations from diverse research domains.” Interestingly, all the concepts supporting those keywords listed in Fig. 1.16 are difficult to formalize. Demanding interpretability in DL has the goal of ensuring that end-users and other stakeholders can understand algorithmic judgments and any underlying data in non-technical terms (Alameda-Pineda et al. 2019). These terms are intertwined and not exclusive, making an exhaustive list difficult. The same as Google search trend in IEEE access publication, citing the interchanging dominant use of ‘Explainable AI/ML’ and ‘Interpretable AI/ML’ in the scientific community or public setting (Adadi and Berrada 2018). Technically, there is no standardized or widely accepted definition of XAI. Figure 1.22 shows that the need and objectives for interpretability and target audience vary widely by user domain. User domains may include:

Fig. 1.21 Schematic flowchart for need of interpretability in latent dimension

1.4 The Question of Interpretability 51

52

1 Introduction to Interpretability

Fig. 1.22 Interpretable AI system development-to-use timeline. The goals of an interpretable system vary between disciplines and communities of practice

• Can I, as a user, confirm the privacy assurances of a new AI system that I want to use for automatic document translation? • Can a regulator track what led to an autonomous vehicle accident? • What benchmarks should be used to evaluate an autonomous car company’s safety claims? • Can I as an academic, conduct unbiased research on large-scale AI without industry computer power? • As an AI developer, can I confirm that my rivals will not cut corners to gain an edge?

1.4.1.1

Contrasting Human Psychology and AI

Even if it is outside the book’s scope, it’s important to note the debate in the field of philosophy about general theories of explanation. Several proposals have been made in this area, suggesting the need for a broad, unifying theory that approximates the framework and purpose of an explanation (Liem et al. 2018). In general, the theory most frequently accepted includes numerous explanatory tactics from diverse disciplines of knowledge. Psychology and AI show similarities and differences. In both domains, input x, mapping function f (x), and output prediction y can be used to predict. The central parts of the prediction procedure and typical conclusions differ. The distinction between training and testing in DL is analogous to the distinction between exploratory and confirmatory factor analysis in psychology. DL verifies a trained model, while psychology focuses on data understanding.

1.4 The Question of Interpretability

53

In psychology, the human-interpretable meaning of x and y is essential: ensure that x only contains psychometrically validated measurable components that are understandable to a human. Selecting a set of such reasonable features to go into x, understanding which aspects of x turn out to be important regarding y, and understanding how end-users perceive and accept y and f (x). The input choices must be driven by theory and explicit hypotheses on significant relationships between the components of x and y. The above focal points are not covered by DL. A DL expert is typically interested in understanding and improving the learning procedure: why f (x) is learned the way it is, where x-to-y transformation sensitivities lie, and how the prediction errors made by f (x) can be avoided. In basic DL, the only thing that matters is this f (x), and it doesn’t matter where x and y came from or how reasonable any human-interpretable relationship between them is as long as their statistical properties are well-defined. In real-world situations, x and y will have different meanings to a person, but they are often objective measurements of the physical world, with x containing raw data with low-level, noisy sensory information. When ML is used for psychological purposes, it will take into account latent human concepts that can’t be measured directly and objectively in the real world. It can be debated whether x should also be expressed at the level of latent human concepts (constructs/meaningful independent variables) when trying to predict these ideas. This would make sense to a psychologist, but a DL expert might not agree with it. On the other hand, one can take an empiricist approach, which only looks at sensory observations and tries to connect them directly to y. This would make sense to an expert in DL, but a psychologist might find it strange. As a possible compromise, if x is made up of observations of raw data, the use of hand-crafted features is like the use of variable dimensions in psychology when it comes to “constructs,” even though the extracted features will be much lower in terms of their meaning. The origins of x and y (as well as any human-interpretable relationship between them) are irrelevant in fundamental DL as long as their statistical properties are well-defined.

1.4.1.2

Industrial Adoption Standard

“DL development is often opaque to those outside an organization, and barriers make it difficult to verify a developer’s claims. Therefore, system attribute claims may be difficult to verify.” AI developers’ concerns disclosing information about commercial secrets, personal information, or AI systems that could be misused are to a certain extent legitimate. Problems arise when these concerns encourage evasion. Third-party auditors can be given privileged and secure access to private information to assess the AI developer’s safety, security, privacy, and fairness claims. AI developers need processes to surface and address safety and security risks to make verifiable safety and security claims. “Red teaming” exercises help organizations discover their own limitations

54

1 Introduction to Interpretability

Fig. 1.23 Characteristics of a trustworthy AI system characteristics for industrial adoption

and vulnerabilities, as well as those of the AI systems they develop, and approach them holistically. On the other hand, “bug bounties” offers a compelling and legal way to report bugs directly to affected institutions, rather than publicly exposing or selling them. While red teaming uses internal resources to identify AI system risks, bounty programs give outsiders a formal way to raise concerns. Bounties elevate the scrutiny of AI systems increasing the likelihood of claims being verified or refuted. Bias and safety benefits would extend the bug bounty concept to AI and could help document datasets and models’ performance limitations and other properties. As a starting point for analysis and experimentation, we focus on bounties for discovering bias and safety issues in AI systems. Other properties, such as security, privacy protection, or interpretability, could also be explored, but benchmarks are still in an early stage. Improved safety metrics could increase the comparability of bounty programs and the robustness of the bounty ecosystem. There should be a way to report issues not captured well by existing metrics. However, non-affliated AI developers have little incentive and no formal process to report bias and safety issues. AI systems are rarely analyzed for these properties. Talking of discrimination, some biases are easy to spot, but others require extensive research. Obermeyer et al. (2019) found racial bias in a widely-used algorithm that affects millions of patients. Consumers without direct access to AI institutions have used social media and the press to highlight AI problems (Telford 2019). In short, we need a trustworthy AI system that protects privacy, is responsible, can be explained, and checks its own validity. This is depicted in Fig. 1.23, which has two sub-properties for each of its properties. This is important from a moral and methodological point of view for a smart system that serves a large number of people.

1.4.1.3

Avenues in Medical Fraternity

Most interpretability analysis actions in medicine are for classification, but radiological practices include image segmentation, registration, and reconstruction. Clearly, interpretability is also important in these areas, and it should be advertised here

1.4 The Question of Interpretability

55

as well. On one hand, existing interpretation methods should be extended to unexplored tasks. Experts can design task-specific interpretation methods. Explaining why a voxel receives a specific class label in image segmentation is more difficult than explaining which area in the input image is mainly accountable for a prediction in image classification. Interpretability of image reconstruction can be difficult. In this context, synergistic integration of data-driven priors and compressed sensing modeled priors is possible thanks to a recently established ACID framework (Fan et al. 2021). The weaknesses of preexisting deep reconstruction networks are eliminated by integrating contemporary compressed sensing with deep networks, which also brings the interpretability of model-based techniques into hybrid DNNs. Interpretability relies on medical doctors with valuable professional training despite biases and errors. Active collaboration between medical doctors, technical experts, and theoretical researchers will be essential for developing IDL methods.

1.4.1.4

The Five Ws—One H Theory

A popular technique of the five ‘W’s—one ‘H’ theory (also known as ‘5W1H’) is creatively reintroduced here in order to begin questioning the nature of the DL black-box. It’s a great way to get the whole story or learn more about something. Every decision we make is a battle between intuition and logic. The shift from human to computer intelligence is complex. Nowadays, if you torture the data enough, it will confess anything. Intelligence is questioning everything we think we know, not knowing anything without question. Replacing animal decision-making with technology requires interpretability and consensus. But for now, a full philosophical overview is beyond the book’s scope. Interpretability, which is also called human-interpretable interpretations (HII) of a DL model, is the degree to which a person, even one who isn’t an expert in DL, can understand why a model made a certain choice (the how, why, and what). Most philosophers, psychologists, and cognitive scientists in the field think that all “why” questions are, in fact, contradictive. In order to give these kinds of explanations, an explanation engine would need to be able to tell the cases apart. This would make the system easier for humans to understand (Miller et al. 2017). Most models are complicated because the problem is complex, and it’s hard to explain what they’re doing and why. Yoshua Bengio, a pioneer in DL research, said that with a complicated enough machine, it’s hard to explain completely what it does. Uber’s Jason Yosinski states that we build great models but we don’t understand them. Each year, the gap grows. “Why, model, why?” asks Paul Voosen to get to the heart of the matter (Voosen 2017). Figure 1.24 shows Human-Computer Interaction (HCI) with possible questions to better understand the model’s workings. Weld and Bansal’s 2019 challenge uses conversational explanations and user questions (Weld and Bansal 2019). An explanation may not address all user concerns. It does not raise the question “Is the system interpretable?” but more precisely asks “Who can interpret the system?”

56

1 Introduction to Interpretability

Fig. 1.24 A hypothetical interactive explaining system that displays object classification. The graphic shows user Q&A in black and AI machine in blue. An intuitive, suggestive discourse could improve system realization (Weld and Bansal 2019)

In Fig. 1.22, the end-user or consumer seeks an explanation of the decisions and positive outcomes more than an engineer or data scientist who looks for the system to work as designed. A business owner seeks need and purpose, while the regulator checks impact, reliability, and model compliance. This is why most research is scenario-dependent and employs user expertise. To obtain a robust model, we must ask diverse questions and fully formulate the problem. The goal of this section is to pause at the numerous definitions throughout the book in regards to the concept (what?), to argue the importance of interpretability in AI and DL (why?), and to introduce the general classification of IDL approaches that will drive the subsequent literature study (how?). Many questions fuel papers (Hofman et al. 2017; Guidotti et al. 2018) proposing methods for interpreting black-box systems. This includes “What does interpretable/transparent mean?”, “How do you explain something?”, “What is the best way to explain something?”, “What kind of information about decisions is affected?”, “Which kind of data record is easier to understand?”, “How much accuracy would we sacrifice for interpretability?” Later in the book, the proposed checks attempt to answer several fundamental questions regarding how to make things easier to understand (Arrieta et al. 2020), such as: • What does the interpretation address? (or) What is in this interpretation? • When is interpretation difficult? (or) When is manipulation to persuade users unethical? • Where has or shall it be used? • Who would benefit from this tool? • Why would one use the interpretation? • How do you handle adversity? (or) How can we balance ethics and interpretability? We use the same theory to define the popular NN structure in Chap. 2. Figure 1.25 illustrates an example of the proposed idea.

1.4 The Question of Interpretability

57

Fig. 1.25 Questionnaires used frequently in the brainstorming process for new IDL chapters

1.4.2 Interpretability—The Right Tool “A good explanation method should not reflect what humans attend to, but what task methods attend to.” —Poerner et al. (2018)

Developing a new system that affects the user or environment requires scientific and moral understanding. Incomplete training of DL models is characterized by mismatched objectives, flawed data collection, inscrutable data use, undesirable outcomes, or susceptibility to adversarial attacks. Figure 1.26 illustrates the multiobjective trade-off between model performance and interpretability and addresses user expertise. These are the two eyes, i.e., the user’s perspective, to understand that explanations are part of the law, and one must acknowledge both eyes equally to contribute to making a responsible, robust, and accountable AI for the future. Fig. 1.26 Responsible AI deployment requires the happy Nash equilibrium between model performance and interpretability

58

1 Introduction to Interpretability

IDL techniques find their place in discrimination-aware data mining methods by identifying implicit correlations between protected and unprotected features. The model designer may uncover hidden correlations between the input variables amenable to discrimination by analyzing how the model’s output behaves with respect to the input feature (Arrieta et al. 2020). Swartout and Moore’s (1993) review of ‘second-generation explainable expert systems’ lists five general desiderata for useful explanations of AI, adding significant perspective to recent work in the field. Important principles can be learned from them: 1. Fidelity: The explanation must be a good representation of what the system actually does. 2. Understandability: This includes a number of usability factors, such as terminology, user skills, levels of abstraction, and interaction. 3. Sufficiency: The AI should be able to explain function and terminology, and it should justify decisions. 4. Low Construction Overhead: The explanation system shouldn’t dominate AI design costs. 5. Efficiency: The AI shouldn’t be slowed down significantly by the explanation system.  Highlight A user should be able to posses the following for a accountable building of system: 1. Moral consideration. 2. Knowledge-related consideration. 3. Methodological soundness. Benjamins et al. (2019) reminds us that fairness includes proposals for bias detection in datasets that affect protected groups. Black-box models can unintentionally create unfair decisions by considering sensitive factors such as race, age, or gender (d’Alessandro et al. 2017). Unfair decisions can lead to discrimination by explicitly considering sensitive attributes or implicitly using factors that correlate with such attributes. For instance, a credit rating based on postal code implicitly encodes a protected characteristic (Barocas and Selbst 2016). The above proposals focus on fairness, allowing researchers to find correlations between non-sensitive and sensitive variables, detect algorithmic imbalances that penalize a subgroup of people (discrimination), and mitigate bias in model decisions. The proposals address: 1. Individual fairness: Modeling differences between each subject and the rest of the population. 2. Group fairness: Addresses fairness from the standpoint of all individuals. 3. Counterfactual fairness: Attempts to interpret the sources of bias using tools such as causal graphs.

1.4 The Question of Interpretability

59

Bias can be traced back to the following sources, as indicated in Barocas and Selbst (2016): 1. Skewed data: Bias within the data acquisition process. 2. Tainted data: Errors in data modeling, incorrect feature labeling, and other potential causes. 3. Limited features: Using too few features may result in the inference of false feature relationships, which can lead to bias. 4. Sample size disparities: When using sensitive features, differences in size between subgroups can cause bias. 5. Proxy features: Correlated features with sensitive information may cause bias even if the sensitive features are not present in the dataset. The next question is what criteria could be used to determine when AI is unbiased. Hardt et al. (2016) presents a framework for supervised ML that uses three criteria to evaluate group fairness when a sensitive feature is present in the dataset: 1. Independence: When the model predictions are independent of the sensitive feature, this criterion is met. As a result, the model’s proportion of positive samples (those belonging to the class of interest) is the same for all subgroups within the sensitive feature. 2. Separation: When the model predictions are independent of the sensitive feature given the target variable, the criterion is said to be met. In classification models, for example, the True Positive (TP) and False Positive (FP) rates are the same in all subgroups within the sensitive feature. ‘Equalized Odds’ is another name for this criterion. 3. Sufficiency: When the target variable is independent of the sensitive feature given the model output, it is said to be sufficient. As a result, the Positive Predictive Value (PPV) for all subgroups within the sensitive feature is the same. ‘Predictive Rate Parity’ is another name for this criterion. Although not all criteria can be met at the same time, they can be optimized together to reduce bias in the DL model. This will result in the creation of a responsible AI with the features listed in Fig. 1.27. For a more secure application in high-risk domains, this process adheres to a development cycle that promotes explanation to uncover, justify, control, and enhance the system. Let’s look at some examples of how the need for explainability has emerged as a topic in different application domains of DL: • Military: Different subjects of study can claim the beginning of the difficult field of interpretability at the same time. However, the solicitation of DARPA project (Gunning 2017) laid a strong foundation for research and visibility. The domain lacked AI explainability, unsurprisingly. The 2017 MIT technology review explains the need for autonomous military machines. The decision threatens international security and people’s lives. DARPA’s XAI project exemplifies how AI researchers balance ethical and legal issues. Researchers and industry are expanding interpretability in international affairs, government administration, cybersecurity, and autonomous weapon systems (a.k.a. “killer robots”), which could have a devastating

60

1 Introduction to Interpretability

Fig. 1.27 The cycle of explanation, a right tool for responsible AI

impact due to lack of liability, accountability, and legal regulation compliance (Bhuta et al. 2016). • Healthcare: The tragic case of Raj (cf. the story ‘A Suspended Thread’ in this Chapter) exhibits the urgent need for explainable AI drastically. The mid-1990s admission of a pneumonia patient by an artificial neural network was another setback for autonomous medical diagnosis by an ANN. Initial findings suggested that the model outperformed the classical statistical method, but an extensive study revealed that pneumonia patients with asthma had a lower risk of dying and were not hospitalized. We may find it counter-intuitive, but the study revealed the real pattern in training samples, where asthma patients were actively treated in the ICU for severity, resulting in less death. Clinical AI systems must justify and account for critical decisions in today’s fast-advancing healthcare systems (Che et al. 2016; Ahmad et al. 2018; Holzinger et al. 2017). • Finance: The sector covers investment strategies, asset management, customer advice, and financial ethics. DL tools raise concerns about fair trade and data security. An overused narrative is in “credit-score” audits and attempts to explain borrowers’ behavior. Some bigger credit agencies like Equifax and Experian are developing more reliable AI-based credit scores and auditor-friendly decision making (Adadi and Berrada 2018; Zhang et al. 2021). • Legal: Explaining the legal decision-making model reduces recidivism risks and crime and incarceration resources. The model must be impartial, nondiscriminatory, and fair. The 2016 case study (Lightbourne 2017), State versus Loomis, raises constitutional concerns about the actuarial risk assessment tool and Mr. Loomis’ prison sentence. The case claimed that the proprietary software ‘Correctional Offender Management Profiling for Alternative Sanctions: COMPAS’ violated the defendant’s right to review how various inputs were weighed and incorporated gender and racial bias in its decisions (Brennan and Dieterich 2018; Tan et al. 2017). Another aspect of legal challenges could be the ever-increasing use of social media platforms and the growing need for robust copyright protection (Wichtowski 2017).

1.4 The Question of Interpretability

61

• Logistics: Autonomous vehicles (AVs) promise to reduce traffic casualties, increase mobility, and reduce logistical labor costs. Autonomous vehicle navigation is challenged by a fast-changing interactive environment, computation-intensive input variables, and emotion-driven uncertainty of human interaction. Consider a self-driving car with classification issues. The impact could be fatal, as highlighted by Uber’s shutdown of self-driving operations after a erratic failure of an emergency braking incident in Arizona (Laris 2018). HMI research in AVs (Ekman et al. 2017), public perception on its application (Penmetsa et al. 2021), and its interpretability are all necessary to improve the systems’ ability to diagnose problems, make decisions, keep people safe (O’Sullivan et al. 2022). Adapted from Arrieta et al. (2020), IDL system developers should take the following aspects and desired qualities into account: 1. Informativeness: Most recent IDL research publications focused on learning model inner workings. 2. Transferability: The explainability of learning models leads to their reuse in various applications. However, not all transferable learning models are explainable. Transferability is the second most-cited reason for IDL research. 3. Accessibility: IDL makes debugging and developing a model more accessible to end-users. Explainability helps non-technical users understand the learning model. 4. Confidence: A confident learning model is needed to create a Responsible AI. The robustness, stability, and reliability requirements of a learning model require a confidence assessment. In finance and medicine, much research is done on evaluating a learning model’s confidence. 5. Fairness: Explainability of a learning model highlights bias in training data. As IDL involves a user in its explainability, it’s important that the learning model’s predictions are fair so that its decisions are just. Fairness-related IDL research focuses on ethical AI and the use of AI for social good. 6. Trustworthiness: Quantifying trustworthiness is difficult for a learning model; it’s the confidence that a model will work as planned when faced with a problem. Interpretability should be a trait of a trustworthy learning model, but not all models meet this criterion. 7. Interactivity: Human-centered AI-driven interactive systems put the end-user at the center to collaborate with the AI learning model on the intended tasks. 8. Causality: Causality in IDL means finding causal relationships between learning model variables. It leads to determining the correlations and causality of training data. IDL’s causality can be verified using causality-inference techniques. 9. Privacy awareness: The ability of non-authorized third parties to understand a learning model’s inner workings may compromise the privacy of the original training data. For example, a learning system implemented in the financial sector may result in the breach of customer private information due to the interpretability of the learning model. Confidentiality is crucial in IDL. At present, researchers’ focus on privacy is weak, creating opportunities for future researchers.

62

1 Introduction to Interpretability

1.4.3 Interpretability—The Wrong Tool We begin by quoting Doshi-Velez and Been (2017) who raised unanswered questions about rigorous interpretability: “Are all models in all defined-to-be-interpretable model classes equally interpretable? Are all applications interpretable?” It is critical to conduct controlled experiments to evaluate the utility of interpretability methods designed with the intent of providing actionable explanations to humans. DNNs are difficult to comprehend due to their complexity and high dimensionality. Extrapolating from intuitions about simpler, lower-dimensional systems is one approach to understanding such systems. Intuition serves as the foundation for a comprehensive understanding. For example, mapping abstract quantities into concrete sensory forms, e.g. through visualization, can also be an effective cognitive aid. However, untested intuition can be misleading. Appearances are frequently deceptive, and unchecked extrapolation can easily lead one astray (Leavitt and Morcos 2020). Every extrapolation iteration must validate the basis of one’s intuition. While intuition can help identify important questions, those questions must be addressed with strong, falsifiable hypotheses. This holds true for all fields of science, including DL. “Do not always think out of the box, it might not be worth the effort.” —Authors of this Book

Thousands of black-box models are used for high-stakes decision-making, but only a handful of interpretable models exist. People prefer to explain black-box models to make interpretable white-box models. This attitude may perpetuate wrong practices and potentially harm society (Rudin 2019). Through AI competitions like Kaggle, we have got high-performing models, but model accuracy isn’t always a measure of enterprise utility. Therefore, the need for interpretable AI must be justified. Transparent models are required for high-stakes applications. Post-hoc explanations can be unreliable; therefore, use a simple white-box model. However, the whitebox model is less accurate than complex DNNs, and post-hoc explanations dominate research. Fears about the future effects of AI are diverting researchers’ attention away from the real risks of deployed systems. Figure 1.29 XAI depicts a similar case of sacrificing the importance of the features in the run for improved model performance. We believe it’s unethical to present a simplified description of a complex system to increase trust if users can’t understand its limitations, and worse if the explanation is optimized to hide undesirable system attributes. Such explanations can lead to dangerous or unfounded conclusions. Therefore, most previous DL research demanded a more rigorous concept of interpretability (Doshi-Velez and Kim 2017). Reading this far, we have understood the work that supports the notion of interpretability. Let’s discuss the other side of a holistic strategy: challenging the typical approach, refining intuitive beliefs, and speculating on interpretability. It’s important to understand the difference between explaining the black-box and creating interpretable models. The Gartner Hype curve for emerging technologies shows XAI

1.4 The Question of Interpretability

63

reaching its peak while artificial general intelligence or inherent interpretable models are in the innovation trigger phase. In 2012, Bunt et al. (2012) asked: “Are explanations always important?” His studies showed that black-box intelligent systems are often well received despite formal explanation. This is due to a lack of clear definition of ‘model transparency’ among users or the ‘cost versus benefit’ of viewing an explanation. Kulesza et al. (2013) presented their findings on how ‘completeness over-weighs soundness’ in explanation-led users’ mental models to evaluate cost versus benefit trade-offs to attend an explanation. Contrary to the optimism of human-friendly explanations, oversimplification could be a curse in itself, i.e., low soundness may demand more mental exercise from the user, resulting in loss of trust in the explanation and attention to such explanations. We must understand intuitive notions of interpretability. Wang et al. (2015) works criticizing ‘interpretability versus accuracy’ show this. They decided to give up some ability to explain in order to remain maximally accurate. Shmueli et al. (2010) debated the same from a statistical perspective, focusing on ML. In 2017, the debate was backed by Yarkoni et al. (2017) from a psychological viewpoint, suggesting that ML models’ predictions are more important than their explainability, leading to a better understanding of their behavior. Modern AI interpretability presents technical challenges. In the 1980s, interpretable expert systems leveraged knowledge bases to make assertions and subject matter expertise. Traditional systems were inflexible yet powerful, and explainable at this moment with strong principles. However, interpretability was deemed insufficient for intelligent systems (Preece 2018). Modern DL systems exploit the other end of the spectrum, being self-reliant, relying on observations of high-degree interactions between input features to create environment representations similar to human behavior. While traditional systems easily interpreted linear input-to-output transformations, modern models’ multi-layer interactions and non-linearity make interpretation even more difficult. Variable hyper-parameters and internal interactions within different models for the same inputs and target accuracy support this idea. Examples The method gives the toaster another dimension. In 2017, Athalye et al. (2018) 3Dprinted a turtle designed to look like a rifle from almost all angles using TensorFlow’s standard pre-trained InceptionV3 classifier. Indeed. Computers see turtles as rifles! See the Fig. 1.28. The dataset on the left adversarially classified most of the turtle as rifle post model training. The authors of the article created a 3D adversarial example for a 2D classifier that is adversarial over transformations like turtle rotation, zooming in, etc. Other methods, like fast gradient, don’t work when the image is rotated or the viewing angle changes. They proposed the Expectation Over Transformation (EOT) algorithm, which generates adversarial examples when the image is transformed. EOT optimizes adversarial examples across transformations. EOT keeps the expected distance between the adversarial example and the original image below a certain threshold, given a selected distribution of possible transformations.

64

1 Introduction to Interpretability

Fig. 1.28 Randomly selected poses of a 3D-printed turtle that has been adversarially perturbed to classify as a rifle from every angle. Unperturbed models correctly identify as turtles almost all of the time. Figure adapted from Athalye et al. (2018) with permission

Another case study in which Grad-CAM visualizations of a model’s predictions (Selvaraju et al. 2017) showed that the model learned to look at the person’s face/hairstyle to distinguish nurses from doctors, thus learning a gender stereotype. Several female doctors were misclassified as nurses, and male nurses as doctors. Problematic. It turns out that the image search results were biased toward men (78% percent of the images for doctors were men, and 93% percent images for nurses were women).  THINK IT OVER How would you tackle systematic instability of automated explanation? Many accurate DL models have the same input features but slightly different internal interactions. Automated explanations may lack consensus in details. When it comes to the importance of interpretability and acceptance by academics and practitioners, we must admit that not everyone will recognize the critical need for an interpretable AI system. Outside of a few potential hints at something deeper, ANNs do not appear to generate original concepts or ideas or perform abstract reasoning at this time. Few tasks or human roles require this mental function. Most people are ‘Cooks, not Chefs’, businessperson rather than entrepreneurs, and they haven’t been taught how to reason from first principles either. P. Norvig, Google’s research director and author of ‘Artificial Intelligence: A Modern Approach,’ questioned interpretability in 2018. He expressed people aren’t good at explaining their decisions and that AI prediction’s credibility can be determined by observing output over time. DL can propagate human biases, is difficult to explain, lacks common sense, and is more pattern matching than semantic understanding according to J. Pesenti, Facebook’s head of AI. Another report (Harwell 2019) confirmed the racial bias of facial recognition software. Uber’s self-driving cars didn’t recognize a pedestrian outside a crosswalk, which was deadly. Yoshua Benigo, an AI researcher who has been critical of DL’s limitations, told IEEE Spectrum in an interview that

1.4 The Question of Interpretability

65

he’s not against it. “Researchers are looking for areas that aren’t working so we can add and explore.”  THINK IT OVER Science of Interpretability Learning: Open Challenges I. Theoretical: 1. 2. 3. 4. 5. 6.

Integrity of problem formulation. Formalization of a standardized shared language. Specificity of the assessment claim. The extent of stochasticity as perceived by humans. Accountability for the unavoidable bias in data. Clarity of explanation.

II. Practical: 1. 2. 3. 4. 5. 6.

Measurement of uncertainty in real-world interaction. Evaluation of the interpretability level. A technique to compare explanations. Spectrum of human comprehension. Knowledge and expertise specific to the task. Time constraints for the explanation.

Connecting links: Interpretability taxonomy, quantification, techniques, literature review. To evaluate the quality and suitability of competing explanation methods, practitioners need principled guidelines. Transparency as a means to an end can be dangerous. It can create a divergence between the audience and the claimant. Figure 1.29 provides a self-explanatory illustration to understand the context of consensus in AI. Data scientists often misunderstand IDL by considering the model interpretation as a placebo. In the quest for trustworthy AI, we expect stakeholders to think like us and provide a single explanation. They feel “what’s said will be heard.” IDL is often seen as a replacement for aides, undervaluing its explanation-friendly features. Often, they optimize model performance over enterprise utility, assuming rather than demonstrating IDL generalizability. To avoid this trap, explanations should balance readability and depth. Instead of simple descriptions, systems should allow for more detail and completeness at the cost of readability. Explanation methods should be evaluated on a curve from maximum interpretability to maximum completeness, not on a single point. We must learn that prioritizing performance over quality may end up in results similar to the illustration below.

66

1 Introduction to Interpretability

Fig. 1.29 An illustration of unclear consensus in AI where we all chase accuracy improvement analogous to quickly pushing the block to the finishing line. Many often end up sacrificing features that we feel are less relevant and fine-tuning the model for better performance. This may lead to the implementation of unlawful models, incorrect predictive bias, and a lack of accountability for use

 Highlight We will conclude this section with four key takeaways: 1. Develop hypotheses that are specific, testable, and falsifiable. 2. Recall the word “human” in human interpretability. If a method’s goal is to help humans understand DNNs, it should be explicitly tested. 3. Quantify wherever possible. An unquantifiable hypothesis runs the risk of being unfalsifiable. 4. Be cautious of visualization! Skepticism should be proportionate to one’s intuitiveness.

Summary As we proceed, most of what is in the book may not be new to someone working in the domain of interpretability. However, it’s important to organize the vast literature on a topic that we believe is urgent and needs to be summarized for a better understanding. Formalism of interpretability is a multifaceted goal that requires the attention of a variety of disciplines. However, the synergetic development of methods must be performed correctly. Our goal isn’t to compile a complete list of interpretable methods, rather to include a holistic survey deemed to be of significant importance in the field of DL, with high citation levels or good coverage of interpretability strategies. This will serve as the foundation for the systematic formalization of definitions. To

1.4 The Question of Interpretability

67

advance the field, it is critical to rigorously generalize the interpretability approach, pushing the advancement of SOTA interpretability.

Reading List 1. ‘Deep Learning’ (MIT Press) textbook by Goodfellow, Bengio and Courvillo (2017) covering a wide range of mathematical, technical and conceptual DL techniques with industry and research perspectives to get a keen understanding of the field. 2. A good insight into the basics of DL from 2015 by LeCun et al. (2015) and Schmidhuber (2015).

Self-assessment 1. Describe the basic notion of the interpretability of your work from three different disciplines. 2. What is the exhaustive diagnostic lists to check the interpretability level of your newly created model? 3. Do the linear models create good explanations? Support your theory with an example. 4. Name some example-based explanation application that can be useful for your task? 5. Classify the popular interpretability methods based on the taxonomy provided in Sect. 1.3.3. 6. Using an example, show how a system’s recent DL explainability operated poorly.

Chapter 2

Neural Networks for Deep Learning

The chapter commences with a prevalent phrase in the modern era “AI will take over the world!” (Crawford and Calo 2016). There are two major interpretations of the phrase. The first is to acknowledge AI as a developing technology with the ability to cater to larger organizations, global automation, and streamline inefficient procedures. The other is a tyrant AI that is beyond human control and will destroy the whole human race. We hope that the majority of the class is in the first half. While those are still perplexed, let us endeavor to change your notion. The word ‘AI’ has become ubiquitous in recent years. In AI, there are several subfields, some of which overlap depending on the categories we use. An important division is the one between ML and DL. Figure 2.1 presents the preliminary distinction between interlinked specializations. This will provide a comprehensive understanding of DL and the requirement to study IDL. For consistency and relevance, we will consolidate DL and neural-related architecture by their capabilities and interpretability practices. The non-linear functions or units for the incoming data are at the core of neural learning. The linear combinations of these units can be used to model/approach non-linear curves of any shape. However, the other side, makes things trickier. Individual non-linear activity has an effect that makes it difficult to comprehend. To get insight into the massive field of DL and to argue the interpretable nature of © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Somani et al., Interpretability in Deep Learning, https://doi.org/10.1007/978-3-031-20639-9_2

69

70

2 Neural Networks for Deep Learning

Fig. 2.1 Evolutionary differences between ML and DL

the models, we will investigate some learning paradigms on these various networks and how it works.  THINK IT OVER What is the difference between DL and ML? Both DL and ML are subsets of the larger field of AI. ML is a subfield of AI that teaches computers to perform tasks with minimum human intervention. DL is a highly sophisticated sub-field of ML that uses ANNs with parametric layers to mimic the human brain. ML needs structured data and human input. DL, on the other hand, can interpret more unstructured data with minimal to no human intervention.

2.1 Neural Network Architectures The vocabulary intuitively resembles how the brain solves large data-driven problems. Layered NNs are the functional block of DL. These layered artificial neurons can accept one or many inputs, turning these inputs into meaningful output. Every neuron influences the others in a mesh-like feedforward arrangement. The network tries to uncover relevance in how data are related, deciphering complicated patterns in large volumes. In Fig. 2.2, input, hidden, and output layers separate the network (left to right across the network depth). The nodes or neurons in the layers have associated weights and biases. Non-linear activation transforms inputs to outputs for the next layer. For people from non-computer science backgrounds to stay connected in upcoming

2.1 Neural Network Architectures

71

Fig. 2.2 An overview of the computational architecture of neurons. a Perceptron analogy with human nerve cell’s dendrites and axons. The value within the units defines the input features, and the value associated with the connection between any two units represents the network weights to the input node to produce the output following the right activation. b A schematic NN with two hidden layers and intermediate connections exhibits forward and backward weight propagation

72

2 Neural Networks for Deep Learning

discussions, it is necessary to recall some basic terms and components of a NN architecture. 1. Input Layer: Data from an external source like a CSV file or online service is loaded into the input layer. It’s the only visible layer in a neural network that transfers external data without computation. For example, ‘object detection’ input can be an array of pixel values representing an image. 2. Hidden Layers: Hidden layers add to the depth of neural learning. They are intermediate layers that perform all computations and extract data features. Searching for different hidden features in the data can be handled by many interconnected hidden layers. In image processing, for example, the initial hidden layers are in charge of higher-level features such as edges, forms, or borders. Furthermore, the later hidden layers execute more complex tasks such as recognizing whole objects (say, a car, a building, a person). Note that in some parts of the book, layers are stacked from bottom-to-top, in others they are stacked from left-to-right? The direction is based on the context of the network under reference.  Highlight Typically, all hidden layers use the same activation function. The output layer, on the other hand, will often use an activation function different from that of the hidden layers. The decision is influenced by the model’s purpose and type of prediction.

3. Output Layer: The output layer uses the model’s learning to provide a final prediction. It’s the final, most important layer because it is where we acquire the ultimate output. The output layer in classification/regression models typically consists of a single node. It is, however, entirely problem-specific and thus dependent on how the model was developed. 4. Weight: Each layer of the network contains intermediate nodes with associated learning parameters called weights. Its sole function is to prioritize those features that contribute the most to learning. Scalar multiplication is introduced between the input value and the weight matrix. A negative word, for example, would influence the choice of the sentiment analysis model more than a pair of neutral terms. 5. Bias: Each layer with associated learning nodes has some bias attached to it. The bias changes the value produced by the activation function. In a linear activation function, the bias b is comparable to a constant. 6. Transfer function: Each node has a transfer function. The transfer function combines several input values into one output value that serves as input for the activation function of the next node. It is achieved by a simple summing of all inputs to the transfer function. 7. Activation Function: The output of the nodes is preceded with an activation function. It is the activation function which introduces non-linearity in perceptrons (refer to Sect. 2.1.1 for more information). Without the non-linear activation

2.1 Neural Network Architectures

73

function, the result would be linear and won’t be able to introduce the network non-linearity (Fig. 2.2).  Highlight The general guide to training a neural network shall be summarized as: 1. Select the model architecture, initial weights, hyperparameter, activations, and loss function. 2. Process training data and input pair shuffling. 3. Run the NN in a configured batch setting for the required epochs. 4. Calculate the loss and gradient with their corresponding weights. 5. Update backpropagation weights for each pass. 6. Repeat learning for the necessary iterations in order to minimize loss. 7. Test on the testing set after freezing the model weights. Before attempting to crack the explainability of NNs, it would be beneficial to quickly analyze the big picture of DL training. This chapter can be consulted if someone loses sight of the overarching goal. Bear in mind that the goal is always to fine-tune the network settings for incoming inputs so that the end result is more precise.

2.1.1 Perceptron In 1957, Frank Rosenblatt, an American psychologist, introduced the world to the oldest type of a traditional NN, known as Perceptrons (Rosenblatt 1957). This most basic NN takes numerous inputs, performs linear mathematical operations with weights assigned to each input, and then goes through a non-linear activation function to produce an output. Initially, the perceptron employs a binary  step activation function with a threshold parameter theta to determine whether wi xi − θ > 0 is ‘true’ or ‘false’. This results in the hyperplane equation w1 x1 + w2 x2 + ... + wn xn − θ = 0. This works well for data that can be linearly separable, which is why the perceptron is also known as linear classifier (Menzies et al. 2014). The network architecture is presented in Fig. 2.3. Mathematically, the operation on a single unit or perceptron can be defined as:  (1)

y (x) = f (.)

N 

 wi

(1)

xi + b

(1)

= σ (W T X + b)

(2.1)

i=1

In Python, Eq. 2.1 can be performed in a few lines using the Numpy package, with y being the output, f (.) being the activation function, w.T the transpose of the weight matrix, and b the neuron’s bias.

74 1 2 3

2 Neural Networks for Deep Learning import numpy as np def f (X) : # Define an activation function , return np. heaviside (x,1) # Here, Binary−step activation function

4 5

y = f (np. dot( input , w.T) + b)

Despite their simplicity, perceptrons are considered the first NN wave (Gurney 2018). They sat the stage for DL’s progress. Publications in the 1970s criticized the perceptrons’ inadequacies (Minsky and Papert 1969). Later, the perceptron units were combined in a feedforward fashion to form more extensive ANNs.  THINK IT OVER How many neural nodes in a single hidden layer NN must be used to approximate or represent an arbitrary decision?? People may question how many neural nodes or univariate summation terms are needed to obtain a particular approximation quality. Based on the curse of dimensionality, we believe that most approximation problems will require astronomical words. Cybenko et al. (1989) presented a mathematical understanding of the problem and approximation properties of a sigmoidal and potential non-linear functions to approximate any continuous function of n real variables in a unit hypercube. Function estimation on sequential learning (Kadirkamanathan and Niranjan 1993) highlights the potential of continuous feedforward NNs with a single hidden layer and continuous sigmoidal non-linearity capable of arbitrarily approximating an arbitrary decision region. Connecting Links: Approximation, completeness, superposition theorem, univariate function.

2.1.2 Artificial Neural Networks ANNs, alternatively called vanilla NNs, feedforward NNs, or Multi-Layer Perceptrons (MLPs), consist of neurons stacked in rows and coupled in a fully-connected feedforward layer. The neurons are programmed to operate in a binary on-off toggle mode, accepting input data and completing simple operations before transferring the output to a feedforward network from left to right (Leverington 2009). To avoid the tiresome effort of manually establishing the weights for the network, Rosenblatt devised an algorithm for automatic learning of vanilla networks, which he referred to as the delta learning rule, which was later formalized as gradient descent learning. More on the significance of gradient descent learning, as well as many other useful references, are provided further down in the chapter.

Fig. 2.3 Schematic of a perceptron with a single hidden neuron and weight backpropagation for neural learning. The prediction task determines whether to chose a discrete or a continuous transfer function

2.1 Neural Network Architectures 75

76

2 Neural Networks for Deep Learning

Fig. 2.4 Simple ANN architectural design a Information flow across the network in a feedforward fashion with a single hidden layer in between, and b Information flow in a backward pass for a given node in the network

 Highlight Activation Function drives ANNs. It enables the NNs to utilise relevant input while suppressing unnecessary data points. This segregation plays a key role in assisting a NN to function effectively and guarantee that it learns from the useful information and doesn’t get stuck analyzing the information that isn’t useful. ANNs are frequently employed in a wide range of activities, including image processing, spam detection, and financial analysis. It is most effective with: 1. CSV-formatted tabular dataset with rows and columns. 2. Primarily for supervised learning tasks using non-sequential or time-dependent data. 3. Real-world classification and regression challenges. 4. Models with high flexibility.  Highlight MLP is the most commonly used alternate term for feedforward NNs. This is not strictly right because the original perceptron is a single layer network with discontinuous binary step non-linear activation, opposed to the continuous non-linear activation function utilized here. Let’s review the mathematical definition behind feedforward NNs for supervised and unsupervised learning. Here’s a two-layer feedforward NN with O(.) as the final

2.1 Neural Network Architectures

77

output prediction function and h(.) as the hidden unit calculation described in the Fig. 2.4. Functions F and f are the respective activation functions for the layers. A summary of the model’s equations:   O(x) = F W (2) h(x) + b(2)   h(x) = σ (x) = f W (1) x + b(1)

(2.2)

 Step 1: Forward propagation from input-to-hidden units. The J linear combination z (1) j of d-dimensional inputs in the NN’s first layer are propagated to the next layer. Here, the arbitrary range of J is determined by the number of hidden units in the respective layer: z (1) j =

d 

w(1) ji x i ,

j = 1, 2, ....., J

(2.3)

i=0 (1) (1) where, z (1) j are the activations, w ji are the parameter weights, w0 is the bias with x0 = 1. The superscript (1) denotes the network’s first layer parameters. Each of the above summations is then converted by a non-linear activation function to produce the matching hidden unit outputs h j . This is because neither the problem input nor the target output for training specify the values. There may be numerous hidden layers between the input and output layers, resulting in a deep network. Tanh function with Tanh(x) = (e x − e−x )/(e x + e−x ) or the logistic sigmoid function with sigmoid(x) = 1/(1 + e−x ) are common alternatives for f (.). Equation 2.4 shows the choice of activation function σ1 , which is commonly a sigmoid:

h j = σ1 (z (1) j )=

1 (1)

1 + e(−z j

)

(2.4)

 Step 2: Forward propagation from hidden-to-output units. The outputs h j of hidden units are linearly combined in the second layer to give the activations z k (2) of the K output units: z k(2) =

J 

wk(2) j h j,

k = 1, 2, ....., K

(2.5)

j=0

wk(2) j is the weight parameter and h 0 = 1 is the basis for the second layer, as in Eq. 2.3. A sigmoid is employed again to compute the final outputs yk as follows: yk = σ2 (z k(2) ) =

1 1 + e(−zk ) 2

(2.6)

78

2 Neural Networks for Deep Learning

However, in the case of a multiclass problem, a softmax activation is utilized (Eq. 2.7) to keep the multiclass output within the range [0, 1]. (2)

ezk σ2 (z k (2) ) =  K l=1

(2.7)

(2)

e zl

Equation 2.8, shows the entire feedforward propagation from the input layer to the output layer. ⎛ yk = σ2 ⎝

J  j=0

wk(2) j σ1

 d 

⎞ ⎠, w(1) ji x i

k = 1, 2, ....., K

(2.8)

i=0

 Step 3: Evaluation of the error function for network optimization. The gradient descent algorithm is capable of incorporating network weights and bias during model training. This entails defining the loss function L and assessing the loss in relation to the model weights. Achieved via the ‘chain rule of differentiation,’ often known as backpropagation of error or just Backprop.  Highlight For a single-layer network, the loss gradient is the product of the derivative of loss for different input-to-output weight values. This is true for hidden-tooutput weights wk(2) j with an available target value at the output and hidden unit values that function as a pseudo input. However, the same situation does not apply to input-to-hidden weights w(1) ji training since the target value for the hidden unit is missing. This is also known as the credit assignment problem (Richards and Lillicrap 2019), because it makes exact training of the input-to-hidden weights impossible. We wonder what the mistake is at the hidden units and how the input-to-hidden weights affect the overall loss. The solution is to deduce the relevant derivatives systematically using the chain rule of differentiation. The loss function L for the given architecture is defined in Fig. 2.4. The sum-ofsquares loss function is triggered by summing the training set of N samples with the target label ynk . N  en L= n=1

1 en = ynk )2 (ynk − 2 k=1 K

(2.9)

The value of ynk for each pattern must be computed using the ANN feedforward propagation specified in Eq. 2.8. The overall gradient is calculated by summing the

2.1 Neural Network Architectures

79

N training samples as follows: δL δwk(2) j δL δw(1) ji

=

N  δen n=1

=

δwk(2) j

N  δen n=1

δw(1) ji

(2.10)

(2.11)

 Step 4: Backpropagation of error from hidden-to-output weights. Begin by computing the loss gradient for the hidden-to-output weight δL/δwk(2) j using the chain rule of differentiation and putting the Eq. 2.10 in terms of the weights from this layer as follows: δen δwk(2) j

=

(2) δen δz nk . = δnk . z n(2)j (2) (2) δz nk δwk j

(2.12)

(2) . This is where δnk denotes the gradient of the error en regarding the activation z nk similar to how a single-layer neural network computes. The activation units within the layer are the derivative of the second.

⎛ ⎛ ⎞ ⎞2 K K J 2   1  1 (2) ⎝σ2 ⎝ ⎠ − σ2 (z nk en = ynk ⎠ )− ynk = wk(2) j hn j 2 k=1 2 k=1 j=0

(2.13)

We have δnk , which is similar to the single-layer neural network with non-linear activation in Eq. 2.14: δnk =

δen δ ynk (2) . (2) = (ynk − ynk )σ2 (z nk ) (2) δ ynk δz nk

(2.14)

 Step 5: Backpropagation of error from input-to-hidden weights. The loss gradient δL/δw(1) ji for the input-to-hidden weights is now calculated by Eq. 2.11. This must be done carefully to determine how the hidden unit j may influence the loss. We begin by examining the error signal in the hidden unit j, which is comparable to the one in Eq. 2.14, as follows: δn j =

δen δz n(1)j

=

K K (2) (2)   δz nk δen δz nk . = δ . nk (2) (1) δz n(1)j k=1 δz nk δz n j k=1

(2.15)

80

2 Neural Networks for Deep Learning

Surprisingly, any hidden unit j has the capacity to influence the loss via all associated output units. To aggregate the contributions of all output units to δn j , we seek the (2) /δz n(1)j derived by differentiating Eq. 2.5 and hiding unit activations formula for δz nk in Eq. 2.4: (2) δz nk

δz n(1)j

=

(2) δz nk δh n j . (1) = wk j σ1 (z n(1)j ) δh n j δz n j

(2.16)

When it is substituted in Eq. 2.15, it yields the well-known backprop equation. Using the chain rule of derivative, we acquire the delta value for hidden units by backpropagating the delta value of each output with appropriate weights from the hidden-to-output weight matrix. Figure 2.4b depicts the visual representation of a two-layer weight matrix learning. δn j = σ1 (z n(1)j )

K 

δnk wk(2) j

(2.17)

k=1

Finally, in Eq. 2.11, the derivative of the input-to-hidden weights is obtained as: δen δw(1) ji

=

(1) δen δz n j . = δ j xi δz n(1)j δw(1) ji

(2.18)

It is worth noting that this strategy can be applied iteratively to increasing hidden layers, resulting in a DNN for sophisticated computing. Algorithm Summary From the above derivation, the Backpropagation algorithm is summarised as: 1. Feedforward the input vector xd from the training set through the network using Eq. 2.8, and retrieve the modified output vector yn . 2. Using Eq. 2.13, compute the loss L using the target label yn . 3. Calculate the error signals δnk for output units using Eq. 2.14. 4. Backpropagate the error to compute the error signal δn j for hidden units with Eq. 2.17. 5. Obtain the overall derivative for the Eqs. 2.10 and 2.11 using the Eqs. 2.12 and 2.18, respectively. Connecting Links: Sect. 2.2.3. The following is a simple Python implementation of an ANN: 1

import numpy as np

2 3

class NeuralNetwork:

2.1 Neural Network Architectures 4 5 6 7 8 9 10 11 12 13 14 15

81

# i n i t i a l i z e the feedforward network def __init__ ( self , units_per_layer_list , iterations , learning_rate ) : self . weights = [] self . bias = [] self . z = [] # (weights ∗ input ) + bias for units in each layer self . a = [] # Each layer ’s activation function is applied to the summed input . self . alpha = learning_rate u = len ( units_per_layer_list ) for i in range(u−1): self . weight .append(np.random. rand( units_per_layer_list [ i +1], units_per_layer_list [ i ]) ) for i in range(1 ,u) : self . bias .append(np.random. rand( units_per_layer_list [ i ]) )

16 17 18

def sigmoid(x) : # Defining the activation function , here used ’Sigmoid’ return 1 / (1 + np.exp(−x) )

19 20 21

def derivative_sigmoid (x) : # Defining the activation function , here used ’Sigmoid’ return self . sigmoid (x) ∗ (1 − self . sigmoid (x) )

22 23 24 25 26

def forward( self , input , label ) : i f len ( input ) != self . weights [0].shape[1]: raise Exception( ’Invalid input size ’ ) output = 0

27 28 29 30 31 32 33

for w,b in zip ( self . weights , self . bias ) : z = np. dot( input , w.T) + b self . z .append(z) input = sigmoid(z) self . a .append( input ) output = input

34 35 36 37

self . error = 0.5 ∗ np.power( output − label ,2) # Here, self . derivative_error = output − label return output

38 39 40 41 42 43 44 45 46

def backpropagation( self ) : delta = self . derivative_error ∗ self . derivative_sigmoid ( self . z[−1]) self . weight[−1] −= self . alpha ∗ self . a[−1] ∗ delta self . bias −= self . alpha ∗ delta for ( i , a) ,z in zip ( reversed ( l i s t (enumerate( self . a[:−1])) ) , reversed ( self . z[:−1])) : delta = np. dot( self . weight[ i +1].T, delta ) ∗ self . derivative_sigmoid (z) self . weight [ i ] −= self . alpha ∗ no. dot( delta , a) self . bias [ i ] −= self . alpha ∗ delta

47 48 49 50 51 52 53 54 55 56 57 58

def train ( self , input_data , input_labels ) : n = len ( input_data ) for i in range( self . iteration ) : average_error = 0 for data , label in zip ( input_data , input_labels ) : self . forward( data , label ) average_error = average_error + self . error self . backpropagate () self . a = [] self . z = [] print (" iterations #{} Error : {}" . format( i , average_error /n) )

Challenges with ANNs It will soon be obvious that the model suffers from the curse of which raises the demand for an optimized neural architecture.

82

2 Neural Networks for Deep Learning

• Take on the challenge of utilizing ANNs to tackle an image categorization problem. Prior to training of the model, images (or volumes) must be converted into a 1-D vector. First, as the image size increases, the number of trainable parameters increases dramatically. A 128 × 128 sample image to be processed by an ANN with five hidden units in its first layer yields a whopping 327,680 trainable parameters. That’s not all; the ANN also loses the image’s spatial pixel relationships. Later in the chapter, we’ll look deeper into spatial properties. • The vanishing or expanding gradient linked with the backpropagation process is the next frequent problem in all NNs. In this situation, raising the ANN’s depth (hidden layers) increases the amount of trainable weight parameters, which raises the likelihood of the gradient disappearing or bursting. • ANNs are incapable of capturing sequential information from input data. RNNs shall be introduced soon to overcome this inability of ANNs.

2.1.2.1

Radial Basis Function Network

A Radial Bases Function (RBF) network is a single-layer ANN that employs the radial basis activation function defined in Eq. 2.19. These networks differ fundamentally from most DNNs, which include only a non-computational input layer, a single hidden layer, and an output layer. The computation in the hidden layer differs greatly from that of most NNs. The network successfully decreases the network’s repetitive application of non-linearity of an activation function feedforward. zi = e

y=

(−

||X −μi ||2 2.σi 2

N 

wi z i

)

(2.19)

i

with μi as the neuron’s prototype vector and σi as the neuron’s bandwidth for ith neuron, wi as the connecting weights, z i as the neuron’s output from the hidden layer, and y as the output prediction. The parameters wi are learned in a supervised manner, similar to gradient descent, and can be utilized for classification and regression problems.  THINK IT OVER What if the number of hidden neurons are equivalent to number of training set samples? The model in this situation is roughly similar to kernel learners such as kernel SVMs or kernel regression.

2.1 Neural Network Architectures

83

Even if an RBF network’s output layer is the ultimate output, it can be layered with other NNs, e.g. by replacing the output layer with a multi-layer perceptron and end-to-end network training. One point of emphasis is the requirement for a greater number of units in the hidden layer than in the input layer to arbitrarily approach a decision area based on Cover’s theorem. The total number of hidden nodes or neurons must be equal to or fewer than the total number of training samples. The output layer employs a linear activation function or can be considered in the absence of an activation unit.  Highlight According to Cover’s theorem (Cover 1965), for a complicated patternclassification task, a pattern cast in a high-dimensional space with non-linear transformation is more likely to be linearly separable than a pattern cast in a low-dimensional space, provided the space is not highly populated. The calculation in the hidden units is determined by the similarity between the prototype vector derived from training samples and the input vector. The bandwidth and μi for the neuron’s prototype vector are learned unsupervisedly, for example, by employing a clustering technique.

2.1.3 Recurrent Neural Networks Extending our understanding of perceptrons, we now know that classic feedforward networks are ineffective for sequential learning. In general, network learning (which is mostly supervised) begins with an input variable X . It learns the mapping over a non-linear optimization function to forecast the dependent target variable Y . To anticipate the desirable dependent goal, we may now introduce a new independent variable of the same type. This is the moment to consider whether the order of the dependent variable matters in the learning process. Let us illustrate the concept in Fig. 2.5 using the ‘Recurr-Ant Problem’ in learning.  Definition Sequential learning (Robins 2004) is the study addressed on learning input information arriving in separate episodes over time. Conceptual drift is the phenomenon where the target concept keeps changing with passage of time. Real-world applications where data follow sequential order with various durations in the DL domain include machine translation, natural language modeling, speech recognition, ECG monitoring, sentiment analysis, sales forecasting, stock market analysis, and streaming services. Such applications need a way to store past data and

84

2 Neural Networks for Deep Learning

Fig. 2.5 Visualization of the “Recurr-Ant” Problem. (I) mimics an ant’s independent variable space, where its movements and direction doesn’t affect others. (II) The assumption that ants walk methodically. (III) shows what happens when one ant misses the direction, which can interrupt (IV) the other ants’ movement

Fig. 2.6 Unfolding of a RNN. a A basic graphical representation of an RNN with a recurrent loop in its hidden layer. b A visualization of the RNN unfolded at time slices t − 1, t, and t + 1. The hues of the nodes show the nodes’ sensitivity to input and changes as time slices progress. The greater the sensitivity, the darker the node. Sensitivity steadily declines over time as new inputs overwrite hidden unit activation

forecast future values. RNNs help here. RNNs are a feedforward NN version that incorporates data dependencies. As demonstrated in Fig. 2.6a, unlike a feedforward network, an RNN’s hidden layers have a recurrent looping constraint (computational cyclic directed graph) (a). This helps the information move in loops and influences the model state, allowing RNNs to capture sequential data in an ordered dataset. The purpose of RNN learning is to forecast a sequence y t+τ that corresponds to the target sequence y t in Fig. 2.7. In this case, τ is the time lag placed in the network to enable the RNN to obtain the context of inputs and reach meaningful information y τ +1 before predicting parts of the output sequence. It is feasible to put

2.1 Neural Network Architectures

85

Fig. 2.7 Representation of a RNN network architecture with a single hidden layer and associated weights

τ = 0 between the first relevant output and the first target output. As seen in Fig. 2.6, the network utilizes the same function and parameters at every timestamp.  Highlight Characteristic features of RNNs are: 1. Unlike ANNs, RNNs include a memory that stores sequential information from previous computations, allowing the network to display dynamic temporal behavior. 2. RNNs deal with temporal sequences of arbitrary length, which means that the size of the input and output vectors varies for different inputoutput sequence pairings, as opposed to ANNs, which have fixed input and output sizes. 3. Backpropagation through time (BPTT) in RNNs is a version of the feedforward network’s backprop technique. 4. RNNs leverage parameter sharing across timestamps, resulting in fewer training parameters and a lower calculation cost. 5. Truncation of higher power matrices on a regular or random basis is required for computational convenience and numerical stability, as well as for dealing with vanishing and exploding gradients.

86

2 Neural Networks for Deep Learning

 Highlight According to the famous Stability-Plasticity Dilemma (Grossberg 1987), stability and plasticity as desirable features of a system remain in direct conflict. In this case, network learning in a dynamic context addresses several learning paradigm difficulties such as sequential learning and conceptual drifts. The training representation should be stable enough to keep critical information during fresh learning while also being malleable enough to incorporate new information as appropriate. In practice, ANNs’ concurrent learning, in which the full training population is presented and learned as a single, complete entry, frequently disturbs or removes the network’s previously acquired representation. This ‘catastrophic forgetting’ as described in the literature (Robins 2004) contrasts with the human brain’s true neural network, which is capable of sequential learning and integrating old and freshly learned concepts as needed. The RNN learning over time step t is formally described in its simplest form by Eq. 2.20.   O t (x) = σo (h t ; w) = F Wo ∗ h t (x) + bo   h t (x) = σh (h t−1 ; w; x t ) = f Wi ∗ x t + Wh ∗ h t−1 (x) + bh

(2.20)

where F represents the non-linear activation function for the output variable, O t represents the output variable at each timestamp t, f represents the non-linear activation function in the hidden states, and h t represents the hidden variables. With a batch size of n, the d-inputs for a mini-batch sample are x (t) ∈ Rn×d . The hidden variable returns h t ∈ Rn× j , where j denotes the number of hidden units. The output variable is O ∈ R j×k . Unlike ANNs, we keep the hidden variables from the previous step h t−1 and introduce a new variable Wh to use the previous timestamp variable in the current timestamp. We allow the network to capture and maintain previous knowledge for calculation in current stages by using the relationship between hidden variables h t and h t−1 from neighboring steps. For the output, hidden variables, and input, the weight parameters are Wo ∈ Rk× j , Wh ∈ R j× j , and Wi ∈ R j×d . The bias terms for the output and hidden layers are bo ∈ R1×k and bh ∈ R1× j . Now, similar to what we did with ANNs, we want to investigate forward propagation and backprop through time techniques for clear interpretation. There are numerous RNN variants for different tasks. The goal is not to dwell deeper into the complexities of sophisticated and heavy derivations, but rather to comprehend the model’s fundamental operation. This will eventually aid in gaining in-depth knowledge while reading other works.  Step 1: Forward propagation from input-to-hidden unit. Forward propagation is similar to that found in the ANN, with a single hidden layer h tj in the vanilla network

2.1 Neural Network Architectures

87

shown in Fig. 2.6. The only difference is that the activations z tj in Eq. 2.21 come from both the present external input and hidden layer activations one step back in time (Eq. 2.22). z tj =

d 

w ji xit +

J 

wh  j h t−1 h  ,

j = 1, 2, ....., J

(2.21)

h  =1

i=1

h tj = σh (z tj )

(2.22)

 Step 2: Forward propagation from hidden-to-output unit. In Eq. 2.24, for each timestamp t, the computation of outputs h tj is linearly coupled with the weights wk j shared across all hidden layers for the timestamp to compute the activations z kt . z kt =

J 

wk j h tj ,

k = 1, 2, ....., K

(2.23)

j=1

In the last layer, the activations are routed through a non-linear activation function σo , commonly softmax, to yield the target output label for the tth time sequence. ykt = σo (z kt )

(2.24)

 Step 3: Evaluation of error function for network optimization. Consider the RNN without bias parameters, whose activation function unit employs the identity mapping σ (x) = x, with input label x t ∈ Rd , hidden units h t ∈ R j , and y t ∈ Rk with the intended output label y t for each timestamp t. The defined loss function L is defined over T timestamps from the beginning of the sequence, similar to how we defined ANN in Eq. 2.9, here for the defined architecture (Ref. Fig. 2.7) with, for example, a cross-entropy loss function having V vocabulary is represented as: L=

T T |V | 1  t 1  t t l(y , y)= y log yvt T t=1 T t=1 v=1 v

(2.25)

 Step 4: Backpropagation of error concerning time from hidden-to-output weights. We build a computational graph for the RNN model, as shown in Fig. 2.7, to help comprehend the dependencies between the model variables and parameters employed during model calculation. We begin by calculating the differentiation of the loss function in Eq. 2.25. Loss in relation to the model output for each timestamp t as follows: yt ) δl(y t , δL = ∈ Rk δy t T.δy t

(2.26)

88

2 Neural Networks for Deep Learning

Now, we use the chain rule in Eqn. 2.27 to determine the objective of the loss function with respect to the weights δL/δWo ∈ Rh×o .  δL δy t  δL δL = ∗ = ht t t δWo δy δW δy o t=1 t=1 T

T

(2.27)

At the final timestamp T, the loss function L is exclusively dependent on the hidden state h T through y T . So, calculating the gradient using the chain rule in Eqn. 2.28 is simple: δL δL δy t δL = t ∗ = Wo t δh T δy δh T δy

(2.28)

This becomes more difficult for any timestamp, t < T, where the loss function depends on h t through h t+1 and y t . Using the chain rule of gradient for the time stamp 1 ≤ t ≤ T , the hidden state δL/δh t ∈ R j can be derived and enlarged recurrently as:  δL δL = (Wh )T −i Wo T +t−i δh t δy t=1 T

(2.29)

The basic linear example in Eq. 2.29 raises some serious issues for prolonged sequence models, owing to the potentially huge power of Wh . This is due to the fact that eigenvalues less than one vanish, while values greater than one explode. As a result, the equation becomes numerically unstable, with vanishing/exploding gradients. One method is to detach the gradient and terminate the time steps at a computationally convenient size. Later applications, such as LSTM models, demonstrate a more advanced use.  Step 5: Backpropagation of error with respect to time from input-to-hidden weights. Finally, the objective loss L is affected by the model weights wit , wht in the hidden layer via hidden states {h 1 , h 2 , ..., h T }. The final representation of the derived Eq. 2.30 using the chain rule is as follows:  δL δh t  δL δL = ∗ = h t−1 t t δWh δh δW δh h t=1 t=1 T

T

 δL δh t  δL δL = ∗ = xt t t δWi δh δW δh i t=1 t=1 T

T

(2.30)

where δL/δh t is recurrently computed using Eq. 2.28 and the quantity impacting numerical stability is Eq. 2.29. It should be observed that the variables Wi , Wh , and Wo are used to simplify the notation for the input, hidden, and output trainable weights parameters, which are shared within the hidden layers for each time stamp.

2.1 Neural Network Architectures

89

Algorithm Summary The BPTT algorithm is summarized as follows from the previous derivation: 1. For each time step, use the d input vectors from the training set, x t into the feedforward unfolded network for j time steps, and compute all h t using Eq. 2.20 and obtain the transformed output vector y t from Eq. 2.1.3. 2. Compute the loss L depending on the target label y t in Eq. 2.25. 3. Using the Eq. 2.26, calculate the BPTT (δL) for output units. 4. Using Eqs. 2.28, 2.1.3 and 2.30, compute the total derivative for Backprop for each time step δL in relation to the network’s output, hidden, and input weights. Challenges with RNNs When compared to previously learned ANNs with network capabilities to handle arbitrary length input and independent behavior of model size to growing input size, it appears to be a win-win situation. The model should effectively compute shared weights effectively over time while allowing the network to incorporate historical data into its prediction decisions. As the calculation slows with recurrent learning, it appears that the issue requires consideration. It proves difficult to acquire information from a long time ago with big sequence inputs.  THINK IT OVER What if we set a maximum input-output size and employ null character padding on input-output pairs that are less than that size? A simplistic solution would indicate that it should work as a basic feedforward network without varying length vectors, and padding seems intuitive from CNNs. Unfortunately, sequencing learning for a translation task is not a oneto-one mapping, and the network must include past and present linkages. Connecting Links: Recurrent neural network, temporal sequence relation. The network with more timestamps (longer input-output sequence in training) has the vanishing/exploding gradient problem, a typical issue in all NNs. RNNs have trouble simulating long-term dependencies between sequence elements. In the statement “Anish is a wise individual who enjoys teaching everyone to code,” the subject ‘Anish’ and the object ‘everyone’ are separated by a vast time-space t for the sequence. As a result, the unfolding RNN network must remember the subject for a longer time slice before reaching the object to identify the output pair (“Anish”, “everyone”) as subject and object. The problem is the chain rule of derivatives of the BPTT algorithm. The holistic proof can be found in many publications. Still, the key principle is the product of partial derivatives, which backpropagates deep in the network over lengthy sequences. In reality, the product of these partial derivatives, in early slices (Eq. 2.29) is proportional to the length of the input-output sequence. Suppose the partial derivative isn’t

90

2 Neural Networks for Deep Learning

Table 2.1 Summary of common gates used in RNNs Type of gate Notation Purpose Forget gate Output gate Relevance gate

f o r

Update gate

u

When to delete a cell Measure of cell reveal Drop of previous information Weightage of influence of past information

Use LSTM LSTM LSTM, GRU LSTM, GRU

close to 1. In cases of a diminishing gradient (partial derivatives  1) or exploding gradient (partial derivatives  1), learning is either sustained or unstable. Another difficulty with network is that it does not take into account future inputs when making judgments.  Highlight Solution for vanishing and exploding gradient problem in RNNs: 1. For vanishing gradients: Initialization + ReLU 2. For exploding gradients: Clipping trick DL developers are continually attempting to overcome the aforementioned issues by modifying RNNs. To improve accuracy, bidirectional RNN handles future step inputs. Simultaneously, Gated Recurrent Units (GRU) and Long Short Term Memory (LSTM) networks cope with vanishing gradient concerns and meaningful information retention. To solve the problem of vanishing gradients, these complex RNNs employ a particular gates (typically denoted by ‘’). Table 2.1 lists some common gates. Gates influence interactions between the memory cell and its surroundings. Figure 2.8 summarizes the various applications in NLP and time-series recognition. The taxonomy structure differs in terms of a number of input and output variables that map for the network to the timestamp t. A one-to-one input to output example is a classical prediction task such as object classification. one-to-many is commonly used for image captioning, many-to-one for sentiment analysis, many-to-many for name entity recognition or machine translation, and synched many-to-many input to output sequence for video classification. At each timestamp, the loss function L and BPTT loss calculation δL are calculated for any variations of RNNs.

2.1.3.1

Long Short-Term Memory Networks

Long short-term memory (LSTM) networks (Hochreiter and Schmidhuber 1997) are a popular, strong type of RNNs that flush memory for practical purposes. Using a layered deep network design, the network achieved SOTA outcomes on many

2.1 Neural Network Architectures

91

Fig. 2.8 Classification of RNN architecture based on various combinations of inputs and output nodes used for different applications

challenges using a stacked deep network configuration. The challenge of a gradient signal being multiplied a large number of times (as many as time steps) with a weight matrix in the recurrent hidden layer during backprop in traditional RNNs suggests that the magnitude of the weights in the transition units can have a strong influence on the learning process. This was previously described as the RNNs’ vanishing/ballooning gradient problem. The motives for introducing the LSTM network are its capacity to solve long-term relevant information preservation and short-term input skipping in latent space. As shown in Fig. 2.9, LSTM has memory cells (some research believes these memory cells to be a particular form of hidden state) connected to layers rather than neurons. It is claimed that the network’s design was inspired by logic gates found in computers. It is important to note that the output of the output gate is the hidden layer output at time step t, not the eventual goal output. The input gate (Eq. 2.31), the forget gate (refer Eq. 2.32), the output gate (Eq. 2.33), and the self-recurrent memory cell or neuron (Eq. 2.35) are the four basic components of LSTM. The self-recurrent connect weight of 1.0 ensures that the state of the memory cell remains constant across time steps in the absence of external input. The following are the functions of the gates in LSTM memory cells: 1. Input gate (i ): Allows for blocking a signal or altering the state of the memory cell. 2. Output gate (o ): Allows the state of the memory cell to either be blocked or affect other neurons.

92

2 Neural Networks for Deep Learning

Fig. 2.9 Illustration of an LSTM memory cell

3. Forget gate ( f ): Regulates the self-recurrent connection of the memory cell by allowing the cell to either remember or forget its past state as needed. Check the mathematical representation of the gates’ operation and the hidden unit’s output value generated from the memory cell functioning as the hidden unit. The first three equations are comparable to Eq. 2.20, and σ represents the sigmoid activation function squashing the output between the range (0, 1). This is done to distinguish between gate value 0 to block everything and value 1 to allow everything through. The notation for input x t ∈ Rn×d , hidden state variables h t−1 ∈ Rn× j and batch size n is similar to what we specified previously for j hidden unit and batch size n.   it = σ Wxi ∗ x t + Whi ∗ h t−1 (x) + bi

(2.31)

   tf = σ Wx f ∗ x t + Wh f ∗ h t−1 (x) + b f

(2.32)

  ot = σ Wxo ∗ x t + Who ∗ h t−1 (x) + bo

(2.33)

it represents the input gate in Eq. 2.31,  tf represents the forget gate in Eq. 2.32, and ot represents the output gate in Eq. 2.33. The weights and bias parameters for the input gates, the forget, and the output gates are given by the subscripts xi , x f , and xo , respectively. Now, we first present ct ∈ Rn× j as the candidate memory cell in Eq. 2.34, which has a computation comparable to the equations stated above. In Eq. 2.35, the distinction is the use of Tanh activation with values in the range (−1, 1) governing the remember or forget mechanism for the memory cell ct . In this case, the weight parameters corresponding to the input and the recurrent unit are Wxc and Whc , respectively, and the bias parameter is bc .   ct = tanh Wxc ∗ x t + Whc ∗ h t−1 (x) + bc

(2.34)

2.1 Neural Network Architectures

93

The mechanism in LSTM incorporates two gates in the memory cell ct , similar to the mechanism in GRUs: input gate it regulates the measure of new data to be included via candidate memory ct and the forget gate regulates the retention of the old memory cell in the present calculation. We get the following updated equation in Eq. 2.35 by using the element-wise vector multiplication : ct =  tf ct−1 + it ct

(2.35)

Finally, the hidden layer output h t ∈ Rn× j is computed using the tanh activation in Eq. 2.36 as: h t = 0t tanh(ct )

(2.36)

The Eq. 2.36 output can be passed through a softmax layer to obtain the target output y t for the current block. We can see from the preceding equations that for any timestamp t, the memory cell understands what to forget from the prior state (i.e.  tf ct−1 ) and what to consider from the current timestamp (i.e. it ct ). Refer to Fig. 2.9 for a block schematic of the memory cell that will help us better comprehend the above-mentioned mathematical calculation. We now know that RNN networks perform well with a length-varying input-output sequence with a strong correlation. Even when “vanilla” CNNs aren’t effective at sequence learning, researchers have found ways to use them for prediction tasks (Dauphin et al. 2017). This highlights the importance of including a CNN in this chapter and delving deeper into how it acquired popularity.

2.1.4 Convolutional Neural Networks CNNs, or ConvNets (LeCun et al. 1989), at the crossroads of biology, mathematics, and computer science, are a particular NN topology that has taken the DL world by storm. CNNs inspired by human visual systems became popular during a 2012 hackathon, the ImageNet Challenge (Krizhevsky et al. 2012). Introduced to address grid-like topology data, the majority of image-related challenges have been impressively converted to sequential data. This could be time-series data represented on a 1D grid, or a picture with pixels arranged in a 2D grid. Almost a decade in the DL field, these variants of neural network architecture have sparked interest in virtually every possible domain, including image analysis, object recognition, document parsing, law enforcement, recommendation systems, smart navigation, and other complex image classification problems. While RNNs laid the foundation for voice-to-DL applications, CNNs have effectively offered vision across multiple applications, particularly in diverse computer vision tasks.

94

2 Neural Networks for Deep Learning

 Highlight Translation variance suggests that a relevant image feature can appear in different positions. Rotation variance suggests that the feature may have arbitrary orientations in an image. Scale variance suggests that the image features may be present in varying scales and crops. Refer to Fig. 3.3 for the context of variances. Consider that you have been given two similar image and asked to identify if they are dissimilar. What steps would you take to accomplish this? Normally, you would try to divide the images into smaller segments and identify various elements, forms, and edges. Then you would compare the attributes of the images and gather information to finally determine the dissimilarity for the given pair. Similar fashion to use ANNs to handle big, rich structures has been highly disappointing until now. We simply flattened the data into a one-directional vector, eliminating the spatial characteristic of each image input because the network is insensitive to the sequence of input features, and fed it into a fully-connected MLP to get a prediction. The problem of input order invariance might be solved intuitively using RNNs, but prior knowledge of image handling necessitates exploiting the knowledge of close pixels being frequently interconnected and holding spatial information to design an efficient network for image-like data. It appears fair that any strategy we adopt should not be unduly concerned with the actual location of the feature in the 2D grid, but should instead leverage any a priori structure governing the interactions between the features. This section focuses on networks that are specifically designed for this purpose and follow a similar pattern of knowledge propagation by comparing pixel values in its hidden layers—convolutional layers or feature extractor layer—to find features in a given image and condensing the data in the fully connected layers or predictor layer to produce a decision. CNNs are one of the most adaptable models, effectively customized for image and non-image data (Table 2.2). To address the network’s high potential among DL researchers and why it performs so well, we will begin with the network’s fundamental building blocks, namely: • Features relate to image components such as edges, borders, forms, textures, objects, and circles. • Kernels are tiny matrices that use convolution to extract meaningful information or characteristics from incoming data. This may be a popular 3 × 3 Sobel filter for edge detection, for example. The form of the filter matrices is normally square and depends on the type of characteristics to be extracted. • Convolution is a CV technique that captures features from images in the form of feature maps using learnable kernels. In other words, the kernel is slid over an image pixel by pixel, and at each pixel group, an element-wise multiplication is done between the kernel and the image. It is simply an integral that expresses

2.1 Neural Network Architectures



• •

• •

95

the overlap of one functions shifted over the other to provide a weighted linear combination of the image while keeping the spatial structure intact. Pooling allows for the aggregation of information across contiguous spatial regions, which helps to reduce the dimension of the derived feature maps by executing actions at the pixel level. Technically, a pooling kernel moves over the picture. Using techniques such as max pooling, average pooling, or sum pooling, just one pixel is chosen from the associated pixel group for further processing, lowering computing burden, controlling overfitting, and giving spatial invariance. Stride is the rate at which the kernel windows shift across the image during the convolution operation. Padding is an important notion in image scanning since it extends the region of the image for a more accurate analysis. Inserts additional pixels with certain pixel values around the image border to ensure that the image dimension is preserved throughout repeated convolutions. Zero-padding and mean-padding are the most prevalent. Flattening is the operation that converts the 2D/3D resulting feature map into a 1D vector. Non-linear activations add non-linearity to the NN, allowing numerous convolutions and pooling blocks to be stacked to improve model depth.

When adding a conv-layer to a network, we need to specify the number of filters. At a higher level, convolutional layers detect these patterns in the image data with the help of filters. The higher-level details are taken care of by the first few convolutional layers. The deeper the network goes, the more sophisticated the pattern searching becomes. For example, in later layers, rather than edges and simple shapes, filters may detect specific objects like lips, eyes, hairs, or noses, and eventually a human, a monkey, and whatnot. The neurons in a convolution layer are generally not accountable to the complete neuron stacks in the previous layer but rather a cluster of neurons that captures the local features of an image along with the network. We also now understand that input of the image as a whole into the input layer of feedforward networks or simply using each color value of three-channel matrices for a colored image as an input feature to the network (for example, 256 pixels × 256 pixels × 3 channels = 196,608 connections for only one neuron in the first dense layer) will lead to a lot of parameters making the optimization of the network difficult and will

Table 2.2 Comparison of the different variants of Neural Networks Artificial NN Recurrent NN General data type Parameter sharing Spatial relationship Recurrent relation Vanishing and exploding gradients

Tabular data ✗ ✗ ✗ ✓

Sequence data ✓ ✗ ✓ ✓

Convolutional NN Image data ✓ ✓ ✗ ✓

96

2 Neural Networks for Deep Learning

require a lot of data for training. This will also be inappropriate for a lot of changing parameters in the image due to translation and rotation variance. More intuitive learning emphasizes smaller features in an image to counter the variance problem and drastically reduce the number of trainable parameters in the network. This is because we don’t need to connect the neurons to individual pixels but to small areas based on kernel size.  THINK IT OVER What actually happens when we use a fully connected layer versus a convolutional layer?

Fig. Interlinked node connectivity compares a fully connected layer on the left with the convolution layer on the right. The selected green node shows the influence of the given node from the antecedent layer. Connecting Links: Trainable parameters, spatially local features, memory size, translation variance. We begin with the fundamental operation of all CNNs, namely the convolution, and then move on to the nitty-gritty details of padding, stride, and use of multiple channels at each layer. Since CNNs will be referred extensively throughout the book, it is critical to understand the convolution operation, which is at the heart of any CNN. Before delving into 2D image convolutions, we’ll take a non-traditional yet intuitive method to mathematically define 1D convolution using an example.

2.1.4.1

1D Convolution

Consider an arbitrary stock whose price fluctuates over time. The time sorted data for time-steps t1 , · · · , tT having corresponding share prices x1 , x2 , ..., x T can be represented as {ti , xi; i = 1, 2, . . . , T }. This is also a case of the most common kind of equally-spaced time-points where difference in time-steps considered as temporal difference can vary in scale. But we can drop the absolute time step information for equally-spaced data regardless of temporal resolution and represent it as a onedimensional sequence of length N as {x1 , x2 , . . . , x T }. Figure 2.10a shows the stock

2.1 Neural Network Architectures

97

Fig. 2.10 This is how padding works in both 1D and 2D data. a An equally-spaced time series data for stock-market analysis. With the operation of padding, we add L entries to both ends of xi data points. b Padding of 2-D data for the convolution operation

price data, which can be considered to be made up of signal component si and noise component ξi as {xi = si + ξi ; i = 1, 2, . . . , T }. To find the average of all prices in the N-neighborhood (spatial information) xt , that is the average of the 2N + 1 prices {xt−N , xt−N +1 , ..., xt , ..., xt+N −1 , xt+N }, Eq. 2.37 is applied. N N N    1 1 1 xt+n = st+n + ξt+n 2N + 1 n=−N 2N + 1 n=−N 2N + 1 n=−N

(2.37)

In order to denoise the dataset and retrieve the signal from the data in Eq. 2.37, two reasonable assumptions can be made: 1. The signal s is relatively smooth, 2. The error ξ has zero-mean distribution. For improved approximations, the assumption has two implications for Eq. 2.37: • The higher the value of neighbourhood size N , the more likely the average will be pushed closer to the true population average, i.e., N  1 ξt+n ≈ 0 2N + 1 n=−N

(2.38)

• On the other hand, raising N has a contrasting effect on the signal st , which is influenced by neighboring values. N  1 st+n ≈ st 2N + 1 n=−N

(2.39)

98

2 Neural Networks for Deep Learning

The solution to this problem is to incorporate a weighted average in the N neighbours of xt with higher weights to the elements in the vicinity of xt by introducing non-uniform weights wn in Eq. 2.40: st =

N 

wn xt+n ,

n=−N

wn =

N + 1 − |n| , n = 1, 2, ..., N (N + 1)2

(2.40)

If you’re still wondering what it has to do with convolution, we’ll solve it soon with the Eq. 2.37 mentioned above. Looking at the range of the defined variables, we see that w is a 2N + 1 length sequence, x is a T -length sequence, and s is a sequence represented by s = w ∗ x, which has the same length and range as x. But how can we perform a convolution at the very start and end of the sequence? To do a convolution there, we need values before and after the time range of the sequence. | | To obtain the answer | | AH: To solve this issue, we must append N data points at both ends of the sequence so that a value of xi exists for all t ∈ [1, T ], i ∈ [−N + 1, T + N ]}, as shown in Fig. 2.10b, which is known as padding. The knowledge can be further extended to a 2D convolution represented in Eq. 2.41. st1 ,t2 =

N1 

N2 

wn 1 ,n 2 xt1 +n 1 ,t2 +n 2 ,

{t1 ∈ T1 , t2 ∈ T2 }

(2.41)

n 1 =−N1 n 2 =−N2

where, w is an (2N1 +1) × (2N2 + 1) weight or kernel matrix, x is a T1 × T2 matrix originally transformed into (T1 + 2N1 ) × (T2 + 2N2 ) after padding and st1 ,t2 is defined for all t1 ∈ [−N + 1, T1 + N ], t2 ∈ [−N + 1, T2 + N ].

2.1.4.2

2D Convolution

Now that we understand the operations better, let’s talk about 2D convolution on images. Images are usually thought of as either a 2D array with [row × column] of pixels with values ranging from [0 − 255] or a 3D array with [row × column × channel] with an intricate combination of red, green, and blue (RGB) pixel brightness values ranging from [0 − 255]. In the case study, a 2D conv operation will be used on a grayscale image to pull features out of the hidden units of feature extractors. As already mentioned, there is also a 1D convolution for time series data with points that are equally spaced (also look at Fig. 2.10a), and 3D convolution is often used on RGB images. Equation 2.41 can be seen as a more representational form of matrices in Eq. 2.42 with hidden units and bias, which we learned about in MLPs (Sect. 2.1.2). It is critical to remember that we only use the translation invariance concept in input X by constraining weight W and bias B to be independent of the (i, j)th pixel location.

2.1 Neural Network Architectures

99

This is how the convolution operation calculates the hidden state H by basically weighing pixels in the neighborhood of (i, j).  [W ]a,b [X ]i+a, j+b (2.42) [H ]i, j = B + a

b

Not only did we successfully add translation invariance into the computation, but we also drastically lowered the weight parameters when compared to typical MLP weights based on a one-to-one weight correspondence with image pixels, i.e., [W ]i, j,k,l with k = i + a, l = j + b. Finally, based on the assumptions given earlier in the section, we will apply the second principle—locality—to confine the weighing of important information in the close vicinity of the source (i, j). This implies weighing zero to weight the matrix outside of a specified range , yielding the Eq. 2.43 below. [H ]i, j = B +

   

[W ]a,b [X ]i+a, j+b

(2.43)

a=− b=−

This is a mathematical depiction of overlap/convolution between two functions in Eq. 2.44, say ( f, g):R d → R in discrete space, with a minor notation change (i − a, j − b). This notation is easily matched in the Eqs. 2.44 and 2.42, however, Eq. 2.42 defined above accurately describes the cross-correlation characteristic.  f (a, b)g(i − a, j − b) (2.44) ( f ∗ g)(i, j) = a

b

Similarly, for a three-channel color image, the convolution filter can be customized for the multidimensional input at each pixel location [X ]i, j,k . For input formulation, the intuitive technique of simply adding a third-order tensor to a hidden representation is not a good idea. In other words, we are looking for a whole vector of hidden representations for each spatial point. Similar to the concept of channels in input, this may be regarded of as a stack of hidden representations in a 2-D grid to represent the spatialized set of learned features by hidden representation in subsequent layers, called feature maps. Lower levels are more intuitively associated with inputs. Some channels may be trained to detect edges, while others may be trained to distinguish textures. The fourth tensor coordinate to weights in Eq. 2.45 can be used to integrate support for multiple channels for input and hidden representations. This is the representation of the convolution layer for numerous channels in general. [H ]i, j,k = B +

     [W ]a,b,c,k [X ]i+a, j+b,c a=− b=−

(2.45)

c

Finally, as we demystify the learning in CNNs, we must notice that the network’s pipeline does not perform the same neuron operations with weights and bias as ANNs

100

2 Neural Networks for Deep Learning

Fig. 2.11 A CNN network’s convolution process. a A 3 × 3 kernel operation performed on an image sliding with a stride of one to obtain the provided feature map. b An intuitive analogy of the aforementioned kernel process as 9 neurons’ input in an ANN-based outline for a given single output

but rather uses learnable kernels. In Fig. 2.11, we present the process of convolution in a representation form similar to a neuron operation in a feedforward network for the sake of easy initial understanding. The convolutional layer’s output feature maps are typically routed via the ReLU activation function to impart non-linearity to the model, similar to how non-linearity was employed in earlier networks to obtain a universal approximations of a function. Otherwise, a weighted combination of linear functions will serve for linear mapping. The problem at hand will dictate the sort of non-linearity chosen.

2.1 Neural Network Architectures

101

 Highlight Characteristic features of CNNs are: 1. The network helps extract spatial features in the form of smaller chunks or a features map. 2. CNNs captures these spatial features using learnable filters without explicitly specifying them from the input data. 3. CNNs also incorporate the concept of parameter sharing using same filter over different region of image to generate a features map. 4. The network also efficiently handles vanishing / exploding gradients. 5. Convolution operation of tensors follows easy parallelism across GPU cores. 6. CNNs are considered to be computationally efficient as they require fewer trainable parameters than ANNs. 7. CNNs have shown sample efficiency to achieve SOTA models, resulting in the emergence of effective application on 1D sequence data, graphstructured data and recommendation systems. 8. Pooling acts as an essential step to reduce the computation and makes the model tolerant against distortions and variations. 9. CNNs have fewer parameters than fully connected networks, which helps reduce overfitting. To better understand the propagation steps in CNNs with emphasis on the feedforward and backpropagation knowledge we have acquired earlier, we can give a rough blueprint describing that includes the steps we need to follow:  Step 1: Forward propagation through convolutional layers. We begin by randomly initializing a filter weight matrix wl , flipping it by 180◦ horizontally and vertically (as shown in Fig. 2.12b), and sliding over the preceding layer’s input variable h l−1 with equal and finite strides (Fig. 2.12a). The procedure is repeated with different kernels to obtain as many feature maps x l as desired. Figure 2.12c depicts the concept of weight parameter sharing and sparse connectivity in CNNs. The color codes between the input and convolutional layers represent the distribution of kernel weights for the same color. Equation 2.46 represents the convolution equation for input xi, j of the original dimension H × W as:  l l wa,b h l−1 (2.46) xi,l j = i+a, j+b + bi, j a

b

where, h li, j = σ (xi,l j ) for the lth layer of convolution, having kernel weight wa,b of dimension K 1 × K 2 and bl being the bias of the layer l. h li, j is the output vector of the layer l after passing the input through an activation function (σ ). At this stage, we’ll try to comprehend the specifics of CNN layers’ convolution block. Later, the values retrieved from the convolution layer are transmitted to the dense, fully connected layer to form the final forecast. Since the fully connected

102

2 Neural Networks for Deep Learning

Fig. 2.12 Forward propagation in a convolutional network. a The input features of x l−1 layer (in blue) has a receptive field size of 4. It is convoluted with the kernel to generate the intermediate convolution feature maps. The idea of sparse connectivity is shown here with kernel connected to only 4 adjacent neurons in the input layer, which are then pooled. b The kernel map, flipped both horizontally and vertically for the convolution operation. c The kernel’s weight distribution in the various region of a 3 × 3 sample input feature map

layer and backpropagation learning are nothing more than typical NN operations, the earlier-mentioned mathematics might be employed to construct the learning equation for the entire network. The output is transmitted in a similar manner, with the chain rule of derivatives used to compute δL/δw with two parameters—weights and bias for the dense layer, followed by propagation through the convolution blocks with the filter matrix as the parameter.  Step 2: Evaluation of error function for the network optimization. Assuming a total of k predictions, the network’s output yk is compared with its corresponding target output yk , using MSE here. Learning is generally achieved generally by applying the gradient descent algorithm to the loss function L in Eq. 2.47 below: L=

1 ( yk − yk )2 2 k

(2.47)

 Step 3: Backpropagation of the error through convolutional layers. For the backpropagation, let’s simplify the learning into two update parameters. Starting with the influence of change in individual pixels say, wa0 ,b0 (also see Fig. 2.13a)

2.1 Neural Network Architectures

103

Fig. 2.13 The influence of the kernel map on forward and backward propagation operations. a In convolution, forward propagation ensures that the yellow pixel wa0 ,b0 in the weight kernel contributes to all products (between each elements of the weight kernel and the input feature map element it overlaps). b A flipped delta matrix illustrating the gradient formed during backpropagation

in the kernel weight to the loss function L obtained by applying the chain rule to the individual kernel weights δL/δwal 0 ,b0 in Eq. 2.48 as: H −k1 W −k2 l   δL δxi, j δL = δwal 0 ,b0 δxi,l j δwal 0 ,b0 i=0 j=0

(2.48)

The forward propagation in Eq. 2.46 ensures the contribution of the weight kernel wa0 ,b0 to all the overlap product between the elements of kernel having dimension say, k1 × l2 and the elements of the input feature map of dimension H × W . This suggests that the corresponding kernel pixel shall have influence on all elements in the output feature map of size (H − k1 + 1) × (W − k2 + 1). The second partial derivative term in the equation above can be expanded in Eq. 2.49 using the expression from Eq. 2.46 with the usual notation as:    δxi,l j δ l−1 l l = wa,b h i+a, j+b + bi, j (2.49) δwal 0 ,b0 δwal 0 ,b0 a b l Further expansion of the summation wa,b h l−1 i+a, j+b in Eq. 2.49 using partial derivatives results in zero values for all except the terms when a = a0 and b = b0 respectively,

δxi,l j δwal 0 ,b0

=

 δ l l−1 w h a0 ,b0 i+a0 , j+b0 δwal 0 ,b0 = h l−1 i+a0 , j+b0

(2.50)

104

2 Neural Networks for Deep Learning

The substitution of Eq. 2.50 into Eq. 2.48 gives the dual summation in Eq. 2.52, is the weight sharing in the network during convolution. The summation suggests the collection of all the gradients δi,l j corresponding to the output from layer l. The cross-correlation expression of gradients is transformed into a convolution expression similar to that of flipping filters in forward propagation, as shown in Fig. 2.12b. The convolution operation applied to fetch the new set of weights is represented in Fig. 2.14b having 2 × 2 kernel weights for easy representation. H −k1 W −k2   δL l−1 l = δi,l j h l−1 i+a0 , j+b0 = r ot180◦ {δi, j } ∗ h a0 ,b0 δwal 0 ,b0 i=0 j=0

(2.51)

In Fig. 2.14a, the reconstruction process uses the delta δi, j in the form shown in Fig. 2.13b, mathematically defined in Eq. 2.52: δi,l j =

δL δxi,l j

(2.52)

this brings us to the computation of the second parameter δL/δxil0 , j0 which represents the influence of change in the individual pixel on the input feature map xil0 , j0 impacts the loss function L in Eq. 2.53. In Fig. 2.15, observe that the input pixel xi0 , j0 influences the output region bounded by (i 0 − k1 + 1, j0 − k2 + 1) and (i 0 , j0 ) at the top and bottom diagonal corner pixels, respectively. The output region bounded

Fig. 2.14 Overview of backpropagation operation. a Backpropagation gradient generation (δ11 , δ12 , δ21 , δ22 ) passed backward through the network. b The convolution operation used to generate the new set of weights marked in pink for the input feature map

2.1 Neural Network Architectures

105

Fig. 2.15 An example of how a change in a single pixel xi0 , j0 in the input feature map impacts the loss function. The dashed orange box in the represents the output region influenced by the matching single pixel from the input feature map represented in blue

by dashed lines in Fig. 2.12c are the pixels affected by the single pixel input xa0 ,b0 in the input feature map. This is represented using the chain rule in the following equation as: k 1 −1 k 2 −1  δxil+1 δL δL 0 −a, j0 −b = l l+1 δxi0 , j0 δxi0 −a, j0 −b δxil0 , j0 a=0 b=0

=

k 1 −1 k 2 −1 

δil+1 0 −a, j0 −b

a=0 b=0

δxil+1 0 −a, j0 −b

(2.53)

δxil0 , j0

where the bounded region ranges from i a0 − 0 to i a0 − (k1 − 1) in height and from jb0 − 0 to jb0 − (k2 − 1) in width. Since, a and b are in the similar range 0 ≤ a ≤ k1 − 1 and 0 ≤ b ≤ k2 − 1, we can represent the summation in equation simply as i a0 − a and jb0 − b. Expanding Eq. 2.53 similarly to Eq. 2.49 gives: δxil+1 0 −a, j0 −b δxil0 , j0

=

 

δ δxal 0 ,b0

a0

wal+1 σ 0 ,b0



xil0 −a+a0 , j0 −b+b0



 +b

l+1

(2.54)

b0

Again, expanding the summation using partial derivatives results in all terms equalling zero except when a0 = a and b0 = b making σ (xil0 −a+a0 , j0 −b+b0 ) transl+1 formed into σ (xil0 , j0 ) and individual weights wal+1 to wa,b . This gives the following 0 ,b0 simplified Eq. 2.55. δxil+1 0 −a, j0 −b δxil0 , j0

=

δ δxal 0 ,b0

 l+1  l  wa,b σ xi0 , j0

  l+1  = wa,b σ xil0 , j0

(2.55)

106

2 Neural Networks for Deep Learning

Finally, substituting Eq. 2.55 into Eq. 2.53 gives the expression where we use the flipped kernel to express the convolution in backpropagation learning below. Note that the derivations of forward and backward propagation vary depending on the propagating layer through which it passes in the network. k 1 −1 k 2 −1  δL = δil+1 wl+1 σ  (xil0 , j0 ) 0 −a, j0 −b a,b δxil0 , j0 a=0 b=0 l+1 = δil+1 ∗ r ot180◦ {wa,b }σ  (xil0 , j0 ) 0 −a, j0 −b

(2.56)

 Highlight The pooling layer basically helps reduce the spatial size of the feature map representation progressively to limit the number of parameters and computations in the network. This also helps regulate overfitting which can be obtained using max-pooling, average-pooling, or potentially L2-norm pooling too. It is important to note that the pooling layer encompasses no learning (LeCun et al. 1989). A pooling layer combined with large receptive field sizes of neurons in a sequence of convolutions results in translation invariance of the network. Backpropagation of the pooling layer basically computes the loss obtained by a single winning unit indexed during forward propagation reduction of N × N block. For example, average-pooling assigns 1/(N × N ) times the error to all units of the pooling block. At the same time, max-pooling only passes the error to the single winning unit and assigns zero to the block’s non-contributing units. Challenges with CNNs: CNNs extracts spatial features from images. The arrangement of pixels and their relationship in an image are referred to as spatial features. They help us to accurately identify an object, its location, and its relationship to other entities in an image. However, one of the most important aspects of CNNs is the model’s generalization to previously unseen data (Table 2.2). The chapter does not address any guidelines for choosing a perfect CNN for a specific application. conv-net research is advancing at a breakneck pace. Every other week or month, a new state-of-the-art architecture for a given benchmark may be introduced, making it impractical to stick to a specific variant of CNN. The best architectures may be composed of the building blocks mentioned in the section, but they frequently face challenges such as over-fitting, exploding gradients, and class imbalance issues. The network frequently requires a large amount of training data, which is typically obtained through data augmentations. To counteract overfitting, control measures such as incorporating regularization, simplifying architecture complexities, and

2.1 Neural Network Architectures

107

early stopping are used. Exploding gradients with overflow or NaN loss value of loss gradient are addressed using gradient clipping, network redesign, and various activation functions. Compared to CNNs and NNs, the class imbalance in training has long been a significant barrier. Dealing with variance in real-world data is another challenge in computer vision. Often, there are different angles for the object in an image, changing lighting conditions, rotations, scaling, and a non-contrasting background, making it difficult for the model to identify an object. Additionally, CNNs have limitations in terms of adversarial attacks and spatial learning curves. If the spatial position of the elements is permuted, it cannot decipher their relationship. This is due to the lack of a coordinate frame, such as the human visual cortex, to comprehend the entire picture of various components in input data. This will be discussed in GNNs. Interestingly, the relationship between CNNs and the primate visual cortex lacks practical completeness. Even a complex state-of-the-art CNN model can be deceived by imperceptibly small, explicitly crafted perturbations that make it hard to map network layers to the visual system and understand its shortcomings. While it is hard to intercept the prediction of neural responses in the visual cortex accurately and make sense of the motion activity perceived by the brain (Cichy and Dwivedi 2021), we lack proper knowledge of the exact mapping of the feature space of CNNs and the space domain of the visual cortex.

2.1.4.3

LeNet

LeNet (LeNet-5) is arguably the first published CNN introduced by Yann LeCun with the primary purpose of recognizing handwritten digits of the MNIST dataset in order to process deposits at the ATM (LeCun et al. 1998). The network consists of (i) two convolutional encoding blocks (having a 5 × 5 kernel convolution, a sigmoid activation, and an average pooling operation of 2 × 2 with a stride of 2 that downsampled the spatial dimensionality by a factor of four), and (ii) three fully connected dense block performing classification tasks. Later it was discovered that ReLU and max-pooling perform better with this architecture. While the implementation of such an architecture appears to be remarkably simple relative to modern DL frameworks, the network received significant attention in the 1990s for its competitive performance against support vector machines. However, between the early 1990s and the 2012 watershed algorithm, ML methods frequently outperformed NNs. This was due to the network’s inefficiency in computing loss and accuracy while processing high-resolution images and categorizing a large number of object classes. In terms of computer vision, the comparison for convolutional networks was a little unfair. Despite the absence of computational accelerators, small data sizes, missing neural network training tricks, and expensive sensors, raw or lightly processed pixel values were compared to manually engineered handcrafted feature extractions. Endto-end algorithmic learning was frequently relegated to an afterthought in favor of clever feature extraction ideas. Until 2012, engineering a new set of feature represen-

108

2 Neural Networks for Deep Learning

tations (a crucial part of the pipeline) was still done mechanically using prominent methods like scale-invariant feature transform (SIFT) (Lowe 2004a), bags of visual words (BoW) (Sivic and Zisserman 2003; Csurka et al. 2004), histograms of oriented gradient (HOG) (Dalal and Triggs 2005), and speeded up robust features (SURF) (Bay et al. 2006). Another group of researchers believed in a reasonably complex hierarchical learning of features with multiple connected learnable layers. Indeed, AlexNet was introduced in 2012 as a breakthrough in the ImageNet challenge attributed to the two major factors (i) large dataset availability—ImageNet, and (ii) hardware acceleration capacity by graphical processing units (GPU). We shall take a broader perspective on CNNs later in the chapter and interestingly observe that increasing parameters in the learning do not always lead to better accuracy of the models.

2.1.4.4

Deconvolutional Neural Network

Most of the convolution and pooling layers we’ve seen so far either reduce the input’s spatial dimensions (height and width) or leave them unchanged. When performing a specific task such as semantic segmentation, we want the result to be in the same 3D location as the input. In order to accomplish this, the spatial dimensions of intermediate feature maps must be upsampled, which calls for upsampling and transposed convolution. To achieve this, Deconvolutional Neural Networks (Deconv-Nets) are used. They are basically CNNs that work in the opposite way to reverse downsampling operations by the convolution. This is called fractionally-strided convolution (Dumoulin and Visin 2016). In deconvolution or transposed convolution, the kernel is convolved over the input tensor to broadcast input elements to the resultant tensor, which replaces a part of the intermediate tensor. At the end, these intermediate tensors are added together to make the final output bigger than the dimensions of the input. Figure 2.16 shows how deconvolution works. In Decov-Net, the transpose convolution layer does both convolution and upsampling at the same time. Notably, the upsampling has no parameters that can be trained. It works by repeating the rows and columns of the image data based on their sizes. The function is denoted by Conv2DTranspose and it takes the parameters: number of filters, filter size, stride. In contrast to the application to the input in a traditional convolution network, the padding and stride are specified for the intermediate tensor (hence, output) rather than the input. For example, the stride must maintain the same spatial dimension as the input. Increasing the stride makes the intermediate tensors and, as a result, the output tensor bigger in space. This is an important preposition that led to the development of autoencoder networks (Sect. 2.1.5), which can learn the lower-dimensional representation from higher-dimensional data while keeping the spatial dimension.

2.1 Neural Network Architectures

109

Fig. 2.16 A Deconv-Net representation with transposed convolution operation using 2 filter

2.1.5 Autoencoder Neural Networks Autoencoders Neural Networks (AE NNs) are surprisingly simple types of ANNs that leverage unsupervised learning techniques for the task of representation learning. We can think of this encoder-decoder architecture as a feedforward network trying to learn an identity function (i.e., take an input x and predict the x itself). To make this non-trivial, we impose a bottleneck layer in the network having a smaller dimension than the input. This forces compression and subsequent reconstruction of the knowledge representation from the original input features, effectively compressing the input into a good representation. If the individual input features were independent, the knowledge encoding would be complex. However, if there is a correlation between the input features, the bottleneck is a key attribute, forcing the network to leverage dimensionality reduction of the high-dimensional data and learn the relationship in low-dimensional space. If you wonder about the need for such an identity mapping function, consider it as a form of compression (i.e., reducing the file size) similar to mp3 compression of an audio file or JPEG compression in an image. This learns the abstract feature in an unsupervised manner that can be applied to the supervised task effectively. In the simplest form, an AE network with linear activation in each layer shall observe the same dimensionality reduction as observed in principal component analysis (PCA). However, they are more efficient in learning complex mapping into a non-linear manifold than a subspace due to the non-linear nature of activation invoked in the network. This is a kind of non-linear dimensionality reduction of the input features. Typically ReLU (Rectified Linear Unit) and sigmoid activation functions are non-linearity used in AEs. A wide range of AE architectures exists, including sparse AEs, denoising AEs, contrastive AEs, or sequence AEs. A Restricted Boltzmann Machine (RBM) is also like an AE with tied weights, except that the units are sampled stochastically, whereas AEs are not probabilistic models. Nevertheless, there is an AE-like probabilistic model known as Variational Autoencoders (VAE), which requires some advanced mathematics and shall be discussed later in the book. As we demystify the components of AEs, we shall leverage the idea learned from CNNs to encode the knowledge representation in the encoder block. AEs are primarily built of three parts:

110

2 Neural Networks for Deep Learning

Fig. 2.17 An architectural design of an Autoencoder with 2 units in the bottleneck trying to compress the input in the encoder block gθ and upsample the low level representation in the decoder block f φ for the final output representation

1. The encoder is the set of convolutional modules followed by pooling blocks that compress the input data into several dimensional smaller low-dimensional information that is encoded by the so called ‘bottleneck layer’. 2. The bottleneck is the noblest block of this neural net NN and, ironically, the smallest one. It acts as a valve permitting only the most vital information to pass through from encoder to decoder. This is the layer that encapsulates the knowledge representation of the input and helps to establish a useful correlation between various input features. Remember that the bottleneck is designed to capture maximum information possessed by an image. Thus, the smaller its dimension, the lower the risk of overfitting. However, a very small bottleneck restricts the storage of information, increasing the possibility of critical information slipping out through the pooling layers of the encoders. Therefore, the layer’s size also acts as a regularisation term in the network. 3. The decoder is the set of upsampling and convolutional modules that helps reconstruct the encoded bottleneck knowledge. It attempts to build the original image with reduced noise from the compressed latent attributes passed as input to the block.

2.1 Neural Network Architectures

111

 Highlight As we can see, Fig. 2.17 represents a symmetric network architecture of encoder and decoder which are mirror images of each other. It does not necessarily mean that the network weights need to be, too. It is exciting to learn the ‘why’ and understand the feature learned by the network. Since AEs learn to compress data based on attributes discovered from input feature vectors during training, they are typically only capable of reconstructing data similar to the class of observations learned during training. This property is utilized in the application of AEs in dimensionality reduction, data denoising in images and audio, information retrieval, image inpainting, and anomaly detection. While AE variants like under-complete autoencoders or sparse autoencoders do not have large scale application in computer vision, parameterized distribution at the bottleneck layer of VAE can also be used to generate images and time-series data from models such as music as well as other heavy applications since its proposal in 2013 (Kingma and Welling 2013).  THINK IT OVER What is the trade-off balance for an ideal AE? 1. Sufficient sensitivity to the input to reconstruct the mapping accurately. 2. Sufficient insensitivity to the input to restrict the model memorize or overfit on the training data. This balance forces the network to learn only the variation in features to reconstruct the input without retaining the redundancies within the input. The theoretical background behind the network is relatively simple to understand at first glance, so we shall go through it briefly. The challenging part of the success of this network is to make them learn meaningful representation from the input. Essentially, we can consider an unlabeled dataset and frame it into a supervised learning task to output ‘ x ’, reconstructed from the original input ‘x’. This latent space mapping using generalized non-linear compression is achieved using two functions.  Step 1: Encoding the original data in a smaller latent space. The encoder block basically maps the original data x (i) having (i)th input features, to a latent space dimension in the bottleneck layer denoted by z ( j) having ( j)th dimension. The mapping in encoder denoted by φ : X → Z in the space with weight matrix we connecting the encoder block to middle bottleneck, bias be and non-linear activation function σφ for the block is used in Eq. 2.57 as, z ( j) = σφ (wφ x (i) + bφ )

(2.57)

 Step 2: Decoding of the encoded latent space into the target output. The decoder block maps back the latent attributes to the target output reconstruction x (i) that in this case is the same as the input function. We denote the decoder function

112

2 Neural Networks for Deep Learning

by θ : Z → X in similar fashion but potentially with different weight matrix wθ connecting the middle bottleneck layer to the decoder block having bias bθ and activation function σθ . x (i) = σθ (wθ z ( j) + bθ )

(2.58)

Note that the weights, bias, and activation unit for the decoder is usually a mirror of the encoder block but not always tied to the same value.  Step 3: Evaluation of the error function for the network optimization. The AE aims to generate the output x (i) as close as possible to the input features x (i) . This makes the overall mapping of the encoder-decoder network be (θ, φ) : X → X, where the network can be trained by minimizing the reconstruction loss L. This loss measures the difference between the original input and the resulting reconstructed output. Typically, an L-2 error is computed for k-feature maps in the network, as can be seen in Eq. 2.59: L(x (i) , x (i) ) =

2 1   (i) xk − xk(i) 2 i k

(2.59)

 Step 4: Backpropagation of error function through AEs. The mathematics provides a good starting point for stacking simple AEs to create a deep AE. The standard backpropagation algorithm of feedforward networks is used in the training algorithm to fine-tune the AE weights and bias. This technique is known as layerwise pre-training. The decoder layer is replaced by two or more new layers in a deep AE. The encoder and bottleneck layer, on the other hand, remain unaltered. The network’s old z layer becomes the input for the new layers, say z 2 , z 3 . Simultaneously, the final output of the bottleneck stacked layers becomes the input for the final new output decoder layer. Variants of AEs are provided by different stacking methods for the modules. For most cases, this involves the formulation of a reconstruction loss function L which is sensitive to inputs, and a second added regularizer term that discourages memorization or overfitting by the network. Therefore, the gradient algorithm for backpropagation optimizing the network for an optimal solution of the trade-off x (i) ) + (h). The standard AE between two objectives can be represented as L(x (i) , variants encompass the imposition of the two objectives and the fine-tuning of the trade-off in various ways. For example, L1 regularization in Eq. 2.60 and KL-divergence in Eq. 2.61 are two ways of measuring hidden layer activations for each training batch and adding a regularization term to the loss function to invoke the sparsity constraints, penalizing excessive activations in sparse AEs.   a l  (2.60) x (i) ) + λ L(x (i) , k k

2.1 Neural Network Architectures

113

where, λ is the scaling of the tunable parameter and ail is the activation unit vector in the l th layer for observation k.    (2.61) x (i) ) + KL ρ|| ρj L(x (i) , j

here, ρ is the sparsity parameter for the average activation of neurons over a collection of samples.  The expectation is calculated for all nodes of the hidden layer as ρ j = 1/k i [ail ], where j denotes the specific neuron in the layer l, summing the activations for k training observations. In essence, by constraining the activation of neurons over a collection of samples, we encourage neurons to be triggered for a subset of observations, leveraged by comparing the KL divergence of the ideal distribution ρ to the observed distribution over all hidden layer nodes ρ . The KL divergence for two variables having a Bernoulli random variable distribution can be rewritten as Eq. 2.62, shown below. x (i) ) + L(x (i) ,

 j

ρ log

ρ 1−ρ + (1 − ρ) log ρ j 1−ρ j

(2.62)

Challenges with AEs An AE is typically thought of as a combination of two NNs: an encoder and a decoder block with data-cleansing capabilities. It distills inputs into the densest form of representation, but it is not a one-size-fits-all tool and faces numerous application challenges. The network operates in an unsupervised mode, learning from data rather than labels provided by humans. This frequently necessitates the availability of a large and less noisy dataset in order to learn meaningful representations of the data. Aside from a large amount of training data, the data must be relevant to the use case, as the network, like many other algorithms, is data-specific. This means that the AE’s ability to learn and reproduce input features is limited to the training data available. It is a good idea to introduce noise, add specialized application regularization, and segment the input features using different unsupervised techniques before feeding them into an AE to improve the sturdiness of the algorithm. Most of the time, it is difficult for the network to understand the application’s relevant attributes. Creating a good AE is a trial and error process. Furthermore, the network is vulnerable to significant performance loss as a result of system compression degradation. Even when researchers try to aggressively prune the problem space to contain the loss, the decoder process is never perfect. It is critical to determine the tolerable loss in the reconstruction output.

114

2 Neural Networks for Deep Learning

2.1.5.1

Variational Autoencoders

Variational Autoencoders (VAEs) (Kingma and Welling 2013) have emerged prominently as GNN was proposed first in 2013. While AEs serve as a hidden representation of raw data input compressed into salient dimensions, VAEs are generative algorithms with an additional constraint of data encoding. The idea was that the normalized hidden representations allowed VAEs to compress data like AEs and generate synthesized data like a GAN. We shall discuss GANs in detail, soon. It is worth noting that in many cases VAEs generated data still tend to be blurrily; competing with fine, granular level, detailed image generation by GANs.  THINK IT OVER For curious readers who try to find a link between the AE and the content generation. It may be tempting to believe that once the AE has been trained, we can pick a random point from the encoder’s latent space and decode it to generate new content if the latent space is regular enough. In this case, the network’s decoder would function as a generator model of GANs (see Sect. 2.1.6). However, it is difficult to guarantee a priori that the encoder would encode the latent space on a regular basis without any reconstruction loss. The regularity of the space for AEs is determined by the distribution of the data, the architecture of the encoder, and the dimension of hidden, latent space. In general, no information loss encoding in high dimension encoding will result in significant overfitting, resulting in useless content when decoded. AEs are solely trained to encode and decode while minimizing loss to its extreme, regardless of how latent space is arranged. As a result, latent space regularity is a more general problem in AEs that requires consideration. Therefore, unless explicitly regularized, the network must rely on the concept of over-fitting possibilities during training to achieve its task. The basic idea behind the VAE is to regularize the encoded distribution during training to ensure that the latent space learns good properties and generates some good results presented in Fig. 2.18. The activity of the encoding-decoding process adds a slight modification from the standard AEs. Here, input is encoded as a distribution over a latent space rather than a single point. This latent space distribution is used to randomly pick a sample fed to the decoder model, essentially enforcing a smooth, continuous latent space representation. This suggests that nearby latent space values should correspond to a similar reconstruction. Equation 2.63 represents the loss function of the network, which consists of two terms. The first term penalizes the reconstruction error (or maximizes the likelihood of reconstruction), while the second term is a regularizer that encourages the learned distribution qφ (z|x) to be closer to the true prior distribution pθ (z). The distribution is assumed to be Gaussian. The architectural design of the network is shown in Fig. 2.19.

Fig. 2.18 The graphical illustration of the shift of knowledge encoding from discrete latent feature attributes for AE (on the left) to probabilistic latent feature distribution for VAE (on the right)

2.1 Neural Network Architectures 115

116

2 Neural Networks for Deep Learning

Fig. 2.19 An overview of the VAE architectural design with two encoded latent features z 1 ,z 2 derived from the mean, and standard deviations derived from input modeled by the network

  L(θ, φ) = −E z∼qφ [Pθ (x|z)] + D K L qφ (z|x)|| pθ (z)

(2.63)

 Highlight For VAE, the encoder is sometimes referred to as recognition model, while the decoder is referred to as generative model.

2.1.6 Generative Adversarial Networks We have now learned innumerable ways to simulate intelligent human-like behavior for discriminative algorithms. If someone wonders that NN algorithms can’t be creative artists, VAEs prove them wrong by demonstrating the logic-based creativity of the AI models. Generative Adversarial Networks (GANs) are an important breakthrough unlike anything the AI community had ever seen, introduced by a then-unknown Ph.D. fellow named I. Goodfellow and his team (Goodfellow et al. 2014). At a high level, GANs are algorithmic architectures that use two NNs—a generator and a discriminator, that work in an adversarial competing fashion. These are

2.1 Neural Network Architectures

117

good for generating realistic-looking new data that strikingly resemble the underlying distribution of training data. “Generative Adversarial Networks is the most interesting idea in the last ten years in ML.” — Yann LeCun, Director, Facebook AI

GANs quickly began to feature in the public consciousness due to their wide variety of professional-looking artistic applications. In 2018, a GAN network trained in fifteen thousand real portrait paintings was used to generate a masterpiece on canvas called Edmond de Belamy (Schneider and Rea 2018). The sale of a massive amount of $432,500 at an auction, made headlines across the globe, bringing AI art to the public eye. In 2019, NVIDIA published the source code of its StyleGAN network (Abdal et al. 2019), allowing the public to generate fake faces on demand. In subsequent years, the potential of adversarial networks to create deepfakes gained a lot of attention. A deepfake is a synthetic generation of fake audio and video that is potentially convincing enough to be perceived as true. For instance, the political figures’ hoaxes spread like wildfires. In fact, it raised concerns about the increasing problem of media manipulation and fake news.  Highlight Characteristic features of generative modeling Generative modeling falls under the umbrella of unsupervised learning, where the synthetic generation of data is based on the distribution and patterns discovered from the training input. They can fall into the following categories: 1. Given a label, they estimate the associated features (say, Naive Bayes). 2. Given some features, they estimate the rest, say, inpainting or imputation (imputation). 3. Given a hidden representative, they estimate the associated features (VAE, GAN). Interestingly, Naive Bayes is an example of a generative model often used as a discriminative model. Given a sample, it computes the possible individual outcome for each variable. It combines independent probabilities to predict the most likely outcome. Here, it summarizes the probability distribution of each input and output class. As a generative model, it works in reverse to generate new independent feature values from the probability distribution of each variable. Other popular generative models includes Latent Dirichlet Allocation (LDA), Gaussian Mixture Model (GMM), DL-based methods like Deep Belief Network (DBN), and Restricted Boltzmann Machine (RBM).

118

2 Neural Networks for Deep Learning

Different variants of GANs have been developed, delivering on the commitment to generate realistic examples in diverse problem domains. These are capable of producing synthetic, high-fidelity audio, which improves voice translation. To augment the sophisticated domain-specific dataset, they learn from training samples and generate synthetic images. High-resolution frames can be generated from finer details captured in low-resolution videos, or the next frame in the video can be predicted. They can also reproduce fake images from the distribution of the training set using random noise or synthesize images from a text description (e.g., a set of other human faces). StyleGANs, CycleGANs, and CartoonGANs are applications that can transfer representations from one image to another in real time. There are numerous other applications, which is why GANs are a hot topic in AI research.  THINK IT OVER Difference between GANs and VAEs? Both are deep generative models, i.e., they model the distribution of training data to generate examples. However, the models differ primarily in the way they are trained. AEs learn a lower spatial-dimensional representation of relevant information in raw data through semi-supervised learning. They are trained to minimize a loss function while reproducing a certain sample in training data. On the other hand, GANs optimize two loss functions in an adversarial manner to produce a more realistic and visually similar-looking sample to the training data. However, the unsupervised learning of GANs with a complex two-loss function design is much slower to train than the VAEs. Before we dive into mathematics, let us familiarize ourselves with some notations frequently used in the learning process of GANs. Figure 2.20 presents the overview of a GAN network having the generator and discriminator block with proper notation (Table 2.3).

2.1.6.1

The Discriminator

The discriminator model is a binary and well-understood classification model. The model aims to identify fake generated samples from true empirical training data correctly. The overall idea of the discriminator is to understand the generated data clearly. Returns a probabilistic prediction for an image to be noisy or free-of-noise in the value range 0,1. The value closer to 1 signifies an actual image, and 0 suggests a fake image. The feedback from the discriminator’s learning is iteratively fed to the generator to improve its performance. Mathematically, it tries to estimate the probability that a sample comes from training data and not from the generative model. Given inputs x, the model learns the conditional probability distribution p(y|x) (i.e., the probability of y given x) to maximize the classification of the inputs into the corresponding target y as correctly

2.1 Neural Network Architectures

119

Fig. 2.20 A general architectural overview of GAN Table 2.3 Summary of common notations used here in GAN Notation Purpose x z pdata (x) pg (z) D(x) G(z) D(G(z)) Ex Ez V(G, D)

The training samples or real data provided for training Noise or latent vector Original data distribution for the entire training sample Generated data distribution for the entire training sample Discriminator’s probability evaluation for real input sample x, i.e., p(y|xr eal ) → {0, 1} Fake data generated by the generator model for random fake seed z Discriminator’s probability evaluation for a fake sample, G(z) being real, i.e., p(y|x f ake ) → {0, 1} The expected value over all real training sample The expected value over all random inputs to the generator Loss/error value function between G and D

as possible. It does not care much about how the data are generated or distributed—for example, logistic regression and SVMs. During discriminator model training, the goal is to correctly classify whether the real data (i.e., y = 1) coming from pdata (x) is true. Here, the discriminator seeks a high value for the loss function against the real data in Eq. 2.64. L D −real = L(D(x), 1) = Ex∼ pdata (x) [log(D(x))]

(2.64)

120

2 Neural Networks for Deep Learning

Similarly, the discriminator strives to classify the false data G(z) from the generator as false, ideally zero (i.e., y = 0). This gives the loss function for fake data in Eq. 2.65 which the discriminator strives to maximize. L D − f ake = L(D(G(z)), 0) = Ez∼ pg (z) [log(1 − D(G(z))]

(2.65)

Here, we refer to the loss for some function with a very generic notation that tells us the distance or the difference between two functional parameters that shall be summed up in Eq. 2.64. You have gained a fairly good understanding of this if it feels similar to binary cross-entropy or KL divergence. Now, for a given fixed discriminator D(.) and fixed generator G(.), (i.e., fixed parameters or model), the discriminator networks wish to maximize V (G, D) in Eq. 2.66. 1 L(D(x i ), 1) + L(D(G(z i )), 0) k i=1 k

V(G, D) =

(2.66)

= max [log(D(x)) + log(1 − D(G(z))] D

2.1.6.2

The Generator

The generator model captures the data distribution and aims to mislead the discriminator as much as possible. It takes a random latent noise vector as input. It outputs synthetic data after performing certain transformations based on certain features learned during training. This is represented in Eq. 2.67. Mathematically, it tries to fool the discriminator by mislabeling the generated image as true. Given input x, the model should learn p(x|y) (i.e., the probability of features x given a label y) from the joint probability distribution P(x,y) = p(x|y).p(y) to generate similar inputs and its labels from the target y. Here, it cares about how the training data is distributed to learn to generate x,—for example, Naive Bayes. The generator is penalized if it fails to generate a close to actual image to fool the discriminator. Ultimately, the generator aims to fool the discriminator while the discriminator seeks to surpass the accuracy of the generator. LG = L(D(G(z)), 1) = Ez∼ pg (z)log(1 − D(G(z)))

(2.67)

For a fixed discriminator, the generator attempts to minimize the same V (G, D) presented in Eq. 2.66. However, it is worth noting that we define the loss functions separately for the generator and the discriminator. This is due to the steeper nature of the gradient y = log x than its counterpart y = log(1 − x), when x is zero. This means that an attempt to maximize log(D(G(z))), shall lead to faster and more substantial improvements than minimizing log(1 − D(G(z))).

2.1 Neural Network Architectures

121

V(G, D) = min max [log(D(x)) + log(1 − D(G(z))] G

2.1.6.3

D

(2.68)

Combined Loss Optimization

Combining the loss function from both networks gives the final loss for the model that follows a min-max zero-one game. Remember, the loss in Eq. 2.68 is valid for a single data. Therefore, we take the expectation in Eq. 2.69 to consider the entire training data. The resulting expression is the value function V(G, D), the same as described in the original paper by Goodfellow et al. (2014).   V(D, G) = min max Ex∼ pdata (x) log(D(x)) + Ez∼ pg (z) log(1 − D(G(z)) (2.69) G

D

The basic training of this vanilla network function to achieve an ideal min G max D V(D, G) can be performed approximately using the standard iterative stochastic gradient descent learning rule. Here, the one-liner formulation of min-max training of both the networks jointly presents the adversarial nature between the models. In Eq. 2.70, the optimal D and G can be calculated by taking the partial derivative of Eq. 2.69 with respect to D(x), −

pg (x) pdata (x) + =0 D(x) 1 − D(x)

(2.70)

Rearranging the terms gives us the condition for the optimal discriminator Dopt (x) in Eq. 2.71. Dopt (x) =

pdata (x) pdata (x) + pg (x)

(2.71)

Note that the above equation is important mathematically and intuitively. This is because, for a highly genuine sample, we expect pdata (x) to be close to 1, and pg (x) to converge to 0. This makes the optimal discriminator assign 1 to the sample. While, for a generated fake sample, x = G(z), we expect the discriminator to assign 0 to the sample, as pdata (G(z)) converges closer to zero. However, it is difficult to compute optimal Dopt practically, as pdata (x) is unknown. Next, to train the generator, we assume the discriminator to be fixed, here using the optimal discriminator of Eq. 2.71. Substituting the expression into the value function (Eq. 2.69) and exploiting the properties of logarithm and basic mathematics, we can reduce the equation to be expressed as Kullback-Leibler (KL) divergence:  V(Dopt , G) = −log 4 + D K L

pdata (x)||

pdata + pg 2



 + DK L

pg ||

 pg + pg 2 (2.72)

122

2 Neural Networks for Deep Learning

This is done to present the expression in terms of Jensen-Shannon (JS) divergence (see Appendix A.6). Expressing the optimal value function in Eq. 2.72 in the form of JS convergence expressed in Eq. 2.74 gives, V(Dopt , G) = −log4 + 2.DJS ( pdata (x)|| pg (x)

(2.73)

When the network reaches an optimal condition, i.e., pg (x) approaches pdata (x), the Eq. 2.73 reduces to −log 4, since the divergence becomes zero. As the generator wants to minimize the above equation, the optimal solution can be reached when the equation reduces to V(Dopt , G) = − log 4. This is the theoretical solution to optimality. In practice, it might be hard to converge to pg (x) = pdata (x). This is the condition sometimes called model collapse. This can possibly be caused due to: 1. A non-unique equilibrium. 2. Iterative approach fails to converge despite a unique equilibrium. 3. The generator network does not have ‘enough parameters’ specifying any distribution (especially pdata (x)) to ensure the existence of an equilibrium. 4. Non-overlapping or very little joint support between pdata (x) and pg (x) for the distribution of training data, causing learning to stagnate. In conclusion, the analysis for optimal training attempts to minimize the value function V(D, G). This implies that the JS divergence between the probability distribution between the original data and the generated data converges to as small a value as possible. It aligns with our intuitive feeling that the generator should learn the underlying distribution of the data in the training sample.  Highlight Linking CNNs and GANs Typically, a GAN model works with image data, leading to the prevalent use of CNNs as generators and discriminator networks. This possibly is reasoned with the remarkable progress of CNNs in computer vision tasks to achieve state-of-the-art results. Secondly, modeling image data suggests providing a compressed representation of the set of images in latent space for model training. The ability to be viewed and accessed visually helped to develop GAN models rapidly compared to other generative models.

2.1.6.4

Conditional GANs

The incorporation (conditioning by) some additional auxiliary information fed to both generator and discriminator as an additional input layer is a critical extension to GANs. Conditional Generative Adversarial Nets (cGAN) by Mirza et al. (2014) is a novel way to train generative models that allows conditioning to control the data generation process. This could be based on class labels, features, or data from various modalities.

2.1 Neural Network Architectures

123

The objective function of a two-player mini-max game would be as Eq. 2.74. This is an extension of the original goal with some additional auxiliary information y. To allow network conditioning, this is fed into both the generator and discriminator models as input layers. This was more of a proof-of-concept in 2014, with the hope that further exploration of hyper-parameters space and architecture would help it match or surpass non-conditioned results. This was quickly proven when CGAN and deep convolutional generative adversarial networks (DCGANs) emerged as two of the most influential early GANs, inspiring a plethora of new research applications. In 2017, pix2pix (Isola et al. 2017) became one of the remarkable early applications based on cGAN that has revolutionized the image-to-image (I2I) translation problems. Within a year, other GAN variants such as the Cycle-Consistent Adversarial Network (CycleGAN), which succeeded in outperforming pix2pix and did so without the use of paired training samples, were surpassed. min max L(D, G) = Ex∼Pdata (x) [log(D(x|y))] + Ex∼Pz (z) [log(1 − D(G(z|y)))] G

D

(2.74) In summary, the discriminator verifies the authenticity of the relationship between x and y, whereas the generator looks at understanding how to obtain x and generate new features. In network training, we always try to minimize the loss function. In this case, we should try to minimize the difference between 1 and the label for true data in the generator. Simultaneously, we minimize the discriminator’s evaluation of generated fake data. This is an iterative process that will continue until both the G and D networks reach a Nash equilibrium. The procedure is known as adversarial training. Any side can be more dominant than the other during the iterative training process. A dominant discriminator should return values close to 0 or 1, making the understanding of the gradient difficult for the generator. In contrast, a dominant generator will exploit the discriminator’s weakness, resulting in a false negative. For effective implementation, it is critical to level the performance of two NNs using appropriate learning rates.

124

2 Neural Networks for Deep Learning

 Highlight Characteristic features of Generative Adversarial Networks are: 1. The models are trained simultaneously in an adversarial zero-sum game, until the Nash-equilibrium is achieved. 2. The discriminator D acts as a binary classification neural network that computes the loss for both real data D(x) and fake data D(G(z)) and combines them. 3. The generator G computes a different loss objective separately from its noise. 4. The two competing loss functions can be trained using only backpropagation. 5. The network can generate the sharpest image with adversarial training in play. 6. The unsupervised learning scheme of GANs eliminates the expensive labeling of data to train the network. 7. Convolutional networks are a more sensible alternative to the fully connected networks in the original GAN. Challenges with Generative Adversarial Networks Often powerful tool comes with its challenges and must be used wisely. It is demanding to follow through with some of the frequently faced problems with GAN training. During training, the generator (G) aims to trick the discriminator (D) into thinking that the output produced by G is real. If D does not push G to learn diverse representations of complex real-world data, there are high chances of mode collapse. G gets stuck in producing limited variants of samples, while D finds it hard to distinguish them. Producing sample results repeatedly leads to the condition of complete mode collapse while adhering to a few learned properties, and limited variants of generated results leads to partial mode collapse.  Highlight The popular Vanishing Gradient issue in deep neural network gets stronger while training GANs. This is because of the double feedback loop in network training. Apart from the feedback loop with the ground truth, the gradient loss at the discriminator also provides feedback to the generator network. This brings about an instability in training. A discriminator getting too weak hardly provides any good feedback to the generator, thereby, making nonsignificant reflection in the loss. In contrary, if the discriminator gets too strong quickly (i.e., D(x) − 1, D(G(x) = 0), it pushes the generator to zero (i.e., log(1 − D(G(z)) = 0) and consequently the gradient of loss curve to zero. This brings the learning to a halt.

2.1 Neural Network Architectures

125

In a fight for optimal training of the entire network (GAN), the non-cooperative game between G and D makes it hard to achieve the Nash Equilibrium. Ideally, the network expects D to be neither losing nor winning irrespective of the performance of G. This is hard to achieve compared to other neural network training, and GANs often enter a stable orbit of gradient descent rather than converging to the desired equilibrium state. The loss curve is often deceptive when compared to other DL techniques. It is not usual to apply early stopping since the metric does not provides a proper evaluation of GAN training. Therefore, visual inspection is often common practice to validate correct training. Lastly, the evaluation of GANs isn’t very obvious. The main difference lies in the generator and discriminator function loss. It depends on whether the discriminator outputs the probability or is unbounded and how the gradient is penalized and evaluated. Inception score (Salimans et al. 2016) and Fréchet Inception Distance (FID) (Heusel et al. 2017) are some of the methods for evaluating the model.

2.1.7 Graph Neural Networks It is a reasonable to assume that recent advances in NNs such as RNNs, CNNs, GANs, and AEs have fast-paced research and technological success in pattern recognition and extensive data handling. We now understand the DL approach to capture euclidean data, such as images, text, or categorical data. These NNs expect their input to be in a uniform rectangular array format, such as fixed input features for perceptrons, fixed-size grid-like structure for CNNs, and input-output time sequence for RNNs. Unfortunately, a graph can come in a complex topology, variable-sized unordered nodes and characteristics like permutation invariance unlike the convolutional networks where changing pixels are perceived by the model differently. This makes traditional networks unsuitable for this kind of learning paradigm. So it’s time to shift our focus on handling non-Euclidean data structures represented as graphs with far more complex connections and inter-dependencies between the objects (Bronstein et al. 2017). Graphs are ubiquitous in the real world, gaining penetration in the diverse DL domain and filling in rich information. Therefore, it is equally important to understand how researchers have found ways to incorporate graph data in neural network learning, resulting in what we call Graph Neural Networks (GNNs) (Scarselli et al. 2008). A powerful tool to solve many critical tasks like fake news detection (Benamira et al. 2019), recommendation systems (Eksombatchai et al. 2018), physics simulation (Sanchez-Gonzalez et al. 2020), transport management (Guo et al. 2020), and drug discovery (Atz et al. 2021). It is exciting to comprehend how the encoder function shall be able to incorporate the locality of the neighborhood, aggregate the knowledge and make the network computation efficient with the stacking of graph layers. A computational graph shall achieve locality information, followed by information aggregation using the NNs while preserving the order-invariance property. Finally, the slightly tweaked feedfor-

126

2 Neural Networks for Deep Learning

ward propagation rule in GNNs with message passing encapsulates the input-output information of a NN.  Highlight Permutation invariance (Maron et al. 2018) states that rearranging the position or order of the nodes should not have differences as long as the connection between the nodes remains the same. Here is a brief overview of graphs with defined taxonomy in order to help make the process of understanding GNNs less cumbersome. A graph is a natural way to express a set of objects and their relations. Almost anything like people, objects, and concepts that have linked entities can be handled using graphs as a powerful visualization tools. Fundamentally, a graph with vertices/nodes (V) and the edges/relations (E) is denoted by G = (V, E). The edges are considered to be ‘directed’ if there are dependencies ei j = (vi , v j ) ∈ E between nodes vi ∈ V , otherwise ‘undirected’ having pairs of edges with inverse directions for connected nodes. The edge directionality is denoted using an arrow (→) from Vi to V j as E vi →v j . Finally, the connectivity of a graph is expressed using an adjacency matrix A ∈ Rn×n with n being the number of nodes. The elements of this binary matrix rep/ E). An resent adjacent nodes by ‘1’ (ei j ∈ E) and disconnected nodes by ‘0’ (ei j ∈ adjacency list can be another way to represent graph data. A symmetric adjacency matrix means that the graph is undirected. Nodes and edges can have additional properties called node features X v . A feature vector for a node v is denoted by xv ∈ Rd . Edge features X e that have the features of an edge (v, u) are denoted by the edge attribute xv,u ∈ Rc . This is all we need to know for now, assumed we are also familiar with feedforward NNs before we proceed with GNNs.  Highlight The core assumption of existing DL and ML algorithms is that the instances are independent of each other. Graph data, in contrast, have interdependent links between instances of various types. This makes graphs a potent tool for visualization and data representation. Counter-intuitively, one can learn more about the symmetries and structure of text and images by viewing them as graphs (Refer to Sect. 3.2.5.5). GNNs can potentially perform effectively in what CNNs have failed at. However, if that’s the case, maybe it is worthwhile to ponder why we do not use GNNs everywhere.

2.1 Neural Network Architectures

127

 THINK IT OVER How is graph data different? 1. Graph data have arbitrary input shape and size which is also true for the image datatype. However, the images can be resized, padded, and cropped to feed into the network, which is something not defined on graph data. 2. Isomorphism is a property of graphs that states that two different looking graphs can be structurally identical. Here, only the order of the nodes gets changes and not the properties. This is also why we cannot use the adjacency matrix directly as input to the feedforward network because it is sensitive to changes in node order. 3. The algorithm to handle graphs needs to be permutation invariant to handle the isomorphism property of a graph. 4. Lastly, graph data has a non-Euclidean structure. This is also why the DL field involving graphs is called deep geometric learning. GNNs are a sub-branch of DL techniques directed to infer data directly from graph data’s representation (representation learning). It includes the concept of node embedding that maps the node features and edge features to a d-dimensional embedding space (usually a space of dimension lower than the dimension of the graph). The network translates the formatted input data into a vector representation of relevant information of nodes and the corresponding relation, which we call ‘node embedding’. This notion of embedding is similar to the use of word embedding in NLPs, which transforms detailed information about terms and their relations into a numerical representation structure that is differentiated and learned. These networks inculcate graph knowledge into embeddings to perform graph-level, node-level, and edge-level prediction tasks. The network model has several intermediate message passing layers, the core building blocks of GNN. These are responsible for combining node and edge features into node embeddings. As the layers of the GNN increase, the subsequent layer repeats the message-passing operation, collecting information from the neighborhood and using an aggregate function to combine with the values of the previous layers. For instance, in a social network analysis model, the first layer aggregates user data and directs followers. The next layer combines the friends of followers. Henceforth, the final layer of GNN produces an embedding that is the vector representation of user data and the information of other users in the graph. Note that the update and aggregation function used in the layers should be differentiable for effective model learning. Note that some learning problems (graph, node, or edge prediction) require another variant of GNN architectures. For example, we use the embedding vector of nodes in node-level predictions, while graph-level predictions need to combine the node embeddings in a certain way. Alternatively, we can use a pooling operation to compress the graph to a fixed size vector, and the representation can be used to run a graph-level prediction. Finally, the edge features can also be processed in these node embeddings. Using a GNN, nodes with similar characteristics or context will result

128

2 Neural Networks for Deep Learning

Fig. 2.21 Types of graph-level prediction task. The orange marker highlights the task associated with the prediction

in similar node embeddings. Consequently, similar graphs result in similar graph embedding. The embedding size is a hyperparameter and can differ from the initial node input size. The different types of graph level prediction tasks are mentioned below (Fig. 2.21). 1. Graph-level tasks aim to predict a single property or characteristic for the entire graph. This can involve graph classification in different categories. For example, given the molecular composition of a drug presented in graphs, we need to predict if it shall bind to a receptor for a disease. This is analogous to image classification for MNIST data or sentiment analysis for an input statement. Other applications may involve graph clustering based on either vertex, edge, and edge distance similarity or graph-based similarities, and graph visualization. 2. Node-level tasks aim to predict the property or role of individual nodes in a graph. This can involve semi-supervised training of a partly labeled graph for node classification to determine the labeling of the subject (nodes) with information from their neighborhood. A well-known example is a social network graph of Zachary et al. (1977) Karate club dataset, where the nodes represent the individual karate practitioners, and edges represent the relation between these practitioners outside of the club. In this Karate club example, after a political rift, the prediction problem is supposed to be classifying a given member’s loyalty between two individuals. This is analogous to image segmentation, attempting to label the role of an individual pixel in an image or identifying the part of speech in a sentence. 3. Edge-level tasks aim to predict the property of relations represented by edges in the graph. Here, the algorithm emphasizes the learning of the connection between entities for link prediction between specific entities in the graph. For example, Netflix uses link prediction to suggest a new video recommendation. Another example are systems for criminal associativity prediction. These tasks are analogous to image scene understanding or image caption generation, incorporating the relationship beyond the object identification or classification.

2.1 Neural Network Architectures

129

 Highlight Graph visualization is a commonly appearing term in networking, distributed systems, software engineering, bioinformatics, data structures and algorithms, AI, robotics, and many other disciplines. It is a powerful way to represent structural information in a meaningful network. Many studies suggest that visual representation is a more effortless and effective form of communication and assimilating information. It has shown a higher chance of discovering insights and getting a better context of presented data. In GNNs, the application is at the intersection of geometric graph theory (mathematics) and information visualization (computer science), concerned with the visual representation of graphs for the study of structure and anomalies in the data. This is also why many applications of GNNs are termed “geometric neural networks”. It is gaining traction due to advancements in drug discoveries, fake news detection, fighting financial crime, IT operation management, and life science.  Theorem Banach fixed point theorem (Banach 1922) (a.k.a. the contraction mapping theorem) stated first in 1922 can be considered as an abstract formulation of Picard’s fixed-point iterative approximation. It ensures the existence and uniqueness of an invariant point (a.k.a. fixed point, where the number x belongs in the domain and co-domain of the function f (x)) of certain selfmaps of metric spaces, and a constructive process to find those fixed points. Inequalities define the speed of convergence. Let’s summarize how Graph NNs are mathematically defined and implemented for the supervised and unsupervised learning tasks. Here, we typically define a node classification task. These are the equations first proposed in Scarselli et al. (2008) and are often considered the original GNN. Note that Eq. 2.75 presented here can be simplified in different ways into several minimal models (i.e., the smallest number of variables to define the model retaining the same computation ability). Nevertheless, the equation closely represents the intuitive notion of the neighborhood, and the expression is consistent with the one specified for other networks earlier in the chapter with usual notations. This is analogous to the behavior observed between the hidden and output states of the NNs. Overview of the model equation in a compact form using the Banach fixed point theorem is: O = F(h, x N ) h = f (h, x) (2.75) Here, O, h, x, and x N are the vectors computed by stacking all the outputs, states, embeddings and node embeddings respectively from Eq. 2.76. Here, F in Eq. 2.75 is

130

2 Neural Networks for Deep Learning

the global output function that defines how the output is produced, f is the global transition function for the stack version of Fl and fl respectively defined in Eq. 2.76. However, Zhou et al. (2020) lists three primary limitations with the original GNN architecture: • If the presumption of “fixed point” is relaxed, it is feasible to leverage MLPs to learn a more stable representation and remove the iterative update process. Because, in the proposed architecture above, different iterations use the exact parameters of the transition function f , while distinct parameters in different layers of MLP allow hierarchical feature extraction. • It cannot process edge information (e.g., different edges in a knowledge graph may indicate a different relationship between nodes). • Fixed point can discourage the diversification of node distribution. Hence, it may not be appropriate to learn to represent nodes. Several variants of GNN have been proposed to address these issues. We will address them in Sect. 4.4. But first, we only discuss the fundamental learning of a vanilla GNN.  Step 1: Forward propagation on the graph. The message passing between layers results in the node embedding of the graph. Here, xv , xv,i , h ne[v] and xne[v] are labels of node features, connecting edge features, embedding of neighbour nodes, and features of neighbour nodes respectively. The state h n attached to each node contains the information in the neighborhood of v, and the representation can be used to produce the output ov , being the decision in Eq. 2.76. ov = Fl (h v , xv ) h v = fl (xv ; xv,i ; h ne[v] ; xne[v] )

(2.76)

where, Fl in Eq. 2.76 is the local output function, fl is the local transition function dependent on n neighbourhood nodes. We can then follow a learning scheme to adapt Fl and fl using training examples in an iterative process to compute the state: h (k+1) = f (h (k) , x)

(2.77)

Equation 2.77 denoting the computation of the kth iteration of h can converge exponentially fast to a solution as ensured by the Banach fixed point theorem and can be expanded with Jacobi iterative method for solving non-linear equations by iterations defined as: ov(k+1) = Fl (h (k) v , xv ) = fl (xv ; xv,i ; h (k) h (k+1) v ne[v] ; x ne[v] ),

v∈N

(2.78)

2.1 Neural Network Architectures

131

 Step 2: Defining the loss function for learning. Learning of the GNN involves y in the learning estimating the parameter θ such that f θ approximates the target label dataset. The learning process is based on gradient descent, and loss optimization can be a task of L1 loss L in Eq. 2.79. Here, Fl and fl can be interpreted as fully connected feedforward NNs. L=



ln =

k 

(yi − yi )

(2.79)

i=1

We can also define a measurable objective in the graph that suggests that the distance between nodes closer to the target node must be of high importance in learning, i.e., if distance d1 < d2 for a node, the measurable objective can then be quantified as a loss function by rearranging the equation as d1 − d2 < 0. Therefore, the loss function can be configured accordingly, accounting for all the measurable objectives from connecting nodes and taking an average over E as:  du − dv , wher e, {du < dv } ∈ E; (2.80) L= |E|2 The above Eq. 2.80 is obtained by forward propagating the message passing in each layer of the network to gather the final representation of the node positions and computing the loss. This loss is then backpropagated to make the model understand where it went wrong and reiterate the process to meet the objective.  Step 3: Backpropagation of loss using the gradient descent approach. The backpropagation follows a similar trend to differentiate the loss w.r.t. to learning weight using the chain rule of differentiation. Here, it is important to consider the transition function ‘f’ and output function ‘F’ to be continuously differentiable w.r.t. x and θ . Here, we won’t dive into deriving long, complex mathematical equations for backpropagation that can be found in literature for a more thorough knowledge.  Highlight In summary, the learning strategy of GNNs for node prediction tasks shall follow the following steps: 1. The state xv(k) is iteratively updated using Eq. 2.78 to approach a fixed point solution in Eq. 2.75 until time T, such that x (T ) ≈ x. 2. Compute the gradient δlθ(T ) /δθ for the learning curve. 3. Update the weights θ by backpropagating the gradient. By now, we are better able to understand some terminologies like message passing, permutation invariance, and some insight into the variant of aggregation and updating of information required for the different tasks. This depends on the desired output target, training setting, and the loss function for the specified task. The respective pipeline is well summarized in Fig. 2.22. At present, we shall learn the application of

132

2 Neural Networks for Deep Learning

Fig. 2.22 General representation of a GNN network as layers. Each layer has the node embeddings and message passing operation until the final layer k is reached to aggregate the messages. The associated prediction is invoked on the target node embedding or the node connections

some popular GNN variants such as Graph Convolutional Network (GCN), Graph Attention Network (GAT), and a few others to understand the diverse implementation areas. Later in Sect. 4.4, we shall look deeper into the knowledge embedding in this network and how researchers endeavor to interpret these models. There are some excellent survey pieces of literature Zhou et al. (2020), Zhang et al. (2020), Wu et al. (2020) to help get an in-depth understanding of GNNs and their ongoing applications. Many researchers believe in the astonishing power of GNNs to model complex structures. With this view, these networks shall play an essential role in AI development and interpretability approaches in the near future.  THINK IT OVER Can a GNN be considered similar to building a CNN? Recall the concept of message passing and aggregation units in GNNs to create node embedding. Interestingly, the pipeline is similar to the feature extraction from pixel data using CNNs. Both follow a similar fashion of stacking layers for feature extraction. A stacking of five convolutional blocks and pooling resembles the learned information of five neighbor hops using message passing and aggregation of information of individual nodes to perform graph-level predictions. Therefore, a very popular GNN architecture is the GCN, which uses convolution to build node embeddings. Connecting Links: GCN, Convolution, Node embeddings. Challenges with GNNs: Let’s understand a few common challenges when we attempt to solve different graph tasks in NNs using the knowledge we gained in this chapter, so far. • Space-inefficiency: Adjacency matrix is the most intuitive choice to represent connectivity of the graph, since it is easily tensorisable. However, the node count in the graph can often be humongous with highly variable edge connectivity. This can potentially lead to a sparse adjacency matrix.

2.1 Neural Network Architectures

133

• Over-smoothing: With the increasing number of layers, the node embedded in the layers retains similar information to all other nodes through message passing and aggregation. This can often lead to oversaturation of the graph network’s node embedding and non-positive learning. • Scalability: Implementation of the above-mentioned node embedding in a social network or recommendation system can be computationally complex for any node embedding algorithm across network layers, including GNNs. • Dynamicity: It is often challenging to deal with dynamic structure graphs like GNNs. • Structural challenge: Finding the best-fit graph generation approach for nonstructural data input in GNNs is challenging. Here is a simple implementation of a GNN in Python. The GNN comprises a sequence of graph layers (GCN, GAT, or GraphConv), ReLU as an activation function, and dropout for regularization. 1

from pytorch import nn

2 3

class GNNModel(nn.Module) :

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

def __init__ ( self , f_in , f_hidden , f_out , layers_num=2, layer_name="GCN" , dp_rate=0.1, ∗∗kwargs) : """ Inputs : f_in − Dimension of input features f_hidden − Dimension of hidden features f_out − Dimension of the output features . Usually number of classes in classification layers_num − Number of "hidden" graph layers layer_name − String of the graph layer to use dp_rate − Dropout rate to apply throughout the network kwargs − Additional arguments for the graph layer (e .g. number of heads for GAT) """ super () . __init__ () gnn_layer = gnn_layer_by_name[layer_name] in_channels , out_channels = f_in , f_hidden layers = []

20 21 22 23 24 25 26 27 28

for layers_idx in range(layers_num−1): layers += [gnn_layer(in_channels=in_channels , out_channels=out_channels , ∗∗kwargs) , nn.ReLU( inplace=True) , nn. Dropout( dp_rate ) ] in_channels = f_hidden layers += [gnn_layer(in_channels=in_channels , out_channels=f_out , ∗∗kwargs) ] self . layers = nn. ModuleList( layers )

29 30 31 32 33 34 35 36 37 38 39 40 41

def forward( self , x, edge_index) : """ Inputs : x − Input features per node edge_index − List of vertex index pairs representing the edges in the graph """ for l in self . layers : # For graph layers , we need to add the "edge_index" tensor as additional input . # All PyTorch Geometric graph layer inherit the class "MessagePassing" , hence we can simply check the class type . i f isinstance ( l , geom_nn. MessagePassing) : x = l (x, edge_index) else :

134 42 43

2 Neural Networks for Deep Learning x = l (x) return x

We have covered significant concepts of a simpler variant of most DNNs and discussed their popular variants. We shall now look at some of the commonly used terms in the DL paradigm to train a network.

2.2 Learning Mechanisms The idea behind learning as the method to train a DNN is minimization of loss/function. This function often consists of an error term and the regularization term measuring the performance of the NN on a given training sample. While the error term evaluates how well the network fits on the dataset, the regularizer controls the practical complexity of the black-box network to prevent overfitting. The complete learning is based on optimal tuning of the adaptive parameters of the neural network—weights and biases. The flow of information in the learning process controlled by feedforward and backpropagation. In the previous section, we have encountered these terms in the network architecture. It is essential to learn about them briefly to understand the knowledge encoding in the network and how to interpret the information.  THINK IT OVER Criteria for choosing the number and size of hidden layers A naive answer would be: The greater the number of hidden layers, the better is the learning process. More layers enrich the levels of features. But—Is that so? Profound NNs are complicated to train due to problems with vanishing and exploding gradients. Unfortunately, there is no correct answer to it, and the best is tried with trial and error. Connecting Links: ResNet, skip-connection.  Ask Yourself What are some of the most popular CV tasks that we regularly come across in AI jargon? Check your understanding of classification, detection, and segmentation applications discussed in the book.

2.2 Learning Mechanisms

135

2.2.1 Activation Function Assuming that all of us DL aficionados understand the basic architecture of ANNs and the role of neurons inside of a network, we must have developed a clear idea about how each neuron receives input(s) from the previous layer or the input layer, performs some operation to encode knowledge, and passes out signals to the neurons of the next layers or to the output layer of the network. We also understand that activation functions perform some operation on the weighted sum of inputs to produce a value, often between some lower and upper limits. In a feedforward network, the activation function acts as a mathematical ‘gate’ between the input fed to a neuron and the output processed to the next layer. This helps in learning complex patterns in the sample data. Therefore, we can understand the activation function and what it does. Still, we might wonder why NNs pay such importance to the non-linear nature of an activation function. An activation function is a deceptively small mathematical expression with biological similarity to brain activity to decide whether a neuron fires up (or is activated). This helps suppress some information from neurons of mere significance for the network’s overall performance. The non-linearity of the activation function serves the following purpose: 1. Using a linear activation function will make the network act as a simple linear regression model, with no effect on the number of layers added to make the network deep. It is because the composition of two linear functions is a linear function itself, which shall make the network simpler. On the other hand, it renders the model ineffective in mapping complex tasks. This can only be achieved by using non-linear activation functions. 2. The activation function helps to limit the output value of the neuron to a certain threshold depending on the task. The value, if not restricted, can run very high in magnitude, making it computationally impossible to train deep networks with millions of parameters. In Fig. 2.23, some of the popular activation functions and their mathematical equations are summarized. It is important to understand the application of each activation function when building a network. For instance, in a multi-class application, the output layer must be activated by the softmax unit. At the same time, a linear activation is expected to be used only in the output layer of a simple regression model. If the ReLU activation function does not help attain the best result, changing the unit to leaky ReLU might sometimes help to achieve better results and overall performance. Diverse activation functions serve different applications but all are expected to have some common desirable properties (refer Table 2.4 for common activation functions) like being: 1. Differentiable: As most NNs are trained using the gradient descent technique, the layers of the network must be differentiable or at least differentiable in parts. Therefore, a differentiable activation function is necessary for the learning process.

136

2 Neural Networks for Deep Learning

Fig. 2.23 Graph plot for different NN activation functions used in modern networks Table 2.4 Summary of desirable properties in the activation functions Activations Zero-centered Computationally inexpensive Sigmoid Softmax Tanh ReLU Swish

✗ ✗ ✓ ✗ ✗

✗ ✗ ✗ ✓ ✗

Vanishing gradient issue ✗ ✗ ✗ ✓ ✓

2. Zero-centered: To avoid the learning of gradient to shift in a particular direction, the activation function is expected to be symmetrical at zero. 3. Computationally inexpensive: Activation functions are expected to be computationally efficient and inexpensive to compute millions of times in a deep network, applied after every layer of the network. 4. Vanishing gradient issue: The activation function is expected to not shift the gradient value towards zero. With this, we get over the vanishing gradient problem of DNNs.

2.2 Learning Mechanisms

 THINK IT OVER Why is the combination of two ReLU variants still not popular? We start with defining the three variants of ReLU mathematically as: • ReLU (Rectified Linear Unit): f(x) = max(0, x) • Leaky ReLU: f(x) = max(αx, x) • ReLU-6 (Restricted ReLU): f(x) = min(max(0, x), 6) In ReLU, the problem of dying ReLU, i.e., gradient suffers from zero output for all negative inputs, has sometimes led the node to die and ultimately learn nothing. On the other hand, the higher limit on the positive input, basically infinity, can occasionally lead to an exploding gradient. On the one hand, Leaky Relu solves the issue of dying node, and on the other hand, the restricted ReLU on the positive side limits the gradient from exploding (toward infinity). The exciting idea is to combine the leaky ReLU and restricted ReLU to solve all known issues with previous activation functions.  THINK IT OVER How to choose the right activation function? As a rule of thumb, you can begin by using the ReLU activation function and then move over to other activation functions if ReLU doesn’t provide optimum results. And here are a few other guidelines to help you out. 1. ReLU activation function should only be used in the hidden layers. 2. Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they make the model more susceptible to problems during training (due to vanishing gradients). 3. Swish function is used in NNs having a depth greater than 40 layers. Finally, a few rules for choosing the activation function for your output layer based on the type of prediction problem that you are attempting to solve: • • • •

Regression—Linear Activation Function Binary Classification—Sigmoid/Logistic Activation Function Multiclass Classification—Softmax Multilabel Classification—Sigmoid

The activation function used in hidden layers is typically chosen based on the type of neural network architecture. • CNN-ReLU activation function. • RNN-Tanh and/or Sigmoid activation function. Connecting Links: ReLU, Softmax, Logistic, Sigmoid, Tanh.

137

138

2.2.1.1

2 Neural Networks for Deep Learning

Understanding the Influence of Different Activation Functions

There are ways to learn the basic influence of activation functions on network learning. Choosing the right one for your application can be daunting at times. Therefore, we start by defining a base class for a few popular activation functions and analyze the impact of activations in the deep network training. Training a model with few layers and keeping other hyperparameters constant, we can visualize the gradient curve to interpret the impact of different activation functions in the learning paradigm. We shall see the effect in the two stages listed below. For the purpose of experimentation, we explore the influence of four popular activation functions (Ref. Fig. 2.24)—Sigmoid, Tanh, ReLU and Leaky ReLU. They used in a simple ResNet model with CIFAR-10 dataset introduced by Krizhevsky et al. (2009). The CIFAR10 dataset is a small dataset consisting of tiny 32 × 32 colored images, divided into 10 mutually exclusive classes. Action 1: Interpreting the gradient propagation after network initialization As previously indicated, a crucial component of activation functions is the manner in which they propagate gradients across the network. Suppose we have a DNN with more than fifty layers. The gradients for the input layer, i.e. the very first layer, have passed >50 times through the activation function, while they should still be of a sufficient magnitude. If the gradient via the activation function is (expectedly) significantly less than 1, our gradients will vanish until the input layer is reached. If the gradient via the activation function is greater than 1, the gradients will increase rapidly and may even burst. To obtain a sense of how each activation function affects the gradients, we can examine a newly initialized network in Fig. 2.25 and measure gradients for each parameter for 256 images: • The sigmoid activation function exhibits behavior that is plainly unwanted. In comparison to the output layer, which has gradients as high as 0.1, the input layer has the lowest gradient norm of all activation functions at 1e-5. Due to its low maximum gradient of 1/4, it is unable to find a sufficient learning rate across all

Fig. 2.24 Detailed mathematical and graphical understanding of popular activation functions

Fig. 2.25 Histogram distribution of weights at different layers for the different activation functions

2.2 Learning Mechanisms 139

140

2 Neural Networks for Deep Learning

layers in this configuration. The gradient norms of all other activation functions are comparable across all levels. Interestingly, ReLU activation spikes around 0 due to its zero-part on left and dead neurons (we will take a closer look at this later on). • It should be highlighted that, in addition to activation, the setting of the weight parameters might be critical. PyTorch utilizes the Kaiming initialization (He et al. 2015) by default for linear layers optimized for ReLU activations. In Action 2 below, we will look more closely at initialization, but for now, assume that the Kaiming initialization works quite well for all activation functions. For each activation function, we train a separate model. Not unexpectedly, the model with the sigmoid activation function fails and does not outperform random performance (10 classes ≥ 1/10 for random chance). All the other activation functions improve in performance. To arrive at a more accurate conclusion, we would need to train the models for several seeds and average the results. However, the “optimal” activation function is dependent on several other criteria (hidden layer sizes, number of layers, type of layers, task, dataset, optimizer, learning rate, and so on), thus a comprehensive grid search would be ineffective in our scenario. Activation functions that have been found to perform well with deep networks in the literature are all forms of ReLU functions that we experiment with here, with slight gains for certain activation functions in specific networks. Action 2: Interpreting the activation distribution in a trained network Once the models have been trained, the activation values discovered inside the model are examined. In ReLU, for instance, how many neurons are initialized to zero? Where do most of Tanh’s best deals hide? To find the answers to these puzzles, we may construct a little program that applies a trained model to a set of images and then shows the network’s activation histogram in Fig. 2.26: • Since the sigmoid activation model did not successfully train, the activations are less informative and cluster at a value of 0.5, the activation at input 0. • More variation in personal attributes may be seen in Tanh. However, activations in the two layers that follow the input layer are closer to zero, despite the fact that a greater number of neurons in the input layer are located between −1 and 1, where the gradients are near to zero. This is likely the result of successive layers combining the results of earlier layers’ searches for features in the input picture. Since the classification layer may be thought of as a weighted average of those values, the activations for the final layer tend to be once again more skewed toward the extreme points; the gradients push the activations to those extremes. • As was hypothesized before, there is a notable peak at 0 in the ReLU. Since there are no negative gradients, the network’s output after the linear layers has a larger tail in the positive direction instead of a Gaussian distribution. The ELU reverts to a more Gaussian distribution, whereas the Leaky ReLU exhibits a very similar behavior. It becomes clear that the choice of the “ideal” activation function depends on many criteria and is not the same for all conceivable networks since all activation functions exhibit somewhat different behavior while achieving identical performance for our small network.

Fig. 2.26 Histogram distribution of activation outputs at different layers for the different activation functions

2.2 Learning Mechanisms 141

142

2 Neural Networks for Deep Learning

2.2.2 Forward Propagation Forward propagation refers to the forward pass (calculation and storage of intermediate variables) of input data through a NN. Each layer accepts its input, performs some mathematical operations based on the activation function, and passes its output to the next layer. This is performed in a single forward direction to prevent the network from failing to generate an output due to getting stuck in a cycle. The operation in a forward pass consists of two steps—preactivation and activation. In pre-activation, the weighted sum of inputs is passed to each neuron with a linear transformation of weights. This is followed by activations that introduce non-linearity in the network and decide the neuron’s further trigger to pass the information to successive layers. We have already seen the simple feedforward NN operation in Sect. 2.1.2.  Highlight Feedforward Propagation is the flow of information that occurs in the forward direction. The input is used to calculate some intermediate in the hidden layers, which is then used to calculate an output.  Highlight Residual Networks (ResNet) ResNets provide an alternate pathway for data to flow to make the training process faster and easier. This is different from the feedforward approach of earlier NN architectures. The core idea behind ResNet is that a deeper network can be made from a shallow network by copying weights from the shallow counterparts using identity mapping. That means the data from previous layers is fast-forwarded, i.e., copied forward in the NN. This is what we call skip connections. The technique was first introduced in Residual Networks to resolve vanishing gradients. During the implementation of DL algorithms, we primarily focus on the calculations in the forward propagation through the network. This is only a part of model learning, though the real fine-tuning of the network comes from the initial objective function that computes the error and back-propagates the knowledge through the implementation of the backpropagation function. We usually leverage a DL framework for automatic differentiation and calculation of gradients in the backpropagation function. This was previously done by hand, where researchers only allocated numerous pages to compute complicated derivatives and update rules. Although we continue to rely on automatic differentiation to channel our energy into defining the exciting parts, we should understand how these gradients work under the hood to go beyond the shallow understanding of DL techniques.

2.2 Learning Mechanisms

143

Fig. 2.27 Mathematical workflow chart for the forward-propagation and backpropagation in a simple node structure

2.2.3 Backpropagation To put it simply, backpropagation aims to minimize the difference between the desired output and the output produced by the network by adjusting the network’s weights and biases. The loss function gradients determine the level of adjustment concerning parameters like activation function, weights, biases, etc. It uses the chain rule of differentiation to efficiently calculate the gradient of a loss function by starting at the network’s output and propagating the gradients backward, which is equivalent to a feedforward computation graph. This gradient is fed to the optimizer, who uses it to update the weights in an attempt to minimize the loss function. It is a technique to quickly compute the derivatives that have served in wide numerical computing problems beyond DL. A robust computational tool ranging from analyzing numerical stability to weather forecasting has been reintroduced many times in different disciplines, under the application-independent name—reverse-mode differentiation (Griewank 2012). The derivation for the backpropagation algorithm has been introduced multiple times for the different neural network architectures in the previous section. The subsection introduces the general flow chart of a forward-mode and reverse-mode differentiation from path ‘I’ to ‘O’ having a hidden state ‘H’ and combinatorial pathway α, β, γ , δ, κ, ω connecting them by summing over all paths. Figure 2.27, represents the factorial representation of derivatives as an easy way of understanding both the directions of differentiation. The computational flowchart has only 9 (3 × 3) pathways, which can grow exponentially as the network grows and gets more complicated.

144

2 Neural Networks for Deep Learning

The forward-mode differentiation starts from an input I , summing up all the pathways feeding in, and moves towards the output O. Each path represents how an input affects the node, and summing them up gives the total combinations in which the node is affected by the input. On the other hand, the reverse-mode differentiation starts from the output, merging all the paths originating from a node and tracking how every node affects an outcome. By this point, some of us might wonder if the reverse-mode does the exact computation as the forward-mode differentiation in a strange way. Why do we bother to use it? The backpropagation or reverse-mode differentiation gives us the derivative of an output with respect to every input node. At the same time, the forward-differentiation only offers the derivative of the output concerning a single input. A three-input variable might only be a speed-up factor of three. Still, having millions of parameters in a NN, forward-mode differentiation has to go through the graph a million times, slowing down the computation speed drastically. This might seem trivial now, but during the time backpropagation was introduced, people were not apparent that using derivatives was the right way to train the NN and circular dependence on how fast to compute the results (Olah 2015). Getting stuck at the local minimum and having expensive computation of all derivatives was thought-provoking. Still, the strong notion of the approach’s success helped to leave much thought about possible obstacles. This is the benefit of hindsight. The most arduous task is already accomplished once we frame the question. Finally, despite having available DL frameworks like TensorFlow, which computes the backward pass for us automatically, it is worth understanding how the algorithm works under the hood (Karpathy 2016). Believing in stacking arbitrary layers and relying on backpropagation to get the work done magically makes us quite susceptible to falling under the leaky abstraction of backpropagation. Let’s have a look at some of the quite unintuitive ways in which the stereotype may lead us wrong: 1. Using sigmoid or Tanh non-linearity without proper weight initialization or data pre-processing may lead the learning to saturate or halt the learning entirely. This is when the training loss remains flat and refuses to go down. For instance, too large weight initialization shall push the resultant output vector z almost binary, i.e., towards 0 or 1 with a sigmoid activation function. In turn, the local gradient of sigmoid non-linearity z × (1 − z) shall become zero in either case, resulting in a zero value of the backward pass from this point due to multiplication in chain rule. Another non-obvious fact is that the local gradient for sigmoid achieves its maximum at 0.25, decreasing the magnitude by at least one quarter for every passing through the sigmoid non-linearity. Using a basic SGD shall result in slower training of shallow layers than those of deeper layers. 2. Using ReLU non-linearity, may result in dead ReLU, if the neuron doesn’t fire in the forward pass, i.e., z = 0. Initialization of neurons that never fire or cannot be knocked out by large update weights in the learning regime leads to a permanently dead neuron. This is evident when passing the entire training set through the

2.2 Learning Mechanisms

145

trained network and observing that a certain percentage of neurons were zero during the entire process. 3. Another possible situation is vanishing or exploding gradient in RNNs that we have discussed earlier. The backward pass in time through all the hidden states can sometimes show the gradient signal multiplied with the same matrix, interspersed with non-linearity. The most significant eigenvalue value for the matrix shall play a role in the vanishing or exploding gradient and indicate the use of gradient clipping. When overlooked by the abstraction of backpropagation, many such intrinsic problems will have non-trivial consequences. The black-box mindset makes it hard to train the network effectively and restricts the ability to debug the NN.

2.2.4 Gradient Descent The gradient is a simple derivative, which we know from differential calculus. Simply put forward, it computes the ratio between the rate of change of NN’s parameter and the error it produces during network training. The process of iterative minimization of the differentiable loss/error function is called gradient descent, arguably introduced by Augustin-Louis Cauchy in mid 18th century. In the case of a univariate function, a first-order optimization algorithm computes the derivative of the objective function for the weights of the NN. This derivative is further used to adjust the weights in the direction of steepest descent defined by the highest negative gradient, which corresponds to the steepest slope of the objective function. Due to its ease of implementation and its significance, it is the most commonly used optimization algorithm to find the local minimum, and it is by far not limited to training ML and DL models. In the case of a multivariate function, it is the vector of partial derivatives along each of the variable axes. For an n-dimensional function f , the gradient at a given point x0 , ∇ f (x0 ) is defined by Eq. 2.81 as the vector of partial derivatives of each of the n variables. ⎡ δf

⎤ (x0 ) ⎢ ⎥ ∇ f (x0 ) = ⎣ ... ⎦ δf (x0 ) δxn δx1

(2.81)

Followed by iterative computation of the next point xn+1 in the curve leading towards function minimum using the basic Eq. 2.82, where the gradient is scaled by learning rate (α). The learning stops either when the maximum iteration is reached, or the step size becomes smaller than the tolerance limit (generally, 0.01) during network training. xn+1 = xn − α∇ f (xn )

(2.82)

146

2 Neural Networks for Deep Learning

Descending a gradient has two aspects: calculating the first-order derivative to choose the direction/gradient of the function to step in (momentum), and moving away from the gradient direction by selecting the step size alpha (learning rate). It is important to note that the gradient descent algorithm requires that the functions are differentiable and convex in nature. A differentiable function ensures that the derivative exists for each point in its domain and is free of cusp, jump, or discontinuity. Next, for a univariate function, the convex nature of the function suggests that the line segment that joins two points in the function lies on or above the function curve. Doing so suggests a local minimum which is not a global minimum. The condition can be represented mathematically as, κ f (x1 ) + (1 − κ) f (x2 ) ≥ f (κ x1 + (1 − κ)x2 )

(2.83)

where, x1 , x2 are two points on the function curve and κ ∈ [0, 1] defines the point’s location in the line segment between leftmost end 0 and the rightmost end 1 in Eq. 2.83, i.e., between x2 and x1 . Alternatively, the convexity of a univariate function is ensured by the value of the second-order derivative being always higher than 0, as shown in Eq. 2.84. The point where the second derivative attains zero is the point of inflection or the point where the curvature changes sign. Note that often a saddle point (min-max point) is reached when both the first-order and second-order derivative equals zero, but the global minimum is not attained like using a quasiconvex function, Ref. Fig. 2.28. For a multivariate function, the computation of the saddle point can be performed using a complex Hessian matrix calculation. d 2 f (x) >0 dx2

(2.84)

Fig. 2.28 Graphical representation of a loss function design with different modes of gradient descent. The situation of saddle point and optimization stuck at local minimum rather than global minimum is shown here

2.2 Learning Mechanisms

147

Fig. 2.29 The optimization steps for the three approaches—Batch GD, Mini-batch GD and Stochastic GD. Mini-batch GD shows an optimized path taken to reach the minimum

 THINK IT OVER How to get the most out of the gradient descent algorithm? Here are a few guidelines and tips to help you. 1. Learning rate: Try which of the small real values like 0.1, 0.001, or 0.0001 suits best the problem statement. 2. Input normalization: Faster minimum for the loss function can be achieved with a non-skewed and non-distorted objective function. This can be achieved by rescaling the input variables to the same range, like [0, 1] or [−1, 1]. 3. Plot loss versus time: A well-performing gradient descent algorithm computed for each iteration should have a decreasing loss value. If it doesn’t show this behavior, try with a decreased learning rate. 4. Plot mean loss: Using stochastic gradient descent, updates for each training instance can lead to a noise plot. Considering an average of over 10, 100, 1000 updates might help better understand the learning trend. 5. Few Passes: The considerable good convergence of the learning curve in stochastic gradient descent often needs no more than 1–10 passes through the training dataset. Connecting Links: Learning rate, optimization, derivatives.

2.2.4.1

Types of Gradient Descent Algorithms

In model training, the error often defines the type of gradient descent algorithms. Popular algorithms include batch gradient descent, stochastic gradient descent, and mini-batch gradient (refer Fig. 2.29) for a better learning curve. Let us quickly discuss the types/variants of these algorithms and summarize the advantages associated with these gradient descent algorithms in Table 2.5.

148

2 Neural Networks for Deep Learning

Batch Gradient Descent Batch Gradient Descent (BGD) is a greedy approach that sums over all the training samples for each model update. The process is called training epoch, and gradient descent utilizes all the resources for a complete training sample in the efficient computation of the model learning. Note that the speed can be slower in the case of large datasets, as one iteration requires the prediction of every instance in the training sample. 1

import numpy as np

2 3 4 5

for i in range(epochs) : parameter_gradient = evaluate_gradient ( loss_function , inputs , parameters) parameters = parameters − learning_rate ∗ parameter_gradient

Stochastic Gradient Descent Stochastic Gradient Descent (SGD), processes the training epoch for a single training sample in the dataset and updates the parameter one at a time. The computation with one training example at a time requires less memory allocated to store and process the information. However, the calculation has some losses compared to batch GD due to frequent updates, making the gradient noisy. It also requires high speed and more details for regular individual updates, but it can help escape the local minimum to find the global minimum. In this process, make sure to randomize the training dataset to ensure the harnessing of noisy random jumps all over the place and prevent being stuck or distracted. Usually, learning is relatively faster for large datasets, where a small number of passes through the dataset can reach a good learning state. 1

import numpy as np

2 3 4 5 6 7

for i in range(epochs) : np.random. shuffle ( inputs ) for attributes in inputs : parameter_gradient = evaluate_gradient ( loss_function , attributes , parameters) parameters = parameters − learning_rate ∗ parameter_gradient

Mini-batch Gradient Descent Mini-batch Gradient Descent combines the above two algorithms, dividing the training samples into small batch sizes. It performs computation on these batches separately, maintaining a balance between batch GD computation efficiency and stochastic GD speed, thereby allowing the algorithm to achieve higher computational efficiency and less noisy gradients. 1

import numpy as np

2 3 4 5 6 7

for i in range(epochs) : np.random. shuffle ( inputs ) for batch in get_batches ( inputs , batch size = 20) : parameter_gradient = evaluate_gradient ( loss_function , batch , parameters) parameters = parameters − learning_rate ∗ parameter_gradient

There are two basic challenges with the gradient descent techniques: (a) learning rate challenge for optimization and, (b) finding the local minima (Fig. 2.29). We shall understand the learning rate before we dive into the different optimization techniques followed by the need of regularization.

2.2 Learning Mechanisms

149

Table 2.5 Summary of desirable properties in the activation functions Types of Computational Memory use Convergence Gradient noise Speed gradient efficiency stability descent Batch GD Balanced High Mini-batch GD High Balanced Stochastic GD Better for large Low datasets

Balanced High Balanced

Low Balanced High

Fast Balanced Slow

2.2.5 Learning Rate Learning Rate (LR), mostly denoted by the symbol alpha ‘α’, is the scaling factor for the learning step size of the gradient descent. It is important to observe how the gradient descent with variation in learning rate and starting point of variable initialization can affect learning steps. Further below, we will define a simple code taking a simple quadratic equation and one with a saddle point to trace the learning path for minimization of objective/loss function during model training. The general mathematical expression to update the objective parameters θk for training samples having loss function L, learning rate α, decay rate β is represented in Eq. 2.85. Here, α0 is the initial learning rate. The gradient term δL/δθ is the partial derivative of the loss function w.r.t. the update parameter, defining the rate of change of loss function with regard to the shift in the update parameter, say weights or biases parameter during training. θk = θk − α

where, α =

δL δθ

1 ∗ α0 1 + (β ∗ epochnum )

(2.85)

 Note Learning rate, which controls the step size, has a huge influence on the model’s performance. • If the learning rate is too small, it may take too long for the gradient descent to converge, or it may fail to attain the optimum point before reaching the maximum number of iterations. • If the learning rate is large, the algorithm might struggle to converge to the optimum point or rather completely overshoot the global minimum.

150

2 Neural Networks for Deep Learning

 Action 1: Investigate the influence of learning rate in gradient descent convergence Here, we take a univariate quadratic function f 1 (x) = x 2 − 5x + 12 and computes its derivative f 1 (x) = 2x − 5. Let’s start the experiment with four gradually increasing learning rates starting from 0.2 and track the trajectory produced by a gradient descent algorithm. Figure 2.30, presents the trajectories and number of iterations needed to finally converge the result within the tolerance limit in a graph for four different learning rates. While Fig. 2.30 illustrates the influence of four different learning rates 0.2, 0.4, 0.6, 0.8 on the simple optimization of a quadratic equation. The comparison of the four learning rate is shown in Fig. 2.31. The figure also shows the optimization of different learning rates on the right. In contrast to common intuition, choosing a high learning rate will not always be beneficial, as it can potentially bounce too frequently and may fail to convert. On the other hand, choosing a slow learning rate may lead to very slow convergence or even getting stuck at a local minimum. Therefore, it is important to better understand learning rate optimization to train an efficient DL model.  Action 2: Investigate the gradient descent for function with saddle point Next, we try to observe the influence of the semi-convex function and saddle point presence in convergence trajectories for gradient descent learning. Here, we take a semi-convex function f 2 (x) = x 4 − 2x 3 + 2, and investigate the number of iterations and the resultant trajectories for two different learning rates and two different starting points in Fig. 2.32. The starting point change is essential to reflect the impact on convergence with the presence of a saddle point and to understand the idea behind different initialization techniques introduced during deep model training. We observe that the presence of saddle point poses a real challenge in obtaining a global minimum for first-order gradient descent algorithms like GD. Here, second-order derivatives like the Newton-Raphson method handle the situation better. However, the investigation of saddle points and ways to overcome them are interesting studies. A simple Python implementation of Saddle point computation is presented below: 1 2 3

# Importing basic libraries import numpy as np from matplotlib import pyplot as plt

4 5 6 7 8

# In this example, we shall define an equation and manually place i t s derivative # The derivative of the function shall later be computed automatically def function (x) : return (x∗x∗x∗x)−2∗x∗x∗x +9 #define the function

9 10 11

def gradient (x) : return 4∗x∗x∗x −6∗x∗x #find the gradient of the function defined

12 13 14 15

def gradient_descent( start , iterations , learning_rate , tolerance=0.01) : steps = [ s t a r t ] #track the path of descent theta = s t a r t

16 17 18

for i in range( iterations ) : diff = learning_rate ∗ gradient ( theta ) #change in the theta

Fig. 2.30 Graph for the four different learning rate on a quadratic equation optimization

2.2 Learning Mechanisms 151

152

2 Neural Networks for Deep Learning

Fig. 2.31 The left plot shows the influence of the previous four learning rates on the optimization trajectory. The right plot shows the influence of different learning rate on loss function optimization over 100 iterations

Fig. 2.32 Influence of number of iterations over learning rate on a function with a saddle point

2.2 Learning Mechanisms 19 20 21 22

153

i f (np. abs( theta ) 0). In Eq. 3.11, the solution is then: ˜ k = log Nik − log Ni wi · w

(3.11)

Adding biases bi and b˜k gets rid of log Ni , which is needed to maintain exchange symmetry. Therefore, the model develops to Eq. 3.12, ˜ k + bi + b˜k = log Nik wi w

(3.12)

which is significantly simpler than Eq. 3.9. Eq. 3.13 defines the loss function: L =

|V | 

˜ j + bi + b˜ j − log Ni j ) f (Ni j )(wi w

(3.13)

i, j=1

where f (.) is a weighting function defined in Eq. 3.14. f (x) =

3.1.2.2

(x/xmax )α if x < xmax 1 otherwise

(3.14)

Contextualized Word Representation

In everyday speech, the context of a statement heavily influences how a word is understood. Many popular word embeddings, such as CBOW, Skip-gram, GloVe, fall short when it comes to comprehending context-specific meanings of words. As a result of learning a separate representation for each word, these models are unable to grasp how the meanings of words shift in different settings.

202

3 Knowledge Encoding and Interpretation

Neumann et al. (2018) suggest Embeddings from Language Models (ELMo), which use a deep, bidirectional LSTM model to construct word representations, to solve this problem. ELMo’s representation of each given word would depend on the context in which it was used. Specifically, ELMo feeds the word and the surrounding text into a DNN, which then turns the word into a low-dimensional vector on the fly. ELMo performs word representation using a bidirectional language model.  Highlight With word representation’s success, scholars are exploring its ideas. Some publications strive to verify the reasonableness of existing word representation learning tactics, while others propose new learning methods. Reasonability: Word2vec and related tools are examples of empirical approaches to learning word representations. Many methods, such as negative sampling, are offered to efficiently learn a corpus’s word representation. A broader theoretical study is required to justify the reasonableness of these tactics, given the efficacy of such approaches. Levy and Goldberg (2014) gives some theoretical analysis of these tricks. Through an implicit matrix factorization approach, they formalize the Skip-gram model with negative sampling. Using the SVD matrix factorization technique, they also assess how well the implicit matrix M performs. When the number of negative samples is one and the embedding size is less than 500 dimensions, matrix factorization yields a substantially higher objective value. Skip-gram with negative sampling performs better objectively with more negative samples and larger embedding dimensions. This occurs because SVD favors factorizing a matrix with minimal values when the number of zeros in M grows. Using embeddings of 1,000 dimensions and varying the amount of negative samples in 1, 5, 15, SVD improves word analogy and similarity performance by a little margin. On the contrary, the Skip-gram with negative sampling outperforms the control group by 2% on the syntactical analogy task.

3.1 What Is Knowledge?

203

Interpretability: Existing distributional word representation techniques can often provide a dense real-valued vector for each word. However, the word embeddings these models produce, are hard to interpret. To create interpretable models where each dimension represents a separate notion, Li et al. (2016) introduced non-negative and sparsity embeddings. Word embedding matrices (E ∈ R|V |×m ) and document statistics matrices D ∈ Rm×|D| are generated by factoring the corpus statistics matrix X ∈ R|V |×|D| . Its training objective is presented in the equation below. |V |

1 Xi,: − Ei,: D2 + λEi,: 1 , arg min E,D 2 i=1  such that, Di,: Di,: ≤ 1, ∀1 ≤ i ≤ m,

Ei, j ≥ 0, 1 ≤ i ≤ |V |, 1 ≤ j ≤ m This model can train non-negative and sparse word embeddings by iteratively optimizing E and D using gradient descent. Words that get the greatest scores in each dimension demonstrate conceptual similarity since the embeddings are sparse and non-negative. This study also suggests phrase-level limitations for the loss function to further enhance the embeddings. Incorporating additional limitations, it could be possible to attain interpretability and compositionality. Word Representation with Hierarchical Structure Generally, human knowledge is organized hierarchically. Many recent approaches have also included a textual hierarchy in word representation learning that includes: • Dependency-based word representation: Embeddings of continuous words include semantic and syntactic information. However, contemporary word representation models rely only on linear contexts and display more semantic than syntactic information. The dependency-based context is utilized by the dependency-based word embedding Levy and Goldberg (2014) so that the embeddings can provide more syntactic information. The dependency-based embeddings are less topical and have more functional similarity to the Skip-gram embeddings. Consider the information from the dependency parsing tree for learning word representations. The contexts of a target word w are its modifiers, i.e., (m 1 , r1 ), . . . , (m k , rk ), where ri is the kind of dependency connection between the head node and the modifier. During training, the model optimizes the likelihood of contexts based on dependency rather than on contexts that are adjacent. Compared to Skip-gram, this model achieves some word similarity benchmark improvements. Experiments also demonstrate that syntactically related phrases are more similar in vector space. • Semantic hierarchies: Due to the linear substructure of the vector space, word embeddings can form simple analogies. For instance, the distinction between

204

3 Knowledge Encoding and Interpretation

‘Norway’ and ‘Oslo’ is comparable to that between ‘India’ and ‘New Delhi’. However, it has difficulty recognizing hypernym-hyponym correlations due of the complexity and non-linear nature of these interactions. To solve this issue, Fu et al. (2014) use word embeddings to discover hypernymhyponym associations. Instead of relying on the embedding offset to express the connection, it is preferable to learn a linear projection. In Eq. 3.15, the model improves projection such that M∗ = arg min M

1  Mxi − y j 2 N (i, j)

(3.15)

where xi and y j are hypernym and hyponym embeddings. To improve the model’s performance, they propose clustering word pairings into multiple groups and then learning a linear projection for each group. The linear projection can assist in identifying several hypernym-hyponym relationships.

3.1.2.3

Evaluation

In recent years, several approaches to embedding words within a vector space have been presented. Consequently, it is vital to examine different strategies. Word similarity and word analogy are the two general measures of word embeddings. Both parties intend to determine if the word distribution is appropriate. Sometimes, these two judgments provide different results. CBOW performs better in terms of word similarity, but Skip-gram performs better in terms of word analogies. Therefore, the approach used relies on the application at the highest level. Typically, task-specific word embedding algorithms are intended for certain high-level activities and greatly enhance these tasks compared to baselines such as CBOW and Skip-gram. However, in two broad evaluations, they only marginally exceed their respective baselines. Word Similarity/Relatedness Language interactions are very rich and complex. All possible interactions between pairs of words cannot be described by a static, finite set of relations. Leveraging various forms of word connections in subsequent activities is also not a simple matter. Scoring the degree of similarity between two words is a more workable solution. We use the term “word similarity” to describe this metric. The exact meaning of the term “word resemblance” can vary greatly depending on the context in which it is used. In diverse texts, we may make reference to a number of different types of resemblance. • Morphological similarity: Morphology is defined in several languages, including English. One and the same morpheme may have several different surface forms, each of which serves a different syntactic purpose. For instance, the adjective ‘active’ can be transformed into the noun ‘activeness’. The verb ‘activate’ can also be used as a noun to refer to the process of activation. When thinking about what words signify and how they are used, morphology is an essential factor to examine. Some syntactic

3.1 What Is Knowledge?

205

relationships between words are outlined. Adjective-to-adverb, past tense, and other connections are employed in the Syntactic Word Relationship test set (Mikolov et al. 2013). However, the basic form is typically used to standardize the morphology of the terms in the contexts where they will be used in more advanced applications and activities (this process is also known as lemmatization). The Porter stemming algorithm (Porter 1980) is one of such methods. All of the variants ‘active,’ ‘activeness,’ ‘activate’, and ‘activation’ are reduced to ‘active’ using this technique. By excluding morphemic markers, the semantic meaning of words is brought into sharper focus. • Semantic similarity: If two words, such as an ‘article’ and ‘document’, may represent the same notion or sense, they are semantically comparable. In the same way that a single word can have many meanings, each synonym of a word will have an associated meaning or meanings. WordNet (Miller 1995) is a database of words organized into sense-based categories. A synset is a collection of words that all have the same meaning and are therefore synonyms. Words belonging to the same synset are regarded to have comparable meanings. For example, ‘bank(river)’ and ‘bank’ (bank(river) is the hyponym of bank) are two words in different synsets that are related by a specific connection (here: hyponym), and therefore are regarded to have some semantic overlap. • Semantic relatedness: Most of the current works that examines word similarity focuses on how words are connected semantically. The concept of semantic relatedness encompasses a wider range of concepts than that of semantic similarity. Meronymy (car and wheel) and antonymy (opposite) are two examples of word relationships that do not rely on shared semantic meaning (hot and cold). Co-occurrence results from semantic similarity; however, this does not mean that the two are interchangeable. Co-occurrence may also result from the syntactic structure. In 2006, it has been argued by Budanitsky and Hirst (2006) that distributional similarity is not a good surrogate for semantic relatedness. In order to evaluate the word representation system on its own merits, the most common method is to compile a collection of word pairs and calculate the correlation between human judgment and system output. Many datasets have been collected and released to the public at this point. RG-65 (Rubenstein and Goodenough 1965) and SimLex-999 (Hill et al. 2015) are two such word-focused datasets. Additional data sets, such as MTurk (Radinsky et al. 2011), focus on connections between words. WordSim-353 (Finkelstein et al. 2001) is often used to evaluate word representations, but its annotation standard does not distinguish between similarity and relatedness. Another round of annotation using WordSim-353 was performed by Agirre et al. (2009) in 2009, which resulted in two subsets: one for similarity and one for relatedness. Cosine similarity is commonly used by researchers as a measure of how similar two distributed word vectors are to one another. The definition of cosine similarity between words w and v is defined in Eq. 3.16 as: sim(w, v) =

w·v wv

(3.16)

206

3 Knowledge Encoding and Interpretation

The cosine similarity between each pair of words is calculated beforehand and used in the evaluation of a word representation method. Then, we compare the human annotator’s ratings to those generated by the word representation model in Eq. 3.17 using Spearman’s ρ correlation coefficient. 6 d2 (3.17) ρ =1− 3 i n −n where a higher Spearman correlation value suggests that they are more comparable. In 2006, WordNet-based similarity evaluation techniques are described by Budanitsky and Hirst (2006). Agirre et al. (2009) highlight that relatedness and similarity are two separate issues, after comparing traditional WordNet-based approaches with distributed word representations. They highlight the fact that WordNet-based techniques do better with similarity than with relatedness, whereas distributed word representation does equally well with both. In Schnabel et al. (2015), the authors examine many different types of decentralized word representations over a wide range of data sources. Distributed representation unquestionably represents the SOTA in both similarity and relatedness. This assessment approach is easy to comprehend and implement. There are, however, a number of issues, as described in Faruqui et al. (2016). One system may produce many different scores on different partitions because of the tiny size of the datasets (less than 1,000 word pairs in each dataset). The statistical significance is difficult to calculate, and overfitting is more likely when testing on the entire dataset. Furthermore, there may be little to no correlation between a system’s performance on these datasets and its performance on downstream tasks. The TOEFL Synonyms Test is another option to assess word similarity. When taking this exam, you’ll be presented with a cue word and asked to select the closest synonym from a list of four. The possibility of comparing a system’s performance to that of a human makes this endeavor particularly fascinating. Landauer and Dumais (1997) assess how well the system handles knowledge inquiry and representation in LSA. The stated score of 64.4% is quite similar to the average rating achieved by the real test participants. The score obtained by Sahlgren (2001) on this test set of 80 queries was 72.0%. The original dataset is expanded using WordNet to create a new dataset called WordNet-based synonymy test (Ferret 2010) with thousands of queries. Word Analogy The word analogy task is an alternative to the word similarity challenge to evaluate representations’ ability to capture the semantic meanings of words. It is the goal of this assignment is to predict a fourth word, w4 , such that the relation between w1 and w2 is the same as that between w3 and w4 . Since (Mikolov et al. 2013), this job has been employed to take advantage of connections between words’ structures. It is possible to distinguish between semantic and syntactic interactions in this context. After the publication of the dataset, this unique approach to evaluating word representations

3.1 What Is Knowledge?

207

became the de facto standard. In contrast to the TOEFL test, most of the terms in this dataset appear often in all forms of the corpus; nonetheless, the last word is selected at random from the whole vocabulary. Due to its focus on word space structure, this evaluation is more favorable to dispersed word representations.

3.1.3 Graph Representation Graphs are used extensively in computer science and related disciplines because of their versatility as a data structure. Graph embeddings are another method for encoding objects based on their distribution. Finding an effective approach to capture or encode network structure that can be readily exploited by AI models is the fundamental problem in this area. In the past, ML methods extracted features encoding structural information about a graph based on user-defined criteria, e.g., degree statistics (Bhagat et al. 2011) or kernel functions (Van der Maaten and Hinton 2008), or carefully engineered features to measure local neighborhood structures (Lee and Verleysen 2007). However, there are restrictions on these methods due to the inflexibility of the hand-engineered features used (they cannot adjust throughout the learning process) and the difficulty and expense of developing such features. Recent years have witnessed a spike in systems that automatically learn to encode network structure into a low-dimensional vector space, Rd , employing techniques based on DL and non-linear dimensionality reduction. The objective is to optimize this mapping such that geometric relationships in this learned space replicate the structure of the original graph. Compared to earlier research, representation learning algorithms diverge most notably in how they handle the challenge of retrieving the underlying structural information of the graph (Hamilton et al. 2017). In the past, this issue was addressed as a preliminary processing phase, with statistics being constructed manually to extract the underlying structure. On the other hand, representation learning methods take on this challenge as a machine learning task in and of itself, employing a data-driven strategy to discover embeddings that encapsulate graph structure. Graphs, which represent interactions (edges) between individual units (nodes), are useful for modeling a wide range of phenomena, including social networks, molecular graph structures, biological protein-protein networks, recommender systems, and many more. As a consequence of their ubiquity, graphs constitute the backbone of many systems, allowing relational knowledge about interacting entities to be effectively stored and retrieved (Angles and Gutierrez 2008). In the previous chapter, we spoke about how NNs struggle to make sense of chaotic, disordered input. Graphs are frequently used to represent and manage unstructured data in various practical scenarios. To this end, it is critical to create automated algorithms for mining graphs for valuable insights. Learning the graph representations, which assign each vertex of the graph a low-dimensional dense vector representation encapsulating significant information provided by the graph, is an efficient way to organize such information connected with a potentially huge and complicated graph.

208

3.1.3.1

3 Knowledge Encoding and Interpretation

Graph Embedding

Graphs may be used to represent a wide variety of datasets. For example, consider the human knowledge graph. Real-world information is represented in encyclopedias like Wikidata and DBpedia through “entities” like individuals, corporations, nations, movies, etc., and relationships between those entities like marriage, presidency, citizenship, and acting roles. By modeling the data in the form of a graph, we can create embeddings for the nodes and transformations for the edges that allow us to traverse from one to the other as illustrated in Fig. 3.6. Additionally, Fig. 3.7 shows the pipeline of the random walkbased graph embedding method, extending the knowledge we have learned from word embedding in the previous section. The goal is to discover a learnable transformation for the edge that can be used in conjunction with the embeddings of the nodes to discover a mapping from one node to another. A translation could be thought of as such a change. Astonishingly accurate results may be obtained to predict the next edge in this way. Knowledge graphs are only one example of the many types of relationships that may be represented by graphs. Graph embeddings are a useful tool for encoding collections based on their distribution. Node Embeddings Methods for node embedding are first discussed; this is the process by which a node’s location in a network and the characteristics of its immediate neighbors are reduced to a compact representation in a low-dimensional vector space. These embeddings have low dimensionality and may be thought of as encoding or projecting nodes into a latent space. Geometric relations in this latent space correspond to edges (e.g., interactions) in the original graph. An encoder maps each node to a low-dimensional vector, or embedding, and a decoder decodes structural information about the graph from the learnt embeddings; these two mapping functions serve as the organizing principle for numerous techniques. If we can learn to decode high-dimensional graph information, like the global positions of nodes in the graph and the structure of local graph neighborhoods, from encoded low-dimensional embeddings, then, in principle, these embeddings should

Fig. 3.6 A summary of many strategies for learning the transformation between billions of embeddings. Here, Pytorch Big Graph’s (PBG) contribution is to partition the nodes and edges so that it’s possible to apply this learning for hundreds of millions of nodes and billions of edges

3.1 What Is Knowledge?

209

Fig. 3.7 Methods for embedding graphs using random walks, implemented as a pipeline. To learn node embeddings for the original graph G = (V, E) with two types of nodes marked with different colors, we first use random walks techniques to produce a set of node context (Wvi ) for every node (vi ∈ V ); the sampled node context (i.e., random walks) are of same fixed walk length t. Second, a language embedding model acting as an encoder such that each node is represented as a low-dimensional, continuous vector in the latent space based on the produced node contexts. Distances between vectors (or node embeddings) in the latent vector space (such as the dot product, cosine similarity, or Euclidean distance) are approximations of similarities in the original graph. In addition, dimension reduction methods make it easy to translate the learnt vectors to 2D space as points (e.g., t-SNE, MDS, PCA). Link prediction, node classification, community detection, etc., may all benefit greatly from the learnt node embedding characteristics (φ ∈ R|V |×L ) for all nodes

contain all information necessary for downstream ML tasks, as the encoder-decoder idea suggests. The encoder is a function formally represented in Eq. 3.18, ENC : V− > Rd

(3.18)

Composition and Multimodality Now that we can obtain embeddings of objects from a variety of viewpoints, we may gain useful and complementary information. The appearance, feel, and description of an item of clothing may all be significant. Is there a way to merge all of these embeddings into a single one? The concatenation of embeddings is a simple technique that yields surprising results. By joining the embeddings of text and images, for instance, a search may be conducted using either the text associated with the item or the image alone. Combining visual and linguistic data in a multimodal framework like imagebert, vilbert, uniter, vl-bert promise to learn from both language and text, to build cross-model representations, and DL is becoming increasingly popular. Fine-tuning a network for a particular goal by combining different embeddings. Composition is a powerful method to view the broad picture of what must be encoded.

210

3 Knowledge Encoding and Interpretation

 THINK IT OVER The popularity of things is a crucial factor in retrieval and recommendation systems. Displaying unpopular things typically yields irrelevant outcomes. Adding a popularity term to the embedding is a straightforward solution to this problem. The last factor, for instance, may be the inverse logarithm of the item’s total number of pageviews. In this method, the most popular results will be prioritized by the L2 distance between queries with a value of 0. Although this method has the potential to filter out less popular things from the top results, it is not foolproof since the balance between similarity and popularity still has to be adjusted manually. The best solution to this problem may be found by training the embeddings for a certain task.

3.2 Knowledge Encoding and Architectural Understanding The section introduces the learning paradigm which can help harness both labeled and unlabeled data to improve the model’s performance, primarily for models with fewer labeled data for training. Zhang et al. (2018) explain how, even with a high accuracy on testing images, a CNN is not guaranteed to encode proper representations when considering dataset and representation bias. As a result in Fig. 4.3, a CNN may attempt to recognize the “lipstick” property of a facial picture by using an inaccurate context, such as eye features. Therefore, individuals are skeptical about CNNs until they can understand their reasoning behind them, e.g. what patterns are being employed for prediction, either semantically or visually. Applying the decomposability principle, which is a well-liked method to classify the degrees of interpretability by breaking down a model into its component parts like neurons, layers, blocks, and so on, will be our method of choice in the following section. This sort of modularized analysis is widely used in the engineering world. The inner workings of a complex system, for instance, can be factorized as a set of functionalized modules. Numerous technical fields can attest to the efficacy of modularized analysis, including software development and visual system design. Since we may optimize the network architecture by breaking it down into smaller, more manageable pieces, modularizing a NN promises to be helpful. • Researchers may be tempted by the manageability of low-dimensional signals to look at the qualities of individual units as indicative of network activity. • Experiments with class selectivity regularization, generative models, and the elimination of individual units have shown that it is not possible to consistently extrapolate single-unit attributes to population levels. • Methods for elucidating networks should make use of qualities that are functionally, and preferably causally, relevant.

3.2 Knowledge Encoding and Architectural Understanding

211

• Features that hold across neurons (distributed, high-dimensional representations) should be the primary focus of research, and tools should be developed to make these properties more intuitively approachable.

3.2.1 The Role of Neurons Information in a layer can be decomposed even further into individual neurons or convolutional filters. Quantitatively, by assessing a unit’s capacity to solve a transfer issue, or qualitatively, by developing visualizations of the input patterns that optimize the response of a single unit, one can get insight into the role of such individual units. Visualizations may be generated in a number of ways, for instance by optimizing an input image with gradient descent (Simonyan et al. 2014), by sampling images that maximize activation (Zhou et al. 2014), or by training a generative network to generate such images (Nguyen et al. 2016). Quantitative unit characterization is also possible through task-solving assessments. Network dissection (Bau et al. 2017) is one such approach; it evaluates how well individual units can segment an extensive annotated visual idea database. Network dissection can be used to characterize the kind of information represented by visual networks at each unit of a network by quantitatively measuring the ability of the units to locate emergent concepts such as objects, parts, textures, and colours that are not explicitly defined in the original training set. Locating the input that most strongly activates a certain neuron in a NN is one way to see the unit’s features. For the purposes of classification, the term ‘unit’ can refer to either a single neuron, a group of neurons, a layer, a set of channels, a set of feature maps, or the final class probability, or the corresponding pre-softmax neuron, which is recommended. Given that neurons function as the network’s fundamental building blocks, it makes sense to develop feature visualizations for each neuron. There is, however, a catch: The number of neurons in a NN can reach into the millions, as cited in Gilpin et al. (2018). It would take too much time to examine a graphical representation of the features of each neuron. As a unit of feature visualization, channels, also known as activation maps, are a feasible option. Furthermore, we can see a complete convolutional layer. As a unit, layers are the building blocks of Google’s DeepDream, which iteratively applies to an input image to produce a surreal, dreamlike outcome. Pruning networks has been demonstrated to be an important step in deciphering the function of individual neurons in networks (Frankle and Carbin 2018). In particular, well-trained massive networks have several tiny sub-networks with optimizationfriendly initializations. This indicates that there are training methodologies that allow for the same issues to be solved with much smaller networks that may be more interpretable.

212

3 Knowledge Encoding and Interpretation

 Highlight Consider that u k,l (x) represents the activation of the l th neuron on the k Meronymy layer for the input x. Erhan et al. (2009) synthesizes images that elicit high activations for certain neurons by addressing synthesis as an optimization problem, as seen in Eq. 3.19. x ∗ = arg maxx u k,l (x)

(3.19)

Because the optimization issue is non-convex in general, a gradient ascentbased strategy is used to obtain a local optimum. Starting with some initial input x = x0 , the activation u k,l (x) is computed, and then steps are taken in the input space along the gradient direction δu k,l (x)/δx to synthesize inputs that cause higher and higher activations for the neuron n k,l , and the process eventually terminates at some x which is deemed to be a preferred input stimulus for the neuron. It offers insight into how the NN processes an input by visualizing hidden representations. Information about a DNN may be collected from each neuron, as shown by works such as Bau et al. (2017); Karpathy et al. (2015); Li et al. (2015); Yu and Principe (2019); Zintgraf et al. (2017). The single neuron/cell activation visualization can help understand the distribution of learned model representatives. In particular, the gate in Hochreiter and Schmidhuber (1997) was described as being left or right saturated depending on its activation value of less than 0.1 or greater than 0.9 by Karpathy et al. (2015). Neurons that are frequently right-saturated are intriguing in this context since it suggests that they retain information for a considerable amount of time. Figure 3.8a shows the potential of single neuron visualization for a text-selective neuron by Karpathy and team. In 2017, Radford et al. (2017) examine the features of recurrent language models at the byte level. The representations learned by these models include disentangled features corresponding to high-level ideas when given enough capacity, training data, and compute time. They identified a single unit responsible for sentiment analysis. These unsupervised-learned representations are SOTA on a binary subset of the Stanford Sentiment Treebank. They also have an impressive data efficiency shown in Fig. 3.8b. In ICML 2011, Le (2013) thinks about the challenge of creating high-level feature detectors that are class-specific, using just unlabeled data. Can a face detector, for instance, be taught with merely unlabeled images? To address this question, the authors train a 9-layer locally connected sparse autoencoder using a large dataset of pictures, pooling, and local contrast normalisation. They found that it is feasible to train a face detector without labelling pictures as to whether or not they contain a face, which goes against what seems to be a commonly held belief. Refer the Fig. 3.8c. Observations from controlled tests demonstrate that this feature detector is also invariant to changes in magnification and orientation perpendicular to the plane of projection. Additionally, we discover that this same network is responsive to human bodies and cat faces as well as other abstract notions. By using these learnt

3.2 Knowledge Encoding and Architectural Understanding

213

Fig. 3.8 Example of single neuron visualization using different strategies. a Cells with interpretable activations found in LSTM models for the Linux Kernel and the War and Peace datasets. The conversion between red and blue, where −1 is red and +1 is blue, using the tanh formula (Karpathy et al. 2015). b Histogram of IMDB reviewer sentiment expressed as a cell activation value (Radford et al. 2017) c Histograms of faces (red) versus no faces (blue). Subsampling the test set is done so that the proportion of samples with faces to those without faces is exactly one (Le 2013). Figure adapted from Karpathy et al. (2015); Radford et al. (2017); Le (2013) with permission

Fig. 3.9 Assume we have picture samples generated from a Gaussian distribution. Figure reproduced from Huszar (2018) with permission

characteristics as a foundation, they were able to train a network to achieve 15.8% accuracy in detecting 20,000 object categories from ImageNet, a relative increase of 70% over the prior SOTA. How Logical Fallacies Can Occur Examining the mode of a distribution and assuming that it will look like random samples selected from the distribution is an instance of how reasoning may go wrong. First, let’s take some picture samples using a Gaussian distribution in Fig. 3.9. What does this distribution look like in terms of its mean and mode? Okay, so it’s a square of dull gray, different from the examples given, obviously. Worse yet, if you replaced the white noise with correlated noise and then looked at the mode, you may still see the same gray square. Therefore, it is not always helpful to focus on

214

3 Knowledge Encoding and Interpretation

Fig. 3.10 Example of these displayed neurons for examples from datasets and optimization techniques. Figure reproduced from Olah et al. (2017) under creative commons attribution (CC-BY 4.0)

the middle of the distribution. In some cases, the maximum of a distribution may not even resemble the distribution at all. Take a look at the visuals of Olah et al. (2017) feature visualization in the Fig. 3.10. In Fig. 3.10, images from the training dataset with high neuron activations are displayed in the upper panels. By maximizing the output with the minimum amount of input, we may acquire the photos at the bottom. Thus, one of the bottom pictures is analogous to the mode of a distribution, while the matching top images are analogous to random samples from the distribution (not exactly, but in essence). It’s not unexpected that these don’t have a visual similarity. If a neuron is completely sensitive to Gaussian noise, then the input that maximizes its activation may be a gray square that has nothing in common with the samples on which it generally activates or the samples from a Gaussian distribution that it has learned to recognize. Therefore, despite the fact that these images are really fascinating and gorgeous, the authors believes their diagnostic usefulness is limited. It would be hasty to conclude that “the neuron may not be sensing what you previously assumed” since the mode of a distribution is not a particularly useful summary measure in high dimensions. This is further compounded by the fact that when we view a picture, we employ our own visual system. The visual system acquires knowledge by exposure to natural stimuli, which are often represented as random samples from a distribution. Our retina has never seen the mode of the normal picture distribution, so our visual brain may not even know how to react; we may find it absolutely bizarre. Even if someone revealed the mode of their ideal picture density model to the world, it would likely be impossible to determine whether or not it is accurate because no one has ever seen the original.

3.2 Knowledge Encoding and Architectural Understanding

215

 THINK IT OVER When interpolating in latent spaces of a VAE or GAN, if the distribution is assumed to be Gaussian, then always interpolate in polar coordinates as opposed to Cartesian coordinates. Two random samples selected from a standard Normal in a high-dimensional space are highly likely to be orthogonal to one another. In this section, we will examine and make sense of NNs’ learned representations. Our inquiry is based on the fundamental question, “How should we characterize the representation of a neuron?” Think of it this way: a neuron at some layer in a network is calculating a real-valued function across the input domain of the network. In other words, a comprehensive representation of a neuron’s functional form would be a lookup table containing all potential input to output mappings for that neuron. While theoretically possible, practically impossible, and computationally challenging to turn into a set of conclusions are infinite tables. Rather than focusing on the neuron’s reaction to noise, we’re more interested in how it reflects elements of a known dataset (e.g. natural images). As a result, Raghu et al. (2017) define a neuron’s representation in their study as the collection of outputs it has generated for a fixed number of inputs taken from a training or validation set.

3.2.1.1

Interaction Between Neurons

If neurons are not necessarily the apt direction to explain NNs elaborately, what is? In practice, several neurons operate in combination to represent an image in the networks. Let us perceive these combinations geometrically; an ‘activation space’ is all potential combinations of neuron activations (basis vectors). Conversely, a permutation of neuron activations is a vector in space. This raises a question: Are the directions of the basis vectors (activated neurons) more interpretable than the directions of other vectors in this activation space? In 2013, Szegedy et al. (2013) observed that “random directions appear just as meaningful as the directions of the basis vectors”. More recently, Bau et al. (2017) noticed that “the directions of the basis vectors are often more interpretable than random directions”. Our knowledge is broadly consistent with both explanations; we observe that random directions oftentimes seem interpretable but at a slightly lower rate than basis directions. It has led researchers to perform arithmetic on neurons to establish exciting directions. For example, in Fig. 3.11, by adding, subtracting or multiplying ‘black and white’ neurons to ‘mosaic art’ (left side), or ‘colourful brush strokes’ neurons handpicked for interpretability (left side), we get a black & white version of mosaic art or brush strokes, respectively. It is reminiscent of semantic arithmetic of word embeddings seen in Word2vec or generative models’ latent spaces (Olah et al. 2017).

216

3 Knowledge Encoding and Interpretation

Fig. 3.11 Arithmetic interactions of activated layer neurons using Lucid. Reproduced image from Olah et al. (2017) under creative commons attribution (CC-BY 4.0)

The concepts hardly scratch the surface of neuron interactions. The reality is that selecting meaningful procedures or determining whether there is a meaningful path is an unsolved problem. Concurrently, there is reason for scepticism regarding how directions interact; for instance, the arithmetic operation demonstrates how a small number of directions interact. In practise, there are a thousand combinations of interacting directions that are difficult to decode. A type of interpretability approach is known as a signal method (Kindermans et al. 2019), and it involves observing the stimulation of neurons or a network of neurons. The values of neurons’ activations can be modified or changed into forms that can be understood. For example, it is possible to rebuild an image similar to the input by using the activation of neurons in a layer. Feature maps in the deeper layer activate more strongly to complex features such as a human face, while feature maps in the shallower levels exhibit basic patterns like lines and curves. This is feasible because neurons store information systematically (Zeiler and Fergus 2014; Bau et al. 2017).

3.2.2 Role of Layers One way to learn about a layer’s structure is to see how well it does at solving issues that are different from those the network was trained to solve. An internal layer of a network trained to classify images of objects in the ImageNet dataset, as demonstrated by the work of Sharif Razavian et al. (2014), generates a feature vector that can be directly reused to solve a variety of other challenging image processing problems, such as fine-grained classification of different species of birds, classification of scene images, attribute detection, and object localization. Each time, a basic model like a SVM was able to directly apply the deep representation to the target problem, achieving SOTA performance without requiring the training of a new

3.2 Knowledge Encoding and Architectural Understanding

217

DNN. Transfer learning refers to the process of using a layer from one network to solve a new problem, and it has tremendous practical value because it allows many new problems to be handled without the need to construct new datasets and networks for each new problem. The ability to apply knowledge from one setting to another quantitatively was first described by Yosinski et al. (2014). In general, the majority of gradient- and perturbation-based techniques under this topic are referred to as layer attribution variants. Layer attribution variations are a catchall term for gradient- and perturbation-based methods that fall under this area. Attributing the model’s inputs to a hidden internal neuron is possible with neuron attribution techniques, whereas layer attribution variations allow us to attribute output predictions to all neurons in a hidden layer. When it comes to attribution, neuron and layer versions are typically minor tweaks of the traditional technique. Yosinski et al. (2015) conducted a small study in which they examined how different visual stimuli affected the activation levels of different layers of neurons. To better comprehend how a model functions, they discovered that viewing activation values in real time as they changed in response to various inputs was beneficial. One of the simplest ways to visualize network behavior is by displaying the activations (layer activations) it experiences as it makes its forward pass. Initially, activations for ReLU networks tend to be quite dense and blobby, but this typically changes into a sparse pattern as training continues. Such representation makes it easy to spot potential problems like, for instance, some activation maps being all zero for many different inputs, which can suggest dead filters and is a symptom of excessive learning rates.

3.2.2.1

Filters and Feature Map Visualization

The intriguing element is to eliminate the obscurity of a black-box CNN model and to know what is happening beneath the hood. What is the model in the image? There is an increasing sense of human interpretation, including trust in practical application behaviour, detecting model bias, and scientific curiosity. Two prominent research threads have begun to merge as it appears: feature visualization (Erhan et al. 2009; Olah et al. 2017; Simonyan et al. 2014; Nguyen et al. 2015; Mordvintsev et al. 2015; Nguyen et al. 2017) and attribution (Simonyan et al. 2014; Zeiler and Fergus 2014; Springenberg et al. 2014; Fong and Vedaldi 2017; Kindermans et al. 2019, 2017). Feature visualization generates examples to answer questions about what a network or sections of a network observe. Let us dive into the realm of NNs and characteristics for network visualization. We shall find two major divisions of research in coalescing—feature visualization and feature attribution. The prior answers what the network or part of the network is looking for by generation examples (Fig. 3.18) while the latter studies the parts of the sample responsible for network activation a certain way. Firstly, we shall discuss the feature visualization and the optimization objectives for a NN. By nature, NNs are generally differentiable to allow the model to learn with respect to the input data.

218

3 Knowledge Encoding and Interpretation

We can start from noise and iteratively tweak the image to activate a specific layer and interpret the internal neuron firing or the final output behaviour (Fig. 3.39). Feature visualization is a mathematical optimization problem. Talking about kernel, initialization of its input value can have different strategies: • Initializing all the values to a constant like 0 or 1. • Initializing with some predefined values. • Initializing with values from a distribution like normal or uniform distribution. Given that the NN has been trained, we suppose that its weights are fixed. New images are being sought out in an effort to increase (mean) neuronal activity as represented in Eq. 3.20 below. img∗ = arg maximg h nx,y,z (img)

(3.20)

The function h represents a neuron’s activation, img the network’s input (an image), x and y the neuron’s spatial position, n the layer, and z the channel index. Here, we maximize the mean activation of a whole channel z in layer n by the Eq. 3.21: img∗ = arg maximg σx,y h nx,y,z (img)

(3.21)

All neurons in channel z are equally weighted in this formula. A second option is to maximize random directions, which would involve multiplying the neurons by a variety of parameters, some of which would be negative. This allows us to investigate the communication between neurons within the channel. The activation level does not have to be maximized; instead, it can be reduced, corresponding to maximizing the negative direction. This optimization issue can be tackled in various ways. Instead of generating new images, we may, for instance, look through our training images and pick the ones that result in the highest activation. Although this strategy has merit, the use of training data presents the issue of correlated elements on the images, making it impossible to know what the NN is actually seeking. We don’t know if the NN is focusing on the dog, the tennis ball, or both in images that produce strong activation of a given channel. An intuitive observation about NNs is that as the number of layers grows larger, developing model simplification algorithms becomes progressively more difficult. Due to this, feature relevance techniques have gained popularity in recent years. In 2017, Kindermans et al. (2017) propose ways to estimate neuron-wise signals in NNs. Using these estimators, they present an approach to superposition neuron-wise explanations in order to produce more comprehensive explanations.

3.2 Knowledge Encoding and Architectural Understanding

219

 THINK IT OVER Why do we generally use a 3 × 3 kernel over 2 × 2 or 4 × 4? Choosing a smaller kernel increases the computation time and use of resources while a larger kernel may trade-off the salient features captured from the image and result in noisy output matrix. Therefore, using a 3 × 3 kernel is a safer and an optimal choice in DL. Why do we use symmetrical kernels? As discussed in the previous chapter, convolution demands the flip of kernel, otherwise the resultant will be a correlation. Using a non-symmetrical matrix would restrict its inversion and we would never get a convolution. What if the size of the kernel equals the size of the image? Choosing of smaller kernel leads to lot of details, increasing computational cost and the possibility of overfitting. However, choosing a kernel equally large or larger than the image returns only a single neuron, and this will lead to underfitting. Today, comprehending intelligence is considered the biggest problem in science. Arguably, the intricacy of learning is a gateway to understanding the intelligence of human brains and computers, discovering how the human brain works, making intelligent machines that learn from the experience and improving their competence as children do. Therefore, it is necessary to question the model on what latent features are essential for during training, how the model develops an understanding of the feature points and why those features are responsible for certain decisionmaking policies. Later in this section, we examine the theories, techniques, and tools necessary to comprehend DNNs in CNNs, specifically for CV tasks. Filter Interpretability As of 2017, ‘objects’, ‘parts’, ‘scenes’, ‘textures’, ‘materials’, and ‘colors’ were the six forms of semantics established by Bau et al. (2017) for CNN filters. To determine a filter’s interpretability, users are expected to annotate test images with one of six distinct semantic categories. This method propagates the RF of each active unit in a filter’s feature map back to the image plane as the image-resolution RF of a filter, and the evaluation metric measures the fitness between this RF and the pixel-level semantic annotations on the image. For instance, we can assume that a filter reflects a particular semantic idea if its RF typically shows high overlap with ground-truth picture regions of that concept across a variety of images. This approach computes the feature maps X = {x = f (I )|I ∈ I} on separate testing images for each filter f . Next, we calculate the spread of activation scores across all positions on all feature maps. To pick top activations from all spatial locations [i, j]’s all feature maps x ∈ X as valid map areas corresponding to f ’s semantics, Bau et al. (2017) established an activation threshold T f such that p(xi j > T f ) = 0.005.

220

3 Knowledge Encoding and Interpretation

The RF of valid activations for each image is then obtained by scaling up lowresolution valid map areas to the image resolution. In this context, appropriate activations of f with respect to image I are denoted by S If , which stands for the RF of f . The IoU score is reported in Eq. 3.22 to indicate whether or not a given filter is suitable for a certain semantic idea (Zhang and Zhu 2018), IoU If,k =

||S If ∩ SkI || ||S If ∪ SkI ||

(3.22)

where SkI is the ground-truth mask for the k th semantic concept on image I . If IoU If,k > 0.04, then the k th notion is connected with filter f for a given image I by the expression P f,k = meanI: with k th concept 1(IoU If,k > 0.04) is the formula for the likelihood that the filter f is linked to the concept k. Therefore, P f,k can be used to assess f ’s filter interpretability.  Highlight To accommodate the vast differences in data distribution, filters come in a variety of sizes, making it challenging to zero down on the optimal one. Since the convolutional layer’s filter sizes are so modest, it can handle data for a relatively limited domain. A larger filter size will collect greater amount of information. In essence, Fig. 3.12 represents the fundamental building components of a NN. The depiction of each building block individually or in the collection provides an idea of what features the network has abstracted in each layer or node. Typically, a feature visualization of each neuron, which serves as the atomic unit of NNs, should provide detailed information. Visualizing a single neuron may not always provide a complete picture of the knowledge represented in the node. On the other hand, visualizing millions of neurons meshing in a network in order to provide a relevant interpretation for each input is difficult. As a result, channels (also known as activation maps) are an excellent display choice. Imaging the spatial, channel, or

Fig. 3.12 Diagrammatic representation of activation in atoms of NNs

3.2 Knowledge Encoding and Architectural Understanding

221

group activation thus aids in deciphering the encoded information to some extent. However, for the sake of simplicity, we also interpret the entire convolutional layer. We’ll try if we can make the models a little more ‘gray-box’. Despite this, there has been a significant advancement in explaining NN inputs/outputs, such as CAM (Fig. A.2) and showing intermediate layer outputs (Fig. 4.4).  THINK IT OVER The reasoning in this subsection is based on the assumption that a pattern to which the unit responds most strongly may be an accurate first-order representation of its behaviour. What is your take? One straightforward approach is to identify the input sample or samples (from the training or test set) that result in the maximum activation of a specific unit. There is still the issue of how many samples to maintain for each unit and how to “combine” these samples, both of which are problematic. If possible, it would be great to learn what characteristics these samples share. Additionally, it is not always obvious which parts of the input vector are responsible for the high activation. Please take note that by limiting our search to either the training or test sets, we imposed a restriction that was not necessary. Taking a step back, we may frame our idea, which is to maximize a unit’s activation, as an optimization issue. In general, this is a hard to solve optimization issue because of its complexity. However, it is a problem for which a local minimum solution can be sought. The simplest way to achieve this is to use gradient ascent to move x in the direction of the gradient of h i j (θ, x) in the input space. Here, θ is the NN parameters (weights and biases). Assuming a fixed θ , the idea can look for an optimization of a non-convex problem represented in Eq. 3.23. x ∗ = argmaxx

s.t. ||x||=ρ

h i j (θ, x)

(3.23)

It’s possible that two or more local minima are discovered, or that the same (qualitative) minimum is discovered when starting from distinct random initializations. A minimum or group of minima can be used to define the unit in either scenario. One can either take an average, pick the solution that maximizes activation, or show all the local minima produced to characterize that unit. This method of optimization, which we will refer to as activation maximization (AM), can be used with any network for which we have the ability to calculate the aforementioned gradients. Choosing the learning rate and the termination criterion are hyperparameters that must be set, as is the case with any gradient descent method.

222

3 Knowledge Encoding and Interpretation

Conv/FC Filters The second popular approach is to visualize the weights. These are often most interpretable on the first conv-layer, which is viewing directly at the raw pixel data, but the filter weights can also be shown deeper in the network. The weights are useful for visualization since well-trained networks typically produce attractive, smooth filters with no noisy patterns. Noisy patterns may indicate that a network hasn’t been trained long enough, or that it has a very low regularization strength, which may have resulted in overfitting. Take a trained AlexNet, for instance, and inspect its first and second conv-layers for filters. First-layer weights are very smooth, indicating a well-converged network. Since the AlexNet has two independent processing paths, it stands to reason that one path will evolve high-frequency grayscale features and the other will evolve low-frequency color features. This design accounts for the clustering of the color/grayscale features. Even if the weights of the second conv-layer are not as easily interpreted as the first, it is clear that they are well formed and free of noisy patterns. Furthermore, to find out if a NN learns a similar representation when randomly started, Li et al. (2015) compared the characteristics generated by different initializations in 2015. A neuron’s RF is the area in which it receives and processes information from an input volume (Lindeberg 2013). Selecting K images with high activation values for neurons of interest, building 5,000 occluded images for each of K images, and feeding them into a NN to observe the changes in activation values for a specific unit was the network dissection (Zhou et al. 2014) method. Any appreciable deviation probably necessitates urgent maintenance. The RF was generated by recentering and averaging widely dissimilar photos of the obstructed scene. To further illustrate how this method of dissecting networks can be used to generative networks, consider the article published in 2020 (Bau et al. 2020). An interpretability measure was calculated by Bau et al. (2017) by sizing a low-resolution activation map of a given layer to the same size as the input, thresholding into a binary activation map, and calculating the overlapping area between the binary activation map and the ground-truth binary segmentation map. Zhang et al. (2018) deconstructed feature relations in a network on the premise that the feature map of a filter in each layer can be activated by part patterns in the layer above. To characterize the relationships between the features in the hierarchy, they mined patterns layer-by-layer, found the activation peaks of patterns in each layer’s feature map, and constructed an explanatory graph in which each node represented a pattern and each edge between layers represented a co-activation relation. Next, talking about visualizing features, we might just optimize input data to cause neurons to fire. Sadly, this doesn’t really work, and we end up with an optical illusion of the NN. Even with a carefully tuned learning rate, the image was filled with noise and nonsensical high-frequency patterns that the architecture responded strongly to. These patterns find a way to trigger neurons that do not occur in practical life, an image that is kind of cheating. We tend to see genuine detection by neurons

3.2 Knowledge Encoding and Architectural Understanding

223

Fig. 3.13 Checkerboard pattern magnitudes due to strided convolution and max-pooling while gradient backpropagation

with long optimization, but the image still dominates with high-frequency patterns. An example representation with checkerboard pattern due to convolution operation is shown in Fig. 3.13. These patterns resemble the correlation with the appearance of adversarial samples (Szegedy et al. 2013). Unable to exactly comprehend the formation of high-frequency patterns, stridden convolution and pooling operations seem to be the prominent cause of patterns in gradient (Odena et al. 2016). The above-mentioned high-frequency patterns suggest that it’s a double-edged sword despite the appealing nature of constraint-less optimization-based visualization. Without any constraints on the the data, it ends up with adversarial examples. Sounds intriguing, but we need to somehow move past them if we need to learn how these models operate in real life. We will endeavor to visualize some of the application of interpretability techniques for the insightful behaviour of general CV techniques. Then, a more rigorous approach will be taken through network visualization and interpretation of neurons, which are the basis of DNN. Other work like “Deep Dream” (GoogleBlog’15) (Mordvintsev et al. 2015) provides fascinating abstract representations which do not greatly aid to interpretability but are pleasant to look at. In 2017, Henderson and Rothe introduced a new modular framework called “Picasso” that computes the actual receptive field of filters, making it possible to see the training of a NN Image classifier. Olah et al. (2017) provides an excellent interactive interface (feature visualization) demonstrating activationmaximized images for GoogLeNet.

224

3 Knowledge Encoding and Interpretation

 Highlight Attribution visualization investigates which component of a sample input is responsible for the specific mode of network activity. Some strategies for analyzing DL models, particularly for high-dimensional unstructured data such as images, are: • SHAP gradient explainer: Blends the concepts of Integrated Gradients, SmoothGrad, and SHAP into a single expected value formula. • Visualizing activation layers: Exhibit the model’s activation of feature maps at different network layers in response to a particular input. • Occlusion sensitivity: Demonstrate the recursive effect that occluding or hiding regions of the image has on the network’s confidence in its interpretation of the image. • Grad-CAM: Display how the model’s predictions for various image regions change as a result of gradients back-propagated to the class activation maps. • Smooth-grad: Pixels of interest can be located in an input image by averaging gradient sensitivity maps for that image. • DeepEyes: Identification of stable layers, degenerated filters, undetectable patterns, oversized layers, redundant layers, and the necessity for additional layers (see Table B.4).

3.2.3 Role of Explanation Extending the taxonomy for the classification of existing interpretation methods, the techniques are expanded to include a diverse range of explanations, and all these explanations are then examined to see if users’ worries are warranted. Overall, it is obvious that numerous elements of the input’s effect on the result may be explained. However, explanatory techniques or interfaces for common users are lacking and hypothesize on the characteristics that such tools should meet. Finally, it is worth mentioning that explanation approaches have a hard time dealing with two major concerns: the worry that biased datasets may produce biased DNNs, and the suspicion that results will be unjust. Table 3.1 summarizes the advantages and disadvantages of the popular mode of interpretations for DL models.

3.2.3.1

Textual Explanations

The image-language combined tasks where explaining-by-text approaches excel are those where the resulting text explanations are helpful in comprehending the model’s actions. Take on the challenge of explaining a model’s outputs by teaching yourself

3.2 Knowledge Encoding and Architectural Understanding

225

Table 3.1 Pros and cons of various types of explanations Explanation Advantages Disadvantages Visual

Local

Example

Textual

Adversarial

Simpler to implement and communicate to a non-technical audience

The number of features under inspection is limited, and the plots require expert intervention to produce explanation In general, it operates at the instance Defining locality is difficult, and level, explaining model behavior from explanations lack global scope and a local perspective stability Helps understand the model’s most Fails to convey influential areas and influential training data space and needs expert inspection internal reasoning Rule-based or logical flow that’s easy The ambiguity of human language, to read and yields interesting insights irony and sarcasm can make explanation meaningless Small, intentional feature perturbations Designed to cause a model to make that cause a model to predict incorrect predictions despite looking incorrectly and clearly expose its human-friendly. Most adversarial weakness attacks aim to degrade classifier performance on specific tasks to “trick” DL

to write explanatory prose (Bennetot et al. 2019). Image-language joint tasks, such as creating a diagnostic report from an X-ray image, might greatly benefit from the use of methods that generate explanatory symbols. Natural language text is an example of a symbol that may be used to create an explainable representation. Propositional symbols, in other words, are used to describe the model’s behaviour by defining abstract ideas that represent high-level processes. Every way of creating symbols to depict the model’s operation is also explained in text form. Through a semantic mapping from model to symbols, these symbols or icons might represent the algorithm’s thinking process. However, explanation by text is effective only if a language module is included in the DL model, hence it cannot be considered a universal strategy. While texts may seem explanatory, the techniques needed to produce them may not always be stated clearly. In the event of incorrect predictions, for instance, RNNs and LSTMs are still mostly opaque black boxes. Only a handful of studies have attempted to explore the hidden signals of LSTM and RNN NNs. Since a similar issue, namely what to do with the intermediate signals, may occur, this is a potential area of study. Moreover, while word embedding is often optimized by loss minimization, the technique and geometry of the optimized embedding do not appear to be well understood. Insight into the algorithm’s inner workings could be gained by correctly analyzing the shape of the embedding, which might provide hints about optimization.

226

3.2.3.2

3 Knowledge Encoding and Interpretation

Visual Explanations

How can we create an effective visual explanation? When it comes to image classification, for example, a “good” visual explanation of any target category should (a) class-discriminatively localize the category and (b) have a high degree of resolution, i.e. capture fine-grained detail. The purpose of visual explanation is to produce visuals that help with the interpretation of a model. Despite inherent difficulties, such as our limited capacity to see information in more than three dimensions, the established methods can aid in learning more about the decision boundary or the interplay between aspects. As a result, visualization is typically employed in tandem with other methods, especially when addressing the general audience.

3.2.3.3

Statistical Explanations

Model inspection techniques employ external algorithms to investigate NNs by methodically obtaining crucial structural and parametric information about the inner working processes of NNs. The methods in this class are more technically responsible than those in feature analysis, since analytical tools like statistics are directly engaged in the feature analysis. Information obtained using a model inspection strategy is, thus, more reliable and fruitful. The strategies can be further divided depending on the scope of interpretation. The goal of a local explanation is to shed light on the workings of a model in a constrained context. In other words, the resulting explanations may not be applicable on a larger scale or be representative of the model’s overall behaviour. As an alternative, users often approximate the model around the instance they wish to explain in order to derive explanations that characterize the model’s behaviour in similar situations. The term “explanations by simplification” is used to describe methods that attempt to shed light on a complicated model by replacing its complex components with more straightforward ones. The primary difficulty lies in the need for the simplified model to be adaptable enough to provide a close approximation of the complicated model. Typically, this is evaluated by contrasting the two models’ performance on classification problems. Post-hoc explainability methods, on the other hand, attempt to sketch out the workings of the model. Dimensionality reduction techniques are included in many of the current visualization approaches in the literature. Such techniques enable the creation of a basic representation that can be easily understood by humans. The best approach to communicate complicated interactions within the model to people not familiar with ML modeling is still through visualizations, which may be combined with other strategies to increase their understanding.

3.2 Knowledge Encoding and Architectural Understanding

3.2.3.4

227

Explanations by Examples

The procedures to be discussed here are intriguing and motivating. However, because not much is learned about the inner workings of a NN from chosen query scenarios, this method is more like a sanity check than a general interpretation. To illustrate the model’s functionality, explanations by example choose typical examples from the training dataset. This is analogous to the way people often explain things; they use concrete instances to explain abstract ideas. For an example to make sense, the training data must be presented in a human-understandable format, such as a image, because hidden information in random vectors with hundreds of variables would be impossible to be comprehend by humans. Case-based reasoning (Li et al. 2018) is an explanation technique based on specific examples. The uninteresting data about a product’s statistics might not grab your attention, but hearing about the product’s impact on other people’s lives might. Many in the field find solace in this philosophy, and the case-based understanding of DL is intrigued by it. The core of a paradigm may be captured by providing illustrative cases, which is what case-based explanations do. The model can be better understood if data examples are extracted that are related to the output given by the model. Extraction of representative instances that capture the fundamental linkages and correlations identified by the examined model is the primary focus of explanations by example, as is the case when people attempt to explain a given process.

3.2.3.5

Feature Relevance Explanations

Comparing, evaluating, and displaying aspects of neurons and layers is at the heart of feature relevance methods. The reasoning behind the model may be articulated to some extent when sensitive characteristics and methods for processing them have been found through feature analysis. Any NN can benefit from the qualitative insights into the kind of features that a network learned from feature analysis methods give. However, these methods are not well understood enough to be employed to improve a model’s interpretability due to a lack of depth, rigor, and unity. Strategies for after-the-fact explanation reveal the hidden workings of a model by calculating a relevance score for its controlled variables. Using these scores we may measure how much each individual feature means to the model’s final results i.e. how sensitive the feature is. The goal of feature relevance explanations is to provide a rationale for a model’s conclusion by identifying the relative importance of the various inputs. Higher score indicate that the associated variable was more significant for the model than other features with a lower score; this process yields an ordered list of significance scores. When scores are compared across multiple variables, it becomes clear how significant each variable was to the model. Though these ratings alone might not necessarily amount to a full explanation, they do provide some insight into the model’s logic.

228

3 Knowledge Encoding and Interpretation

Fig. 3.14 Sample Optimization with diversity reveals four different features of a class label. Interestingly, the model learns the material so thoroughly that even the woodle laddle is confidently labeled as a dog (Olah et al. 2017). Figure adapted from Olah et al. (2018) under creative commons attribution (CC-BY 4.0)

3.2.4 Semantic Understanding 3.2.4.1

Diversity of Features

Often visualizing a genuine sample can be deceiving in the sense that the visualization may represent only a part of the entire feature representations, for instance, Fig. 3.14 shows the optimization of a network at the class level to classify dogs. A classifier is expected to classify quite different visual profiles of the dog, be it face close-up or a wider profile. Previous work by Wei et al. (2015) strives to illustrate this ‘Intra-class’ diversity. The author clusters the activation over the entire dataset and optimizes it, showing the different features of a learned class. Nguyen et al. (2016) take a different approach by searching for diverse samples across the whole dataset and considering them as the starting point for optimization. More recently, they sampled diverse examples by combining visualizing classes with a generative model (Nguyen et al. 2017). The leftmost image in the Fig. 3.14 is a depiction of simple optimization for layer mixed-4a of the InceptionV1 model. According to the illustration, neurons for the dog class are activated in the upper head region of the dogs with eyes and just downward curled edges. The optimization with diversity that drives us to broaden the ground against the class is represented by the next four photos in the row. These evidently included one with no eyes and another with an upward texture of the dog’s fur. The hypothesis is broadly correct when tested on a sample dataset. It’s worth noting that a wooden spoon with a similar texture and color to the dog activates the neuron as well. Hence, we uncovered a simple technique to achieve diversity: add a ‘diversity term’ to a model’s cost function, separating samples that are unique to be far from each other. Diversity can take many forms, and we now have a limited understanding of its benefits. To punish the varied cases, one option is to use the cosine similarity distance metric (Eq. A.4). The use of contrastive loss in the Siamese network to discriminate the class using PCA is a fascinating application, as detailed in Sect. 3.4.1.1, Fig. 3.44. Another method is to use the characteristics of style transfer (Gatys et al. 2015) force the display of features in distinct styles. We will be able to understand the degree of

3.2 Knowledge Encoding and Architectural Understanding

229

Fig. 3.15 Semantic understanding of neurons with multi-diverse examples. Figure adapted from Olah et al. (2018) under creative commons attribution (CC-BY 4.0)

neuron activation for an input example using various feature visualization techniques. However, numerous flaws in this technique must be addressed: – First, the obligation to create different examples can cause artifacts to appear, like e.g. eyes in our dogs and cats example. – Second, optimization can impose an unnatural way of perceiving an example. For instance, one might like examples of tabby cat separated from other cat species like persian cat or tiger cat beyond a model’s limitation. – Additionally, some dataset-based approaches in paper (Wei et al. 2015) can separate features more naturally but proved less useful in interpreting model’s behaviour on other data. – Finally, diversity with its advantages begins to uncover a more fundamental issue: whereas example in Fig. 3.14 shows a coherent reflection of a concept, here: animal faces, there are other neurons reflecting a strange mix of concepts. The image in Fig. 3.15 responds to car bodies along with two kinds of animal faces. Examples like these support a prominent impression that “Neurons are not necessarily the correct semantic units to understand neural nets” Olah et al. (2017).

3.2.4.2

Progressive Spatial Resolution and Attention Shift

This is a simple case study for the progressive growth of GAN (PROGAN) during training to improve quality, stability, and variation. An interesting concept with potential architectural designs for future DL enthusiasts in the CV domain. Here, Karras et al. (2018) begin training with a resolution of 4 × 4 pixels. During the training process, all existing layers are trainable. As the generator and discriminator accumulate more layers during training, the resulting image quality improves. The work of Wang et al. (2018), which employs numerous discriminators that operate at varying spatial resolutions, is connected to this concept of developing GANs progressively. Wang’s work is inspired, in turn, by the efforts of Durugkar et al. (2017), who employed a single generator in conjunction with multiple discriminators, and Ghosh et al. (2018), who employed the inverse strategy, employing a number of generators instead conjunction with a single discriminator. A generator and a discriminator are defined separately for each stage of an image pyramid in hierarchical GANs (Zhang et al. 2017). PROGAN’s approach, like all of these others, is based on the insight that learning the complex mapping from latents to

230

3 Knowledge Encoding and Interpretation

Fig. 3.16 Both the generator and the discriminator commence with a 4 × 4 pixel spatial resolution. We incrementally add layers to the generator and the discriminator as the training progresses, enhancing the spatial resolution of the generated images. Throughout the procedure, all existing layers are trainable. Here, I × I refers to convolutional layers with a spatial resolution of same dimension. This permits steady synthesis at high resolutions and also significantly accelerates training. Figure recreated with inspiration from Karras et al. (2018)

high-resolution images is best accomplished in stages, but with the key difference of using a single GAN rather than a hierarchy. Unlike other work on adaptively developing networks, such as growing neural gas (Fritzke 1994) and neuro evolution of augmenting topologies (Stanley and Miikkulainen 2002) greedily, Karras et al. simply delay the introduction of pre-configured layers. In this way, their approach is similar to layer-wise training of AE (Bengio 2006). Their primary strength is a strategy for training GANs that involves beginning with low-quality images and gradually increasing the resolution with adding layers in the network as shown in Fig. 3.16. Because of its progressive nature, the model doesn’t have to learn all scales at once, training can instead focus first on a broad overview of the image distribution and subsequently drill down into its finer details. The authors adopt generator and discriminator networks that are mirrored versions of one another with gradual transition to introduce new layers to the networks in synchrony. During the training process, all of the existing layers in both networks can be modified. This prevents abrupt shocks to the smaller-resolution layers, which are already trained. We observe that gradual training has numerous advantages: 1. The generation of smaller images is far more stable because there is significantly lower class information and fewer modes (Odena et al. 2017).

3.2 Knowledge Encoding and Architectural Understanding

231

2. By gradually increasing the resolution, we are continually posing questions that are far simpler than the ultimate objective of discovering a mapping from latent vectors to images. 3. By using WGAN-GP loss (Gulrajani et al. 2017) and even LSGAN loss (Mao et al. 2017), we can stabilize training to the point where we can dependably synthesize images on the megapixel scale. 4. Less time spent on training. In the case of gradually expanding GANs, the majority of iterations are performed at lower resolutions, and comparable result quality is sometimes attained up to 2–6 times faster, depending on the final target resolution. Attributes Versus Texture Learning of the Model It is a well-conceived fact in AI industry that DL models are ‘data hungry’, the more the merrier. This is why we often perform data augmentation and data redundancy to improve the model’s performance with algorithmic limitations. This made us start from the basics and attempt to visualize the response of the model to augmentations of images like the splitting of original data to multiple channels of input, rotation of the image to understand how the channel or neuron reflects on imaging invariance properties and making an attempt to understand the attributes versus texture learning of the model, using our own shuffle-crop and mix-features-crop techniques. In Fig. 3.17, column 1 is the original input to the VGG-16 model with pre-trained model weights assigned from ImageNet. The prediction below each image shows the top-3 classes from the ImageNet 1000-class idx to human-readable labels. Column 2 shows the weighted RGB-to-grayscale augmentation using channel-specific weights, Iweighted−gray = 0.2989 × Red + 0.5870 × Gr een + 0.1140 × Blue commonly assigned for the reason of difference in human perception/sensibility towards the three colors (Ware 2019). However, the difference isn’t noticeable on a normal computer screen compared to averaging grayscale images computed with Igray = (R + G + B)/3. The other columns are self-explanatory, the Red, Green and Blue channels extracted from the original image. Class A in Fig. 3.17 represents an ideal image for object recognition with no major changes in top-class detection by the model. While Class B shows the sample image captured by Nikon D3300 camera in row 1-2, and a mobile captured image in row 3. JPEG image captured from the DSLR camera has significant variations over class label and class probability across the column for all the three-samples. This poses the questions of how the model handles the channel weights and whether it is critical in specific domains to evaluate the accuracy of keeping the illumination parameter into play.

3.2.4.3

Pixel Attribution

Pixel attribution techniques reveal the NN’s key pixels that contribute to a specific classification of an image. There are many different names for pixel attribution techniques, such as sensitivity map, saliency map, pixel attribution map, gradient-based attribution methods, feature relevance, feature attribution, and feature contribution. Pixel attribution is essentially a subset of feature attribution that applies specifically

232

3 Knowledge Encoding and Interpretation

Fig. 3.17 Splitting image channels for top-3 class prediction using transfer learning VGG-16 base model with ImageNet dataset pre-trained weights

to CV tasks. Attributing each input feature a weight based on how much it altered the prediction, feature attribution provides an explanation for each prediction, negatively or positively. Pixels, tabular data, or even words can all serve as input features. General feature attribution approaches include SHAP, Shapley values, and LIME, among others. The method can be categorized on the basis of its baseline equation as done by Molnar (2020): 1. Gradient-only methods: Gradient-only strategies reveal if a pixel change affects the prediction. This can be interpreted as follows: if we were to raise the pixel’s color values, the predicted class probability would increase (for positive gradient) or decrease (for negative gradient). The magnitude of this pixel’s contribution to the gradient increases as its absolute value increases. Vanilla Gradient and Grad-CAM are examples.

3.2 Knowledge Encoding and Architectural Understanding

233

2. Path-attribution methods: Path-attribution strategies compare the current image with a reference image, which can be an artificial “zero” image like a completely grey image. Pixels separate the difference between real and baseline prediction. This category contains gradient-based model-specific approaches like Deep Taylor and Integrated Gradients and model-agnostic methods like LIME and SHAP. Here, all path-attribution techniques evaluate the results relative to a baseline. Therefore, each pixel is responsible for a certain percentage of the difference between the actual image and the baseline image’s categorization scores. The explanation is heavily influenced by the selection of the reference image (distribution). Some path-attribution approaches are considered “complete”, meaning the sum of relevance scores for all input features equals the difference between the image prediction and a reference image prediction. SHAP and Integrated Gradients are examples. At this point, we would normally explain how these methods operate intuitively, but we think we should start with the Vanilla Gradient method (Saliency Map) which illustrates the overall formula that many other methods follow rather beautifully.  The Saliency map proposed by Simonyan et al. (2014) is a straightforward approach for representing input sensitivity for a given input sample by computing the gradient (δout put /δinput ) for input image pixels. Among Vanilla (where default modifier is ‘none’), ReLU, and Guided saliency, the latter yielded the most promising results when experimenting with saliency strategies as shown in Fig. 3.18. Note that the primary principle behind Rectified/Deconv saliency is to trim negative gradients in the backprop stage so that only positive gradient information indicating an increase in output is permitted (Zeiler and Fergus 2014). Guided saliency as depicted

Fig. 3.18 Guided Saliency demonstrates that high-contrast boundaries are spatially significant for the majority of target classes, especially the eyes of animal output classes

234

3 Knowledge Encoding and Interpretation

in Fig. 3.18 is adapted accordingly so that only positive gradients are propagated for positive activations (Springenberg et al. 2014). The results reveal that high-contrast boundaries are spatially important for most target classes, particularly the eyes for animal output classes. However, this routine does not provide the model with great interpretability. It is difficult to understand how the network determines if the sample is a ‘Japanese spaniel’ or a ‘tabby cat.’ The strategy for the Saliency method is: • First, run the input images through a forward pass. • Second, calculate the gradient of the line connecting the input pixels and the target class score where we initialize the remaining classes to zero. • Third, make the gradients visible. We have the option of displaying the absolute figures or differentiating between the negative and positive contributions. In a more formal sense, given an image I , a CNN will assign it a score of class c, denoted by the symbol Sc (I ). The image’s score is a highly non-linear function. The concept behind utilizing a gradient is that a first-order Taylor expansion can provide a close approximation of that score using the derivative of the score w as shown in Eq. 3.24. Sc (I ) ∼ w T I + b

(3.24)

When using Vanilla Gradient, the gradient is backpropagated up to layer n + 1, and in cases where the activation in the layer below is negative, the gradient is simply set to zero.  Highlight As described by Shrikumar et al. (2017), Vanilla Gradient suffers from a saturation problem. When ReLU is utilized and the activation falls below zero, the activation is capped at zero and no longer changes. The activation has reached its limit, i.e. the situation of saturation. For instance, the layer receives input from two neurons was , neuron1 and neuron2, each with a weight of 1 and a bias of 1. When traversing the ReLU layer, the activation will be neuron1 + neuron2 if the sum of the two neurons is less than one. If the sum is more than 1, the activation will remain saturated at 1. In addition, the gradient at this time will be 0, and Vanilla Gradient will conclude that this neuron is insignificant. Here an experiment involving the rotation and flip of images, as well as its top-3 predictions for ImageNet class targets. Note that the interpretation of guided saliency on the second row, and vgrad-CAM for user-target class in the third and fourth rows seems to explain something for Fig. 3.20, which depicts a tabby cat and golden fish, and for Fig. 3.21, which depicts a real-world image of a pizza pan. Despite the fact that the CNN model is regarded as rotationally invariant, it shows appealing results

3.2 Knowledge Encoding and Architectural Understanding

235

Fig. 3.19 Rotational augmentation guided saliency & v-grad visualization of a cat and dog in the garden

in distinguishing ‘German Shepherd’ and a ‘tabby cat’ for Fig. 3.19 having dog and cat being predominant target class in ImageNet. But, the interpretation is not very consistent during the transition for the other inputs as seen in the figures. This leads us to the question whether the popular view based on saliency requires proper inspection. Kindermans et al. (2019) also demonstrated that these pixel attribution techniques can be incredibly unreliable. They applied a continuous shift to the input data, resulting in identical pixel alterations for every image. Here, DeepLift, the Vanilla Gradient, and the Integrated Gradients were all considered. They compared two networks, the original network and the “shifted” network, in which the bias of the first layer was modified to accommodate the constant pixel shift. Both networks generate identical predictions. Moreover, both gradients are the same. However, the explanations varied, which is an unfavorable characteristic.  Fast gradient sign method: In 2014, Goodfellow et al. (2014) created the fast gradient sign approach for generating adversarial images. The gradient sign approach finds adversarial cases using the gradient of the underlying model. Each pixel in the original image x is altered by adding or deleting a minor mistake of magnitude . Whether is added or subtracted depends on whether the gradient sign for a pixel is positive or negative. Adding errors in the gradient’s direction indicates that the image has been purposely manipulated such that the classification model fails. To implement the method described by Goodfellow and team, it is necessary to modify a large number of individual pixels, even if just slightly. The problem arises, though, when do we have access to perturb only one pixel? How well do you think

236

3 Knowledge Encoding and Interpretation

Fig. 3.20 Feature visualization of a cat & fish in an aquarium

you would be able to fool a DL model? Interestingly, image classifiers can be tricked with the change of a single pixel, as demonstrated by Su et al. (2019). Figure 3.22 provides an illustration of the case. The 1-pixel attack is a variant of the counterfactual that seeks out a variant example x that resembles the original image x but alters the prediction in an unfavorable way. In order to determine which pixel should be modified and how, the 1-pixel attack employs differential evolution. Biological species’ gradual diversification serves as an inspiration for differential evolution. Generation after generation, a group of pixels known as candidate solutions recombines until a solution is reached. Each potential answer is a five-element vector consisting of the pixel’s x, y, and red, green, and blue (RGB) values, encoding a change to the pixel. The search begins with, say, 256 candidate solutions (=pixel modification recommendations), and each of these is used to generate a new generation of candidate solutions (children) using the formula in Eq. 3.25: xi (t + 1) = xr 1 (t) + F.(xr 2 (t) − xr 3 (t))

(3.25)

3.2 Knowledge Encoding and Architectural Understanding

237

Fig. 3.21 Feature visualization of pan pizza on a table

Fig. 3.22 Illustration to show that a NN trained on ImageNet can be tricked into making erroneous predictions by manipulating just one pixel

238

3 Knowledge Encoding and Interpretation

where t is the current generation, F is a scaling parameter (set to 0.5), r 1, r 2, and r 3 are independent random numbers, and each xi represents an element of a candidate solution (either x-coordinate, y-coordinate, red, green, or blue). Each new child candidate is a pixel that combines three randomly selected parent pixels to create its five distinct properties for position and color. If a candidate solution is an adversarial example, meaning it is incorrectly categorized as a class, or if the maximum number of iterations set by the user is reached, then further child formation is stopped.  Highlight Pixel-based attributions might be difficult to read and interpret at times. Positive and negative attributions may be mixed in with salient pixels distributed around the image. The accuracy of pixel attribution algorithms varies widely. Small (adversarial) changes to an image might result in quite different pixels being highlighted as explanations, demonstrated by Ghorbani et al. (2019). In contrast to pixels, a novel region-based attribution method, XRAI (Kapishnikov et al. 2019), recognizes salient regions. The technique employs pixel-level attribution methods such as Integrated Gradients or Guided IG, as well as segmentation maps, to identify relevant locations by summarizing attributions as segments. Felzenszwalb’s method (Felzenszwalb and Huttenlocher 2004) is used to compute picture segments.  Layer-wise Relevance Propagation (LRP) presents the contributions of individual pixels to the predictions of kernel-based classifiers over Bag of Words features and multilayered NNs. A human expert is given access to these pixel contributions in the form of heatmaps so that they may intuitively determine whether or not the classification conclusion was correct and where additional investigation is necessary.  Smoothgrad obtains the appropriate mapping of channel values in a pixel to a specific color resulting in a significant impact on the overall impression of the visualization.  Absolute value of gradients: Algorithms for generating sensitivity maps frequently result in the output of signed values. Intense gray areas exists in an attempt to determine the correct color representation of a given signed value. An important decision is whether to show the absolute value solely or to differentiate between positive and negative numbers. Whether or not it is beneficial to take the absolute values of the gradients depends on the nature of the dataset being analyzed. For instance, in the MNIST digits dataset (LeCun and Cortes 2010), where each digit is white regardless of class, positive gradients suggest a strong signal for that class. On the other hand, we discovered that using the absolute value of the gradient in the ImageNet dataset (Russakovsky et al. 2015) led to more distinct images. Many picture recognition tasks are invariant with color and illumination changes; therefore, this could be an indication that the direction is based on context. To identify a ball,

3.2 Knowledge Encoding and Architectural Understanding

239

for instance, a black ball against a light background would have a negative gradient, while a white ball against a darker background would have a positive gradient.  Capping outlying values: Another feature of the gradient that we notice is the presence of a few pixels with substantially higher gradients than the average. This is not a recently discovered fact; it was used to generate adversarial examples that are undetectable by humans (Szegedy et al. 2013). These extreme values may cause color gradations to become unreliable. Visual coherence is improved in maps by limiting outliers to a large extent. This idea is described in detail by Sundararajan et al. (2017). If this step in post-processing is skipped, the resulting maps could end up completely black.

3.2.5 Network Understanding Classical ML models emphasized learning the output prediction based on key attributes and features. Many interpretability techniques have developed to represent the model in the form of a feature correlation matrix and attributed mapping to understand the behavior and learning of the model. With the advancement of computational resources, this classical approach performs far behind the current SOTA practice. Unlike classical models like SVM, where raw pixels are not the best input for training an image classifier, CNNs are fed an image in its raw form (pixels). DNNs learn high-level features in the hidden layers, which seems to be one of their greatest strengths, reducing the need for domain knowledge to extract features from raw data. The image/text is passed through multiple transformations and convolutional blocks. The weights and bias are learned through backpropagation, whereas SVM models design new features based on the image color representation, edge detectors, and pixel frequency domains. The deeper the CNN layer, the more complex the features encoded in it. This information of the transformed image passes through fully connected layers and non-linear activation mapping output for classification or prediction.

240

3 Knowledge Encoding and Interpretation

 Highlight The XAI community has recommended several model-agnostic techniques, such as local or partial dependence plots. Nonetheless, there are three motives why it makes sense to consider explainability methods developed specifically for NNs: 1. NNs learn features and concepts in their hidden layers and demand specialized tools to uncover them. 2. Gradients can be used to achieve interpretation schemes like looking at the model ‘from inside’ that are more computationally effective than model-agnostic methods. 3. Most other known methods are intended to interpret models for tabular data. Image and text data requires distinct methods.

3.2.5.1

Convolutional Network

CNNs are designed to work with image data, and their structure and function suggest that they should be less inscrutable than other types of NNs. Specifically, these models are comprised of small linear filters and the result of applying filters are called activation maps, or more generally, feature maps. To better understand the abstract encoding of CNNs in feature learning, it is critical to understand the basic blocks of a NN. We will then look at various suggestive measures developed in recent times, interpreting the network architecture’s behaviour for a corresponding input mapped to its prediction. Figure 3.23 showcases the different ways of interpreting CNN units. (A) Convolutional neuron being the basic atom of the architecture; (B) Convolutional channel interpretation; (C) Convolutional layer giving a sense of abstract encoding of the features in whole; (D) Neuron; (E) Hidden layers with learned weights and bias; (F) Class probability neurons (or, corresponding pre-activation neuron). In brief, the basic blocks of CNNs are divided into four parts described by LeCun et al. (1989):

 d(l) (l+1) ∗ xl (l)) (α)  Conv. layer: xl (l+1) (α) = ξ l=1 (wl l  Activation:  Pooling:  Parameters:

ξ(x) = max{x, 0} rectified linear unit (ReLU)     ρ = 1, 2, or ∞ xl (l+1) (α) = xl (l) (α ) : α ∈ N(α) ρ

filters W(1) , . . . W(L)

3.2 Knowledge Encoding and Architectural Understanding

241

 Highlight The key properties of CNNs are the following: 1. Convolution filters impart translation invariance plus self-similarity property. Refer Appendix A.11 for mathematical description. 2. Multiple layers impart compositionality similar to MLPs. 3. Locality feature by filter localization in space. 4. O(1) parameters per filter, independent of the size of the input image. 5. O(n) complexity of the layer due to filtering done in spatial domain. 6. O(log n) layers in classification tasks. The modern CNN has found its way beyond computer vision tasks. It has emerged as a credible competitor to RNNs for one-dimensional structure data like time-series analysis and audio and text labelling (Fawaz et al. 2019; Cui et al. 2016; Hershey et al. 2017). We can also witness clever adaptation of CNNs to recommendation systems and graph-structure data like the recent application of graph R-CNN for scene graph generation (Yang et al. 2018). The CNN’s success could be summarized as follows: • • • • •

Efficient in achieving more accuracy models. Easy to parallelize convolutions across GPU cores. Pooling layer aggregates spatial information of the data. Handling multiple channels at each layer of the architecture. I/O channels capture multiple spatial aspects of the image location.

Fig. 3.23 Feature visualization for different units of CNN

242

3 Knowledge Encoding and Interpretation

Fig. 3.24 Schematic block diagram of a general CNN network

• Performs cross-correlation among kernel and 2-D inputs, then adds bias. • Fewer parameters than a fully connected network make it computation-effective. • Network depth can be decided based on the feature map detected by the receptive fields. Consider the basic network blocks of a CNN in Fig. 3.24. The first line of investigation involves visualizing a new input that is generated based on the current input in order to comprehend the representation gained by DNNs. It is primarily intended as an instance-wise explanation and applies to CNNs that take images as inputs. CNNs typically comprise multiple convolutions and pooling layers that serve the model in extracting relevant features from visual data like images automatically. Due to this multi-layered architecture, CNNs learn a robust hierarchy of features, which are spatial, rotation, and translation invariant. Therefore, interpretability of the knowledge encoded in a CNN can be treated at different levels: • Kernel level, i.e. the knowledge or features of the knowledge model encoded in the kernels in a given layer. • Hierarchy of features as a tree or hierarchy of knowledge features in different levels. • Before fully connected layer or at the end of a collection of layers, for example through class activation maps. The objective of this section is to demystify the structure of the basic NN discussed in Chap. 2 and illustrate the subfield of network learning where the interpretability methods can be applied. The CNN blocks in Fig. 3.24, for instance, has undergone a remarkable deal of interpretability work, such as the implementation of the interpretation module in the input block (a) or the output block (d). Some people attempted to dig around in the knowledge representation (c) in an effort to figure out what the model has encoded before making a prediction, while the rest has tried different means such as visualizing filters, channels, layers and network dissection to understand how the convolution encoder block (b) operates.

3.2 Knowledge Encoding and Architectural Understanding

3.2.5.2

243

Recurrent Network

RNNs, like CNNs in the visual domain, have recently seen widespread use for predicting problems defined over essentially sequential data, with particular emphasis in areas such as NLP and time series analysis. Long-term dependencies in these data are difficult to capture using a machine learning model. By treating the neuron’s ability to remember information as a parameter that can be learned from data, RNNs are able to extract time-dependent associations. Very little research has been reported to provide an explanation of RNN models. This research may be broken down into two categories: (i) attempts to explain RNN models’ learning (often using feature relevance approaches) and (ii) attempts to alter RNN architectures in order to provide insight into the choices they make (local explanations). In the first group, Arras et al. (2017) extend the use of LRP to RNNs. They propose a specific propagation rule that works with multiplicative connections such as those in LSTM units and GRUs. Karpathy et al. (2015) propose a visualization technique based on finite horizon n-grams that discriminates interpretable cells within the LSTM and GRU networks. Following the premise of not altering the architecture, Che et al. (2015) extend the interpretable mimic learning distillation method used for CNN models to LSTM networks, so that interpretable features are learned by fitting Gradient Boosting Trees to the trained LSTM network under focus. Aside from the approaches that do not change the inner workings of the RNNs, Choi et al. (2016) present their RETAIN (REverse Time AttentIoN) model, which detects influential past patterns by means of a two-level neural attention model. To create an interpretable RNN, Wisdom et al. (2016) propose an RNN based on the Sequential Iterative Soft-Thresholding Algorithm (SISTA) that models a sequence of correlated observations with a sequence of sparse latent vectors, making its weights interpretable as the parameters of a principled statistical model. Finally, Krakovna and Doshi-Velez (2016) construct a combination of a Hidden Markov Model (HMM) and an RNN, so that the overall model approach harnesses the interpretability of the HMM and the accuracy of the RNN model. A generalized RNN network block is depicted in Fig. 3.25, where three different strategies trigger distinct interpretation algorithms, namely: 1. A post-hoc strategy of understanding the model’s training by the application of perturbations to the input block (a) and testing their effects on the prediction block (d). 2. Local perturbation of a block of recurrent units (b) using techniques like layer activation, channel visualization, neuron dropout influence, and gradient-based feature mapping. 3. By analyzing and assessing the knowledge representation block (c), we may deduce the hidden information the model has learned after being trained.

244

3 Knowledge Encoding and Interpretation

Fig. 3.25 Schematic block diagram of general RNN

3.2.5.3

Autoencoder-Decoder Network

The representative large-scale design of the AE networks is shown in Fig. 3.26. Both an encoder and a decoder are part of its structure, as the name indicates. There is also a crucial intermediate feature space, the latent features, which is a part of this system. Explaining the information in the encoder (b) and the decoder (d) separately might be useful when attempting to explain or interpret the knowledge at the network level. A post-hoc interpretability method may be the quickest and easiest answer, even if it misrepresents or omits information. We use an independent solution for decoder interpretability and a post-hoc one for the encoder. While ad-hoc solutions aren’t the norm just yet, we may build them at every stage of the AE network, from the unit level (like a recurrent unit) to the layer level (how a layer interprets its learning) to the backbone level (how the network as a whole is interpreted). This is a collection of set of layers which perform collective learning and are preceded or followed by an architecturally different layer. Furthermore, in architectures that use a symmetrical design of autoencoder and decoder, interpretability may also include correspondence or even correlation between the knowledge encoded in the symmetric pairs of layers. In contrast, for symmetric pairings of learning units, interpretability may be learned collectively by using the symmetry. One alternative to analyzing functional blocks in an AE network is to read only the latent feature space (c). That’s the same as not understanding the mapping or the features it highlights for encoding. It is how the latent feature space displays the encoded information, such as where an input (a) sits in the latent feature space and where another picture lies that is very similar to the previous input image but different in one aspect. This method not only elucidates the mapping from the input space to the latent space but also describes which latent space features stand in for which differences between otherwise identical pictures. The connection between the latent space (c) and the output space (e) is quite similar. One such method that works well for this goal is the use of visual explanations.

3.2 Knowledge Encoding and Architectural Understanding

245

Fig. 3.26 Schematic block diagram of general autoencoder-decoder network

3.2.5.4

Generative Adversarial Network

GANs are two-block networks with a generator block (b) that creates output and a discriminator that analyses the generator’s output for quality. Figure 3.27 depicts their overall architecture. The concept is that the generator is meant to generate synthetic data in the output space that is as realistic as possible, while the discriminator is designed to distinguish between real data and synthetic data, no matter how realistic. As a result, the generator’s goal is to mislead the discriminator, whereas the discriminator’s goal is not to be misled. Once trained, such a system means that the generator has the capacity to create high-quality pseudo-real data, resulting in a very tiny discriminator loss. There is a common denominator between the generator and the discriminator, despite their divergent goals: both learn ‘what comprises real data,’ with the generator attempting to simulate these components and the discriminator distinguishing them from realistic synthetic data. Therefore, the following methods can be used to gain an understanding of GAN at the network level:

Fig. 3.27 Schematic block diagram of general GAN

246

3 Knowledge Encoding and Interpretation

1. A posteriori analysis at the block level (b). Compared to the AE network, this calls for a derivation of an interpretability model post-hoc, after training, treating each block independently. 2. Ad-hoc decoding at the block level. Incorporating interpretability into the learning process is most typically performed at one or more levels of network architecture inside a functional block (such as learning unit or layers, or collection of layers). 3. Learning of the common denominator, in this case the answer to the question ‘what comprises real data’. In other words, what are the core features that determine the generator’s and the discriminator’s output, (c) and (f) respectively. It is indirectly probing an equivalent of the latent feature space in the AE architecture. However, deriving an interpretability solution for this question is not as straightforward. In fact, it may require that the architecture of the generator and the discriminator have some symmetry in design. For example, the last N learnable layers of the generator are similar to the first N learnable layers that the real image encounters in the discriminator. Then, considering the relationships between the layers in these pairs, interpretability can be inferred. The preceding description refers to the simple design depicted in Fig. 3.27. There are additional modifications of the generative learning architecture, some of which are shown in Fig. 4.18. However, the concepts of network level comprehension remain very similar. In order to interpret the knowledge that supports the functional activity of these blocks, block-wise interpretability can be handled either post-hoc or ante-hoc. Alternatively, the interpretability emphasis can be chosen as the common denominators or places of knowledge sharing among the blocks.

3.2.5.5

Graph Network

GNNs have several advantages over ordinary NNs, some of which are discussed here. Self-loops, multi-input, and several output ports for each node or edge with varied weight connections interconnecting edges with different weights are examples of graph structures. There are also GNN versions such as graph convolutional networks, graph attention networks, and graph residual networks. Figure 3.28 depicts their general schematic. GNNs can be trained on any dataset that contains both input data and pairwise item associations. GNNs offer a significant benefit over ordinary DL in that they can capture the graph structure of data, which is typically highly rich. GNNs can be used to classify data or predict outcomes. However, GNNs do not perform as well in tasks such as regression, where the output is a real-valued number rather than a discrete value (such as cat/dog). GNNs have a much smaller memory footprint than normal DL models, since they only need to store information about node connections rather than all neurons in the graph. Even with tiny datasets, GNNs are simple to train. There are two main types of GNN encoders (b) which include feed-forward GNNs and graph recurrent networks.

3.2 Knowledge Encoding and Architectural Understanding

247

1. In a feed-forward GNN, input data is transmitted through a graph of neurons to produce output by applying the transfer function at each node’s edge weights. The following are the steps involved in this type: • Feed-forward propagation should be used on graph nodes that have input. • To spread the output of graph nodes, use the graph transfer function on the graph edges. • Graph weights based on output gradients are used to backpropagate. 2. Graph recurrent networks are similar in that they also have a graph structure. Unlike feed-forward graphs, which transmit data in only one direction, recurrent network graphs are bi-directional, with data flowing in both directions. This type involves the following steps: • Use standard graph propagation on graph nodes that have input. • Apply the graph transfer function to the graph edges to transport neuron output back to the graph nodes while changing edge weights based on previously applied gradient modifications to that node or edge nodes. • Graph weights based on output gradients are used to backpropagate. In contrast to feed-forward GNNs, the preceding layer has no influence on the present state.

Fig. 3.28 Schematic block diagram of general GNN

248

3 Knowledge Encoding and Interpretation

 Highlight How is GNNs different from standard ANNs? GNNs are similar to standard NNs where the data flows through a graph of neurons in an iterative fashion and each edge weight can be modified based on input examples for that node or neuron. What is different in GNNs is the graph transfer function. In the graph transfer function, weights are not just between neurons but also on the edges as well as the nodes of the graphs. This is very helpful in cases of overlapping data or missing data points with different associated values, since those values can be filled from other connected nodes and data points through graph transfer functions and GNNs learn this graph structure. GNNs are also different in graph execution. GNNs go through the graph one node at a time, while standard DNNs go through all neurons before moving on to the next data point. The key difference between GNNs and standard DL models is that a GNN has its own set of parameters for each graph node, graph edge, and data point. GNNs use a graph structure for learning the graph nodes in the dataset with different sets of parameters, which is very different from standard DNNs where neurons are just linear functions that have real-valued weights to their inputs. This makes the graph transfer function more complex than the transfer function of a regular DL model since graph neural networks have multiple graph nodes and graph edges for a single data point. Nevertheless, learnable units of different scales can also be identified for GNNs in the manner we discussed for CNNs and RNNs. For example, each graph convolution kernel or graph recurrent unit is a very small scale learning unit. A collection of them in a layer is another functional unit. In this sense, there is a direct analogy between the approaches of interpretability used for all the architectures discussed previously. However, graph-based networks provide an additional opportunity towards explainability as well as insight into knowledge encoding. This opportunity is in terms of the weights of the links in the graph model. After learning, the chain of the prominent links can be identified as those that represent the most primary or statistically significant knowledge. Instead of deriving interpretability across learnable units or layers or the entire network, interpretability may be derived in an end-to-end manner; however, with selected links.

3.3 Design and Analysis of Interpretability

249

 Highlight Yosinski et al. (2015) overlaid a NN onto a relational graph, and then used extensive tests to investigate whether or not there was a correlation between the two features. They found that the clustering coefficient and the average path length were both related to a network’s prediction effectiveness.

3.3 Design and Analysis of Interpretability The section is a clear overview that aims to bring together many, diverse lines of IDL research that have attracted substantial attention in recent years across a variety of subfields and settings. Different works use different criteria, all of which are justified in some way. While it is difficult to cover all IDL-related research activity, we make every attempt to provide it in a novel way that most individuals in the computer science discipline can understand. As a result, we develop a unified conceptual framework for describing the various approaches and identify important conceptual differences among them. A fresh look at IDL via the lens of a well-known topic: Design and Analysis of Algorithms (DAA). This endeavor to draw parallels between the two fields of IDL and DAA is referred to as “Design and Analysis of Interpretability.” Interpretable DL techniques are grouped into six major subsections below, each of which briefly defines the terminology and the suitability of approaches to fall into that category. This will aid us in understanding the motivation, intuition, and reasoning underlying the creation of similar procedures in the near future.

3.3.1 Divide and Conquer In the divide and conquer interpretability approach, the black-box model problem is divided into smaller sub-problems, and then each problem is solved independently. Sometimes division occurs in our mind. The solution of all sub-problems is finally merged in order to obtain the solution of the original problem. This is one of the most widely used methods in programming. The solution to sub-problems are conquered into one of the original problem using recursive calls. Remember, that only the best of the sub-problems are picked in certain scenarios with increasing computation, and it captures the essence of collaborative filtering.

250

3 Knowledge Encoding and Interpretation

 Highlight Divide and conquer propagates through the network breaking the interpretation of the model’s performance into local, fundamental or instance-level understanding, and finally gets back to the global or general understanding of the model. The approach can be model-specific or model-agnostic, but the general idea is the comparability of the functional units analogous to merge sort or Strassen’s matrix multiplication. Case 1: Case-Based Reasoning Generating Useful Explanations The idea is to look at the Case-Based Reasoning (CBR) techniques to generate a sound interpretation of DNNs from the divide and conquer strategy perspective. One may argue its suitability to be considered a backtracking problem based on the standard process of searching similar cases among previously solved solutions based on a given target problem and its corresponding attributes. Consequently, adapting the solution for such new cases was described by Aamodt and Plaza (1994) in 1994. If the induced problem fails to solve the target problem, it is expected to revive and attempt reevaluation by other means. The floor is open for discussions and to understand the proposed analogies from different standpoints.  Highlight The key hypothesis for the concept is utilizing prior experiences (similar past cases), domain knowledge, and context to bring forth a new dimension to the models, generating more accurate interpretations for the users. The principle is broadly based on how humans crack problems: solving new problems with past experiences under similar circumstances. In the context of XAI, the generated interpretations from the attempt of CBR are sometimes referred to as CBE (Case-Based Explanation). The evidence of dividing the black-box problem and generating model interpretability by combining the knowledge is evident in most CBR techniques. For example, a deep neural architecture dissects the image by finding prototypical parts related to the image classification and using previous evidence for the prototypes to make a prediction (Chen et al. 2019). The architecture is built using a sequence of convolution layers, a prototype layer, and a fully connected layer, followed by the output logit layer. Training of this network is done in separate stages to target individual layers: the network as a whole, followed by training of the prototypical layer, followed by an optimization of the last layer. This prototype classification approach is not new. Prior attempts have achieved similar results. In 2018, Li et al. (2018) built CBR into the network with a unique encoding layer to find the best prototypes automatically. These prototypes (cases) are compared to new encoded input instances, where the most probable prototype gives the corresponding prediction. Networks are a form of prototype classifier, where

3.3 Design and Analysis of Interpretability

251

observations are classified based on their proximity to a prototype observation within the dataset. For instance, in our handwritten digit example, we can determine that an observation was classified as a 3 because the network thinks it looks like a particular prototypical 3 within the training set. If the prediction is uncertain, it would identify prototypes similar to the observation from different classes, e.g., 4 is often hard to distinguish from 9, so we would expect to see prototypes of classes 4 and 9 identified when the network is asked to classify an image of a 9. Cunningham (2008) point to two possible approaches in CBE: 1. Knowledge-light CBE that base their explanations on just similarity measures performed during retrieval. 2. Knowledge-intensive CBE include rule-based methods which can be used to generate explanations. They leverage similarity measures by expressing interpretations in terms of casual interactions. The explanation generation part is shown in Fig. 3.29 with the reconstructed input part, which lives in the same space as the encoded inputs, it can be used to visualize the learned prototypes during training, and partially trace the path of a new classification task, with the activation weights to each prototype. They incorporate the generation of a rationale rationale (set of reasons or a logical basis for a course of action or belief) as an integral part of the overall learning process, by combining two modular components: a generator that specifies a distribution over possible rationales, and an encoder that uses their rationales to map to task-specific target values. The example in Fig. 3.29 shows that the rationale is simply the specific sequence of words that justify the classification value.  Highlight An NN interpretation by picking parts of input and focusing for their respective tasks. We discuss CBR here instead of typical extractive reasoning. CBR explains a model’s predictions based on similarity to prototypical cases, rather than highlighting the most relevant parts of the input. Note, that the approach does not provide a full solution to problems with accountability and transparency of black-box decisions, but it does allow us to trace the path of classification for a new observation partially. Case 2: Pointwise Localization with the Class Peak Response Map Given the inefficiency of pixel-level annotation and the evidence of strong visual signals residing inside each instance observed in local maximums, i.e., peaks in Class Response Map (CRM) inspired (Zhou et al. 2018) to train CNNs with imagelevel weak supervision for instance-level semantic segmentation. We shall explore the procedure of exploiting the Peak Response Map (PRM) from the perspective of divide and conquer strategy and comprehend the mechanism in this light. 1. Divide: Image to CRMs During the network training, the image is stimulated to emerge peaks from a CRMs. Zhou et al. (2018) suggested the use of a FC-NN

252

3 Knowledge Encoding and Interpretation

Fig. 3.29 The architecture of CBR with explanations. a A clay-colored sparrow image and the learnt archetypal parts of a clay-colored sparrow used to identify the bird’s species. The explanation is a comparison of prototypes from similar circumstances. b The modified architecture that was used to accomplish these results. Figure reproduced from Chen et al. (2019) with permission Table 3.2 Notational summary of CRM stimulation from an image representation Notation Remark M ∈ RC×H ×W s ∈ RC Mc Pc G c ∈ R H ×W

CRMs for top convolutional layers with C number of classes with spatial size of H × W as input Outputs class-wise confidence scores Local peak of cth response map within a region radius r Locations {(i 1 , j1 ), . . . , (i N c , j N c )} number of valid peaks N c for cth class Generated sampling kernel for evaluating the confidence score of cth class category

by simply removing the global pooling layer and replacing the fully connected layers with 1× 1 convolution layers. This is done to output the CRMs with a single forward pass, preserving the spatial information and classification confidence at each image location (Oquab et al. 2015). For a standard network, the following notions in Table 3.2 are considered for the stimulus of peak CRM.

3.3 Design and Analysis of Interpretability

253

Each kernel at location (x, y) can be computed using Eq. 3.26 having 0 ≤ x ≤ H , 0 ≤ y ≤ W . The authors used Dirac delta function for the sampling function f to aggregate only peak features. Here, (i k , jk ) is the coordinate of the kth peak. N  c

G cx,y

=

f (x − i k , y − jk )

(3.26)

k=1

Therefore, s c is computed by convolution between the CRM M c and sampling kernel G c in Eq. 3.27 as: N 1  c M s = M ∗G = c N k=1 ik , jk c

c

c

c

(3.27)

From Eq. 3.27, it is evident that network uses peaks only to make the final decision. Therefore, the gradient δ c for the cth channel of the top convolutional layer is apportioned by G c to all the peak locations with classification loss L. δc =

1 δL · · Gc N c δs c

(3.28)

Equation 3.28 exploits the generation of CRMs from dense sampling of RFs, most of which are negative samples and do not contain valid instances. The equation suggests the learning on sparse sets of RFs estimated by CPRs, preventing easy negatives from overwhelming the learned representatives during training compared to unconditional learning of conventional networks due to the extreme background-foreground imbalance. 2. Conquer: CRMs to PRMs During inference shown in Fig. 3.31, the emerged peaks are probabilistically backpropagated and effectively mapped to highly informative regions of each object instance, generating fine-detailed and instanceaware representations, i.e., PRMs. Here, we look for the most relevant neurons (instance-aware visual cues) of an output category from specific spatial locations to generate class-aware attention maps. The idea of peak backpropagation is to start from the top-layer and randomly walk to the bottom-layer (Example Fig. 3.30), formulated by the probability of relevance.  Highlight The peak backpropagation process is used as a conqueror step to piece together the predicted instance masks. Emerged peaks are backpropagated to generate maps that highlight informative regions for each object, referred to as Peak Response Maps (PRMs). PRMs provide a finedetailed separate representation for each instance, which are further exploited to retrieve instance masks from object segment proposals.

254

3 Knowledge Encoding and Interpretation

For understanding, consider a single filter W ∈ Rk H ×kW convolution layer with U and V as input and output feature maps, respectively. The visiting probability P(Ui j ) at location (i, j) can be computed from P(V pq ) in Eq. 3.29 for spatial location Ui j and V pq . i+ k2H

P(Ui j ) =



j+ kW 2



P(Ui j |V pq ) × P(V pq )

(3.29)

p=i− k2H q= j− kW 2

i j W + where, P(Ui j |V pq ) = Z pq × U (i− p)( j−q) is the transition probability with +  bottom-up activation Ui j , W = ReLU(W ) discarding the negative connections adopting ReLU in most modern CNNs and the normalization factor Z pq to ensure the transition probability sums to 1. This PRM finally retrieved from the model CRM (sub-problem) suggests the interpretability of important visual cues in the input image as a whole (original problem) in the model learning shown in Fig. 3.31. The technique is compatible with any modern network architecture and can be trained using standard classification settings, e.g., image class labels and cross entropy loss, with negligible computational overhead. Rule extraction techniques are one of the most recognized approaches that operate on a neuron-level rather than the whole model, like using decompositional rule extraction techniques (Özbakır et al. 2010). Continuous/discrete Rule Extractor via Decision tree Induction (CRED) (Sato and Tsukimoto 2001) is an extension of the decompositional rule extraction algorithm to more than one hidden layer and uses decision trees to describe the extracted rules.

Fig. 3.30 Peak back-propagation maps class peak responses to fine visual signals inside each object, i.e., PRMs, allowing instance-level masks to be derived. Figure reproduced from Zhou et al. (2018) with permission

3.3 Design and Analysis of Interpretability

255

Fig. 3.31 Peak Response Maps (PRMs). A stimulation process activates each object’s strong visual cues into class peak responses. Back-propagation collects detailed information from the peaks. Instance masks are predicted using class-aware cues, instance-aware cues, and object priors from proposals. Best viewed in color. Figure reproduced from Zhou et al. (2018) with permission

 Highlight Rule extraction is done step-wise, with one layer at a time. Here, one layer is used to explain the next. Consequently, one is left with a rule set that describes each layer of the DNN by their respective preceding layers. Therefore, combined to mimic the whole network along with input pruning reverse engineering strategies (Augasta and Kathirvalavakumar 2012) to achieve more comprehensible and generalized rules. Deep neural network Rule Extraction via Decision tree induction (DeepRED) (Zilke et al. 2016) is one famous such approach, extending CRED. It has additional decision trees and intermediate rules for every hidden layer. It can be seen as a routine divide and conquer method aiming to describe each layer by the previous one, aggregating all the results to explain the whole network. On the other hand, when the internal structure of an NN is not considered, the corresponding strategies are called pedagogical. These strategies treat the complete network as a black-box function and do not inspect it at a neuron level to explain it. In 2012, Augasta and Kathirvalavakumar (2012) introduced the RxREN algorithm, employing reverse engineering techniques to analyze the output and trace the components that cause the final result. This can be the scenario with backtracking. Apart from rule extraction techniques, other approaches have been proposed to interpret the decisions of NNs. In 2016, researchers introduced Interpretable Mimic Learning (Chen et al. 2020), which builds on model distillation ideas to approximate the original NN with a simpler and interpretable model analogous to dynamic programming. The concept of transferring knowledge from a complex model (the teacher) to a simpler one (the student) has been explored in other works such as a clustering-based approach to extract rules from MLP (Hruschka and Ebecken 2006).

256

3 Knowledge Encoding and Interpretation

The biggest concern of interpretable models is the limitation of the method used to learn the problem, as each has its own weakness in terms of knowledge representation and knowledge reasoning on the problem and the complexity of the domain. Inherently interpretable decision trees works by following a set of if-then clauses structured as a tree. The problem with this method is that it is not very scalable in terms of interpretability. The larger the tree grows with respect to the complexity of the problem, the less interpretable it becomes. The same can be said for other rulebased methods, which explain in a similar fashion as decision trees. Linear models like the Support Vector Machine (SVM) are usually a bit more interpretable because of their transparency, where the vectors can be used to explain the decisions on a new problem. Unfortunately, these suffer the same fate as decision trees: as we increase the number of dimensions, it is very difficult to interpret the prediction into an explanation. And at some point we’re likely to lose the bigger picture. CBR, on the other hand, can be considered somewhat interpretable, depending on the complexity of the similarity measure used and on the previous cases, where we do not need to focus on all the dimensions, but rather on only the ones that were most important for the retrieval. Where as k-NN looks at all the features, CBR usually only looks at the most important features. Although many papers discussing CBE bring forth the argument of only presenting the nearest neighbor as a sufficient explanation, there are some related challenges. As mentioned by Nugent and Cunningham (2005), although CBR offers some transparency, there is some knowledge hidden inside the knowledge containers which are not apparent to the user. The presentation of the feature values in the most similar cases may be misleading (McSherry 2003). In some cases, the presence of some feature values may be against the prediction. Just presenting these to the user may not always be as useful. This speaks to the importance of finding and highlighting features that are essential factors when designing a successful case-based explanation (Nugent and Cunningham 2005).

3.3.2 Greedy The goal of greedy interpretability is to get the best explanation for a black-box model. Because people are greedy, the closest solution seems to be the best is chosen. It tries to find a localized best solution, assuming it leads to a global optimum in the end. This algorithm is easy to make and is usually the easiest one to use. But making decisions based on what’s best locally doesn’t always work the way it sounds. In general, it does not offer solutions that are best for the whole world. Simply speaking, the greedy algorithm doesn’t always work, but when it does, it works like a charm! So, the dynamic programming approach, which is a reliable solution, has taken its place. This algorithm has two basic characteristics: 1. It’s picking the top choice out of greed. 2. It requires the problem to be solvable by finding optimal solutions to its subproblems (substructure property).

3.3 Design and Analysis of Interpretability

257

People typically use such explanations to identify other people or items of interest. “Anish is the intellectual guy with long hair who wears glasses,” for example. In this scenario, we can observe that traits like intelligence and long hair help to describe the person, albeit insufficiently. The existence of glasses is necessary to complete the identification and differentiate him from, say, Parth, who is tall, has long hair, and does not wear glasses. When we try to accurately describe something, we often offer such contradictory facts. These contrastive facts are not a comprehensive list of all possible characteristics that should be absent from an input to distinguish it from all other classes to which it does not belong, but rather a minimal set of characteristics/features that help distinguish it from the “closest” class to which it does not belong. This is where the trait of greedy selection comes into play. The downside of this approach is that we don’t know how the network solved the problem because we don’t know its underlying knowledge and reasoning, even if the explanations are correct. Xiao and Wang (2021) developed an iterative technique for determining the qualities and dependencies that any classifier uses. The optimization problem of discovering groupings of qualities whose interactions effect performance is addressed. To tackle this, a greedy approach called the “GoldenEye” algorithm was suggested, which successfully randomizes the characteristics to choose permutations, groups the attributes, and lastly prunes superfluous attributes to boost fidelity. Case 1: Towards Contrastive Explanations with Relevant Negatives Dhurandhar et al. (2018) wanted to generate such explanations for NNs which, in addition to highlighting what is minimally sufficient (e.g. tall and long hair) in an input to justify its classification, also identify contrastive characteristics or features that should be minimally and critically absent (e.g. glasses), in order to maintain the current classification and distinguish it from another input that is “closest” to it (say, Parth). As a result, we want to provide explanations of the following form: An input x is classified as belonging to class y because features f i , . . . , f k are present and features f k , . . . , f m are absent. This categorization is based on the idea of searching for all feasible solutions when traversing the explanation and searching for only the relevant ones while doing so, looking for a greedy local optima to strongly confirm the predictive nature of the classification.

258

3 Knowledge Encoding and Interpretation

 Highlight The Greedy interpretability strategy is like a sliding window that finds the optimum answer in the region of interest and explains it. The strategy may not provide global maxima, but the solution is probably well received. This is backed by a GDPR statement indicating the importance of such basic, unambiguous explanations above needlessly complex and lengthy ones (Yannella and Kagan 2018). In fact, the presence of such explanations in certain human-critical sectors is a powerful motivation to have them. There is the concept of pertinent positives (PP) and pertinent negatives (PN) (Herman 2016) in medicine and criminology, which combined offer a comprehensive explanation. For example, in Fig. 3.32, where we see hand-written digits from the MNIST dataset, the black background represents no signal or absence of those specific features, which in this case are pixels with a value of zero. Any non-zero value would indicate the presence of those features/pixels. This idea also applies to colored images where the most prominent pixel value (say median/mode of all pixel values) can be considered as no signal, and moving away from this value can be considered as adding signal. One may also argue that there is some information loss in our form of explanation. However, we believe that such explanations are lucid and easily understandable by humans, who can always further delve into the details of our generated explanations, such as the precise feature values, which are readily available.  Highlight • A Pertinent Positive (PP) is a factor of minimally adequate existence to substantiate the final classification. • A Pertinent Negative (PN) is a factor whose absence is required to establish the final classification. Given an input and its categorization by a NN, CEM generates explanations with the following objectives: 1. PNs: It identifies a minimum number of features in the input that should be absent (i.e. remain in the background) to avoid the classification result from altering. 2. PPs: It discovers a minimal number of (object/non-background) features in the input that are adequate to provide the same categorization. 3. It uses a SOTA CAE (Mousavi et al. 2017) to (1) and (2) above, “near” to the data manifold to obtain more “realistic” interpretations. The authors suggested the improvised approaches to accomplish (3), so that the resultant explanations are more likely to be near to the underlying data manifold and so correspond to human intuition rather than random perturbations that may affect the categorization. In fact, developing a decent representation with an autoencoder

3.3 Design and Analysis of Interpretability

259

Fig. 3.32 MNIST explanations. CEM with or without CAE, LIME, and LRP. Cyan/pink are PP/PN. Green represents neutral relevance, red/yellow positive, and blue negative. PP/PN are respectively cyan/pink. For LRP, green is neutral, red/yellow is positive and blue is negative relevance. For LIME red is positive and white is neutral. Figure reproduced from Dhurandhar et al. (2018) with permission

may not be achievable in many cases due to constraints such as a lack of data or poor data quality. It may also be unnecessary if all feature value combinations have meaning in the domain or the data does not sit on a low dimensional manifold, as is the case with images. It is also worth noting that this technique is connected to methods for generating adversarial examples (Carlini and Wagner 2017; Chen et al. 2018). Let us look at the CAE mathematically, Given a set of data (x0 , c0 ), the technique attempts to determine the PP (ref. objective (1) above) by calculating the Eq. 3.30 with defined notation in Table 3.3: min δ∈X/x0 α. f κneg (x0 , δ) + β||δ||1 + ||δ||21 + γ ||x0 + δ − AE(x0 + δ)||22

(3.30)

where κ ≥ 0 is the confidence parameter that determines the separation of the predicted probability when the example belongs to class c0 versus when it does not. The corresponding regularization coefficients are denoted by the terms α, β, γ ≥ 0. For PP, a similar optimization approach is used to minimize δ as min δ∈X∩x0 . The whole technique uses Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) Beck and Teboulle (2009) for objectives (2) and (3) and a similar projected FISTA for PN objective (1). The Fig. 3.32 shows that CAE improves CEM results. The highlighted portions better reflect how humans view numbers. The PP findings show which pixels correspond to the number categorization, but the PN results show the smallest modification required to modify the prediction result. Case 2: Singular Vector Canonical Correlation Analysis Singular Vector Canonical Correlation Analysis (SVCCA) analyzes the interpretability by combining Canonical Correlation Analysis (CCA) with Singular Value Decomposition (SVD) (Raghu et al. 2017). CCA has been used for related tasks such as

260

3 Knowledge Encoding and Interpretation

Table 3.3 CEM notations from optimization representation Notation Remark δ ∈ X/x0

δ ∈ X ∩ x0 P(x) AE(x) arg maxi [P(x0 )]i arg maxi [P(x0 + δ)]i neg f κ (x0 , δ) β||δ||1 + ||δ||21 x0 + δ − AE(x0 + δ)||22

Negative interpretable perturbation applied to a natural instance x0 to investigate the difference between two most likely classes (x0 , c0 )in the feasible data space X with inferred class label c0 Positive interpretable perturbation on the existing component X ∩ x0 such that removal of 0 results in the same top-1 prediction of class c0 Prediction probabilities for all classes in relation to a perturbed image x created by x = x0 + δ Autoencoder use to reconstruct an image from x while evaluating its similarity to the sample The most likely class prediction for a natural image The most likely class prediction for the perturbed image Loss function motivating the perturbed x to be projected as a different class than c0 = arg maxi [P(x0 )]i Elastic net regularizer used for efficient feature selection in high-dimensional learning L 2 AE’s reconstruction error

calculating the similarity between modeled and observed brain activity Sussillo et al. (2015) and training multi-lingual word embedding models (Faruqui and Dyer 2014). However, it has never been used to compare deep representations. Not the neuron’s reaction to random input, but how it encodes aspects of a given dataset, is of major relevance (e.g. natural images). As a result, in this work, the authors define a neuron’s representation as its collection of responses to a finite number of inputs (either from a training or a validation set). Given a data collection X = x1 , . . . , xm , where each input xi may be multidimensional. The activation of neuron i at layer l is represented by the symbol z li . It should be noted that only one such output is defined for the entire input data. SVCCA determines the relationship between two network layers lk = z lik |i = 1, . . . , m k for k = 1, 2 by accepting l1 and l2 as input (in practice, lk does not have to be the com plete layer). SVCCA employs SVD to extract the most informative components lk ,



and CCA to convert l1 and l2 so that l1 = Wx l1 and l2 = Wx l2 have the highest correlation ρ = ρ1 , . . . , ρmin (m 1 , m 2 ). Another definition for the degree of similarity between two compared layers is ρ = (1/(min(m 1 , m 2 ))) i ρi . Finally, it produces

l

l pairs of aligned directions. (˜z i 1 , z˜ i 2 ) that correlate well with ρi . As shown by one of the SVCCA tests on CIFAR-10, only 25 of the most significant axes in lk are required to achieve almost the complete accuracy of a full network with 512 dimensions. In a nutshell, the SVCCA takes two groups of neurons as input and uses SVD on each subspace to pick the subspaces that include the most essential directions from the original subspace (a greedy approach). As noted by Raghu et al. (2017), the low variance directions (neurons) are mostly noise, which makes this a crucial consideration in NN. Then, we maximize the correlations between the altered subspaces by

3.3 Design and Analysis of Interpretability

261

computing the canonical correlation similarity (Hardoon et al. 2004) with CCA. At last, a set of outputs with the best possible alignment in terms of single values and directions is generated. To conclude, Malioutov et al. (2017) approach the problem of generating interpretable models as a sparse signal recovery problem from boolean algebra known as Boolean compressed sensing. In a non-heuristic approach, threshold group testing is used to develop interpretable rules. Dhurandhar et al. (2018) developed the CEM as a model-agnostic method that provides contrastive explanations by taking the PP and PN as independent optimization problems and solving them using FISTA. Mousavi et al. (2017) improved on this by employing the CAE close to the data manifolds.

3.3.3 Back-Tracking Backtracking interpretability is a recursive algorithmic strategy for solving issues by iteratively attempting to create a solution, one component at a time, and discarding iterations that fail to fulfill the problem’s criteria. Compared to just trying harder, this method works far better. In this case, we’ll pick one answer out of several possibilities and give it a shot; if it works, we’ll print it; if not, we’ll go back and try something else. As with other forms of recursion, this involves going back to a previous choice that did provide a solution before moving on to new ones.  Highlight It’s easy to visualize the backtracking process as becoming stuck in a maze and having to continuously turning right (backpropagating the losses to the input) until we find an exit (representation of the influence on the data). Nonetheless, the majority of post-hoc explanations approach the issue from a different angle: In the first step, a model is trained without giving much thought to the features it will use (forward pass). Then we backpropagate through it to see what properties of the input data it has learned. A prerequisite is nevertheless for the model to be able to explain itself, e.g. by identifying which input features it uses to support its prediction. Large datasets, on the other hand, are sometimes afflicted by the presence of false correlations between the many variables (Calude and Longo 2017). When deciding which of the few associated input variables to utilize to support the prediction, the learning machine is confounded by spurious correlations. Figure 3.33 below shows a simple example. The model accurately classifies the data by utilizing either feature x1 , feature x2 , or both, but only the first choice generalizes to fresh data. Failure to learn the correct input features may result in ‘Clever Hans’-type predictors (Ref. Sect. 4.1.1). Selecting only a subset of “excellent” input features and giving them to the learning machine is one approach (Guyon and Elisseeff 2003). However, this strategy is challenging to

262

3 Knowledge Encoding and Interpretation

Fig. 3.33 Illustration of the common occurrence of spurious correlations in high-dimensional data. In this example, both x1 and x2 accurately anticipate the present data, but only x1 successfully generalizes to the true distribution, shown by Montavon et al. (2019)

implement, for example, in picture identification, where the importance of individual pixels is not established. To begin, Eq. 3.31 demonstrates the interpretation of a prediction f (x) at the neighboring reference x˜ in a straightforward Taylor Decomposition method. f (x) = f (x) ˜ +

k  (xi − x˜i ).[∇ f (x˜i ] + high-order derivatives · · ·

(3.31)

i=1

Each input feature’s significance to the prediction is quantified by first-order terms (components of the sum) in Eq. 3.31, which also provide the explanation. Despite its apparent ease of use, this approach is not stable when used with DNNs.  Highlight The instability of a basic Taylor Decomposition of a prediction can be traced back to various recognized flaws in DNN functions: • Shattered gradients (Balduzzi et al. 2017): While the function value f (x) is often precise, the function gradient is noisy. • Adversarial examples (Szegedy et al. 2013): Small changes in the input x can cause the function value f (x) to alter dramatically. Because of these flaws, selecting a relevant reference point x with a meaningful gradient ∇ f (x) is challenging. This makes it impossible to create a reliable interpretation. Case 1: Explaining Decision by Layer-Wise Relevance Propagation We will now concentrate on LRP, a pixel-wise decomposition technique that uses the DNN’s network structure to compute explanations rapidly and reliably. LRP adds explainability and scale to potentially extremely complicated DNNs by propagating the f (x) prediction backwards in the neural network using specially developed local

3.3 Design and Analysis of Interpretability

263

Table 3.4 Notations for LRP interpretation Notation Remark Rj αk ρ(wzk )

Propagating a layer’s ( j into lower layers) relevance score Activation activity of jth layer neuron The extent to which neuron j has contributed the relevance of neuron k A small positive term to absorb some relevance when the contributions to neuron activation k are weak or inconsistent

propagation rules (backtracking). The LRP propagation technique is subject to a conservation property, which states that what a neuron receives must be transferred in an equal proportion to the bottom layer. In electrical circuits, this behavior is equivalent to Kirchoff’s conservation laws. Given an input-output mapping (for example a DNN) f : xi → yk , LRP propagating relevance score (Rk )k at a given layer k onto neurons of previous lower layer, say, j, is achieved by Eq. 3.32.  z jk Rk (3.32) Rj = j z jk k using notations in Table 3.4. Here, the quantity z jk depicts the extent to which neuron j has contributed to the relevance of neuron k. The conservation property is enforced by the denominator. Once the input features are reached, the propagation procedure ends. By applying the above approach to all neurons in the network, the layer-wise R = conservation property j j k Rk , and, by extension, the global conservation property i Ri = f (x) may be easily validated. There have been quite some variations of LRP, such as LRP-0, LRP- , LRP-γ , and LRP-αβ (Bach et al. 2015), allowing for simple implementation in DNNs. As a superset of the aforementioned versions, a generic equation is stated in Eq. 3.33: Rj =

 k

α j .ρ(w jk ) Rk + 0, j α j .ρ(w jk )

 Highlight This propagation rule’s computation can be divided into four steps: 1. Forward pass: ∀k : z k = 0, j α j .ρ(w jk ) 2. Element-wise division: ∀k : sk = Rk /z k 3. Backward pass: ∀ j : h j = k ρ(w jk ).sk 4. Element-wise product: ∀ j : R j = α j h j

(3.33)

264

3 Knowledge Encoding and Interpretation

Fig. 3.34 Input image and pixel-wise explanations of the output neuron ‘castle’ obtained with various LRP procedures. Parameters are = 0.25 std and = 0.25. Figure reproduced from Montavon et al. (2019) with permission

The sample implementation of LRP for model interpretation is shown in Fig. 3.34 below. LRP assigns importance scores to pixels by propagating backward the prediction probability of the input through DNN and computing relevance scores. The relevance score, R j , intuitively indicates the pixel’s local contribution to the prediction function f (x).

3.3.4 Dynamic Dynamic programming is employed when we have problems that can be broken down into similar sub-problems and their results reused. These methods are mostly used for optimization. A sort of interpretability is also known as the memorization technique because the objective is to retain previously calculated interpretations to prevent having to calculate them over and again. Before attempting to solve the in-hand sub-problem, the dynamic algorithm will attempt to review the findings of previously solved sub-problems. Sub-problem solutions are integrated to arrive at the optimal solution. In breaking down the problem into smaller and smaller feasible subproblems, this method is analogous to divide and conquer. However, unlike divide and conquer, these sub-problems are not solved separately. Instead, the outcomes of these smaller sub-problems are memorized and used for comparable or overlapping sub-problems.

3.3 Design and Analysis of Interpretability

265

 Highlight The saliency/feature attribution visualizations approach to interpretation, for instance, attempts to establish which aspects of the input are most crucial to a specific categorization outcome. Typically, this entails locating a value that may be thought of as the gradient of a certain output with respect to the input. Case 1: Visual Explanations via Gradient-Based Localization To create a coarse localization map showing the crucial regions in the image for predicting the concept, Gradient-weighted Class Activation Mapping (Grad-CAM) (Selvaraju et al. 2017) uses the gradients flowing into the final convolutional layer from any target concept. This enables untrained users to distinguish between a ‘stronger’ and a ‘weaker’ deep network, even when both produce identical predictions. A short refresher on Saliency Maps, which in essence does a forward pass through an image of interest, computes the gradient of class score Sc (I ) with respect to the input pixels for the class C, and sets zero to all other classes. This is then shown, either by highlighting negative and positive contributions separately or by showing the absolute values. Grad-CAM additionally assigns a relevance score to each neuron for the decision of interest. The gradient, however, is not backpropagated all the way back to the image, but rather (typically) to the last convolutional layer to build a coarse localization map that highlights significant portions of the image. This decision of interest could be the class prediction (found in the output layer), but it could also be any other layer in the neural network. Let us begin with an intuitive examination of Grad-CAM. The goal of Grad-CAM is to figure out where a convolutional layer “looks” for a certain classification in an image. The question of visual interpretation is: How can we “see” from the feature maps how a certain classification was made by the convolutional neural network? First, we may visualize the raw values of each feature map, average them, and overlay them on our image. However, we are only interested in one specific class, yet the feature maps encode information for all classes. On the other hand, Grad-CAM must determine the significance of each of the k feature maps to our class c of interest. Before we average over the feature maps, we must weight each pixel of each feature map with the gradient. This yields a heatmap showing which areas have a favorable or negative impact on the class of interest. The ReLU function is used to this heatmap, which is a fancier way of expressing that we zeroed out any negative values. With the justification that we are only interested in the contributions to the specified class c and not other classes, Grad-CAM uses a ReLU function to get rid of all negative values. Exactly like locating the optimal solution to a sub-problem (original image). We then scale the map back to the [0, 1] range for display purposes and overlay it over the original image. For this method, the localization map is defined in Eq. 3.34 as follows:

266

3 Knowledge Encoding and Interpretation

LcGrad-CAM

∈R

H ×w

= ReLU

 

 αkc Ak

(3.34)

k

with c representing the class of interest, Ak representing the activation map, and H and W for the height and width of the explanation, respectively. The gradient for the class of interest to the last layer of convolutional layer before the convolutional layer is defined as δy c /δ Ak . After that, the feature map is globally pooled by balancing the respective class gradient in Eq. 3.35, αkc =

1   δy c Z i j δ Aikj

(3.35)

This implies that the localization is very coarse, as the final convolutional feature maps have a much coarser resolution than the input picture. Other attribution systems backpropagate all the way to the input pixels. As a result, they are far more detailed and can show us specific edges or locations that contributed the most to a prediction. Guided Grad-CAM (Tang et al. 2019) is a hybrid of the two approaches. And it is quite simple. We compute the Grad-CAM explanation as well as the explanation from another attribution approach, such as Vanilla Gradient, for an image. The Grad-CAM result is then upsampled using bilinear interpolation, and both maps are element-wise multiplied. Grad-CAM functions as a lense, focusing on certain areas of the pixelby-pixel attribution map. Section 4.1.4 describes in detail how to visualize various saliency approaches.  Highlight The way to understand the gradient-only attribution is as follows: If we changed the pixel’s color values, the predicted class probability would go up (if the gradient was positive) or down (if the gradient was negative). The greater the absolute value of the gradient, the greater the impact of changing this pixel. Case 2: Visually Sharpen Gradient-Based Sensitivity Maps SmoothGrad (Smilkov et al. 2017) reduces the noise in gradient-based explanations by adding noise and averaging over these artificially noisy gradients. This strategy is not an independent explanation method; rather, it is an extension of any gradientbased explanation method. In a nutshell, it operates as follows: 1. Add noise to the image of interest to generate several variants. 2. Generate pixel attribution maps for all images. 3. Take an average of the pixel attribution maps. Indeed, it is that straightforward. Why should this operate effectively? According to the principle, the derivative swings significantly at tiny scales. During training,

3.3 Design and Analysis of Interpretability

267

Fig. 3.35 The top row takes a specific image x and an image pixel xi and plots the values as a percentage of the maximum entry in the gradient vector, maxi δSc /δxi (t) (middle plot), for a short line segment x + t in the space of pictures parameterized by t ∈ [0, 1] as one goes away from a baseline image x (left plot) toward a fixed place x + (right plot). represents a random sample from N (0, 0.012 ). The final image (x + ) is indistinguishable from the origin image x to a human. Five photos from the ImageNet gazelle class are shown here, with the effect of increasing the noise level (columns). After subjecting the input pixels to Gaussian noise N(0, σ 2 ) for 50 iterations, we obtain the sensitivity map. The noise is quantified as σ/(xmax xmin ). Figure reprinted from Smilkov et al. (2017) with permission

neural networks have no motivation to keep gradients smooth; their aim is to correctly categorize images. These irregularities are “evened out” by averaging many maps in Eq. 3.36. Rsg (x) =

n 1  R(x + gi ) N i=1

(3.36)

where, gi ∼ N (0, σ 2 ) are the noise vector sampled from Gaussian distribution. The example of strongly fluctuating partial derivatives is shown in Fig. 3.35. The authors present it as a percentage of the maximum entry to demonstrate that the swings are significant. This length of this segment is short enough that the starting image x and the final image x + appear identical to a human. Furthermore, the model correctly classifies each image along the journey. However, the partial derivatives

268

3 Knowledge Encoding and Interpretation

of Sc with regard to a single pixel’s RGB values (red, green, and blue components) fluctuate dramatically. Another feature of the gradient we notice is the presence of a few pixels with substantially higher gradients than the average. The impact of noise level on numerous example images from ImageNet is also included (bottom rows) in the figure (Russakovsky et al. 2015). The second column provides the vanilla (zero percent noise) gradient. Since quantitatively assessing a map is still an open subject, they rededicate themselves to the qualitative approach. Note that the crispness of the sensitivity map is best preserved while applying noise between 10 and 20% (middle columns). Interestingly, while this noise level is relatively effective for inception, the optimum noise level varies with the input. In conclusion, gradient-based approaches are typically faster to compute than model-agnostic methods, leverage less expensive resource utilization, comparable to saving time complexity in dynamic programming. The saliency approach describes an algorithm’s choice by assigning values that indicate the relevance of input components in their contribution to that decision. These values could take the shape of probabilities and super-pixels like heatmaps, for example.

3.3.5 Branch and Bound Consider computing a lower bound (or, by duality, an upper bound) and claiming the sufficiency of obtaining attributes. While Branch and Bound techniques can only provide a restricted estimate of a variable’s value, they can operate with bigger models, such as up to 10,000 hidden neurons.  Highlight The simplest approach is to think of it as a tree. We begin by focusing on a small subset of the main problem’s solution space, and then we search for a locally optimal explanation that can be generalized to the problem at large. We impose a bounding constraint on the tree’s branching and find a solution, which may involve explanations that are only partially complete and hence miss the most optimal or holistic interpretation. Case 1: Local Interpretable Model-Agnostic Explanations Local Interpretable Model-agnostic Explanations (LIME) (Ribeiro et al. 2016) explains the predictions of any classifier in an interpretable and faithful manner by learning a model locally around the prediction. It perturbs the input instances and identifies output prediction for a locally faithful interpretable model over the interpretable representation. LIME divides the input image into interpretable components (contiguous superpixels) and runs each perturbed instance through the model to generate a probability. This set of locally weighted data, is used to train a simple linear model. At the end of the process, LIME gives an explanation based on the super-

3.3 Design and Analysis of Interpretability

269

Table 3.5 Notations for LIME interpretation Notation Remark f (x) g∈G πx (z) L( f, g, πx ) ω(x) ξ(x) c(V, W, I )

The probability (or a binary indicator) that x belongs to a specific class Set of interpretable models G with interpretable components g in the

{0, 1}d Proximity measure between an instance z over the locality of x Locality-aware fidelity loss that measure of how close g approximates f in the locality specified by πx Regularizer based on complexity measurement (opposed to interpretability) Optimal explanation model with f as the true function to be modeled Non-redundant coverage defining coverage as the set function c that computes the total importance of the features that appear in at least one instance in a set V given explanation matrix W and global importance of an instance I j in the explanation space

pixels with the most positive weights. This could also be a contender for Divide and Conquer, but let’s consider Branch and Bound. For example, a binary vector showing the presence or absence of a word could be used to represent text classification, even though the classifier may use more complex (and hard to understand) features like word embeddings. While the classifier may represent the image as a tensor with three color channels per pixel, an interpretable representation may be a binary vector denoting the “presence” or “absence” of a contiguous patch of related pixels (a “super-pixel”). We say that x ∈ Rd is the original

representation of an instance that is being explained, and x ∈ 0, 1d is a binary vector that shows how it can be understood. Let’s refer to the model being presented as f : Rd → R. To guarantee both interpretability and local faithfulness, we must minimize the model ξ(x) = arg minL( f, g, πx ) + ω(x) while keeping (g ∈ G) low enough for humans to interpret with notations from Table 3.5. The explanation yields by LIME, minimizing the model objective ξ(x), can be used with different explanation families G, fidelity loss functions L, and complexity measures ω. The basic idea underlying LIME is to sample instances both near x (with a high weight owing to πx ) and far away from x (low weight from πx ). Even though the original model may be too complex to describe globally, LIME provides a locally faithful interpretation, encapsulated by πx . Branching: We define the pick step as the task of selecting G instances for the user to investigate given a set of instances X . The sub-modular pick essentially motivates the branch to select a diverse, representative group of interpretations to show to the user, explanations that are non-redundant and represent how the model behaves generally. We generate a n × d explanation matrix W that captures the local importance of the interpretable components for each instance g given the explanations for a collection of

270

3 Knowledge Encoding and Interpretation

instances X (|X | = n). When utilizing linear models as interpretation, we set Wi j = |wgi j | for an example xi and explanation gi = ξ(xi ). Furthermore, for each component (column) j in matrix W , we let I j denote the component’s global importance in the explanation space. Bounding: We want I in such a way that features that explain many different cases have greater  relevance rankings. For instance, in text applications, I can be set as n Ij = i=1 Wi j . In case of images, I must measure something that is comparable across super-pixels in different images, such as color histograms or other super-pixel properties; we will explore these ideas further in Sect. 4.1. Remember, while we choose instances that cover the majority of components, the set of explanations should not be redundant. Therefore, we avoid selecting instances with similar explanations in a greedy style. The non-redundant coverage notion can be implemented by defining coverage as the set function c (Ref. Table 3.5) expressed in Eq. 3.37.

c(V, W, I ) =

d 

[∃i∈V ;Wi j>0] I j .

(3.37)

j=1

where, the pick problem consists of finding the set V , such that |V | ≤ G maximizing a weighted coverage function to achieve highest coverage in Eq. 3.38 is NP-hard (Feige 1998). Pick(W, I ) = argmaxV,|V |≤G c(V, W, I )

(3.38)

Here, the submodular pick is used to get a rough idea of the marginal coverage gain c(V ∪ i, W, I ) − c(V, W, I ) of adding an instance i to the set V . This submodularity is a greedy algorithm that iteratively adds the instance with the highest marginal coverage gain to the best solution with a constant-factor approximation of 1 − 1/e. Case 2: Local Rule-Based Explanations Guidotti et al. (2018) developed LOcal Rule-based Explanations (LORE), an agnostic method that employs a local interpretable predictor on a syntactical neighborhood built by a genetic algorithm. It works similarly to LIME in that it initially learns a local interpretable predictor on a neighborhood, but instead of randomly improving the value of the created instance, it permutes using a genetic approach. It produces a meaningful interpretation, consisting of a decision rule that validates the reasoning for the decisions, as well as a collection of counterfactual rules that indicate changes to the input features that can be made to modify the anticipated outcome. To ensure realism, the rules are created using decision trees on the neighbourhood to replicate the behavior of the black-box locally shown in Fig. 3.36. Case 3: Distilling a Neural Network into a Soft Decision Tree In 2017, Frosst and Hinton (2017) produced a soft decision tree that generalizes better than one trained directly from training data (Fig. 3.37), by transferring the

3.3 Design and Analysis of Interpretability

271

Fig. 3.36 Examining the relative merits of the a LIME, and b LORE. When comparing LORE to LIME, the primary distinction is in the level of detail provided by the latter. Using the nearest split in the decision tree, LORE also provides a counterfactual explanation. Figure reproduced from Guidotti et al. (2018) with permission

Fig. 3.37 An MNIST-trained soft decision tree with depth four. Inner node images are learned filters, while leaf images are probability distributions over classes. The final most likely categorization at each leaf is noted, as are the likely classifications at each edge. For instance, observing the rightmost internal node, we see the potential classifications are only 3 or 8, hence the learned filter is learning to distinguish between those two digits. The filter looks for two places that would connect the 3 to produce an 8. Figure reproduced from Frosst and Hinton (2017) with permission

generalization skills of a NN to a soft decision tree. This technique focuses on training the decision tree on hierarchical decisions to simulate the NN’s input-output function. Traditional NNs’ hierarchical properties allow them to learn resilient and innovative input space representations, but after one or two levels they become difficult to engage with. In contrast, a soft decision tree trained by SGD utilizing NN predictions give more informative targets. Based on an input example, the branching selects a specific static probability distribution over classes as its output. Instead of attempting to understand how a DNN makes its decisions, the approach employs the DNN to train a decision tree that mimics the NN’s input-output function but operates in a completely different manner. If there is a lot of unlabeled data, a NN

272

3 Knowledge Encoding and Interpretation

can build a larger labelled data set to train a decision tree, overcoming its statistical inefficiency. Consider: Branching: The method uses a soft binary decision trees trained with mini-batch gradient descent, where each inner node i has a learned filter wi and a bias bi . Each leaf node l has a learned distribution P l . At each inner node, the probability of taking the rightmost branch is given in Eq. 3.39: pi (x) = σ (xwi + bi )

(3.39)

Each hierarchical mixture of experts in the model learns a simple, static distribution across the possible output classes, k. The model uses this hierarchical filtering to allocate examples to the appropriate experts presented in Eq. 3.40. exp(ρkl ) Pkl = l k exp(ρk )

(3.40)

where, P l is the probability distribution at the l th leaf having learned parameter ρ l . Both the distribution from the leaf with the highest path probability and an average of all the distributions over all the leaves, weighted by their individual path probabilities, can be utilized with this model to provide a predictive distribution over classes. An example representation is shown in Fig. 3.37. A list of all the filters along the route to the leaf, coupled with the binary activation decisions, may easily explain the predictive distribution obtained from the leaf with the highest path probability. Bounding: The soft decision tree employs learning filters to generate hierarchical judgments depending on input and output a static probability distribution over classes. To avoid getting trapped on sub-optimal solutions during training, Frosst and Hinton implemented a penalty term encouraging internal nodes to employ both left and right sub-trees. Without this penalty, the tree tended to get stuck on plateaus where one or more internal nodes always allocated practically all the probability to one of its sub-trees and the logistic gradient was close to zero. The penalty is the cross entropy between the desired average distributions (0.5 for the two sub-trees) and actual average distributions (α, 1 − α). Equation 3.41 gives the expression of α for node i where P i (x) is the path probability from the root node to node i. i x p (x) pi (x) (3.41) αi = i x P (x) The penalty L aggregated across internal nodes is stated in Eq. 3.41 as follows:  L = −λ 0.5 log(αi ) + 0.5 log(1 − αi ) (3.42) i∈Inner Nodes

where the hyper-parameter λ, prior to training, defines the severity of the penalty. This penalty was predicated on the notion that a tree that used alternate sub-trees

3.3 Design and Analysis of Interpretability

273

Fig. 3.38 Sample visualization of the first two layers of a soft decision tree trained on the Connect4 dataset (Lichman et al. 2013). Examining the learnt filters reveals that the game may be divided into two separate subtypes: games in which the players have placed pieces on the board’s edges and games in which the players have placed pieces in the center of the board. Figure adapted from Frosst and Hinton (2017) with permission

fairly equally would be better suited to any particular classification task, and it did boost accuracy in practice.  THINK IT OVER As one descends the tree, how valid is the assumption of the above penalty computation? This assumption loses support as one travels farther down the tree; the last node in the tree might only be responsible for two types of input, in some non-equal proportion, and penalizing the node for a non-equal split in this circumstance may affect the model’s accuracy. The predicted fraction of data that each node sees in any given training batch exponential decline as one descends the tree. This means that the computation of the actual odds of employing the two sub-trees becomes less accurate. To counter this we can maintain an exponentially decaying running average of the actual probabilities with a time frame that is exponentially proportional to the depth of the node. The authors of the paper achieved much greater experimental test accuracy by using both the exponential decline in the strength of the penalty with depth and the exponential growth in the temporal scale of the window used to compute the running average. Figure 3.38 depicts their experimental outcome, in which a soft decision tree trained on Connect4 data (Lichman et al. 2013) obtained 80.60% accuracy, whereas the non-distilled version scored 76.83%. Likewise, Plumb et al. (2018) developed MAPLE (Model Agnostic suPervised Local Explanations), a model-agnostic explanation. It can detect global patterns while simultaneously providing example-based and local explanations. MAPLE combines the idea of Supervised Local Modeling Method (SILO) (Bloniarz et al. 2016) to employ random forest as a method for supervised neighborhood selection for local linear modeling with feature selection methods from DStump (Kazemitabar et al.

274

3 Knowledge Encoding and Interpretation

2017). DStump determines the significance of a feature based on how much it reduces label impurity when split at the root in random forest trees. The given local explanation was deemed to be more faithful than LIME (Ribeiro et al. 2016), and global patterns may be utilized to detect problems in its local explanations.

3.3.6 Brute-Force Brute-Force is the most fundamental and straightforward sort of interpretability strategy. Typically, a Brute-Force algorithm is the simplest solution to a problem, or the first solution that comes to mind when we perceive the problem. More technically, it is the same as iterating through every possible solution to the problem. To devise an ideal solution, we must first obtain a solution and then attempt to optimize it. Every issue can principally be solved by brute force, but for complex problems it is usually not feasible due to the high costs. For example, Monte Carlo Dropout can be considered as brute-force attempt with the process of training a NN using the usual dropout and keeping it turned on during inference. This allows us to produce several forecasts for each instance. We can average the softmax outputs for each class in classification task. Nonetheless, this section will not be expanded upon, as the idea is well recognized. We leave it up to the reader to create a strategy for grouping IDL methods that appear to be suitable for Brute-Force interpretability. In addition, it is important to comment on the effect of the strategies along the way on the completeness of model dependability. In the next part, we will address the transmission of knowledge in deep networks and the relationship between layer encoding and its interpretation.

3.4 Knowledge Propagation in Deep Network Optimizers Figure 3.39, from Olah et al. (2017), shows the features learned by a CNN range from edges in layer conv2d0 and textures in layer mi xed3a in lower convolutional layers to more complex and abstract features of parts and objects in higher convolutional layers mi xed4d & mi xed4e to its right. In general, all feed-forward NNs are differentiable. They can learn through gradient-based backpropagation, which changes the input over time based on how an internal neuron fires or how the output behaves. Conceptually, this seems easy, but we must fine-tune the optimization process iteratively in order to achieve the goal. Numerous possibilities are dependent on the optimization technique being used, whether it is exploring the entire dataset or optimizing a single image from scratch. The image below (Ref. Fig. 3.40) is from the paper “Feature Visualization” (Olah et al. 2017), which optimizes GoogleLeNet to activate a neuron from random

Fig. 3.39 Learned feature by GoogLeNet layers, trained over ImageNet data. Reproduced image from Olah et al. (2017) under creative commons attribution (CC-BY 4.0)

3.4 Knowledge Propagation in Deep Network Optimizers 275

276

3 Knowledge Encoding and Interpretation

Fig. 3.40 The process of enhancing an image so that it stimulates a specific neuron, beginning with random noise. Reproduced image from Olah et al. (2017) under creative commons attribution (CC-BY 4.0)

noise. To visualize certain feature, we need examples with high neuron activation at a specific position or the entire channel. If we want to explain the output classes of a classifier, we can either optimize the class logits before or after the softmax activation. Logits serve as evidence for each class, while probabilities are its likelihood. Non-intuitively, the simpler way to increase a class’s softmax probability is to make the alternative less likely, rather than making the class of interest more likely (Simonyan et al. 2014). According to a literature review, optimizing pre-softmax logits may improve image quality for interpretation. One explanation for the above-mentioned statements is that maximization of a class probability cannot push down evidence from other classes, making it ineffective. Another option is to optimize using a softmax function. This is a common occurrence when dealing with adversarial samples. A strong regularization with generative models is always useful. Probabilities can be an important thing to optimize in this case. Style transfer objectives (Gatys et al. 2015) gives us a great insight into the network’s understanding of various styles and contents. Objectives help us understand what a model keeps and discards during optimization-based model inversion (Mahendran and Vedaldi 2015). Flexibility is another good thing about optimization. Let’s say we need to investigate how neurons collectively represent knowledge. In that case, we can investigate how a specific example should be altered in order for an additional neuron to activate. This adaptability can also aid in the visualization of how features emerge during network training. Limiting the interpretability to fixed samples in a dataset would have made it difficult to thoroughly examine the model. Not all aspects are relevant for the same application, but it is useful to break it down further in light of the adversarial conditions for a model’s behavior. Image visualization is not the only way to comprehend inter-network learning. The use of the weight histogram visualization (Fig. 3.41) for each layer is a specific approach for DL models. This applies to any type of data. As weights and bias are the brains of neural networks, determining the overall distribution of weights across the network provides a valuable understanding of its operation. The interpretation of a uniform distribution, a normal distribution, or an ordered structure can be comparable to that of an image’s histogram. It provides the distribution of captured tones as well

3.4 Knowledge Propagation in Deep Network Optimizers

277

Fig. 3.41 Weight Histogram technique in tensorboard to interpret layer learning (Stewart 2020)

as the information lost due to blacked-out shadows or overexposed highlights. While the acquired knowledge may not be instrumental to fully decipher the black box, it is the next step in gaining valuable insight from it. In conclusion, different optimization objectives reveal which parts of a network are attending during the model’s learning on a dataset. Whether it’s independent neuron, channel, or layer visualization via optimization, or the most popular class-logits and class probabilities visualization via pre-softmax and post-softmax functions, optimization can be a powerful tool for distinguishing between features that truly influence the model’s behavior and those that simply correlate with the cause. What we believe to be the most prominent feature in the image may not be detected by a neuron.

278

3 Knowledge Encoding and Interpretation

 Highlight Use the “inversion” interpretability technique to investigate the invariance of recent deep CNNs by sampling possible approximate reconstructions. Mahendran and Vedaldi (2015) relate this to the representation’s depth, demonstrating how the CNN gradually builds an increasing amount of invariance, layer by layer. Reconstructing images from subsets of neurons, either spatially or based on channels, allows us to investigate how locality of information is distributed across these representations.

3.4.1 Knowledge Versus Performance The section emphasizes the fact that metric accuracy alone is no longer sufficient. Algorithmic accountability is the need of the hour. A successful analysis hinges on selecting the right metric, so be sure we understand what those are. The data science community has put a lot of effort into creating cutting-edge instruments with unprecedented precision. Various accuracy metrics are being developed using statistical or logical reasoning for all-purpose tools to select the best model. One thing that can be said to be true is that more complex models are much more flexible than their simpler counterparts. This enables the approximation of more complex functions. Given the premise that the function to be approximated is complex and that there is enough data to harness a complex model, the statement is true. This is where the trade-off between performance and interpretability can be seen. It’s important to remember that trying to solve problems that don’t follow the above rules will lead to trying to solve a problem that doesn’t have enough different kinds of data (variance). So, the model’s added complexity will only make it harder to solve the problem correctly (Fig. 3.42). A few recent cases are highlighted below, such as how Google’s facial recognition algorithm labeled some men with dark complexions as gorillas. Tesla’s auto-coldweather pilot’s performance drop led to fatal decision-making risk. An algorithm

Fig. 3.42 Choosing the right tool for explainability

3.4 Knowledge Propagation in Deep Network Optimizers

279

on YouTube shuts down a popular channel suspecting it to be bot-operated. Apple’s credit limit algorithm favors men over women having joint taxation status. Uber’s self-driving car skips stop signs. DL academics are aware of the explainability problem, and recent developments focus on visualizing the model. Feature visualization and understanding a model’s learning behavior are among them; however, making it work involves several aspects. Simple methods produce high-quality visualizations. Researchers have developed strategies to examine neuron behavior, what it fires for, how they interact, and ways to improve optimization techniques. The debate over “interpretability versus performance” has been going on for a long time, but like any other big statement, it is surrounded by myths and misconceptions. As Rudin (2018) says, it is not always true that models that are more complicated are automatically more accurate. This is not true when the data is well organized and the features we have access to are of high quality and value. A successful analysis hinges on selecting the right metric, so be sure we understand what those are. The data science community has put a lot of effort into creating cutting-edge instruments with unprecedented precision. Various accuracy metrics are being developed using statistical or logical reasoning for all-purpose tools to select the best model. This kind of situation happens a lot in some industry settings because the features being analyzed are constrained within very controlled physical problems, in which all the features are highly correlated and not much of the possible landscape of values can be explored (Diez-Olivan et al. 2019). When the most fundamental process, formal knowledge gathering, is skipped, all counting on the log-loss or charting in the world won’t mean a thing. As rational creatures, humans prefer to base their choices on sound reasoning. To better understand the network’s learning, we now seek consensus on why something is functioning as intended. A few real-world scenarios in which an algorithm or model is biased can have serious consequences. In our pursuit of the SOTA so relentlessly, we may discard a working model in exchange for a negligible improvement in accuracy just to keep up appearances. We shall look into few human-interpretable interpretations (HII) for unstructured image data, i.e., how well a non-technical human can perceive a model’s decision and the possible crucial features in the decision-making process for higher performance. On this path to performance, when performance goes hand in hand with complexity, interpretability runs into a downhill slope that seemed unavoidable until now. But the appearance of more sophisticated ways to measure explainability could turn that slope around or at least cancel it out. Figure 1.9 depicts a preliminary representation of how IDL can improve the common trade-off between understanding a model and evaluating its performance based on previous work. The approximation dilemma is also worth mentioning at this point because it is closely related to how well a model can be understood and how well it works. This is because explanations for ML models must be clear and close enough to meet the needs of the audience for whom they are made, making sure that explanations are accurate and don’t leave out the model’s most important parts.

280

3 Knowledge Encoding and Interpretation

 THINK IT OVER Scalar metrics are often used for classification, but are they complete? Accuracy, recall, F1 score, area under the ROC curve (AUC), or precision all lack something. – ‘Accuracy’ is irrelevant when there is a class imbalance. – How often have we tried manual threshold for binary classification? ‘Precision’ works well when false positive costs are high, while ‘recall’ works well with high false negative costs. – Class skewness can affect ‘F1 score/AUC’ used in competitions. Visualizing the confusion metric and confusion examples, such as those from a cifar-10 dataset with a 10 × 10 confusion matrix, provides more insight into class labels that have been underperforming by the model. Explaining a model’s performance with edge-case scenarios or where it fails is more useful than an accuracy graph. Regarding images performance, Fig. 3.23 shows how the top 3-class idx for two images can be predicted using a simple VGG-16 model pre-trained with ImageNet data weights. The simple experiment to classify the image with no hard optimization is shown in the figure by changing the image perspective ratio and image resolution. We believe that classification performance has improved significantly over time, and that a cutting-edge approach can produce far superior results. Our goal with this illustration here is to instill an idea of interpretation rather than to compete for the most advanced approach to achieve the highest score. The top prediction results in Fig. 3.43 vary significantly when the image perspective is changed, some background with less contextual sense is cropped, and the resolution is changed. When working with similar data that could undergo such a transformation in the future, we need to exercise caution. Nonetheless, we have just reached the tip of the interpretability iceberg. We have primarily discussed the network’s input and output layers, as these are typically the most concerned with model performance. The interpretation of every other method demonstrating SOTA strategies has been subject to a chaos theory. We need to gain a better understanding of whether that is the sole interpretation and righteousness of all. For example, pixel-based distances such as L2 pixel distance can be very unintuitive when dealing with high-dimensional data, particularly images. The image’s perceptual or semantic similarity does not correspond well to pixel-wise distance. When discussing image similarity and its applications, it’s worth pausing to briefly introduce the readers to the Siamese Network.

3.4 Knowledge Propagation in Deep Network Optimizers

281

Fig. 3.43 Variation in top-3 class prediction for transformed examples

3.4.1.1

Case Study: Siamese Network

Before proceeding to the next section, which provides an overview of knowledge encoding in shallow and deep layers of DNNs and aspects of its representation, we shall discuss an interesting example that will aid in understanding how people approach unconventional techniques not only in search of peak performance, but also in an attempt to make a machine learn features on its own without much human supervision. The architecture shown in Fig. 3.44 describes a network structure in which two or more identical sub-network components share the same weights. A . simple twin CNN network with binary classification and logistic prediction Y According to (Rao et al. 2016), traditional dimensionality reduction using PCA to find a linear projection to low dimensional space results in high variance. Multidimensional Scaling (MDS), on the other hand, arranges the data in a low-dimensional space while preserving the pairwise distance between the inputs. The disadvantage is the inability to map new data to the same space, which in a facial emotion recognition example requires non-linear embedding to handle noise and illumination differences. The Siamese network reveals dimensional reduction of data without relying explicitly on total categories and explicit categorical labels. The model applies a contrastive loss to the final activation layer of each identical network, which pairs similar inputs that are separated from dissimilar ones. The loss function is derived from (Hadsell et al. 2006), having X 1 and X 2 input images, Y is a binary label, α > 0 being the margin for separation between positive and negative score, Dw being the Euclidean distance (Refer Sect. A.4) between the vectors with dependence on weight w of the network formulated below as:

282

3 Knowledge Encoding and Interpretation

Fig. 3.44 Layout of a siamese network. Adapted from Rao et al. (2016)

Y =

0; similar images 1; different images

Dw = G1 − G2   2 1 1 max(0, α − Dw ) L = (1 − Y ) Dw 2 + Y 2 2

(3.43)

It has been inspired by analogy of spring’s pull and repulsing behavior by the first and second half of the Eq. 3.43 respectively. The loss can be represented more simply as a function fed with positive (L+ ) and negative (L− ) sample data at a time and adds up to [L = L+ + L− ] and a triplet loss, L = max(d(X, P) − d(X, N) + α, 0) with distance metric d (ref. Appendix A.4); P and N being the positive and negative data and X the sample data. The idea of shared weight matrices at each layer of the network is another insightful motivation to compare the difference between fuzzy weights and regular CNN network for each hidden convolution layer discussed in Chapter (Refer Chap. 5). Interestingly, the author demonstrates an excellent concept of similarity structural features derived solely from image data in a restricted training condition, i.e., without providing any similarity information to the network. EMPATH, a NN categorizing facial expressions by Dailey et al. (2002) derived the interpersonal circumplex from the network reflected confusability softmax output in 2002 (Rao et al. 2016). To the best of our knowledge, the interpretation of the actual similarity structure learn-

3.4 Knowledge Propagation in Deep Network Optimizers

283

ing and the ability of the network to outperform at facial expression of emotions has not yet been explored. This emphasizes the point that we cannot rely solely on the model’s performance and must work hard to understand how the system can outperform human-level expertise while remaining a reliable source of its behavior.

3.4.2 Deep Versus Shallow Encoding The functions of deeper and shallower layers of neurons have also been investigated through ablation studies (Meyes et al. 2020, 2019). In essence, some neurons are messed up, and the output of the messed-up network is compared to the output of the original network. Many papers such as Alain and Bengio (2016), have proposed that concepts at deeper levels of abstraction can be more easily disentangled. If a network is too shallow, its layers may be incapable of clearly separating concepts. Researchers are increasingly favoring the “bigger is better” strategy, hoping that larger DL models with more layers and parameters trained on larger data sets will result in AI breakthroughs. Microsoft released Turing-NLG, the largest deep neural network to date, with 17 billion parameters in February 2020, surpassing Nvidia’s MegatronLM, which had 8.3 billion parameters. About 20% of the human brain’s total 85 billion neurons are found in this network. The network can train on massive amounts of data without becoming saturated, which should make it superior to other models. However, understanding a NN with 17 billion parameters is no easy task (Ba and Caruana 2013). Over the years, most people have come to believe that even though deep and shallow networks can approximate data modeling functions, heavily deep-layered architectures are more efficient in terms of computation and parameter count for the same level of accuracy. Modern computing power has rendered shallow approaches largely obsolete. In order to learn a more abstract representation of the input, deep encoding can generate deep representations at each layer. Here, we refer to NNs with only one hidden layer as “shallow.” Architectures with many hidden layers and many neurons in each layer have been dubbed “deep.” Despite the fact that only a small number of studies have demonstrated that a shallow network can fit any function, its insight is a much broader layer that significantly raises the total number of parameters. The vast majority of studies show that a deep NN outperforms a shallow network when it comes to fitting a cost function with a small number of parameters.

284

3 Knowledge Encoding and Interpretation

 Highlight Three major factors have contributed to recent advances in heavily layered deep data encoding: 1. The availability of a large volume of data and the need to avoid overfitting. 2. Problems with vanishing gradients were solved by field advancements like the creation of ReLU activation functions. 3. Powerful computing to train DNN on a large dataset in a few hours. DNNs, particularly those with convolutional layers, have sparked a revolution dating back to the discovery of complex neural representations of cortex for optical recognition and processing. The hierarchical arrangement of features in layers has been a major focus for DCNNs’ exponential success. Poggio and Smale (2003) discussed that another, possibly related, challenge to hierarchical learning is a comparison with real brains. The paper describes ‘learning algorithms’ that correspond to one-layer architectures. In terms of learning theory, are hierarchical architectures with more layers justifiable? It appears that the learning theory they have described provides no general argument in favor of hierarchical learning machines for regression or classification. This is puzzling because the organization of the cortex, for example, the visual cortex, is highly hierarchical. At the same time, hierarchical learning systems outperform in a variety of engineering applications. Many prior investigations (e.g., Bengio et al. (2013); Mahendran and colleagues (2016)) have speculated that deeper CNN representations are better at capturing more abstract visual concepts. Furthermore, convolutional features naturally retain spatial information that is lost in fully-connected layers, so the final convolutional layers should provide the best balance of high-level semantics and detailed spatial information. These layers’ neurons search the image for semantic class-specific information (say object parts). Grad-CAM analyzes the gradient information flowing into the CNN’s final convolutional layer to determine the importance of each neuron for a particular decision. Although the technique is very general and can be used to visualize any activation in a deep network, Selvaraju et al. (2017) focus on how the network might make decisions. Ironically, a mathematical theory explaining the qualities, or even merely a validation for the DCNNs’ success, is still lacking. There have been attempts to answer the criteria for DNNs to outperform shallow networks (Mhaskar et al. 2016). Mhaskar and Poggio (2016) showed Directed Acyclic Graph (DAG) for an idealized deep network with three examples: (i) traditional sigmoid networks, (ii) commonly used DCNN’s ReLU networks, and (iii) Gaussian network for quantitative measurement of shallow versus deep networks’ optimal performance.

3.4 Knowledge Propagation in Deep Network Optimizers

3.4.2.1

285

Application-Specific Suitability of Shallow Learning

The question that arises is, how much smaller can the architecture be made? Are deep neural networks really necessary? In contrast to complicated, well-engineered deep convolutional networks, a shallow network with the same number of parameters can achieve competitive performance, as empirically demonstrated in the paper (Ba and Caruana 2013). We need to learn what makes deep nets superior to shallow ones in terms of the fraction of improvement they offer. A thorough examination is required to determine which of the following outcomes is most likely or whether a combination of many outcomes is more likely.  Highlight Contrary to widespread notion that humans only use 10% of their brain, 90% of the neurons do not lay around all day blank, and the medical community now has devices to look inside people’s brains. Even while researchers have yet to pin down the functions of the brain’s many regions, they are aware of the significance of every single bit. So, if you’re of the opinion that you can disable 90% of your brain and it’ll still function well, maybe you just stick to using 10% of your brain power. Some of the advantages of using DNNs are as follows: 1. More learning options are available in a deep encoding network. 2. DNNs are able to learn the correct hierarchical representation because of their superior inductive bias. 3. DNNs’ convolutional layers can pick up information that a simpler network couldn’t. 4. DNNs are better suited to current learning algorithms and regularization than shallow nets. 5. Shallow architecture is incapable of learning complex functions with the same number of parameters as deep architecture. Check out Fig. 3.45 for a schematic layer-wise display of MNIST fashion data trained on a five-layered convolutional block followed by a fully connected layer. The encoded feature appears to become more complex as we move away from the input. It learns to categorize feature maps on its own, making the intricate feature bits, weights, and bias mesh difficult to describe. This motivates us to keep tiny networks that allow us to distinguish discrete characteristics. On the other hand, it may lose the ability to precisely map tiny bits of data. If that’s the case, let’s look at a recent study in ‘LightOCT’ (Butola et al. 2020) for performance analysis.

286

3 Knowledge Encoding and Interpretation

Fig. 3.45 A visual guide of the layers in CNN learning, trained on the MNIST fashion dataset

3.4.2.2

Case Studies in Shallow Learning: LightOCT Network

This is an example from the paper of a shallow network “LightOCT” being used for diagnostic decision assistance. A two layer convolutional architecture is used in the paper (Butola et al. 2020). Attempting to examine the activation map at the 2nd convolution layer made us question about the model’s knowledge propagation and performance in comparison to other heavy-layered models like VGG-19, ResNet101, and Inception-V3. The activation map is generated just before the FC-layer by the contributing weight, which is a sum of all neurons. Figure 3.46 shows a clear structure demonstration of activation map generation from the last layer of convolution shortly before the FC-layer. As illustrated in this case, the overall notion of interpretability of deep CNNs using activation has suffered in the past. The convolution activation map for the 2-layers lightOCT model with no strong visual activation for different datasets but effective to incorporate sufficient features in its layers for model accuracy of 99.3 percent for Normal Breast Tissue (NBT) and 98.6% for Cancer Breast Tissue (CBT) is depicted in the second row of the Fig. 3.46. The LightOCT learns characteristics in its second layer with slightly visible activation map overlay, which is lacking in the second layer of deeper models like Inception-V3. The 2nd layer of Inception-V3, which has 96 layers in total, has a similar feature scheme in the 3rd row of Fig. 3.46. Only in the fourth row is its 94th layer with a clear view of the activation map. The authors of the paper give us the idea that people believe activation maps are the sole important way to analyze models, which we believe is only partially true. Only a deeper layer could demonstrate activation based on distinguishability in the weaker and stronger feature, which was missing

3.4 Knowledge Propagation in Deep Network Optimizers

287

Fig. 3.46 Feature encoding in layers at different depth. Figure adapted from Butola et al. (2020) with permission

in the shallow network. We see here a shallow model with limited computing capabilities and memory requirements that can achieve excellent accuracy comparable to massively layered models such as Inception-V3.

Summary The chapter concluded that current uncertainty approaches observe semantic and perceptual information from image patches with varying degrees of accuracy. Rather than a model-agnostic tool ideal for a beginning point of explainability, we need to delve deeper into specific architectures and attempt to comprehend the essential character of the networks. According to reviews of the literature, the lack of a true agreement, particularly for deep layered, complicated models, has caused researchers to provide their own interpretation of the model’s performance, which is insufficient to describe the complex learnt representation DNNs. Even though the ideas are incomplete, we are certain that the definition of black-box is partially and that we need improved techniques to grasp these models. The reader now finds a new perspective on model understanding thanks to the schematic flowcharts of various networks, their interpretability designs, and their knowledge encoding. The strategies of interpretability is introduced to us through an analogy with one of the most well-known areas of computer science: the design and analysis of algorithms. The attempt to place IDL strategies in a framework familiar to us is innovative and novel.

288

3 Knowledge Encoding and Interpretation

The chapter analyzes current SOTA methods for certifying DNNs, and it proposes that there is a missing link between neural learning of texture and attribute, and human interpretation of the same. The significance of color and texture reminds of the dog visions, in which weak dogs live in a little monochromatic universe. Dogs have a limited color vision compared to humans. Red, blue, and green are the primary hues that most people can see. Dogs, on the other hand, are limited to two shades of blue and are unable to distinguish between other colors. This doesn’t bother them, though, until we grab them a red ball, hurl it into the green grass, and act like they’re stupid for not finding it. Because our ancestors have spent millennia searching for food against a green background, we have no trouble understanding why those objects would be red. Before accepting or rejecting a model for its accuracy, a fundamental understanding of what a model sees and misses out is equally crucial. We shall learn more about interpretation in specific DNN architectures in the upcoming chapter.

 Reading List 1. “CNN variants for computer vision: History, architecture, application, challenges and future scope” by Bhatt et al. (2021). 2. “A Gentle Introduction to Graph Neural Networks” by Sanchez-Lengeling et al. (2021). 3. “The Building Blocks of Interpretability” by Olah et al. (2018). 4. “Visual interpretability for deep learning: a survey” by Zhang and Zhu (2018) covers methods for visualising CNN representations in intermediate network layers, diagnosing these representations, disentanglement representation units, creating explainable models, and semantic middle-to-end learning via human-computer interaction. 5. “Interpretable Machine Learning”, a book by Christoph Molnar’s (2020).

 Self-assessment 1. Create a code to show the ludicrousness of pixel-based distance. From the original image, the code should generate three to five equally L2 distant images and examine the image composition. Justify for example, why L2 distant image evaluation has nothing to do with perceptual or semantic similarity. 2. Use dimensionality reduction to visualize ANN hidden activity. Show the connections between learned representations of observations and connections between artificial neurons. (Refer: Channel Attribution (Olah et al. 2018)).

Chapter 4

Interpretation in Specific Deep Architectures

Artificial networks of neurons and algorithms mimic human neural systems intertwined and stacked. Most can’t comprehend a human mind, yet they work flawlessly until they don’t. With millions of billions of parameter optimizations in hundreds of layers over a short time, the question is not ‘if’ something will go wrong, but ‘when’. Even calculus-based training in active research areas fails to ensure optimality. Figure 1.9 showed that logistic regressions and rule-based learning are wellconceived and interpretable. As the slope rises, MLPs, CNNs, GANs, and GNNs become harder to interpret. The discovery of XAI has opened up more ways to evaluate algorithm fairness and statistical validity. These bring us to a consensus that there are no black-box and white-box models in AI. The explainability exists as a spectrum or a ‘gray-box’ of varying greyness. Categorically, it’s trickier than merely reading the coefficient in logistic regression. More efforts are needed to formulate a better technique to represent the models. The book’s analysis of SOTA techniques for verifying DNNs suggests a missing link between machine and human understanding of texture and attribute. Furthermore, an ablation study of networks such as CNNs, GANs, and GNNs in the previous chapter explains the model’s likely edge-case behavior. The investigation © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Somani et al., Interpretability in Deep Learning, https://doi.org/10.1007/978-3-031-20639-9_4

289

290

4 Interpretation in Specific Deep Architectures

of deep versus shallow networks boils down to the fact that the computation in a single neuron is the same across all architectures. A shallow network’s neurons are not significantly different from those of a deep-layered network. It simply has fewer cells across the network and a shorter computation time for fewer parameters, which should suggest a smaller chance of erroneous network behaviour or feature learning that exceeds human attention. This piques people’s interest in learning how a network can achieve such accuracy and accountability. Motivation for explainability is plentiful, as well as the methods and tools available. The degree of interpretability is determined by the individual’s perspective and the precision with which their instruments are calibrated. There is no fundamental basis for it in the same way as there is for accuracy and specificity. To be useful, the prospective explanation should be understandable by everyone, not just, domain specialists. This will be covered in detail in this chapter. We begin with segmentation, projection, and metric analysis to learn about the art of interpretation in CNNs. They will be discussed qualitatively in the following sections, along with the proposal of a new theory—Convolution Trace—and the failure story of cross-structure learning. Next, because the architecture is broken down into small convolutional structures that encapsulate specific tasks, GANs are simple to understand. Following the interpretation of nanoscopic structure learning, a brief introduction to graph learning techniques and a study of the robustness of neural structure learning with code will be presented.

4.1 Interpretation in Convolution Networks CNNs (LeCun et al. 1998; Krizhevsky et al. 2012; He et al. 2016; Huang et al. 2017) have achieved superior performance in many visual tasks, such as object detection and classification. However, the end-to-end learning strategy makes CNN representations a black box. Except for the final network output, it is difficult to understand the logic of CNN predictions hidden inside the network. In recent years, a growing number of researchers have realized that high model interpretability is of significant value in both theory and practice, and have developed models with interpretable knowledge representations. Potential focus areas include: 1. Understanding the impact of compression or scaling of image or video attributes in CNNs. 2. Estimating the optimal reduction of data size. 3. Calculating the computing time depending on variants of image size and the knowledge it is able to encode. 4. Tackling the situation of image retrieval, transfer learning on different models, and efficiency of cross-domain learning. 5. Encoding images as videos.

4.1 Interpretation in Convolution Networks

291

 THINK IT OVER The history over the past decade has taught us the general belief of people to discard functions and model parameters in the CNN pipeline that make the processing slow. What if we apply similar notion to build a model that can effectively crop the image at the initial stage and pass the minimum number of features through the pipeline for effective learning?

4.1.1 Case Study: Image Representation by Unmasking Clever Hans It is critical to understand the decision-making process. Transparency of the what and why in a non-linear machine’s choice is vital for determining if the learned strategy is trustworthy and generalizable, or whether the model based its conclusion on a spurious connection in the training data. In psychology, misleading correlations are called the Clever Hans phenomenon (Pfungst 1911). A model using a clever Hans type decision method may likely fail to give correct categorization and usefulness in the real world, where false or artifact correlations may not exist. Recent work (Simonyan et al. 2014; Zeiler and Fergus 2014; Bach et al. 2015; Ribeiro et al. 2016; Montavon et al. 2017) explains non-linear ML predictions in complicated real-world problems (Greydanus et al. 2018; Zahavy et al. 2016; Arras et al. 2017). Individual explanations can vary. An ideal (but not available) explanation would show the entire causal chain from input to output. In most research, reduced forms of explanation are examined, often a set of scores showing the value of each input pixel/feature for the prediction. These scores can be presented as userinterpretable heatmaps (relevance maps). Note that computing an explanation does not necessitate understanding neurons individually. Lapuschkin et al. (2019) propose semi-automated Spectral Relevance Analysis (SpRAy) as a realistic approach of defining and validating the behavior of nonlinear learning machines. This aids in determining whether a trained model delivers consistently for the challenge for which it was designed. The authors’ effort aims to offer a cautionary voice to the ongoing excitement about machine intelligence by pledging to assess and judge some of the recent successes in a more nuanced manner. The authors investigated several cases that show how LRP and SpRAy explain and validate learned model behavior. • First, the strategies shows how a learning machine exploits a spurious correlation in the data to “cheat.” The first learning model based on Fisher Vector (FV) model (Perronnin et al. 2010), trained on the PASCAL VOC 2007 image dataset (Everingham et al. 2010) is compared with its competitor, a pretrained DNN that was also fine-tuned on PASCAL VOC. Comparison shows excellent SOTA test set accuracy on categories such as ‘person’, ‘train’, ‘car’, or ‘horse’. Inspecting

292

4 Interpretation in Specific Deep Architectures

the decisions with LRP reveals substantial divergence for certain images, as the heatmaps displaying the classification reasons couldn’t be more different. Clearly, the DNN’s heatmap highlights the horse and rider. Whereas, FV’s heatmap focuses on the image’s lower left corner, which has a source tag. A closer inspection of the data set, which humans never look through exhaustively, shows that source tags appear distinctively on horse images (Lapuschkin et al. 2016). The FV model has “overfit” the PASCAL VOC dataset by relying on the easily identifiable source tag, which correlates with the true features. Cutting the source tag from horse images significantly weakens the FV model’s decision, while the DNN’s decision remains almost unchanged. If we take a correctly classified image of a Ferrari and add a source tag, the FV’s prediction quickly changes from ‘car’ to ‘horse’. • The second showcase example trains NN models to play the Atari game Pinball. Mnih et al.’s (2015) work showed that DNN outperforms humans in the game. Similar to the previous example, LRP heatmaps visualize the DNN’s decision behavior in Pinball game pixels. After extensive training, heatmaps focus on highscoring switches and lose track of flippers. A subsequent inspection of the games in which these LRP heatmaps occur reveals that the DNN agent first moves the ball into the vicinity of a high-scoring switch without using the flippers, then, secondly, “nudges” the virtual pinball table such that the ball infinitely triggers the switch by passing over it back and forth, without tilting the pinball table. Here, the model has learned to abuse Atari Pinball’s “nudging” threshold. From a game scoring perspective, it’s smart to use any available mechanism. In a real pinball game, the player would likely lose because the machine tilts after a few strong movements. The examples above demonstrate the point that low test set error (or high game scores) may be due to cheating rather than valid problem-solving behavior. It may not correspond to true performance in a real-world environment or when other criteria like social norms that penalize such behavior (Chen et al. 2017) are included in the evaluation metric. LRP explanations helped find this fine difference. The objectively measurable progress shows strategic behavior. Overall, while in each scenario, reward maximization and incorporating a certain degree of prior knowledge have induced complex behavior, the analysis has made explicit that: 1. Some of these behaviors incorporate strategy. 2. Some of these behaviors may be human-like, others may not. 3. In some cases, the behaviors could even be considered deficient and not acceptable once deployed. FV-based image classifiers may not detect horses in real-world data, and Atari Pinball AI may perform well until the game is updated to prevent excessive nudging.

4.1 Interpretation in Convolution Networks

293

Fig. 4.1 SpRAy workflow. a Relevance maps are constructed for data samples and object classes of interest, which needs two model passes, a forward and a LRP backward pass (here a FV classifier). Then, an eigenvalue-based spectral cluster analysis identifies prediction strategies in the data. Clustered relevance maps and t-SNE cluster groupings show the validity of prediction strategies. This can improve the model or dataset. Reproduced image from Lapuschkin et al. (2019) under Creative Commons Attribution (CC-BY 4.0)

All insights about classifier behavior obtained so far require human experts to analyze individual heatmaps, a laborious and expensive process that does not scale well. Lapuschkin et al. (2019) further implemented SpRAy to understand the classifier’s predicting behavior on large datasets in a semi-automated manner. Figure 4.1 depicts the SpRAy analysis results when applied to horse images from the PASCAL VOC dataset. For classifying images as ‘horse’, four different strategies can be identified • Detect a horse and rider (Fig. 4.1b). • Detect a source tag in portrait-oriented images (Fig. 4.1c). • Detect wooden hurdles and other contextual elements of horseback riding (Fig. 4.1d). • Detect a source tag in landscape-oriented images (Fig. 4.1e). Subsequently, the SpRAy analysis also reveals another ‘Clever Hans’-like behavior in their fine-tuned DNN model that previously went unnoticed in manual analysis of the relevance maps. The large eigengaps in the DNN heatmaps’ eigenvalue spectrum for class ‘aeroplane’ indicate that the model employs very different strategies for classifying aeroplane images. A t-SNE visualization emphasizes this cluster structure even more. An unexpected strategy discovered with the help of SpRAy identifies

294

4 Interpretation in Specific Deep Architectures

aeroplane images by inspecting the artificial padding pattern at the image borders. For aeroplane images this consists primarily of a uniform and structure-less blue background. Padding is typically introduced for technical reasons (the DNN model only accepts square-shaped inputs), but the padding pattern unexpectedly (and unintentionally) became part of the model’s strategy to classify aeroplane images. As a result, we figured that changing the way padding is done has a significant impact on the DNN classifier’s output. SpRAy’s advantage over previous approaches is thus its ability to ground predictions to input features, where classification behavior can be fine-tuned.

4.1.2 Variants of CNNs When it comes to information encoding and knowledge interpretation (to Sect. 3.1.1), images and videos cover a wide range of applications. CNNs have not only dominated the domain of images and videos, but have also demonstrated outstanding performance in multi-modal data encoding and decision making. The concept of stacking layers for deep learning is straightforward. The concept behind VGG was that if AlexNet performed better than LeNet because it was larger and deeper, why not push it even further? Adding more dense layers was one option. This would result in more computations. The next option was to add more convolutional layers. However, this did not work because defining each convolutional layer separately was exhausting. So, the best solution of all was to group convolutional layers into blocks. The question was whether it was better to use fewer wider convolutional blocks or more narrow ones. The researchers eventually came to the conclusion that more layers of narrow convolutions were better than fewer layers of wider convolutions. The primary four strategies of scaling and stacking layers in the CNN architecture is shown in Fig. 4.2. However, remember that the performance can vary greatly depending on architecture and hyperparameter settings.  Highlight Basic CNN architecture engineering terminology is all about scaling. • A wider network suggests more feature maps (filters) in the convolutional layers. • A deeper network implies more convolutional layers. • A network with higher resolution implicates processing input images with greater width and depth (spatial resolutions). As a result, the produced feature maps will have greater spatial dimensions.

4.1 Interpretation in Convolution Networks

295

Fig. 4.2 A general block of different types of CNN architectural design (Tan and Le 2019)

The networks are mostly the result of instincts, a lack of mathematical acumen, and a lot of trial and error. As a result, we will go over some of the significant CNN architectures in Table 4.1 that have served as base models for the advancement of supervised learning in CV since 2010. Let us investigate the variants of these convolution networks in chronological order, partly to get a sense of history and to form an opinion about where the domain is heading.  THINK IT OVER Less parameters are needed for convolutional layers. The number of parameters dramatically increases in the final few layers of fully connected neurons. Does eliminating the fully connected layers provide a potential solution? It’s easy in theory but hard to implement. Convolutions and pooling reduce resolutions, but we must map them to classes. A strategy can use 1 × 1 convolutions to reduce the resolution as we go deeper and increase the number of channels. This provides us with high-quality information from each channel.

296

4 Interpretation in Specific Deep Architectures

Table 4.1 Quick summary of popular CNN Models Model

Key feature

AlexNet (Krizhevsky et al. 2012)

• Trained on ImageNet’s 15 M high-resolution 256 × 256 × 3 images • Dropout, not regularization, prevents overfitting • Softmax and overlapping pooling prevented information loss • Instead of sigmoid activation, ReLU was used for the first time

Overfeat (Sermanet et al. 2013)

• Exploration of three well-known vision tasks, classification, localization, and detection, utilizing a single framework trained concurrently to improve accuracy • For localization, the classification head is replaced by a regression network • Bounding boxes are predicted at all scales and locations

DeconvNet (Zeiler and Fergus 2014)

• ZfNet, an intriguing multilayer DeconvNet, was introduced • Built for the purpose of statistically visualizing network performance • Includes three fully connected layers, a max-pooling layer, a dropout layer, and five shared convolutional layers

Network-in-network (Lin et al. 2013)

• The model is made lighter by replacing the last completely connected layer with a global max-pooling layer

Visual Geometry Group (VGG) (Simonyan and Zisserman 2014)

• A series of 3 × 3 convolutions padded by one to keep the output size the same as the input size, followed by max-pooling to half the resolution • The architecture consisted of n VGG blocks followed by three FC-dense layers

GoogLeNet (Szegedy et al. 2015)

• Inception module-based concept • Nine stacked inception modules with max-pooling layers (to halve the spatial dimensions)

Inception (Szegedy et al. 2015)

• Tthree convolutional layers with different-sized filters and max-pooling • Each layer’s parallel learning filters are different sizes

Highway Networks (Srivastava et al. 2015)

• On the notion that increasing network depth can improve learning capacity • Depth was utilized to learn improved feature representation and to give a novel cross-layer connection technique • Converged far faster than simple networks, even at 900 layers deep, despite the fact that adding hidden units beyond 10 layers affects the performance of a simple network

ResNet (He et al. 2016)

• ResNet is one of the most popular and successful models of deep learning to date • ResNets use residual blocks • Skip-connections and batch-normalization let it train hundreds of layers without sacrificing speed • Based on idea that deeper layers shouldn’t have more training errors than shallower ones • VGG-19 inspired the architecture

SqueezeNet (Iandola et al. 2016)

• Increase channel interdependence at almost minimal computational cost • Smaller CNNs reduce server communication during distributed training • Each squeeze-expand block is connected to form a fire module • Reduce AlexNet’s parameters by replacing 3 × 3 filters with 1 × 1 • Downsample in the later layers so that the convolutional layers have huge activation maps • The features are compressed with 1 × 1 convolutional layers, then expanded with 1 × 1 and 3 × 3 layers • To allow the network to adaptively change the weight of each feature map, it adds parameters to the channels of all convolutional blocks

Xception (Chollet 2017)

• Standard convolution combines all input channel filters in one step. The introduction of depthwise separable convolutions in architecture splits the traditional convolution process (filtering all input channels and combining the values in a single step) into depthwise and pointwise convolutions • Xception increases network efficiency by separating spatial and feature-map (channel) correlation

MobileNets (Howard et al. 2017)

• Build lightweight DNNs with depth-wise convolutions • Models with minimal latency utilized in applications like as robots and self-driving cars • Instead of one huge filter, MobileNets have two: one for filtering and one for combining • Batchnorm and ReLU non-linearity follow all layers

4.1 Interpretation in Convolution Networks

297

 THINK IT OVER MobileNet is a well-liked and lightweight platform, but why is that? In a simple CNN structure, a filter takes the form of a block that is superimposed on the input image block; the dot product between these two blocks is used to make a decision. It is possible to determine the link between channels and the specifics of any given channel. MobileNets use two smaller filters instead of one large one: • One channel at a time is examined to determine the interconnectedness of its individual pixels. • The other one looks at all the channels at the same time to see how each pixel is connected to the ones that follow.

4.1.3 Interpretation of CNNs This section explains CNN representations and learning with interpretable/disentangled representations. There are five main branches. 1. Intermediate layer visualization that either synthesizes the image that optimizes a CNN unit’s score or inverts conv-layer feature mappings back to the input image. The most straightforward technique to investigate visual patterns concealed within a neural unit is to visualize filters in a CNN. Various network visualization strategies are previously covered in Sect. 3.2.2.1. We shall quickly review a few of them here. • Gradient-based methods: The majority of network visualization methods are gradient-based (Simonyan et al. 2014; Zeiler and Fergus 2014; Springenberg et al. 2014; Mahendran and Vedaldi 2015). These approaches primarily compute gradients of a given CNN unit’s score in relation to the input image. They estimate the image look that optimizes the unit score using gradients. Olah et al. (2017) presented a toolbox of existing ways for visualizing patterns stored in various conv-layers of a pre-trained CNN. • Up-convolutional net: Another common technique to visualizing CNN representations is Up-convolutional net (Nguyen et al. 2016). It is a tool that indirectly depicts the appearance of the image corresponding to a feature map. However, unlike gradient-based methods, it cannot mathematically guarantee that the visualization results exactly mirror CNN representations. In 2017, Nguyen et al. (2017) added a semantic before GAN. Note that CNN feature maps can be used as a priori for visualization.

298

4 Interpretation in Specific Deep Architectures

2. Discovery of representations can either determine a CNN’s feature space for various types of objects or identify potential flaws in the way conv-layers represent things. • Examine CNN features from a global perspective. Szegedy et al. (2015) studied filter semantics in 2014. In the same year, Yosinski et al. (2014) studied filter representations in intermediate conv-layers. In 2015, Aubry and Russell (2015) and Lu (2015) computed CNN feature distributions of several categories/attributes in a pre-trained CNN’s feature space. • The second research direction extracts image regions that directly contribute to CNN label/attribute representations. This is comparable to how CNNs are visualized. In 2017, Fong and Vedaldi (2017) and Selvaraju et al. (2017) suggested approaches to transmit feature map gradients to the image plane to estimate image regions. Ribeiro et al. (2016) proposed the LIME model, which extracts network-sensitive image regions. Zintgraf et al. (2017), Kindermans et al. (2017), and Kumar et al. (2017) devised ways for visualizing areas in the input image that contribute the most to CNN decision-making. Wang et al. (2017) and Goyal et al. (2016) attempted to decipher the logic stored in NNs for visual question answering. This research used noteworthy items (or regions of interest) detected in images and key words in questions to explain output answers. • Estimating susceptible locations in a CNN’s feature space is another useful diagnostic method. Research contributions like “One pixel attack for fooling DNNs” (Su et al. 2019), influence functions interpretability (Koh and Liang 2017), and intriguing properties of NN by Szegedy et al’s. (2013) developed approaches to calculate adversarial samples for a CNN; they try to estimate the minimal noisy input picture perturbation that can influence the final prediction. Koh and Liang’s (2017) influence functions can compute adversarial samples. The influence function can be used to construct training samples, correct the training set, and debug CNN representations. • The fourth area of research analyzes network feature spaces to refine network representations. Lakkaraju et al. (2017) suggested a weakly-supervised strategy to find CNN knowledge blind spots (unknown patterns). This approach divides CNN’s feature space into thousands of pseudo-categories. It assumes a welltrained CNN would utilize each pseudo-sub-space category to represent a subset of an object class. This study randomly exhibited object samples in each subspace and used sample purity to find probable CNN representation problems. Hu et al. (2016) proposed employing logic rules of natural languages (e.g., I-ORG cannot follow B-PER) to design a distillation loss to supervise the knowledge distillation of DNNs, resulting in more meaningful network representations. • Finally, Zhang et al. (2018) came up with a way to find out if a CNN is showing biased information. Figure 4.3 from the paper illustrates biased CNN face estimations. CNN may represent an attribute using co-appearing visual characteristics in training images. When co-appearing features aren’t semantically related to the target property, they’re biased.

4.1 Interpretation in Convolution Networks

299

Fig. 4.3 Biased learning of CNN. Figure adapted from (Zhang et al. 2018) with permission

3. Encoded Filter disentanglement: To simplify the interpretation of network representations, conv-layers are used to convert them to more human-friendly forms, such as graphs and decision trees. • Explanatory graphs: A more complete explanation of network representations is made possible by decomposing CNN features into human-interpretable graphical representations termed as explanatory graphs. Each filter in a high conv-layer of a CNN, for instance, typically stands for a variety of pattern types. Both the head and the tail of an object may activate the filter. As a result, Zhang et al. (2018) proposed using a graphical model to represent the semantic hierarchy hidden inside a CNN to disentangle features in the convolutional layer of a pre-trained CNN. This is expected to provide a global view of how visual knowledge is organized in a CNN. The following three questions were the focus of this analysis: a. How many different kinds of visual patterns does each convolutional filter of a CNN remember? (A visual pattern can be a part of an object or a texture.) b. Which patterns work together to describe a part of an object? c. What is the spatial relationship between two patterns that are co-activated? 4. Semantic disentanglement: At the semantic middle-to-end learning level, interactions between people and computers are expected to help make new models. In addition, middle-to-end learning of NNs with low supervision may be made possible through the clear semantic disentanglement of CNN representations. Methods such as active question-answering under light human supervision (Zhang et al. 2017) and middle-to-end learning or debugging of NNs using explainable or disentangled network representations may drastically reduce the need for human annotation. 5. Building interpretable models: Explore the concepts of interpretable CNNs (Shi et al. 2018), interpretable RCNNs (Wu et al. 2017), InfoGAN (Chen et al. 2016).

300

4 Interpretation in Specific Deep Architectures

 THINK IT OVER Do CNNs learn disentangled features? Disentangled features let each network unit pick up on a different real-world concept. For example, a convolutional channel 394 could find skyscrapers, channel 121 could find dog noses, and channel 12 could find stripes at a 30◦ angle. A completely entangled network is the inverse of a disentangled network. In a fully tangled network, for example, there would be no clear unit for dog snouts. All channels would help with dog snout recognition. The presence of disentangled features indicates that the network is highly interpretable. Assume we have a network with totally disentangled units labeled with wellknown notions. This would enable the network’s decision-making process to be tracked. For example, we may look at how the network categorizes wolves versus huskies. First, we must locate the ‘husky’-unit. We can see if this unit is dependent on the prior layer’s ‘dog snout’, ‘fluffy fur’, and ‘snow’-units. If it does, we know it will misidentify a husky on a white background as a wolf. We were able to discover problematic non-causal linkages in a disentangled network. To explain an individual forecast, we may automatically list all highly activated units and associated notions. The NN’s bias was immediately detectable. Unfortunately CNNs are not completely disentangled.

Exploring Network Representations with Layerwise Attribution In contrast to the aforementioned methods, exploring network representations can be done most directly through the visualization of CNN representations. Furthermore, the network representation offers a technical basis for several methods of analyzing CNN representations such as disease. Learning explainable network representations and decoupling feature representations from a pre-trained CNN are two additional problems for SOTA techniques. Finally, weakly-supervised middle-to-end learning begins with explainable or disentangled network representations. Think about the heatmaps and pixel highlights that were mentioned above for input images depending on the output classification. The method is not resistant to simple visual transformations like contrast, split channels, and brightness. The meaning of each pixel is intertwined with and distant from high-level concepts of the target class. Secondly, traditional techniques offer a simple interface and do not delve into individual feature points. They selectively reveal attributes for a single class at a time, allowing us to consider the attribution as a building block for high-level feature categorization, such as cat paws, rather than retrieving whether the image is of a tiger cat. The earlier slicing of the activation cube in Fig. 3.12 into spatial, channels, and neurons analyzed individually leaves out some important elements of the puzzle.

4.1 Interpretation in Convolution Networks

301

Fig. 4.4 Layer-wise convolution and pooling attribute visualization. The figure above is a schematic network configuration for training MNIST handwritten number recognition. The image below is the neuron activation map for the following network configuration by passing inputs sequentially into 1st and 2nd layer

We can certainly drill down to individual neurons, but that’s just too much detail to convey the entire story. It will be difficult to convey even the picture of channels before they break into neurons. Furthermore, as the number of layers in the network increases, as shown in Fig. 4.4, the feature encoding becomes more complex and difficult for us to decipher. This is an excellent argument to link the most functional neurons rather than overloading oneself with information. Matrix factorization is a large field of study that looks at the best ways to break up matrices to get useful information. It flattens the cube and recovers the basic spatial positions and channels that are unique to each input image. For each set of input image, the utilization of the grouping must be computed. Unfortunately, each grouping is necessarily a trade-off between limiting pieces to human interpretation and retaining knowledge, because any aggregation is lossy. Matrix factorization allows us to choose what our groupings are optimized

302

4 Interpretation in Specific Deep Architectures

for, providing a more significant trade-off than the natural groupings we highlighted earlier.

4.1.4 Review: CNN Visualization Techniques Deep CNNs have had a substantial impact on CV performance. The method began with image classification techniques and is now utilized for pixel-by-pixel image segmentation. Despite years of progress, classification methods are still used in areas where the problem statement cannot be expressed as a semantic segmentation task, or when pixel-wise labeling or pixel-wise labeling is computationally expensive. Another constraint of the computational availability of the architecture is the compression and resizing of input data. One major problem with classification networks is the lack of visual output, which limits understanding of the dominant image aspects that contribute to the judgment. This has increased the requirement for approaches and strategies to depict and explain network decisions in a way that humans can understand. As many algorithms evolved over time, we halted, comprehended, and evaluated the prevalent strategies’ benefits and drawbacks. Let us take it from the start. 4.1.4.1

Challenges with Signal Saliency Methods

Some of the black-box mechanisms may have been discovered via signal-based approaches. However, there are still many open questions, some of which are listed below. 1. How can we make use of the optimized activation images and the partially rebuilt ones? 2. To what extent may our potential newfound ability to approximately invert signals in order to reconstruct images contribute to an already-improved level of interpretability? 3. Will we be able to make use of the information contained in the intermediate process that reconstructs the approximate images? 4. What makes describing the section in this “inverse space” more helpful than explaining the forward propagation of signals? 5. How can studying the signals that lead to optimizing activation in the intermediate stages help us to determine which neurons play which roles? 6. Non-unique solutions are notoriously produced by the optimization of heavily parameterized functions. Can we be certain that the optimization that produces a mix of surreal dog faces will not produce more unusual images with slight changes? When we answer these questions, we may find hidden clues that will help us get closer to an AI that can be understood.

4.1 Interpretation in Convolution Networks

303

Rethinking image saliency The paper (Adebayo et al. 2018) evaluated whether saliency approaches are model and data insensitive. Insensitivity is highly undesirable because it implies that the “explanation” has nothing to do with the model or the data. Edge detectors are methods that are insensitive to model and training data. Edge detectors merely identify significant pixel color changes in images; they are not connected to a prediction model or abstract visual properties, and they require no training. Vanilla Gradient, Gradient × Input, Integrated Gradients, Guided Backpropagation, Guided Grad-CAM, and SmoothGrad were the approaches tested in the paper (Adebayo et al. 2018). The insensitivity test was passed by Vanilla Gradient and Grad-CAM, while Guided Backpropagation and Guided Grad-CAM failed. However, in their article titled “Sanity checks for saliency measures,” Tomsett et al. (2020) discovered some fault with the sanity checks paper itself (of course). According to the authors’ research, evaluation metrics are inconsistent. Having said that, we can see more clearly that assessing visual explanations is still a challenge. Because of this, for example, using DL applications in a diagnostic procedure with a patient is difficult for a medical professional. The current state of affairs is really disappointing. We need to hold off for a while until further studies are conducted on this matter. And, please, instead of coming up with brand new saliency methods, let’s focus on better ways to analyze the ones already out there. For instance, a DNN trained with adversarial training (Noack et al. 2019) (a method that augments training data with adversarial examples) outperforms the identical model trained without adversarial examples in terms of interpretability (with more accurate saliency maps) (Noack et al. 2019). What if the saliency-based interpretability is called into question? Saliency approaches have become increasingly popular in recent years as a modelagnostic method of bringing attention to important input qualities, most often an image. While it is true that relying solely on visual assessment can be counterproductive due to the methodological difficulties of reaching the breadth and depth of explanations, Adebayo et al. (2018) argues that this is not necessarily the case. The saliency approach determines which features of the input data are most important for making a prediction or understanding the model’s latent representation. When a saliency map falls into this category, it requires human review to determine its credibility. For example, if polar bears are invariably paired with snow or ice in digital images for whatever reason, the model may have mistakenly relied on this information rather than actual polar bear characteristics to make the identification. Using a saliency map, we can locate the source of the problem, and hence avoid it. Extensive random experiments revealed that some saliency techniques can be model- and data-independent (Adebayo et al. 2018), i.e., the saliency maps produced by some methods can be quite comparable to the results obtained by edge detectors. This is troublesome since it indicates that the saliency approaches are

304

4 Interpretation in Specific Deep Architectures

not properly identifying the input features responsible for the model’s prediction. In these situations, it is important to build a saliency approach that takes into account both the model and the data.

4.1.4.2

Case Study: Randomization Tests on Saliency

Adebayo et al. (2018) proposed randomized tests to validate the sanity of the saliency approach. The methodology can be thought of as a global scope of explanation that can be used to assess the suitability of any interpretability approach in general. The following is a summary of the proposed randomization task: 1. The evaluation draws a parallel with edge detection, which is not dependent on training data or models. 2. The randomization tests were centered on two instantiating frameworks to compare the usual experiment with an artificially randomised setup: model parameter randomization test and data randomization test. 3. The randomization test on a large experimental setup demonstrates that the wide saliency approaches (i.e., guided Backprop and its variations) are quite sensitive to both input data and model parameters employed in training.  Note The core concept of the saliency randomization test is based on: 1. Model Parameter Randomization Test: The saliency maps for a randomly initialized, untrained network and a trained network should be considerably different if the saliency approach relies on the learned parameters of the model. 2. Model Data Randomization Test: The saliency maps for a model trained on data with randomly permuted labels should be different from the maps created against the uncorrupted dataset if the saliency technique depends on the data labels (i.e., p(y|x)). Look How Interpretable is LIME Next, the popular interpretable technique LIME (Sect. 3.3.5) is used here to explain the learning of image classification tasks. The basic notion is to understand why a deep learning network predicts that an object (image) belongs to a particular class (Chihuahua, pizza, or Persian cat as a top prediction in Fig. 4.5). The input sample is processed and resized to an image size of 299 × 299 to be suitable for the Inception V3 model. The prediction class for three sample mix-cropped shuffle images in the first column block is a vector of 1000 probabilities of belongingness to each class of Inception V3. The top predictions for each image are shown on the left of input samples with the confidence probabilities. The number of influential superpixels for visualization of the positives in the first column of Fig. 4.5 is set to five. The next column contains the most influential ten features with a minimum thresholding weight of 0.05. The corresponding heat map

4.1 Interpretation in Convolution Networks

305

for each target labeled class is presented in column 2 and column 3. It is sufficient for the subsection to consider simple image classification tasks and interpret the predictions based on popularly known image convolutional recognition models like InceptionV3 or ResNet-50. We have yet considered few self-captured images to stress the complexity of application in real-world scenarios. Further, we equally cropped images and created a shuffled montage of four and nine crops, as seen in the Fig. 4.5. The output or top predicted classes are colored inside a decision boundary with positive weights in green and negative weights in red. Each perturbation is determined by evaluating the distance from the original instance to be explained. These distances are converted to weights by mapping the distances to a normalized 0–1 jet color scale using a kernel function. To concede better, the perturbation for the image is generated by turning on and off some of the superpixels (created using a quick-shift algorithm, Table 4.2) in the image. The number of superpixels for a given image is shown in Fig. 4.8. The different combinations of superpixels’ activation termed as the perturbation for an image is generated using a random binomial distribution parameter and input and output specification. Next is the prediction of each perturbed image for the 1000 classes of ImageNet in Inception and ResNetV2 are estimated, being the most computationally expensive process. The distance between each randomly generated perturbation and the image explanation is measured using the cosine pairwise distance explained in Eq. A.4. The kernel function width is decided based on the locality width around the instances and mapped to a zero-one value. The cosine metric shows a slighter stable expected distance value (between 0 and 1), so no further fine-tuning of the kernel is required. A diagrammatic explanation of various kernel functions is exhibited in Fig. 4.6.

Fig. 4.5 Positive and negative weights influencing the output classification

306

4 Interpretation in Specific Deep Architectures

Fig. 4.6 Kernel function width for locality distance weights

4.1.5 Review: CNN Adversarial Techniques There is a connection between feature visualization and adversarial examples: Both techniques maximize the activation of an NN unit. For adversarial examples, we look for the maximum activation of the neuron for the adversarial, i.e. incorrect class. One difference is the image we start with: For adversarial examples, it is the image for which we want to generate the adversarial image. For feature visualization, it is the noise characteristics that depends on the approach. Techniques are included to understand adversarial attacks and formulate a defence mechanism for such attacks in the future. The network’s flaws can be easily remedied by using the algorithms listed below to classify as an entirely different class with high confidence. Adversarial examples can be either natural, organic in nature, or synthesized by a malicious attacker where the image remains very similar to the original data but is purposely injected with noise to be able to classify it as something completely different with confidence. Hendrycks et al. (2019) have described using two natural adversarial test datasets, ImageNet-A and ImageNet-O, that the classifier accuracy degrades significantly in SOTA models. A natural adversarial example may range from having the entire image being mapped to a single class, or color and texture being predominant in classification as opposed to shape as the primary descriptors. A synthetic adversarial example may have the addition of noise in the input sample purposely to fool the data or addition of minor distortion to the original image, which are indistinguishable from the human eye but cause the network to fail. This isn’t limited to adversarial image attacks, but we have seen that adding this kind of noise to audio recognition can lead to wrong classifications. An example of this could be seen in Fig. 4.7 with top-3 ImageNet prediction being labeled for each sample image, where a cat with a fish in a tank is confidently classified as goblet (572) or toilet seat (861). There is also evidence of augmentation of images by splitting channels (Fig. 3.17) or rotation (Figs. 3.19, 3.20, and 3.21) that the texture learning of the model is easily tweaked to classify an object wrongly with a high confidence metric. The sad part of the story is that adversarial training appears powerless to address the problem of natural adversarial examples problem (Hendrycks et al. 2019). Adve-

4.1 Interpretation in Convolution Networks

307

Fig. 4.7 Top-3 output class prediction for sample inputs on ImageNet pretained network

rsarial examples are not confined to images; natural language networks can also be fooled (Jia and Liang 2017). Luckily, some training like Projected Gradient Descent (PGD) (Appendix A.8), Gradient Ascent (Nguyen et al. 2015) and Fast Gradient Sign Methods (FGSM) by Goodfellow et al. (2014) on specific synthetic adversarial attacks helped in handling suspicious attacks. It is important to understand the notion and realize that the model in high-risk domains needs to be highly robust to such adversarial attacks in the future for us to trust in the model with its performance. For ease of use, the mean and standard deviation of the ImageNet data have already been used to process the images that these pre-trained model use.

4.1.6 Inverse Image Representation Research has been conducted on inverting a variety of traditional CV representations, including HOG and dense SIFT (Vondrick et al. 2013), keypoint-based SIFT (Weinzaepfel et al. 2011), Local Binary Descriptors (d’Angelo et al. 2013), BOW (Kato and Harada 2014). Modern inverting-based methods (Dosovitskiy and Brox 2016; Mahendran and Vedaldi 2015; Stone et al. 2017; Zeiler and Fergus 2014) break a NN representation ωi by inverting feature maps into a synthesized image, whereas the aforementioned methods are limited to inverting shallow representations of features. Specifically, ωi = ω(xi ) is not an invertible model for a NN representation of an input image xi , as assumed by Mahendran and Vedaldi (2015). Subsequently, the inversion problem was defined as finding an image x  whose NN representation best

308

4 Interpretation in Specific Deep Architectures

matches ωi , i.e., arg minx ||ω(x) − ωi ||2 + R(x), where R(x) is a regularization term that indicates prior knowledge about the input image. The objective was to determine what was missing by comparing the inverted image to the original. Dosovitskiy et al. (2016) directly trained a new network using features generated by the model of interest and an image as the label to invert features of intermediate layers to images. The discovery was made that deeper layer properties can be used to reconstruct contours and colors. Pairing the original CNN with a deconvolution network that includes unpooling, rectification, and deconvolution operations enables features to be inverted without training. Zeiler et al. (2014) carried out this research. Maximum locations are used for unpooling in the deconvolution network, negative values are set to zero for rectification, and transposed filters are used for the deconvolution layers. Inverting representations, in particular inverting representations based on CNNs, are connected to the previously well-studied subject of inverting NNs. In addition to the backpropagation algorithm presented here, further optimization algorithms based on sampling have been proposed in 1990s by Lee and Kil (1994), Linden and Kindermann (1989), and Lu et al. (1999). But these method were not used in the current generation of DNNs, and natural image priors were not used. Other papers, such as (Jensen et al. 1999), specialized in inverting networks in the context of dynamical systems, but will not be examined in depth here. Lastly, some academics, such as Bishop (1995), have also proposed learning a second NN to operate as the inverse of the original; this is complicated by the fact that the inverse is typically not unique. Finally, AE designs (Hinton and Salakhutdinov 2006) train networks together with their inverses as a type of supervision.

4.1.7 Case Study: Superpixels Algorithm The preceding image class identification and classification examples in Fig. 4.7 are based on the low-level image segmentation method. So, it’s worth pausing to explore several prominent superpixel segmentation strategies (or, we say, over-segmentation). This generation of superpixels then acts as the foundation for more complicated tasks such as Conditional Random Fields (CRFs). A CRF is a type of discriminative model, an abstraction based on Markov Random Fields (MRFs) which are best suited for prediction tasks that require contextual information or neighbor states to be weighted. This method has a wide range of applications, including object detection, noise reduction, speech tagging, and many others. The chronological visualization sample for four superpixel algorithms (Table 4.2) are shown in Fig. 4.8 with function parameter labels.

4.1 Interpretation in Convolution Networks

309

Table 4.2 Chronological overview of Superpixel algorithms for low-level image segmentation Year Algorithm Notes 2004

Felzenszwalb’s graph-based segmentation (Felzenszwalb and Huttenlocher 2004)

A fast 2D image segmentation algorithm with a single ‘scale’ parameter influencing the segment size. Local contrast of the image decides the actual number and size of segments

2008

Quickshift image segmentation (Vedaldi and Soatto 2008)

Quickshift 2D image segmentation is based on an approximation of kernelized mean-shift that simultaneously computes a hierarchical segmentation on multiple scales. Applied to a 5D space of color information and image location, making it a local mode-seeking algorithm. The parameter ‘σ ’ controls the scale of the local density approximation, ‘Maxdistance ’ monitors the level of produced hierarchical segmentations and ‘ratio’ manages the trade-off between image space distance and distance in colour-space

2012

SLIC—K-Means based image segmentation (Achanta et al. 2012)

Closely related to Quickshift for simple computing of the K-means in the 5D space of image location and color information. The parameter ‘compactness’ trades between color-similarity and closeness similar to Quickshift, while n segments selects the number of centers of k-means. It has gained popularity based on efficiency due to the simpler clustering method. Note that the algorithm works well in Lab color space

2014

Compact watershed segmentation of gradient images (Neubert and Protzel 2014)

Watershed uses a grayscale gradient image rather than a color image as input. The bright pixels form high peaks and denote the boundary region, flooding the image landscape from the given marker until separate flood basins meet at the peak creating distinct image segments. The additional parameter ‘compactness’ restricts the markers to flood faraway pixels making the watershed forming more regular shape segments

4.1.8 Activation Grid and Activation Map The last experiment in CNN network visualization is to understand multiple layers together by group factorization and activation grid. A summary of factorization of activation groups to classify an object is shown in Fig. 4.9. The experiment in Fig. 4.5 is reinterpreted in the form of groups of individual layers that classify if the particular kernel positively or negatively affects the network’s output. The positive and negative effect on learning is shown by how thick the green and red lines are between the kernel features. This helps interpret what the model weighs in each layer of the network and gives insight into the adversarial weights and ways to optimize the dataset for better accuracy. The experiment investigating the effect of shuffling images on kernel activation and testing for spatial and texture learning, as well as the impact on the top kernel for prediction. Figure 4.10 shows that image shuffling fails to some extent into activating similar kernels in each case. The shifting of the image will possibly distort the boundary. But, the top 15 kernel activation for all the four permutation shuffles of the

310

4 Interpretation in Specific Deep Architectures

Fig. 4.8 Four methods are illustrated to compute superpixels

Fig. 4.9 Visual guide to factorized kernel grouping. Reproduced image from Olah et al. (2018) under Creative Commons Attribution (CC-BY 4.0)

4.1 Interpretation in Convolution Networks

311

Fig. 4.10 Activation Grid helps the understanding in each spatial position on the input sample. The image is cropped and shuffled to verify texture learning

312

4 Interpretation in Specific Deep Architectures

image A, B, C, D respectively labeled has a portion of pictures that are not activated predominantly by the same activation grid. This leads us to the judgment that either the network doesn’t learn texture and spatial attributes ultimately as claimed by the network. The model with its locality kernel attention cannot get the whole context of the image information. The future scope of work will establish and interpret the results for various image and video data, as well as exploit the texture and attribute learning feature of convolutional NNs for better improved explainability.

4.1.9 Convolution Trace Now, as we have a clear understanding of the CNNs, the alternating blocks of convolutional and max-pooling layers with piecewise non-linear activation in each layer followed by a certain number of fully connected layers forms the base of any modern CNNs. Any variation of the architecture, be it heterogeneous blocks of multiple scale convolution and pooling at each layer, or multiple convolutions between pooling layers, or skip connections, comes with its own parameters and training procedures. Therefore, the intriguing question in mind is which of the components of a convolutional deep NN are necessarily learning over time to achieve SOTA results in model testing. After reading Chap. 2, you will have realized that network learning during training is optimizing weights and biases of the network for a specific task. This learning occurs only in the convolutional layers of the network and not in the max-pooling or fully connected layers. Even Springenberg et al.’s (2014) paper experimentally verified that by substituting the network with all convolutional layers of appropriate kernel size and stride, the model can achieve competitive or SOTA results on various object recognition tasks. As a result, we introduce the term Convolution Trace for CNNs as a parallelism for the ‘Memory Trace’ defined in the Fuzzy Inference System (FIS) in Chap. 5. This is the addition of rules or advanced visualization methods specifically across each convolution layer in parallel before the pooling block and connected in skipconnections. The firing of a set of rules based on specific thresholding shall help to understand the brain of the CNN. These rules can be fuzzy or computed with the accession of kernel filters and weights by training the network formally and retraining with the function activation with restricted grids in play. The next action plan is to develop the model and produce a mathematical representation of the model.

4.2 Interpretation in Autoencoder Networks Information retrieved from the real world typically has a high degree of redundancy. This not only hinders the modeling of the representation, but also creates obstacles to computational efficiency. For example, look at the Swiss roll in Fig. 4.11. The information was originally stored in three dimensions, but after being unrolled, it

4.2 Interpretation in Autoencoder Networks

313

was discovered that only two were needed. This process is known as dimensionality reduction (precisely, dimensionality reduction via manifold learning). In this setting, it is assumed that high-dimensional data can be adequately represented by a lowerdimensional embedding. Now, if we apply this idea to the image representation problem, we see that we need to consider a space of smaller dimensions in order to adequately characterize the images in most datasets. The term ‘latent space’ is used to describe this kind of area. All data points should cluster together on this lower-dimensional manifold of the high dimensional images. We have already learned in Chap. 2 that the goal of the AE model’s training is to reduce the difference in quality (loss) between the original and the reconstructed image. Upon reaching convergence, a latent space embedding of the image is acquired as a vector. We shall visualize the latent space to examine the data embedded in the AE space. In AE, we can visualize the latent space by embedding these latent vectors into two dimensions using the t-SNE shown in Fig. 4.12.

4.2.1 Visualization of Latent Space Essentially, a vanilla AE is a NN that is fed a picture and then re-creates the original. The image is processed by the network’s encoder, which produces a latent vector (a vector that cannot be instantaneously reconfigured into an image) representing the original image in a lower-dimensional space (only a couple of floats instead of an entire image matrix). It is possible to “uncook” this latent vector back into the original image by feeding it into a decoder network. At the same time, we could generate our own latent vectors (draw random values) and feed them into the decoder section, resulting in a strange hybrid image that doesn’t look like anything. Given that most latent vectors will be outside the data space we’ve fed into the network, we restrict the latent vectors generated by the encoder part to be selected from a unit Gaussian distribution in order to improve our odds of selecting acceptable latent vectors. By doing so, we know that the decoder will be able to make sense of the data sampled from the unit Gaussian distribution.

Fig. 4.11 An example of manifold dimensionality reduction. Image reproduced from (Lee and Verleysen 2007) with permission

314

4 Interpretation in Specific Deep Architectures

Fig. 4.12 Embedded representation of MNIST data in lower dimension space using AE model learning. The cluster visualization is presented using t-SNE

There are two objectives, and therefore two losses. One of the losses is the latent loss, which penalizes the network if the latent vector deviates from the unit Gaussian distribution and is used to restrict the network’s output. Since the AE also has to check that the output image is consistent with the input image, there is a second loss, the actual image loss. In order to combine them, the network must make sacrifices in order to achieve the lowest possible latent loss (unit Gaussian distribution of latent vectors) and the lowest possible image loss (high similarity between input and output images). When the mean is zero and the standard deviation is one (a unit Gaussian), then and only then does the latent loss evaluate to zero (perfect).  Highlight First, we want to visualize a dataset meaningfully. In our example, the image (or pixel) space of the MNIST dataset has 784 dimensions (28 × 28 × 1). The problem is fitting to fit all these dimensions into 2D or 3D for visualization in a plot. Here, we have t-SNE, an algorithm that tries to preserve the distance between points by transforming data from a high-dimensional space into a lowerdimensional space of 2-D or 3-D. Keep in mind that t-SNE outperforms PCA and ICA when it comes to visualizing the data.

4.2 Interpretation in Autoencoder Networks

315

In other tasks, generative unsupervised models’ interpretation of embedded or latent space features can unearth previously hidden patterns in the input. The use of embedded feature interpretation has been primarily focused on image and textbased applications (Mikolov et al. 2013), but it is also increasingly being applied to genomic and biomedical fields (Ching et al. 2018). For example, Way and Greene (Weinstein et al. 2013) trained a VAE on TCGA gene expression and used latent space arithmetic to quickly isolate and analyze high-grade serous ovarian cancer subtypes (Way et al. 2018). VAE traits are most indicative of subtype-specific biological processes. Other methods interpolate unseen intermediate states using GAN-learned latent space embeddings.  THINK IT OVER The impact of MSE as loss function for AE! The mean squared error forces the network to pay special attention to pixel values where its estimate is inaccurate. When reconstructing, predicting 127 instead of 128 is unimportant, but misunderstanding 0 with 128 is significantly worse. Unlike VAEs, apply a distance metric rather than predicting the probability per pixel value. This saves a lot of parameters and makes training easier. People normally report the summed squared error (SSE) averaged over the batch dimension to have a better intuition per pixel (any other metrics involving mean/sum lead to the same result). However, there are some disadvantages. MSE typically results in blurry images where little noise or frequent patterns are removed because they create a very low error. To ensure realistic image reconstruction, GAN could be combined with AEs. Furthermore, comparing two photos using MSE may not always reflect their visual resemblance. Say, an AE reconstructs an image that has been moved by one pixel to the right and bottom. Despite the fact that the images are nearly identical, we can obtain a bigger loss than forecasting a constant pixel value for half of the image using a separate, pre-trained CNN to estimate distance instead of pixel-level comparison. A possible solution to this problem would be to create a separate, pre-trained CNN and use the distance of visual features in lower layers as a distance measure instead of the original pixel-level comparison.

4.2.2 Sparsity and Interpretation We briefly discuss three topics of sparsity and interpretation of AEs:

316

4 Interpretation in Specific Deep Architectures

1. Interpolation: Now that we know how much detail the model can extract, we can investigate the topology of the latent space. To accomplish this, we examine the appearance of interpolation in the image space vs. the latent space. 2. Latent space arithmetic: In the latent space, we can also perform mathematical operations. This means that we can perform arithmetic operations on latent space representations, such as adding or subtracting, instead of interpolation. In the case of faces, for example, man with glasses - a man without glasses + woman without glasses = woman with glasses. This approach produces astounding results. 3. Interpolation in pixel space: The structure of the pixel space is to blame for the sloppy transition. Smooth transitions between images are impossible in the image space. This is why it’s impossible to create the illusion of a half-full glass by combining a picture of an empty glass with an image of a full glass.  THINK IT OVER Why are autoencoders deemed unsuccessful? What options exist? Autoencoders are important for some tasks, but they aren’t as necessary as we formerly imagined. We anticipated that deep networks would not learn correctly if trained using simply backprop of the supervised cost about ten years ago. We assumed that deep networks, like AEs would require an unsupervised cost to be regularized. It was an AE that Google Brain used to build their first very big NN to recognize things in photos (and it didn’t work very well compared to later efforts). Today, we know that we can recognize photos simply by utilizing backprop on the supervised cost if there is enough labeled data. AEs are still used for certain tasks; however, they are not the essential solution to training deep NNs as was formerly believed.

4.2.3 Case Study: Microscopy Structure-to-Structure Learning It is fascinating to observe through experimentation learning that crossmicroscopy learning has a wider application in the imaging domain. The subcellular specimens differ in characteristics, shape, and mobility across cells. The plethora of nanoscopic specimens provide a large amount of data for processing. The idea was to investigate whether learning a specific structure in a DNN could be used to transfer learning to a new structure with a minimal amount of data training. Figure 4.13 shows that it’s hard to get a fruitful result in the attempt to transfer the learning from mitochondria to vesicles. But in the later application, we made a successful attempt for a structure-to-structure learning from vesicles to mitochondria (Fig. 4.14). These

4.2 Interpretation in Autoencoder Networks

317

Fig. 4.13 Cross-structure learning from a network trained on the thread-like mitochondria dataset to the dot-like vesicles dataset testing. Row 1 contains the original ground-truth data, while Row 2 contains the unsuccessful network-generated data

Fig. 4.14 Cross-structure learning from a network trained on the minuscule vesicles dataset to the mitochondria dataset testing. Row 1 contains the original ground-truth data, while Row 2 contains the successful network-generated data

possibly are due to the minute shape of vesicles that could be learned and generate a good convincing translation. The result demonstrates a similar approach, in a straightforward way, combined with other microscopy modalities by following a similar approach. Similar is the case study for the cross-structure transfer learning of vesicles to mitochondria. Figure 4.13 depicts the failed case to translate mitochondria training applied to label vesicles in nanoscopic imaging. However, the later case in Fig. 4.14 suggests that the model was able to provide significant results in the case of transfer learning to mitochondria from the model trained in vesicles. A possible interpretation from a physics point of view is that training on vesicles with very minute structure somehow worked on the transfer learning to mitochondria for cross-structure study. Understanding of the model is firmly in need of understanding what features the corresponding layers have learned. This will allow us to tweak the system and make it more robust to improve the performance and reliability of the model.

318

4 Interpretation in Specific Deep Architectures

The latter is crucial in cutting-edge transfer learning methods that minimize computational overhead while maximizing temporal efficiency. On the basis of these considerations, we concluded that we needed to dig deeper into the literature to understand the approaches of interpretability and their effect on the models.

4.3 Interpretation in Adversarial Networks  THINK IT OVER How can we secure our DL systems from adversarial examples? Few proactive techniques could be: 1. Adversarial training: Iterative retraining of the classifier with adversarial examples. 2. Regularization: Learning invariant transformations of features or resilient optimization, based on game theory. 3. Ensemble: Using numerous classifiers rather than just one, and have them vote on the prediction, although this offers no guarantee of success because they may all suffer from adversarial cases. 4. Gradient masking: Producing a model without useful gradients employing a nearest-neighbor classifier instead of the original model is another strategy that does not work well. The level of familiarity an attacker has with the system is a useful metric to categorize different kinds of attack. • The attackers may have complete knowledge (white box attack), in which case they have access to the model’s parameters, training data, and feature representation; • they may have partial knowledge (gray box attack), in which case they have access to the model’s feature representation and type but not the training data or parameters; or • they may have no knowledge (black box attack), in which case they have access to the model only through queries. Different methods of attack can be used against the model by the attackers, depending on the available data. Hiding knowledge about data and model is not enough to guard against attacks, as we have shown in the examples, even in the black-box situation hostile examples can be constructed. Intuition goes wrong The following is a hypothetical situation: My excellent picture classifier is now available to you via Web API. The model predictions are available to you, but you

4.3 Interpretation in Adversarial Networks

319

cannot modify the model settings in any way. You can send data from the comfort of your couch, and my service will respond with the appropriate categorizations. Since most adversarial attacks rely on having access to the gradient of the underlying DNN in order to find adversarial examples, they would be ineffective in this setting. In 2017, Papernot and colleagues (Papernot et al. 2017) demonstrated the ability to generate adversarial instances without knowledge of the internal model or the training data. A black box assault is a form of (nearly) zero-knowledge attack. It functions like this: 1. To attack a classifier, first collect a small set of photos from the same domain as the training data; for example, if the classifier is a numeric one, use images of numeric symbols. The subject knowledge is necessary, but access to training data is not. 2. Get the black-box predictions for the current image data. 3. Train a surrogate model (say, a NN) using current images as training data. 4. To increase the output variability of the model, generate a fresh collection of synthetic images using a heuristic that analyzes the existing set of images to identify the best direction to alter the pixels. 5. To achieve the desired results, it is necessary to repeat steps 2–4 for a predefined number of epochs. 6. The fast gradient method can be used to generate adversarial examples for the surrogate model. The goal of the surrogate model is to provide an approximation of the decision boundaries of the black box model. 7. Use malicious examples to try and bring down the original model. The authors tested their method by using it to attack image classifiers that had been taught using various forms of cloud ML. Using the images and labels provided by the users, these services educate the image classifiers. The model is trained and then automatically deployed by the program, often using an algorithm that is hidden from the user. Predictions are made for images that have been supplied to the classifier, but the model itself is not available for review or download. The authors discovered adversarial cases for a wide range of service providers, with up to 84% of those examples exhibiting incorrect classification. Interestingly, this technique is applicable even if the target black-box model is not an NN. Decision trees and other ML models that do not rely on gradients fall into this category.

4.3.1 Interpretation in Generative Networks GANs are gaining popularity in DL. Compared to variational AEs, GANs can handle sharp estimated density functions, effectively produce required samples, and eliminate deterministic bias and compatibility issues with the internal neural architecture (Goodfellow 2016). And yet GANs aren’t perfect, either. Foremost among their limitations is the fact that it is difficult to train and assess them. When it comes to training

320

4 Interpretation in Specific Deep Architectures

difficulty, it is usual for the generator to fail to learn properly the whole distribution of the datasets, and it is also non-trivial for the discriminator and generator to reach Nash equilibrium during training. This is the classic case of the mode collapse problem. Since DL schemes and GAN approaches both eliminate hand-crafted features and objective functions, they remain the most promising methods in data science. GANbased I2I research in CV has produced a variety of learning models with a wide range of applications and favorable results. There has been a lot of research done in this area (Qiu et al. 2017; Krizhevsky et al. 2009; Yoshida and Miyato 2017; Karras et al. 2018). The key assessment challenge is developing a metric to determine in what degree the generated distribution ( pg ) deviates from the actual distribution of the target ( pt ). Estimating pt with any degree of precision is currently not feasible. Therefore, it is difficult to generate accurate estimates of the relationship between pt and pg . Figure 4.15 demonstrates a well-structured taxonomy to represent the progress in the overall GAN architecture along with different variants of the loss function that have evolved over time. • The network architecture category highlights enhancements or modifications to the overall GAN architecture. • The latent space category highlights architecture those are modified based on various representations of the latent space. • Application-focused refers to changes made in response to specific applications. • Loss types are types of loss functions that can be optimized for GANs. • Regularization includes an additional penalization built into the loss function or any type of network normalization process. To be more specific, we transform the loss function using an Integral Probability Metric (IPM) in which the discriminator is restricted to a predetermined set of functions. We will not delve into the history of GANs and the definition of various loss functions. Our goal is to gain an appreciation for how different GAN structures can be interpreted by the DL community as a whole. Figure 4.16 highlights the tremendous improvement in GAN image generation over the past decade. A significant portion of GAN applications are used for artistic creation. However, a substantial number of models are also employed to generate synthetic data and simulate adversarial examples. In addition to admiring the model’s advancements, it is time to comprehend the underlying bias in data collection and the model’s latent space encoding in order to build a reliable system. To begin, Table 4.3 lists a few interpretation approaches in the supervised, unsupervised, and zero-shot learning domains (with fewer data resources) (Zhou 2022).

Fig. 4.15 Taxanomy for recent development of GANs in CV. Reproduced from (Wang et al. 2021) with permission

4.3 Interpretation in Adversarial Networks 321

322

4 Interpretation in Specific Deep Architectures

Fig. 4.16 Timeline of significant advances in image generation using GAN over the last decade. Adapted from Zhou (2022) (CC-BY 4.0)

4.3.2 Interpretation in Latent Spaces Some approaches facilitate the discovery of interpretable directions in the latent space, i.e., controlling the generation process via iteratively adjusting the latent signals z in the desired directions τ with step α, where z  = z + ατ is the vector arithmetic. Finding these guides can be done in a variety of ways, including with or without human supervision. Some new approaches propose to skip training and optimization altogether by simply computing the interpretable directions in closed form from the pre-trained models. This can be performed as follows: 1. Supervised Setting: Existing supervised learning-based approaches randomly sample latent codes, synthesize corresponding images, extracting statistical image information (e.g., color variations) (Plumerault et al. 2020), and annotate them with predefined labels by introducing a pre-trained classifier (e.g., prediction of face attributes or light directions). Among the methods in the 2020s are Abdal et al’s. (2021) Styleflow, Goetschalckx et al’s. (2019) Ganalyze, GAN steerability by Jahanian et al. (2019) and Shen et al. (2020). Shen et al. use off-the-shelf classifiers to evaluate the face representation learned by GANs and predict semantic scores for synthetic images. Abdal et al. use Continuous Normalizing Flows (CNFs) to learn a semantic Z-to-W mapping. Both methods require attributes (usually from a face classifier network) that may be hard to obtain for fresh datasets and require manual labeling. 2. Unsupervised Setting: Because the sampled codes and synthesized images used as supervision differ in each sampling, the supervised setting introduces bias into the experiment and may result in various discoveries of interpretable directions claimed by Shen and Zhou (2021). It also substantially limits the range of directions that present techniques can find, particularly when labels are absent. Fur-

4.3 Interpretation in Adversarial Networks

323

Table 4.3 Few interpretation approaches and the associated challenges with GAN Interpretation approaches

Features

Challenges

Supervised approach: Use labels or trained classifiers to probe the representation of the generator

GAN dissection: Aligning semantic segmentation with GAN feature map. Bau et al. (2018)

• How to expand the annotated dictionary size?

Probing latent space with linear • How to further disentangle the classifier. Yang et al. (2021) relevant attributes? InterFaceGAN: Probing latent space of face GAN with linear classifier (Shen et al. 2020)

• How to align latent space with image region attributes?

StyleFlow: StyleGAN + flow-based conditional model (Abdal et al. 2021) Unsupervised approach: SeFa: Closed-form factorization • How to evaluate the results? Identify the controllable of latent space in GANs dimensions of generator without (Shen and Zhou 2021) labels/classifiers

Zero-shot approach: Align language embedding with generative representations

GANspace: PCA applied to the latent space of StyleGAN (Härkönen et al. 2020)

• How to annotate each disentangled dimensions?

Hessian penalty: A weak prior for unsupervised disentanglement. Peebles et al. (2020) EigenGAN: Layer-Wise Eigen-Learning for GANs (He et al. 2021) Counterfactual explanations Shapley values

• How to improve the disentanglement in GAN training?

StyleCLIP: CLIP + StyleGAN (Patashnik et al. 2021)

Paint by word: CLIP + Region-based StyleGAN inversion (Bau et al. 2021) Massive data-driven OpenAI DALL.E (Ramesh et al. 2021)

324

4 Interpretation in Specific Deep Architectures

thermore, the individual controls identified by these approaches are frequently entangled, impacting numerous variables, and are often non-local. As a result, certain algorithms (Härkönen et al. 2020; Cherepkov et al. 2021; Voynov and Babenko 2020) try to uncover interpretable directions in the latent space unsupervisedly, i.e., without the need for paired data. Erik et al. (2020), for example, develop interpretable controls for picture synthesis by identifying relevant latent directions using PCA in the latent or feature space. The derived main components are correlated to specific properties, and their selective application enables the management of numerous image attributes. This method is deemed unsupervised because the directions can be determined using PCA without the need for labels. Annotating these directives to the target processes and determining which layers they should be applied to requires manual intervention and oversight. In contrast, Jahanian et al. (2019) use self-supervised optimization to optimize trajectories (both linear and non-linear). Given an inverted source image G(z), they learn w in Eq. 4.1 as: w∗ = arg minw Ez,α [L(G(z + αw), edit(G(z), α))]

(4.1)

where L represents the distance between the created picture G(z + αw) and the target image edit(G(z), α) by taking α-step in the latent direction. This method is considered “self-supervised” because the target image (G(z), α) can be inferred from the source image G(z). 3. Closed-form Solution: Obtaining interpretable directions for image synthesis in closed forms, without training or optimization, has recently been demonstrated by a number of approaches (Spingarn et al. 2020; Shen and Zhou 2021; Wei et al. 2021). Semantic factorization using singular value decomposition of firstlayer GAN weights is proposed by Shen et al. (2021). They notice that the latent direction n, which is not dependent on the sampled code z, determines the semantic transformation of an image, which is often represented by shifting the latent code in a given direction n  = z + αn. To find the directions n that can significantly alter the output image y, they devised a Semantics Factorization (SeFa) technique, defined as y = y  − y = (A(z + αn) + b) − (Az + b) = α An, where A and b represent the weight and bias of certain layers in G. The formula found, y = α An, implies that the weight parameter A should contain the essential information about the fluctuations of the image and recommends that the required editing with direction n can be achieved by adding the term α An to the projected code. This enables us to formulate the following optimization problem as a factorization of the problem of investigating latent semantics in Eq. 4.2. n ∗ = arg max{n∈Rd :n T n=1} ||An||22

(4.2)

A closed-form factorization of latent semantics in GANs, denoted by the directions n ∗ , corresponds to the eigenvectors of the matrix A T A. Instead of using a single layer of the generator to decide which directions are interpretable for image

4.3 Interpretation in Adversarial Networks

325

synthesis, like SeFa does, a method based on orthogonal Jacobian regularization is applied to many levels of the generator (Wei et al. 2021).

4.3.2.1

Disentanglement of Latent Spaces

The decision on which latent space to embed the image is a key design choice one must make regardless of the GAN inversion techniques used. To be effective as a latent space, it must be untangled and simple to embed. The latent code in latent space has two desirable properties: it faithfully and photorealistically reconstructs the original image, and it makes subsequent image manipulation easier. Beginning with the classic Z space and progressing to the SOTA P space, this section provides an overview of latent space analysis and regularization efforts. We take the example of StyleGAN/StyleGAN2 generation process that involves a number of latent spaces where image editing is mostly performed in the W+ space. Let’s discuss the different latent spaces. 1. Z latent space, also known as latent codes or latent representations, is typically a normally distributed random noise vector z ∈ Z . This is applicable in almost all unconditional GAN models. On the other hand, the constraint of the Z space due to a normal distribution limits its representation capacity and the disentanglement of semantic features. 2. W latent space is the intermediate space generated by changing Z space using a succession of fully connected layers. According to Karras et al. (2019), this reflects the disentangled nature of the learned distribution. The latent spaces used in StyleGANs are largely utilized in recent GAN inversion approaches. 3. W+ latent space is a different intermediate latent vector w fed to the generator’s layers, used in cases like style mixing or image inversion. The expressiveness of the W space remains limited, limiting the spectrum of images that can be authentically reproduced. As a result, some works like (Abdal et al. 2019, 2020) make use of another layer-wise latent space fed into each of the generator’s layers using Adaptive Instance Normalization (AdaIN) (Huang and Belongie 2017). Inverting images in the W + space, on the other hand, reduces distortion, but also results in reduced editability. 4. S latent space is the style space formed by spanning the channel-wise style parameter s, with each layer of the generator utilizing a different learned affine transformation on (w ∈ W ). It is proposed to achieve greater spatial disentanglement beyond the semantic level in the spatial dimension. The intrinsic complexity of style-based generators (Karras et al. 2019) and the spatial invariance of AdaIN normalization (Huang and Belongie 2017) are the primary causes of spatial entanglement. 5. P latent space proposed by Zhu et al. (2020) where the final leaky ReLU has a slope of 0.2 and the transformation from W to P space is x = LeakyReLU5.0 (w),

326

4 Interpretation in Specific Deep Architectures

Fig. 4.17 Latent Spaces of GAN’s Generator Table 4.4 Summary of applying the pre-trained GAN model to image processing tasks Generative prior Formulation GAN inversion Colorization Super-resolution Masked optimization

 x = argminx ||G(x) − I ||  x = argminx ||rgb2gray(G(x)) − Igray ||  x = argminx ||down(G(x)) − Ismall || ˙  x = argminx ||m G(x) − m I˙context ||

where w and x are latent codes in the W and P space, respectively. They begin with the most basic assumption that the joint distribution of latent codes is roughly a multivariate Gaussian distribution, and then propose P N space to minimize dependency and duplication. PCA whitening achieves the transformation from P to P N space. This transformation normalizes the distribution to have a zero mean and unit variance, resulting in an isotropic P space in all directions. 6. P+ latent space is a space extension P N in which each latent code is utilized to demodulate the related StyleGAN feature mappings at different levels. Figure 4.17 show the case of StyleGAN with several latent spaces. Table 4.4 summarizes the basic modification of the generative priors for various sampling, inversion, and reconstruction tasks (Fig. 4.18).

4.3.3 Evaluation Metrics GAN inversion approaches are evaluated on several criteria, including photorealism, faithfulness of the rebuilt image, and editability of the inverted latent code. This motivates us to discuss some of the criteria below:

4.3 Interpretation in Adversarial Networks

327

Fig. 4.18 The architectures of SGAN, CGAN, BiGAN, InfoGAN, BEGAN, and AC-GAN. Adapted from (Wang et al. 2021) with permission

 Photorealism: Photorealism in GAN-generated images is typically evaluated using Inception Score (IS), Fréchet Inception Distance (FID), and the Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al. 2018) metrics. Metrics such as Fréchet Segmentation Distance (FSD) (Bau et al. 2019) and Sliced Wasserstein Discrepancy (SWD) (Rabin et al. 2011) have also been used to assess the perceptual quality of images. IS (Salimans et al. 2016) is a widely used way to evaluate the quality and diversity of GAN-generated images. The Inception-v3 Network (Szegedy et al. 2016) pre-trained with ImageNet (Deng et al. 2009) uses it to compute the statistics of a synthesized image. A higher score is preferable. FID (Heusel et al. 2017) is determined by the Fréchet distance between feature vectors from real and generated images based on the Inception-v3 pool3 layer. Lower FID values suggest higher perceptual quality. LPIPS (Zhang et al. 2018) evaluates the perceptual quality of an image using a VGG model trained on ImageNet. A lower score indicates higher image patch similarity.  Faithfulness: The degree of faithfulness is defined as the degree to which the generated image is visually comparable to the original. The similarity between images can serve as a rough approximation. PSNR and SSIM are the two most commonly used measurements. PSNR is a widely used criterion to measure reconstruction quality and is defined by the maximum possible pixel value and the MSE between pictures. SSIM compares images using independent measurements of brightness, contrast, and structures. Wang et al. (2004) defined these terms in the article “Image quality assessment: from error visibility to structural similarity”. Other metrics like

328

4 Interpretation in Specific Deep Architectures

Mean Absolute Error (MAE), MSE, and RMSE are few examples of pixel-wise reconstruction distance metrics that can be used in some approaches.  Editability quantifies the degree to which the inverted latent code can be modified in context of the generator’s output image’s characteristics. It is impossible to directly assess the editability of latent code. Existing techniques evaluate some qualities between input x and output x  (i.e., adjusting the target attribute while leaving others unaffected) using cosine or Euclidean distance (Nitzan 2020) or classification accuracy (Voynov and Babenko 2020). Existing methods are primarily concerned with determining the editability of face data and facial features. Nitzan et al. (2020), for example, employ cosine similarity to compare the accuracy of facial expression preservation, which is computed by the Euclidean distance between 2-D landmarks x and x  . The preservation of the pose, on the other hand, is determined as the Euclidean distance between the Euler angles x and x  . Abdal et al. (2021) establish the edit consistency score (regressed by an attribute classifier) to assess consistency across modified face images, assuming that various permutations of edits should have the same attribute score when categorized with an attribute classifier. These approaches assess the quality of the modified images by measuring the retention of facial identification. We should remark that the methods described above may not be relevant to image classes other than faces.  Subjective Metric: Human rating agents are often used in subjective image quality assessment, with values ranging from 1 (very poor) to 5 (excellent) (good). Mean Opinion Score (MOS) or Differential Mean Opinion Score (DMOS) are common metrics for summarizing ratings. A common task in user studies is to have participants select the image that best answers the research question from a set of three options (the original, the baseline results, and the recommended approach). For example, a question may read, “Which of the two modified versions of this image is more realistic?” or “Which of the two edited versions of this image better preserves the identity of the individual in the original image?”. The final percentage of votes represents the level of support for the suggested technique in compared to the control. The non-linear scale of human judgment, the possibility of bias and variance, and the high human cost are all disadvantages of these measures.

4.3 Interpretation in Adversarial Networks

Highlight Normalization in Generator and Discriminator GANs are prone to signal magnitude escalation because of the unhealthy competition between the two networks. Most, if not all, previous methods discouraged this by employing batch normalization in the generator and discriminator (Ioffe and Szegedy 2015; Salimans and Kingma 2016). Originally, these normalization strategies were developed to eliminate the covariate shift. However, we haven’t seen this as a problem in GANs, leading us to infer that the underlying necessity in GANs is limited signal magnitudes and competition. Karras et al. (2018) proposed to use two elements that don’t have learnable parameters as a solution. • Equalized learning rate: In contrast to the prevalent practice of carefully initializing weights, we can start with a trivial N (0, 1) initialization and then scale the weights explicitly throughout training. To be more specific, we use the Kaiming initialization (He et al. 2015) and set  wi = wi /c, where wi are the weights and c is the per-layer normalization constant. The advantage of performing this on-the-fly rather than at initiation is somewhat subtle, but it has to do with the fact that popular adaptive SGD algorithms such as RMSProp (Tieleman and Hinton 2012; Adam 2014) are scale-invariant. These techniques take a gradient update and normalize it by its estimated standard deviation, making it scale-independent. Accordingly, if the range of one parameter is greater than that of another, it will take more time to fine-tune the system. Because of the way that initializers work, it is conceivable that a learning rate is excessively high or low. This guarantees that all weights have the same dynamic range and, by extension, the same learning velocity. Laarhoven (2017) independently arrived at a similar conclusion in 2017. • Pixelwise feature vector normalization in the generator: After each convolutional layer in the generator, we can normalize the feature vector in each pixel to unit length to prevent the magnitudes from spiraling out of control due to competition.

329

330

4 Interpretation in Specific Deep Architectures

Fig. 4.19 Illustration of feature weighted learning loss. Figure reproduced from (Somani et al. 2021) with permission

4.3.4 Case Study: Digital Staining of Microscopy Images Somani et al. (2021) aimed to provide a framework that can be used to build digital fluorescence equivalents from brightfield microscopy cell specimens. GANs, a promising subfield of DL, have demonstrated success in a variety of data-driven computational tasks, including solving the inverse problem in microscopy imaging such as super-resolution and localization microscopy, as well as cell simulation reconstruction. The use of data-driven image modifications to perform digital staining of label-free tissue samples to observe mitochondrial activities in the cell without staining or cell disturbance offers new possibilities. Reconstructing the number (amount), position (geometry) and intensity (photometry) of the generating light sources equates to solving the picture translation problem. In general, based on the relationship between the input and the target images, there are two major groups of methods and applications: 1. Unsupervised/ Unpaired/ Unaligned/ Unregistered approach such as style changing, color changing. 2. Supervised/Paired/Aligned/Registered approach such as supervised segmentation and labeling, denoising, and super-resolution. An aligned input/output image translation is the problem at hand. What stands out about the application is how a black-box model contains complicated feature learning that goes beyond manual human labor. This abstraction can do far more good than harm, but comprehending the black box is essential if it is to be used in vital realms involving life and death. The relationship between microstructure and performance is typically non-linear and costly to evaluate, making microstructure optimization difficult. The implementation of a deep CNN architecture using cGAN to transfer unstained embryonic heart tissue to its corresponding mitochondrial monomeric Red Fluorescent Protein (mCherry) and enhanced Green Fluorescent Protein (eGFP) stained images is detailed in the paper. The network learns to match fluorescent-stained microscopy images of label-free tissue slices with their brightfield microscopy counterparts. Quantification of virtual stained images generated and real tissue slices staining demonstrates that a DL-based digital staining approach produces results comparable to the manual fluorescent staining procedure.

4.4 Interpretation in Graph Networks

331

We might uncover that the model’s background adaptability is often the most difficult barrier in this type of image translation. Staining fluorescence imaging (Fig. 4.19), shows a murky mitochondria-labeled foreground and a significantly large unlabeled background. The foreground has high-intensity values, whereas the background contains a pixelated value with changing intensities, which we can call noise. The goal is to optimize the image-to-image (I2I) translation process so that it generates tagged images for label-free brightfield imaging. The obvious strategy will focus the model on the foreground, but in the label-free image, the backgrounds are significant. The model learned to produce a stained image by conditioning the low-intensity zone. Learning is less demanding, although precision is required in some areas. The cost function of the generator can compute the L1 loss between the generated and actual image using a weighted image map with foreground and background information. The weightage of the map is calculated using a loss modeling technique. The next step in this process is to learn how to use the model in a specific application. It also determines whether the comprehension of the loss computation for the image by the suggested loss described in Fig. 4.19e performing better than the L1 loss in (C) could be a type of interpretation for the success of the model. The cuttingedge method is competitive in its application. However, some basic knowledge of how the network evaluates loss and how it accesses the feature in the image went a long way toward developing a loss function that is successful in applicability and has outperformed the SOTA technique by a wide margin in various discriminator loss categories. It paves the way for future interpretability strategies to assist the model in performing accurately while also gaining trust in its behavior.

4.4 Interpretation in Graph Networks In recent years, we have been able to achieve unprecedented performance on a variety of challenges, thanks to DL breakthroughs like CNNs and RNNs. Despite the progress made, most studies using the DL technique have only considered data collected on Euclidean domains (i.e. grids). However, non-Euclidean domains reflect data that must be dealt with in many different fields, including biology, physics, computer graphics, network research, and recommender systems: graphs and manifolds. Since the non-Euclidean nature of the data makes the specification of basic techniques (like convolution) relatively complex, the uptake of DL in these demanding fields has lagged behind until recently. Recommender systems, fake news identification, drug repurposing, chemistry, neutrino detection, etc. are just a few examples of where deep geometric learning on graphs is put to use. Manifold knowledge is also useful in fields like computer graphics, medical imaging, drug design, robotics, and autonomous vehicles. The spectral domain, the spatial domain, and the parametric domain are all possible forms that non-Euclidean CNNs can take.

332

4 Interpretation in Specific Deep Architectures

New models are being sought in the hope that those based on GNN architectures may produce more comprehensible patterns of thought (Battaglia et al. 2018). This section focuses on the benefits of GNNs and the difficulties in interpreting their results. To better learn about entities, relations, and compositional rules, the GNN architecture is viewed as more interpretable. To begin, entities of GNNs are more easily understood by humans than either image pixels (fine granularity) or word embeddings (high-level concepts represented by a large number of words, latent space vectors). As a second benefit, GNN inference spreads data across links, making it simpler to pinpoint the subgraph or explicit reasoning path responsible for the prediction. As a result, GNN models are increasingly being used to make predictions once images or text data have been converted to graphs. For example, to create a graph from an image, we can make each object (or part of an object) serve as a node and then produce edges between them based on their spatial relationships. In a similar vein, lexical parsing can be used to convert a document into a network by identifying ideas (such as nouns or named entities) as nodes and extracting their relationships as connections. The graph data format provides a springboard for interpretable modeling; however, GNN interpretability is still threatened by a number of obstacles. To begin, GNN still uses mapping nodes and connects to embeddings. Consequently, the opacity of information processing in intermediate layers is an issue for GNN, as is for conventional deep models. Second, the final prediction is affected in diverse ways by the various channels or subgraphs of information propagation. Post-hoc interpretation methods are still required, as GNN does not immediately reveal the most relevant reasoning processes for its prediction. To improve the explainability and interpretability of GNNs, we will in the following section describe recent developments in addressing the aforementioned issues. However, unless otherwise noted, the graphs we examine in the rest of this chapter are assumed to be homogeneous. Challenges with graph explanation The problem of explaining deep graph models is crucial but difficult. Challenges to providing adequate explanations for GNNs are discussed. To begin with some facts: • Graphs are not grid-like data like photos and texts are; instead, they have a more organic structure where each node has a distinct number of neighbors and there is no location information. Graphs are represented as feature matrices and adjacency matrices, both of which are rich in information about the underlying topology. • It is possible to directly apply the approaches developed to explain feature relevance in image data to graph data. However, the topology is represented by discrete values in the adjacency matrices. Therefore, conventional approaches will not work. • Common approaches to explaining the general characteristics of image classifiers include input optimization methods (Simonyan et al. 2014; Olah et al. 2017). The model is explained by means of abstract images obtained by optimizing the input using backprop, which it views as a trainable variable. Discrete adjacency matrices, however, are not amenable to the same kind of optimization.

4.4 Interpretation in Graph Networks

333

Fig. 4.20 Geometric deep learning datasets for non-Euclidean domains of graphs and manifolds on the right compared to traditional speech and images on the left using mixture model CNNs (Monti et al. 2017)

• In addition, various approaches (Dabkowski and Gal 2017; Yuan et al. 2021) train soft masks to capture crucial parts of images. On the contrary, the discreteness quality is lost when soft masks are applied to the adjacency matrices. In addition, it is necessary to analyze the importance of every individual pixel and word in digital images and written texts. However, studying the structural information as seen in Fig. 4.20 for the graph data is more vital. Initially, it is important to note that the labels for entire graphs are decided by graph topologies even though individual nodes in the networks may be unlabeled. If the nodes are unlabeled, there is no semantic value in analyzing them individually. Moreover, graph substructures are closely related to their functionalities in the fields of biochemistry, neuro-biology, ecology, and engineering (Rudin 2019). The network motifs (Alon 2007) are illustrative; they serve as the building blocks for a wide variety of more complicated networks (Chen et al. 2018). Therefore, such structural details must be taken into account in the explanation procedures. However, current methodologies from picture domains are unable to offer insight into the structures themselves. The next step is that for node classification tasks, the prediction of each node is based on distinct message walks from its neighbors. The study of message passing is significant but difficult. The lack of approaches that can take into account such walk information in the image domain highlights the need for more research in this area. Moreover, visual and textual data are more intuitive than graph data. Understanding deep models requires expertise in the relevant datasets. Semantic understanding is clear and easy for visuals and texts. Despite being quite abstract, their explanations are easily grasped by humans. It is difficult for humans to grasp the meaning of graphs because they can represent such a wide variety of complicated facts as molecules, social networks, and citation networks. There are also numerous unanswered questions and a dearth of domain expertise in interdisciplinary fields like chemistry and biology. Therefore, there is a need for standard datasets and assessment measures for explanation jobs because it is not easy to get explanations that a human can understand for graph models. Breakthrough with manifold-structured data At the dawn of the AI domain, two of its founders famously predicted that solving machine vision problems would only take a summer. We now acknowledge that they were off by half a century. Wissner-Gross began to ponder the puzzle “What took

334

4 Interpretation in Specific Deep Architectures

the AI revolution so long?” By analyzing AI advancement publications for the past 30 years, the author uncovers testimony that suggests high-quality datasets were the influential factor for breakthroughs in AI, not algorithmic advances. The summary of the breakthroughs in Table 4.5 implies that the striking discoveries in AI after data availability are 3 years, i.e. six times faster than the average elapsed time of 18 years from the proposal of an algorithm. If the above observation is correct, it shall have foundational implications for the future growth of AI. Prioritizing superior data collection may have a higher degree of speedup in DL than the curation of heavy algorithms. The example of speech recognition, language translation, IBM’s Deep Blue in 1997, Deepmind in 2015 are repeated instances in the Table 4.5 implying the importance of data. The greater the data, the stronger the correlation and the more complex the relationship. Here, graph neural learning is one way to move forward with the abstract nature of efficient knowledge encoding against rigorous mathematical modelling.

Table 4.5 Datasets over algorithms by Wissner-Gross (Quanto 2021) Year Breakthrough in AI Dataset (First Available) Algorithm (First Proposed) 1994

1997

2005

2011

2014

2015

Human-level spontaneous Spoken Wall Street speech recognition Journal articles and other texts (1991) IBM Deep Blue defeated 700,000 Grandmaster Garry Kasparov chess games, aka “The Extended Book” (1991) Google’s Arabic- and 1.8 trillion tokens from Chinese-to-English Google Web and News translation pages (collected in 2005) IBM Watson became the 8.6 million documents world Jeopardy! from Wikipedia, champion Wiktionary, Wikiquote, and Project Gutenberg (updated in 2010) Google’s GoogLeNet ImageNet corpus of 1.5 object classification at million labeled images near-human performance and 1000 object categories (2010) Google’s Deepmind Arcade Learning achieved human parity in Environment dataset of playing 29 Atari games over 50 Atari games by learning general (2013) control from video

Average No. of years to breakthrough: 3 years

Hidden Markov Model (1984) Negascout planning algorithm (1983) Statistical machine translation algorithm (1988) Mixture-of-Experts algorithm (1991)

Convolution neural network algorithm (1989)

Q-learning algorithm (1992)

18 years

4.4 Interpretation in Graph Networks

335

The study of molecule graphs, the social network, or semantic image classification, gives a theoretical idea of graph theory and differential geometry. Understanding non-Euclidean space is important to encode knowledge in the spatial, spectral, and parametric domains of graphs. The branch of study addresses questions including: • What might the non-Euclidean data interpretation look like? • How do we learn to work with spectral analysis on graphs and manifolds? • What are the benefits of trusting spatial-domain and spectral-domain geometric deep learning techniques? The fundamental challenges facing in the GNN learning approach today are the extension of NNs to graph or manifold-structured data. Here, the assumption is made that non-Euclidean data are locally stationary and manifest hierarchical structures. This, when brought into a practical scenario, is hard to interpret. Also, a fair difficulty is to define the compositionality of convolution and pooling techniques in graphs or manifolds. It is because filters are basis-dependent and do not generalize well across domains. Compositionality is one of the common problems in fuzzy CNNs, which will be discussed in the next chapter.

4.4.1 Neural Structured Learning Moreover, graph learning is another broad subject of research that is too large for the scope of the book to cover explicitly. Nevertheless, it is worth emphasizing, in brief, the novel paradigm for training a NN by exploiting structured signals (Fig. 4.21) in addition to the standard feature inputs used in a NN. In this case, the structure is represented explicitly through a graph or implicitly by leveraging nearest neighbors similar to the data input. This graph learning technique can be used to investigate training on natural or synthesized graphs, as well as adversarial cases. The goal of learning this neural structure (Bui et al. 2018) is to find the minimum value of the loss that results from contrasting the two functions in Eq. 4.3. Minimizing supervised loss, such as the L2-norm for regression or the cross-entropy loss for

Fig. 4.21 Schematic representation of NSL

336

4 Interpretation in Specific Deep Architectures

classification, is a common goal in NN design. In this case, we also minimize the neighbor loss to preserve the correlation between the inputs for the same structure. The total loss optimization is theoretically defined in Eq. 4.3, as: Optimize : Lloss =

n 

L(yi ,  yi ) + α

i=1

n 

LN (yi , xi , N (xi ))

(4.3)

i=1

where, xi → f(.) → yi is the transformation of the input feature,  yi is the NN output for input xi and L and LN are the supervised loss function and neighbor loss function, respectively. The latter part of the equation computes the neighbour loss expressed  as [ x j ∈N (xi ) wi j · D(h θ (xi ), h θ (x j ))] with h θ (.) the hidden target layer and D(.) the distance metric, e.g. L1 or L2-norm. The adversarial examples covered briefly in earlier chapters can also be utilized as a structure learning graph for the network by perturbing the incoming data. Analytical verification of the model’s resistance to intentional manipulation was obtained after it was trained using the Szegedy et al. (2015) adversarial perturbation samples. These structured signals are usually used to represent association or similarity among samples. Training a NN is regularized by the use of structured signals, such as negative instances generated by disturbances in the source images. By keeping the structural similarity between inputs and adversarial instances in mind while learning to reduce the supervised loss, the adversarial loss can be kept to a minimum. The following code snippet demonstrates adversarial neural learning in an organized fashion using the NSL framework implemented in the TensorFlow API. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

import tensorflow as tf import neural_structured_learning as nsl #Read and process dataset (x_train, y_train),(x_test, y_test) = tf.keras.datasets.mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 #Create a base model −− Sequential, functional or subclass #Here, Keras model is used model = tf.keras.Sequential(...) #Wrap the model with adversarial regularization adv_config = nsl.configs.make_adv_reg_config(multiplier = 0.2, adv_step_size = 0.05) adv_model = nsl.keras.AdversarialRegularization(model, adv_config) #Compile, train and evaluate adv_model.compile(optimizer = ’adam’, loss = ’sparse_categorical_crossentropy’, metric = [’accuracy’]) adv_model.fit({’feature’: x_train, ’label’: y_train}, epochs = 50) adv_model.evaluate({’feature’: x_test, ’label’: y_test})

Listing 4.1 Sample adversarial neural structured learning code for Mnist Data

The next step could be to investigate the embedding of low-dimensional knowledge in graphs for link prediction challenges using Euclidean, hyperbolic, and complicated models. For instance, Graph-Agreement Models (GAMs) study the noisy nature of graphs in real-world problems with CNNs (Kipf and Welling 2016); learning SOTA fixed embedding entities in multi-relational graph extrapolated by nearestneighbor graph inference to unseen relationship at test time. The preceding considerations indicate that any image can be factored as a tensor of pixel values. Convolution layers help to extract features from the image (form feature maps). The shallower layers in the network (closer to the input data) learn

4.4 Interpretation in Graph Networks

337

relatively generic features like edges, corners, and so on. Deeper layers (nearer the output layer) learn highly specific aspects of the input image.

4.4.2 Graph Embedding and Interpretability Recently, a number of different methods have been presented to provide context for the results produced by deep graph models. Each of these approaches analyzes a unique facet of graph models and offers a new perspective on them. Questions they typically address include: Which input edges are most important? Which nodes in the input graph are most crucial? What characteristics of a node are the most crucial? How can we best predict a class using graph patterns? Our taxonomy of GNN explanation methods, later in this section, should help readers make sense of various approaches. In Fig. 4.22, you can see how our taxonomy is composed. Different techniques are divided into two broad categories, instance-level methods and model-level methods, according to the depth of explanation they offer.

Fig. 4.22 Comprehensive analysis of various graph explanation methods. The explanation target is used to group the methods. The method color coding in black indicates the backward computational flow for explanation, while the method color coding in brown indicates the forward computational flow for explanation. With the exception of the Graph Line and Casual Screening methods, all other forward flow methods involve learning procedures, whereas text in black indicates no learning process. Except for the ones mentioned, all of the methods serve both the node and graph classification tasks

338

4.4.2.1

4 Interpretation in Specific Deep Architectures

Instance-Level Methods

Methods at the instance level offer justifications for each input graph that vary depending on the nature of the inputs themselves. These techniques, when applied to an input graph, provide an explanation of the deep model by highlighting the most relevant input attributes for the model’s prediction. We classify the methods into four groups that differ in the way importance scores are calculated: 1. 2. 3. 4.

Gradient-/feature-based methods (Table 4.6). Perturbation methods (Table 4.7). Decomposition methods (Table 4.8). Surrogate methods (Table 4.9).

In particular, the feature values or gradients are used by the gradient-/feature-based algorithms to highlight the significance of various input features. Perturbation-based algorithms track the shift in prediction in response to varied input perturbations, allowing for the investigation of input importance scores. In decomposition methods, prediction scores, like predicted probabilities, are broken down into their components and then fed to neurons in the final hidden layer. They take the decomposed scores as importance scores and then propagate them backwards layer by layer until they reach the input space. Surrogate-based approaches, in contrast, begin by selecting a dataset from the surrounding examples of the input example. After that, these techniques use a decision tree or another easily understandable model to fit the data they have collected. Then, the surrogate model justifications are applied to the initial forecasts. On the other hand, model-level methods explain GNNs that are independent of the particular input example. The high-level input-independent explanations provide reasons for the general behaviour. This direction has received less attention than instance-level methods. XGNN (Table 4.10), which is based on graph generation, is the only existing model-level technique. It develops graph patterns in order to optimize the anticipated probability for a given class and then uses these graph patterns to explain that class. Each of these two categories of approaches offers a unique perspective on understanding deep GNNs. Methods at the model level offer generic comprehension of how deep graph models function, while those at the instance level offer explanations tailored to individual examples. To ensure the accuracy and reliability of GNNs, human supervision is necessary to double-check the justifications provided. Experts need additional time to manually supervise instance-level procedures as they investigate the reasons behind various input graphs. Since the explanations for the model-level procedures are more abstract, fewer human supervisors are needed. Whereas, real input instances are used in the explanations of instance-level techniques. However, it’s possible that the produced graph patterns do not exist in the real world. Therefore, the explanations for model-level procedures might not make sense to humans. The two approaches can be combined for a more complete understanding of deep graph models, thus it’s important to look into both of them.

4.4 Interpretation in Graph Networks

339

Table 4.6 Overview of popular feature/gradient-based methods for interpretable GNN models. The (+) signifies the pros and (−) the cons for the methods Feature/gradient based methods Key feature Sensitivity Analysis (Baldassarre and Azizpour et al. 2019)

+ Utilizes the squared gradient values as the importance scores of various input features + Calculated by back-propagation, with graph nodes, edges, or node features as input features + Assumes that input features with higher absolute gradient values are of greater importance − SA only reflects the sensitivity between input and output, which is insufficient to demonstrate the significance − Suffers from saturation problems (Shrikumar et al. 2017), where the model output changes minimally with respect to any input change, the gradients can hardly reflect the contributions of inputs − Its efficacy is still limited. It assumes input features are mutually independent and does not always consider their correlations during the decision-making process − Typically, the explanation results provided by sensitivity analysis are noisy and difficult to comprehend

Guided BP (Baldassarre and Azizpour et al. 2019)

+ Similar to SA but with modifications to the backpropagation gradients procedure − Due to the difficulty of explaining negative gradients, GuidedBP only back propagates positive gradients while clipping negative gradients to zero − Shares same limitations as SA

Grad × Input (Shrikumar et al. 2017)

+ Calculates feature contribution scores as the element-wise product of input features and gradients of decision function with respect to features + Considers feature sensitivity and scale − suffered from the saturation problem, where the scope of the local gradients is too limited to reflect the overall contribution of each feature

Integrated Gradients (Sundararajan et al. 2017)

+ Solve the saturation problem by aggregating feature contribution along a designed path in input space − The explanations obtained by the above methods usually contain a lot of noise

SmoothGrad (Smilkov et al. 2017)

+ Aggregate feature contributions along a path in the input space to solve saturation + Aims at sharpening the saliency maps on images − Typically, the explanations obtained by the aforementioned methods contain a great deal of noise

CAM (Pope et al. 2019)

+ Identifies important nodes by mapping final layer node features to input space + CAM combines different feature maps by weighted summations to score input nodes’ importance − Requires a global average pooling (GAP) layer and an fully connected (FC) layer as the final classifier in the GNN structure. Its applicability and generalization are limited − Assumes that the final node embeddings can reflect the input importance, which is a heuristic assumption that may or may not be true − Can only explain graph classification models and cannot be used to classify nodes

Grad-CAM (Pope et al. 2019)

+ Removes the GAP constraint from the CAM to make it general + Instead of using GAP and FC weights to combine feature maps, gradients are used − Based on heuristic assumptions and cannot explain node classification models

340

4 Interpretation in Specific Deep Architectures

Table 4.7 Overview of popular perturbation based methods for interpretable GNN models Perturbation based methods Key feature Gnnexplainer (Ying et al. 2019)

+ Learns soft masks for edges and node features in order to explain predictions through mask optimization + Masks are optimized by maximizing mutual information between original and new graph predictions − Masks are optimized for each input graph, so explanations may be limited

PGExplainer (Luo et al. 2020)

+ Learns approximated discrete masks for edges to explain the predictions + Maximizing mutual information between original and new predictions trains the mask predictor + Since all dataset edges share the same predictor, explanations can provide a global understanding of trained GNNs − Though reparameterization is used, the obtained masks are not strictly discrete but can reduce introduced evidence (Dabkowski and Gal 2017)

GraphMask (Schlichtkrull et al. 2020)

+ Trains a classifier to predict if an edge can be dropped without affecting predictions + It gets an edge mask for each GNN layer while PGExplainer focuses on input space + To avoid changing graph structures, dropped edges are replaced by learnable baseline connections, vectors with the same dimensions as node embeddings + The classifier is trained using the entire dataset by minimizing a divergence term − Similar to PGExplainer, it can solve the introduced evidence problem and explain trained GNNs

ZORRO (Funke et al. 2020)

+ Identify input nodes and features with discrete masks + No training is required, so discrete masks’ non-differentiable limitation is avoided + ZORRO’s hard masks prevent introduced evidence − Step-by-step node or feature selection uses greedy algorithm − Greedy mask selection can lead to local optimal explanations − Each graph’s mask is generated separately, limiting global understanding

Causal Screening (Wang et al. 2020)

+ Investigates the causal attribution of various edges in the input graph + It uses the individual causal effect (ICE) to select edges, which measures mutual information difference after adding different edges to the subgraph − Lack global understanding and stuck in local optimal explanations − Causal Screening generates discrete masks without training, like ZORRO

CF-GNNExplainer (Lucic et al. 2022)

+ Generate GNN’s counterfactual explanations + Unlike previous methods that attempted to find a sparse subgraph in order to preserve the correct prediction + Find the fewest edges to remove to change the prediction − CF-GNNExplainer uses soft mask like GNNExplainer. The introduced evidence problem also exists − Non-zero or non-one values can add unnecessary information or noise to an explanation

SubgraphX (Yuan et al. 2021)

+ Employs the Monte Carlo Tree Search (MCTS) algorithm (Silver et al. 2017) to efficiently explore different subgraphs via node pruning and select the most important subgraph from the search tree as the prediction explanation (Silver et al. 2017) + Investigate different subgraphs effectively using node pruning and select the most important subgraph from the search tree’s leaves as the explanation for the prediction + It measures the relevance of subgraphs using Shapley value as the MCTS reward and presents an efficient approximation by only considering interactions within the message passing range + Obtaining subgraphs improves the readability of graph data − Exploration of subgraphs using the MCTS algorithm raises the computational cost

4.4 Interpretation in Graph Networks

341

Table 4.8 Overview of popular Decomposition methods for interpretable GNN models Decomposition methods Key feature LRP (Baldassarre and Azizpo 2019; Schwarzenberg et al. + Decomposes prediction score into node importance scores 2019) + Intuitively, The neuron with the most target neuron activation gets a larger score + LRP’s explanation results are more trustworthy because it’s directly based on model parameters − Can only study the importance of different nodes and cannot be applied to graph structures, such as subgraphs and graph walks − Requires a thorough understanding of model structures, limiting its use by non-experts like interdisciplinary researchers Excitation BP (Pope et al. 2019)

+ Based on the law of total probability, it’s similar to the LRP algorithm + Score decomposition decomposes target probability into conditional probability terms − It has the same advantages and disadvantages as the LRP algorithm

GNN-LRP (Schnake et al. 2020)

+ Analyzes graph walks. It’s more coherent with DGNNs since graph walks correspond to message flows + Score decomposition is model predictions’ high-order Taylor decomposition − Since high-order Taylor derivatives can’t be directly computed, it uses back propagation to approximate T -order terms + Instead of assigning scores to nodes or edges, GNN-LRP assigns them to different graph walks − It has a solid theoretical background, but its computations may be off − Considering each walk separately increases computational complexity − Non-experts find interdisciplinary domains difficult to use

Gradients/Features-based Methods The most common approach, especially for image and text tasks, is to employ gradients or features to explain the deep models (Table 4.6). The main concept is to approximate the importance of the input using gradients or the values of a hidden feature map. Specifically, gradient-based algorithms (Simonyan et al. 2014; Smilkov et al. 2017) use backprop to generate target prediction gradients with respect to input features. Alternatively, feature-based approaches (Zhou et al. 2016; Selvaraju et al. 2017) interpolate between the hidden feature space and the input space to derive relevance scores. In such approaches, larger gradients or feature values typically denote greater significance. When the model parameters are taken into account, it becomes clear that gradients and hidden features can be explained in terms of the information they carry. These approaches are easily extended to the graph domain because they are simple and general. SA (Baldassarre and Azizpour 2019), GuidedBP (Baldassarre and Azizpour 2019), CAM (Pope et al. 2019), and Grad-CAM (Pope et al. 2019) are only a few of the recent methodologies used to provide an explanation for GNN. The

342

4 Interpretation in Specific Deep Architectures

Table 4.9 Overview of popular Surrogate methods for interpretable GNN models Surrogate methods Key feature GraphLime (Huang et al. 2022)

+ Extends LIME to deepgraph models and investigates node features for node classification tasks + Considers N-hop neighboring nodes and their predictions as its local dataset, where Ni is the trained GNNs’ layer count + The local dataset is fit using a non-linear surrogate model, Hilbert-Schmidt Independence Criterion (HSIC) Lasso (Yamada et al. 2014) + The explanations for the original GNN forecast are selected features − Can only explain node features and ignore graph structures such as nodes and edges − GraphLime can explain node classification predictions but not graph classification models

RelEx (Zhang et al. 2021)

+ Combines surrogate and perturbation-based methods to explain node classification models + Given a target node and its computational graph (N-hop neighbors), it first obtains a local dataset by randomly picking connected subgraphs from the computational graph Starting from the target node, it randomly selects BFS neighbors. Next, it fits local datasets using a GCN model + After training, it uses perturbation-based approaches to explain predictions, such as soft masks or Gumbel-Softmax masks − RelEx’s surrogate model isn’t interpretable, unlike LIME and GraphLime − It uses a surrogate model to approximate local relationships and masks to approximate edge importance, making explanations less persuasive and trustworthy − It’s also unknown how it may be used for graph classification tasks

PGM-Explainer (Vu and Thai 2020)

+ Builds a probabilistic graphical model for instance-level explanations of GNNs + Random node feature perturbation generates the local dataset + PGM-local Explainer’s dataset comprises node variables, not graph samples. The Grow-Shrink (GS) technique is then used to reduce the local dataset’s size + A Bayesian network is used to fit the local dataset and explain the GNN model’s predictions + Can explain both node classification and graph classification tasks + PGM-Explainer shows feature dependencies − Explain graph nodes but not edges, which include topology information

Table 4.10 Overview of popular Model-Level explanations for interpretable GNN models Model-Level explanations Key feature XGNN (Yuan et al. 2020)

+ Graph generation is used to describe GNNs + It trains a graph generator to maximize a target graph prediction, not the input graph + Generating graphs requires reinforcement learning + Graph rules support human-understandable explanations + Model-level explanations can be generated using any graph generation algorithm − XGNN explains graph classification models well, although nodes are unknown

4.4 Interpretation in Graph Networks

343

main distinction between these approaches is in the way gradient back-propagation is carried out and in the combination of various hidden feature maps. Perturbation-based Methods Deep image models are typically explained using perturbation-based approaches (Dabkowski and Gal 2017; Yuan et al. 2020; Chen et al. 2018). The driving force is an interest in learning about the sensitivity of the output to changes in the input, explained also from the perspective of dynamic interpretation in Sect. 3.3.4. We can assume that the predictions will be consistent with the original if we remember to keep the key aspects of the input data in mind. In order to explain DL image models, existing techniques learn a generator-to-generate mask to choose relevant input pixels. Nevertheless, graph models do not lend themselves to such procedures. Graphs, unlike images, have a fixed number of nodes and edges and cannot be scaled without distorting the structure of the graph. And unlike images, graphs and their functionality cannot be understood without their structural information. Multiple perturbation-based approaches, such as GNNExplainer (Ying et al. 2019), PGExplainer (Luo et al. 2020), ZORRO (Funke et al. 2020), GraphMask (Schlichtkrull et al. 2020), CausalScreening (Wang et al. 2020), and SubgraphX (Schwarzenberg et al. 2019), have been proposed to shed light on GNNs. These methods use a somewhat standard high-level processing pipeline. 1. First, masks are obtained from the input graph to highlight crucial input features. Note that various masks, such as node masks, edge masks, and node feature masks, are produced in response to various explanation tasks. 2. In the next step, the input graph is joined with the created masks to produce a new graph that preserves key aspects of the original graph. 3. Finally, the trained GNNs are given the new graph in order to assess the masks and modify the mask-generating procedures accordingly. Intuitively, the significant input features recorded by the masks should convey the primary semantic meaning and result in a prediction that is comparable to the original. The primary distinctions between these methods are the mask generation algorithm, the type of mask, and the objective function. Notable are the three distinct varieties of masks: (i) soft masks, (ii) discrete masks, and (iii) approximated discrete masks. Figure 4.23 displays discrete masks for edges, soft masks for node features, and approximate discrete masks for nodes. Continuous values between 0 and 1 make up soft masks, and the mask generation process may be immediately updated using backpropagation. But any non-zero or non-one value in the mask may introduce new semantic meaning or new noise to the input graph, which can then alter the explanation findings, a phenomenon known as the introduced evidence problem (Dabkowski and Gal 2017). However, since no new numerical value is introduced in the case of discrete masks, the introduced evidence problem may be sidestepped. However, processes like sampling are always non-differentiable with discrete masks. To address this issue, the policy gradient method (Sutton et al. 1999) has gained a lot of traction. Recent works (Chen et al. 2018) also propose using reparameterization techniques, such as Gumbel-Softmax estimation (Jang et al. 2016) and sparse relaxations

344

4 Interpretation in Specific Deep Architectures

Fig. 4.23 The standard procedure for implementing perturbation-based techniques. To generate various masks, they use a variety of algorithms. The mask may represent nodes, edges, or even properties of nodes. Three types of masks are demonstrated here (yellow box): a soft mask for node characteristics, a discrete mask for edges, and an estimated discrete mask for nodes. The mask is then combined with the input graph (in the orange node) to capture important input information. Finally, the trained GNNs determine whether the new prediction (green) is comparable to the original forecast and offer advice on how to improve the mask generation algorithms. Figure adapted from (Yuan et al. 2021) with permission

(Louizos et al. 2017), to approximate discrete masks. Keep in mind that the output mask is not discrete but rather gives a decent approximation, which not only allows for backpropagation but also significantly reduces the impact of the introduced evidence problem. Decomposition Methods Decomposition methods, which quantify the significance of input features by decomposing the prediction of the original model into many terms, are another well-liked explanation for GNNs. These values are then used as weights for the relevant input features. These strategies examine the model parameters in depth to expose the connections between the input-space properties and the predicted outcomes. The sum of the decomposed terms must equal the original prediction score, as this is a necessary condition for the conservative property of these approaches. Since graphs have nodes, edges, and node properties, it is difficult to directly apply such approaches to the graph domain. It’s not easy to assign weights to edges in a graph, yet doing so is necessary since edges provide crucial structural information. There have been a number of recent decomposition approaches proposed to explain deep graph neural networks (DGNNs). These include Layer-wise Relevance Propagation (LRP) (Baldassarre and Azizpour 2019; Schwarzenberg et al. 2019), ExcitationBP (Pope et al. 2019), and GNN-LRP (Schnake et al. 2020). To evenly disperse prediction scores across the input space, these algorithms are supposed to use score decomposition rules as their basis. The prediction score is backpropagated from the output layer through the network to the input layer using these techniques. The model’s forecast is used as the first target score, starting at the output layer. Then, using pre-determined decomposition criteria, the score is assigned to the neurons in the preceding layer. It is possible to represent edge importance, node importance, and walk importance by

4.4 Interpretation in Graph Networks

345

combining the results of multiple iterations of the aforementioned operations up to the input space. In particular, the activation functions used by deep graph models are ignored by all of these techniques. The scoring decomposition rules and explanation goals are where these approaches diverge. Surrogate Methods Due to the intricate and non-linear connections between the input space and the predictions they provide, GNNs are notoriously difficult to explain. The surrogate approach is often used to explain the workings of image models at the instance level. The basic concept is to use a simple and easily interpretable surrogate model to approximate the predictions of the sophisticated deep model for the areas surrounding the input sample. Bear in mind that these techniques presume that the input example’s nearby areas have simpler connections that can be successfully represented by a simpler surrogate model. In this case, the interpretable surrogate model’s justifications are utilized to support the initial forecast. Graph data are discrete and contain topological information, making it difficult to apply surrogate methods to the graph domain. Next, it is unclear how to characterize the adjacent parts of the input network and whether interpretable surrogate models are appropriate. GraphLime (Huang et al. 2022), RelEx (Zhang et al. 2021), and PGMExplainer (Vu and Thai 2020) are just a few examples of recent surrogate approaches proposed to explain GNNs. For every given input graph, they will first collect a local dataset that includes all of the items in the graph’s immediate neighborhood, as well as their predictions for those objects. To learn the local dataset, they fit an interpretable model. Finally, the explanations of the original model for the input graph are assumed to be the explanations provided by the interpretable model. Although both approaches are conceptually similar, they differ primarily in two ways: 1. How the local dataset is obtained. 2. The surrogate models used.

4.4.2.2

Model-level Interpretations

Model-level approaches, in contrast to instance-level ones, seek to explain deep graph models by offering broad, overarching understanding, and generalizations (Table 4.10). To be more precise, they investigate the patterns of input graphs that might cause a given GNN behavior, such as the maximization of a goal prediction. To generate model-level explanations for image classifiers, input optimization (Olah et al. 2017) is a well-traveled route. As a result, it is more difficult to provide an explanation for GNNs at the model level when applied directly to graph models, because of the discontinuous nature of graph topology information. Thus, it remains vital but has received very little attention. XGNN (Yuan et al. 2020) is the only known model-level mechanism for providing an interpretation for GNNs.

346

4 Interpretation in Specific Deep Architectures

4.4.3 Evaluation Metrics for Interpretation To effectively compare various approaches, a reliable evaluation metric is required. Heat maps, a type of explanation visualization, have seen extensive application in the context of explaining both visual and textual data because of their apparent intuitiveness. However, this benefit is diminished due to the fact that the graph data is difficult to grasp. Only those well-versed in the field should provide judgment on the matter. This means that assessment measures are fundamental to the investigation of explanation strategies. Considerations such as whether or not the provided explanations are consistent with the model are examples of good metrics (Jacovi and Goldberg 2020; Wiegreffe and Pinter 2019). Several commonly used metrics for evaluating various elements of explanations are introduced, and the benchmark datasets often employed in the development and explanation of GNNs are briefly presented.  Accuracy: Accuracy is generally appropriate to ground-truth datasets. Synthetic datasets contain the ground truth defined by its rule. Even though it’s uncertain if GNNs produce predictions in synthetic datasets, the principles for creating them, such as network motifs (Alon 2007), can be employed as approximations of the ground facts. Any input graph’s explanations may be compared to these foundation facts. Common accuracy measurements include the F1 score and the ROC-AUC. Nevertheless, the accuracy measures have a restriction in that it is uncertain whether the GNN model predicts in the same manner that people do, i.e., whether the predefined ground truth is really valid.  Fidelity: In 2019, Pope et al. (2019) agree with intuition that weakening a model by omitting its most crucial characteristics is counterproductive. What it means to be faithful is formally defined in Eq. 4.4 as: Fidelity =

N Gi 1  ( f (G i ) yi − f ( e ) yi ) N i=1 Gi

(4.4)

where f (.) is the target model of the output function, G i is the ith graph, G ie is the explanation for it, and G i /G ie represents the perturbed ith graph in which the identified explanation is removed. Recently, the Fidelity+ metric has been proposed. It seems to reason that the model’s predictions would shift significantly when key input features (nodes, edges, and node features) were eliminated, if they were indeed discriminative. So, Fidelity+ is the difference in accuracy (or predicted probability) between the original predictions and the new predictions after hiding important input characteristics. In a formal sense, we can refer to the ith input graph as G i . Using the formula yi = arg max f (G i ), we can predict the outcome of this graph (G i ) having a complementary mask e = 1 − m i that removes the important input features. Then, its explanations may be thought of as a hard importance map, where each element is either 0 or 1 to signify if the related attribute is essential. It is important to note that the

4.4 Interpretation in Graph Networks

347

explanations given by methods like ZORRO and Causal Screening are discrete masks that may be utilized as the significance map without any additional processing. Since the significance scores in GNNExplainer and GraphLime are continuous values, the importance map m i can be generated by normalizing and thresholding. Finally, the precision of the predictions can be quantified using a number called Fidelity+ (Yuan et al. 2022). Fidelity+acc =

N 1  (1( yi = yi ) − 1( yi1−m i = yi )) N i=1

(4.5)

where yi is the original graph prediction and N is the number of graphs. In this case, 1 − m i is the complementary mask that removes important input features, and  y 1−m i is the prediction when the new graph is fed into trained GNN f (.). If yi and yi = yi ) returns 1; otherwise, it returns 0.  yi are equal, the indicator function 1( Note that the Fidelity+acc metric examines the change in prediction accuracy. The Fidelity+ of probability can be defined by focusing on the predicted probability. Similarly, Fidelity−acc studies the prediction change keeping the important features and removing the less relevant features.  Contrastivity: Pope et al. (2019) also calculate the Hamming distance to compare two explanations. These two explanations correspond to the model’s prediction of a single instance for each class. When making predictions for different classes, it is assumed that the models will highlight different features. The higher the contrastivity, the better the interpreter’s performance.  Sparsity: Next, Pope et al. (2019) compared the size of the explanation graph with the size of the input graph. As a general rule, it’s best to keep explanations brief, as doing so will help focus on what’s most important and leave out the irrelevant parts. The sparsity measure is a metric for assessing this very quality. In particular, it quantifies the percentage of characteristics prioritized by explanatory approaches. The metric may be formally calculated as shown in Eq. 4.6 using the graph G i and its hard significance map m i .  N  |m i | 1  1− Sparsity = N i=1 |Mi |

(4.6)

In this case, the number of significant input features (nodes, edges, and node features) is denoted by |m i |, and the total number of features in G i is indicated by |Mi |. Note that higher numbers suggest that the explanations are more sparse and likely to capture only the most significant input information.  Stability: In 2020, Sanchez-Lengeling et al. (2021) examined the performance gap of the interpreter before and after introducing noise to the explanation. The article implies that a solid explanation shouldn’t be sensitive to changes in the input, such

348

4 Interpretation in Specific Deep Architectures

Table 4.11 Summary of some interpretability visualization tools Tool Remarks CSI: collaborative semantic Visual Interaction with Deep Learning Models through inference (Gehrmann et al. 2019) Collaborative Semantic Inference. User can both understand and control parts of the model reasoning process. eg. in text summarization system, user can collaborative writing a summary with machines suggestion Manifold (Zhang et al. 2018) A Model-Agnostic Framework for Interpretation and Diagnosis of ML Models using inspection (hypothesis), explanation (reasoning), and refinement (verification) Activis (Kahng et al. 2017) Visual exploration of industry-scale DNN models compares the activations from different data instances (i.e., examples) to investigate the potential causes of misclassifications DQNViz (Wang et al. 2018) A Visual Analytics Approach to Understand Deep Q-Networks. Extract useful action/reward patterns that help to interpret the model and control the training Block (Bilal et al. 2017) Do CNNs learn class hierarchy? Including a class hierarchy and confusion matrix showing misclassified samples only, bands indicate the selected classes in both dimensions and a sample viewer

that the model’s prediction wouldn’t change even if the input was altered slightly. In our mind, it seems reasonable that the explanations would stay the same even if one made minor adjustments to the input without changing the predictions. The current suggested stability metric assesses if an explanation approach is stable. Given an input graph G i , its explanations m i are considered the ground truth. The original input graph G i is then slightly modified by adding additional nodes and edges to produce i are needed to have the same predictions. Then i . Note that G i and G a new graph G i are obtained, indicated as m i . By measuring the difference the explanations of G i , we can compute the Stability score. Keep in mind that the more between m i and m reliable and resistant-to-noise explanations have lower values. In addition, as graph representations are sensitive, determining a sufficient number of perturbations may be hard. Finally, to end the section, Table 4.11 presents the summary of popular opensource libraries that includes commonly used dataset and GNN algorithms used for constructing and explaining GNNs.

4.4.4 Disentangled Representation Learning on Graphs Due to the lack of transparency in the representation space, traditional representation learning has severe limitations in terms of interpretability. In contrast to the case of manually engineered features, where the meaning of each dimension of the resulting feature is known in advance, the meaning of each dimension of the representation space is unknown. This constraint affects all forms of representation learning, includ-

4.4 Interpretation in Graph Networks

349

ing that of graphs. Several methods have been proposed to enhance the interpretability of representation learning on graphs by allowing the assignment of actual meanings to various representation dimensions. The explanatory graph reveals the semantic knowledge contained in a CNN. By assigning each node in the explanatory graph to a distinct component, it is possible to untangle the complex pattern of parts contained in a conv-layer’s feature map at the filter level. In more detail: 1. The explanatory graph contains several levels of explanation. Each “layer” in the graph represents one “conv-layer” in the CNN. 2. Each node in the explanatory graph uniformly represents the same component of the item, shown in a variety of ways. The node can be used to localize the matching feature in the supplied picture. The node can withstand some degree of shape distortion and posture changes without breaking. 3. In a network, the co-activation and spatial interactions between two nodes in different levels are encoded through edges. 4. An explanatory graph may be thought of as a compact representation of convolutional layer feature maps. Each conv-layer may consist of hundreds of individual filters, each of which may generate a feature map including hundreds of individual neural units. To express the information contained in the tens of millions of neuronal units in these feature maps, or, in other words, to express by which patterns the feature maps are activated and where these patterns are located in the input picture, we may employ tens of thousands of nodes in the explanatory graph. 5. Each input image can only trigger a subset of the component patterns (nodes) in the explanatory graph, much like a dictionary. Each node in the graph depicts a common component pattern that appears repeatedly across several images in the training set.  THINK IT OVER Is a single vector embedding for each node enough? Learning a single embedding for each node is the goal of several currently available representation learning algorithms on graphs. However, is it sufficient to represent each node with a single vector if some nodes have numerous facets? Solving this issue has significant practical benefit for applications like recommendation systems, where users may have a wide range of interests. Thus, each user might be represented by a set of embeddings, where each embedding represents a different interest.

350

4 Interpretation in Specific Deep Architectures

There are two difficulties in learning disentangled representations: determining the K facets and distinguishing the updating of distinct embeddings throughout the training phase. Unsupervised facets might be identified by clustering, where each cluster represents a facet. We will now describe multiple ways of learning disentangled node embeddings on graphs.

4.4.4.1

Prototype-Based Soft-Cluster Assignment

This approach is discussed in the context of recommender system design. Facets were discovered that reflect item types as we learn about user and item embeddings. In this case, we assume that each object has just one facet, although each user may have several facets. For each given item t, its embedding is indicated by h t , whereas for any given user u, it is denoted by h u = [h u,1 , h u,2 , . . . , h u,K ]. If an object t belongs to a certain facet k, then its associated one-hot vector ct = [ct,1 , ct,2 , . . . , ct,K ] will have the value ct,k = 1, if t belongs to the facet k, and ct,k = 0, otherwise. In addition to learning the node embeddings, we must also learn a collection of prototype embeddings, K . Equation 4.7 illustrates how we select the one-hot vector from denoted by {m k }k=1 the corresponding category distribution (Wu et al. 2022). ct ∼ categorical(softmax([st,1 , st,2 , , st,K ])) st,k = cos(h t , m k )/τ

(4.7)

where the cosine similarity is scaled by the hyper-parameter τ . Then, if we look at the odds of seeing an edge (u, t) in Eq. 4.8, we find that p(t|u, ct ) ∝

K 

ct,k · similarity(h t , h u,k )

(4.8)

k−1

Specifically, the item embeddings and prototype embeddings are concurrently updated until convergence, in addition to the basic learning process described above. The embedding of each user, h u , is calculated by adding the embeddings of all the objects with which that user has interacted; more specifically, h u,k takes all the embeddings from all the objects that belong to the aspect k. Cluster discovery, node-cluster assignments, and embedding learning are performed concurrently in the learning process.

4.4 Interpretation in Graph Networks

351

Fig. 4.24 User preferences are represented by a set of embedded features. Every piece of an embedding represents a different feature of the data. Figure adapted from (Li et al. 2019) with permission

4.4.4.2

Dynamic Routing Based Clustering

The Capsule Network (Sabour et al. 2017) inspires the concept of employing dynamic routing to learn disentangled node representations. A distinction can be made between low-level capsules and high-level capsules. The objects with which a user u has interacted are referred to as Vu . Each capsule is an embedding of an item that has interactions, and the low-level capsules set is indicated as cil , where i ∈ Vu . The high-level capsules are denoted by the set {ckh }, 1 ≤ k ≤ K , where ckh is the user’s kth interest. To converge, the routing process is typically performed many times. As demonstrated in Fig. 4.24, once the routing process is complete, high-level capsules may be utilized to represent the user u with numerous interests, which can then be input into future network modules for inference (Li et al. 2019).

4.4.5 Future Direction The field of interpretation using GNNs is still being developed. However, there are countless difficulties that needs to be addressed. Our goal in this section is to provide a roadmap for future research that may be used to improve the interpretability of GNNs. 1. Some online applications need models and algorithms that can respond instantly. In this way, it proposes stringent criteria for the effectiveness of explanatory strategies. However, many GNN explanation approaches require either extensive sampling or extremely iterative algorithms to get meaningful answers. To that end, finding out how to make fast and accurate explanation algorithms is a promising area for future research. 2. Infrequently mentioned in the literature how to use interpretation to discover GNN model faults and improve model attributes, despite the fact that more and more

352

4 Interpretation in Specific Deep Architectures

approaches have been created for doing so. The question is whether adversarial or backdoor assaults will have a significant impact on GNN models. Is there any way that interpretation could assist us in solving these problems? To what extent can GNN models be improved once they have been shown to be unreliable or biased? 3. Might additional modeling or training paradigms, in addition to attention approaches and disentangled representation learning, further enhance the interpretability of GNNs? Some researchers in the field of interpretable ML are drawn to the idea of presenting causal links between variables, while others favor the use of logic rules for inferring meaning. Exploring the means by which to include logic thinking into GNN inference, or bringing causality into GNN learning, might therefore be a fruitful line of inquiry. 4. Existing work on interpretable ML focuses mainly on improving interpretation accuracy, while the human experience is often neglected. When a system is designed with the user’s needs in mind, it’s easier to earn the user’s confidence and improve the overall user experience. Incorporating domain experts who are unfamiliar with ML through a simple interface is a great way to speed up the process of iteratively improving the system. Therefore, human-computer interaction (HCI) to present an explanation in a more user-friendly style, or the building of improved human-computer interfaces to allow user interactions with the model, might be another line of research to explore.

4.5 Self-Interpretable Models Remarkable models are the only ones that fit herein. The primary issue with interpretable models is that each learning technique has its own shortcomings when it comes to representing and reasoning about the problem’s domain knowledge. According to Freitas (2014), all of these factors affect readability. A decision tree is a logical structure consisting of a series of if-then statements. This approach has the issue that it is not easily scalable in terms of interpretability. As the problem’s intricacy increases, a growing tree becomes harder to decipher. This is also true for other rule-based systems that provide an analogous explanation to decision trees. Linear models, such as SVM, are somewhat more interpretable due of their openness, where the vectors may be utilized to explain the judgments on a new challenge. This has the same problem as decision trees: It becomes increasingly difficult to translate a prediction into an explanation as more dimensions (features) are added. In the same way, k-nearest neighbors (k-NN) can be interpretable on any scale depending on the similarity measure used. Nonetheless, when the number of dimensions increase (compare Sect. 2.3.1.1), this becomes impractical since we risk losing sight of the forest for the trees.

4.5 Self-Interpretable Models

353

4.5.1 Case-based Reasoning Through Prototypes Previous NN designs often focused solely on accuracy, with post-hoc interpretability analysis. In this situation, the network architecture was selected first, and then one worked to make sense of the model’s training or the characteristics it had learned at a high level. These interpretability study should be treated as a distinct modeling effort. The explanations themselves may shift depending on the model used to generate them, which is an issue when trying to generate them post-hoc. For instance, it could be simple to come up with a variety of plausible, but incorrect justifications for how the network would categorize a given object. An associated problem is that the explanations produced by post-hoc interpretations are often incomprehensible to people. Thus, further modeling is required to guarantee that the explanations can be understood. For instance, in the Activation Maximization (AM) method, the objective is to discover an input pattern that generates the largest possible response from the model for the user-defined quantity of interest (Erhan et al. 2009). Regularized optimization is utilized to discover an interpretable high activation picture of AM (Hinton et al. 2012; Lee et al. 2009; Van Oord et al. 2016; Nguyen et al. 2016) because AM images are typically not readable; they tend to just appear gray.  THINK IT OVER What we get when we combine what the network computes with extrinsic regularization is called a regularized estimate. After all, the post-hoc interpretation seems hard to believe if it was generated by a different modeling approach using strong priors that were not part of the training procedure. In reality, there is a growing body of work addressing the aforementioned problems in AM like Montavon et al.’s (2018) interpretation of images that involves examining NN layer structure visually. Unlike developing explanations for pre-trained black-box models (which we will also explore), developing the architecture to encode its own interpretations is more in line with research like prototype classification and CBR. Here, a novel DL network design has been developed that provides intuitive explanations of its prediction processes. AE and a special prototype layer (Fig. 4.25) are included in this design; the later units each hold a weight vector that is similar to an encoded training input. Comparisons in the latent space are made possible by the AE’s encoder of the AE, and the learned prototypes are visualized by its decoder. Note, that a prototype is an example that is substantially similar to, if not an exact match for, an observation in the training set. Similarly to previous prototype classification methods for ML (Bien and Tibshirani 2011; Kim et al. 2014; Marchette et al. 2003), the CBR work align well with the existing literature. Although prototype categorization is a classic example of

354

4 Interpretation in Specific Deep Architectures

Fig. 4.25 Case-based Reasoning network design with prototype classifier. Reproduced from (Li et al. 2018) with permission

case-based reasoning introduced by Kolodner (1992) in 1992, the network allows us to evaluate the distance between prototypes and observations in a more general latent space. The impressive efficiency of latent space stems from its adaptable nature. In the network, observations are assigned to categories based on how similar they are to a prototype observation in the dataset. Using the handwritten digit example, we can learn from Fig. 4.26 that a specific observation was labeled as ‘3’ because the network determined that it resembled a prototypical ‘3’ from the training set. Likewise, if the network is asked to categorize an image of a ‘9’, it is reasonable to assume that it will also recognize prototypes from classes ‘4’ and ‘9’, as the former is typically difficult to differentiate from the latter. In the figure, each prototype node is represented by a row in the matrix, and each digit class in the MNIST data is represented by a column. The category carrying the most unfavorable weight is shown in darker gray. The negative weight of each prototype is its visual class, with the exception of the row in the matrix representing the number ‘2’ that is connected to the weight matrix representing the number ‘7’.

4.5.2 ProtoNets Introduced by Snell et al. (2017) take the average of many embedded “support” samples, to serve as a prototype for each class. Prototypes for zero-shot learning are points in the feature space, and Li and Wang’s (2017) generative probabilistic model is used to create them. Each class is only allowed to have one prototype, and in neither scenario are prototypes needed to be interpretable (thus, their representations will typically not resemble natural images).

4.5 Self-Interpretable Models

355

Fig. 4.26 Distance between the test picture “6” and the prototype (in prototype space) (on the left). Transposed weight matrix (on the right) corresponds between the prototype and softmax layers in the architecture. Adapted from (Li et al. 2018) with permission

4.5.3 Concept Whitening Another method for making interpretable image classifiers is the one we call Concept Whitening (CW) (Chen et al. 2020). To implement CW, a normalizing layer (e.g. a batch normalization layer) is replaced with a CW layer. This makes CW a powerful tool for users who wish to make their pre-trained image classifiers more interpretable without sacrificing model performance. CW draws extensively from the work on whitening transformation. Therefore, those interested in learning more about CW would do well to familiarize themselves with the mathematics of whitening transformation.  Highlight In CW, a latent low-dimensional space is generated by an AE and the distances to prototypes are calculated here. To identify a more appropriate dissimilarity measure than L2 in the pixel space, we can use a latent space for distance calculation.

4.5.4 Self-Explaining Neural Network Self-Explaining Neural Networks (SENNs) were developed by Alvarez-Melis et al. (2018), and their purpose is to provide a simple and locally interpretable model. This is achieved by employing a regularization method to enforce the variety through sparsity regularization, provide conceptual interpretation via prototyping, and ensure the model acts locally as a linear model. Figure 4.27 gives an overview of the design. SENNs has three parts:

356

4 Interpretation in Specific Deep Architectures

Fig. 4.27 The network architecture for SENN. A SENN includes three components: a concept encoder (green) that turns the input into a minimal collection of interpretable basic features; an input-dependent parametrizer (orange) that provides relevance scores; and an aggregation function that combines the two to yield a prediction. The robustness loss on the parametrizer promotes the whole model to behave locally as a linear function on h(x) with parameters θ(x), resulting in straightforward interpretation of both ideas and relevances. Reproduced from (Alvarez-Melis and Jaakkola 2018) with permission

1. An input-dependent parameterizer that provides relevance scores. 2. A concept encoder that turns the input into a limited collection of interpretable basic characteristics. 3. An aggregation function that aggregates the scores. By imposing a loss of robustness across the parameters x, the complete model is motivated to exhibit linear behavior locally on h(x), which in turn facilitates a more easily interpretable (linear relation) explanation for any given prediction. A parameterizer, a conceptizer, and an aggregator make up the skeleton of the SENN model. An ANN realizes the parameterizer θ , and an AE realizes the conceptizer h. It’s possible that these networks’ actual implementations will differ. The tabular data is represented by completely linked networks. Here, CNNs are used for the image data. In the first part  of their discussion of SENNs, the authors assume that the linear model f (x) = in θi xi + θo with the associated parameters θ0 , θ1 , . . . , θn ∈ R is interpretable for a given set of input features x1 , x2 , ..., xn ∈ R. Then, they expand the scope of the linear model to make it more intricate while still keeping the interpretable features typical for linear models. We can design a NN f that can explain itself in Eq. 4.9 as follows: f (x) = g(θ (x)1 h(x)1 , . . . , θ (x)k h(x)k ),

(4.9)

4.5 Self-Interpretable Models

357

To be more precise, let’s say that θ is a NN that converts input characteristics into relevance scores (or parameters). Specifically, if x is an input, then X → R will calculate k interpretable feature representations of x. Specifically, g is a monotonically growing, totally additively separable aggregation function, and these feature representations are known as basis ideas. Keep in mind that while feature values change, the parameters of linear models remain stable, which is a crucial aspect for interpreting the model. In cases where the parameters θ (x) are very complicated functions of input characteristics, this property is lost. For the parameterizer θ to serve as coefficients of a linear model in the basis concepts h(x), it is proposed that θ be locally-difference bounded by the conceptizer h. It follows that, intuitively, θ is resistant to modest changes in concept values, at least within a localized region centered on some input value xi . Minimizing the robustness loss is used to ensure the locally-difference boundedness in Eq. 4.10. Lθ = ||∇x f (x) − θ (x)T Jxh (x)||

(4.10)

The Jxh -pointer indicates the Jacobian of h with respect to x. Minimizing the loss function is what the authors do in Eq. 4.11 throughout training. L = L y ( f (x), y) + Lθ ( f (x)) + ξ Lh (x),

(4.11)

where L y ( f (x), y) represents the loss in classification, or how well the model predicts the true label. The robustness loss provided by Eq. 4.10 is Lθ ( f (x)).  is a regularization parameter that determines how strongly robustness is imposed making a trade-off between performance and stability, hence the interpretability of θ (x). The notion loss is represented by Lh (x). The loss is the product of two losses: reconstruction loss and sparsity loss. ξ is a concept loss regularization parameter.  THINK IT OVER Please take note that Alvarez-Melis et al. do not provide the particular loss functions used for classification, reconstruction, or sparsity. They do not define which matrix norm should be utilized for the robustness loss. The Frobenius norm provides the squared sum of a matrix’s elements, so we may use that. Finally, in Eq. 4.11, the authors use the term “sparsity strength parameter” to refer to the concept of regularization loss hyperparameter ξ , which is unclear. Lastly, despite the fact that a public implementation of SENN exists, the authors have chosen not to make their code publicly available alongside the article. Lastly, an issue stated by Leake (1995) is that many explanation systems do not take past instances into account when developing an explanation for the input prediction. We (and the system) would need to revise our fundamental knowledge of the subject if fresh evidence emerges to refute our existing assumptions. By definition, something can’t be done without some prior knowledge or understanding. Time-

358

4 Interpretation in Specific Deep Architectures

based learning is at play here, since the quality of the explanations provided grows with experience. User comments might be used to fine-tune this learning. The system might keep track of the ways in which its reasoning has strayed from the user’s expectations or from the opinions held by domain experts. Since the system’s internal knowledge may be “deficient” in certain respects. The CBR system by Craw et al. (2018) is meta-cognitive (knowing about knowing), therefore, it may learn to discover its own flaws via inquisitive study of its domain. The ability to explain phenomena could be enhanced in the future in comparison to the standard CBR model.

4.6 Pitfalls of Interpretability Methods Rudin (2018) argues that it is futile to try to describe a black-box system. Inducing network behavior via more transparent or interpretable models is criticized for being unfaithful to the original system’s design. Since absolute fidelity to the original model is impossible to achieve while employing the black box as an oracle and training a transparent model as an explainer, the faithfulness of the system may drop. The main point of criticism is that incomprehensible models shouldn’t be used, and the assumption that improved accuracy can be achieved at the expense of interpretability is a fallacy (Gosiewska et al. 2021), despite its prevalence in XAI works. Complex models are not necessary for great accuracy, but are simpler to deal with since they are more evolved than interpretable models. In 2018, Alvarez-Melis et al. (2018) applied the robustness metric on interpretability methodologies (LIME, SHAP, etc.), where robustness is a measure of variance to the input with regard to the explanation (attribution) provided, where identical input-output pairings should yield similar explanations. Experiments revealed that model-independent perturbation-based interpretability approaches were more prone to instability than gradient-based interpretability methods. However, both techniques performed poorly on the robustness metric for the most part. According to AlvarezMelis, the interpretability technique faces the same fate as the underlying model, since it is not resilient. However, it is unclear whether robustness is a necessary feature to maintain because it just considers the sensitivity of the explanations. These gradient-based methods are also potentially vulnerable to adversarial attacks, as demonstrated by Ghorbani et al. (2019), where a small random perturbation of the input can change the feature importance, leading to drastically different interpretations without changing the prediction. Yeh et al. (2019) developed the measure of sensitivity. Sensitivity describes how an explanation varies in response to different information. The results suggest that the less sensitive the interpretation technique, the more true the explanation is. Yeh et al. also show how we can use adversarial training to improve the black-box model’s sensitivity to this attribute.

4.6 Pitfalls of Interpretability Methods

359

Fig. 4.28 A sticker that makes a VGG16 classifier trained on ImageNet categorize an image of a banana as a toaster. Figure reproduced from (Brown et al. 2017) with permission

 Highlight One of our preferred techniques actually makes counterexamples exist in the real world. When training an image classifier, Brown et al. (2017) created a printed label (see Fig. 4.28) that, when adhered to an object, makes it look like a toaster. In contrast to previous approaches for adversarial examples, this one does not require that the adversarial image be highly similar to the original. Instead, it fully replaces that section of the image with a patch that can be of any shape. To make the patch useful in a wide variety of contexts, its appearance is optimized across a variety of background images, with the patch occasionally being relocated, enlarged, or rotated. Finally, this improved image can be printed and utilized to trick image classifiers in the wild. Methods for comprehending DNN choices and procedures often focus on developing intuition by highlighting sensory or semantic elements of specific examples. Methods, for example, try to show the input components that are “essential” to a network’s decision or to quantify the semantic qualities of particular neurons. In their 2020 paper, Leavitt and Morcos (2020) claim that interpretability research suffers from an over-reliance on intuition-based techniques, which can lead to illusory progress and erroneous findings, has done so and, under some circumstances, has done so. They outline a set of constraints that we argue impede significant progress in interpretability research. They urge researchers to start with their intuitions to design and test unambiguous, falsifiable hypotheses, and we believe that the book provides strong, evidence-based interpretability methodologies that will result in more impactful research and applications on DNNs.

360

4 Interpretation in Specific Deep Architectures

 Highlight We underline that we do not propose that intuition should be abandoned entirely; intuition is necessary to develop understanding, and an intuitive technique is always superior to an unintuitive method. Instead, we contend that unsubstantiated intuition and other parts of interpretability research can contribute to misunderstanding. As a result, the purpose of this work is to minimize misdirected effort, better realize the potential of effective ideas, and promote more meaningful research by employing scientific rigor. Nevertheless, it is also possible that visualizations’ ubiquity and attractiveness work against efforts to conduct rigorous studies on interpretability. Many interpretability approaches provide rich and engaging visuals (Alexander Mordvintsev and Tyka 2015a, b; Gatys et al. 2015, 2016; Olah et al. 2017, 2018, 2020; Zeiler and Fergus 2014), which is crucial to build understanding (Tufte 1983; Victor 2014). However, the ability to visualize something is both a blessing and a curse; a graphic may be very effective at conveying meaning, even if it isn’t an accurate depiction of the phenomena it’s trying to explain. This highlights the necessity of accurate visualization in the field of interpretability, where it is easy to be misled by untested intuitions.  THINK IT OVER Interpretability illusion!! The feature visualizations can give the illusion that we comprehend what the NN is doing. But do we truly comprehend what is going on in the NN? Despite inspecting hundreds, if not thousands, of feature visualizations, we still don’t get the NN. It’s possible for several neurons to learn the same or comparable features, and there may not be any corresponding human conceptions for many of the qualities due to the complicated interplay between the channels. We must not fall into the trap of assuming we completely comprehend DNNs simply because we believe we noticed that the neuron 264 in the layer 14 is stimulated by tigers.

4.6.1 Case Study: Feature Visualization and Network Dissection  Network Dissection proved that designs like ResNet or Inception include units that react to particular notions. However, metrics like IoU are inadequate because

4.6 Pitfalls of Interpretability Methods

361

several units react to the same concept, while others show no response at all. The channels are not completely disentangled and we cannot analyze them in isolation. In order to perform Network Dissection, pixel-level concept labels in the form of a dataset are required. These datasets take a lot of effort to acquire, as each pixel needs to be tagged, which normally works by drawing segments around things on the image. Network dissection only aligns human notions with positive activations but not with negative activations of channels. Negative activations seem to be relevant to the concept, as shown by the feature visualizations (See Sect. 4.3.1). This may be rectified by looking at the lowest quantile of activations.  Highlight The visualization tool for interpretability is double-edged sword 1. Understanding is best gained by exploration, where visual aids, such as maps and diagrams can be invaluable. Even if a visualization is inaccurate, viewers may nevertheless walk away with a profound sense of understanding if it effectively conveys the subject at hand. 2. Having to deal with the lack of quantification and the potential drawbacks of visual representation. Particularly so in DL, where seemingly identical models can exhibit wildly diverse behavior depending on a few its hyperparameters. Without proper quantification, visualization can be as unreliable as a Rorschach test for researchers.  CAM is a low-maintenance, high-yield method of supervision. It provides a good concept of the most important features that a model is searching for, and it can be implemented quickly and easily. It’s a great way to show how a CNN works, even to people who don’t have any technical background. Although CAM is useful to get a general sense of the most prominent aspects, it is limited in its ability to offer finer insights. Look for numerous examples in the literature to verify how the activation heatmap can extend from the neck to the face in some circumstances, or from the Persian cat to a golden fish in the bowl. Is it more reasonable to assume that the neck, rather than the face, is a discriminatory feature for that group? Although it provides some insight into differentiating traits, it is unable to capture the complex semantic relationship occurring at deeper levels.

4.6.2 Gradients as Sensitivity Maps Several authors, including (Baehrens et al. 2010; Simonyan et al. 2014; Erhan et al. 2009), have suggested the need for mathematically sound methods to identify “essential” pixels in the input image. In practice, it appears that there is a correlation

362

4 Interpretation in Specific Deep Architectures

between the sensitivity map of a label and the locations where that label is present (Baehrens et al. 2010; Simonyan et al. 2014). Sensitivity maps computed using raw gradients, on the other hand, tend to look cluttered. The formulas for CAM and LRP are given on a heuristic basis, as discussed in earlier sections; the formulas assume that the models will eventually produce interpretable information through a certain set of interactions between weights and the strength of activation of some units. The intermediary steps are not observable. Adjusting the value of a single weight, for instance, does not readily reveal any meaningful patterns. The apparent noise in raw gradient visualizations can have various causes. True descriptions of the network’s behavior as depicted on the maps are one possibility. Perhaps certain seemingly unrelated pixels dispersed throughout the image are crucial to the network’s decision-making process. Several attempts have been made to show the regions of images that most strongly activate a certain feature map, but this does not yet explain how the network arrived at its conclusion (Zeiler and Fergus 2014). Alternatively, the raw gradient might not be the best indicator of feature value. Several previous publications have proposed improvements to the basic concept of gradient sensitivity maps in an effort to provide better explanations of network decisions. Meanwhile, LRP interpretability strategies (Bach et al. 2015), DeepLift (Shrikumar et al. 2017), and more recently Integrated Gradients (Sundararajan et al. 2017), all attempt to address this possible issue by evaluating the global importance of each pixel rather than the local sensitivity. “Saliency” or “pixel attribution” maps are the terms used to describe the types of maps made using these methods. Modifying or extending the backpropagation algorithm is another tactic to improve sensitivity maps with the intention of giving more weight to good inputs. DeconvNet (Zeiler and Fergus 2014) and Guided Backprop (Springenberg et al. 2014) are two methods that alter the gradients of ReLU functions by ignoring negative values during backpropagation calculation. The goal of this “deconvolution” is to better reveal the specific elements that caused the activation of higher-level brain circuits. In a similar vein, the suggestions made in Selvaraju et al. (2016), Zhou et al. (2016) provide methods for combining gradients of units at multiple levels. The apparent noise in a sensitivity map may, therefore, be the result of small, locally-variable partial derivatives that have little global significance. After all, there is no reason to anticipate smooth variations in derivatives in light of typical training methods. Relevant networks are often based on ReLU activation functions, so the gradient score for a class Sc is not even continuously differentiable in most cases.

4.6.3 Multiplying Maps with Input Images Some algorithms generate a final sensitivity map by multiplying gradient-based values by real pixel values, as reported by Shrikumar et al. (2017), Sundararajan et al. (2017). Visually, this multiplication tends to provide simpler and sharper images;

4.6 Pitfalls of Interpretability Methods

363

however, it’s not always clear how much of the improvement may be due to improvements in the original image’s clarity. For instance, if the input contains a black/white edge, but the underlying sensitivity map does not, the output display may nevertheless have a structure that resembles an edge. But there is a chance that doing so could trigger some unpleasant consequences. Pixels with a value of 0 will never be displayed on the sensitivity map. If we encode black as 0, for instance, a classifier that correctly predicts a black ball on a white background will never bring attention to the black ball in the image. However, if we consider the significance of the features in terms of its contribution to the overall score (y), then multiplying the feature’s gradients with the input photos makes intuitive sense. When y = W x, it is logical to consider xi wi as xi ’s contribution to the overall score y. “In so far as a scientific statement speaks about reality, it must be falsifiable: and in so far as it is not falsifiable, it does not speak about reality.” —Karl Popper (2005)

4.6.4 Towards Robust Interpretability Researchers like Bansal et al. (2014), Lakkaraju et al. (2017), and Zhang et al. (2018) all contributed to the study to identify NN flaws and biases. For example, Bansal et al. (2014) created a model-agnostic approach to determine when a NN is unlikely to make a prediction. Instead of an alarm, the model would issue a cautionary message along the lines of “Do not trust these predictions.” They did this by adding a set of binary labels to all failed images, and then grouping them together in the attribute space. Thus, each group represents a potential source of disaster. Lakkaraju et al. (2017) proposed two fundamental hypotheses to efficiently identify mislabeled occurrences with strong prediction scores in the dataset: 1. The first hypothesis is that each unsuccessful case is representative and informative enough. 2. The second hypothesis asserts that high-confidence mislabeling results from systematic biases rather than random variation. The images were then clustered into various groups, and a multi-armed bandit search approach was developed by treating each cluster as a bandit that chooses which clusters to query and sample at each stage. Zhang et al. (2018) used ground-truth connections between attributes based on human common knowledge (fire-hot vs. ice-cold) to find representation biases.

364

4 Interpretation in Specific Deep Architectures

 THINK IT OVER Is Attention Interpretable? Attention is an important way to describe how neural models work. The assumption here is that the inputs (e.g., words) with high attention weights are responsible for model output. However, attention does not have a strong correlation with other well-founded feature importance metrics. There are alternative distributions for which the model produces nearly identical prediction scores. Attention scores are utilized to provide an explanation rather than providing an explanation directly. The model itself is degraded by detecting the attention scores achieved by parts of the model. A reliable enemy must also be trained (Wiegreffe and Pinter 2019; Serrano and Smith 2019; Jain and Wallace 2019). In summary, to verify the robustness of the claim, we must investigate the falsifiability of the interpretations (Leavitt and Morcos 2020) using a valid hypothesis. Table 4.12 illustrates an example of a very weak, weak, average, and strong hypothesis, which changes how we try to open the black-box. These can be combined alongside the two IDL axioms: 1. Sensitivity: The first axiom is that the model must be sensitive. A non-zero saliency for a feature is needed if two inputs differ in only that one way yet yield different predictions. Because ReLUs might lead to zero gradient despite of differing inputs (for instance, if the pre-ReLU activation is smaller than 0), traditional gradient-based saliency breaches this axiom. 2. Implementation invariance: The second axiom states that if two networks are functionally equal (generate the same output in response to all inputs), their saliency maps should be identical. Doran et al. (2017) propose pairing an understandable model with a reasoning engine to realize XAI completely. To put it another way, a model that is easy to understand is one that provides symbolic representations of its outputs so that the user can easily draw connections between the input and output characteristics. This suggests including a black box that reveals more than just the output, where symbolic local interpretation and a knowledge base are used to shed light on a given prognosis. A case-based knowledge-intensive reasoner might fulfill this role, with feedback and explanation customization. The understandable component would need to be gained by other means, but we expect that any model induction technique, specifically local interpretation, will suffice as a first step.

4.6 Pitfalls of Interpretability Methods

365

Table 4.12 An example of hypothesis testing adapted from (Leavitt and Morcos 2020) Case Remarks Very weak hypothesis The foundation of DNN function is feature-selective neurons

– Not falsifiable – The term “foundation” is ambiguous; how do you determine whether something is the “foundation of DNN function”? · There is no sense of a baseline for comparison

Weak hypothesis If feature selectivity is critical for DNN function, we should look for neurons that are feature-selective

+ Is falsifiable! This hypothesis is proven untrue if there are no feature-selective neurons – It is frequently difficult to demonstrate the absence of anything. What if you simply didn’t look in the correct place? We – Don’t know what to expect by chance: how many feature-selective neurons should we expect to happen by chance? – Just because there are feature-selective neurons doesn’t mean that they are important

Average hypothesis Ablating feature-selective single neurons should result in a loss in test accuracy if feature selectivity is required to maximize test accuracy

+ Falsifiable + Addresses the issue of causality – “necessary” is a lot more precise statement than “important,” and it leads to a specific experiment to test this notion – No alternatives discussed – No discussion of baseline. Ablating all neurons could reduce test accuracy. Does this satisfy the hypothesis?

Strong hypothesis If feature selectivity in single neurons is needed to maximize test accuracy, ablation should reduce accuracy proportionate to the neuron’s feature selectivity. Alternatively, if networks rely more on feature selectivity across neurons than in individual neurons, then zeroing activity in feature-selective directions (i.e. a linear combination of units that represents curves) should cause a decrease in test accuracy proportional to the strength of feature selectivity and exceed the decrease from ablating only single units

+ Falsifiable + Makes specific, testable predictions + Presents a number of opposing hypotheses + It provides a benchmark for comparison (single units vs. non-axis-aligned linear combinations) – Verbose

Summary Though serious development is underway to extend the understandability in series data and numerical inputs, handling the image/video data with spatial correlation and transformation invariance eccentricities requires a unique and creative outlook problem statement. In this book, we have investigated existing visualization techniques and mathematical modelling to develop the decision-making policies of the DL models. However, an in-depth study is still needed on developing regularization and operation on image pixels with adaptive rules. It will help present an end-to-end architecture that contributes to explainable AI for image classification tasks. In the future, we seek to develop a scalable interpretable model that applies to a vast case

366

4 Interpretation in Specific Deep Architectures

scenario owing to the non-scalable nature of existing interpretable practices, limiting the scope of application. Having said that, we also need to understand that we should not force-fit interpretability into every method if the application doesn’t find it necessary. The take-home message is that interpretability is the major need of the hour on the grounds of increased application of AI to practical scenarios where certain biases in the model can be a matter of life and death. In 2018, the European Union also introduced the Right to Explanation under GDPR. Admitted unclarity, the right entitles a subject to receive meaningful information about the logic involved. The need for interpretability of deep learning models lies in the fact that the so-called black-box models are not biased by themselves. The root lies in bias in data, adversarial examples, and human notion, which makes it important to validate the model’s accuracy. I believe that with fairness and transparency fueled by explainability in the DL field; we can make the world a better place.

 Reading List 1. “Generative Adversarial Networks in Computer Vision: A Survey and Taxonomy” by Wang et al. (2021) for an overview and critical analysis of progress in different variants of the state-of-the-art GAN architectures and loss functions. 2. “GAN Inversion: A Survey” is a curated list of the GAN inversion method, datasets and related informations by Xia et al. (2022). 3. “The (un) reliability of saliency methods” by Kindermans et al. (2019) discusses how saliency methods fails to satisfy the reliability in explanation when two networks process images in identical manners. A new evaluation criterion, input invariance, requires that the saliency method mirrors the sensitivity of the model with respect to transformations of the input. Input transformations that do not change network’s prediction should not change the attribution either. 4. “Fooling lime and shap: Adversarial attacks on post hoc explanation methods” by Slack et al. (2020) propose a novel scaffolding technique that effectively hides the biases of any given classifier by allowing an adversarial entity to craft an arbitrary desired explanation. 5. “Interpretation of NNs Is Fragile” by Ghorbani et al. (2019) demonstrate how to generate adversarial perturbations that produce perceptively indistinguishable inputs that are assigned the same predicted label, yet have very different interpretations.

Chapter 5

Fuzzy Deep Learning

Previous chapters have discussed the learning, performance and explainability of NNs based on crisp inputs, weights, parameters, training samples and other information pieces. It was demonstrated how a deep network propagates information into its layers. Any classification that employs deep layer architecture, on the other hand optimizes millions of parameters in order to reach a conclusion. This renders DNNs non-transparent with a higher level of abstraction. The current research direction emphasizes the interpretability and explainability of a black-box model in order to gain trust for its practical implementation under certainty. Moreover, in real-life applications, we do not necessarily get the same inputs and ideal decision-making abilities. This chapter examines model explainability using fuzzy metrics and its impact on performance opposed to the model’s knowledge discovery. It is impossible to introduce all of the fuzzy interpretability applications here, nor is it relevant for this book. The core focus is to introduce the impact of fusing fuzzy logic with neural networks, improve the prediction accuracy and, most importantly, understand the reliability of black-box behaviour of heavy neural network models. The chapter will © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Somani et al., Interpretability in Deep Learning, https://doi.org/10.1007/978-3-031-20639-9_5

367

368

5 Fuzzy Deep Learning

go over the learning techniques of a fuzzy neural model as well as use-cases in various domains. Finally, it will present a fuzzy model architecture inspired by a mixed system of interpretability. This is to provide a broader framework for competently interpreting an existing technique. Introducing fuzziness to NNs results in networks with fuzzy signals, fuzzy weights, fuzzy rules, membership function, and other information that paves the way for dealing with the model’s ambiguity and imprecision in learning data representation. However, at the time of writing this book, advanced deep learning architectures such as the ones used in ResNet, were not amenable to incorporation of fuzzy logic, and the authors expect that the field has evolved significantly at the time of reading and the readers can connect to dots of interpretability through fuzzy logic from the basic concepts to the new systems that then will have emerged.

5.1 Fuzzy Theory Real-world experiments are always full of uncertainty, which must be considered when modeling a sensitive application. We are already familiar with one type of ambiguity that serves as the foundation of probability theory. However, the ambiguity in probability theory is related to the uncertainty of certain events or their outcomes. Fuzziness, on the other hand, is associated with ambiguity or uncertainty in explaining or interpreting a specific unambiguous crisp entity, often linguistically. Let us consider an example to better understand the concept of fuzziness. Imagine that a car of a particular brand and model costs USD 40,000 in the year 2020. This is a certain information with no ambiguity, and can therefore be considered a crisp value. A relevant question for a potential buyer, however, is whether the car is considered in terms of its price as budget, mid-range, luxury (Fig. 5.1). This question could be about both affordability and social status. The answer to this question is somewhat ambiguous as it cannot be said crisply if this car falls in mid-range and if a car costing USD 42,000 falls in luxury simply because it costs USD 2,000 more. Such situations in everyday life necessitate a different space than the uncertainty of probability theory and the crispness of nominal values.

Fig. 5.1 Example of crisp values and corresponding fuzzy labels of the price of cars

5.1 Fuzzy Theory

369

Such a space was formulated by Zadeh (1965) known as the fuzzy sets. Probability and fuzzy logic approach are multiple forms of uncertainty. While both fuzzy inference and probability theory can express degrees of some varieties of subjective knowledge, the fuzzy set theory uses the concept of fuzzy set membership, i.e., how much observation is within a vaguely defined set. The formulation of the concept of fuzzy set resulted in the development of fuzzy logic, which is a mathematical framework for working with fuzzy concepts using fuzzy sets. In the following subsections, we present the concepts of fuzzy logic which will serve as a foundation for the subsequent sections of this chapter.

5.1.1 Fuzzy Sets and Fuzzy Membership We first present the crisp set, which is more familiar to most of us, and then move on to the fuzzy set. A crisp set is defined as a collection of elements, for example, S = {x|x ∈ N and x > 5} that establishes a general universe of discourse. Here, N is a set of natural numbers, and each natural number is either a part of our set or not. Crisp sets may also include continuous domains, but in any case the following basic rule applies. Each crisp set has a crisp definition in the sense that an entity either belongs to a specific crisp set or not. Fuzzy sets are similar, but significantly more generalized. Fuzzy sets are designed to provide an answer to the following questions: Can an element be a part of the set partially? What about the values that lie very close to the boundary of a crisp set? How do we consider the membership of an element that is slightly outside the boundary of a crisp set but human inference suggests that it ‘more-or-less’ belongs to the set? The definition of fuzzy sets is derived through the concept of Membership Function (MF), a function whose value lies in the interval [0, 1] for all values of x in the universe of discourse. A MF is used to check whether an element is a member of a set or not. For a crisp set S, the MF μ S (x) is defined such that  μ S (x) =

1 if x ∈ S 0 otherwise

A fuzzy set is one in which the MF can take values in the range [0, 1], rather than just discrete values 0 and 1, as shown above. In this manner, fuzzy sets differ from crisp ones in that the engagement of an element in a fuzzy set is not rigidly determined, but can be a real value in the interval [0, 1]. The fuzzy set allows us to define uncertain and subjective terms like ‘young’, ‘many’, and ‘few’ mathematically. We can determine a particular value in our universe of discourse as part of that set with partial participation or partial membership. To clarify further, consider the following example: Suppose we define a fuzzy set as

370

5 Fuzzy Deep Learning

Fig. 5.2 Four common types of membership functions are illustrated here

a set of all ‘bright’ students. Then for the grades of all students that lie in the range [0, 10], we can define the grade of a student, say x, to have a membership value μbright (x), denoting the grade x’s membership in the fuzzy set of bright students. As a result, a fuzzy set S is defined as an ordered pair of crisp value and its MF value, as shown below: S = {(x, μ S (x))}

(5.1)

where x is in the universe of discourse. The four most widely used MFs are presented below, illustrated graphically in Fig. 5.2, and using the Listing 5.1.1. Triangular: It has three defining points: low, mid, and high. The low and high limits the MF and serve as the two extrema. The mid is an arbitrary value in between, while the trend follows linearly on either side of mid. Mathematically, this membership function is defined by Eq. 5.2. ⎧ 0 ; x ≤ a or x ≥ c ⎪ ⎪ ⎪ ⎨x − a μ S (x) = b − a ; a < x ≤ b ⎪ x −c ⎪ ⎪ ⎩ ; b 1.5 Errq0 −1 ), where q0 < p and a flag Drift is set to be true when this occurs. A reduction in the error trend count observed at p > q0 is taken as evidence that drift is no longer observed and that it was a short-term erroneous behaviour and thus the flag Drift is reset to false. The decision to make a new network is made if either of the following is true: • The flag Drift is set to true and Cerr,up >= 0.1W. This implies that a concept drift is identified over a sustained period of time. • Cerr,up ≥ 0.2W. This indicates that existing networks have become plastic and cannot adapt to data trends. • W = Wmax . This implies that the networks have become optimal for the current trend. No further learning is required for existing networks. New networks will be needed for newer trends in the future and introducing diversity for future agility without deteriorating the currently optimal networks. There are several approaches to dealing with concept drift once it has been identified. Typically, it is investigated whether the network is capable of accommodating the revised trends or has become sufficiently plastic that a change is difficult to induce.

396

5 Fuzzy Deep Learning

Either drastic measures, such as reducing the weights of some strong rules, or pruning some links or removing rules, are required to nudge the network out of stability. This is followed by retraining the network on a new batch of data. It may even be prudent to create a new network and relegate the now-plastic network to the status of a knowledge model that is relatively obsolete for the time being but may become relevant again in the future. Alternatively, if recent trends show a mixed bag of patterns, with intermittent data instances indicating concept drift and the rest fitting well to the existing concept, we may resort to using multiple networks concurrently in the form of an ensemble of networks. In any of these cases, a memory (potentially both long and short term) is required. Furthermore, memory may be required for either the data or the previous knowledge models. However, it is also beneficial to include mechanisms that prevent the stability-plasticity dilemma from occurring in the first place. Among these mechanisms is the forgetting mechanism. The mechanisms for dealing with concept drift are discussed in the following subsections.

5.2.5.2

Data Memory and Data Base

As evident in the description of concept drift, some monitoring of at least the loss function over time is needed. Often, even the data instances in the temporal window used for monitoring are also stored. This corresponds to short-term data memory. However, a slightly deeper perspective is needed on the utility of data storage. When the concept drift is detected, the new concept must be learned. This indeed requires a collection of data instances from the recent past. Such short term memory serves two purposes: monitoring the concept drift and serving as the training dataset for the learning of new concept. Sometimes, this dataset is maintained as the past W number of data instances. On other occasions, data instances that indicate concept drift and the other data instances are kept in some proportion. Sometimes, relevance of data is judged in terms of their age and the firing strength generated by them or the amount of weight update. It is also of relevance to maintain long-term memory. Either representative data from the window of concept drift are stored, or a computational representative is created, for example, by finding centroids of the data used for training after concept drift is detected. In some designs, data instances at regular intervals are submitted to the long-term data memory.

5.2.5.3

Knowledge Memory and Knowledge Base

The knowledge model of the NFS is based on what each network has learned, specifically the rule base. When a concept drift is detected, the question is whether we should completely forget or completely discard the obsolete knowledge model(s) or keep them. In general, keeping the old knowledge model is preferred because it may become relevant again. However, how and how much of the obsolete knowledge should be preserved must be considered. When concept drift is detected, the simplest

5.2 Neuro-Fuzzy Inference Systems

397

solution is to copy the network or at least the rule base into a knowledge memory and then retrain the current network to compensate for the concept drift. However, it can quickly become memory intensive and redundant. Another approach is to use a network ensemble. When a concept drift is detected in this approach, a new network is created in the ensemble and trained from scratch. The advantage of this approach is that it avoids the elasticity-plasticity problem. It also supports the possibility of having one knowledge base per trend (representing the data distribution associated with it). Furthermore, it enables parallel evaluation of multiple concepts (i.e. multiple knowledge models) by making predictions using all of the networks in the ensemble and combining their predictions using some weighted combination. The weights of the networks, for example, could be based on how old they are. As a result, the possibility of an old network becoming relevant again is accommodated, and there is no need to re-learn an old concept by adapting the network is avoided. However, this is a memory-intensive approach, and the number of networks can continue to grow indefinitely. Episodic memory concept is another interesting approach to store a knowledge base while retaining accessibility. The hippocampus is well known to solve the stability-plasticity problem in the human brain, recognize novel information, and perform sequential learning (Nadel et al. 2000). It is also known to chunk concepts in the brain to improve the ability to store more information. Each network is represented using a few simple semantic indicators, which are collectively called its memory trace. Although the network might have been learned much earlier, it is easily recalled when a match is detected with its memory trace, similar to episodic memory. For example, humans cannot recall an individual word but are able to remember the starting letters or rhyming words. Sometimes, a fragrance can trigger the recall of an elaborate experience. In this sense, the entire network becomes interpretable through a memory trace, which represents it. The mathematical formalism of Multiple Trace Theory (MTT) (Hintzman and Block 1971) is presented as follows. A list of memories m¯ n is collected in a matrix M given in Eq. (5.10). M = [m¯ 1 m¯ 2 . . . m¯ n

. . . m¯ N ]

(5.10)

When a probe item p¯ with the attributes equivalent to the memory trace arrives, the Euclidean distance between p¯ and each m¯ n in M is determined as dn =  p¯ − m¯ n . A similarity metric is computed using Eq. 5.11. s( p, ¯ m¯ n ) = exp(−dn )

(5.11)

The memory that provides the highest similarity and passes a certain criterion is recalled. In the context of MTT, an entire network may be memorized and cached out. The memory vector of a network m contains N I number of elements m i , i.e., the number of attributes of the input variable x, where

398

5 Fuzzy Deep Learning

mi =

max ji {ν right } − min ji {ν left } 2

(5.12)

is the center of the region spanned by all MFs of the ith attribute in that particular network. Thus, while an entire network may be cached, only its memory vector with a relatively much smaller footprint must be retained in active memory. When new data arrive, in order to decide whether a cached memory should be recalled, the Euclidean distance d = x p − m is computed for each cached network. For the cached network with the lowest value of d, if exp(−d) > t0 , where t0 is a threshold value, then that network is recalled in the active memory. In exchange, the network in active memory with the least weight is cached away. There are other methods for memory-efficient knowledge preservation. For example, instead of storing the entire networks, we may make a global knowledge base in which the rules and some meta-information about them can be stored. In this manner, the rules entering the global knowledge base can be checked for redundancy, and instead of storing two similar rules, the weight of an already existing rule may be boosted. It also accommodates the possibility that the global database can be used to initiate a new network when needed. Additionally, the global knowledge base of rules is a simple and compact way to collect interpretable concepts on the time-varying data distribution.

5.2.5.4

Forgetting Mechanisms

Rules that are not frequently invoked are likely to become obsolete. The weights of such rules are susceptible to short-term forgetting. According to decay theory, as time passes, information becomes less available for later retrieval; a memory trace is created whenever a memory is formed, but this slowly degenerates over time. According to this theory, the weights of rules that are not invoked can be subjected to short-term forgetting using a multiplicative forgetting factor λ ∈ (0, 1) as shown below: (5.13) wk = λwk Despite its simplicity, it poses a problem—How do we know when the rule should be decayed. If a decay is introduced every single time the rule is not invoked, the rules can decay very quickly, become obsolete, over-compensate for the concept drift and even render the network unstable. Brown–Peterson theory hypothesized that the forgetting rate is a function of displacement, i.e. the time since the last invocation of the rule (Brown 1958). Altmann proposed a forgetting model that integrates decay and interference (Altmann and Gray 2002). Accordingly, the forgetting coefficient for memories not recalled immediately should be an exponentially decaying function, which includes the time lapse since the last recall and the number of memories that can be stored in the short term memory. This is used later in updating the weights of the rules not immediately recalled.

5.2 Neuro-Fuzzy Inference Systems

399

This concept is implemented by making the forgetting factor λ a function of displacement as follows:  p− pin + 1  (5.14) λ = exp − √s n in N where n in is the number of times the rule was invoked and N is the total number of rules in a network. The difference p − p in gives the time elapsed since the rule was last learned. The division by s provides the idea of displacement. Accordingly, the first s rules stored in the memory will have a higher λ, compared to the next s rules, etc. It is useful to keep the value of λ close to 1 so that the rules do not become obsolete too quickly and the network does not become unstable.  Short term forgetting during weight update. Short term forgetting can be incorporated in the weight update as well so that even though a rule is invoked, some element of stability can be reduced, therefore improving the agility to learn new trends. Furthermore, both functional decay theory and displacement theory can be included. Consequently, a time decay factor can be included in the weight update, as shown below.   (5.15) w = φ y p x p + w p Furthermore, the decay can also be incorporated into time-varying updates, for example the sliding threshold θ ( p) used in associative dissociative learning. Accordingly, the sliding threshold θ with the incorporation of the time decay factor is calculated using Eqs. (5.16)–(5.18). θ raw (5.16) θ = scale θ with

θ raw = θ raw,in ( p− p θ scale = θ scale,in ( p− p

in

)

in

)

+

+ (o p )2 1 − ( p− p 1−

(5.17) in

)

(5.18)

where p and p in indicate the index of current data and the last time the data was invoked and o p indicates the output strength. The superscript ‘in’ indicates the values of the corresponding variables after the last update. The output strength of the current data is o p . In terms of interpretation, θ raw is the non-linear function of the postsynaptic activity, i.e. the output and θ scale is the scaling factor for the time lapse in between last activation p in and current activation p. The term ( p− p ) essentially indicates the forgetting of the values over time.  Normalization after forgetting. The weights of the rule base are normalized by using a (0, 1) mapping so that the weight of the strongest rule is 1. As a consequence of decay, the weight of the rules can diminish even if they are relatively stronger than the other rules. At the same time, a relatively strong rule with numerically low weight

400

5 Fuzzy Deep Learning

may lead to a significant weight increase during the next weight update. However, if such a rule is already normalized to be close to one, then the weight update is small and the situation of saturation is avoided.

5.3 Case Studies In this section, we present two case studies. The first case study is about the evolution of a special family of NFS algorithms. The goal is to outline the scope of NFS’s flexible design and how a variety of concepts relatable to humans can be incorporated into NFS, which, in addition to NFS’s inherent interpretability, supports further interpretability of NFS’s inner workings. The second case study describes an approach for integrating deep learning and fuzzy learning.

5.3.1 POPFNN Family of NFS—Evolution Towards Sophisticated Brain-Like Learning As a case study, we will look at a family of neural networks that began with a basic implementation of NFS with a pseudo-outer-product (POP) based rule generation concept and evolved into incremental and pseudo-incremental NFS replete with several concepts described in the previous section. The members of the POP-based fuzzy neural networks (FNN) family are presented chronologically, including their main features and an indication of the family’s evolution.  POPFNN (Zhou and Quek 1996). Gaussian membership functions were used along with backpropagation algorithm for learning the fuzzy labels. This was followed by pseudo-outer product for learning the rules and truth value restriction method for making inferences with the rules. This approach was later shown to be amenable to reinforcement learning concept as a clustering approach for identifying fuzzy labels (Wong et al. 2009).  POPFNN-AAR(S) (Quek and Zhou 1999). As compared to POPFNN, it finds fuzzy labels more with Kohonen’s partitioning using modified learning vector quantization. Further, instead of using truth value restriction method of inference, it uses approximate analogical reasoning schema (AAR (S)) for inference. In AAR(S), a similarity measure is used to judge the closeness of an observation A to an antecedent A of a rule, and if the closeness is deemed below a threshold then the output of the rule B  is computed using the consequent B based on the closeness between A and A . Furthermore, although POP is used for identifying the rules first, backpropagation is used in the last stage to fine-tune the entire network.  LazyPOP (Quek and Zhou 2001). While this method is similar to the original POPFNN in many ways, it reduces the rule base created by POP. POP generates an

5.3 Case Studies

401

exhaustive set of rules that includes all possible combinations. LazyPOP, like POP, is a one-pass algorithm that considers only the rules that are fired by the inputs in the training dataset. This is accomplished by defining invalid input fuzzy labels, which are labels that are unrelated to any output (and thus do not fire any rule), and irrelevant output fuzzy labels, which are output labels that do not get fired and are thus unrelated to any input fuzzy label (which consequently means that the rules corresponding to them are irrelevant). LazyPOP was the first sign that the POP family was shifting toward more efficient and sparse NFS solutions.  POPFNN-CRI(S) (Ang et al. 2003). This is the first time the trapezoidal membership function is used in the POPFNN family. Fuzzy labels were learned using Kohonen partitioning; however, it was reported that POPFNN could flexibly use other partitioning techniques for identifying fuzzy labels. POP was used to learn the rules. However, the inference mechanism employed here is a compositional rule of inference. This allows for combination of multiple rules which when triggered, support a more powerful and organic use of NFS.  POP-Yager (Quek and Singh 2005). Gaussian membership functions are used and the fuzzy labels are learnt using Kohonen’s partitioning. POP is used to learn the rules, but the Yager inference mechanism is used for inference.  Rough-sets based POPFNN (RSPOP) (Ang and Quek 2005). RSPOP incorporated the theory of rough sets to identify a high quality and efficient set of fuzzy labels. It uses two main concepts, namely reduct and core. A knowledge reduct is its essential part, which is sufficient to define all basic concepts occurring in the considered knowledge, whereas the core is the most important part of the knowledge. These concepts are used to identify good fuzzy labels that are distinguished by their indispensability and consistency. Furthermore, it is demonstrated that the concepts of reduct and core can be applied to the rules layer, resulting in a smaller but more effective rule base. This is accomplished by first employing POP and then identifying the necessary and consistent rules. Because it can derive the necessary concepts, the resulting NFS is more efficient, compact, and interpretable. However, it is an iterative approach that is used for both attribute reduction and rule reduction. Similar concepts of rule reduction by ambiguity correction were also explored (Tan and Quek 2009; Cheu et al. 2012). However, RSPOP has stood the test of time not only because of its sound theoretical concepts, but also because of its suitability for incremental and online learning.  Incremental ensemble RSPOP (ieRSPOP) (Das et al. 2016). As the POP family moved to incremental learning through ieRSPOP, it incorporated several advanced features. For example, discrete incremental clustering (DIC) with trapezoidal membership function was employed (Tung and Quek 2002). However, DIC was used in conjunction with a rough-sets-based attribute reduction. The compositional rule of inference is employed. In addition, LazyPOP is used in the initial phase to create a rule base. During the incremental learning phase, Hebbian learning is used as more data streams in. Further rough sets-based rule and attribute reduction is used during incremental learning, but only after a certain amount of historical data has been

402

5 Fuzzy Deep Learning

accumulated. Every time a rough sets-based rule and attribute reduction is used, the historical data are dynamically pruned. Simple concept drift mechanisms have been used; however, whenever concept drift is deemed significant, a new network in an ensemble is created. While all networks in the ensemble are used to predict the output and a weighted sum is used to compute the final output, the weights are computed as a function of age so that the most recent networks contribute the most to the prediction. As a result, the use of dynamically pruned historic databases and age-integrated network weighing incorporate forgetting mechanisms.  Pseudo ieRSPOP (PIE-RSPOP) (Iyer et al. 2018). At the time of writing this book, this was the last development in the POP family. It incorporated several further advanced concepts such as associative dissociative learning, multiple trace theory based long term knowledge memory, sophisticated drift mechanisms, both long and short term forgetting that incorporates both decay and interference theories, global and highly interpretable knowledge base, and finite data memory to facilitate offline learning when needed. Summary The evolution tree of POPFNN is presented in Fig. 5.12. NFS architectures enable the flexible incorporation of several interpretable learning and inference concepts. Humans can easily interpret not only knowledge models, but also architectures and theory mechanisms. This enables better NFS design for real-world

Fig. 5.12 The evolution tree of the POPFNN family is presented here, together with the most important highlights that led to improved sophistication and inclusion of brain-like mechanisms in order to strike a balance between interpretability and accuracy, and reduce the stability-plasticity problem. The choronological order is from top to bottom, top being the earliest

5.3 Case Studies

403

applications with less to no dependence on heuristics that currently rely heavily on human knowledge or decision making.

5.3.2 Combining Conventional Deep Learning and Fuzzy Learning Out of several approaches explored for combining conventional deep learning and fuzzy learning, we present one example Xie et al. (2021) as a case study. The architecture of this method is presented in Fig. 5.13. The linear system in this case can represent any conventional deep learning architecture. Its function in this architecture, however, is to map the fuzzy space of the input variables to the fuzzy space of the output variables. This differs from traditional deep learning in that traditional deep learning maps crisp input variables to crisp output variables. Using the different membership functions of the fuzzy labels, the input variable is mapped to its fuzzy space. The linear system then generates output that is located in the output fuzzy space and represents the firing strengths of the various membership functions of each output variable. This enables the linear system’s output to be converted into final output through the use of defuzzification. Therefore, the progress thus far indicates that deep learning can operate in the fuzzy space. We note that when dealing with numeric values in a fuzzy space, the Takagi Sugeno model of fuzzification/defuzzification is better suited. It is also clear that the linear system performs the layer III of NFS task of representing the rule base and inference mechanism without the use of a fuzzy approach. Now we’ll look at how the linear system’s knowledge is supported in terms of interpretability. In the approach a parallel block is included to generate fuzzy rules. This block takes the fuzzifier output (i.e. fuzzified input variables) and the linear system output (i.e. fuzzified output variable) and generates a fuzzy rule base. Clustering in joint space, pseudo outer product, or any other fuzzy rule learning mechanism can be used to form the rule base.

Fig. 5.13 Architecture proposed in Xie et al. (2021) that combines fuzzy learning and conventional deep learning

404

5 Fuzzy Deep Learning

 THINK IT OVER Does the architecture and approach shown in Fig. 5.13 represent antehoc or post-hoc interpretability? Fuzzy labels of the input and output variables include interpretability in terms of the data. However, they themselves do not form the entire model that maps the input to the output. The task of mapping is accomplished by the learning system. In this sense, this particular approach imbibes the principle of inherent interpretability or ante-hoc interpretability. Indeed, a fuzzy rule base is also learned. However, the learning of a linear system is independent and standalone in terms of the fuzzy rule base, and the fuzzy rule base simply presents an explanation of what the linear system does based on the observations. In this sense, this approach incorporates post-hoc interpretability. But as an overall architecture, would you think of this as overall ante-hoc or post-hoc?

5.3.3 Overview of Fuzzy Deep Learning Studies While we have witnessed an explosion in DNN learning in the last few decades, the FNN has not experienced the same revolution. Most fusions in NNs are ad-hoc, poorly understood, and distributed rather than localized. The explainability of these methods is very low, even if they are present. A popular way to handle both labeled and unlabeled data with incomplete features is with a FNN. In modeling, fuzzy sets help to deal with uncertainty caused by vagueness, imprecision, and ambiguity of the data (Chen et al., 2017). Multiple approaches have been proposed to address the drawbacks of DL, such as the implementation of fuzzy RBMs (Chen et al., 2015) for data on airline passenger profile that are often incomplete and vague, and the early warning system for industrial accidents (Zheng et al., 2017). Zheng’s model (Zheng et al., 2017) incorporates the concept of Deep BMs, Pythagorean fuzzy numbers (Ref. Appendix A.9) and sets (Yager, 2013). We find several perspectives of the current SOTA in this book, which are probably the shortcomings of the rule pruning for the image recognition task. The vague idea of pruning rules such as fuzzy kernel and fuzzy max-pooling architecture in CNN (Yazdanbakhsh and Dick, 2019) felt intuitive to deal with spatial invariance and transformation invariance property for an image classification task. However, at present, this is too big an idea for the length of the chapter with techniques to handle each pixel of an image with a set of adaptive fuzzy rules, keeping in check the firing strength of each unit for interpretability. DL models that incorporates fuzzy logic have been implemented in various studies and practical applications, namely, traffic flow control & road safety (Deng et al., 2016; Chen et al., 2017; El Hatri and Boumhidi, 2018; Zheng et al., 2016), medical

5.3 Case Studies

405

Table 5.2 Chronological overview of few Fuzzy DL Models Year Model Key feature Integrated models: Fuzzy system as part of learning mechanism 2020

Explaining Deep Learning Models Through Rule-Based Approximation and Visualization (Soares et al., 2020)

2019

Enabling explainable fusion in deep learning - Applied in a fusion of sets of heterogeneous with fuzzy integral neural networks (Islam et al., architecture models enabling XAI. 2019) - Fuzzy Choquet integral, a non-linear aggregation function for improved SGD based optimization.

2018

Stacked Auto Encoder trained using Fuzzy Logic (El Hatri and Boumhidi, 2018)

- Fuzzy System to train network parameters of DNNs aiming to prevent overshooting, minimize error and increase convergence speed during the learning. - Stacked Auto-Encoder fuzzy deep network for spatial and temporal correlations of traffic flow detection trained using general backpropagation. - Fuzzy logic system decides only the learning rate and momentum of the network.

2017

Developing a deep fuzzy network with Takagi Sugeno fuzzy inference system (Rajurkar and Verma, 2017)

- Identification of three-layered TS fuzzy system using backpropagation. - t-norm operator to evaluate gaussian fuzzy MFs for its easy differentiability. - Experimental analysis on non-linear system identification and the Truck backer upper problem shows the TS FIS outperforms the three-layered feed-forward ANN.

2016

Intra- and Inter-Fractional Variation Prediction of Lung Tumors Using Fuzzy Deep Learning (Park et al., 2016)

- Experimental overshoot analysis of fuzzy logic dropped drastically compared to CNNs and HEKFs. - Fuzzy DL to cluster breathing parameters in vector form with reduced unnecessary feature metrics improves computational time and accuracy.

- Maps multidimensional state space of vehicles and velocities in discrete sets. - Approximation of Double Deep Q-Network learning with zero-order IF-THEN fuzzy rules to provide alternative explainable model. - Concept ’MegaClouds’ group similar actions, reducing the dimensionality of fuzzy rules.

Ensemble models: Fuzzy logic and deep learning in sequential or parallel fashion 2020

Transfer Learning for Toxoplasma gondii Recognition (Li et al., 2020)

2020

A deep learning based neuro-fuzzy approach for - Use of a gaussian membership function and solving classification problems (Talpur et al., graph partitioning for rule generation with 2020) back-propagation learning. - MATLAB toolbox for classification on three benchmark datasetsred wine, thyroid diseases and statlog.

2018

Automatic kidney segmentation in 3D pediatric ultrasound images using deep neural networks and weighted fuzzy active shape model (Tabrizi et al., 2018)

- Fuzzy cycle GAN using transfer learning with c-means clustering algorithm for pre-defined structural shape of toxoplasma. - Uses two CNNs with fuzzy clustering algorithm and adversarial loss architecture improved over VGG network.

- Gabor filter to fuzzify input data and reduce noise. - Makes use of traditional deep CNN for detection and fuzzy shape weights for segmentation. - Automatic initialization-based segmentation allows reproducibility with no a priori user knowledge and experience.

406

5 Fuzzy Deep Learning

Table 5.2 (continued) 2018

A fuzzy convolutional neural network for text sentiment analysis (Nguyen et al., 2019)

- Lower standard deviation on 10-cross validation showcases stability along with accuracy. - Experimental data reflects superior performance on noisy dataset compared to CNN model. - Feature visualization on input data has better interpretability on t-SNE metric.

2018

A novel fuzzy deep-learning approach to traffic flow prediction with uncertain spatial-temporal data features (Chen et al., 2017)

- Tensor in fuzzy deep convolution to investigate temporal and spatial properties. - Ensemble of the fuzzy system reduces the impact of data uncertainty and results in faster convergence with accuracy.

2017

A Hierarchical Fused Fuzzy Deep Neural Network for Data Classification (Deng et al., 2016)

- Multiple fuzzy rules reduce ambiguity while the deep hierarchically NN reduces noise for clean fuzzy logic representation - Dropout strategy to improve FDNN training performance. - Performance comparison with an alternative fusion of Information-theoretic learning, BM and AE with fuzzy representation.

2016

Damaged fingerprint classification by Deep - Uses Kriging method of radial basis function Learning with fuzzy feature points (Wang et al., to fuzzify input data 2016) - Uses Fuzzy graph of feature points, position and relationship between points to improve recognition rate - Kriging method helps to highlight the relation between feature points and weakens their location creating higher rotation invariance.

2009

Mamdani Model-based Adaptive Neural Fuzzy Inference System and its Application (Chai et al., 2009)

- Proposed weight updating formula based on backpropagation for Mamdani fuzzy rules in ANFIS model. - Experiment shows the superiority of M-ANFIS over the general ANFIS model on traffic management data.

healthcare & image processing (Wang et al., 2016; Park et al., 2016), text processing (Nguyen et al., 2019; Tabrizi et al., 2018), and time-series prediction (Rajurkar and Verma, 2017; Zheng et al., 2017). A critical analysis of the chronological application of the Fuzzy Deep Learning model is summarised in Table 5.2. The review of models is categorised in two segments depending on the fusion of fuzzy logic either as Integrated system or Ensemble models of fuzzy inference system arranged in a sequential or parallel fashion.

Summary In this chapter, we have explored and described current fuzzy deep learning approaches (Table 5.2), defined core concepts, and provided a proof-of-concept based on a layered ensemble of fuzzy logic systems that tests the firing strength of rules for

5.3 Case Studies

407

Fig. 5.14 Illustration for self-assessment 1

improved interpretability. The benefits of combining fuzzy inferences with NN’s performance can be realized in the long run. This chapter has gone into deeper depth on the rationale behind current research efforts to solve the fundamental limitations of DNNs by employing fuzzy models. On the one hand, the performance of the models on noisy, incomplete, uncertain, or imprecise data has been greatly enhanced by including fuzzy theory as an intrinsic element of DL systems through the use of fuzzy logic for training parameter selection or adaptive fuzzy parameters. As a result, we have been able to avoid being mired in local optimums and instead search across a wider space. While deep fuzzy learning has strengthened explainability and transparency, it has also increased computational complexity. Present architectures are laborious to train due to the lack of suitable software acceleration tools to work with fuzzy parameters and weights. The nature of the data used in fuzzification makes the development of a general model difficult.

 Reading List 1. “Techniques for learning and tuning fuzzy rule-based systems for learning and tuning fuzzy rule-based systems for linguistic modeling and their application.” by Alcal’a et al. (1999) presents a mathematical framework in which the fuzzy rule base systems are reviewed. 2. “Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering” by Kasabov (1996) is a classical book on the concepts of fuzzy logic and fuzzy systems. 3. “Fuzzy neural networks and neuro-fuzzy networks: A review of the main techniques and applications used in the literature” by de Campos Souza (2020) presents a review of techniques in NFS in a very compact footprint.

408

5 Fuzzy Deep Learning

 Self-assessment 1. Consider the illustration of membership functions of two fuzzy labels ‘A’ and ‘B’ of a crisp variable x. Do the following (Fig. 5.14): • Write the expressions of the membership functions of A and B. • Draw the membership function of a fuzzy label C such that it is the intersection of A and B, the fuzzy label D such that it is the union of A and B, and the fuzzy label E such that it is equal to NOT C. • Draw the membership function of a fuzzy label F such that it is the bounded difference of A and B, fuzzy label G such that it is the bounded sum of A and B. • Check if C is a subset of D and F. 2. Draw a diagram of a POPFNN architecture with two inputs, two fuzzy labels per input, one output and two fuzzy labels per output. 3. A restaurant management team is trying to understand the trend of tipping by its customers. It defines three variables, namely food quality (fq), service quality (sq) and tip quality (tq). Each of these variables has crisp values in the range [0, 10]. The team defines the following fuzzy labels that are represented by trapezoidal membership functions with parameters a, b, c, and d also defined below: Variable Food quality Service quality

Tip quality

Fuzzy label Hopeless Delicious Poor Good Excellent Lousy Average Excellent

Membership function μfq1 μfq2 μsq1 μsq2 μsq3 μtq1 μtq2 μtq3

[a, b, c, d] [0, 0, 2, 3] [5, 8, 10, 10] [0, 0, 4, 5] [4, 5, 5, 7] [6, 7, 10, 10] [0, 1, 1, 3] [2, 4, 4, 5] [4, 5, 9, 10]

• Draw the fuzzy membership functions for each of the variables. • The above fuzzy terms are used in the formulation of a fuzzy expert rule system for the payment of tips. The quantum of tip in the range [0, 10] is derived from the fuzzy rules based on fq and sq. Here are the four rules used: R1 R2 R3 R4

: : : :

If service is poor then tip is lousy If service is excellent and food is delicious then tip is excellent If food is hopeless then tip is lousy If service is good and food is delicious then tip is lousy

Use the above rule base to draw a NFS network architecture that represents the four rules R1–R4. • Discuss how the inference operation can be selected so as to resolve the rule conflict when more than one rule are activated.

Appendix A

Mathematical Models and Theories

A.1

Choquet Integral

The (fuzzy Choquet integral) ChI of observation h on X is  h ◦ g = C g (h) =

N 

h π( j) (g(Aπ( j) ) − g(Aπ( j−1) ))

(A.1)

j=1

for Aπ( j) = {xπ(1) , . . . , xπ( j) }, g(Aπ( j) ) = 0, and permutation π such that h π(1) ≥ h π(2) ≥ · · · ≥ h π(N ) .

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Somani et al., Interpretability in Deep Learning, https://doi.org/10.1007/978-3-031-20639-9

409

410

A.1.1

Appendix A: Mathematical Models and Theories

Restricting the Scope of FM/ChI

In the paper Islam et al. (2019), the author states that ChI being a parametric function, turns into a specific operator when the fuzzy membership is determined. Take for instance: If g(A) = 1 ∀A ∈ 2X \ φ, the ChI becomes the maximum operator; if g(A) = 0 ∀A ∈ 2X \ X, we recover the minimum; if g(A) = |A| \ N , we recover the mean; and for g(A) = g(B) when |A| = |B| ∀A, B ∈ X, gives a linear order statistics. Additionally, the author claims that as most familiar operators in practice, namely, mean, max, min and trimmed version of the operator are all subsets of linear order statistics, hence a subset of ChI too. In conclusion, ChI is beneficial for a broad spectrum of aggregation needs.

A.1.2

ChI Understanding from NN

The author attempts to use a NN to calculate the fuzzy Choquet integral, which is useful as an aggregation operator. This is important for many claims that a NN like CNNs can do. Mathematically, a CNN can encode filters (linear time-invariant filters such as a matched filter, low-pass or Gabor filter), random projections and combinations being few among many. But the least attention is given to interpret the fusion in the network. The choice of aggregators like union, average, intersection, or something more fascinating, and potential aggregation function for a neural network needs consideration. Understanding of fuzzy Choquet integral using neural networks shall give us a better perception of the possibility at hand. Example Consider the case of n = 2 and the NN outlined in Fig. A.1. The network output is

Fig. A.1 Network calculate ChI for n = 2. Blue neurons on the left are n! LCSs. Red neurons (nonlinearities) select an linear convex sum (based on input sort order), and the rightmost blue sums the results. Multiple inputs to nodes are summed

Appendix A: Mathematical Models and Theories

411

o = u(h 1 − h 2 )(h 1 w1 + h 2 w2 ) + u(h 2 − h 1 )(h 2 w3 + h 1 w4 ) where u is a unit/Heaviside step function, which gives us o = u(h 1 − h 2 )[h 1 g(x1 ) + h 2 (1 − g(x1 ))] + u(h 2 − h 1 )[h 2 g(x2 ) + h 1 (1 − g(x2 ))] Thus, we have ⎧ h 1 g(x1 ) + h 2 (1 − g(x1 )), ⎪ ⎪ ⎪ ⎨ h g(x ) + h (1 − g(x )), 2 2 1 2 o= ⎪ 0.5[h g(x ) + h (1 − g(x1 ))]+ 1 1 2 ⎪ ⎪ ⎩ 0.5[h 2 g(x2 ) + h 1 (1 − g(x2 ))],

h1 > h2 h2 > h1

(A.2)

h1 = h2

This could be extended to any n without loss of generality. The author uses a simple network to convey the fact that neural nets can represent chI.

A.2

Deformation Invariance Property

A image is modeled as a function x ∈ L 2 ([0, 1]2 ), where Lτ x(u) = x(u − τ (u)) is a wraping operator defined on the image with a smooth deformation field τ ∈ [0, 1]2 → [0, 1]2 and a image classification function f : L 2 ([0, 1]2 ) → {1, 2, ...., L} then, image deformation invariance is defined as: | f(Lτ x) − f(x) | ≈ ||∇τ ||

A.3

∀f, τ

(A.3)

Distance Metrics

The distance metrics are functions d(x, y) such that d(x, y) < d(x, z) if objects x and y are considered more similar than objects x and z. This means that the distance of zero is for two objects that are exactly similar or alike. Euclidean distance n 2 d(x, y) = i=1 (x i − yi ) in n-dimensional space is a popular example. A ‘true’ metric must obey the following criteria below: 1. 2. 3. 4.

d(x, y) ≥ 0, ∀ x and y d(x, y) == 0, iff x = y, positive definiteness d(x, y) == d(y, x), symmetry d(x, z) ≤ d(x, y) + d(y, z), the triangle inequality

In our case, the cosine similarity Manning et al. (2008) computes the L2normalised dot product between the row vectors between image Ix & I y is defined as:

412

Appendix A: Mathematical Models and Theories I I

T

y cosine_similarity (Ix , I y ) = ||Ix ||x2 ||I . Therefore the pairwise cosine-distance y ||2 between image matrices Ix & I y is formulated as:

cosine_distance (Ix , I y ) = 1 −

A.4

I1 I2 T ||Ix ||2 ||I2 ||2

(A.4)

Grad Weighted Class Activation Mapping

Class activation map (CAM) Zhou et al. (2016) is a visualization algorithm that allows the overlay of heatmap over the image using the weights of the output on convolutional feature maps to highlight the region of importance of learning in the image. Here, Global Average Pooling (GAP) replaces the last dense layer, average sums the activations of feature maps as vectors fed to the final softmax layer. GradCAM is a more refined approach, where the gradient for a class of interest c is computed wrt. to the index of activation maps k of the final convolutional layer and averaged across each feature map to output an importance score (αk c ). αl C =

1   δy c Z i j δ Ai j k

To compute only the positive influencing pixels on the class score of interest, a ReLU non-linearity is used to the linear combination.

 αc k Ak L Grad−C AM c = ReLU (A.5) k

The sample grad-CAM visualization Fig. A.2 is shown below with the ‘picket fence’ highlights feature at the top row and the ‘Ladle’ features the grad-saliency for the bottom sample.

A.5

Guided Saliency

Figure A.3 below briefly presents the fused approach of backpropagation and deconvolution techniques to obtain a more precise and comprehensive saliency map. Figure A.3a, described the selective activation of features after forward-pass and propagated backwards for reconstruction. Backpropagation variations using a ReLU nonlinearity are depicted in Fig. A.3b, while Fig. A.3c, gives a mathematical formulation of propagating an output activation back through a ReLU unit in a specific layer l. The point to remember is that the latter two approach-DeconvNet and guided backpropagation computes an imputed version of the gradient and not the true one.

Appendix A: Mathematical Models and Theories

413

Guided Saliency maps masks both the signal, the importance of the negative signal in forward-pass (backpropagation) and the negative reconstruction signal value from top to bottom (deconvolution) to generate a saliency feature map.

A.6

Jensen-Shanon Divergence

Jensen-Shanon (JS) divergence measures the similarity between two probability distributions p1 and p2 . It is defined in Equation A.6, which follows some properties like: 1. 2. 3. 4.

Equation A.6 is symetric in p1 and p2 . The function is bounded between 0 ≤ JS( p1 || p2 ) ≤ log2. Its square root can be considered for a proper metric. Somtimes also called information radius or the total divergence of the average. JS( p1 || p2 ) =

1 (KL( p1 (x)|| p0 (x)) + KL( p2 (x)|| p0 (x))) 2

(A.6)

where, p0 = 0.5( p1 + p2 ). Note that the use of log base 2, bounds the equation between the value [0, 1].

Fig. A.2 Sample examples to generate grad-saliency on input image based on the class labels of Imagenet class labels

414

Appendix A: Mathematical Models and Theories

Fig. A.3 Schematic visualization of deep layers using guided saliency (Springenberg et al. 2014)

A.7

Kullback-Leibler Divergence

Kullback-Leibler (KL) divergence basically considers two probability distributions p1 and p2 and estimates how one diverges from the other. Equation A.7 defines the mathematical expression for the KL divergence for a continuous distribution function. KL( p1 || p2 ) =x p1 (x)log

p1 (x) dx q(x)

(A.7)

Here, observe that KL ( p1 || p2 ) is zero if the probability distribution p1 = p2 . In general, KL divergence is asymmetric in nature, i.e., KL ( p1 || p2 ) = KL ( p2 || p1 ).

A.8

Projected Gradient Descent

Projected gradient descent (PGD) is simply a perturbation of an original image with a certain delta (δ) and creates images through iterations that maximize network loss. After each iteration, δ is projected onto a norm ball, here using either the L 2 or L ∞ ball. The delta is initially assigned with zeros, and random initialization is avoided to prevent rigorous optimizations. Delta is clipped between [−, +] to preserve visual semantics. The ultimate aim of the algorithm is to fool an already trained model so that the training using gradient-descent based optimization shall help the prediction of the true class to rise. (A.8) maximize l(h θ (x + δ), y) δ∈

Appendix A: Mathematical Models and Theories

415

where, x ( x = x + δ) denotes the adversarial example, h θ is the model or hypothesis function, x ∈ X the input, y ∈ Z the true class, l(h θ,y ) denotes the loss and δ (||δ|| ≤ ) represents an allowable set of perturbations.

A.9

Pythagorean Fuzzy Number

Pythagorean fuzzy sets Yager (2013) are an extension of IFS where the sum of the squares of the membership degree ranges from 0 to 1 defined mathematically as: p = {< x, P(μ p (x), v p (x)) > |s ∈ S}

(A.9)

where, μ p (x) : S → [0, 1] and v p (x) : S → [0, 1] are respectively the membership & non-membership degree satisfying the expression μ p (x)2 + v p (x))2 ≤ 1 for the element x to S in P. The tentative degree of x ∈ X is defined as  (A.10) π p (x) = 1 − μ2p (x) − v2p (x) If β = P(μβ , vβ ) is a PFN, it satisfies μβ , vβ ∈ [0, 1] and μ2β + vβ2 ≤ 1. Using the properties, the two metrics namely, score function and accuracy function to rank the PFN are defined respectively as: s(β) = μ2β − vβ2 h(β) = μ2β + vβ2

(A.11)

Based on the metric, the ranking for two PFN - β1 = P(μβ1 , vβ1 ) and β2 = P(μβ2 , vβ2 ) is performed as Ren et al. (2016): 1. if s(β1 ) < s(β2 ), then β1 < β2 2. if s(β1 ) = s(β2 ), then a. if h(β1 ) < h(β2 ), then β1 < β2 b. if h(β1 ) = h(β2 ), then β1 = β2 .

A.10

Targeted Adversarial Attack

Similar to projected gradient descent, perturbation of an original image with a certain delta (δ) with an additional loss optimization. Throughout iteration, it attempts to maximize the true class loss and minimize the loss for the targeted class. This form of targeted gradient-descent reduces the prediction of the target class apart from the rise in true class prediction. y)) maximize (l(h θ (x + δ), y) − l(h θ (x + δ), δ∈

(A.12)

where, the notations are similar to the Equation A.8, except y being the target class.

416

A.11

Appendix A: Mathematical Models and Theories

Translation Invariance Property

A image is modeled as a function x ∈ L 2 ([0, 1]2 ), where Tv x(u) = x(u − v) is a translation operation defined on the image with a translation vector v ∈ [0, 1]2 and a image classification function f : L 2 ([0, 1]2 ) → {1, 2, ...., L} then, image translation is defined as: ∀x, v (A.13) f(Tv x) = f(x)

A.12

Universal Approximation Theorem

Theorem A.1 Let ξ be a non-constant bounded and monotonically-increasing continuous activation function, f : [0, 1]d → R continuous function, and  > 0. Then, ∃ n and parameters a, b ∈ Rn , W ∈ Rn×d s.t.

n



T

ai ξ(wi x + bi ) − f(x)

< 

i=1

∀x ∈ [0, 1]d

(A.14)

Appendix B

List of Digital Resources and Examples

See Tables B.1, B.2, B.3, B.4, B.5 and B.6 Table B.1 Summary of popular perspective interpretability methods in recent years Cluster Methods Decomposition saliency

CAM (Class Activation Map) (Zhou et al. 2018) Grad-CAM (Selvaraju et al. 2017) Guided Grad-CAM and feature occlusion (Tang et al. 2019) Score-weighted class activation mapping (Wang et al. 2019) Smoothgrad (Smilkov et al. 2017) Multi-layer CAM (Bahdanau et al. 2014) LRP (Bach et al. 2015; Samek et al. 2016) LRP on CNN and on BoW (bag of words)/SVM (Arras et al. 2017) BiLRP (Eberle et al. 2020) SDeepLIFT: learning important features (Shrikumar et al. 2017) Slot activation vectors (Jacovi et al. 2018) PRM (Peak Response Mapping) (Zhou et al. 2018) (continued)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Somani et al., Interpretability in Deep Learning, https://doi.org/10.1007/978-3-031-20639-9

417

418 Table B.1 (continued) Cluster Sensitivity saliency

Other saliency Signal inversion

Signal optimization Other signals Verbal interpretability

Appendix B: List of Digital Resources and Examples

Methods Saliency maps (Simonyan et al. 2014) LIME (Local Interpretable Model-agnostic Explanations) (Ribeiro et al. 2016) Guideline based Additive eXplanation optimizes complexity (Zhu and Ogino 2019) Attention map with autofocus convolutional layer (Qin et al. 2018) Deconvolutional network (Noh et al. 2015; Zeiler and Fergus 2014) Inverted image representations (Mahendran and Vedaldi 2015) Inversion using CNN (Dosovitskiy et al. 2015) Guided backpropagation (Springenberg et al. 2014; Izadyyazdanabadi et al. 2018) Activation maximization (Olah et al. 2017) Semantic dictionary (Olah et al. 2018) Network dissection (Bau et al. 2017; Zhou et al. 2018) Decision trees (Kotsiantis 2013) Propositional logic, rule-based (Caruana et al. 2015) Sparse decision list (Letham et al. 2015) Rationalizing neural predictions (Lei et al. 2016) MUSE (Model Understanding through Subspace Explanations) (Lakkaraju et al. 2019)

Appendix B: List of Digital Resources and Examples

419

Table B.2 Summary of mathematical interpretability methods in recent years Cluster Methods Pre-defined model

Mathematical Feature Extraction

Mathematical Sensitivity

Linear Probe (Alain and Bengio 2016) CNN Regression (Hatami et al. 2018) GDM (Generative Discriminative Models): ridge regression + least square (Varol et al. 2018) GAM, GA2M (Generative Additive Model) (Hastie 2017) ProtoAttend (Arik and Pfister 2019) Group-driven RL (reinforcement learning) (Zhu et al. 2018) Model-based RL (reinforcement learning) (Kaiser et al. 2019) TS Approximation (Bede 2019) Deep Tensor NN (Schütt et al. 2017) PCA (Principle Components Analysis) (Dunteman 1989) CCA (Canonical Correlation Analysis) (Hardoon et al. 2004) Principal Feature Visualization (Bakken et al. 2020) Eigen-CAM using principal components (Muhammad and Yeasin 2020) GAN-based Multi-stage PCA (Goodfellow et al. 2014) Estimating probability density with deep feature embedding (Krusinga et al. 2019) t-SNE (t-Distributed Stochastic Neighbour Embedding) (Nguyen et al. 2016; Karpathy 2014) Laplacian Eigenmaps visualization for Deep Generative Models (Biffi et al. 2018) Group-based interpretable NN with RW-based graph embedding (Yan et al. 2019) ST-DBSCAN (Birant and Kut 2007) TCAV (Testing with Concept Activation Vector) (Kim et al. 2018) ACE (Automatic Concept-based Explanations) uses TCAV (Ghorbani et al. 2019) Influence function (Koh and Liang 2017) SocRat (Structured-output Causual Rationalizer) (Alvarez-Melis and Jaakkola 2017) Meta-predictors (Fong and Vedaldi 2017)

420

Appendix B: List of Digital Resources and Examples

Table B.3 Summary of popular ML interpretability methods over past years Cluster Methods Interpretable ML models

Global model-agnostic methods

Local model-agnostic methods

Others

Linear regression (Montgomery et al. 2021) Linear regression (Weisberg 2005; Montgomery et al. 2021) Logistic regression (Kleinbaum et al. 2002) Decision trees (Kotsiantis 2013) GLMS (Venables and Dichmont 2004) GAMs (Venables and Dichmont 2004) Naive Bayes (Rish et al. 2001) RuleFit (Friedman and Popescu 2008) KNNs (Papernot et al. 2016) Partial dependence plot (PDP) (Goldstein et al. 2015) Permuted feature importance (Altmann et al. 2010) Accumulated local effect plot (Apley and Zhu 2020) Prototypes and criticisms (Kim et al. 2016) Global surrogate models (Tenne and Armfield 2009) ICE (Individual Conditional Expectation) (Goldstein et al. 2015) LIME (Ribeiro et al. 2016) SHAP (Slack et al. 2020) Anchors (Ribeiro et al. 2018) Counterfactual explanations (Wachter et al. 2017) Shapley values (Messalas et al. 2019) ELI5 (Korobov and Lopuhin 2020) Yellowbrick (Bengfort and Bilbro 2019) MLXTEND (Raschka 2018)

SHapley Additive exPlanations Perturbation and gradient-based attribution methods for Deep Neural Networks interpretability A toolbox to iNNvestigate neural networks’ predictions A library for debugging/inspecting machine learning classifiers and explaining their predictions Python Library for Model Interpretation/Explanations Visual analysis and diagnostic tools to facilitate machine learning model selection. A collection of infrastructure and tools for research in neural network interpretability Closed-form factorization of latent space in GANs Framework-agnostic implementation for state-of-the-art saliency methods A framework agnostic, scalable Python package for DL on graphs Self-attention visualization tool for different families of vision transformers Progressive Visual Analytics for Designing DNNs by Pezzotti et al. (2017) Understanding the Adversarial Game Through Visual Analytics by Wang et al. (2018) Analyzing the training processes of deep generative models by Liu et al. (2017) Using Interactive Visual Experimentation to Understand Complex Deep Generative Models by Kahng et al. (2018)

SHAP DeepExplain

DGTracker GAN Lab

SeFa PAIR SALIENCY Deep Graph Library Probing ViTs DeepEyes GANViz

Lucid

Skater Yellowbrick

iNNvestigate ELI5

Description

Resource

Table B.4 Summary of some software resources

genforce.github.io/sefa pair-code.github.io/saliency/home github.com/dmlc/dgl/ github.com/sayakpaul/probing-vits

github.com/tensorflow/lucid

github.com/datascienceinc/Skater github.com/DistrictDataLabs/yellowbrick

github.com/albermax/innvestigate github.com/TeamHGMemex/eli5

github.com/slundberg/shap marcoancona/DeepExplain

Link

Appendix B: List of Digital Resources and Examples 421

422

Appendix B: List of Digital Resources and Examples

Table B.5 Summary of some interpretability survey papers in recent years Year Publication Target 2007

2017 2018 2018

2018 2018 2018 2018

2018 2019

2019

2019

2020

2020

2020

2020

2021

Saad et al. (2007)

Provides NN explanation using inversion and partially addresses the commercial barrier and data wildness obtructing practioners to obtain interpretability Chakraborty et al. Structured to offer in-depth perspectives on varying degrees (2017) of interpretability, but with only 49 references for support Adadi and Berrada Do not solely concentrate on NNs but instead cover existing (2018) black-box ML models Ching et al. (2018) Blends the application of DL methods in biomodeical problem, understanding its potential to transform several areas of medicine and biology and discuss the essence of interpretability Giplin et al. (2018) Classified the understanding the workflow and representation of NNs Guidotti et al. (2018) Cover existing black-box ML models instead of focusing on NNs Lipton et al. (2018) Discuss the myths and interpretability concept in ML Melis et al. (2018) Short survey of incapabilities of popular methods and direction towards self-explaining models based on explicitness, faithfullness, and stability Zhang et al. (2018) Mainly on the visual interpretability of DL and evaluation metrics for network interpretability Du et al. (2019) It addresses 40 studies, which are broken down into categories such as global and local explanations, and coarse-grained post-hoc and ad-hoc explanations Mittelstadt et al. Provides a short summary on explaining AI from the (2019) perspective of philosophy, sociology and human-computer interactions Tjoa et al. (2020) Collection of journal articles under perspective and mathematical interpretability and applying the same categorization towards medical research Arrieta et al. (2020) Detailed explanations of key ideas and taxonomies are provided, and existing barriers to the development of explainable artificial intelligence (XAI) are highlighted Huang et al. (2020) Discusses extensively on verification, testing, adversarial attacks on DL and interpretability techniques on 202 papers most published after 2017 Samek et al. (2021) An extensive timely review on active emerging field and placing interpretability algorithms to test theoritically and with extensive simulations Vilone et al. (2020) Extensive clusters the XAI concept into different taxonomies, theories and evaluation approach. A collection of 361 papers and elaborative tables of classifying the explainability in AI Fan et al. (2021) Classifies NNs by their interpretability, discusses medical uses, and draws connections to fuzzy logic and neuroscience (continued)

Appendix B: List of Digital Resources and Examples Table B.5 (continued) Year Publication 2021

2021

2021

2021

Mishra et al. (2021)

423

Target

Survey of the works that analysed the robustness of two classes of local explanations (feature importance and counterfactual explanations) and discusses some interesting results Mohseni et al. (2021) Categorization presents the mapping between design goals for different XAI user groups and their evaluation methods.Further, provide summarized ready-to-use tables of evaluation methods and recommendations for different goals in XAI research Vilone and Longo Clustering all the scientific studies via a hierarchical system (2021) that classifies theories and notions related to the concept of explainability and the evaluation approaches for XAI methods and concludes by critically discussing these gaps and limitations Zhang et al. (2021) Interpretability taxonomy for NNs in three dimensions comprising passive and active approach, type of explanation, and local vs global interpretability

424

Appendix B: List of Digital Resources and Examples

Table B.6 Remark on some most cited datasets filtered by tasks Task Datasets Remark Question Answering

Semantic Segmentation

Object detection

GLUE (Wang et al. GLUE benchmark is a collection of 9 natural 2018) language understanding tasks built on established datasets. However, the task formats are limited to sentence and sentence-pair classification. SQuAD (Rajpurkar SQuAD has a whopping 1 M+ questions. The weak et al. 2016) learner alone has low performance both on in-distribution and adversarial sets. PoE training improves the adversarial performance while sacrificing some in-distribution performance. A multi-loss optimization closes the gap and even boosts adversarial robustness (Sanh et al. 2020). ConceptNet (Speer Represents the general knowledge involved in et al. 2017) understanding language, improving NLP applications. However, it has harmful overgeneralizations of both negative and positive perceptions over many target groups. There are severe disparities across the targets in demographic categories such as professions and genders (Mehrabi et al. 2021). Cityscapes (Cordts Comprising of a large, diverse set of stereo video et al. 2016) sequences recorded in streets from 50 different cities. It involves dense inner-city traffic with wide roads and large intersections. KITTI (Geiger et al. Most of KITTI’s images are taken during the daily 2012) time on the road. Apart from the low coverage with ground truth at interesting image locations, another major flaw is the imagery’s low dynamic range and luminance resolution (Haeusler and Klette 2012). ShapeNet (Chang A large scale repository for 3D CAD models et al. 2015) containing over 300 M models with 2.2M classified into 3,135 classes arranged using WordNet hypernym-hyponym relationships. The limited number of classes in it is an impediment to few-shot learning (Stojanov et al. 2021). COCO (Lin et al. A widely used image captioning benchmark dataset, 2014) heavily skewed towards lighter-skinned and male individuals with racial terms in the manual captions (Zhao et al. 2021). Visual Genome Contains over 108 K images where each image has an (Krishna et al. 2017) average of 35 objects, 26 attributes, and 21 pairwise relationships between objects. It exhibit gender or racial stereotypes as it has no filtering processes (Hirota et al. 2022). Nuscene (Caesar It has 3D bounding boxes for 1000 scenes collected in et al. 2020) Boston and Singapore. It contains 28,130 training and 6,019 validation images. (continued)

Appendix B: List of Digital Resources and Examples Table B.6 (continued) Task Datasets

425

Remark

Speech recognition LibriSpeech (Panayotov et al. 2015)

Image classification

Text classification

A collection of nearly 1,000 hours of audiobooks. There are no statistics on the accents of LibriSpeech, but it was explicitly designed to skew towards US English, and is very slightly biased towards men (Meyer et al. 2020). Speech Commands It is an attempt to create a standard training and (Warden 2018) evaluation dataset for a class of simple speech recognition tasks. MuST-C (Di Gangi The largest freely available multilingual corpus for et al. 2019) ST, which comprises (audio, transcript, translation) triplets extracted from TED talks data. It has a male/female speakers’ gender ratio of approximately 2:1. CIFAR-10 Used in over 210 models including Vision (Krizhevsky et al. Transformers includes images belonging to animals 2009) and vehicles. Predominantly four classes belong to vehicles, and six belong to the fauna. ImageNet (Deng The most popular dataset in CV tends to be biased et al. 2009) towards fauna rather than flora. It exhibits an unbalanced dataset, biased towards dogs, birds and fish. A well-critiqued limitation is that the objects tend to be centered within the images, which does not reflect how “natural” images appear (Barbu et al. 2019). MNIST (LeCun One of the most popular deep learning image et al. 1998) classification datasets where some images in the testing dataset are barely readable and may avert reaching test error rates of 0%. Introducing Colored MNIST (Li and Vasconcelos 2019), exploits the intuition of color processing of digits and possible dataset biases for color representations. IMDb Movie It contains 50k highly polar movie reviews, evenly Reviews (Maas et al. split to 25 k positives and 25 k negatives. There is 2011) additional unlabeled data for use as well. DBpedia (Auer et al. It currently provides information about more than 2007) 1.95 million “things”. It aims to extract structured content from the information created in the Wikipedia project. AG News (Zhang It is a subdataset of AG’s corpus of news articles. et al. 2015) Each of the 4 class contains 30,000 training samples and 1,900 testing samples. (continued)

426 Table B.6 (continued) Task Datasets

Appendix B: List of Digital Resources and Examples

Remark

Domain adaptation SVHN (Netzer et al. Contains 6M 32 × 32 RGB images of printed digits 2011) that are centered in the digit of interest. It has more natural scene images, however, with an extreme overlap between the train dataset and test dataset across every class. It is skewed and has over sampled and under sampled for some classes (Mani et al. 2019). HMDB51 (Kuehne It is composed of 6,849 video clips from 51 action et al. 2011) categories. Representation bias can be seen in this dataset and the background scene features are less helpful (Sultani and Saleemi 2014). Office-Home It contains 15,588 images which are from four (Venkateswara et al. domains: Art, Clipart, Product, and Real-World 2017) divided in 65 classes. Information MS MARCO A collection of datasets focused on DL in search. It retrieval (Nguyen et al. 2016) shows a tendency of survivorship-bias and poor estimator of the absolute performance of systems. (Gupta and MacAvaney 2022). CORD-19 (Wang A free resource comprising tens of thousands of et al. 2020) scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses. There are sampling and observation errors in the creation of CORD-19 but is averaged out in the larger datasets (Kanakia et al. 2020). BioASQ A question answering dataset with instances (Tsatsaronis et al. composed of a question (Q), human-annotated 2015) answers (A), and the relevant contexts (C) (also called snippets). Face recognition LFW (Huang et al. A gold standard benchmark for face recognition, 2008) estimated to be 77.5% male and 83.5% white (Han and Jain 2014). Many groups are not well represented like children, no babies, very few people over the age of 80, and a relatively small proportion of women and ethnic minorities. Additional conditions, such as poor lighting, extreme pose, strong occlusions, low resolution, and other important factors do not constitute a major part of LFW. CASIA-WebFace Annotated with 10.5 k unique people with 4.9 M (Yi et al. 2014) images requires some filtering for enhancing quality. MS-Celeb-1M (Guo The dataset with 1M celebrity names and about 10M et al. 2016) face images with labels has been retracted from use.

Appendix B: List of Digital Resources and Examples

427

B.1 Open-Source Datsets B.1.1 Face Recognition Image Dataset Labeled face in the wild (Huang et al. 2008) • Contains 13,000 labeled human faces images. They were detected by the ViolaJones face detector. • Many groups are not well represented in LFW. • Additional conditions, such as poor lighting, extreme pose, strong occlusions, low resolution, and other important factors do not constitute a major part of LFW. UMD Faces (Bansal et al. 2017) • Still Images—367,888 face annotations for 8,277 subjects. • Video Frames—Over 3.7 million annotated video frames from over 22,000 videos of 3100 subjects. • Provide the estimated pose (yaw, pitch, and roll), locations of twenty-one key points, and gender information generated by a pre-trained neural network. CASIA WebFace (Yi et al. 2014) • The CASIA dataset is annotated with 10,575 unique people with 494,414 images in total. • This is the second-largest public dataset available for face verification and recognition problems. • It requires some filtering for enhancing quality. FERET (Phillips et al. 1998) • The database contains 1564 sets of images for a total of 14,126 images, including 1199 individuals and 365 duplicate sets of images. • The database had been collected in a highly controlled environment with controlled illumination; all images had the eyes in a registered location. MS-Celeb-1M (Guo et al. 2016) • It has 1M celebrity names and about 10 M face images with labels. • This dataset is determined to facilitate celebrity recognition tasks, so the dataset needs to cover as many popular celebrities as possible. • The occurrence frequency for a given entity is obtained by counting how many documents contain this entity in a large corpus with billions of documents from the web. 100,000 Faces (Cour et al. 2009) • It was created over the course of two years by taking 29 K images of 69 models. • They offer a wide range of ages, facial shapes, and ethnicities, but are all consistently sized and lit.

428

Appendix B: List of Digital Resources and Examples

• There are some portraits that look glitchy (and therefore fake). • a 100,000 faces AI-generated; created a realistic set of 100,000 faces using an original machine learning dataset. Flickr-Faces-HQ Dataset (Karras et al. 2019) • Consists of 70 k high-quality human faces, as a benchmark for GANs. • The dataset has considerable variations with age, glasses, background and ethinicity. • The dataset is curated from Flickr with automated filters and inherits the biases of the website.

B.1.2 Animal Image Dataset Stanford Dogs Dataset (Khosla et al. 2011) • This dataset has been built using images and annotation from ImageNet for the task of fine-grained image categorization. • Number of categories: 120 dog breeds; Number of images: 20,580; Annotations: Class labels, Bounding boxes • The images show that dogs within a class could have different ages (e.g. beagle), poses (e.g. blenheim spaniel), occlusion/self-occlusion and even color (e.g. Shihtzu). Fishnet.AI (Kay and Merrifield 2021) • Fishing dataset for AI training • The dataset consists of 86,029 images containing 34 object classes, making it the largest and most diverse public dataset of fisheries EM imagery to-date. • The dataset is sourced from real-world fishing trips, where the distribution of species encountered is skewed. The Oxford-IIIT Pet Dataset (Parkhi et al. 2012) • They built a 37-category pet dataset with approximately 200 photos for each group. • The images have large variations in scale, pose, and lighting. • All images have an associated ground truth annotation of breed, head ROI, and pixel-level trimap segmentation.

B.1.3 Satellite Imagery Dataset OpenStreetMap (Bennett 2010) • Planet.osm is the OpenStreetMap data in one file: all the nodes, ways, and relations that make up our map.

Appendix B: List of Digital Resources and Examples

429

• The core of OSM is a spatial database, which contains geographic data and information from all over the world. • Contributors can make edits to the OSM global database without any real controls or moderation at the point of contribution. NEXRAD (Heiss et al. 1990) • The Next Generation Weather Radar (NEXRAD) system is a network of 160 high-resolution S-band Doppler weather radars jointly operated by the National Weather Service (NWS), the Federal Aviation Administration (FAA), and the U.S. Air Force. • The NEXRAD system detects precipitation and wind, and its data can be processed to map precipitation patterns and movement. xBD (Gupta et al. 2019) • With over 850,000 building polygons from six different types of natural disaster around the world, covering a total area of over 45,000 square kilometers, it is one of the largest and highest quality public datasets of annotated high-resolution satellite imagery. • The xBD dataset includes pre- and post-disaster imagery for 6 different types of disasters and 15 countries. • The dataset contains bounding boxes and labels for environmental factors such as fire, water, and smoke. Spacenet (Van Etten et al. 2018) • The SpaceNet partners (CosmiQ Works, Radiant Solutions, and NVIDIA), released a large corpus of labeled satellite imagery on Amazon Web Services (AWS) called SpaceNet. • The SpaceNet dataset provides a large corpus of high-resolution multi-band imagery, with attendant validated building footprint and road network labels. Radiant Earth Foundation (Alemohammad 2019) • Radiant MLHub allows anyone to access, store, register, and share open training datasets and models for high-quality Earth observations, and it is designed to encourage widespread collaboration and development of trustworthy applications. • Radiant MLHub facilitates an open community commons for geospatial training data and machine learning models.

B.1.4 Fashion Image Dataset iMaterialist-Fashion (Guo et al. 2019) • Apparel instance segmentations include 27 main apparel objects (jackets, dresses, skirts, etc.) and 19 apparel parts (sleeves, collars, etc.).

430

Appendix B: List of Digital Resources and Examples

• A total of 294 fine-grained attributes with over 50K clothing images was annotated by expert for main apparel objects.

B.2 Applications in Computer Vision Task B.2.1 Image Classification Since ImageNet dataset was published in 2010, image classification has been one of the most investigated areas of computer science. Image classification is the most common CV assignment because the problem formulation is straightforward. The goal is to categorize a collection of images into a predefined set of categories using only examples that have been labeled. Image classification involves analyzing an image as a whole to assign a specific label, unlike difficult issues like object identification and image segmentation, which must localize (or give locations for) the characteristics they discover.

B.2.2 Object Detection Detection and localization of objects using bounding boxes is what is meant by the term “object detection” illustrated in Fig. B.1. Object detection analyzes an image or video for occurrences of class-specific characteristics and labels them. Classes include whatever the detection model has been trained on, from automobiles to animals to humans. Haar characterstics, SIFT, and HOG features were previously used in object identification methods to detect and categorize features in an image using traditional ML algorithms.

Fig. B.1 Approach to object detection and localization in CV

Appendix B: List of Digital Resources and Examples

431

In addition to being time-consuming and prone to error, this method has serious restrictions on the total number of objects that can be identified. Therefore, DL models like YOLO, RCNN, SSD are often used for this purpose since they use millions of parameters to overcome these constraints. Object recognition, also known as object classification, is commonly performed in conjunction with object detection.

B.2.3 Image Segmentation Image segmentation is the partition of an image into subparts or sub-objects to exhibit that the computer can distinguish an object from the background and/or another object in the same image. A “segment” of an image is a certain type of entity that the neural network has detected in an image, represented by a pixel mask that may be used to extract it. This fascinating subject of CV has been extensively studied, both with classical image processing techniques like watershed algorithms and clustering-based segmentation, and with popular modern-day DL architectures like PSPNet, FPN, U-Net, SegNet, and so on.

B.2.4 Face and Person Recognition Facial Recognition is a subset of object detection in which the predominant subject detected is a human face. While object detection is a task in which features are identified and localized, facial recognition not only detects but also recognizes the identified face. Facial recognition systems look for common features and landmarks, such as eyes, lips, or a nose, and categorize a face based on these characteristics and their placement. Some robust strategies based on DL algorithms may be found in articles such as FaceNet (Schroff et al. 2015).

B.2.5 Edge Detection Edge detection is the process of identifying object boundaries. It is conducted algorithmically with the use of mathematical algorithms that assist in detecting abrupt changes or discontinuities in the image’s luminosity. Traditional image processingbased techniques, such as Canny Edge detection and convolutions with specialized edge detection filters, are largely utilized for edge detection, which is frequently employed as a pre-processing step for several jobs. In addition, edges in an image provide crucial information about the image’s composition, which is why all DL algorithms conduct internal edge detection to collect global low-level features leveraging learnable kernels.

432

Appendix B: List of Digital Resources and Examples

B.2.6 Image Restoration Image restoration refers to the restoration or reconstruction of old hard copies of deteriorated and faded image that were acquired and kept improperly, resulting in loss of image quality. Typical image restoration strategies involves the reduction of additive noise using mathematical tools, but reconstruction sometimes necessitates substantial alterations, demanding further analysis and image inpainting. Picture inpainting is filling in damaged areas of an image using generative models that estimate what the image intends to portray. Frequently, the restoration technique is followed by a colorization procedure that colors the subject of the image (if it is black and white) as realistically as feasible.

B.2.7 Feature Matching In computer vision, features are the portions of an image that provide the most information about a given image. Since edges are powerful markers of object intricacy, better localized and crisp details, such as corners, are also features. Feature matching allows us to compare the features of one image’s comparable region to those of another image’s similar region. Feature matching has applications in CV tasks such as object detection and camera calibration. In general, the task of matching features is accomplished in the following sequence: 1. Features Identification: Image processing methods such as Harris Corner Detection, SIFT, and SURF are typically utilized to detect regions of interest. 2. Formation of local descriptors: Upon detection of features, the region surrounding each keypoint is collected, and local descriptors for these regions of interest are obtained. A local descriptor is a representation of a point’s immediate neighborhood and is therefore useful for matching features. 3. Feature Matching: To complete the feature matching stage, the features and their local descriptors are matched in the relevant images.

B.2.8 Scene Reconstruction Scene reconstruction, one of the most difficult challenging tasks in CV, is the digital 3D reconstruction of an object from an image. The majority of scene reconstruction algorithms build a point cloud at the object’s surface and reconstruct a mesh from this point cloud.

Appendix B: List of Digital Resources and Examples

433

B.2.9 Video Motion Analysis In CV, Video Motion Analysis (VMA) refers to the study of moving objects or animals and the trajectory of their bodies. Object identification, tracking, segmentation, and posture estimation are only few of the subtasks that encompass motion analysis as a whole. In addition to sports, VMA is utilized in healthcare, smart surveillance monitoring, and physical therapy, smart production units and to count and monitor microorganisms, such as germs and viruses.

References

Aamodt, A., Plaza, E.: Case-based reasoning: foundational issues, methodological variations, and system approaches. AI Commun. 7(1), 39–59 (1994) Abbott, L.F., Nelson, S.B.: Synaptic plasticity: taming the beast. Nat. Neurosci. 3(11), 1178–1183 (2000) Abdal, R., Qin, Y., Wonka, P.: Image2stylegan: how to embed images into the stylegan latent space? In: IEEE International Conference on Computer Vision, pp. 4432–4441 (2019) Abdal, R., Qin, Y., Wonka, P.: Image2stylegan++: how to edit the embedded images? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8296–8305 (2020) Abdal, R., Zhu, P., Mitra, N.J., Wonka, P.: Styleflow: attribute-conditioned exploration of stylegangenerated images using conditional continuous normalizing flows. ACM Trans. Graph. (ToG) 40(3), 1–21 (2021) Abonyi, J., Nagy, L., Szeifert, F.: Hybrid fuzzy convolution modelling and identification of chemical process systems. Int. J. Syst. Sci. 31(4), 457–466 (2000) Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: Slic superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2274–2282 (2012) Acharya, U.R., Oh, S.L., Hagiwara, Y., Tan, J.H., Adam, M., Gertych, A., San Tan, R.: A deep convolutional neural network model to classify heartbeats. Comput. Biol. Med. 89, 389–396 (2017) Acharya, U.R., Oh, S.L., Hagiwara, Y., Tan, J.H., Adeli, H.: Deep convolutional neural network for the automated detection and diagnosis of seizure using EEG signals. Comput. Biol. Med. 100, 270–278 (2018) Ackley, D.H., Hinton, G.E., Sejnowski, T.J.: A learning algorithm for boltzmann machines. Cogn. Sci. 9(1), 147–169 (1985) Adadi, A., Berrada, M.: Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160 (2018) Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., Kim, B.: Sanity checks for saliency maps. Adv. Neural Inf. Process. Syst. 31 (2018) Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Pasca, M., Soroa, A.: A study on similarity and relatedness using distributional and wordnet-based approaches (2009) Ahmad, M.A., Eckert, C., Teredesai, A.: Interpretable machine learning in healthcare. In: ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 559–560 (2018) © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Somani et al., Interpretability in Deep Learning, https://doi.org/10.1007/978-3-031-20639-9

435

436

References

Alain, G., Bengio, Y.: Understanding intermediate layers using linear classifier probes (2016). arXiv:1610.01644 Alameda-Pineda, X., Redi, M., Celis, E., Sebe, N., Chang, S.F.: Fat/mm’19: 1st international workshop on fairness, accountability, and transparency in multimedia. In: ACM International Conference on Multimedia, pp. 2728–2729 (2019) Alcal’a, R., Casillas, J., Cord’on, O., Herrera, F., Zwiry, S.: Techniques for learning and tuning fuzzy rule-based systems for learning and tuning fuzzy rule-based systems for linguistic modeling and their application. ETS de Ingenier’ia Inform’atica. University of Granda, Granada, Spain (1999) Alemohammad, H.: Radiant mlhub: a repository for machine learning ready geospatial training data. In: AGU Fall Meeting Abstracts, vol. 2019, pp. IN11A–05 (2019) Alexander Mordvintsev, C.O., Tyka, M.: Deepdream-a code example for visualizing neural networks (2015a). https://ai.googleblog.com/2015/07/deepdream-code-example-for-visualizing.html Alexander Mordvintsev, C.O., Tyka, M.: Inceptionism: going deeper into neural networks (2015b). https://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html Alon, U.: Network motifs: theory and experimental approaches. Nat. Rev. Genet. 8(6), 450–461 (2007) Altmann, A., Tolo¸si, L., Sander, O., Lengauer, T.: Permutation importance: a corrected feature importance measure. Bioinformatics 26(10), 1340–1347 (2010) Altmann, E.M., Gray, W.D.: Forgetting to remember: the functional relationship of decay and interference. Psychol. Sci. 13(1), 27–33 (2002) Alvarez Melis, D., Jaakkola, T.: Towards robust interpretability with self-explaining neural networks. Adv. Neural Inf. Process. Syst. 31 (2018) Alvarez-Melis, D., Jaakkola, T.S.: A causal framework for explaining the predictions of black-box sequence-to-sequence models (2017). arXiv:1707.01943 Alvarez-Melis, D., Jaakkola, T.S.: Self-explaining neural networks (2018) Ang, K.K., Quek, C.: Rspop: rough set-based pseudo outer-product fuzzy rule identification algorithm. Neural Comput. 17(1), 205–243 (2005) Ang, K.K., Quek, C., Pasquier, M.: Popfnn-cri (s): pseudo outer product based fuzzy neural network using the compositional rule of inference and singleton fuzzifier. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics) 33(6), 838–849 (2003) Angelov, P., Soares, E.: Towards explainable deep neural networks (XDNN). Neural Netw. 130, 185–194 (2020) Angles, R., Gutierrez, C.: Survey of graph database models. ACM Comput. Surv. (CSUR) 40(1), 1–39 (2008) Apley, D.W., Zhu, J.: Visualizing the effects of predictor variables in black box supervised learning models. J. R. Stat. Soc.: Ser. B (Statistical Methodology) 82(4), 1059–1086 (2020) Arik, S.O., Pfister, T.: Protoattend: attention-based prototypical learning (2019). arXiv:1902.06292 Arras, L., Horn, F., Montavon, G., Müller, K.R., Samek, W.: “What is relevant in a text document?”: An interpretable machine learning approach. PloS One 12(8), e0181142 (2017) Arras, L., Montavon, G., Müller, K.R., Samek, W.: Explaining recurrent neural network predictions in sentiment analysis (2017). arXiv:1706.07206 Arrieta, A.B., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., García, S., GilLópez, S., Molina, D., Benjamins, R., et al.: Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020) Ashrafi, M., Chua, L.H.C., Quek, C.: The applicability of generic self-evolving takagi-sugeno-kang neuro-fuzzy model in modeling rainfall–runoff and river routing. Hydrol. Res. (2019) Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: International Conference on Machine Learning, pp. 284–293. PMLR (2018) Atz, K., Grisoni, F., Schneider, G.: Geometric deep learning on molecular representations. Nat. Mach. Intell. 1–10 (2021) Aubry, M., Russell, B.C.: Understanding deep features with computer-generated imagery. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2875–2883 (2015)

References

437

Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: a nucleus for a web of open data. In: The Semantic Web, pp. 722–735. Springer (2007) Augasta, M.G., Kathirvalavakumar, T.: Reverse engineering the neural networks for rule extraction in classification problems. Neural Process. Lett. 35(2), 131–150 (2012) Aull, C.E.: The first symmetric derivative. Am. Math. Mon. 74(6), 708–711 (1967) Ba, L.J., Caruana, R.: Do deep nets really need to be deep? (2013). arXiv:1312.6184 Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One 10(7), e0130140 (2015) Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., Müller, K.R.: How to explain individual classification decisions. J. Mach. Learn. Res. 11, 1803–1831 (2010) Baena-Garcıa, M., del Campo-Ávila, J., Fidalgo, R., Bifet, A., Gavalda, R., Morales-Bueno, R.: Early drift detection method. In: Fourth International Workshop on Knowledge Discovery from Data Streams, vol. 6, pp. 77–86 (2006) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014). arXiv:1409.0473 Bakken, M., Kvam, J., Stepanov, A.A., Berge, A.: Principal feature visualisation in convolutional neural networks. In: European Conference on Computer Vision, pp. 18–31. Springer (2020) Baldassarre, F., Azizpour, H.: Explainability techniques for graph convolutional networks. In: International Conference on Machine Learning (ICML) Workshops, 2019 Workshop on Learning and Reasoning with Graph-Structured Representations (2019) Balduzzi, D., Frean, M., Leary, L., Lewis, J., Ma, K.W.D., McWilliams, B.: The shattered gradients problem: If resnets are the answer, then what is the question? In: International Conference on Machine Learning, pp. 342–350. PMLR (2017) Banach, S.: Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fund. Math. 3(1), 133–181 (1922) Bansal, A., Farhadi, A., Parikh, D.: Towards transparent systems: Semantic characterization of failure modes. In: European Conference on Computer Vision, pp. 366–381. Springer (2014) Bansal, A., Nanduri, A., Castillo, C.D., Ranjan, R., Chellappa, R.: Umdfaces: an annotated face dataset for training deep networks. In: 2017 IEEE International Joint Conference on Biometrics (IJCB), pp. 464–473. IEEE (2017) Bansal, G., Weld, D.: A coverage-based utility model for identifying unknown unknowns. In: AAAI Conference on Artificial Intelligence, vol. 32 (2018) Barbu, A., Mayo, D., Alverio, J., Luo, W., Wang, C., Gutfreund, D., Tenenbaum, J., Katz, B.: Objectnet: a large-scale bias-controlled dataset for pushing the limits of object recognition models. Adv. Neural Inf. Process. Syst. 32 (2019) Bargmann, C.I.: Beyond the connectome: how neuromodulators shape neural circuits. Bioessays 34(6), 458–465 (2012) Barocas, S., Selbst, A.D.: Big data’s disparate impact. Calif. Law Rev. 104, 671 (2016) Battaglia, P.W., Hamrick, J.B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al.: Relational inductive biases, deep learning, and graph networks (2018). arXiv:1806.01261 (2018) Bau, D., Andonian, A., Cui, A., Park, Y., Jahanian, A., Oliva, A., Torralba, A.: Paint by word (2021). arXiv:2103.10951 Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: quantifying interpretability of deep visual representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6541–6549 (2017) Bau, D., Zhu, J.Y., Strobelt, H., Lapedriza, A., Zhou, B., Torralba, A.: Understanding the role of individual units in a deep neural network. Proc. Nat. Acad. Sci. 117(48), 30071–30078 (2020) Bau, D., Zhu, J.Y., Strobelt, H., Zhou, B., Tenenbaum, J.B., Freeman, W.T., Torralba, A.: Gan dissection: visualizing and understanding generative adversarial networks (2018). arXiv:1811.10597

438

References

Bau, D., Zhu, J.Y., Wulff, J., Peebles, W., Strobelt, H., Zhou, B., Torralba, A.: Seeing what a gan cannot generate. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4502–4511 (2019) Bay, H., Tuytelaars, T., Gool, L.V.: Surf: speeded up robust features. In: European Conference on Computer Vision, pp. 404–417. Springer (2006) Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009) Bede, B.: Fuzzy systems with sigmoid-based membership functions as interpretable neural networks. In: International Fuzzy Systems Association World Congress, pp. 157–166. Springer (2019) Bello, I., Fedus, W., Du, X., Cubuk, E.D., Srinivas, A., Lin, T.Y., Shlens, J., Zoph, B.: Revisiting resnets: improved training and scaling strategies. Adv. Neural Inf. Process. Syst. 34 (2021) Benamira, A., Devillers, B., Lesot, E., Ray, A.K., Saadi, M., Malliaros, F.D.: Semi-supervised learning and graph neural networks for fake news detection. In: IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 568–569. IEEE (2019) Bengfort, B., Bilbro, R.: Yellowbrick: visualizing the scikit-learn model selection process. J. Open Source Softw. 4(35), 1075 (2019) Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013) Bengio, Y., Goodfellow, I., Courville, A.: Deep Learning, vol. 1. MIT Press, Cambridge, MA, USA (2017) Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. Adv. Neural Inf. Process. Syst. 19 (2006) Bengio, Y., Paiement, J.f., Vincent, P., Delalleau, O., Roux, N., Ouimet, M.: Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. Adv. Neural Inf. Process. Syst. 16 (2003) Benjamins, R., Barbado, A., Sierra, D.: Responsible AI by design in practice (2019). arXiv:1909.12838 Bennetot, A., Laurent, J.L., Chatila, R., Díaz-Rodríguez, N.: Towards explainable neural-symbolic visual reasoning (2019). arXiv:1909.09065 Bennett, J.: OpenStreetMap. Packt Publishing Ltd (2010) Berenji, H.R., Khedkar, P.: Learning and tuning fuzzy logic controllers through reinforcements. IEEE Trans. Neural Netw. 3(5), 724–740 (1992) Bhagat, S., Cormode, G., Muthukrishnan, S.: Node classification in social networks. In: Social Network Data Analytics, pp. 115–148. Springer (2011) Bhatt, D., Patel, C., Talsania, H., Patel, J., Vaghela, R., Pandya, S., Modi, K., Ghayvat, H.: CNN variants for computer vision: history, architecture, application, challenges and future scope. Electronics 10(20), 2470 (2021) Bhuta, N., Beck, S., Geiβ, R., Liu, H.Y., Kreβ, C.: Autonomous Weapons Systems: Law, Ethics, Policy. Cambridge University Press (2016) Biecek, P., Burzykowski, T.: Explanatory Model Analysis: Explore. Chapman and Hall/CRC, Explain and Examine Predictive Models (2021) Bien, J., Tibshirani, R.: Prototype selection for interpretable classification. Ann. Appl. Stat. 5(4), 2403–2424 (2011) Bienenstock, E.L., Cooper, L.N., Munro, P.W.: Theory for the development of neuron selectivity: orientation specificity and binocular interaction in visual cortex. J. Neurosci. 2(1), 32–48 (1982) Biffi, C., Oktay, O., Tarroni, G., Bai, W., Marvao, A.D., Doumou, G., Rajchl, M., Bedair, R., Prasad, S., Cook, S., et al.: Learning interpretable anatomical features through deep generative models: application to cardiac remodeling. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 464–471. Springer (2018) Bilal, A., Jourabloo, A., Ye, M., Liu, X., Ren, L.: Do convolutional neural networks learn class hierarchy? IEEE Trans. Vis. Comput. Graph. 24(1), 152–162 (2017)

References

439

Birant, D., Kut, A.: St-dbscan: an algorithm for clustering spatial-temporal data. Data Knowl. Eng. 60(1), 208–221 (2007) Bishop, C.: Improving the generalization properties of radial basis function neural networks. Neural Comput. 3(4), 579–588 (1991) Bishop, C.M., et al.: Neural Networks for Pattern Recognition. Oxford University Press (1995) Bloniarz, A., Talwalkar, A., Yu, B., Wu, C.: Supervised neighborhoods for distributed nonparametric regression. In: Artificial Intelligence and Statistics, pp. 1450–1459. PMLR (2016) Brennan, T., Dieterich, W.: Correctional offender management profiles for alternative sanctions (compas). Handb. Recidiv. Risk/Needs Assess. Tools 49 (2018) Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P.: Geometric deep learning: going beyond euclidean data. IEEE Signal Process. Mag. 34(4), 18–42 (2017) Brown, J.: Some tests of the decay theory of immediate memory. Q. J. Exp. Psychol. 10(1), 12–21 (1958) Brown, P.F., Della Pietra, V.J., Desouza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–480 (1992) Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch (2017). arXiv:1712.09665 Brownlee, J.: Clever algorithms: nature-inspired programming recipes. Jason Brownlee (2011) Buckley, J.J., Hayashi, Y.: Fuzzy neural networks: a survey. Fuzzy Sets Syst. 66(1), 1–13 (1994) Buda, M., Maki, A., Mazurowski, M.A.: A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249–259 (2018) Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32(1), 13–47 (2006) Bui, T.D., Ravi, S., Ramavajjala, V.: Neural graph learning: Training neural networks using graphs. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 64–71 (2018) Bunt, A., Lount, M., Lauzon, C.: Are explanations always important? A study of deployed, low-cost intelligent interactive systems. In: ACM International Conference on Intelligent User Interfaces, pp. 169–178 (2012) Butola, A., Prasad, D.K., Ahmad, A., Dubey, V., Qaiser, D., Srivastava, A., Senthilkumaran, P., Ahluwalia, B.S., Mehta, D.S.: Deep learning architecture “lightoct” for diagnostic decision support using optical coherence tomography images of biological samples. Biomed. Opt. Express 11(9), 5017–5031 (2020) Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: Nuscenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020) Calude, C.S., Longo, G.: The deluge of spurious correlations in big data. Found. Sci. 22(3), 595–612 (2017) Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. IEEE (2017) Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., Elhadad, N.: Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730 (2015) Casillas, J., Cordón, O., Triguero, F.H., Magdalena, L.: Interpretability Issues in Fuzzy Modeling, vol. 128. Springer (2013) Chai, Y., Jia, L., Zhang, Z.: Mamdani model based adaptive neural fuzzy inference system and its application. Int. J. Comput. Intell. 5(1), 22–29 (2009) Chakraborty, S., Tomsett, R., Raghavendra, R., Harborne, D., Alzantot, M., Cerutti, F., Srivastava, M., Preece, A., Julier, S., Rao, R.M., et al.: Interpretability of deep learning models: a survey of results. In: 2017 IEEE Smartworld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (smartworld/SCALCOM/UIC/ATC/CBDcom/IOP/SCI), pp. 1–6. IEEE (2017)

440

References

Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model repository (2015). arXiv:1512.03012 Chang, P.C., Liu, C.H.: A tsk type fuzzy rule based system for stock price prediction. Expert Syst. Appl. 34(1), 135–144 (2008) Che, Z., Purushotham, S., Khemani, R., Liu, Y.: Distilling knowledge from deep networks with applications to healthcare domain (2015). arXiv:1512.03542 Che, Z., Purushotham, S., Khemani, R., Liu, Y.: Interpretable deep models for ICU outcome prediction. In: AMIA Annual Symposium Proceedings, vol. 2016, p. 371. American Medical Informatics Association (2016) Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., Su, J.K.: This looks like that: deep learning for interpretable image recognition. Adv. Neural Inf. Process. Syst. 32 (2019) Chen, J., Song, L., Wainwright, M., Jordan, M.: Learning to explain: An information-theoretic perspective on model interpretation. In: International Conference on Machine Learning, pp. 883– 892. PMLR (2018) Chen, K., Hwu, T., Kashyap, H.J., Krichmar, J.L., Stewart, K., Xing, J., Zou, X.: Neurorobots as a means toward neuroethology and explainable AI. Front. Neurorobotics 14, 570308 (2020) Chen, M., Shi, X., Zhang, Y., Wu, D., Guizani, M.: Deep features learning for medical image analysis with convolutional autoencoder neural network. IEEE Trans, Big Data (2017) Chen, P.Y., Sharma, Y., Zhang, H., Yi, J., Hsieh, C.J.: Ead: elastic-net attacks to deep neural networks via adversarial examples. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) Chen, W., Grangier, D., Auli, M.: Strategies for training large vocabulary neural language models (2015). arXiv:1512.04906 Chen, CL., Philip Z., C-Y, Chen, L., Gan, M.: Fuzzy restricted Boltzmann machine for the enhancement of deep learning. IEEE Transactions on Fuzzy Systems, 23(6), 2163–2173, IEEE (2015) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Adv. Neural Inf. Process. Syst. 29 (2016) Chen, Y.F., Everett, M., Liu, M., How, J.P.: Socially aware motion planning with deep reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1343–1350. IEEE (2017) Chen, W., An, J., Li, Renfa., Fu, Li., Xie, Guoqi., Bhuiyan, Md., Zakirul A., Li, K.: A novel fuzzy deep-learning approach to traffic flow prediction with uncertain spatial–temporal data features. Future generation computer systems. 89, 78–88, Elsevier (2018) Chen, Z., Bei, Y., Rudin, C.: Concept whitening for interpretable image recognition. Nat. Mach. Intell. 2(12), 772–782 (2020) Cherepkov, A., Voynov, A., Babenko, A.: Navigating the gan parameter space for semantic image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3671–3680 (2021) Cheu, E.Y., Quek, C., Ng, S.K.: Arpop: an appetitive reward-based pseudo-outer-product neural fuzzy inference system inspired from the operant conditioning of feeding behavior in aplysia. IEEE Trans. Neural Netw. Learn. Syst. 23(2), 317–329 (2012) Ching, T., Himmelstein, D.S., Beaulieu-Jones, B.K., Kalinin, A.A., Do, B.T., Way, G.P., Ferrero, E., Agapow, P.M., Zietz, M., Hoffman, M.M., et al.: Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15(141), 20170387 (2018) Choi, E., Bahadori, M.T., Sun, J., Kulas, J., Schuetz, A., Stewart, W.: Retain: an interpretable predictive model for healthcare using reverse time attention mechanism. Adv. Neural Inf. Process. Syst. 29 (2016) Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017) Cichy, R., Dwivedi, K.: The algonauts project 2021 challenge: how the human brain makes sense of a world in motion. arxiv, 1, 2021. http://algonauts.csail.mit.edu

References

441

Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: International Conference on Machine Learning, pp. 160–167 (2008) Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016) Cortes, C., Vapnik, V.: Support vector machine. Mach. Learn. 20(3), 273–297 (1995) Cour, T., Sapp, B., Jordan, C., Taskar, B.: Learning from ambiguously labeled images. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 919–926. IEEE (2009) Cover, T.M.: Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput. 3, 326–334 (1965) Craw, S., Aamodt, A.: Case based reasoning as a model for cognitive artificial intelligence. In: International Conference on Case-Based Reasoning, pp. 62–77. Springer (2018) Crawford, K., Calo, R.: There is a blind spot in AI research. Nature 538(7625), 311–313 (2016) Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV, vol. 1, pp. 1–2. Prague (2004) Cui, Z., Chen, W., Chen, Y.: Multi-scale convolutional neural networks for time series classification (2016). arXiv:1603.06995 Cunningham, P.: A taxonomy of similarity mechanisms for case-based reasoning. IEEE Trans. Knowl. Data Eng. 21(11), 1532–1543 (2008) Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989) Dabkowski, P., Gal, Y.: Real time image saliency for black box classifiers. Adv. Neural Inf. Process. Syst. 30 (2017) Dailey, M.N., Cottrell, G.W., Padgett, C., Adolphs, R.: Empath: a neural network that categorizes facial expressions. J. Cogn. Neurosci. 14(8), 1158–1173 (2002) Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 886–893. IEEE (2005) d’Alessandro, B., O’Neil, C., LaGatta, T.: Conscientious classification: a data scientist’s guide to discrimination-aware classification. Big Data 5(2), 120–134 (2017) d’Angelo, E., Jacques, L., Alahi, A., Vandergheynst, P.: From bits to images: inversion of local binary descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 36(5), 874–887 (2013) Das, R.T., Ang, K.K., Quek, C.: ierspop: a novel incremental rough set-based pseudo outer-product with ensemble learning. Appl. Soft Comput. 46, 170–186 (2016) Datta, A., Sen, S., Zick, Y.: Algorithmic transparency via quantitative input influence: theory and experiments with learning systems. In: IEEE Symposium on Security and Privacy (SP), pp. 598– 617. IEEE (2016) Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. In: International Conference on Machine Learning, pp. 933–941. PMLR (2017) de Campos Souza, P.V.: Fuzzy neural networks and neuro-fuzzy networks: a review the main techniques and applications used in the literature. Appl. Soft Comput. 92, 106275 (2020) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009) Deng, Y., Ren, Z., Kong, Y., Bao, F., Dai, Q.: A hierarchical fused fuzzy deep neural network for data classification. IEEE Trans. Fuzzy Syst. 25(4), 1006–1012 (2016) DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout (2017). arXiv:1708.04552 Dhurandhar, A., Chen, P.Y., Luss, R., Tu, C.C., Ting, P., Shanmugam, K., Das, P.: Explanations based on the missing: Towards contrastive explanations with pertinent negatives. Adv. Neural Inf. Process. Syst. 31 (2018)

442

References

Di Gangi, M.A., Cattoni, R., Bentivogli, L., Negri, M., Turchi, M.: Must-c: a multilingual speech translation corpus. In: 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2012–2017. Association for Computational Linguistics (2019) Diez-Olivan, A., Del Ser, J., Galar, D., Sierra, B.: Data fusion and machine learning for industrial prognosis: trends and perspectives towards industry 4.0. Inf. Fusion 50, 92–111 (2019) Doran, D., Schulz, S., Besold, T.R.: What does explainable AI really mean? A new conceptualization of perspectives (2017). arXiv:1710.00794 Doshi-Velez, F., Kim, B.: Towards a rigorous science of interpretable machine learning (2017). arXiv:1702.08608 Došilovi´c, F.K., Brˇci´c, M., Hlupi´c, N.: Explainable artificial intelligence: A survey. In: International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 0210–0215. IEEE (2018) Dosovitskiy, A., Brox, T.: Inverting visual representations with convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4829–4837 (2016) Dosovitskiy, A., Brox, T., et al.: Inverting convolutional networks with convolutional networks, vol. 4 (2015). arXiv:1506.02753 Du, M., Liu, N., Hu, X.: Techniques for interpretable machine learning. Commun. the ACM 63(1), 68–77 (2019) Dumoulin, V., Visin, F.: A guide to convolution arithmetic for deep learning (2016). arXiv:1603.07285 Dunteman, G.H.: Principal Components Analysis, vol. 69. SAGE (1989) Durugkar, I., Gemp, I., Mahadevan, S.: Generative multi-adversarial networks. In: International Conference on Learning Representations (2017) Eberle, O., Büttner, J., Kräutli, F., Müller, K.R., Valleriani, M., Montavon, G.: Building and interpreting deep similarity models (2020). arXiv:2003.05431 El Hatri, C., Boumhidi, J., Fuzzy deep learning based urban traffic incident detection. Cognitive Systems Research, 50, 206–213, Elsevier (2018) Ekman, F., Johansson, M., Sochor, J.: Creating appropriate trust in automated vehicle systems: a framework for HMI design. IEEE Trans. Hum.-Mach. Syst. 48(1), 95–101 (2017) Eksombatchai, C., Jindal, P., Liu, J.Z., Liu, Y., Sharma, R., Sugnet, C., Ulrich, M., Leskovec, J.: Pixie: a system for recommending 3+ billion items to 200+ million users in real-time. In: World Wide Web Conference, pp. 1775–1784 (2018) Erhan, D., Bengio, Y., Courville, A., Vincent, P.: Visualizing higher-layer features of a deep network. Univ. Montr. 1341(3), 1 (2009) Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. comput. Vis. 88(2), 303–338 (2010) Fan, F., Wang, G.: Fuzzy logic interpretation of quadratic networks. Neurocomputing 374, 10–21 (2020) Fan, F.L., Xiong, J., Li, M., Wang, G.: On interpretability of artificial neural networks: a survey. IEEE Trans. Radiat. Plasma Med. Sci. 5(6), 741–760 (2021) Faruqui, M., Dyer, C.: Improving vector space word representations using multilingual correlation. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 462–471 (2014) Faruqui, M., Tsvetkov, Y., Rastogi, P., Dyer, C.: Problems with evaluation of word embeddings using word similarity tasks (2016). arXiv:1605.02276 Fawaz, H.I., Forestier, G., Weber, J., Idoumghar, L., Muller, P.A.: Deep learning for time series classification: a review. Data Min. Knowl. Discov. 33(4), 917–963 (2019) Fei-Fei, L., Fergus, R., Perona, P.: One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 594–611 (2006) Feige, U.: A threshold of ln n for approximating set cover. J. ACM (JACM) 45(4), 634–652 (1998)

References

443

Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vis. 59(2), 167–181 (2004) Ferret, O.: Testing semantic similarity measures for extracting synonyms from a corpus. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10) (2010) Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. In: International Conference on World Wide Web, pp. 406–414 (2001) Firth, J.R.: A synopsis of linguistic theory, 1930–1955. Studies in Linguistic Analysis (1957) Fong, R.C., Vedaldi, A.: Interpretable explanations of black boxes by meaningful perturbation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3429–3437 (2017) Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks (2018). arXiv:1803.03635 Freitas, A.A.: Comprehensible classification models: a position paper. ACM SIGKDD Explor. Newsl. 15(1), 1–10 (2014) Friedman, J.H., Popescu, B.E.: Predictive learning via rule ensembles. Ann. Appl. Stat. 916–954 (2008) Fritzke, B.: Fast learning with incremental RBF networks. Neural process. Lett. 1(1), 2–5 (1994) Fritzke, B.: A growing neural gas network learns topologies. Adv. Neural Inf. Process. Syst. 7 (1994) Frosst, N., Hinton, G.: Distilling a neural network into a soft decision tree (2017). arXiv:1711.09784 Fu, R., Guo, J., Qin, B., Che, W., Wang, H., Liu, T.: Learning semantic hierarchies via word embeddings. In: Annual Meeting of the Association for Computational Linguistics, pp. 1199– 1209 (2014) Fukushima, K., Miyake, S.: Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. In: Competition and Cooperation in Neural Nets, pp. 267–285. Springer (1982) Fullér, R., Werners, B.: The compositional rule of inference: introduction, theoretical considerations, and exact calculation formulas (1991) Funke, T., Khosla, M., Anand, A.: Hard masking for explaining graph neural networks (2020) Gao, T., Chai, Y.: Improving stock closing price prediction using recurrent neural network and technical indicators. Neural Comput. 30(10), 2833–2854 (2018) Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style (2015). arXiv:1508.06576 Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414– 2423 (2016) Gehrmann, S., Strobelt, H., Krüger, R., Pfister, H., Rush, A.M.: Visual interaction with deep learning models through collaborative semantic inference. IEEE Trans. Vis. Comput. Graph. 26(1), 884– 894 (2019) Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012) Geis, J.R., Brady, A.P., Wu, C.C., Spencer, J., Ranschaert, E., Jaremko, J.L., Langer, S.G., Kitts, A.B., Birch, J., Shields, W.F., et al.: Ethics of artificial intelligence in radiology: summary of the joint European and North American multisociety statement. Can. Assoc. Radiol. J. 70(4), 329–334 (2019) Ghorbani, A., Abid, A., Zou, J.: Interpretation of neural networks is fragile. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3681–3688 (2019) Ghorbani, A., Wexler, J., Zou, J.Y., Kim, B.: Towards automatic concept-based explanations. Adv. Neural Inf. Process. Syst. 32 (2019) Ghosh, A., Kulharia, V., Namboodiri, V.P., Torr, P.H., Dokania, P.K.: Multi-agent diverse generative adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8513–8521 (2018)

444

References

Gilpin, L.H., Bau, D., Yuan, B.Z., Bajwa, A., Specter, M., Kagal, L.: Explaining explanations: an overview of interpretability of machine learning. In: 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), pp. 80–89. IEEE (2018) Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: International Conference on Artificial Intelligence and Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings (2010) Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323. JMLR Workshop and Conference Proceedings (2011) Goetschalckx, L., Andonian, A., Oliva, A., Isola, P.: Ganalyze: toward visual definitions of cognitive image properties. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5744–5753 (2019) Goldberg, D.: What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. (CSUR) 23(1), 5–48 (1991) Goldstein, A., Kapelner, A., Bleich, J., Pitkin, E.: Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. J. Comput. Graph. Stat. 24(1), 44–65 (2015) Goodfellow, I.: Nips 2016 tutorial: Generative adversarial networks (2016). arXiv:1701.00160 Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Adv. Neural Inf. Process. Syst. 27 (2014) Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples (2014). arXiv:1412.6572 Goodman, B., Flaxman, S.: European union regulations on algorithmic decision-making and a “right to explanation”. AI Mag. 38(3), 50–57 (2017) Gosiewska, A., Kozak, A., Biecek, P.: Simpler is better: lifting interpretability-performance trade-off via automated feature engineering. Decis. Support Syst. 150, 113556 (2021) Goyal, Y., Mohapatra, A., Parikh, D., Batra, D.: Towards transparent AI systems: interpreting visual question answering models (2016). arXiv:1608.08974 Greydanus, S., Koul, A., Dodge, J., Fern, A.: Visualizing and understanding atari agents. In: International Conference on Machine Learning, pp. 1792–1801. PMLR (2018) Griewank, A.: Who invented the reverse mode of differentiation. Documenta Mathematica, Extra Volume ISMP pp. 389–400 (2012) Grossberg, S.: Competitive learning: from interactive activation to adaptive resonance. Cogn. Sci. 11(1), 23–63 (1987) Guidotti, R., Monreale, A., Ruggieri, S., Pedreschi, D., Turini, F., Giannotti, F.: Local rule-based explanations of black box decision systems (2018). arXiv:1805.10820 Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi, D.: A survey of methods for explaining black box models. ACM Comput. Surv. (CSUR) 51(5), 1–42 (2018) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. Adv. Neural Inf. Process. Syst. 30 (2017) Gunning, D.: Explainable artificial intelligence (XAI). Defense Advanced Research Projects Agency (DARPA)(2017) (2017) Guo, K., Hu, Y., Qian, Z., Liu, H., Zhang, K., Sun, Y., Gao, J., Yin, B.: Optimized graph convolution recurrent neural network for traffic prediction. IEEE Trans. Intell. Transp. Syst. 22(2), 1138–1149 (2020) Guo, S., Huang, W., Zhang, X., Srikhanta, P., Cui, Y., Li, Y., Adam, H., Scott, M.R., Belongie, S.: The imaterialist fashion attribute dataset. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0 (2019) Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In: European Conference on Computer Vision, pp. 87–102. Springer (2016) Gupta, P., MacAvaney, S.: On survivorship bias in MS marco (2022). arXiv:2204.12852 Gupta, R., Goodman, B., Patel, N., Hosfelt, R., Sajeev, S., Heim, E., Doshi, J., Lucas, K., Choset, H., Gaston, M.: Creating XBD: a dataset for assessing building damage from satellite imagery.

References

445

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 10–17 (2019) Gurney, K.: An Introduction to Neural Networks. CRC Press (2018) Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3(Mar), 1157–1182 (2003) Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2, pp. 1735–1742. IEEE (2006) Haeusler, R., Klette, R.: Analysis of kitti data for stereo analysis with stereo confidence measures. In: European Conference on Computer Vision, pp. 158–167. Springer (2012) Hamilton, W.L., Ying, R., Leskovec, J.: Representation learning on graphs: methods and applications (2017). arXiv:1709.05584 Han, H., Jain, A.K.: Age, gender and race estimation from unconstrained face images. Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA, MSU Technical report (MSU-CSE-14-5) vol. 87, p. 27 (2014) Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004) Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning. Adv. Neural Inf. Process. Syst. 29 (2016) Härkönen, E., Hertzmann, A., Lehtinen, J., Paris, S.: Ganspace: discovering interpretable gan controls. Adv. Neural Inf. Process. Syst. 33, 9841–9850 (2020) Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954) Harwell, D.: Federal study confirms racial bias of many facial-recognition systems, casts doubt on their expanding use. Washington Post 19 (2019) Hastie, T.J.: Generalized additive models. In: Statistical Models in S, pp. 249–307. Routledge (2017) Hatami, N., Sdika, M., Ratiney, H.: Magnetic resonance spectroscopy quantification using deep learning. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 467–475. Springer (2018) Hayashi, I., Nomura, H., Yamasaki, H., Wakami, N.: Construction of fuzzy inference rules by NDF and NDFL. Int. J. Approx. Reas. 6(2), 241–266 (1992) Hayashi, Y., Buckley, J.J.: Implementation of fuzzy max-min neural controller. In: Proceedings of ICNN’95-International Conference on Neural Networks, vol. 6, pp. 3113–3117. IEEE (1995) He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) He, Z., Kan, M., Shan, S.: Eigengan: layer-wise eigen-learning for gans. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14408–14417 (2021) Hebb, D.O.: The Organization of Behavior: A Psychological Theory. Wiley, New York (1949) Heiss, W.H., McGrew, D.L., Sirmans, D.: Nexrad: next generation weather radar (wsr-88d). Microw. J. 33(1), 79–89 (1990) Hempel, C.G., Oppenheim, P.: Studies in the logic of explanation. Philosop. Sci. 15(2), 135–175 (1948) Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples (2019). arXiv:1907.07174 Herman, A.: Are you visually intelligent? What you don’t see is as important as what you do see. Medical Daily (2016) Hershey, S., Chaudhuri, S., Ellis, D.P., Gemmeke, J.F., Jansen, A., Moore, R.C., Plakal, M., Platt, D., Saurous, R.A., Seybold, B., et al.: CNN architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135. IEEE (2017)

446

References

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two timescale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 30 (2017) Higgins, I., Pal, A., Rusu, A., Matthey, L., Burgess, C., Pritzel, A., Botvinick, M., Blundell, C., Lerchner, A.: Darla: Improving zero-shot transfer in reinforcement learning. In: International Conference on Machine Learning, pp. 1480–1490. PMLR (2017) Hill, F., Reichart, R., Korhonen, A.: Simlex-999: evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41(4), 665–695 (2015) Hinton, G., Srivastava, N., Swersky, K.: Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited On 14(8), 2 (2012) Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006) Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006) Hinton, G.E., Shallice, T.: Lesioning an attractor network: investigations of acquired dyslexia. Psychol. Rev. 98(1), 74 (1991) Hintzman, D.L., Block, R.A.: Repetition and memory: evidence for a multiple-trace hypothesis. J. Exp. Psychol. 88(3), 297 (1971) Hirota, Y., Nakashima, Y., Garcia, N.: Gender and racial bias in visual question answering datasets (2022). arXiv:2205.08148 Hochreiter, S.: The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 6(02), 107–116 (1998) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) Hofman, J.M., Sharma, A., Watts, D.J.: Prediction and explanation in social systems. Science 355(6324), 486–488 (2017) Holzinger, A., Biemann, C., Pattichis, C.S., Kell, D.B.: What do we need to build explainable AI systems for the medical domain? (2017). arXiv:1712.09923 Hooker, S., Erhan, D., Kindermans, P.J., Kim, B.: A benchmark for interpretability methods in deep neural networks. Adv. Neural Inf. Process. Syst. 32 (2019) Hopfield, J.J.: Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad Sci. 79(8), 2554–2558 (1982) Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications (2017). arXiv:1704.04861 Hruschka, E.R., Ebecken, N.F.: Extracting rules from multilayer perceptrons in classification problems: a clustering-based approach. Neurocomputing 70(1–3), 384–397 (2006) Hsu, F.H.: Ibm’s deep blue chess grandmaster chips. IEEE Micro. 19(2), 70–81 (1999) Hu, Z., Ma, X., Liu, Z., Hovy, E., Xing, E.: Harnessing deep neural networks with logic rules (2016). arXiv:1603.06318 Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017) Huang, G.B., Mattar, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: A database for studying face recognition in unconstrained environments. In: Workshop on Faces in ‘Real-Life’ Images: Detection, Alignment, and Recognition (2008) Huang, Q., Yamada, M., Tian, Y., Singh, D., Chang, Y.: Graphlime: local interpretable model explanations for graph neural networks. IEEE Trans. Knowl, Data Eng (2022) Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510 (2017) Huang, X., Kroening, D., Ruan, W., Sharp, J., Sun, Y., Thamo, E., Wu, M., Yi, X.: A survey of safety and trustworthiness of deep neural networks: verification, testing, adversarial attack and defence, and interpretability. Comput. Sci. Rev. 37, 100270 (2020) Hubel, D.H., Wiesel, T.N.: Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. 160(1), 106–154 (1962)

References

447

Huszar, F.: Gaussian distributions are soap bubbles. inFERENCe (2018). https://www.inference. vc/high-dimensional-gaussian-distributions-are-soap-bubble/ Hutchinson, P., Stopp, E.: Johann Wolfgang von Goethe: Maxims and Reflections. Penguin (1998) Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally,W.J., Keutzer, K.: Squeezenet: alexnet level accuracy with 50x fewer parameters and