Computational Models for Cognitive Vision [1 ed.] 1119527864, 9781119527862

Learn how to apply cognitive principles  to the problems of computer vision  Computational Models for Cognitive Vision f

478 105 5MB

English Pages 240 [233] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Computational Models for Cognitive Vision [1 ed.]
 1119527864, 9781119527862

Table of contents :
Cover
Computational Models for Cognitive Vision
Copyright
Contents
About the Author
Acknowledgments
Preface
Acronyms
1 Introduction
2 Early Vision
3 Bayesian Reasoning for Perception and Cognition
4 Late Vision
5 Visual Attention
6 Cognitive Architectures
7 Knowledge Representation for Cognitive Vision
8 Deep Learning for Visual Cognition
9 Applications of Visual Cognition
10 Conclusion
References
Index

Citation preview

Computational Models for Cognitive Vision

IEEE Press 445 Hoes Lane Piscataway, NJ 08854 IEEE Press Editorial Board Ekram Hossain, Editor in Chief Jón Atli Benediktsson Xiaoou Li Saeid Nahavandi Sarah Spurgeon

David Alan Grier Peter Lian Jeffrey Reed Ahmet Murat Tekalp

Elya B. Joffe Andreas Molisch Diomidis Spinellis

About IEEE Computer Society IEEE Computer Society is the world’s leading computing membership organization and the trusted information and career-development source for a global workforce of technology leaders including: professors, researchers, software engineers, IT professionals, employers, and students. The unmatched source for technology information, inspiration, and collaboration, the IEEE Computer Society is the source that computing professionals trust to provide high-quality, state-of-the-art information on an on-demand basis. The Computer Society provides a wide range of forums for top minds to come together, including technical conferences, publications, and a comprehensive digital library, unique training webinars, professional training, and the Tech Leader Training Partner Program to help organizations increase their staff’s technical knowledge and expertise, as well as the personalized information tool my Computer. To find out more about the community for technology leaders, visit http://www.computer.org. IEEE/Wiley Partnership The IEEE Computer Society and Wiley partnership allows the CS Press authored book program to produce a number of exciting new titles in areas of computer science, computing, and networking with a special focus on software engineering. IEEE Computer Society members receive a 35% discount on Wiley titles by using their member discount code. Please contact IEEE Press for details. To submit questions about the program or send proposals, please contact Mary Hatcher, Editor, Wiley-IEEE Press: Email: [email protected], John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774.

Computational Models for Cognitive Vision Hiranmay Ghosh

Ex-Advisor, TCS Research

Copyright © 2020 The IEEE Computer Society, Inc. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data: Names: Ghosh, Hiranmay, author. Title: Computational models for cognitive vision / Hiranmay Ghosh. Description: Hoboken, New Jersey : Wiley-IEEE Computer Society Press, [2020] | Includes bibliographical references and index. Identifiers: LCCN 2020003784 (print) | LCCN 2020003785 (ebook) | ISBN 9781119527862 (paperback) | ISBN 9781119527855 (adobe pdf) | ISBN 9781119527893 (epub) Subjects: LCSH: Computer vision. | Cognitive science. | Visual perception. | Bayesian statistical decision theory. Classification: LCC TA1634 .G483 2020 (print) | LCC TA1634 (ebook) | DDC 006.3/7–dc23 LC record available at https://lccn.loc.gov/2020003784 LC ebook record available at https://lccn.loc.gov/2020003785 Cover Design: Wiley Cover Image: © Andriy Onufriyenko/Getty Images Set in 9.5/12.5pt STIXTwoText by SPi Global, Chennai, India Printed in the United States of America. 10 9 8 7 6 5 4 3 2 1

v

Contents About the Author ix Acknowledgments xi Preface xiii Acronyms xv 1 1.1 1.2 1.3 1.4 1.5

Introduction 1 What Is Cognitive Vision 2 Computational Approaches for Cognitive Vision 3 A Brief Review of Human Vision System 4 Perception and Cognition 6 Organization of the Book 7

2 2.1 2.2 2.3 2.4 2.5 2.6 2.6.1 2.6.2 2.7 2.8

Early Vision 9 Feature Integration Theory 9 Structure of Human Eye 10 Lateral Inhibition 13 Convolution: Detection of Edges and Orientations 14 Color and Texture Perception 17 Motion Perception 19 Intensity-Based Approach 19 Token-Based Approach 20 Peripheral Vision 21 Conclusion 24

3 3.1 3.2 3.3 3.4

Bayesian Reasoning for Perception and Cognition 25 Reasoning Paradigms 26 Natural Scene Statistics 27 Bayesian Framework of Reasoning 28 Bayesian Networks 32

vi

Contents

3.5 3.6 3.7 3.8 3.9 3.9.1 3.9.2 3.9.3 3.10

Dynamic Bayesian Networks 34 Parameter Estimation 36 On Complexity of Models and Bayesian Inference 38 Hierarchical Bayesian Models 39 Inductive Reasoning with Bayesian Framework 41 Inductive Generalization 41 Taxonomy Learning 45 Feature Selection 46 Conclusion 47

4 4.1 4.2 4.3 4.4 4.5 4.6 4.6.1 4.6.2 4.6.3 4.6.4 4.7 4.8

Late Vision 51 Stereopsis and Depth Perception 51 Perception of Visual Quality 53 Perceptual Grouping 55 Foreground–Background Separation 59 Multi-stability 60 Object Recognition 61 In-Context Object Recognition 62 Synthesis of Bottom-Up and Top-Down Knowledge 64 Hierarchical Modeling 65 One-Shot Learning 66 Visual Aesthetics 67 Conclusion 69

5 5.1 5.2 5.2.1 5.2.2 5.2.3 5.2.4 5.2.5 5.3 5.4

Visual Attention 71 Modeling of Visual Attention 72 Models for Visual Attention 75 Cognitive Models 75 Information-Theoretic Models 77 Bayesian Models 78 Context-Based Models 79 Object-Based Models 81 Evaluation 82 Conclusion 84

6 6.1 6.1.1 6.1.2 6.2 6.3

Cognitive Architectures 87 Cognitive Modeling 88 Paradigms for Modeling Cognition 88 Levels of Abstraction 91 Desiderata for Cognitive Architectures 92 Memory Architecture 94

Contents

6.4 6.5 6.5.1 6.5.2 6.6 6.7

Taxonomies of Cognitive Architectures 97 Review of Cognitive Architectures 99 STAR: Selective Tuning Attentive Reference 100 LIDA: Learning Intelligent Distribution Agent 102 Biologically Inspired Cognitive Architectures 105 Conclusions 106

7 7.1 7.1.1 7.1.2 7.1.3 7.2 7.3 7.3.1 7.3.2 7.3.3 7.3.4 7.4 7.5 7.6

Knowledge Representation for Cognitive Vision 109 Classicist Approach to Knowledge Representation 109 First Order Logic 111 Semantic Networks 113 Frame-Based Representation 114 Symbol Grounding Problem 117 Perceptual Knowledge 118 Representing Perceptual Knowledge 119 Structural Description of Scenes 120 Qualitative Spatial and Temporal Relations 122 Inexact Spatiotemporal Relations 124 Unifying Conceptual and Perceptual Knowledge 127 Knowledge-Based Visual Data Processing 128 Conclusion 129

8 8.1 8.1.1 8.1.2 8.1.3 8.1.4 8.1.5 8.2 8.2.1 8.2.1.1 8.2.1.2 8.2.2 8.2.3 8.2.3.1 8.2.3.2 8.2.3.3 8.2.3.4 8.2.4

Deep Learning for Visual Cognition 131 A Brief Introduction to Deep Neural Networks 132 Fully Connected Networks 132 Convolutional Neural Networks 134 Recurrent Neural Networks 137 Siamese Networks 140 Graph Neural Networks 140 Modes of Learning with DNN 142 Supervised Learning 142 Image Segmentation 142 Object Detection 144 Unsupervised Learning with Generative Networks 144 Meta-Learning: Learning to Learn 146 Reinforcement Learning 148 One-Shot and Few-Shot Learning 148 Zero-Shot Learning 150 Incremental Learning 150 Multi-task Learning 152

vii

viii

Contents

8.3 8.3.1 8.3.2 8.4 8.5

Visual Attention 154 Recurrent Attention Models 155 Recurrent Attention Model for Video 158 Bayesian Inferencing with Neural Networks 159 Conclusion 160

9 9.1 9.1.1 9.1.2 9.1.3 9.2 9.2.1 9.2.2 9.3 9.3.1 9.3.2 9.3.3 9.4 9.5

Applications of Visual Cognition 163 Computational Photography 163 Color Enhancement 164 Intelligent Cropping 166 Face Beautification 167 Digital Heritage 168 Digital Restoration of Images 168 Curating Dance Archives 170 Social Robots 172 Dynamic and Shared Spaces 173 Recognition of Visual Cues 174 Attention to Socially Relevant Signals 175 Content Re-purposing 177 Conclusion 179

10 10.1 10.2 10.3

Conclusion 181 “What Is Cognitive Vision” Revisited 181 Divergence of Approaches 183 Convergence on the Anvil? 185 References 187 Index 215

ix

About the Author

Hiranmay Ghosh is a researcher in Computer Vision, Artificial Intelligence, Machine Learning, and Cognitive Computing. He has received his Ph.D. degree from Electrical Engineering Department of IIT-Delhi and his B.Tech. degree in Radiophysics and Electronics from the Calcutta University. Hiranmay had been a research adviser with Tata Consultancy Services. He had been associated with R&D and engineering activities for more than 40 years in industry and autonomous research laboratories. He had been invited to teach at Indian Institute of Technology Delhi and National Institute of Technology Karnataka as Adjunct Faculty. He is also a co-author of the book Multimedia Ontology: Representation & Applications. He is a Senior Member of IEEE, Life Member of IUPRAI, and a Member of ACM.

xi

Acknowledgments This book is an outcome of my studies in cognitive vision, and presents an overview of the subject. At the outset, I am indebted to all the researchers, who have toiled to make the subject grow. While I have cited many of such research, I could not do justice to all within the finite bounds of this book. Further, my thanks go to the management of ArXiv, ACM, IEEE, Researchgate, and other digital repositories, without which access to the research papers could not be possible. Writing of the book provided me with an opportunity to interact with several researchers in the field. My special thanks go the researchers, who have engaged in discussions, forwarded their research material, and have happily conceded to my request of reproducing them in the book. The names include Aude Oliva, Calden Wloka, Charles Kemp, David Vernon, Guanbin Li, Guilin Liu, Harald Haarmann, J.K. Aggarwal, Jason Yosinski, John Tsotsos, Marcin Andrychowicz, Nicholas Wade, Pritty Patel-Grosz, Roberto Cipolla, Rudra Poudel, S.M. Reza Soroushmehr, Sangho Park, Stan Franklin, Sumit Chopra, Ulrich Engelke, and V. Badrinarayan. Further, many illustrations in the book have been borrowed from the Wikimedia Commons library, and I thank their authors for creating those diagrams and licensing them for general use, and the Wikimedia management for providing a platform for their sharing. I thank the management of the TCS Research team and my colleagues, especially K. Ananth Krishnan, P. Balamurali, and Hrishikesh Sharma for providing me an environment and support for conducting my studies. I thank Prof. V.S. Subrahmanian, Prof. Ramesh Jain, and Prof. Santanu Chaudhury for encouraging me to write the book. My thanks go to the management and the production team of IEEE Press and John Wiley & Sons to take up the publication of the book. Special mentions go

xii

Acknowledgments

to Mary Hatcher, Victoria Bradshaw, Louis V. Manoharan, and Gayathree Sekar, for providing me with the necessary support at various stages of authoring and production. Finally, I thank my spouse Sharmila for encouraging and supporting me at every stage of my research and authoring this book and for selecting an appropriate cover for the book. – Hiranmay Ghosh

xiii

Preface As I started on my research on cognitive vision a few years back, it became apparent that there is no up-to-date textbook on the subject. There has been tremendous research on cognitive vision in the recent years, and rich material lie scattered in myriads of articles and other scientific publications. This had been my prime motivation in compiling the material as a coherent narrative in form of this book, which may initiate a reader into various aspects of the subject. As I proceeded with my research, I realized that cognitive vision is still an immature technology. It is struggling its way to attain its ambitious goal of achieving the versatility and capabilities of the human vision system. It was also evident that the scope of cognitive vision is ill-defined. There is just not one single way to emulate human vision, and the researchers have trodden diverse paths. The gamut of research appears to be like islands in an ocean, a good part of which is yet to be traversed. This provided formidable difficulty in organizing the book in a linear and cohesive manner. The sequence that I finally settled on is one of the many possible alternatives. This book does not represent my contribution to the subject, but collates the work from many researchers to create a coherent narrative with a wide coverage. It is primarily intended for the academic as well as the industry researchers, who wants to explore the arena of cognitive vision and apply it on some real-life problems. The goal of the book is to demystify many mysteries that the human visual system holds, and to provide their computational models that can be realized in artificial vision systems. Since cognitive vision is a vast subject, it has not been possible to cover it exhaustively, in the finite expanse of this book. To overcome the shortcoming, I have tried to provide as many references as possible for the readers to explore the subject further. I have consciously given preference to surveys and reviews that provide many more pointers to rich research on the individual topics. Further, since cognitive vision is a fast growing research area, I have tried to cover as much as recent research as possible, without compromising on classical

xiv

Preface

text on the subject. Nevertheless, these citations are not exhaustive, but provide just a sample of the major research directions. Use of this book as the course material on the subject is also envisaged. However, my suggestion will be to restrict the number of topics and to expand on them. Especially, the realization of cognitive vision through deep learning in emergent architecture, which is briefly reviewed in Chapter 8, can be a subject in itself and be dealt with independently. I hope that the book will be beneficial to the academic as well as the industrial community, for some significant period of time to come. Hiranmay Ghosh

xv

Acronyms

AB-RNN AGC AGI AI AIM ANN AUC-ROC AVC BN BNN BRNN CAM CIE CNN CP CR CRF CSM DBN DL DL DNN DoG FCNN FDM FoL FR GAN

attention-based recurrent neural network automatic gain control artificial general intelligence artificial intelligence attention-based information maximization artificial neural network area under the curve – receiver operating characteristics advanced video coding Bayesian network Bayesian neural network bi-directional recurrent neural network content addressable memory international commission on illumination convolutional neural network Cognitive Program consequential region conditional random field current situational model dynamic Bayesian network deep learning description logics deep neural network difference of Gaussians fully convolutional neural network fixation density map first order logic full reference (visual quality assessment) generative adversarial network

xvi

Acronyms

GNN GPU GW HBM HBN HDR HMM HRI HSE HSV HVS ICD ICH KLD LDA LDR LIDA LoG LOTH LSTM LTM MCMC MEBN MOWL MSE MTL NLP NR NSS NTM OWL PAM PLCC PSNR RAM RBS RGB RNN RR RTM SALICON

graph neural network graphics processing unit global workspace hierarchical Bayesian model hierarchical Bayesian network high dynamic range hidden Markov model human–robot interaction human social environment hue-saturation-value human vision system Indian classical dance intangible cultural heritage Kullback–Leibler divergence latent Dirichlet allocation low dynamic range Learning Intelligent Distribution Agent Laplacian of Gaussian language of thought hypothesis long short term memory long-term memory Markov chain Monte Carlo multi-entity Bayesian network Multimedia Web Ontology Language mean square error multitask learning natural language processing no reference (visual quality assessment) natural scene statistics neural turing machine web ontology language perceptual associative memory Pearson linear correlation coefficient peak signal to noise ratio recurrent attention model rule based systems red–green–blue recurrent neural network reduced reference (visual quality assessment) representational theory of mind SALIency in CONtext

Acronyms

SGP SLAM SMC SSIM ST STAR STM SURF SWRL TCS VQA W3C WTA

symbol grounding problem simultaneous localization and mapping sensori-motor contingencies structural similarity index measure selective tuning (attention model) selective tuning attentive reference short term memory speeded up robust features semantic web rule language TATA Consultancy Services visual query answering World-Wide Web Consortium winner take all

xvii

1

1 Introduction Human vision system (HVS) has a remarkable capability of building three-dimensional models of the environment from the visual signals received through the eyes. The goal of computer vision research is to emulate this capability on man-made apparatus, such as computers. Twentieth century saw a tremendous growth in the field of computer vision. Starting with signal processing techniques for demarcating objects in space-time continuum of visual signals, the field has embraced several other disciplines like artificial intelligence and machine learning for interpreting the visual contents. As the research in computer vision matured, it has been pushed to address several real-life problems toward the turn of the century. Examples of such challenging applications include visual surveillance, medical image analysis, computational photography, digital heritage, robotic navigation, and so on. Though computer vision has shown extremely promising results in many of applications in restricted domains, its performance lags that of HVS by a large margin. While HVS can effortlessly interpret complex scenes, e.g. those shown in Figure 1.1, artificial vision fails to do so. It is “intuitive” for humans to comprehend the semantics of the scenes at multiple levels of abstraction, and to predict the next movements with some degree of certainty. Derivation of such semantics remains a formidable challenge for artificial vision systems. Further, many real-life applications demand analysis of imperfect imagery, for example with poor lighting, blur, occlusions, noise, background clutter, and so forth. While human vision is robust to such imperfections, computer vision systems often fail to perform in such cases. These revelations motivated deeper study of HVS and to apply the principles involved into computer vision applications.

Computational Models for Cognitive Vision, First Edition. Hiranmay Ghosh. © 2020 The IEEE Computer Society, Inc. Published 2020 by John Wiley & Sons, Inc.

2

1 Introduction

(a)

(b)

Figure 1.1 Hard challenges for computer vision. (a) “The offensive player … is about to shoot the ball at the goal …” Source: File shared by Rick Dikeman through Wikimedia Commons, file name: Football_iu_1996.jpg. (b) A facial expression in Bharatnatyam dance. Source: File shared by Suyash Dwivedi through Wikimedia Commons, file name: Bharatnatyam_different_facial_expressions_(9).jpg.

1.1 What Is Cognitive Vision Though there is a broad agreement in the scientific community that cognitive vision pertains to application of principles of biological (especially, human) vision systems to computer vision applications, the space of cognitive vision studies are not well defined (Vernon 2006). The boundary between vision and cognition is thin, and cognitive vision operates in that gray area. Broadly speaking, cognitive vision involves the ability to survey a visual scene, recognizing and locating objects of interest, acting based on visual stimuli, learning and generation of new knowledge, dynamically updating a visual map that represents the reality, and so on. Perception and reasoning are two important pillars on which cognitive vision stands. A crucial point is that the entire gamut of activities must be in real-time to enable an agent to engage with the real world. It is an emerging area of research integrating methodologies from various disciplines like artificial intelligence, computer vision, machine learning, cognitive science, and psychology. There is no single approach to cognitive vision, and the proposed solutions to the different problems appears like islands in an ocean. In this book, we have attempted to put together computational theories for a set of cognitive vision problems and organized it in an attempt to develop a coherent narrative for the subject. We shall get more insight on what cognitive vision is as we proceed through the book, and shall characterize it in clearer terms in Chapter 10.

1.2 Computational Approaches for Cognitive Vision

1.2 Computational Approaches for Cognitive Vision Two branches of science have significantly contributed to the understanding of the processes for cognition from visual as well as other sensory signals. One of them is psychophysics, which is defined as the “study of quantitative relations between psychological events and physical events or, more specifically, between sensations and the stimuli that produce them” (Encyclopedia Britannica). The subject was established by Gustav Fechner and is a marriage between study of sensory processes and physical stimuli. The other branch of science that has facilitated our understanding of perception and cognition is neurophysiology, which combines physiology and neural sciences for an understanding of the functions of the nervous system. The two approaches are complementary to each other. While psychophysics answers what happens during cognition, neurophysiology explains how it is realized in the biological nervous system. Researchers on cognitive vision have for long recognized it as an information processing activity by the biological neural system. However, a formal computational approach to understand cognition has been a fundamental contribution by David Marr (1976). Marr abstracted vision into three separable layers, namely (i) hardware, (ii) representation and algorithms, and (iii) computational theory. This abstraction enables computational theories of cognitive vision to be formulated independent of implementations in biological vision system. It also provides a theory for realizing cognitive functions in artificial systems made up of altogether different hardware, and possibly using different representations and algorithms. Further, Marr’s model of vision assumes modularity and pipelined architecture, two important properties of information processing systems that allow independent formulation of the different cognitive processes with defined interfaces. Marr identifies three stages of processing for vision. The first involves finding the basic contours that mark the object boundaries. The second stage results in discovery of the surfaces and their orientations, that results in an observer-centric 2 12 -dimensional model. The third involves knowledge-based interpretation of the model to an observer-neutral set of objects that constitute the 3D environment. These three stages roughly correspond to the early vision, perception, and cognition stages of vision, as recognized in the modern literature, and which we shall describe shortly. As suggested by David Marr, it is possible to study computational theories of cognitive vision in isolation from the biological systems, and we propose to do exactly the same in this book. However, such computational models need to explain the what part of cognition. For that purpose, we shall refer to the results of the psychophysical experiments, wherever relevant, without going into details

3

4

1 Introduction

of the experimental setups. Further, though the goal of computational modeling is to support alternate (artificial) implementations of cognition that need not be based on biological implementation models, analysis of the latter often provides clue to plausible implementation schemes. We shall discuss the results of some relevant neurophysiological studies in the book. We shall consciously keep such discussions to a superficial level, so that the text can be followed without a deep knowledge of either psychology or neurosciences.

1.3 A Brief Review of Human Vision System We briefly look into how human vision works in this section, in order to put rest of the text in this book in context. A broad overview of HVS is presented in Figure 1.2. It comprises a pair of eyes connected to the brain via the optic nerves. When one looks at a scene, the light rays enter the eyes to form a pair of inverted images on screens at the back of the eyes, which are known as the retina. This corresponds to mapping of the external 3D world to a pair of 2D images, with slightly different perspectives. Internal representations of the images are transmitted to the visual cortex in the rear end of the brain by a bunch of optic nerves, where the images are correlated and interpreted to reconstruct a symbolic description of the 3D world. In this simple model of biological vision, the eyes primarily act as image capture device in the system, and the brain as the interpreter. In reality, things are much more complex. The output from the eyes is not a faithful reproduction of the images received. Significant transformations takes place on the retina, which enables efficient identification of object contours and their movements. These transformations are collectively referred to as early vision. Further processing in the neural circuits of the brain that results in interpretation of the signals received from the eye is known as late vision. The goal of late vision is to establish what and where of the objects located in the scene. It is believed that there are two distinct pathways in human brain, ventral and dorsal, through which visual information is processed, to answer these two aspects of vision (Milner and Goodale 1995). This has been emulated in several artificial vision systems, as we shall see in the following chapters of this book. One of the initial tasks in the late vision system is to correlate the images received from the two eyes, which is facilitated by the criss-cross connection of the optic nerves connecting the eyes with the brain. Further, the late vision system achieves progressive abstraction of the correlated images and leads to perception and cognition, which we discuss in some details in Section 1.4.

Right brain Crossed axon

Uncrossed axon

Right eye

Thalamus

Optic tracts

Optic chiasm

Yellow

Bird

Flower

Tree

Left brain

Left eye

Optic (II) nerves

Green

Figure 1.2 An overview of human vision system. Source: Derivative work from file shared by Wiley through Wikimedia Commons, file name: Wiley_Human_Visual_System.gif.

6

1 Introduction

1.4 Perception and Cognition The first step in interpreting retinal images involves organization of visual data, the isolated patterns on the retina, to create a coherent interpretation of the environment. This stage is known as perception. Though we shall focus on visual perception in this book, biological perception generally results in a coordinated organization of inputs from all sensory organs. For example, the human beings create a coordinated interpretation of visual and haptic signals while grabbing an object. For an artificial agent, for example a driver-less car, perception involves all the sensors that it is equipped with. In philosophical terms, perception is about asserting a truth about the environment by processing sensory data. However, the “assertion” by an agent can be different from the reality, e.g. a vehicle seen through the convex side-view mirrors of a car may be perceived to be farther than it actually is. Such “erroneous” perceptions often lead to illusions, some of which we shall discuss in Chapters 2 and 4 of this book. Some authors prefer to include a capability to respond to the percepts in connotation of perception. Cognition refers to an experiential interpretation of the percepts. It involves reasoning about the properties of percepts with the background knowledge and experience that an agent possesses. Depending of the knowledge-level of the agent, there can be many different levels of interpretation for the percepts. For example, Figure 1.1b can be interpreted in many ways with progressive levels of abstraction, such as a human face, a classical dance form, or an emotion expressed. Cognition may also result in “correcting” the erroneous perceptions, using specific domain knowledge. For example, the knowledge of the properties of a convex mirror results in a more realistic estimate of the distance of an object seen through a side-view mirror of a car. Cognition involves the intentional state of an agent as well. For example, while driving an automobile, a driver analyzes the visual (and audio) percepts with an objective of reaching the destination while ensuring safety and being compliant the traffic rules. In the process, the driver may focus on the road in the front, and the traffic lights, ignoring other percepts, such as the signage on the shop-fronts bordering the street. Such selective filtering of sensory data is known as attention. It is necessary to prevent the cognitive agent to be swamped with huge volume of information that it cannot process. Thus, we find that cognition involves not only interpretation of the sensory signals but also many other factors, such as intention, knowledge, attention, and memory of an agent. Moreover, the knowledge of a cognitive agent needs to be continuously updated for it to adapt to a new environment and to respond to yet unforeseen situations. For example, while driving on a hilly road, a city-driver needs to quickly learn the specific skills for hill driving to ensure a safe journey. The process through which the knowledge is updated is called learning, and is a critical requirement for a real-life agent.

1.5 Organization of the Book

Learning

Knowledge

Cognition

Attention

Perception

Sensory organs

Environment Figure 1.3

A simple process model in a cognitive system.

The fundamental difference between perception and cognition is that the former results in acquisition of new information through sensory organs, while the latter is the process of experiential analysis of the acquired information with some intention. There is, however, a strong interaction between these two processes. Percepts, filtered through attention mechanism, enters the cognitive process. On the other hand, cognitive interpretation of percepts results in signals to control further data acquisition and perception. This ensures need-based just-in-time visual data collection based on the intention of a cognitive agent, which is also known as active vision. Moreover, discovery of new semantic patterns through the process of cognition leads to update in the knowledge store of an agent. A simplified process model in a cognitive system is shown in Figure 1.3.

1.5 Organization of the Book The characterization of cognitive vision and its various stages presented above, sets the background of the rest of this book. We begin with early vision system in Chapter 2, where we describe the transformations that an image goes through by the actions of the neural cells on the retina. In Chapter 3, we introduce Bayesian

7

8

1 Introduction

reasoning framework, which will be used to explain many of the perceptual and cognitive processes in the later chapters. We explain several perceptual and cognitive processes in Chapter 4. Chapter 5 deals with visual attention, the gateway between the world of perception and the world of cognition. While the earlier chapters describe the individual processes of perception and cognition, they need to be integrated in an architectural framework for realization of cognitive systems. We characterize cognitive architectures, discuss their generic properties, and review a few popular and contemporary architectures as examples, in Chapter 6. While the architectures provide generic cognitive capabilities and interaction with the environment, we focus on the functions for cognitive vision in these architectures. Knowledge is a crucial ingredient of a cognitive system, and we introduce classical approaches to its representation in Chapter 7. There is huge corpus of recent research that attempts to emulate the biological vision system with artificial neural networks and aims to learn the cognitive processes with deep learning techniques. A discourse on cognitive vision cannot be complete without them. We present a cross-section of the research in Chapter 8. In this chapter, we elaborate on the various modes of learning capabilities that a real-life agent need to possess, and that have been realized with deep learning techniques. We discuss a few real-life applications for visual cognition in Chapter 9 and illustrate the use of the principles of cognitive vision. In Chapter 10, we take a look through a rear-view mirror to review what we studied, which enables us to characterize cognitive vision in more concrete terms. Further, we compare the two complementary paradigms of cognition, namely classicist and connectionist approaches, and discuss a possible synergy between the two that may be on the anvil. Finally, a few words about the content of the book. Computational theories of cognitive vision is a vast subject, and it is not possible to cover all of it in the extent of one book. I have tried to condense as much of information as possible in this book, without sacrificing understandability, and have provided ample number of references for interested readers to explore the subject further. While providing the citations, I have given preference to authentic reviews and tutorials that should enable a reader to get an overview of the subject, and which may lead an inquisitive reader to many other relevant research publications. Also, cognitive vision being a rapidly evolving subject, I have tried to cover as much as recent material as possible, without ignoring classic text in this subject. Though I focus on cognitive vision, many of the principles of perception and cognition discussed in the book is not exclusive to visual system alone, but holds good for other sensory mechanisms as well.

9

2 Early Vision Early vision refers to the functions of vision performed in the eye. It typically involves detection of edges, contours, and color blobs, while integration and semantic interpretation of these isolated features are left to the later stages of vision. We discuss the early stage of vision in this chapter. Early vision is generally characterized as pre-attentive vision, i.e. the visual functions that take place before attention comes to play. We begin this chapter with a psychological experiment that clearly demonstrates the distinction between pre-attentive and post-attentive stages of vision. We follow this with a brief discussion on the structure of the human eye and the functionality of its important components. Further on, we analyze the complex transformations that an image undergoes in the human eye and computational models for the transformations.

2.1 Feature Integration Theory Psychological experiments by several researchers Treisman and Gelade (1980) and Julesz (1981) show that the early stage of human vision system (HVS) is sensitive to some basic image features, such as color, orientation, and shape. This is evidenced by the fact that humans can identify visual targets that are distinguished by one of these features from the background or distractors almost instantaneously, e.g. the red “X”, which is distinguished by its color from the distractors, in Figure 2.1a. It has been observed that the time taken to search the target in such cases is independent of the number of distractors, which is explained with the hypothesis that human vision can index the patterns by these features and retrieve them very fast. When a target is distinguished from the distractors by a combination of features, searching takes an extra effort, e.g. the red “X” in Figure 2.1b. It has been found that the search time varies linearly with the number of distractors in such cases, leading to the hypothesis of a sequential Computational Models for Cognitive Vision, First Edition. Hiranmay Ghosh. © 2020 The IEEE Computer Society, Inc. Published 2020 by John Wiley & Sons, Inc.

10

2 Early Vision

(a)

(b)

Figure 2.1 Feature integration theory. (a) It is easy to identify a red “X” amidst green ones. (b) It is not so easy when two features (color and shape) need to be combined to identify the target. See color plate section for color representation of this figure.

scan at a later stage of vision.1 Thus, Treisman’s experiments clearly distinguishes between an early and a late stage of vision.

2.2 Structure of Human Eye Figure 2.2a provides a simplified schematic diagram of human eye. The anterior segment of the eye comprises the cornea, the iris, and the lens. These components control the intensity of the input light rays and focus them on the retina located at the posterior of the eye. The retina embeds photo-sensors, which get excited when the light rays fall on them. The corresponding signals are carried by a bundle of optic nerves to the brain. The location where the optic nerves leave the eye is devoid of any photo-sensors and forms a blind spot. Figure 2.2b depicts a cross-section view of the retina. There are two types of sensors (photoreceptors) on human retina, the rods and the cones. The rods are more sensitive to light, but cannot distinguish colors. They are responsible for black and white vision in poor lighting condition. The cones are of three types, and have different sensitivity to different colors (wavelengths) of light. S-, M-, and L-cones have higher sensitivity toward short, medium, and long wavelengths, respectively. They are responsible for color vision. Further, the rods and the cones are not evenly distributed on the retina. The cones are concentrated on the central portion of the retina, covering about 15% of the visual field, called the fovea, while the rods spread to cover about 60–80%. The sensitivity and distribution of the rods and cones are shown in Figure 2.3a and b, respectively. 1 The theory of a sequential scan throughout the search space has been challenged in attention-based visual search, as we shall see in Section 2.7 and in Chapter 5.

2.2 Structure of Human Eye

lens

sclera

iris

retina choroid

cornea vitreous chamber vitreous humor

anterior chamber aqueous humor

fovea

optic nerve suspensory ligaments (a)

Optic nerves

Amercin and ganglion cells

Bipolar and horizontal cells (b)

Photo-sensors (rods and cones)

Figure 2.2 Structure of human eye and retina. (a) A human eye. Source: File shared by Holly Fischer through Wikimedia Commons, file name: Three_Main_Layers_of_the_ Eye.png. (b) Cross-section of human retina. Source: File shared by Cajal through Wikimedia Commons, file name: Retina-diagram.svg.

The uneven distribution of the cones on the retina results in high acuity image to be formed only in the foveal region, which covers about 1.5–2∘ of the visual field. Best acuity occurs at the fovea centralis that is about one-tenth of the fovea. When a person looks at an object, it is brought to the center of the visual field with eyeball movement. The process that controls the eyeball movement is known as

11

2 Early Vision

420

498

534 564

100 Normalised absorbance

50

S

R

M L

0 400 Violet

Blue

600 700 500 Cyan Green Yellow Red Wavelength (nm) (a)

180 000 Number of receptors per mm2

12

160 000 140 000 120 000 100 000 80 000

Rods

60 000 40 000 20 000 0

Fovea

Blind spot Cones

–60°–40°–20° 0° 20° 40° 60° 80° Angle from fovea (b)

Figure 2.3 Distribution and sensitivity of photoreceptors in human eye. (a) Sensitivity of the photoreceptors. Source: File shared by Maxim Rasin through Wikimedia Commons, file name: Cone-response-en.svg. (b) Photo-receptor distribution on retina. Source: File shared by Cmglee through Wikimedia Commons, file name: Human_photoreceptor_ distribution.svg.

visual attention, which we shall discuss in Chapter 5. The image formed in rest of the visual field is with low acuity and contributes to peripheral vision. It is estimated that there are some 120 million rods and 6 million cones on the retina. But there are less than one million optic nerve fibers that carry the information to the visual cortex of the brain. The ratio of the number of optic nerves

2.3 Lateral Inhibition

terminating in a locality to the number of photo-sensors in that locality varies from approximately 1:1 in the central region of the eye to as low as 1:600 in the outer peripheral region. Thus, an optic nerve, in most parts of the retina, carries information from several photoreceptors. The bipolar cells connect to either a set of rods or a set of cones to multiplex and transmit the received signals further in the network. The transmission is controlled by the horizontal cells, which connect to the photoreceptors laterally. The signals are transmitted to the ganglion cells where the optic nerves terminate via the amacrine cells. There are several types of amacrine cells, but their broad function is to control the inputs from the bipolar cells in specialized ways. There are at least 11 types of ganglion cells on the retina of the mammals as reported in Rockhill et al. (2002). Some of the ganglions, such as the parasol cells, gather data from many photoreceptors and exhibit fast response time; some others, like the midget cells, connect to a few photoreceptors and have slow response. It is believed that each type of ganglion captures information about a specific feature of a visual scene, e.g. color, edges, motion, etc. Thus, the visual signals are transformed into several parallel representations (also called channels) in the retina before they are transmitted to the brain (Baden and Euler 2013). This complex interconnection of cells on the retina and their functions result in various effects that we shall review in Sections 2.3–2.7.

2.3 Lateral Inhibition Figure 2.4 illustrates an important property of the early vision system. Though each vertical band in the image have uniform color, they appear to be lighter to

Figure 2.4

Mach band illusion.

13

14

2 Early Vision

10

10

10

2

2

2 Photo-receptors

10 –1

10 –1

–1

10 –1

–1

2 –.2

–1

2 –.2 –.2

2 –.2

–.2 Ganglions

8

8

8.8

0.8

1.6

1.6

10 8 6 4 2 0

Figure 2.5 (2003).

A computational model for lateral inhibition. Source: Redrawn after Ihaka

the left and darker to the right. This illusion is known as the Mach band effect. This is explained by the hypothesis that the response of a neuron is attenuated by the response of the adjacent neurons in the network. The phenomenon is known as lateral inhibition.2 A computational model of lateral inhibition is schematically shown in Figure 2.5. In the figure, we depict the neurons to be organized in an 1D array for the sake of simplicity. The output from a photoreceptor (rod), via a ganglion, is determined not only by its own excitation level but also the negative inhibitory signals transmitted by the horizontal cells associated with the adjacent photoreceptors. We assume that some of the photoreceptors to the left are excited to level 10 (high) and the rest at the right are excited to level 2 (low), and that the strength 1 of the inhibition signal sent by a horizontal cell to its neighbors is 10 th of its excitation level. The resultant output signal that is transmitted by a bipolar cell is the sum of the excitation value of that neuron and the inhibition signals received from its two neighbors, as depicted in the graph at the bottom of Figure 2.5.

2.4 Convolution: Detection of Edges and Orientations The bipolar cells in the retina are connected to the ganglions with a complex structure. Some of the bipolar cells, when excited, send positive response to the 2 It is not a property of the visual system alone, but has been observed in context of all sensory organs and is believed to happen in abstract conceptual layers in the brain as well.

2.4 Convolution: Detection of Edges and Orientations

– – + + – – –

– – – – – – –

– – – – – – –

– – – + + + +

+ + + + + + +

+ + + – – + +

+ + – – – + +

+ + + – – + +

+ + + + + + +

Off-center configuration

On-center configuration (a)

+ + + + + + +

Response from the bipolar cells

+ + + + + + +

Winning signal

– – + + – – –

– – + + + – –

+



– – – – – – –

– – – – – – –

Normalizing signal

– – – – – – –

(b)

Figure 2.6 Schematic representation of center-surround operations in the eye. (a) On-center and off-center configurations. (b) Differently oriented filters with automatic gain control.

ganglion and some send negative signals. However, this is not in a random order. The two-dimensional organization of the bipolar cells are schematically shown in a honeycomb structure in Figure 2.6a. Receptive fields of ganglions that cover localities of bipolar cells are shown as dotted circles. The central and the surrounding parts of the receptive field of a ganglion connects to bipolar cells of opposite polarities. There can be two alternate configurations as shown in the diagram: (i) on-center, where the light falling on the center increases the stimulation of the ganglions, while the light falling on the surround decreases it; and (ii) off-center configuration, where it is the other way round. The two organizations provide complementary responses. In on-center configuration, the ganglion has a strong response if there is strong light on the center and weak light on the surround, and vice versa in off-center configuration. A superposition of outputs from the two configurations detects both dark and bright discontinuities in the visual field. In both of the configurations, the ganglion response is weak when there is uniform light on both center and surround regions. Thus, the response from a retinal region on the eye does not depend only on the brightness of light falling on that region but also on the difference of firing rates of photoreceptors in a locality. The biological center-surround vision can be approximated as a Difference-ofGaussians, (DoG) (or a Laplacian-of-Gaussian [LoG]) operator (Blackburn 1993). In this model, the weight with which a bipolar cell contributes to the connected ganglion is assumed to have a Gaussian distribution depending on its distance from the ganglion. The operation results in smoothing (elimination of noise) followed by detection of regions of highest intensity variation that generally mark the contours of the objects. This step is deployed as the first pre-processing step in many computer vision applications. Adjacent center-surround organizations of bipolar cells result in differently oriented “filter banks.” This is schematically shown in Figure 2.6b, where the white

15

16

2 Early Vision

elliptical areas represent positive and dark ones represent negative contributions from the bipolar cells, respectively. These filters enable detection of orientated edges. The filters can be at different scales (covering different spatial extents). Further, they can either be symmetric (the two middle ones in Figure 2.6b) or anti-symmetric (the ones on the sides), which enables them to operate at different (spatial) phases. The strongest of the filter outputs are passed on to further stages of processing. Mathematically, they are modeled as wavelet filters and are widely used in digital signal processing and computer vision (Teolis 1998). Moreover, the output is normalized by dividing with the sum of all filter outputs, which attenuates stronger signals, thereby implementing automatic gain control (AGC). The AGC mechanism results in a sublinear response of the ganglions. The output is scaled by the average stimulus in the region. This is modeled as Weber’s law (also known as Weber–Fechner law) as ΔS (2.1) S where p represents the percept, and S represents the stimulus. This implies that the response of visual center-surround operation is amplified more under low illumination levels than at higher ones, which explains illumination-invariance characteristics of human vision. A nonlinear model of center-surround operation has been proposed in Vonikakis and Winkler (2016) based on this assumption. A generic model for integrating information from multiple pixels with linear operators is known as convolution. It involves application of a linear filter to every image pixel. Let us denote the pixels of an image I of width W and height H by {Ixy }x=1∶W,y=1∶H . Let F = {𝑤xy }x,y=−m∶+m represent a square mask of dimension M = 2m + 1 used for the filtering. The convolution of the image with the mask results in an output Δp = k ⋅

′ I ′ = F ∗ I = {Ixy }x=1∶W,y=1∶H

(2.2)

where ′ = Ixy

+m +m ∑ ∑

𝑤x+i,y+j ⋅ Ix−i,y−j

(2.3)

i=−m j=−m

The width of the filter M also known as its receptive field, and is generally much smaller than the image dimensions. Convolution of a two-dimensional image with a mask of size M × M integrates information from M 2 input locations centered at the mask. Linear filtering with convolution filters is extensively used in the initial stages of many common image-processing tasks. The convolutional neural networks (CNNs) that are ubiquitously used for image processing in recent times, implements several layers of convolution. We shall discuss more about them in Chapter 8.

2.5 Color and Texture Perception

2.5 Color and Texture Perception We have already described that there are three types of cones, S, M, and L, which respond differently to light of different wavelengths. Excitation level of these three types of cones in any locality of the retina, in response to the incident light, gives rise to color perception. The perceived color depends of the spectral composition of light. Different spectral compositions may lead to the perception of the same color, if they result in the same excitation levels of the three types of cones. Psychological experiments have shown that human perception of any color can be matched by a combination of three primary colors. This is known as the trichromatic theory of color vision, which has been developed by the researchers Thomas Young and Hermann von Helmholtz in the early nineteenth century. Electronic devices, such as computer screens, exploit this theory. They mix three primary colors, red, green, and blue, in various proportions to create perception of different colors in the human eye. This has led to the development of red–green–blue (RGB) color model, where any reproducible color is expressed as a linear additive combination of the three primary colors. This model is device-dependent because of differences in hardware used to produce the three primary colors. To overcome this limitation, a color has been defined in a device independent way by Albert H. Munsell in early twentieth century, in terms of its three perceptual properties, namely hue (shade), value (lightness), and chroma (color purity). It has been observed that the distance between two colors measured in this three-dimensional space approximately corresponds to the human perception of difference between the colors. It has later been adapted to other color models like hue–saturation–value (HSV) and CIELAB (Stone 1992) for different applications. While many of the observations regarding color perception can be explained by the trichromatic color theory, it cannot explain certain other phenomena, for example, the perception of complementary “after-images” experienced by a person after a prolonged exposure to a color. An alternative theory of color vision, the opponent process theory, proposed by Ewald Hering in 1892, explains such phenomena better. According to the theory, the neural network connects the photoreceptors in a certain way to distinguish between three opponent color pairs, namely dark versus bright, green versus red, and blue versus yellow. Psychological experiments confirm the existence of such opponent color pairs. For example, it is impossible for a human being to conceive colors composed of opponent colors, like greenish red, or bluish yellow. Models of the neural interconnections for perceiving such color contrasts are shown in Figure 2.7. Visual texture is the periodic variation of illumination pattern in a neighborhood of an image. Textures appearing on man-made artifacts are generally deterministic in nature, such as a tiled pavement, a brick-work, hatched lines, and so on. On the

17

18

2 Early Vision

Cone-S

Cone-S

Cone-M

Cone-M +

B



W

G

R

Cone-L

Cone-L

Rod

Rod (a)

(b)

Cone-S Cone-M –

+

B

Y

Cone-L Rod (c) Figure 2.7 Models for opponent color contrasts. (a) Dark–bright contrast. (b) Green–red contrast. (c) Blue–yellow contrast.

contrary, texture appearing on natural objects, such as foliage, sand on a beach, etc. are generally stochastic in nature. Mathematical modeling of texture essentially involves computation of the different orders of derivatives of illumination pattern and their statistical properties. It has been found that HVS can discriminate between certain textures at the early stage of vision, but some other texture-pairs need more detailed examination. This observation led to the belief that there are some atomic components of texture, like elongated blobs, bars, etc. which the early vision system can detect and discriminate. Early research on texture perception focused on characterizing texture as a structural composition of such atomic elements, called the textons (Julesz 1981). While studies in this direction have mostly focused on artificial images, there has been some attention on textons learned from natural images too (Zhu et al. 2005). The texton-based approach to modeling textures has been found to be inconsistent with the results of some psychophysical experiments, leading to the recent approach for modeling texture in terms of response to a set of linear wavelet filters (Malik and Perona 1990; Livens et al. 1997; Sebe and Lew 2000). The texture patterns can be coarse or fine, and can be differently oriented. Generally, a bank of wavelet filters, differently oriented and with different scales, are used to characterize texture. Color and texture play important roles in HVS. Humans can discriminate materials based on their color and surface texture. Further, changes in color and texture provide cues for identification of object boundaries in an image, enabling

2.6 Motion Perception

image segmentation. The modulation of color and texture over a surface helps in estimating the local surface normals and hence in construction of 2 12 D surface models, which in turn gives rise to a perception of depth.

2.6 Motion Perception Motion in a scene is caused either by the movement of one or more objects in the scene, or the movement by the observer itself (ego-motion). In general, both factors contribute to the perceived motion in a scene. Motion detection and measurement is an important requirement for biological vision systems to model an animated world. The motion in a scene is not directly presented to the eyes, but need to be computed from temporal variations in the images that are formed on the retina. A time-varying image on the retina can be described by an array I(x, y, t) of intensities. The motion in an image can similarly be described by V(x, y, t). The problem of motion estimation thus boils down to computation of V(x, y, t) from I(x, y, t). Further, motion may either be continuous or be discrete. Continuous motion refers to the changes in I(x, y, t) that occurs smoothly over time. HVS often perceives discrete changes also as continuous motion, with intermediate location values interpolated. Perception of smooth movements from discrete video frames is an example of such discrete motion. The perception of continuous motion in HVS has been explained by intensity-based approach, where motion is computed from local intensity variations in an image. On the other hand, token-based approach, where motion is estimated from the movement of some à priori detected “features,” such as corner points, has been used to explain perception of discrete motion (Ullman 1981).

2.6.1

Intensity-Based Approach

The gradient model of motion estimation (Ullman 1981; Hildreth 1984) is an intensity-based approach, where the motion is estimated from the local gradients of the image intensity. At any given point z in an image, the local velocity in the direction of the spatial intensity gradient is given by It (z, t) (2.4) ∣ ∇I(z, t) ∣ where It (z, t) represents the temporal gradient for local illumination change, and ∣ ∇I(z, t) ∣ represents the magnitude of spatial gradient for local illumination change. The two values may be estimated by measuring local intensities around the location z over a period of time. This method for estimating motion provides reliable results for velocity when both the numerator and the denominator are large enough to be reliably measured, and best be applied to the object boundaries. 𝑣(z, t)∇I = −

19

20

2 Early Vision

We have already seen earlier in this chapter that the center-surround organization of the eye, modeled as a LoG operator provides a mechanism to discover the object boundaries from a retinal image. This method cannot be used to determine the velocity in the direction perpendicular to the brightness gradient, i.e. along the contour of an object, for an obvious reason. For example, consider a diamond-like object moving in the horizontal direction as shown by the arrow in the center of the object, and that we are measuring the velocity at a point on one of its edges, using observations in a small circular window (aperture) in its vicinity, as shown in Figure 2.8a. While the motion will make the intensity values to have temporal change in the direction perpendicular to the edge (shown by solid arrow), there will be no such change in the direction along the edge (shown by dotted arrow), and the velocity component in that direction will remain undetected. This is known as the aperture problem. The overall motion of an object in 2D can be estimated from the perceived motion at various points on the contour, as shown in Figure 2.8b, which requires a higher level integration process. Results of such integration can sometimes be ambiguous. While one of the reasons for the ambiguity is the loss of information arising out of projecting a 3D world on a 2D retinal image, there can be other reasons too. For example, consider a deformable string moving in 2D space as shown in Figure 2.8c, where the motion of a point on the string as marked in the figure is ambiguous, when measured with intensity cues. Additional domain-specific constraints are needed to resolve such ambiguities.

2.6.2

Token-Based Approach

The token-based method assumes that some tokens, or distinguishable points, in a scene has been identified, and the motion is measured by changes in the location of the tokens with time. For example, the movement (with rotation and distortion) of some text across a time interval is shown in Figure 2.9; a few of the tokens are

?

? ?

(a)

(b)

(c)

Figure 2.8 Illustrating aperture problem based on Hildreth (1984). (a) Motion at a point along the edge cannot be estimated. (b) Overall motion can be estimated by integrating motion at different points on the object contour. (c) Illustrating ambiguity in motion estimation.

2.7 Peripheral Vision

Figure 2.9 diagram.

Token-based motion estimation. Movement of three points shown in the

depicted with markers. Motion of the text can be uniquely determined from such identified tokens across the image. In general, the different parts of an image at different times may have different motion parameters, modeling of which is known as optical flow (Horn and Rhunck 1981). Unique identification of the tokens on retinal image, and to establish their correspondence across movement, is not a trivial issue. The visual properties of a point may significantly change over movement because of several reasons, such as the reflectance property of the surfaces, distortions, and so on. Computer vision algorithms generally use robust feature points (e.g. speeded up robust features (SURF)) for such identification, but confusion over the identity of the points may still prevail. Additional domain-specific constraints, such as that of epipolar geometry, need to be imposed to estimate the motion in such cases. Such constraints may occasionally lead to incorrect motion perception and may lead to illusions. For example, illusory rotation of a spoked wheel in the reverse direction (especially under stroboscopic light) is explained by an implicit constraint of minimum angular movement of the spokes between two glimpses.

2.7 Peripheral Vision Peripheral vision refers to processing of the visual signals received through the peripheral region of the eyes. While there is no hard and fast boundary between

21

22

2 Early Vision

foveal and peripheral vision, the latter generally refers to the visual field beyond 2–2.5∘ from the center of the eye (Strasburger et al. 2011). Peripheral vision gathers a wide-angle overview of the scene, though with a lower resolution than the foveal vision. At the far corner of the eye, where the photoreceptors are mostly rods, peripheral vision is in black and white. Further, the overlap between peripheral regions (till about 60∘ ) of the two eyes give rise to stereoscopic vision, which we shall discuss in Chapter 4. While the foveal vision has been extensively studied, research on peripheral vision has been relatively sparse. Much of the mysteries of peripheral vision remain unexplored. Early psychophysical experiments in mid-nineteenth century showed that the acuity of visual images decreases linearly with the distance from the central visual region, a fact that is explained by the progressively decreasing density of rods in the peripheral region (see Figure 2.3b). This means that equal volume of neurons cover more and more visual area, as we move toward the periphery. Conversely, equal visual areas, of say 1∘ in diameter, are represented by progressively less and less number of neurons as the eccentricity increases, and hence has less information content. It has been established that the minimum size of an object that can be discerned with peripheral vision increases linearly with the eccentricity. This phenomenon is known as the cortical magnification. However, the loss of acuity does not simply result in a blur as happens with pooling,3 or retention of lower coefficients in frequency domain. Instead, it results in a complex synthesis of the texture for a region. It has been observed that some distinctive textures of the objects on the retinal image are retained, while there may be some disparity regarding their locations. This is consistent with the purpose of peripheral vision, since discovery of these textures with peripheral vision guides foveal vision to be redirected to candidate locations in a visual search problem. It has been hypothesized that a local patch in peripheral area is processed by a group of neurons, and is represented by some summary statistic of a fixed size (Rosenholtz et al. 2012). The size of a local patch increases with the eccentricity. If more than one object crowds within a local patch, the statistic represents a synthesis of some texture properties of all the objects, in which case an individual object in that patch may not be discernible. However, if the same object is presented alone in the same patch, it can still be recognized. For example, the left half of Figure 2.10, depicts an Indian miniature painting showing the marriage ceremony of an Indian prince. It is impossible to discern the human figure with a halo on his head (the prince) in the image while focusing on the cross-hair at the center because of crowding in the peripheral region. When the figure is isolated as in the right half of the diagram there is no difficulty in discerning it, though it is presented with the same size and at the same distance from the center of foveal vision 3 Pooling refers to the illumination values for some adjacent pixels being represented by a single number, usually the average or the maximum.

2.7 Peripheral Vision

Figure 2.10 Illustrating effect of crowding. Source: File shared by Nomu420 through Wikimedia Commons, file name: Miniature_painting_showing_the_marriage_procession_ of_Dara_Shikoh_National_Museum_India.jpg..

(the cross-hair). Bouma’s law suggests that there need to be a critical gap between the objects (rather, object parts) for it to be recognized through peripheral vision, and that gap is approximately 0.4–0.5 times the eccentricity (Rosen et al. 2014). The phenomenon of crowding also explains asymmetric difficulty levels encountered in the visual search problem, e.g. when searching for an “O” amidst distractors “Q”, and vice versa, which cannot be explained by feature integration theory. An alternate hypothesis for visual search postulates that peripheral vision guides the eye movement during the search. A local patch in the peripheral region is likely to contain multiple objects, either distractors only, or the target with some distractors. The difference between the texture representation of “target-only” and “target+distractor” patches helps in guiding the foveal vision to the target location. The distinction between the textural patterns between these patches depends on the target and the distractor patterns, which explains the asymmetry in visual search process. We shall discuss more on the role of peripheral vision in guiding visual attention in Chapter 5. There have been some attempts to simulate the texture representation of peripheral vision. A method for computing summary statistics that can explain the crowding effect has been presented in Balas et al. (2009). A local patch in peripheral vision is modeled as a “texture tile” in Rosenholtz et al. (2012).

23

24

2 Early Vision

A generative neural network based online simulation of peripheral vision following the texture tile model has been reported in Fridman et al. (2017). Peripheral vision has less acuity and is less efficient than foveal vision. Therefore, it is not surprising to find the central vision to be more efficient than peripheral vision for object recognition, in particular, for face recognition (Wang and Cottrell 2016). But, peripheral vision has been found to be more accurate in scene recognition than the foveal vision (Loschky et al. 2019). Integration of foveal and peripheral vision is crucial for artificial vision system. Robots often use side-looking or omni-directional cameras for navigation, movement detection, etc. using low-cost image-processing.

2.8 Conclusion In summary, the analysis of the images by the neural cells in the eyes result in the perception of boundaries between regions of homogeneous illumination, or edges. The temporal variations in sequence of images received by the eye give rise to the perception of motion. The segmentation of images in space and time results in detection of object contours from the visual signals. Many of the image and video processing techniques also use edge and motion detection through analysis of local contrasts as their first steps. The patterns discovered at this stage, both in biological and artificial vision, are in fragments, which needs to be stitched together and holistically interpreted for the perception of the environment and cognition. We shall discuss those processes in Chapter 4. Peripheral vision remains a relatively unexplored topic and has many unsolved mysteries. Nevertheless, it is considered to be extremely important for vision system by enabling an “overview” of the environment with low computational investment.

25

3 Bayesian Reasoning for Perception and Cognition The goal of perceptual and cognitive processes is to discover the semantic state of the world from the received sensory signals. For example, an autonomous driving agent needs to perceive a threat to safety when its visual sensors capture a moving object, perhaps a pedestrian, on the road. The process involves discovering unknown facts (the world state) from a set of known facts (sensory data) with some background knowledge, which is called reasoning. It is a crucial capability for a cognitive agent for successfully interacting with a dynamic world. Many theories have been proposed for explaining the reasoning involved in the perceptual and cognitive processes. Bayesian theory has proved to be the most satisfactory one (Knill and Richards 1996). The goal of this chapter is to introduce Bayesian framework of reasoning and the methodology built around it. We begin the chapter with a comparative study of the various reasoning paradigms. Subsequently, we illustrate the statistical properties exhibited in the nature, which justifies statistical formulation for the computational theory for reasoning. This is followed by a detailed presentation of the Bayesian reasoning theory. Subsequently, we introduce Bayesian networks (BN) and dynamic Bayesian networks (DBN), which are powerful tools to support Bayesian reasoning in static and dynamic environments. Then, we introduce a method for estimating the parameters of a Bayesian network, which serves as its prior world model. This is followed by a discussion on the complexity of the models and prior probabilities. Going further, we introduce hierarchical Bayesian models (HBM) that provide the capability for a cognitive agent to generalize on the reasoning process and adapt it to new situations. This is followed by several examples of such generalization that can be accomplished with hierarchical Bayesian models. Finally, we conclude the chapter with a critical review of the Bayesian reasoning paradigm.

Computational Models for Cognitive Vision, First Edition. Hiranmay Ghosh. © 2020 The IEEE Computer Society, Inc. Published 2020 by John Wiley & Sons, Inc.

26

3 Bayesian Reasoning for Perception and Cognition

Explanations of specific perceptual and cognitive capabilities with Bayesian reasoning framework are deferred to Chapter 4.

3.1 Reasoning Paradigms There are three distinct paradigms for human reasoning. Deductive reasoning results in discovery of new facts from a set of known facts, which are implied by the latter. In this paradigm, the model of the world is represented by a set of facts, or assertions, such as “all birds can fly” and “all parrots are bird.” Logical reasoning with the assertions can lead to discovery of hitherto unknown fact, for example “all parrots can fly.” This form of reasoning is said to be valid, meaning that if the premises (given facts) are true, the consequence (inferred fact) must be necessarily true. Though powerful in ascertaining facts, this mode of reasoning cannot cope up with noisy data received from visual and other sensory organs. The reasoning system breaks down if a premise variable is unknown, or is incorrectly specified. On the contrary, abductive reasoning (also known as evidential reasoning) assumes a model of the world, where several hypotheses compete with each other for explaining the presented facts (observations). The hypothesis that best explains the observations, is chosen as the inference. An important point to note is that the hypothesis need not be consistent with every piece of presented data. For example, doctors diagnose a disease based on what explains most of the reported symptoms and test reports of a patient. This form of reasoning is not valid, but is robust against incomplete and noisy information, such as some unreported or misreported symptoms. Yet another form of reasoning results in generation of new knowledge at higher level of abstraction. For example, after seeing a few instances of birds of different kinds, say parrots, crows, and peacocks, an observer creates a notion of a general concept of a “bird,” and attribute some common properties, like “has beak,” “can fly,” etc., to the concept. The concept of bird and its properties constitute new knowledge. For a new type of a bird, say a sparrow, which is yet unseen, one tends to believe that it too has a beak and that it can fly, facts that are not implied by the earlier observations. This process is known as inductive generalization, and has been shown to be a special case of abduction (Josephson 2000). Inductive generalization helps an intelligent agent to adapt to a new unforeseen environment. From the above discussions, it follows that abduction and induction are preferred modes for reasoning with visual (and other sensory) data. Bayesian framework provides a formal mechanism for the same in a statistical setup.

3.2 Natural Scene Statistics

3.2 Natural Scene Statistics Natural scenes1 are found to be characterized by strong statistical regularities. For example, when many images of a class of objects, like the human face, are superimposed on each other, the average does not melt into a homogeneous background, but shows distinctive and recognizable contour patterns (Torralba and Oliva 2003). It is believed that the human eyes have adapted to the statistics of the natural scenes during the course of evolution, and that this has been the key to robust vision despite noisy image data (Geisler 2008). The statistical regularity exhibited in the natural scenes has been exploited to model vision as a process of statistical interpretation. It essentially involves probabilistic matching of statistical patterns for anticipated outcomes with observed visual data. We illustrate the role of natural scene statistics in human vision with a simple example shown in Figure 3.1. The flowers in the image can be characterized by a probabilistic distribution of features like luminosity, texture, and contour descriptors, which is à-priori known to the human vision system. The prior knowledge enables mental completion of the image of the flower, which is partly occluded with a dark spot. There are a few consequences of the statistical regularity found in natural images. An image is a two-dimensional array of illumination values, where the illumination value at a spatial location can assume any value within some permissible range. Thus, a digital gray-scale image, of width 𝑤 and height h (measured in pixels) can be represented by a point in (𝑤 × h)-dimensional space, Figure 3.1 statistics.

Illustrating natural scene

1 Natural scenes refer to images of visual scenes captured with devices operating in the range of visual spectrum. They do not include, for instance, text images, computer graphics, animations, paintings, cartoons, X-ray images, and so on.

27

28

3 Bayesian Reasoning for Perception and Cognition

where each of the dimensions can have 256 permissible values. If we consider populating the space with all possible natural images of that size, much of the space will remain empty. The property of statistical regularity results in an extremely sparse distribution of the natural images in the permissible space, much like the liquid part in a froth that spreads the entire volume (Ruderman 1994). The sparse distribution is an indicator of high degree of redundancy in the pixel-based representation of an image. This justifies computer vision tasks to use lower dimensional feature-based representations to characterize an image which results in a drastic reduction in data-volume. Typically, an image is characterized by continuous homogeneous areas with interspersed singularities that mark the object contours. As we have seen in Chapter 2, the center-surround organization of retinal neurons have evolved to detect such singularities. The detected contour fragments form a compact description of the image, which is conveyed to the brain as the primary stimulus for further analysis. This motivates analysis of images with multi-scale wavelet filters, and characterization of image contents as energy spectra, in different spatial frequency sub-bands, and for different spatial orientations. The local statistics of these descriptors represents objects, and the global statistics characterizes the scenes. The histograms of normalized mean-subtracted energy distributions in natural images have been found to be distinct from Gaussian distributions, which disproves natural images to be random phenomena. It has also been seen that the power-spectrum in a natural image depends on the orientation of wavelet filters. In general, there are more vertical and horizontal edges than the oblique ones in natural scenes, making the human eye more adapted to the former. This adaptation is reflected in human preferences in aesthetic assessments. It is demonstrated in more vertical and horizontal edges in scenes containing man-made objects (say, buildings) than their natural counterparts (say, forests). It has been demonstrated that it is possible to characterize and distinguish between different classes of man-made and natural scenes, such as cityscapes and natural landscapes, with the global statistics of oriented energy spectrum (Torralba and Oliva 2003).

3.3 Bayesian Framework of Reasoning Traditionally, there have been two competing approaches to interpret some observations. Model-based top-down approach emphasizes on inherent structured knowledge in human mind, also known as the intuitive theories specific to domains. It comprises a taxonomy of concepts, causal relations between them, structural constraints, and so on. The model is considered to be rigid, and interpretation involves matching the models with the observed data. On

3.3 Bayesian Framework of Reasoning

the contrary, bottom-up data-driven approach emphasizes on learning the concepts with statistical mechanisms of pattern recognition and inference, such as similarity, association, and correlation. The model-based theories are generally too complex to formalize for making quantitative predictions in nontrivial domains. On the other hand, the theories of statistical learning cannot explain the human capability of inferencing from sparse and often noisy sensory data, and the process of inductive generalization. The Bayesian framework of reasoning blends statistical mechanisms with background knowledge and provides a robust computational model of cognition (Tenenbaum et al. 2006; Perfors et al. 2011). It assumes prior models, but they are not rigid. In Bayesian framework, the prior models are updated with observed data in a probabilistic setting. It also supports acquisition of high-level knowledge from observed data (meta-learning), which enables inductive generalization. In model-based approach, an agent that needs to interpret some sensory signals d forms a model of the world in terms of a set of alternate hypotheses  = {h1 , h2 , … , hn }. The goal of the agent is to establish its configuration h, based on observed data d. The process of interpretation assumes a mapping between a configuration of the world h and the expected sensory data, h → d∗ . In an ideal situation, when a hypothesis is true, the observed data should match its expectation, i.e. d = d∗ . But, in the real world, it seldom happens! The received sensory data can be different from their expected values because of two primary reasons: (i) inaccuracy or errors in the sensory measurements, and (ii) the inherent variations in the environmental configuration. Thus, the interpretation process needs to work with uncertainties and can at best, produce a belief about the environmental configuration. In Bayesian theory of perception, the hypotheses about the configurations (models) of the world are characterized with some prior beliefs. The beliefs get revised based on the received sensory data. The configuration that assumes maximal posterior belief is accepted as the interpretation, i.e. the perceived configuration of the world. Formally, we can express it as a likelihood maximization problem as h∗ = argmaxhi ∈ P(hi ∣ d)

(3.1)

h∗

represents the inference. Using the Bayes formula, the probability of a where hypothesis hi based on some observed data d is given by P(hi ∣ d) =

P(hi ) ⋅ P(d ∣ hi ) P(d)

(3.2) ∑

where denominator in the equation, P(d) = k P(d ∣ hk ) ⋅ P(hk ), represents the marginal probability of the data d, which is independent of any hypothesis. It can ∑ be viewed as a normalizing constant that makes i P(hi ∣ d) = 1. Thus, the posterior probability P(hi ∣ d) is proportional to the product of two terms in the numerator. This equation shows reasoning as a process of belief

29

30

3 Bayesian Reasoning for Perception and Cognition

revision. The prior belief in the hypothesis P(hi ) gets updated by the factor P(d ∣ hi ) to generate revised posterior belief P(hi ∣ d) on observing data d. This signifies perception as a process of integration of top-down conceptual knowledge (prior belief) and bottom-up sensory (observed) data, a view that has been subscribed to by some psychologists for a long time. At this juncture, it is interesting to examine the relationship between percepts and visual signals in the Bayesian framework. The percepts (hypotheses) are not aggregates of the visual signals (observed data), but are holistic emergent entities. Similar relations hold good at higher level cognitive processes as well. Further, it is interesting to note that perception is a subjective process in Bayesian theory, as the inference depends on the prior beliefs P(hi ) and P(d ∣ hi ) of the observer. Since the theory essentially involves hypothesis testing, it follows that a configuration of the world can be perceived, only if it is included in the prior model. Simply stated, unless one knows about tomatoes and their visual properties, he cannot perceive one in an image. This subjectivity in inferencing explains the individual differences in perception. The outcome of Bayesian reasoning can be reliable, only if the priors and the conditionals are veridical. An agent is generally not interested in the absolute posterior probability of a hypothesis. Instead, when confronted with a situation (data), it needs to weigh the relative likelihood of two or more possible hypotheses and choose the most likely one. The posterior odds of a hypothesis hi over another hj (hi , hj ∈ ) is given by odds(hi , hj ) =

P(hi ∣ d) P(hi ) P(d ∣ hi ) = ⋅ P(hj ∣ d) P(hj ) P(d ∣ hj )

(3.3)

or, in terms of logarithms log-odds(hi , hj ) = log

P(hi ) P(d ∣ hi ) P(hi ∣ d) = log + log P(hj ∣ d) P(hj ) P(d ∣ hj )

(3.4)

A hypothesis hi is more likely to be true than hj , if odds(hi , hj ) > 1 (or, if log-odds(hi , hj ) > 0), and vice versa. Let us illustrate the significance of prior knowledge in this Bayesian model of inferencing with a simple example. Let us assume that a patient approaches a doctor with a symptom of headache that can be caused by either an inflammation of the brain tissue or a brain tumor. The doctor needs to weigh the likelihood of the possible causes and prescribe a suitable course of treatment. The problem may be modeled using Eq. (3.4), where hi and hj may represent the two alternative hypotheses of tumor and inflammation, respectively, and d represents the observation of headache. Let us assume that past clinical records suggest that 80% of patients with tumor have reported the symptom of headache and 60% of the patients with inflammation have done so, i.e. P(d ∣ hi ) = 0.8 and

3.3 Bayesian Framework of Reasoning

P(d ∣ hj ) = 0.6. A naïve doctor may assume that the two hypotheses, inflammation and tumor, are equiprobable, i.e. P(hi ) = P(hj ) = 0.5. In this case, the log odds of ( ) ( ) 0.8 tumor against inflammation will be log 0.5 + log ≈ 0.125, which leads to 0.5 0.7 a stronger posterior belief in tumor than inflammation. On the other hand, a doctor may know from his experience that a brain tumor is very rare, while an inflammation may be quite common, and assign the prior probabilities as P(hi ) = 0.1 and P(hj ) = 0.9. With these assumptions, the log odds ( ) ( ) of tumor against inflammation is given by log 0.1 + log 0.8 ≈ −0.83, which 0.9 0.7 makes the doctor to believe that inflammation is the more probable cause for the observed headache. The numbers used in this example are fictitious and may be far from reality. But they illustrate how the prior beliefs may influence the inference in a Bayesian reasoning system for the same observed data. Further, the data d is often an aggregate of elementary and mutually independent data items, e.g. different symptoms for the diseases, like fever, headache, nausea, giddiness, etc. In such cases, we can express d as d = {d1 , d2 , … , dn }, and ∏k when we can write P(d ∣ h) = i=1 P(di ∣ h). The posterior odds of hi over hj can now be written as n P(hi ∣ d) P(hi ) ∏ P(dk ∣ hi ) = ⋅ (3.5) P(hj ∣ d) P(hj ) k=1 P(dk ∣ hj ) or, in logarithmic form as P(hi ) ∑ P(dk ∣ hi ) P(hi ∣ d) = log + log P(hj ∣ d) P(hj ) k=1 P(dk ∣ hj ) n

log

(3.6)

The equation proves to be useful since it is generally easier to model the statistical dependency of an elementary data item dk with a hypothesis rather than of a cohort of data d. Further, a robust belief in a hypothesis can be formed with partial data, i.e. with a subset of data items {dk }i=1∶n , and the impact of a few erroneous data items on the inference is also likely to be minimal. These properties of Bayesian reasoning is particularly useful for inferencing from visual and other sensory data, where the inputs are generally redundant and inherently noisy (Ghosh and Chaudhury 2004). Moreover, when (d1 , … , dn ) represent a temporal sequence of data, it is possible to incrementally revise the belief in the hypotheses by sequentially combining the evidences (Bonawitz et al. 2014). In such cases, the posterior from the earlier observations (d1 , d2 , … , dk−1 ) form the prior for the current observation dk . If the observation matches the prediction by the prior belief, the latter is reinforced; otherwise, the prior beliefs are revised and the world model is updated to enable better predictions for the future. This is known as predictive coding, and is the fundamental approach for adaptation by the learning agents.

31

32

3 Bayesian Reasoning for Perception and Cognition

3.4 Bayesian Networks Bayesian theory of reasoning models real-world concepts (an object or an event) as stochastic variables, whose probabilities of occurrence depend on each other. Inferencing involves updating the probabilities of the variables based on the observed data. Let us assume that the real-world concepts are represented with a set of stochastic variables  = X1 , X2 , … , Xn . In general, the specification of ∏n joint probability distribution across n stochastic variables requires i=1 ki − 1 parameters, where ki represents the number of possible states for the variable Xi .2 In most of the real-life problems, we deal with a large number of variables, which results in a large set of parameters, making the computations intractable. Besides, many of the combinations may be rare, making the corresponding joint probability values very small. It is impossible for the experts to specify the overwhelmingly large number of small probability values, especially when some rare combinations are never observed even in fairly large dataset. Fortunately, in most of the real-life problems, only a handful of variables depend on each other, and it is sufficient to model those dependencies. This results in drastic reduction in number of parameters to deal with. For example, in a medical diagnosis problem, while the symptoms depend on the diseases, they can generally be considered to be independent of each other.3 Graphical models provide a convenient framework for working with the joint probability distributions in such cases. In a graphical model, the stochastic variables are represented as nodes of a graph. The pairs of variables that are dependent of each other are connected with edges. The values associated with the edges quantify the dependencies. Generally, the graph is sparse, reflecting the uncorrelated nature of many of the variables. The edges in the graph can be either undirected or directed, representing symmetric or asymmetric nature of the relations between the nodes. A comprehensive treatise on probabilistic graphical models is available in Koller and Friedman (2009). In this section, we introduce one of the graphical tools, Bayesian networks (BN) (Pearl 2014), which provides a convenient way for modeling the Bayesian reasoning framework. It assumes asymmetric causal relations between the variables, such as a disease causes a symptom, but not the other way round. A BN is a directed acyclic graph, where the directionality of an edge signify causal dependency between the two entities that are represented by its source and the destination nodes. Figure 3.2 depicts a Bayesian network that captures the causal relations between a couple of diseases, some of their symptoms and some 2 For binary concepts, i.e. when each of the events can assume one of two possible states, say present and absent, the number becomes 2n − 1. 3 There can be exceptions like fever inducing nausea, etc.

3.4 Bayesian Networks

External factors

Family history

Brain tumor

Diseases

Symptoms

Radiation exposure

Seizure

Nausea

Trauma

Viral infection

Inflammation

Headache

Neck-pain

Figure 3.2 A Bayesian network depicting causal relation between some external factors, diseases, and symptoms.

external factors that are possible causes of the diseases. The figure is by no means complete; we use it to illustrate the properties and the principles of reasoning with a Bayesian network. The Bayesian network in Figure 3.2 is organized in three layers, where the middle layer constitutes of a set of variables representing diseases, the lower layer consists of variables that represents some symptoms, and the uppermost layer comprises variables that represent some external factors that are possible causes for the diseases.4 The causal parent–child relations in a Bayesian network lead to a canonical factorization of a full joint probability distribution. In a Bayesian network, the probabilities of a set of child nodes  = {X1 , X2 , … , Xn } depend only on the conditional probabilities of the child nodes given the parent nodes, and the probabilities of the parent nodes. If Pa() represents the set of parent nodes of , we can write P(X1 , X2 , … , Xn ) = P(X1 , X2 , … , Xn |Pa()) ⋅ P(Pa())

(3.7)

This factorization essentially implies conditional independence. This means that, given the states of the disease nodes, the symptom nodes are conditionally independent on the nodes representing external factors. In other words, if the states of the disease variables are known, the probability of occurrence the symptoms are

4 In general, a Bayesian network need not be organized in such layers and can depict more complex dependencies.

33

34

3 Bayesian Reasoning for Perception and Cognition

solely dependent on them and not on the external factors. However, if the status of the diseases are not known, the probability of occurrence of the symptoms are guided by the external factors that a patient may have been subjected to. There are a few other configurations in a BN that lead to conditional independence of variables. A node is said to be d-separated from another, if an update in belief in one does not affect the belief of the other. In this network, the disease nodes represent latent variables, i.e. they cannot be observed and need to be inferred. On the other hand, nodes representing the external factors and the symptoms can be observed. The reasoning in a Bayes network may take one of the two forms. A doctor may assess the risks of the diseases in a patient from the external factors that he has been subjected to. The probabilities of the diseases can be computed from the external factors with Eq. (3.7), when the set  represents the diseases and Pa() the external factors. This mode of reasoning is called the causal reasoning. On the other hand, the doctor may diagnose the disease from the observed symptoms. In this case,  represents the symptoms and Pa() the diseases. In this case, Bayes formula can be used to express the posterior probabilities P(Pa() ∣ ) as P(Pa() ∣ ) =

P( ∣ Pa()) ⋅ P(Pa()) P()

(3.8)

which is a product of conditional probabilities P( ∣ Pa()) and the prior probabilities P(Pa(), normalized with the constant P(). This mode of reasoning represents evidential reasoning or abductive reasoning. Often, the two modes of reasoning are combined in practical scenario, e.g. when a doctor considers symptoms as well as prevailing external factors in his diagnosis. Several techniques for belief revision (computing posterior probabilities) have been proposed for nontrivial Bayesian networks, of which methods based on local message passing (Pearl 2014) is quite popular. Partitioning of the network in independent modules often simplify probability computations.

3.5 Dynamic Bayesian Networks A Bayesian network represents the state of a system at a certain point of time. The representation has been extended to Dynamic Bayesian Networks (DBN) for dynamic systems (Mihajlovic and Petkovic 2001; Murphy 2002). A DBN is motivated by and is a generalization of hidden Markov model (HMM), which provides a mechanism to model time-series data. While HMM models the state of the world with one hidden state variable, DBN models the state of the world at time t with a set of n variables {Xt1 , Xt2 , … , Xtn }, where each of Xti can have some discrete values. As in a BN, some of the variables are observable, and some are hidden.

3.5 Dynamic Bayesian Networks

The variables, at any given point of time, may have causal dependency with each other, which can be represented with a Bayesian network. Besides, the state of the network at any point in time depends on its earlier states. The term “dynamic” refers to the system states and not to the structure of the network, which remains unchanged over time. Like in HMM, DBN also assumes a first-order Markov property: the state of a network at time t directly depends only on the immediate past, i.e. its state at time t − 1, and not on the earlier states. Figure 3.3 depicts three consecutive time slices for a DBN. The arrows connecting nodes across time points (dashed lines in the j diagram) represent causal dependency of a variable Xti at time t on another Xt−1 at the previous time point. We assume the system to be initialized at t = 0. In order to explore the system behavior at time t = T, we need to consider the sequence of the states of its variables Xt1∶n during the interval t = (0:T). Let us denote the parents of a variable Xti at time t by Pa(Xti ), which may either be in the current or the previous time-slice. The joint probability distribution of the states of the variables during the period is given by a product of the prior and the conditional probabilities as 1∶n P(X0∶T )=

n ∏

P(X0i ∣ Pa(X0i )) ⋅

i=1

T n ∏ ∏

i P(Xti ∣ Pa(Xt−1 ))

(3.9)

i=1 t=1

where the first term represents the prior probability distribution of the variables, (at t = 0), and the second represents the conditional probability distributions within and across time-points. The reasoning in a DBN involves estimation of the probabilities of the hidden variables from the priors at t = 0 and the states of observed variables during the period t = 0:T, based on this equation. In general, it involves a combination of causal and evidential reasoning (forward and backward propagations of beliefs, respectively). Further, the conditional probabilities that spans across time-points in a DBN may either be constants or be a function of time. Belief propagation may prove to be intractable for a DBN from the generic expression provided above, and constraints imposed by specific network topologies are generally exploited.

X1 X2

X1

X3

X2

t–1

Figure 3.3

A dynamic Bayesian network.

X1

X3 t

X2

X3 t+1

35

36

3 Bayesian Reasoning for Perception and Cognition

3.6 Parameter Estimation The parameters of a Bayesian network represent a set of prior beliefs. These prior beliefs can either be based on some domain knowledge or computed from observed statistics. Statistical likelihood estimation models solely rely on the later. For example, if the clinical records reveal that h patients have shown symptom of headache in a population of t patients with brain tumors, the prior probability of a tumor patient to have headache can be specified as ht . This is known as the maximum likelihood estimation. But, this estimate may not be reliable, when the population size t is small, or is not a representative one. On the other hand, specification of prior beliefs based on domain model is generally extremely difficult. For example, quantification of probability of headache of brain-tumor patients may be impossible from knowledge of medical sciences. Bayesian theory provides a reliable method to estimate the model parameters by combining a knowledge-based prior distribution model and observed data. In this approach, the probability of headache, given tumor is treated as a random variable 𝜃 in the range [0, 1], and the probability distribution function of 𝜃, given some data d, is given by p(𝜃 ∣ d) =

P(d ∣ 𝜃) ⋅ p(𝜃) P(d)

(3.10)

1

when P(d) = ∫0 P(d ∣ 𝜃) ⋅ d𝜃.5 In absence of any prior information on 𝜃, we may assume it to follow a uniform distribution, i.e. p(𝜃) = 1 for all values of 𝜃, though other assumptions are also possible. Let us assume that the data d refers to the observation of h patients with headache in a sampled population of t patients with brain tumor. According to Bernoulli’s theorem P(d ∣ 𝜃) = 𝜃 h ⋅ (1 − 𝜃)t−h . With these assumptions, we can write 1

P(d) =

∫0

𝜃 h ⋅ (1 − 𝜃)t−h ⋅ d𝜃 =

h!(t − h)! (t + 1)!

(3.11)

Substituting in Eq. (3.10), we get p(𝜃 ∣ d) =

(t + 1)! h 𝜃 (1 − 𝜃)t−h h!(t − h)!

(3.12)

Thus, the Bayesian theory does not predict a single value of the probability, but a probability distribution. We have plotted a few such distributions for different values of t and h in Figure 3.4. The mode of the posterior probability density functions gradually lean toward higher values as more proportions of patients are observed 5 We have used the notation p(∘) to indicate probability density functions and P(∘) to represent probability values.

3.50

3.00

3.00

2.50

2.50

3.50 3.00 2.50

2.00

2.00

2.00

1.50

1.50

1.50

1.00

1.00 0.50

0.50

0.00

0.00 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

(a) t = 10, h = 2, h = 0.2 t

1

1.00 0.50 0.00 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

(b) t = 10, h = 5, h = 0.5 t

1

0

8.00

7.00

7.00

7.00

6.00

6.00

6.00

5.00

5.00

4.00

4.00

3.00

3.00

2.00

2.00

2.00

1.00

1.00

0.00

0.00

5.00 4.00 3.00

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

(d) t = 50, h = 10, h = 0.2 t

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

(c) t = 10, h = 7, h = 0.7 t

1

1.00 0.00 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

(e) t = 50, h = 25, h = 0.5 t

1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

(f) t = 50, h = 35,

h = 0.7 t

Figure 3.4 Parameter estimation from observations. (h = Number of cases of headaches reported, out of t = Number of patients with brain-tumor observed).

1

38

3 Bayesian Reasoning for Perception and Cognition

with the symptom. Also, the higher number of observations leads to sharper peaks than those for lower number of observations, signifying more confidence in estimates with higher number of observations. The most likely (expected) posterior value of the probability can be found from the probability distribution as 𝜃̂ ∣ d =

1

∫0

𝜃 ⋅ p(𝜃 ∣ d) ⋅ d𝜃

(3.13)

which can be simplified to h+1 with substitution of Eq. (3.12). This expression is t+2 robust for smaller number of observations. For example, assuming that only one tumor patient is examined and that he has been found to have headache, we get 𝜃̂ = 23 , and not as extreme as 1 (one) as suggested by maximum likelihood estimation. As the number of observations increases, 𝜃̂ tends to approach h . Thus, we t

find that the à-priori model dominates when number of observations is less, and data dominates when the number of observations is more. This result intuitively matches the human cognitive principles. The specification of the prior distribution can be guided by domain knowledge. In the above example, we have assumed a uniform prior probability distribution for 𝜃. We could start with a different distribution function as well. For example, if there is a weak belief in positive association of headache with tumor, the prior can be a Gaussian distribution with mean greater than 12 . The posterior distribution computed with this prior will reflect this bias, till it is either neutralized by overwhelming evidence of data to the contrary or further reinforced with supporting data. We have used analytic methods for parameter estimation in this example. But, for most of the complex real-world models, such analytical methods may not work and we need to use numerical methods. Markov chain Monte Carlo (MCMC) (Geyer 2011) is an approximation tool, which is often used for estimating the model parameters for many probabilistic models.

3.7 On Complexity of Models and Bayesian Inference The starting point in Bayesian framework of reasoning is a set of hypotheses (or models), each of which is associated with a prior belief (probability). A model is said to be simpler than another, if it can be described by less amount of information than the other. Psychological experiments have proved that there is inherent preference of humans toward a simpler model rather than a more complex one, a principle that is popularly known as the Ockham’s razor. For example, a square (represented with one parameter) is simpler than a rectangle (specified with two parameters), and hence the former will be preferred explanation for an observed visual pattern.

3.8 Hierarchical Bayesian Models

If we denote the complexity of a model M (prior or conditional) by c(M), one way to specify the probability for a model can be P(M) = 2−c(M) . Let, c(Ei ) denote the complexity of the hypothesis Ei , and c(d ∣ Ei ) denote the complexity in explaining the data d with the hypothesis Ei . Thus, we have P(Ei ) = 2−c(Ei ) and P(d ∣ Ei ) = 2−c(d∣Ei ) . Substituting in Eq. (3.2) and taking logarithms of both the sides, we get c(Hi ∣ d) = −k + c(Hi ) + c(d ∣ Hi )

(3.14)

where k = log2 P(d) is a constant. With this formulation, the likelihood maximization problem in Eq. (3.1) becomes equivalent to minimization problem for system complexity as represented in Eq. (3.14).6 Sometimes, the definition of “complexity” of a model M assumes an abstract domain-specific notion. We shall provide examples for such domain-specific complexity in Chapters 4 and 6 of the book. Moreover, there may be infinitely many configurations of the world, and hence, infinitely many hypotheses. It is not possible to test each and every hypothesis within a finite time with constrained resources of the human brain (or a computer). Thus, it is imperative that a subset of hypotheses is chosen for evaluation. It is an usual practice to choose the hypotheses with higher prior probabilities, which is generally guided by the world model inferred by an agent at earlier points in time (von Gioi 2009).

3.8 Hierarchical Bayesian Models In Sections 3.3–3.7, we have introduced the Bayesian reasoning framework and methods for estimating its parameters. We have seen that reliable estimation of the Bayesian parameters requires a large volume of data. In contrast, human mind is generally capable to learning the task parameters from very few examples. It has been argued that in the process of learning for individual tasks, we learn how to learn, so that learning becomes easier for the later tasks. This can be illustrated with a simple example described below. Consider several bags containing marbles of different colors, where the color distribution in each of the bags may be different from the others. Suppose, we sample several marbles from three of the bags, and it turns out that all the marbles sampled from bag 1 are white, those from bag 2 are green, and those from bag 3 are black. Sampling of the three bags provides us with specific knowledge about the content of those bags. They are most likely to contain white, green, and black marbles, respectively. Moreover, sampling of three bags provides us with some generic knowledge about all the bags. Each of them is likely to contain marbles of uniform colors, though we cannot predict the color without sampling a bag. Now, if 6 Though we have assumed exponential decay in probability assignment for a model with its complexity, the conclusion remains valid with any monotonic decay function.

39

40

3 Bayesian Reasoning for Perception and Cognition

we sample a single marble from a fourth bag, and it turns out to be blue, there is a good reason to believe all other marbles in that bag are also blue. In this example, the inference about the fourth drawn bag is based on just one example, which is plausible by virtue of the generic knowledge learned about the contents of the bags from earlier sampling experiments. This example illustrates how the generic knowledge acquired from earlier tasks can be reused to simplify learning a later related task, which is known as transfer learning. Learning from one, or a few, training samples is known as one shot learning, or more appropriately few shot learning, and is necessary in many real-life situations where training samples are rare. The acquisition of generic knowledge from specific tasks is referred to as meta-learning. Hypothesizing, and generalization of properties from one task to another is known as property induction or inductive generalization and is one of the modes of transfer learning. We may formulate the problem of sampling the marbles in an individual bag in a Bayesian framework, much in the same way as the example presented in Sections 3.3–3.4. The model parameters for bag i can be expressed as a vector 𝜃i = 𝜃ij , j = 1 ∶ n, where n represents the cardinality of all possible colors of the marbles, and 𝜃ij represents the probability of a marble sampled from bag i to be of ∑n color j with the obvious constraints ∀i, j ∶ 𝜃ij ⩾ 0, and j=1 𝜃ij = 1. Assuming some prior distribution, the probability distribution functions for the parameter 𝜃i for each bag can be learned individually from the sampling data di from that bag. However, this approach requires sample size for each bag to be large enough for reliable prediction, and cannot predict the contents of a bag that has not yet been sampled. The goal of hieararchical Bayesian model (HBM) is to learn some generic properties about all the bags from the sampling data for a few bags. In HBM, the parameters of a probability distribution is modeled as functions of some hyper-parameters, which represent a higher (more generic) level of knowledge. For example, the distribution of parameter 𝜃i for a bag i can be modeled as a Dirichlet distribution, parameterized by (i) a scalar 𝛼 that represents the heterogeneity of colors of the marbles in the individual bags, and (ii) a vector 𝛽 that captures the average color distribution across all the bags (Kemp et al. 2007), as depicted in Figure 3.5. The parameters 𝛼 and 𝛽 characterize a space of over-hypothesis. They are learned together with the model parameters {𝜃i } of the sampled bags, assuming some prior distribution models, in the same way as parameter estimation for Bayesian networks discussed in Section 3.6. The linkage of the models for the individual tasks through the meta-model effectively increases the sample size for each of the tasks and helps each of them to learn faster. The generative model for 𝜃i enables estimation of its prior distribution from the learned distribution of 𝛼 and 𝛽, which is likely to be more realistic than a random guess. The distribution of parameter 𝜃i for a bag is parameterized by some specific values (𝛼i , 𝛽i ) drawn from the distribution of 𝛼 and 𝛽.

3.9 Inductive Reasoning with Bayesian Framework

Meta–meta-model Meta-model

Hyper-parameters

α,β

α1, β1 Models

Hyper–hyper-parameters

Λ

α2, β2

αn, βn

α3, β3 α4, β4

θ1

θ2

θ3

All white

All green

All black

θ4

...

θn

Parameters

Sampling Data One blue

Figure 3.5 Hierarchical Bayesian model for estimating color distribution of marbles in multiple bags. Source: Redrawn after Kemp et al. (2007) with permission from the authors.

Thus, the distribution of the hyper-parameters imposes some constraints on the model parameters, which makes it possible to learn the latter with less number of examples. This is also an example of multitask learning (MTL), where multiple tasks (sampling of different bags) are learned together with some common task parameters. It is possible to generalize knowledge to arbitrary level of abstractions, by modeling the parameters (𝛼, 𝛽) to be generated by a set of even higher level parameters Λ (hyper-hyper-parameters), and so on. Progressive higher levels of knowledge establish linkage of models across the more and more diverse set of tasks and results in faster learning for each.

3.9 Inductive Reasoning with Bayesian Framework In Section 3.8, we have seen that abstraction of knowledge is the key to inductive reasoning and generalization. In this section, we present a few concrete examples of inductive reasoning that can be achieved with Bayesian framework.

3.9.1

Inductive Generalization

Human mind is capable of building rich and abstract models of the world with sparse and noisy data. For example, after encountering one, or at best a few, images of cats, it is possible for even a child to recognize a cat in a new image. This is in contrast to classical machine learning models, where many examples of an

41

42

3 Bayesian Reasoning for Perception and Cognition

object is required in order to recognize another of the same kind. The capability of generalizing the properties of an object is known as inductive generalization. The problem can be formally stated as follows: given that a set of object instances of class C maps to the points X = {x1 , x2 , … , xn ∣ n ⩾ 1} in a metric feature space, and given a new object that maps to the point y, what is the probability of that object to belong to the class C? We assume that C covers a contiguous region, called the consequential region (CR), in the feature space, when the problem boils down to estimate the extent of the CR, and to ascertain if y falls within the CR or not (Tenenbaum and Griffiths 2001). Let us illustrate the use of Bayesian framework for such inductive generalization with the simple example of identifying a cat through its visual features. For simplicity, we shall assume that the visual features (x) of a cat is represented by a scalar quantity and can assume a fixed set of discrete values, say from 1 to 20. Suppose, we have experienced a cat with a feature value of 11. We assume that the property of an object is deterministic with respect to its feature value. In this example, all objects with feature value 11 must be a cat. The constraints on the CR are that (i) it is a continuous region, (ii) it must contain the feature value 11, and therefore (iii) it has a minimum size of 1. Now, there can be several hypothesis for the CR of different sizes, each covering a different region of feature space, subject to the constraints stated above. The hypotheses lie between the two extremes: (i) the CR includes only one feature value, namely 11 (most conservative) and (ii) the CR covers the entire range of feature values [1, 20] (most liberal). Let  = {hi } represent the hypothesis space. The posterior probability of a hypothesis hi given the data X is given by P(hi ∣ X) =

P(hi ) ⋅ P(X ∣ hi ) P(X)

(3.15)

Since the denominator is a normalizing constant, the posterior probability of a hypothesis depends on the two terms in the numerator. The term P(hi ) represents prior belief in a hypothesis hi . In absence of any other information, we assume P(h) to have a uniform distribution.7 The term P(X ∣ hi ) denotes the probability of observing the data X, given a hypothesis hi . The probability distribution for the CR over the feature values can be computed as a weighted average of all possible hypotheses, i.e. ∑ 1 ∑ P(CR ∣ X) = P(hi ∣ X) = 𝜅 ⋅ P(X ∣ hi ) ⋅ P(hi ) (3.16) ∣  ∣ h ∈ h ∈ i

i

where 𝜅 is a normalizing constant. 7 Other alternatives, such as assuming higher probabilities for simpler (more conservative) CRs, are also possible. See discussions in Section 3.7.

3.9 Inductive Reasoning with Bayesian Framework

Under strong sampling assumption (Navarro et al. 2012), the probability of observing a single sample x given a hypothesis hi is given by ⎧ 1 ⎪ p(x ∣ hi ) = ⎨ |hi | ⎪0 ⎩

if x ∈ hi

(3.17)

otherwise

where ∣ hi ∣ represents the size of the CR for hypothesis hi . In the present example, there is only one sample, i.e. X = {x} = {11}. Substituting the result of Eq. (3.17) in Eq. (3.16) and simplifying, we get the probability distribution for the CR to be a curve with an exponential and symmetric decay on both sides of the value 11, as sketched in Figure 3.6a, which coincides with our intuitive understanding. Now, let us assume that we have observed three cats, with feature values 7, 10, and 11, respectively. In this case, the constraint on a hypothesis for CR is that, being contiguous, it should cover the range from 7 to 11, i.e. its minimum size is 5.8 Working with the possible hypotheses conforming to these constraints, as in the earlier case, we get the probability distribution function for the CR as shown in Figure 3.6b. Compared to the earlier result, the peak probability function has flattened out over a larger feature space, but has similar decays at either side of the plateau. Now let us consider a third case, where we encounter n cats, all with feature value of 11. In this case, the probability of the observation X, given a hypothesis hi is given by ⎧ 1 ⎪ n p(xi ∣ hi ) = ⎨ |hi ∣ p(X ∣ hi ) = xi ∈X ⎪0 ⎩ ∏

when the samples are in hi

(3.18)

otherwise

With n = 3, and using this equation, we get a probability distribution function for CR as depicted in Figure 3.6c. Compared to the case of a single sample, the probability values show a sharper fall on either side of the observed value. This is intuitively explained by the fact that all the observed samples having the same feature value may be more than just a coincidence. Figure 3.6d depicts the generalization in a two dimensional feature space. The dots represent observed samples and the elliptical boundaries represent iso-probabilistic contours. The range of generalization is larger and have slower decay along the axis along which the observed samples are spread out than the axis along which they are packed more densely. 8 This constraint is stronger than the constraint in the previous case, and as a result some of the hypotheses (with CR size < 5) are ruled out.

43

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20

(a)

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20

(b)

1 0.8 0.6 0.4 0.2 0 1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20

(c)

(d)

Figure 3.6 Inferencing consequent region from observations (a) Generalization through induction. (b) Increase in range of generalization. (c) Decrease in range of generalization. (d) Generalization in two dimensions.

3.9 Inductive Reasoning with Bayesian Framework

(a)

(b)

Figure 3.7 Taxonomy learning from data. (a) Less data-points: simpler hypothesis favored. (b) More data-points: goodness of fit favored.

3.9.2

Taxonomy Learning

Categorization of objects refers to organization of objects in taxonomic categories. We present a Bayesian formulation for taxonomy learning in this section. Let us assume some data (round dots) depicted on a two-dimensional metric space as shown in Figure 3.7a. Our goal is to categorize the data into some categories. There are several possible categorization for this data set, for example in a single category, in two categories, or in three categories as shown by the rectangles with different line-styles in the diagram.9 Bayesian formulation provides us a way to select the most plausible categorization from such alternatives by analyzing the data. In Bayesian formulation, each of the possible categorization scheme forms a hypothesis. If  = {hi , i = 1, … , n} represents the hypothesis space (i.e. the set of all hypothesis), and d denotes the observed data, the posterior probability of a hypothesis hi , given the data d is given by P(hi ∣ d) =

P(hi ) ⋅ P(d ∣ hi ) P(d)

(3.19)

P(d) being a normalizing constant, the probability of a hypothesis is determined by a product of two terms in the numerator: (i) P(hi ), which represents the prior probability of a hypothesis, and (ii) P(d ∣ hi ), which represents the probability of the data d being explained the hypothesis hi , also known as the goodness of fit. In Section 3.7, we have seen that simpler hypotheses are preferred to more complex ones, and that a Bayesian inferencing problem may be modeled as a complexity minimization problem. This implies that the prior probability of 9 Hierarchical categorization is also possible, but we presently keep that possibility out of our discussions.

45

46

3 Bayesian Reasoning for Perception and Cognition

a hypothesis with lower number of categories will be more than that with larger number of categories. If there are n data points, i.e. d = {dj ∶ j = 1, … , n}, and their occurrences are assumed to be independent of each other, the goodness of fit can be expressed as P(d|hi ) =

n ∏

P(dj |hi )

(3.20)

j=1

where the probability of an individual term P(dj |hi ) is inversely proportional to the area Aik of the region k representing the category to which the point dj belongs according to hypothesis hi . Under strong sampling assumption (Navarro et al. 2012), the goodness of fit for a hypothesis hi will be higher when it postulates that the data points are categorized in larger number of tightly fitting categories, rather than fewer and larger generic categories. Thus, taxonomy learning in Bayesian formulation involves a trade-off between parsimony (simplicity) of hypotheses and goodness-of-fit for data (Perfors et al. 2011). It may be seen that when there are a few data points, as in the example of Figure 3.7a, the simplicity of the hypothesis dominates, and a hypothesis with less number of categories may be preferred. However, if more data points are observed, as shown in Figure 3.7b, the goodness of fit dominates and a more complex hypothesis is preferred. The Bayesian formulation of data-based category learning explains how a child (or an artificial agent) may learn to organize objects in hierarchical taxonomies. After seeing a few examples of different animals, he may learn to distinguish animals from other objects. After encountering more examples, he tends to distinguish between the cats and the dogs. After many more examples, he may be able to distinguish between different breeds of the dogs. The category learning problem illustrates top-down knowledge acquisition, where more abstract knowledge is acquired before concrete knowledge. The same principle has been proposed in Griffiths (2015) for color category learning.

3.9.3

Feature Selection

Consider two objects of categories A and B, which are represented by the points a and b in a two-dimensional metric feature space X, Y as in Figure 3.8a. For example, the dimensions of the feature space may represent color and shape properties of the objects, respectively, each somehow compressed to a single dimension. Let us assume an unknown kind of object P maps to the point p in the feature space, and that p is equidistant from a to b. In this case, it is difficult to ascertain if the object represented by the point p belongs to the category A or B. Observing samples of objects of other categories, such as P, Q, R, etc. (not necessarily the classes to which, a, b, or p belongs to) for which more examples

3.10 Conclusion

Y

Y

P

a

p

b

(a)

R

X

Q

(b)

X

Figure 3.8 Category determination with learned over-hypothesis. (a) Ambiguity in categorization. (b) Meta-learning for categorization.

may exist (see Figure 3.8b), it is possible to learn the over-hypothesis that the features of an object category have more dispersion along the X-axis that in Y . The meta-knowledge so learned, leads to more belief in object P to belong to the same category of object B rather than of object C (Perfors et al. 2011), even from single instances of those objects. This example illustrates learning of relevant features for classification. For example, most of the common object categories have more dispersion in color than in shape. This model of meta-learning explains human bias toward using some image properties and ignoring others for object classification, e.g. the “shape bias” observed in the children (Diesendruck and Bloom 2003). This is also the basis of automatic extraction of relevant features in artificial neural networks that will be discussed in Chapter 8.

3.10 Conclusion We have reviewed the Bayesian framework of reasoning and its use in inductive inferencing in this chapter. The main strength of Bayesian reasoning framework comes from its use of a generative model that enables it to combine prior beliefs with observations. The use of generative model helps it to generalize from specific observations, and makes it robust against noisy data. In hierarchical Bayesian models, the knowledge acquisition problem is pushed to higher level of abstractions. There are several advantages to this approach (Perfors et al. 2011). A higher level of knowledge is more abstract than a lower level of knowledge, which results in lower number of prespecified parameters. Further, it imposes weak constraints on the beliefs at the lower levels, making learning possible with less amount of data. The progressive levels of abstraction allow knowledge to be shared across contexts that are related but distinct, with little training data.

47

48

3 Bayesian Reasoning for Perception and Cognition

Despite its advantages, Bayesian framework of reasoning has its own limitations. A criticism about the Bayesian learning model is that it assumes the sampling to be from a representative population. According to strong sampling assumption, the data is considered to be highly informative, which may not be necessarily true. It has been recognized by the psychologists that people tend to adapt their beliefs at a slower pace than the Bayesian formulation would imply. This is known as conservatism (Navarro et al. 2012). Another criticism to Bayesian framework is that it is all about selection of a hypothesis as inference from an à-priori known hypothesis space. Thus, if a hypothesis was not included in the hypothesis space, Bayesian reasoning will fail to discover it. It can be interpreted as that nothing new is really learned through Bayesian reasoning. But the fact remains that nothing can be learned in a void, and there must always be some prior beliefs. If a new hypothesis could be generated, there must have been some reasoning mechanism to generate it, which need to be à-priori specified. Generalization is necessarily based on some prior theory, which provides a link between the observed data and the unobserved ones. Thus, it is quite apparent that every cognitive learning system needs to be pre-equipped with a hypothesis space that consists of all possibilities within its realm of representation and computation (Perfors et al. 2011). Bayesian reasoning provides an elegant framework for representing and reasoning with them. A more precise definition for Bayesian inferencing can be constituted as selecting the hypothesis that best explains the data from a set of known alternatives. Yet another criticism to the Bayesian framework is the selection of priors (Gelman 2008). While the priors play a significant role in Bayesian estimations, especially when one needs to be satisfied with a small volume of data, it is often a subjective judgment of an agent. There seems to be no principled approach to selection of the priors for Bayesian reasoning framework, though some guidelines have been proposed in Zondervan–Zwijnenburg et al. (2017). However, even the statistical models of estimation is not free from such assumptions. For example, curve-fitting with a polynomial (regression), essentially involves strong prior assumption about the degree of the polynomial. Similarly, assumption about Gaussian noise implies a strong commitment on the model of the noise. Indeed, Bayesian framework provides a mechanism to make the assumptions explicit, and a principled way to deal with it. For example, when there is an uncertainty about the degree of the polynomial (model complexity) in a regression problem, it can be treated as a parameter with uncertainty and can be estimated as the average, weighted with posterior probabilities. The strength of the prior (the degree of belief in it) selected also calls for some criticism. When there is a weak

3.10 Conclusion

belief in the prior, even a small volume of noisy data can lead to instability in the reasoning system. On the other hand, if a strong prior is selected, a large volume of observations is required to offset the strong (and possibly wrong) belief of the agent (Gelman 2008). A further abstraction of knowledge level and assumption of hyper-parameters to constrain the prior alleviates the problem. Notwithstanding these criticisms, Bayesian framework of reasoning is used extensively in explaining perception and cognition, and has become an important tool for computer vision. We shall see many applications of the Bayesian reasoning framework in subsequent chapters of this book.

49

51

4 Late Vision In Chapter 2, we have seen that the retinal neurons detect local contrasts from the visual signals. The fragmented contour information so extracted are transmitted to the brain for further processing. These pieces of information are stitched together in the brain and interpreted with the processes of perception and cognition to create a model of the environment. These processes are collectively known as late vision. In this chapter, we shall review some of those processes and seek their explanation. We shall also present some examples of artificial implementations of the processes in this chapter.

4.1 Stereopsis and Depth Perception One of the first tasks in the late vision is integrating the slightly disparate visual inputs from the two eyes and to create a 3D model of the world. The process is known as stereopsis. It involves (i) identification of pairs of corresponding points in the images formed on the two retinas, and (ii) estimating the distance of the real-world points from the difference in positions where they are projected on the retina. The process depends on the relative locations and orientations of the two eyes. Establishing the correspondence between the points across two views of a scene is nontrivial, and we have discussed the issues involved in Chapter 2. Once the correspondence of a pair of image points has been established, the depth (distance from the eyes) of the corresponding real-world point can be established from the disparity in the angles of projection of the point in the two eyes using simple trigonometry. Intuitively, the disparity is more when the object is closer to the eye, and less when it is farther. We illustrate the depth calculation in Figure 4.1, where O1 and O2 represent the optical centers of the two eyes, and P represents a point whose depth needs to be estimated. Let, s represent the distance between the eyes. Let 𝛼 and 𝛽 denote the Computational Models for Cognitive Vision, First Edition. Hiranmay Ghosh. © 2020 The IEEE Computer Society, Inc. Published 2020 by John Wiley & Sons, Inc.

52

4 Late Vision

P

Figure 4.1 Distance estimation using stereopsis.

δ

O1

α s

O2

β

angles subtended by the point P in the two eyes with the line joining the optical centers, O1 O2 . The disparity in the angles is given by 𝛿 = 𝛽 − 𝛼. The distance of the point P from the left eye d can be computed with simple trigonometry as1 d = s(cos 𝛼 + sin 𝛼. cot 𝛿)

(4.1)

The same principle is often deployed in artificial vision systems using two or more closely placed cameras. Besides stereopsis, depth perception also exploits some high-level visual cues, such as perspective, occlusion, relative sizes, and shadows, as shown in Figure 4.2. These cues help in creating illusion of depth in 2D images. A way to explain the depth perception with such high-level cues is the exploitation of natural scene statistics. For example, the depth illusion in Figure 4.2d is explained by the fact that human eyes get adapted to the statistics of the texture patters produced by illumination from that top, which is common A

B

(a)

(b)

(c)

(d)

Figure 4.2 Depth Perception. (a) Perspective: corner A appears to be at the rear, and corner B to the front. (b) Occlusion: the triangle occludes the square and hence is nearer. (c) Relative size and position: the man appears to be farther away than the woman. (d) Texture and shadow: circles on the left appear to be convex, those on the right appear to be concave. 1 Assuming that the eyes are close together, this will be the distance from the right eye also.

4.2 Perception of Visual Quality

in natural lighting environment. As a simple exercise, a reader may turn the book upside down and look at the diagram. The convex surfaces will appear to be concave and vice-versa.

4.2 Perception of Visual Quality Quest for perceptual quality assessment of images and videos is primarily motivated by broadcast and streaming services, where the goal is to provide the viewers with best possible viewing experience, within the constraints of limited bandwidth and processing resources. A digital image or video undergoes degradation from the point of its capture to presentation, often resulting in unsatisfactory viewing. Quality measures based on signal processing theory, such as mean square error (MSE) or peak signal to noise ratio (PSNR), differ widely from the human perception of visual quality. This motivates research on perceptual quality assessment of images and videos, which attempts to formulate an alternate metric for the same. Broadly speaking, such metric needs to establish a distance measure for a degraded image or video from the original one on a perceptual scale (Bovik 2013). It has been observed that human eye is particularly sensitive to structural information in an image. For instance, the structural degradations, like blur and noise, have far more perceptual effect than nonstructural degradation, such as change in luminance and contrast, as shown in Figure 4.3. This has been explained by the hypothesis that perception of image quality happens in the pre-attentive stage of vision. The center-surround organizations of the retinal neurons, which make the ganglions to respond to the local structures in a scene, are by and large robust against nonstructural distortions by a normalization mechanism (see discussions on automatic gain control (AGC) in Chapter 2). This theory prompts the measure of perceptual image fidelity to be based on various models of structural description of an image, such as spectral analysis, structural similarity index measure (SSIM) (Wang and Bovik 2009), and natural scene statistics.

(a)

(b)

(c)

(d)

(e)

Figure 4.3 Perceptual degradation of image. (a) Original image. (b, c) Non-structural degradations (luminance and contrast). (d, e) Structural degradations (blur and noise).

53

54

4 Late Vision

There are three modalities for perceptual assessment of images and videos. The simplest is full reference (FR) method, where the original artifact (image or video) is assumed to be available. In this case, perceptual models of both the original and the degraded artifact are created and compared. For example, Sheikh et al. (2005) has proposed a model for visual information fidelity based on a distance between the statistical patterns in the source and the degraded images. A method to learn the changes in statistical patterns in an degraded image has been proposed in Tang et al. (2011). While most of the earlier work has been based on analysis of gray-scale image features, Temel and AlRegib (2019) introduces a multi-scale and multichannel spectral analysis method to account for color distortions in an degraded image. There are instances, such as at reception point of a TV broadcast, the original signal may not be available.2 In such cases, the quality of an image or video needs to be assessed without referring to the original. This is known as the no reference (NR) or blind quality assessment. In NR approach, the meta-knowledge for the statistical properties of natural scenes is used for assessing the quality of a visual artifact. In-between the two extremes of FR and NR methods, lie the reduced reference (RR) approach, where some features extracted from the original image are transmitted from the source through a narrow-band channel (assumed to be loss-less), and quality assessment is based on deviation from those features. Selection of the features is based on a trade-off between the bandwidth required for its transmission and the accuracy of quality assessment, and is a critical issue for RR approach. They need to be efficient, sensitive to image distortions, and have strong perceptual relevance. Different approaches for RR and NR image quality assessment have been summarized in Wang and Bovik (2011). Though statistical regularities exist in videos, modeling of these properties for quality assessment poses some challenges. Frame by frame image quality assessment leaves out the dynamic aspect of a video, namely the optical flow, which can be due to either camera motion (ego-motion), or motion of the real-word objects, or both. Modeling the statistical properties of optical flow has so far been done in some restricted environments only. Since it is difficult to get the original video in most of the cases, much of the research on video quality assessment has been blind (Mittal et al. 2016). More recent work has focused on quality of “in-the-wild” commodity videos captured with personal cameras (e.g. on smart-phones), rather than professionally created ones (Li et al. 2019). A database of commodity videos, annotated with quality parameters created by crowd-sourcing, has been reported in Sinno and Bovik (2019). Significantly, none of the existing video quality assessment methods have been found to produce satisfactory results on this database. 2 Strictly speaking, the original natural scene is never available for analysis. The closest to the original is available at the output of the camera, where some degradation has already been caused by camera hardware and compression algorithm.

4.3 Perceptual Grouping

4.3 Perceptual Grouping The object contours detected in a natural image in the early vision stage form a set of discontinuous segments, including many “noise” edges, as shown in Figure 4.4b for a small area enclosed in the white square in Figure 4.4a. Despite discontinuities of the edges and occlusions hiding parts of the objects, human vision system (HVS) can perceive the semantic entities in the scene and their contours in a holistic manner. The perceptual grouping in human vision had been studied by a group of psychologists, establishing the school of Gestalt psychology. 3 They observe that the holistic representations in an image are emergent features, i.e. new knowledge created by the HVS, and not just an aggregate of the fragmented information transmitted by the eyes (Wagemans et al. 2012a,b). Perceptual grouping for still images is believed to happen in two stages. The first stage involves a bottom-up process of contour integration, where the fragments of detected edges are combined based on continuity, proximity and such other considerations, to form extended, but yet incomplete contours. These extended contours may have large gaps due to various reasons, such as occlusion. The next stage involves contour completion, which involves model-based integration of these extended contours, to identify the closed region boundaries. Gestalt scientists have discovered several principles of perceptual grouping for contour integration and completion. These principles, in general, can be explained

(a)

(b)

Figure 4.4 Imperfection in object contour detection. (a) A natural image. (b) Edges detected for the inset (white) box. 3 The word Gestalt, according to Collins online dictionary, “is something that has particular qualities when you consider it as a whole which are not obvious when you consider only the separate parts of it.”

55

56

4 Late Vision

(a)

(b)

(c)

Figure 4.5 Some principles of perceptual grouping. (a) Grouping by proximity gives a perception that the blocks are arranged in columns. (b) Grouping by similarity (of colors) gives rise to the perception of the blocks to be arranged in rows. (c) Grouping by closure gives rise to the perception of a circle from the fragments of edges.

with Bayesian reasoning framework. The problem can be modeled in a similar way to taxonomy learning (see Chapter 3), where objects are grouped together in learned categories. Figure 4.5 illustrates some of the principles of perceptual grouping, namely by proximity, by similarity and by closure. The grouping principles can often be conflicting with each other. For example, grouping can be ambiguous in Figure 4.6, where the dots are similar in color along the rows, but closer along the columns. This represents a trade-off between parsimony and goodness of fit, where the simpler hypothesis of a group to contain uniform colors compete with the compactness of the grouped regions. For most of the observers, the dots are perceived to be organized in rows, rather than in columns, in this example. Empirically, it has been found that if dh and d𝑣 denote the horizontal and the vertical distances between the dots, and th and t𝑣 denote the likely number of times the dots are perceived to be horizontally or vertically grouped ( )−𝛼 th dh = (4.2) t𝑣 d𝑣 Figure 4.6 Conflict in the laws of grouping by proximity and similarity. The circles are grouped by similarity (color), though the distance across the columns is greater than that between the rows.

4.3 Perceptual Grouping

(a)

(b)

Figure 4.7 Amodal and modal contour completion. (a) Amodal completion: The contour of the triangle is completed by extrapolating the visible edges into the occluded region. (b) Modal completion: We perceive a white illusory triangle with complete contour, without occlusion (Kanizsa’s triangle).

where 𝛼 ≈ 2.89. This shows a stronger bias for color similarity than proximity in human vision. We see two more examples of grouping by closure in Figure 4.7. The contour of an occluded rectangle can be mentally completed in Figure 4.7a. This is known as amodal completion. On the other hand, we perceive an illusory white triangle (known as Kanizsa’s triangle) occluding the black circles in Figure 4.7b. This form of contour completion is known as modal completion. The perception of occlusion through contour completion results in high-level cue for depth perception. In both the figures, the occluding triangle appears to be closer than the occluded geometric shapes. Another important principle of perceptual grouping pertains to good continuity. For example, it is possible to perceive the helical structure of the barbed wire despite intersections and overlaps in Figure 4.8a. The principle of good continuity is explained by that choice of simpler hypothesis. This is further illustrated

B A

(a)

B

(b)

A

C

(c)

Figure 4.8 Grouping by good continuity. (a) Perception of helical structure of barbed wire. Source: File shared by Darin Marshall through Wikimedia Commons, file name: Concertina_wire_on_the_FAA_airplane_station.jpg. (b) The occluded segment of the line AB is generally reconstructed with a straight line, out of infinitely many possibilities. (c) The line segment A is perceived to extend to B, rather than C.

57

58

4 Late Vision

in Figure 4.8b,c. While the occluded portion of the straight line AB in Figure 4.8b can be reconstructed in infinitely many ways (some of which are shown with white dotted lines in the diagram), human perception will generally consider a straight edge to connect the two ends. Similarly, to extrapolate the continuation of line segment A beyond the occluded portion in Figure 4.8c, human perception tends to choose the line segment B rather than C. The “simplicity” of a contour often have a domain-specific abstract notion, and is guided by natural scene statistics. For example, a “simpler” hypothesis for the missing contour of an elephant’s back is perceived to be flat, while that of a camel’s back is with a hump as shown in Figure 4.9. Note that prior knowledge plays a key role here. One, who has never seen a camel, will tend to complete the missing contour on the camel’s back with a flat surface like that for the other animals. Gestalt scientists of a later age have postulated quite a few other “rules” for perceptual grouping, some of which are depicted in Figure 4.10. A particularly interesting one is grouping by generalized common fate, shown in Figure 4.10a, which groups the elements by similar (not necessarily identical) movement pattern, for example birds flying in a flock. The different principles described above works together, and sometimes conflict, to give a holistic perception of an image. As in other cases of Bayesian reasoning, several hypotheses compete and the winner, on the basis of all the evidences, determines the inference. The principles of Gestalt grouping discussed in this section are not exhaustive. More such “rules” have been discussed in Wagemans et al. (2012a).

(a)

(b)

Figure 4.9 Role of prior experience in reconstruction with continuity. (a) The missing contour of an elephant’s back is perceived to be flat. Source: Derivative work from file shared by Felix Andrews through Wikimedia Commons, file name: Elephant_side-view_Kruger.jpg. (b) The missing contour of a camel’s back is completed with a hump. [Derivative work from file shared by Jjron through Wikimedia Commons, file name: 07._Camel_Profile,_near_Silverton,_NSW,_07.07.2007.jpg.]

4.4 Foreground–Background Separation

(b)

(a)

(c)

Figure 4.10 Some other principles of perceptual grouping. (a) Grouping by (generalized) common fate (flock of birds flying). (b) Grouping by region (the objects in the boxes are grouped together). (c) Grouping by parallelism (parallel lines are grouped together).

4.4 Foreground–Background Separation One of the direct consequences of perceptual grouping is the separation of the foreground objects in an image from the background. This is known as foreground–background separation, or figure-ground separation. The shared border between two regions is perceived as the occluding contour of a foreground object. The other region, the background, is perceived to continue behind the foreground. Given a boundary between two regions, an issue to resolve is that which side of the boundary represents the foreground object? The principles that work behind selecting the foreground is guided by the observed statistics in natural scenes. Some such principles are shown in Figure 4.11. Further, it has been observed that a region eliciting attention (will be discussed in Chapter 5) and moving objects are more likely to be treated as a part of the foreground. Moreover,

(a)

(b)

(c)

(d)

(e)

(f)

Figure 4.11 Foreground–background separation. (a) Closed regions (black) are considered to be the foreground. (b) Convex areas (black) are considered to be the foreground. (c) Symmetric areas (white) are considered as the foreground. (d) Lower region (black) of an image is considered as the foreground. (e) Regions wide at the bottom and narrow at the top (white) are considered as the foreground. (f) Familiar shapes (monkey face) are considered to be the foreground.

59

60

4 Late Vision

the foreground object is generally at the center of the visual field when a person looks at it, as well as in photographs that are intended to depict salient objects.4 The foreground–background separation can also be explained with Bayesian theory, where the hypotheses regarding foreground and background are tested with the visual data. The priors for the various hypotheses with respect to foreground objects, as presented in the earlier paragraph, are considered to have higher values, based on our experience with the real-world objects. As mentioned in earlier sections, foreground–background separation provides high-level cues for depth perception in 2D images.

4.5 Multi-stability The perception of visual signals is not always unique, which often gives rise for an image to be interpreted in different ways. A classic example is the ambiguous cartoon depicted in Figure 4.12,5 which can be perceived either as a young or an old woman. The phenomenon of multi-stability is explained by two or Figure 4.12 An example of visual bistability. “My wife and my mother-in-law.” Source: File shared by William Ely Hill through Wikimedia Commons, file name: My_Wife_and_ My_Mother-In-Law_(Hill).svg.

4 Professional photographers may follow some different placement of the foreground object to improve the aesthetic value of the photograph. 5 The cartoon originally appeared in a German postcard in 1988, and is attributed to an unknown artist.

4.6 Object Recognition

more hypotheses being found to be equiprobable à-posteriori, when any one of the hypotheses is accepted. Subjective experience of an observer often tilts the balance, and guides the interpretation (what one sees first). It is important to note that it is not possible to perceive multiple interpretations of an ambiguous image simultaneously; human perception system accepts only one interpretation (hypothesis) at a time. Sometimes, it is hard to see one of the possibilities by an observer, unless when prompted. This is explained by the fact that prompting results in assignment of a higher prior to the alternative. It has also been observed that the perception switches to an alternate interpretation, when one looks at an ambiguous image for a long time. This has been explained by fatigue of the neurons with long exposure to the same stimulus, which effectively decreases the prior.

4.6 Object Recognition Vision (as well as any sensory perception) is essentially an inverse problem. Visual patterns on human retina (or, a camera sensor) are generated by the objects in the real world, through a complex optical process. The state of the real world is estimated by reverse interpretation of the received visual signals. Thus, the vision problem naturally lends itself to abduction, where a hypothesis regarding the state of the real world that best explains the observed visual data forms its interpretation. The problem of abduction in vision can be effectively modeled with Bayesian formulation. In the simplest case, the different possible objects form the hypothesis space 𝒪. Given some visual data 𝑣, the object is identified as the hypothesis O∗ that has a maximal posterior probability, given the observed data. Formally, it can be stated as O∗ = argmaxO∈𝒪 P(O ∣ 𝑣)

(4.3)

where P(O ∣ 𝑣) is computed as P(O ∣ 𝑣) =

P(O) ⋅ P(𝑣 ∣ O) P(𝑣)

(4.4)

The denominator P(𝑣) represents the prior probability of occurrence of the visual pattern 𝑣 irrespective of the hypotheses. It is a normalizing constant in a given observation scenario. Thus, in Bayesian theory, object recognition becomes a trade-off between the prior probability of the object to be present in an image P(O) and the evidence of visual data received P(𝑣 ∣ O). Like in other Bayesian processes, both of the prior and conditional are formed out of personal experiences of the observer, making vision as a subjective process.

61

62

4 Late Vision

Figure 4.13 in context.

4.6.1

(a)

(b)

(c)

(d)

Utility of context in perception. (a,b) Objects without context. (c,d) Objects

In-Context Object Recognition

Human perception of a visual signal for an image region, however, does not depend on the signal alone, but also on the context in which it is situated. This has been illustrated in Figure 4.13, where it is difficult to recognize the objects on the top row from their visual features in isolation. However, they can be readily recognized when placed in context as shown in the bottom row. Context plays an important role in recognition to resolve ambiguities, especially in degraded images. Though we restrict the term “context” to mean the visual signals received from rest of the image in this section, it is indeed a much broader term.6 For the process of visual cognition, the context, in general, includes many other factors, such as the intentional state of the observer, the situation preceding an observation (memory), the geographical location, etc. Equation (4.4), which is the basic Bayesian formulation for vision, has been extended for in-context object recognition in Torralba (2003a). In general, the observed visual features 𝑣, which is used in the equation comprises the visual 6 Collins Online Dictionary defines context of an event to be “the general situation that relates to it, and which helps it to be understood.”

4.6 Object Recognition

features of the object O and that of the context where it is situated in an image. Thus, we can write 𝑣 = (𝑣l , 𝑣c ), where 𝑣l and 𝑣c represent the features of the object and that of the context respectively. For object recognition without consideration of context, 𝑣 is approximated with 𝑣l in Eq. (4.4), when it may read as P(O ∣ 𝑣l ) =

P(𝑣l ∣ O) ⋅ P(O) P(𝑣l )

(4.5)

which means that only the visual features of the region occupied by the object, and not to the background, are used for object detection. Classical object detection algorithms choose the region by trial and error, for example, by employing sliding windows at different scales. Since an object generally occupies a small region in an image, use of object features alone has the advantage of smaller size of feature vectors, which often makes the interpretation tractable. However, this simplification fails in many cases as illustrated in Figure 4.13. In order to account for the context in object detection, we may substitute (𝑣l , 𝑣c ) for 𝑣 in Eq. (4.4), when it can be expanded as P(O ∣ 𝑣) = P(O ∣ 𝑣l , 𝑣c ) =

P(𝑣l ∣ O, 𝑣c ) ⋅ P(O ∣ 𝑣c ) P(𝑣l ∣ 𝑣c )

(4.6)

This equation is very similar to Eq. (4.5), except that all the entities are conditioned by the context features 𝑣c . As usual, the denominator remains a normalizing constant for a given observation. Thus, the probability of an object O given visual features 𝑣 depends on the product of the two terms in the numerator. The first term, P(𝑣l ∣ O, 𝑣c ), represents the probability of finding the object features when the object appears in a given context. The inclusion of the context in this term accounts for the change in the object appearance in different contexts. The second term P(O ∣ 𝑣c ) specifies the prior belief for the object to appear in a given context, which we explore in more detail below. An object instance O is characterized by a set of object properties, such as the class it belongs to (o), its location in the image (x) and its appearance (𝜎) comprising its scale, orientation, etc. Thus, we can write O = (o, x, 𝜎). With this substitution, the second term P(O ∣ 𝑣c ) in Eq. (4.6) can be expanded as P(O ∣ 𝑣c ) = P(o, x, 𝜎 ∣ 𝑣c ) = P(𝜎 ∣ o, x, 𝑣c ) ⋅ P(x ∣ o, 𝑣c ) ⋅ P(o ∣ 𝑣c )

(4.7)

The last term in the equation expresses the prior probability of finding an object (of a given class) in a given context. The middle term represents the probability of finding it in different image locations, when it is present in a given context. The first term sets expectation for the appearance (scale, orientation, etc.) of the object when it is present in a given location in a given context. To appreciate the significance of the terms, consider a simple example: (i) it is more likely to find a car in an outdoor scene than in a drawing room, (ii) in an outdoor scene, a car

63

64

4 Late Vision

is more likely to be there on the road surface than on the pavement, and (iii) the appearance and the size of the car depends on where it is located in the scene. The contextual modulation of the priors as well as the object features as illustrated above enables reliable in-context object recognition. The representation of context calls for some discussion. We have defined context as the visual features for the entire image-space, excluding the object region. Generally, an object occupies a small region in an image, leaving a large area for other objects. Thus, 𝑣c has a very large dimensionality as compared to 𝑣l . The context could be represented at a scale of higher abstraction to make it compact. For example, a symbolic representation of the cutlery in Figure 4.13b could be a compact cue for detection of the plate. But it raises a chicken-and-egg problem of which object is recognized first? A solution to this problem comes from the gestalt scientists that we perceive the whole of a scene first and then the parts, rather than the other way round, which is aptly paraphrased as “Forest before the trees” (Navon 1977). A mathematical formulation for representing and distinguishing several types of scene contexts, using a few compact statistical representations called the gist, have been demonstrated in Oliva and Torralba (2001). Such features can be computed inexpensively to establish the context and can be used to facilitate object detection.

4.6.2

Synthesis of Bottom-Up and Top-Down Knowledge

We focus on yet another important aspect of context in vision, that is the need for synthesis of top-down and bottom-up reasoning (Yuille and Kerstenb 2006). Making a simplifying Markovian assumption that P(𝑣l ∣ O, 𝑣c ) does not depend on 𝑣c , and can be simplified as P(𝑣l ∣ O), we can rewrite Eq. (4.6) as P(O ∣ 𝑣) ∝ P(𝑣l ∣ O) ⋅ P(O ∣ 𝑣c )

(4.8)

The term P(𝑣l ∣ O) represents generic bottom-up knowledge about the probable features of an object, and the term P(O ∣ 𝑣c ) represents top-down task-specific knowledge about the probability of occurrence of an object in a context. This shows that the interpretation of visual signals is a result of synthesis of bottom-up and top-down information, which makes a visual system robust. As a simple example, consider the image of a human face on the top-left of Figure 4.14. Detection of low- level image features, such as lines and corners, are extremely difficult in this image, especially toward the left, because of shadows. Still, we do not find any difficulty in detecting a human face. In this example, detection is supported by a hypothesis (context) of a human face that has a well-defined structure as shown in the right of the figure. The interpretation of the visual signal follows a two-way interaction. While the observed visual patterns provide evidence for the hypothesis of the human face, the hypothesis, in turn, reinforces

4.6 Object Recognition

Visual signal

Confirmation

Interpretation

Figure 4.14

Model (context)

Vision as a synthesis of top-down and bottom-up processes.

the interpretation of the visual signals and confirms the mid-level features, like the location of the nose and the eyes.

4.6.3

Hierarchical Modeling

The basic approach for Bayesian modeling for interpretation of visual signals, presented in Sections 4.6.1 and 4.6.2, works well for simpler image classification tasks. But, it becomes extremely difficult to model complex scenes, which may comprise a large number of objects and their interactions. The interpretation becomes unreliable because of large intrinsic variations in the visual manifestations of the higher level concepts. To cope up with such situations, a Bayesian network is often constructed in a hierarchical fashion, where a higher level concept is recursively modeled as a probabilistic manifestation of some lower level concepts. The lowest level of “observable” concepts are modeled with probabilistic manifestation of visual features. For example, Park and Aggarwal (2004) uses a hierarchical Bayesian network (HBN), as shown in Figure 4.15, to estimate the action of a person and interaction between two persons from the pose of human body parts. At the lowest level of the hierarchy, a set of feature detectors are employed to detect the size and orientations of atomic body parts, such as the forearm, the facial region, etc., which have been used estimate the poses of composite body parts, e.g. head, torso, upper and lower limbs, at the intermediate level. At the highest level of the hierarchy, the poses of the composite body-parts are combined to estimate the overall pose of a person. The prior probability distribution tables for the network are estimated from training data using a generative model with two parameters representing geometric transformations (body and camera poses) and environmental conditions (e.g. illumination). Going further, the authors use a dynamic Bayesian network to model the human interactions, where an interaction is modeled through sequence of change in body part poses. The hierarchical structure of the Bayesian network

65

66

4 Late Vision

H1

H2

V1

H3

V2 Head

V3

V4

Torso

H4

V5

V6

Upper Body

H5

V7

V8

H6

V9

V10

V11

Lower Body

Figure 4.15 A hierarchical Bayesian network depicting for human body pose estimation. Source: Redrawn after Park and Aggarwal (2004) with permission from the authors.

and restricting temporal dependencies to the body parts make the probability computations tractable in the network. The approach followed by Park and Aggarwal (2004) requires labeled data for the intermediate concepts (body part poses), which may be difficult to create in several problem instances. For example, a visual scene may consist of many different observable low-level themes, such as rocks, trees, clouds, and buildings, annotating each of which in a large number of training images can be extremely painful. Following the latent Dirichlet allocation (LDA)-based approach to document classification problem (Blei et al. 2003), the intermediate themes (rock, trees, etc.) have been modeled as latent topics, with a visual codebook obtained from local image features (also called a bag-of-words) as the observable words (Fei-Fei and Perona 2005). A scene is considered to be a probabilistic distribution of the latent themes, which are in turn considered to be a probabilistic distribution of the visual code-words. The parameters for the probabilistic distributions for either of the two levels of hierarchy cannot be learned directly in absence of labeled data for intermediate nodes. Generative models are therefore defined for the parameters of the distributions with an appropriate set of hyper-parameters with a hierarchical Bayesian model. The model is learned from training data, where only the high-level concepts (scene categories) are labeled. The approach has been extended in Wang et al. (2009), where an even higher level of knowledge has been assumed and learned to categorize human activities in crowded videos in a completely unsupervised manner.

4.6.4

One-Shot Learning

Modeling a scene with hierarchical Bayesian models (Fei-Fei and Perona 2005) results in acquisition of higher level knowledge, which can be reused to quickly

4.7 Visual Aesthetics

train a classifier with few training examples.7 This principle has been used in Fei-Fei et al. (2006) to realize one-shot learning of object categories. The authors created a parts-based model for the objects to be classified, and used a simple local appearance feature to detect the parts. The parts were considered to be latent variables manifesting in appearance features, and mixture models have been used in both the levels of hierarchy. A few object classes with sufficient training data have been used to train the model parameters. The meta-model was used as the prior for the model for classification of a much larger number of object classes. It had been observed the model stabilized faster and with less training data, when the meta-model learned from few classes were used as the prior.

4.7 Visual Aesthetics Visual aesthetics (distinct from visual quality) is the attribute of the images that is responsible for evoking positive emotion in human mind. Some of the sights, be it a person’s face, the colors of a dress, a natural scene, or a piece of artwork, give us more visual pleasure than others. Much of the research on visual aesthetics has focused on man-made artwork, like photographs, paintings, and videos. Though aesthetics is by and large a subjective matter,8 user ratings and comments on photographs posted on websites amply demonstrate that some of them are more appreciated than others.9 This points to a possibility of establishing some objective criteria for predicting aesthetic quality of an image. Since there are no known rules to define beauty, the researchers have taken recourse either to some empirical models, or to statistical machine learning methods. The goal has been to create either a classifier that distinguishes images with high aesthetics values from the others, or a regression model to assign an aesthetic score to an image. The key theme of the research has been to discover an appropriate feature set for modeling aesthetics. Intuitively, color appears to be a major contributor to aesthetics of an image. Psychological studies have revealed that, despite differences across individuals, age, gender, and cultural background, there are some colors that are generally preferred. A plausible theory to explain color preferences, the ecological valence theory (EVT) (Palmer and Schloss 2010), suggests that the liking for a color depends on its correlation with objects that are desired or hated. For example, blue and cyan, the colors of clear sky and clean water, are universally liked, whereas brown 7 See discussions on one-shot learning in Chapter 3. 8 It motivates (Kairanbay et al. 2019) to predict photographer’s demography and even identity by analyzing the aesthetic attributes of a photograph. 9 The user ratings may not exclusively relate to the aesthetic appeal of the photo, but may have other considerations, such as familiarity with the subject, etc. as well.

67

68

4 Late Vision

or yellowish brown, the color of feces and of rotting flesh are generally disliked. The theory can explain personal variations by associating some special value to the colors for various groups, for example, people with strong bonding to a community show preference for colors associated with that community. The theory also explains preference for some other aesthetic parameters, for example, preference for rounded contours over sharp corners, since the latter provides unpleasant tactile experience. It has also been observed that the preferences for colors often depend on objects on which they are displayed, depending on contextual “appropriateness” (Schloss et al. 2013), which is often culture-specific. Color combinations also play a big role in aesthetic assessment of an image. Several models for color harmony has been proposed by the researchers (Weingrel and Javor˘sek 2018). The empirical models suggest a “balance” of the perceptual properties of colors (hue and lightness) and the area covered by the individual colors to improve color harmony. The statistical models are based on studies of human preferences. A statistical model of color harmony has been proposed in Nishiyama et al. (2011), where an image is treated as a bag of colors.10 Images have been divided into small patches and quantized, and are designated as the color words. Color harmony is modeled as a probability distribution of the color words across the image. Lu et al. (2015b) extends the model with an LDA-based approach, where color harmony is assumed to be a distribution over a set of color topics, which are in turn distributions over observable color words in an image. The parameters of the distributions are learned with hierarchical Bayesian model, where the priors have been derived from empirical color harmony models. The color harmony models find many practical applications, such as in web-page design (Bonnardel et al. 2011), garment recommendation (Goel et al. 2017a), and so on. Further studies on image aesthetics show that the spatial structure of the image, orientation of edges, symmetry, complexity, blur estimates, balancing and many such parameters influence image aesthetics. Researchers have also experimented with common photographic techniques, such as aspect ratio, the “the rule of thirds,” placement of foreground objects, etc. (Datta et al. 2006; Ke et al. 2006; Bhattacharya et al. 2010). Many of the features are motivated by statistics observed in the natural scenes. It is reasonable to expect that different categories of images, such as close-ups and landscapes, should have different parameters for aesthetics, which has been confirmed in Tang et al. (2013). Temel and AlRegib (2014) highlights the importance of spatial distribution of the features for image aesthetics. Controlled data sets, annotated with aesthetic, semantic, and photographic style, for conducting experiments with statistical image aesthetics models have been reported in Murray et al. (2012). While it 10 This is similar to bag of words model for image feature representation.

4.8 Conclusion

has been recognized that image aesthetics is determined by several facets, it is not clear if the overall aesthetics is an aggregate of aesthetics as determined by the individual facets, or a holistic (gestalt) entity that emerges from these components (Palmer et al. 2013). A few researchers have explored aesthetics of videos also. Static image features pooled over multiple frames has been used in Moorthy et al. (2010) to assess video aesthetics. Bhattacharya et al. (2013) follows a hierarchical approach to define video aesthetics at cell, frame, and shot levels, and fuse them to assess overall video aesthetics. Moreover, the authors introduce human affects present in the video as a parameter. Bettadapura et al. (2016) uses color vibrancy, composition, and symmetry to discover “picturesque” shots from personal vacation videos.

4.8 Conclusion In this chapter, we have analyzed several processes for perception and cognition. The processes discussed in this chapter can broadly be categorized into two parts. The contents of Sections 4.1–4.4 refer to lower level perceptual processes. Higher level cognitive functions are dealt with in Sections 4.6 and 4.7. A common thread that binds these sections is that most of the processes are explained by Bayesian framework of reasoning. Natural scene statistics play an important role in the Bayesian formulation by defining the priors and the conditionals. We conclude the chapter with the remark that the Bayesian theory has been successfully employed to explain the perceptual and cognitive processes for sensory organs other than the eyes also, as well as for the higher level processes to integrate them.

69

71

5 Visual Attention In order that a situated agent can effectively interact with its environment, the data received through its sensors need to be analyzed and responded to in real-time. The processing power on an agent may not be sufficient to process the large amount of data that it receives through its sensors. In particular, the volume of visual data that an agent receives far exceeds the data received through its other sensory channels. To appreciate the magnitude of visual data, note that a single high-definition video-stream with resolution 1280 × 720 pixels at a frame-rate of 30 fps results in a raw information rate of more than 650 Mbps! Thus, an agent needs to selectively focus on a narrow window of visual stimuli that conveys the most pertinent information. In human visual system, the foveal vision, where images are treated with maximum acuity, is restricted to only about 2∘ of the visual field. A human being selectively fixates his foveal vision on the most informative region of a scene at a time. The process of selecting the visual region to fixate on is known as visual attention. Moreover, the focus of attention is not static, but shifts over time. The rapid movement of gaze from one region of fixation to another is known as saccade. Humans can acquire the important information in an image, by scanning its informative regions of a scene through alternate fixations and saccades, without having to analyze every pixel in the image. For example, when presented with an image as shown in Figure 5.1a, most of the people will scan only a small part of the image, as shown with the fixation density map (heat map of eye tracking data) in Figure 5.1b, to interpret it’s contents. The process of attention significantly reduces the cognitive load on biological vision system. It is essentially regarded as a preprocessing step for higher level visual cognitive tasks. Taking a cue from the biological world, emulation of visual attention has gained significant interest in computer vision research. The goal of the chapter is to provide an overview of the diverse approaches and techniques in modeling visual attention and methods for their evaluation. In this chapter, we restrict ourselves to the classical attention models, which have been developed before the advent Computational Models for Cognitive Vision, First Edition. Hiranmay Ghosh. © 2020 The IEEE Computer Society, Inc. Published 2020 by John Wiley & Sons, Inc.

72

5 Visual Attention

(a)

(b)

Figure 5.1 Where people look at. (a) Example Image. (b) Attention heat map. Source: Reproduced from Judd et al. (2009).

of deep neural networks in the computer vision tasks. Some of the deep learning based models will be discussed in Chapter 8.

5.1 Modeling of Visual Attention We find several approaches to modeling attention in the literature. We present an overview of the approaches in this section before taking a plunge into the specific attention models. At the outset, visual attention has been modeled in two distinct contexts. In a free-viewing situation, when one gazes at a scene without any specific goal, the attention is naturally guided by some inherent properties of image regions. For example a bright red rooftop pops out amidst a scene with predominant vegetation. This is known as bottom-up attention, and is involuntary. It is guided by the conspicuity of the regions in an image, and not by the will of the viewer. In contrast to the free-viewing situation, when a person has a specific task at hand, the attention is deliberately modulated by the requirements of the task. For example, while searching for a red apple in a fruit-basket, attention focuses on the objects with red color and shape like that of an apple. A classical experiment by Yarbus (1967) shows that the eye movements of the same subject on encountering the same scene follow different patterns while trying to answer different questions regarding the scene (see Figure 5.2). Such task-driven voluntary selection of visual stimuli is known as top-down attention. While studies of Yarbus (1967) and some other researchers are based on still images, Hadnett-Hunter et al. (2019) confirms the task-dependency of human attention in dynamic virtual reality environment (e.g. games) with free-viewing, navigation, and search tasks. Early models for bottom-up visual attention have been based on local contrasts that are detected in early stage of biological vision. The models postulated that

5.1 Modeling of Visual Attention

(a)

(b)

(c)

(d)

(e)

(f)

(g)

Figure 5.2 Examining a picture “A room with a family with an unexpected visitor” with different questions to answer. (a) Free examination. (b) Estimate the material circumstances of the family in the picture. (c) Give the ages of the people. (d) Surmise what the family had been doing before the arrival of the ‘unexpected visitor’. (e) Remember the clothes worn by the people. (f) Remember the position of the people and objects in the room. (g) Estimate how long the unexpected visitor had been away from the family. Source: Reproduced from Yarbus (1967).

73

74

5 Visual Attention

the image regions with higher contrast attract more visual attention, and predict graded saliency values at different locations in an image. These models are known as pixel-based models, and are said to implement soft attention. Later research has found that certain holistic objects like human face, or man-made artifacts, rather than local contrasts, call for human attention (Einhäuser et al. 2008). Attention models based on this theory predict the saliency of “super-pixels”, representing an object or its part. They are known as the region-based or object-based saliency models. These models generally predict binary saliency values, for the image regions, i.e. an image region is classified as salient or not. This is known as hard attention. While most of the attention models make use of intrinsic cues derived from within an image (or, a video) only, some of them use extrinsic cues as well. Common extrinsic cues are statistics collected from other images, depth map, and image annotations. When the extrinsic cues are derived from the past observations, a model is said to be an experiential model. Besides visual features, the context of an object often plays an important role in determining conspicuity of an object. A car on a street (where it is naturally expected) may not draw much attention, but a car on a roof-top (where it is unexpected) will certainly be conspicuous. In task-driven top-down attention, human eye fixations are guided by the requirement of obtaining specific information at the right time and is often guided by the experience of an agent. The knowledge that the target may appear in a specific region of an image, attention is directed toward that location. For example, while looking for pedestrians in an urban scene, one restricts the attention to the pavements. Psychological experiments by several researchers, e.g. Posner (1980), confirm that repeated spatial cues help a person in locating a target pattern faster. In task-driven top-down attention, mapping the task requirements to the image features often proves to be a significant challenge (Borji et al. 2014). Different tasks may demand different models of attention. For complex tasks that extends over time, top-down attention modeling requires analysis of spatiotemporal aspects of the task. For example, in a game of cricket, a batsman fixates on the wrist of the bowler during the delivery and then on the bounce point just ahead of the impact, to estimate the trajectory of the ball. Peripheral vision often plays an important role in determining eye movements. For example, while driving a car, the foveal vision overtly fixates on the road in the front, while the peripheral vision covertly scans the traffic signals and any indication of pedestrians about to step onto the road. The covert (peripheral) attention is a quick way to scan the field of vision and to locate the potential regions of interest, where overt (foveal) attention may subsequently fixate. This is known as recurrent attention model. It has been generally agreed on by the behavioral scientists that top-down and bottom-up attentions act together to guide the eye movements in biological vision

5.2 Models for Visual Attention

system. The inherent features of the scene and the task-specific goals together determine the eye fixations. Top-down and bottom-up attentions may often be conflicting. For example in a visual search task, while the top-down attention tries to focus on the object features, involuntary attention may direct the gaze to a distractor with a high local contrast impairing the search (Treisman and Gelade 1980). From the previous discussions, we find that visual attention in the biological world is guided by a complex interplay of several factors involving the goal of the agent and the singularities in the environment. Image features, scene changes, target properties, task-related knowledge, experience, and many such factors are responsible for guiding gaze toward most informative parts of a scene. Accounting for such factors make modeling visual attention a complex scientific endeavor (Yantis 2000; Borji et al. 2014).

5.2 Models for Visual Attention The goal of research on classical models of attention has been to find an appropriate feature set that can model visual attention. Task-specific adaptation of the feature set for top-down attention has also been explored together with models of bottom-up attention. While the area is vast, we present a representative cross-section of the research in the following text.

5.2.1

Cognitive Models

One of the earliest successful models of bottom-up visual attention, proposed in Itti et al. (1998), has been motivated by the theory that the early vision in biological systems can discern local contrasts. In this model, the saliency of an image location has been computed by combining the multi-scale local contrasts with a few elementary image features. The choice of image features has been guided by neurophysiological theories (see Chapter 2) and includes intensity, color contrasts (red-green and blue-yellow), and orientation (response to differently oriented wavelets). The model is schematically shown in Figure 5.3. While computing the local contrasts, the authors have used a center-surround model, where the central region of a locality is represented with higher resolution than the surroundings, thereby emulating biological foveal and peripheral vision. This principle has been commonly followed in many later implementations of attention models. The local contrasts computed from the different features have been normalized before being combined, to provide equal weights to the different features, and to reduce the effect of continuous texture.

75

76

5 Visual Attention

Input image Linear filtering colors

intensity

orientations

Center-surround differences and normalization Feature (12 maps)

maps

(6 maps)

(24 maps)

Across-scale combinations and normalization maps

Conspicuity Linear combinations Saliency map Winner-take-all

Inhibition of return

Attended location Figure 5.3 Computation model for attention with feature integration theory. Source: Reproduced from Itti et al. (1998) with permission from the authors.

A saliency map for an image depicts the saliency values of the image computed at different locations. It is regarded as a model for the activation of neurons representing the corresponding visual field. Two policies (Koch and Ullman 1985) guide the temporal fixations over the image. 1. Winner-take-all (WTA) policy suggests that only the output of neurons with strongest activation is forwarded for further processing in the visual cortex, filtering out all other activations. This policy guides the attention to the most salient location of the image. 2. Return inhibition policy postulates that the activation of the neurons that have already been attended to are inhibited from being forwarded at a later time. This policy prevents attention to be perpetually fixed at the most salient location on the saliency map, and enables gaze to traverse the image regions in order of decreasing saliency values. Many of the later attention models use different methods for computing saliency, but they commonly assume these policies to guide the foveal attention.

5.2 Models for Visual Attention

The model, presented by Itti and colleagues, has been quite successful in explaining the observed eye-movement pattern in free viewing condition, and continues to be a reference model for comparison and benchmarking by many later researchers. Neurological experiments show that the involuntary bottom-up attention is modulated by goal-directed influences (Corbetta and Shulman 2002). The bottom-up model presented in Itti et al. (1998) has been adapted to a top-down attention model in Navalpakkam and Itti (2006) based on this observation. The weights assigned to the different features have been biased by using learned statistics of the local features of the target and distracting clutter. The goal has been to maximize the overall saliency of the target relative to the surrounding clutter in the task-specific context. While the feature set used in Itti et al. (1998) remains the baseline, later researchers have extended it with many new features, e.g. local motion features (for videos) (Itti et al. 2003), gestalt and semantic features like text, faces, symmetry and convexity, etc., which are believed to draw biological attention (Xu et al. 2014). Addition of these features have been found to produce better results for top-down attention in different application contexts (Borji et al. 2014). An important issue in attention modeling using multiple features is fusing the conspicuity maps for individual features to compute overall saliency of a location. Since the features are diverse, the feature maps need to be aligned. This is generally achieved through normalization (Itti et al. 1998), or relative weighting (Frintrop et al. 2005). There is also some debate on when the conspicuity maps are integrated. Many of the attention models, like (Itti et al. 1998), use an early fusion strategy, where the saliency is computed after integrating the individual feature maps. Some of the authors question the validity of this approach and propose late fusion (Khan et al. 2009). They argue that the filtering of the image locations progressively with independent feature maps, one at a time, can significantly reduce the cognitive load, and should be a natural choice. For example, while looking for a red ball amidst distractors, attention may first select all red objects in a scene, and then look for circular shape among them. The precedence of the features for late fusion strategy has also been debated. Cognitive scientists generally believe that color features dominate the others, but there is no agreement on precedence of the other features.

5.2.2

Information-Theoretic Models

In information-theoretic models, the researchers have based their saliency computation on Shannon’s information theory.1 The basic assumption behind the models 1 See Chen (2016) for a short introduction to Shannon’s information theory.

77

78

5 Visual Attention

is that the targets (of attention) are small and the image features representing them are sparse in a scene. According to these models, image regions that contain least probable image features have the highest information content and are the most salient locations. A model for Attention based on Information Maximization (AIM) based on this theory has been proposed by Bruce and Tsotsos (2005). Departing from model-based cognitive approach, the authors use an orthogonal set of image features, learned from training data. The computation of saliency is based on the sparsity of these learned features, called the self-information of an image location. The authors demonstrate that the model of saliency can be computed using a neural network that emulates the early stage of a biological vision system. Some of the authors have used local complexity, or entropy, of an image location, rather than global sparsity, as a measure of its saliency (Gilles 1998; Kadir and Brady 2001). A wavelet-based measure of information content and saliency has been proposed in Sebe and Lew (2003). The use of local complexity of an image location as its measure of saliency, however, has an inherent drawback. For example, while human vision generally regards an egg in a nest to be more salient than the nest, the models based on local entropy predict the opposite, because of the inherent texture of the latter.

5.2.3

Bayesian Models

Information-theoretic models use intrinsic cues from an image to determine the “unexpectedness”, or the information value, of an image location. The Bayesian models generalize that by bringing in an experiential factor. The expectations about the image (or video) features are derived from the experience, or the earlier observations by an agent. The proponents of the Bayesian models argue that the notion of unexpectedness of a visual feature is guided by the experience of an agent, and computing spatial saliency alone is not sufficient. For example, an object, when it suddenly disappears from a scene, catches attention though it may not have been very conspicuous in the earlier frames of a video, where it was present. Further, similar movement pattern over a prolonged period of time, e.g. the movement of leaves in gentle breeze, ceases to draw attention after a time. In order to model this experiential aspect of attention, some researchers have introduced novelty, i.e. the newness of the information presented (Friedman and Russell 1997; Marsland et al. 2000; Gaborski et al. 2004). The two aspects of attention, conspicuity and novelty are effectively blended to compute saliency in Bayesian models. In the Bayesian model for bottom-up attention developed in Itti and Baldi (2005), “surprise” is measured in terms of the deviation of observed data from the expectation based on the past experience of an agent. If M ∈  represents a model of the world and D represent some observed data, the posterior probability for the

5.2 Models for Visual Attention

model on observing the data, according to Baye’s theorem, is given by p(M ∣ D) =

P(D ∣ M).p(M) P(D)

(5.1)

where, p(M) and P(D) represent the priors for the model and the observed data, respectively, and P(D ∣ M) is the conditional probability for the data D when the model M is valid. The surprise factor of the data D for the model  of the world is measured by the change in the posterior beliefs in the models from the prior beliefs, aggregated over all possible models M ∈ . Kullback–Leibler divergence (KLD) between the probability distributions p(M) and p(M ∣ D) can be used as a measure of such deviation. Thus, the surprise factor of an observed data D is given by (D) = =

∫ ∫

KLD(p(M), p(M ∣ D)) ⋅ dM p(M). log

p(M) ⋅ dM p(M ∣ D)

(5.2)

Using Eq. (5.1), the above equation can be simplified as (D) = log P(D) −

∫

log P(D ∣ M) ⋅ dM

(5.3)

When an agent is situated in a real world, it keeps on observing image frames in succession, and the prior probability of the models build over time. Thus, the posterior probability P(M)n−1 after observing data in frame Dn−1 , which represents accumulation of experience from frame 1 to frame n − 1, is used as the prior while observing frame n. The decision-theoretic model of attention (Gao et al. 2009), where the feature are selected on the basis of their discriminatory power to distinguish between the different visual concepts is shown to be a further generalization of this model. In this model, the discriminatory power of a feature is determined by the maximization of mutual information between features and the visual concepts.

5.2.4

Context-Based Models

A comprehensive model for predicting location-based saliency that incorporates bottom-up conspicuity, top-down influence as well as contextual reasoning has been presented in Oliva et al. (2003). It uses a Bayesian framework to combine sensory inputs with some prior constraints. Equation (4.6) (see Chapter 4) provides a formulation for the probability for finding an object (target) at a certain location in an image with a certain appearance. The observed image features 𝑣 at the image location is an overlay of the two distributions: (i) 𝑣l that represents the probability distribution of the image feature due to a target, and (ii) 𝑣c that represents the

79

80

5 Visual Attention

distribution of image feature for the context (background). With minor reorganization, we can rewrite the equation as P(O ∣ 𝑣l , 𝑣c ) =

1 ⋅ P(𝑣l ∣ O, 𝑣c ) ⋅ P(O ∣ 𝑣c ) P(𝑣l ∣ 𝑣c )

(5.4)

The right-hand side of Eq. (5.4) is a product of three terms, each of which represents an aspect of attention. The first term, P(𝑣1∣𝑣 ) , is independent of the target charl c acteristics. It represents the unlikeliness of the feature 𝑣l , given the context feature 𝑣c , i.e. the information value of the location. It defines the bottom-up saliency of a location in the image. In the equation, O = (o, x, 𝜎) represents the complete information about an object instance in an image, including its class, location in the image, and its appearance. Substituting (o, x, 𝜎) for O, we can expand the third term in the equation as P(O ∣ 𝑣c ) = P(𝜎 ∣ x, o, 𝑣c ) ⋅ P(x ∣ o, 𝑣c ) ⋅ P(o ∣ 𝑣c )

(5.5)

We have discussed the significance of the terms in this expansion in Chapter 4. Recall that the third and the second terms in the equation define the probability for an object class to appear in a scene context, and the probability of the object to appear at a location in a scene, assuming that it appears in the scene. The product of these two terms determines the task-specific context-driven saliency of location. Thus, the overall saliency of an image location in context of object detection task has two contributing factors: (i) the conspicuity of the location, given by P(𝑣1∣𝑣 ) , l c and (ii) the task-specific context-driven saliency, given by P(x ∣ o, 𝑣c ) ⋅ P(o ∣ 𝑣c ). Contextual model of attention, depicted in Figure 5.4, suggests that eye fixates at the location x where the probability of finding the target is the highest, i.e. the location where P(O ∣ 𝑣) attains a maximum. Image conspicuity v1(x,y)

p(v1)

SC(x) vC

p(o, x | vC)

×

Contextual saliency Original image

Multiscale feature extraction Contextual prior

Figure 5.4 Contextual attention model. Source: Reproduced from Oliva et al. (2003) with permission from the authors.

5.2 Models for Visual Attention

We have not discussed the significance of some of the terms in Eqs. (5.4) and (5.5) yet. The first term in Eq. (5.5), P(𝜎 ∣ x, o, 𝑣c ), predicts the appearances of the object at a certain location in a scene. The second term in Eq. (5.4), P(𝑣l ∣ O, 𝑣c ), predicts the image feature 𝑣l , when an object with a certain appearance is present in an image location. Generally, the term can be approximated with P(𝑣l ∣ O), since the context may not have a role to define the object features. The two terms represent task knowledge and guide the target search process (Torralba 2003b). The first term is helpful in constraining the search space (scale and aspect ratio) for a target around any given location of a scene, while the second term guides the feature selection. Details of the computational methods used for in-context target search is presented in Oliva et al. (2006). It has been postulated that contextual saliency (comprising bottom-up saliency and contextual prior) are evaluated in pre-attentive stage of vision, before the first saccade is deployed. Recognition of target appearance generally needs longer time, particularly when it involves a complex combination of image features from a local image region (Oliva et al. 2006). Further, the prior probability values used in the model are learned from a large number of images and not from the current image alone. Thus, these probabilities represent the experiential knowledge of an agent.

5.2.5

Object-Based Models

The models described in Sections 5.2.1–5.2.4 compute pixel-wise saliency maps based on some local image features. Many psychological experiments have indicated that “interesting” objects and man-made artifacts draw attention in biological vision, rather than local image features (Einhäuser et al. 2008) (though (Borji et al. 2013) refutes this claim). It has been observed that combining object detection and saliency based on low-level image features can predict human gaze better than saliency alone (Cerf et al. 2008; Judd et al. 2009). Object saliency determination has been modeled as a binary classification problem in Liu et al. (2011) using energy minimization in a conditional random field (CRF). Each node of the CRF network represents an image patch. The network has been trained with a large corpus of images that have been labeled with the most salient object in each. The network encourages all the patches in a super-pixel to be uniformly labeled by imposing a penalty for neighboring patches that are similar in appearance to be differently labeled. The saliency computation has been based on a few local, regional, and global features. The network is trained with varieties of salient objects, so that it may learn the generic properties that make an object salient. A method to compute saliency of à-priori defined nonoverlapping image regions has been proposed in Cheng et al. (2015). The most salient region, in this approach, has been determined based on the weighted sum of color-contrasts

81

82

5 Visual Attention

of a region with all other regions. The weights incorporate a function of spatial distance between the regions to encourage closer regions to have uniform saliency values. The authors have also proposed a saliency-cut method2 to determine the accurate contour of a salient object. While the above methods use low-level image features, the research of gestalt scientists suggests that saliency of an image region is more likely to be guided by emergent higher level features. The suggestion prompts (Xu et al. 2014) to present a generic model of object-based saliency that combines pixel-level, object-level, and semantic-level features. The pixel-level features chosen by the authors have been the same as those in Itti et al. (1998). The object-level features include some generic properties of (foreground) objects, such as shape, complexity, convexity, etc. (see Chapter 4). Semantic-level features include properties of a few semantic categories of objects that are believed to draw human attention. Authors identify four such semantic categories, namely 1. Those relating directly to humans, e.g. face, emotion, gaze, etc., 2. Objects with implied motion, such as cars and airplanes, 3. Those related to other (nonvisual) senses, like flowers (smell) and musical instruments (audio), and 4. Those designed to attract attention or for interaction with humans, for example, text and other man-made objects like computers and mobile handsets.

5.3 Evaluation Computational models of attention is an outcome of reverse engineering of biological attention system. Thus, a natural way of their validation is to compare them with human eye movements in response to a visual stimulus in a given context (free viewing or task-specific situation). This requires creation of human eye-movement databases and evaluation metrics to measure the closeness of the predictions of the computational models with actual eye movements of human subjects. Human eye-movement data have been collected in several laboratories, and processed to create fixation density maps (FDM) databases that provide normalized amplitudes for the fixation points in a set of images averaged over a number of subjects. The available FDM databases differ in the hardware used for data collection, demography of the subjects, environmental conditions of the experimental setups (including presentation time) and the test images (Engelke et al. 2013). Despite

2 Saliency-cut is similar to grabcut (Rother et al. 2004), but uses saliency values for image features.

5.3 Evaluation

such differences, it has been observed that FDM tends to become more stable with longer presentation times. While task-specific situations are more definitive, free-viewing introduces uncertainties, since the subject’s mental state is not known. The FDMs created for a few sample images by different labs is shown in Figure 5.5. A class of metrics for validating attention models compare the attention heat maps generated by a model with the FDMs for the images in a database. There are several metrics for such comparison, such as Pearson linear correlation coefficient (PLCC) and analysis of area under receiver operating characteristics curve (AUC-ROC). Validation with FDM does not account for the temporal sequence and duration of fixations. A more general metric for evaluation of an attention model can be based on deviation between two spatiotemporal sequences (scan-paths), the sequence predicted by a model and an actual sequence followed by a human subject, averaged over a number of subjects. There are several ways to measure the deviation. One approach is to model the temporal sequences as vectors and

Figure 5.5 Fixation density maps for a few images created by different laboratories for presentation time of 10s, from left to right: original image, University of the West of Scotland (UWS), Technische Universität Darmstadt (TUD), and University of Nantes (UN). Source: Reproduced from Engelke et al. (2013) with permission from the authors. See color plate section for color representation of this figure.

83

84

5 Visual Attention

to measure the Euclidean Distance or Fréchet distance (Agarwal et al. 2012)3 between the sequences. An alternative widely adopted approach is to consider the two sequences as probability distributions and use KLD or Percentile Metric as a measure of the deviation. Yet another approach is to treat the predictions and the observations as random variables and to find correlation coefficient between them. A distinct approach is to model the predictor function as a classifier and compute the convergence between the predictions and the observations. It may be noted that these methods of evaluation are applicable to overt (foveal) attention models. Standard methods for evaluating covert (peripheral) attention are still lacking.

5.4 Conclusion Attention assumes an extremely significant role in cognitive vision. Reduction in visual data to be analyzed in real-time is of utmost importance to improve the performance of robots and other situated agents that need to interact with the environment in real-time. Moreover, understanding of human attention model can help in optimal delivery of visual contents by modulating acuity of the image (or, video) regions, thereby reducing bandwidth requirements for their transmission. Emulating human attention model allows an artificial vision system to focus on most important parts of a scene as perceived by humans, which helps in generating most appropriate annotation for the scene for visual question answering. In this chapter, we have presented various approaches to classical models of visual attention. More detailed reviews are available in Frintrop et al. (2010), Borji and Itti (2013), and Gide and Karam (2017). In summary, there are five factors that guide attention (Wolfe and Horowitz 2017): 1. 2. 3. 4. 5.

Intrinsic visual properties of a scene (bottom-up attention). Task-driven guidance (top-down), Scene-driven guidance – likely location of the target (context) Perceived utility of some visual patterns (objects) Deviation from prior experience (surprise)

Perceptual grouping and attention are two important aspects of cognitive vision. Several researchers have tried to explore the relation between the two. Early saliency models have used local contrast features based on the assumption that attention precedes perceptual processes. Later research indicates that perceptual grouping may not be exclusively post-attentive. Psychophysical experiments 3 Fréchet distance is also called the dog walking distance. It is the maximum of spatial distances between the two sequences at different points in time.

5.4 Conclusion

indicate that while attentive process is generally necessary to achieve perceptual grouping, coherent perceptual grouping of patterns (e.g. to assume a familiar object form) is often responsible for attracting attention. Thus, the two processes appear to have complex synergic relations (Kimchi 2009). Based on this theory, later research on attention models integrates higher level object and semantic features also. Despite more than 30 years of research on the subject, many aspects of visual attention are still elusive, and it remains an active research topic. It may also be worth mentioning that attention mechanism is not specific to vision alone, but works for all sensory inputs. Of late, there has been some research on multi-modal attention models, which naturally occurs in biological domain. Visual attention has been complemented with audio or natural language-based attention in applications such as image captioning (Cheng et al. 2017), video annotations (Hori et al. 2017), machine translation (Caglayan et al. 2016), and so on. Studies on interaction between visual and haptic attentions (Tripathi et al. 2007; List et al. 2014; Graven 2016) show that there is a positive synergy between the two.

85

87

6 Cognitive Architectures The goal of research in artificial general intelligence (AGI) is creation of autonomous intelligent systems with capabilities that compares with or exceeds human intelligence. Such systems need to autonomously interact with the environment to achieve some goals, while being constrained by some value system. For example, an autonomous vehicle needs to reach its destination in shortest possible time, obeying the applicable traffic rules and safety requirements. The functions involved in the process include perception and cognition (of traffic lights, road signs, and pedestrians, for instance), dynamic planning (for the route), and selection and execution of appropriate actions (stopping, slowing down, turning left or right, etc.) at appropriate times. The functions need to be performed in reaction to the stimuli received from the environment in context of the goal of the agent in real time. Achievement of such intelligent behavior requires emulation of human cognitive capabilities for each of the tasks, which is known as cognitive modeling. A system comprising a coordinated collection of such cognitive tasks that enables its consistent autonomous behavior is known as a cognitive system. A system architecture that effectively supports creation of cognitive systems is known as a cognitive architecture. It specifies the components that are required for implementing a cognitive system and their interactions. In a broader sense, “a cognitive architecture represents any attempt to create a unified theory of cognition.” (Vernon et al. 2016). Cognitive modeling techniques and cognitive architectures draw their inspiration from cognitive sciences and artificial intelligence. In this chapter, we start with the paradigms for modeling cognition, followed by the desired properties of cognitive agents. We devote a section on memory architecture that enables an agent to exploit its experience and to improve its performance over time. We present taxonomies of agent architectures before reviewing two of the classical cognitive architectures that are mature, current, and are used in computer vision problems. Subsequently, we introduce a rough sketch of the biologically motivated emergent architectures. We conclude the chapter with some critical observations on the subject. Computational Models for Cognitive Vision, First Edition. Hiranmay Ghosh. © 2020 The IEEE Computer Society, Inc. Published 2020 by John Wiley & Sons, Inc.

88

6 Cognitive Architectures

6.1 Cognitive Modeling Implementation of artificial cognitive systems needs formal and implementable models. We find two kinds of formal modeling techniques for cognition. Computational modeling of cognition broadly refers to formal specification of the cognitive processes through computational models using algorithmic descriptions. The precise conceptual clarity provided by the computational modeling makes it a useful tool for implementing artificial cognitive systems. In contrast, product modeling results in a black-box view, where cognitive behavior is specified through input–output relations without considering the processes that are responsible for the transformation. Such models of cognition can be used to impose constraints on the process model. Computational modeling subsumes mathematical modeling, since a mathematical model can be effectively implemented with a computational algorithm (Sun 2008). Given these general characteristics, the computational models proposed by different researchers vary widely in the paradigm of modeling and levels of abstraction. We compare the different approaches in the following text.

6.1.1

Paradigms for Modeling Cognition

There is no unique way for computational modeling of cognition. The approaches followed by various researchers can be broadly classified in two distinct categories (Vernon et al. 2007; Duch et al. 2008). The symbolic approach is based on the classical view that cognition results from manipulation and reasoning with a symbolic representation with the world. In contrast, the emergent system approach is essentially an amalgamation of connectionist, dynamic, and enactive views of a system. An emergent system dynamically reorganizes itself while interacting with the environment, and cognition is the result of the self-organization. Traditionally, the symbolic approach is also referred to as the cognitive approach, but we shall reserve the term to connote any approach for modeling cognitive functions, and the term cognitive system to mean any system capable of cognitive functions. The fundamental difference between the symbolic and emergent approaches is the representation of the world model. Motivated by the development of symbolic logic in artificial intelligence, the various real-world entities (concepts) are represented as a set of symbols in the symbolic approach. The concepts are related with various relations to represent a model of the world. The resulting declarative knowledge representation is separated from its embodiment, is human accessible, and can be explicitly shared between a number of systems. On the contrary, the emergent approach has been inspired by the emulation of physiological neural networks. Here, the knowledge is represented as the dynamic embodiment of the

6.1 Cognitive Modeling

system itself, distributed across the system components at a sub-symbolic level. As a result, it is strictly private to the system and cannot be easily shared. Computation in symbolic approach involves manipulation of symbols. Different flavors of first order logic, e.g. description logics (DL), are used for reasoning. In emergent systems approach, computation involves continuous self-organization and self-development of the system through concurrent interactions of the components in system with external world and with each other. Thus, an emergent system keeps itself updated through the interaction with the environment in real time. The representations in an emergent system are automatically and continuously induced through its interaction with the environment. A symbolic system uses some “in-built,” generally manually encoded, knowledge. Essentially, it represents a top-down knowledge-driven process. On the other hand, an emergent system uses activation signals from the environment to reorganize its internal states, which is essentially a bottom-up data-driven process. The knowledge in emergent systems is learned through interaction with the environment. In other words, the symbolic approach involves the use of knowledge, which can be generalized, while the emergent systems focus on development of task-specific skills to achieve a specific system goal. A key ingredient of a cognitive architecture is memory, which helps a system to remember the past and improve it’s performance with accumulated experience. In a symbolic system, the knowledge is generally represented as graphs representing the relations between the concepts and a set of rules to operate upon them. Learning is a process through which the contents of the memory are updated with time. In an emergent system, the memory is embodied in the network configuration, and learning is effected through its change resulting from interaction with the environment. Another difference between the two approaches lies in the anatomical modularity of the system. In the top-down symbolic approach, the knowledge is partitioned (by design) and the different cognitive functions are independently implemented. Thus, a symbolic system is functionally as well as anatomically modular. In emergent systems, functional modularity need not necessarily imply anatomical modularity. Several functions may map to the same component or overlapping components of the system. Vernon (2016) characterizes the symbolic approach as a principled approach with a goal “to create a model of cognition and gain an understanding of cognitive processes,” and the emergent approach to be practical with an aim “to build useful systems that have a cognitive ability and thereby provide robust adaptive behavior that can anticipate events and the need for action.” The key differences between the two approaches are summarized in Table 6.1. The two approaches have their own advantages and disadvantages. A strength of the symbolic approach is that the representation of declarative knowledge is human accessible and sharable. Thus, semantic knowledge can be directly

89

90

6 Cognitive Architectures

Table 6.1

Key differences between symbolic and emergent architectures. Symbolic architecture

Emergent architecture

Goal

Create model of cognition

Build useful system

Processing

Top-down

Bottom-up

Knowledge centric

Skill centric

Symbolic

Sub-symbolic

Declarative, sharable and accessible to humans

Implicit, strictly private and inaccessible to humans

Symbol manipulation

Self-organization

Representation

Computation

Deterministic or probabilistic

Necessarily probabilistic

Knowledge aacquisition

Through manual encoding and learning

Through learning alone

Memory

Graph-based, rule-based

Networked

Modularity

Functional and anatomical modularity

Functional modularity does not necessarily imply modular architecture

implanted by humans in a symbolic system, so that the system may perform since its inception. Further, the modularity of the knowledge structure makes it easier to understand by human programmers and analyze any errors. A declarative knowledge representation can be shared across systems resulting in a shared ontological view of the world. In an emergent system, the knowledge is to be learned by the system, and therefore, the system needs training before it can perform. The knowledge is embodied in the network configuration itself, and is in a form that is difficult for humans to understand. It is strictly private to the system, and can be shared only through explicit knowledge exchange across agents. The only way to impart prior knowledge in an emergent architecture is to use a pre-trained network, or through the use of meta-learning.1 Unfortunately, what is an advantage for a symbolic system leads to its inherent weakness. It fails to perform when the environment differs significantly from that represented by its prior knowledge. The problem is particularly severe for monotonic knowledge representations, where a system cannot unlearn the already encoded knowledge to adapt to a novel situation. An emergent system architecture, on the other hand, is more amenable for adaptation to a new environment through self-organization. An advantage of emergent systems is its flexibility toward finding the generic properties from a large volume of data (inductive generalization). Symbolic 1 These will be discussed in Chapter 8.

6.1 Cognitive Modeling

systems generally follow deductive logic and are incapable of inductive generalization. Another advantage of emergent approach is exploiting the principles of dynamic systems that readily support several desirable properties of a cognitive system, such as multi-stability, adaptability, intentionality, and learning. Most of the artificial autonomous systems systems are built around symbolic approach till date, but emergent systems are fast catching up. Though several modeling frameworks for emergent systems have been proposed, comprehensive architectures are yet to be defined. Symbolic systems are generally good for reasoning, while the emergent systems are more suitable for perceptual tasks. Thus, combining the two approaches into a hybrid system is necessary for building cognitive systems that can interact with the environment and can perform complex tasks. The hybrid systems are motivated by dual process theory, which suggests that human cognition results from interaction of two types of processes, intuition and reasoning (Kahneman 2003), operating at two distinct levels. Intuition happens at a subconscious level, is based on associative reasoning, is holistic, and is generally extremely fast. It is best modeled through the emergent approach. Reasoning uses conscious decision-makings. It is a slower process and can be modeled as a symbolic system. The interface between the symbolic and sub-symbolic representations poses significant challenge to the development of intelligent autonomous systems (Goertzel 2012).

6.1.2

Levels of Abstraction

A computational model of cognition is an abstraction of human cognition process. In this context, it is important to consider the level abstraction to be followed in the modeling. David Marr has proposed hardware, algorithmic, and computational layers as three progressive levels of abstractions (Marr 1982) for vision. A similar three-layer abstraction for knowledge-based systems has been defined in Newell and Simon (1976). The highest layer of abstraction in the model is the knowledge level, which justifies the behavior of a system with respect to its goals and knowledge through rationality. The next lower level is the symbolic level, where the goals and the beliefs are expressed through symbols, and the rules for their manipulation (logic) is established. The lowest level is the physical level, which deals with the means to implement the representation and the rules for their manipulation. Though these stratification schemes, based on computational abstractions, lead to a strong theoretical foundation for cognitive modeling, it has been criticized on the ground that the layers cannot be fully isolated in practice (Sun 2008). For example, the processes for the algorithmic layer cannot be formulated without clues for their plausible implementation at the physical hardware level. In contrast, Sun et al. (2005) focus on a different aspect of abstraction by considering the behavioral nature of the processes, which they consider to be more

91

92

6 Cognitive Architectures

Table 6.2

Layers of abstraction in a cognitive architecture (Sun et al. 2005).

Object of analysis

Type of analysis

Computational model

Inter agent processes

Social and cultural

Group of agents

Agents

Psychological

Individual agents

Intra-agent processes

Componential

Modular construction of agents

Substrates

Physiological and biological

Implementation of components

important than the computational stratification. The authors propose a four layer model of abstraction: 1. The highest level corresponds to sociological level that guides the behavior of social interaction between intelligent agents. The models at this level guide the collective behavior of autonomous agents, their interactions with each other and with the environment. 2. The next level is the psychological level that relates to individual behavior of an agent with respect to its goals, knowledge, and skills. These entities are closely related to perception, motivation, emotion, and similar such other factors of the agent. 3. The third level, the componential level, pertains to intra-agent processes and defines the cognitive mechanisms. An internal architecture of an autonomous agent, in terms of its components and their interactions, also emerges at this level. 4. The lowest layer is the physiological level that deals with the implementation, and is motivated by physiological, neuro scientific, and such other studies. The knowledge of physical implementation imposes constraints on the architecture and the processes at the componential level. The abstraction hierarchy is summarized in Table 6.2. As we shall see later in this chapter, these layers of abstractions are often reflected in the proposed cognitive architectures.

6.2 Desiderata for Cognitive Architectures In order to guide development of cognitive architectures, a set of 10 desired properties have been articulated in Vernon et al. (2016). They focus on cognitive development of an agent while developing the desiderata. These properties are not independent, but are often related to each other. The essential elements of these properties are summarized in the following text.

6.2 Desiderata for Cognitive Architectures

A prime property of a cognitive agent is its autonomy. We may distinguish between two types of autonomy, behavioral and constitutional. Behavioral autonomy manifests in the external characteristics of an agent, such as setting of its goals, flexibility in dealing with a new environment, etc. It requires that an agent should have its independent value systems, i.e. a set of rules or principles that it should abide by, and motives that drive it to take some actions to achieve them. Learning from such actions guide the self-development of an agent. Further, it is possible to distinguish between two types of motives, namely exploratory and social. Exploratory motive drives an agent to discover the environment as well as to explore its own capabilities. Social motive of an agent pertains to its endeavor to discover the protocols of engaging with the environment and other agents. It enables an agent to learn new skills and acquire knowledge about the world from the experience of others. On the other hand, constitutive autonomy results in the capability of an agent in defining it internal structure and in the process of (re)-structuring itself during its lifetime. Constitutive autonomy is particularly important for the emergent systems, which continuously reorganizes itself, as it interacts with the environment throughout its lifetime. The physical embodiment of an agent is also extremely important for its exploratory capabilities. A larger number of sensors and actuators in the embodiment of an agent augments its capability to explore the world. The exploratory capability of an agent is often determined by it “non-brain” components. For example, in absence of visual inputs, the shape and size of an object can be explored through tactile exploration, provided the “limbs” of the agent allow effective manipulation of the object, or of itself. The interpretation of the sensor data needs perception capabilities. Perception of motion introduces a temporal dimension and is particularly important for an agent to make predictions. Since the volume of sensory data received by an agent may be too large to process, it’s attention mechanism needs to focus on it goals. Further in order to acquire social skills, attention should fixate on biological entities, such as human face and especially on the movement of the articulatory organs, such as the eyes and the lips. Further, a cognitive agent should be rational in selecting an action from a prospective set, i.e. the action of a cognitive agent should be aligned to its goal.2 In a dynamically changing world, the sensory data received by an agent keeps on continuously changing. The changes may be due to either the changes in the environment or due to the action taken by an agent. For example, when a guided missile follows a zig-zag path (to avoid enemy fire), the image of the target keeps on changing continuously. A cognitive agent should be able to relate such changes in

2 There may be a conflict between emulating human mind and rationality; human behavior may not always be rational!

93

94

6 Cognitive Architectures

the sensed data with its own actions. This property is known as sensorimotor contingencies (SMC).3 Further, a cognitive agent needs to rehearse through a scenario through internal simulation, in order to assess the possible impact of its action, like a chess-player anticipating the future moves of the opponent. This involves a representation of the anticipated view of the environment. Besides cognition, internal simulation is also necessary for development of an agent with respect to sensorimotor contingencies and prospective actions. The most important property of a cognitive agent for its self-development is learning from its past experiences and progressively improve its performance in adapting to new situations and in predicting the future. The top-down and bottom-up processes of dual process theory require that a cognitive architecture should support multiple modes of learning. In particular, a cognitive agent must minimally possess the capabilities for supervised, unsupervised, and reinforcement learning. It naturally follows that an agent must possess some dynamic long-term memory (LTM), which gets updated with learning. Since, memory system holds an important position in a cognitive architecture, we shall discuss it in more details in Section 6.3. Having provided this desiderata, Vernon (2016) cautions the readers not to blindly follow them while building a practical cognitive system. The authors advise creating an architecture following the requirements of the system, which may demand different priorities on the different properties of the architecture. However, if the goal is to create a generic platform supporting unified theory of cognition, it is desirable to focus on the desiderata, since the use-cases of a specific system may lead to missing out on some of the key considerations.

6.3 Memory Architecture Memory systems play a key enabler role in any cognitive architecture. This prompts us to discuss the memory system in some details. Physiological studies of memory of human beings and other animals has thrown important insights into the structure of memory in cognitive systems. For example, it has been observed that patients may exhibit forgetfulness while retaining remarkable intellectual capabilities. Such observations indicate that human memory is partitioned, and it is possible for one part of the memory to get affected without affecting the functionality in others. As the researchers have discovered, the various memory systems in the brain act independently and in parallel. Memory systems have been classified based on the type of information that they process and the way they function. 3 The term sensorimotor is defined as “the nerves or their activities having or involving both sensory and motor functions or pathways contingencies – (dictionary.com)”.

6.3 Memory Architecture

First and foremost, we distinguish between two types of memory: long-term and short-term. LTM holds the knowledge and experience of an agent for its life time and is considered to be virtually infinite in capacity. Short-term memory (STM) has finite and small capacity, is transient, and is used by a cognitive system to control its activities. Both long-term and short-term memories are further organized into several subclasses. Figure 6.1 shows a taxonomy of the LTM. Declarative memory represents the capability of an agent to consciously recall some facts or events after a prolonged period of time; it is believed that such knowledge is stored in a declarative (or representational) form, such as “the sky is blue.” It provides a capability to explicitly model the external world and to assert the truth value of a model. Declarative memory is used for the top-down cognitive processes. It is further classified into two classes: semantic memory and episodic memory. Semantic memory holds generic knowledge about things and may keep on updating with the life of the agent. Episodic memory holds specific knowledge about an event that is confined to a specific location and time. Episodic memory results in the capability to (mentally) re-experience an event, even long after it has occurred. A particular form of episodic memory that is particularly useful for sensory perceptions is associative memory (also called content addressable memory, or CAM), which helps remembering associations between entities, such as an object and its shape. Experiencing the shape results in recalling the object. Episodic memory is generally associated with a decay, i.e. one tends to forget events after a certain point of time. But, modeling the decay is indeed tricky; some of the events may be forgotten very soon, but some (like a dog bite in the childhood) may be remembered for a very long

Memory (long-term)

Declarative memory

Semantic (facts)

Non-declarative memory

Episodic (events)

Associative

Figure 6.1

Skills and habits

Priming and perceptual learning

Emotional responses

Classical conditioning

Non-associative learning

Skeletal responses

Taxonomy of long-term memory. Source: Redrawn after Squire (2004).

95

96

6 Cognitive Architectures

time and may even induce a behavioral trait (fear of dog) in the non-declarative memory. Non-declarative memory (or, what is often called procedural memory), on the other hand, is dispositional and determines the performance of an agent. There is nothing true or false about it. Different kinds of procedural memory relate to specialized functions of an agent (the entries in Figure 6.1 are self-explanatory). They evolve with the interactions of an agent with the environment. Non declarative memory has the capability to generalize from the experience of a number of events, and to retain the generalization. They are responsible for the bottom-up cognitive processes. STM is used by a cognitive agent to organize its perception and actions. The sensory information first accumulates in a sensory memory. There are distinct sensory memories for each of the sensory organs. The capacity and the retention period of a sensory memory depend on the sensory modality. Sensory memory acts as a buffer and allows an agent to perceive the world as a continuum, rather than in discrete pieces. Visual sensory memory is also known as iconic memory since it is believed that the mental representation of the visual forms is iconic in nature. Psychological experiments show that it has a large capacity but is associated with a very fast decay. The information received in the sensory memory is filtered through an attention mechanism. Only a small fraction of the information reaches the next stage of processing at the active memory4 ; the rest of the information is forgotten. Information from different sensory modalities are integrated in the active memory. It holds a small volume of information and retains it for a period of a few seconds to less than a minute. It has been observed that rehearsing helps in refreshing the active memory and remembering its contents for a longer period of time.5 Active memory can recall information from LTM and can update its contents. Figure 6.2 depicts the interaction of the STM modules. However, recent observations suggest that it may not be always necessary for information to pass through STM to reach the LTM. Another memory unit that is at the core of a cognitive system, is called the working memory, which draws information from STM and LTM, and acts as a temporary scratch-pad for manipulation of information. The contents of the working memory at any given point of time represents the percepts of an agent and relevant background knowledge, manipulation of which results in cognition, reasoning, learning, and action selection. Researchers have often proposed specialization of these memory elements in context of specific requirements of an architecture.

4 Active memory is loosely called the STM by some authors. 5 An alternate hypothesis is that rehearsing results in movement of information from active to LTM.

6.4 Taxonomies of Cognitive Architectures

Rehearsal

Attention

Active memory Storage

Sensory memory

Retrieval

Sensory inputs

Long-term memory Figure 6.2

Short-term memory and interactions.

In emergent systems, the distinction between these memory units gets blurred, and they often map to the same physical components.

6.4 Taxonomies of Cognitive Architectures Several cognitive architectures have been proposed by different research groups over the years. Figure 6.3 depicts a taxonomy for cognitive architectures based on the modeling paradigm. The cognitive architectures can broadly be classified to be symbolic, emergent, or hybrid, depending on their knowledge representation schemes. While some of the older architectures used symbolic modeling paradigm, the newer ones, particularly those supporting perceptual capabilities, Cognitive architectures

Symbolic

Fully integrated

Hybrid

Emergent

Symbolic subprocessing

Connectionist logic systems

Neoronal modeling

Figure 6.3 A taxonomy of cognitive architectures based on their modeling paradigms based on Kotseruba and Tsotsos (2020).

97

98

6 Cognitive Architectures

have mostly embraced hybrid paradigms. Few researchers have used emergent modeling paradigm exclusively to realize cognitive capabilities, but without implementation of a full-fledged architecture (see discussions in section 6.1.1). The emergent systems can further be classified into either (i) connectionist logic systems that deploy artificial neural networks to compute complex symbolic processes in parallel (Aleksander 1989), or (ii) neuronal modeling where the behavior of a neuron is modeled through signal processing approach by slicing it in smaller compartments (Koch and Segev 1998). The hybrid architectures, in turn, can further be classified based on the mode of integration between the symbolic and sub-symbolic representations. They are fully integrated in some of the architectures, but operate in isolation in the others. The classification of a system based on this taxonomy is often ambiguous. For example, a system following symbolic architecture can loosely integrate deep neural networks for some specific visual function, say object detection, the output of which can be used for further processing using symbolic logic (Kennedy and Trafton 2006). It may be controversial to classify such system either as a symbolic or hybrid architecture. Further, the cognitive architectures can be distinguished based on their perceptual capabilities. In this context, perceptual capability may be defined as the capability of a system to transform the raw sensory signals to appropriate internal representations of the system for their utilization in cognitive tasks, such as reasoning. It is apparent that architectures based on symbolic paradigm alone cannot have perceptual capability. Perceptual capabilities exist at different levels in hybrid and emergent systems. Since this book concentrates on visual cognition, we distinguish the architectures depending on whether vision is supported as a perception modality, or not. Traditionally, vision has been viewed as the most prevalent perception modality, and most of the architectures that support perception has focused on visual perception. Most of the architectures that support other sensory modes, such as audio or tactile inputs, also support vision. Some architectures are designed to handle raw domain-specific sensor data, such as weather data, and do not support natural sensory data. Figure 6.4 depicts a taxonomy of the cognitive architectures based on their perceptual capabilities with emphasis on visual perception. As described in the earlier Chapters 2 and 4, perceptual processing of visual data involves several stages, namely (i) early vision: detecting and locating of intensity discontinuities, (ii) perceptual grouping: extraction of edges and contours, (iii) identification of objects and motions, and creating object-centric description of the environment, and (iv) task-specific labeling and reasoning with the objects and their spatio temporal aspects. Some of the architectures supporting visual perception starts at the early stage of vision; they accept visual data in their raw form, i.e. as raw pixel values. These architectures are known as the real- vision

6.5 Review of Cognitive Architectures

Cognitive architectures

With perceptual capabilities

Supports natural sensory data

Unimodal (generally vision)

Real vision

Without perceptual capabilities

Supports domain-specific sensory data

Multimodal (vision and others)

Simulated vision

Figure 6.4 A taxonomy of cognitive architectures based on their perceptual capabilities, with emphasis on vision.

architectures. Other architectures accept processed visual data and start at a later stage of visual processing. They are called the simulated vision architectures.

6.5 Review of Cognitive Architectures Several cognitive architectures have been proposed by various researchers since 1980s, some of which have stood test of time and are more popular than others. These architectures have been continuously enriched with new capabilities over the years. A reader may refer to Kotseruba and Tsotsos (2020) for a detailed survey of contemporary cognitive architectures. In this section, we present brief overviews of two of the representative cognitive architectures, Selective Tuning Attentive Reference (STAR) and Learning Intelligent Distribution Agent (LIDA), which have sufficiently long histories and are in current use for computer vision systems. Both of the architectures reviewed in this section represent single-agent architecture and address psychological and componential levels of abstraction. They do not address the physiological level of implementation and the sociological level for inter agent interactions. Interestingly, both the architectures center around attention mechanism, which is crucial for alleviating the information overload of an agent. Both the architectures support real vision and are hybrid in nature. They use sub-symbolic representation at percept level and symbolic representation at

99

100

6 Cognitive Architectures

the higher levels of cognition. The main distinction between the two architectures is that STAR specifically focuses on attention based vision, while LIDA is a generic architecture for AGI.

6.5.1

STAR: Selective Tuning Attentive Reference

STAR is an architectural framework to emulate the attention mechanism of human vision system (HVS) in the context of a visual task, based on the selective tuning (ST) model of attention (Tsotsos et al. 1995; Tsotsos 2011). The ST attention model suggests a top-down control mechanism on the feed-forward neural pathway that carries visual signals from the retina to the visual cortex. The forward pathway is modeled as a hierarchical neural network, called the visual hierarchy (VH), where the visual information undergoes progressive abstraction as it climb up the hierarchy levels. The control mechanism involves top-down priming of visual signals, right from the early stages of the vision and feedback processing. It results in suppression of clutter surrounding the attended objects and regulates the excitation levels of the neurons during the entire process of early vision and perception. The control mechanism is achieved through a set of Cognitive Programs (CPs) (Tsotsos and Kruijne 2014). The functions of the CPs include imposition of top-down bias signals, control of parameters, coordination of top-down (control) and bottom-up (visual) signals, fixation changes, parameter tuning of methods, and so on. Vision is a continuous process and cycles through fixation changes and analysis of visual contents of both foveal and peripheral vision in every cycle. The STAR architecture provides a general framework to realize such functions. We provide a brief description of the architectural components in the following paragraph. The interaction and the functionality of the modules in STAR architecture are shown in Figure 6.5. The VH module implements the core perception function, while the role of all the other modules is to control its functions. The module methods long-term memory (mLTM) stores a set of visual methods (obtained from external sources) suitable for different visual tasks. These methods are somewhat generic and need to be adapted for specific tasks. For example, there can be a method to detect a blob, the color parameter of which needs to be tuned to the value “green” to detect a green blob. The module visual task executive (vTE) receives the visual task specification together with relevant world knowledge (from some external source), selects the appropriate methods from mLTM, and adapts them with appropriate parameters to create task-specific scripts. vTE is also responsible for executing and monitoring the execution of the scripts. The execution of the script triggers a visual attention cycle that is controlled by the visual attention executive (vAE). Among other things, vAE primes the VH with task-specific top-down signals and instructions to search for specific items. Besides, it also enables VH to disengage attention from earlier cycles

mLTM

store new methods

fetch relevant methods task completion

Re-Interpret cFOA

vWM

bias FHM

Localize cFOA

reset BB

vAE Select cFOA

Active Script NotePad read BB

Fixation History Map

efference copy location-I0R task bias

cFOA

feedforward retinal signal

execute new fixation

FOA

tWM

Blackboard

Update FHM

History Biased Priority Map pFOAs (feature-based attention)

Peripheral Priority Map (AIM)

Figure 6.5

higher order functions

FOA

Select FOA WTA

FC Plan Shift

peripheral attentional field

VH

vTE

Script Monitor

monitor localization via BB

θ-WTA localization and surround suppression

suppress task-irrelevant

Disengage Attention

inhibit attended pathways (object-IOR) lift surround suppression

Task Priming

Cycle Controller

θ-WTA selection central attentional field (object-based attention)

restart attentive cycle disengage efference copy Task-Based Parameters

task specification world knowledge

Script Constructor

Script Executor

Saccade Controller

get new fixation

external learning of methods task specs

Disengage Attention

A detailed view of the STAR architecture. Source: Reproduced from (Tsotsos and Kruijne 2014) with permission from the authors.

102

6 Cognitive Architectures

and to implement return-inhibition. It enables VH to select and localize the focus of attention following a Winner-Take-All (WTA) strategy and to interpret its contents. The module fixation controller (FC) determines the change in fixation based on current peripheral priority map and a history-based priority map, details of which are available in Tsotsos et al. (2016). The process is supported by two distinct memory modules. The task working memory (tWM) holds these temporary scripts and intermediate execution results to facilitate monitoring of execution of the scripts. The visual working memory (vWM) contains a blackboard to hold the current attentional sample (the location and the features at progressive abstraction levels of the image-region being attended to). Besides, it stores a history of past several fixations to enable implementation of return-inhibition. A caricature of a typical visual cycles is shown toward the bottom-left of Figure 6.5 with the VH module. The left-most hierarchy (represented by the pyramidal structure of rectangles) represents the initial stage where the VH is primed by the top-down task-specific requirement (the target is expected at the center of the image), but the stimulus is yet to appear. The next one, the stimulus has appeared (at an off-center location), feed-forward visual signals have been sent up the hierarchy, and a WTA strategy has been used to localize the target. These steps are good enough for common visual tasks like that of discrimination (desired signal from others), object identification, and categorization. More complex tasks, such as fine-grained classification, further attentive cycles are generally required. The third and the fourth (from left) pyramids depict the recurrent top-down localization and consequent feed-forward pass for re analysis. There may be several cycles of these steps depending on the task complexity.

6.5.2

LIDA: Learning Intelligent Distribution Agent

LIDA has been developed as a general cognitive architecture designed to support AGI (Faghihi and Franklin 2012, Snaider et al. 2011). The goal of LIDA has been to address both the scientific challenge of modeling cognition as well as the engineering challenge to build a workable artificial agent. It has been claimed to support all desired properties for cognitive systems discussed in Section 6.2. The guiding principle of LIDA is that every autonomous agent need to continuously sense its environment, interpret the sensory inputs to assess the environmental situation, and then act to change the world in its favor. Accordingly, the architectural framework of LIDA models a cognitive cycle with three successive stages, namely understand, attend, and respond (Franklin et al. 2014). Figure 6.6 depicts the cognitive cycle of LIDA, and the architectural components that realize it. The core of the LIDA architecture centers around an attention model based on the global workspace (GW) theory (Baars 2005). The human

WORKSPACE Conscious Contents Queue

Structure Building Codelets

Sensory Memory

Dorsal Stream

Actuator Execution

Perceptual Associative Memory

Spatial Memory

Consolidate

Local Associations

Declarative Memory

Episodic Learning Update Behaviors

Selected Behavior

Action Selection

ing

Sensory Motor Memory

Transient Episodic Memory

Spatial Learning

Form Coalitions

Local Associations Cue

im

Motor Plan

Cue

Pr

Motor Plan Execution

Cue

Perceptual Learning Sensory Motor Learning

Figure 6.6

Spatial Maps

ing

im

Percepts Cue

Internal and External Environment

Add Coalitions

Current Situational Model

Pr

Sensory Stimulus

Ventral Stream

Add Conscious Content

Attention Codelets

Attentional Learning Procedural Learning

Instantiated Schemes (Behaviors)

Global Workspace

Recruit Schemes Short Term

Procedural Memory

LIDA cognitive cycle. Source: Reproduced from Franklin et al. (2014) with permission from the authors.

Long Term Conscious

104

6 Cognitive Architectures

nervous system is viewed as a large number of autonomous, distributed, and parallel processing units. GW theory is a model for organizing and coordinating such processors. Each of the processing units (implemented as codelets in LIDA) is believed to specialize in a specific cognitive task. Coordination of several coalitions of such processing units results in interpretation of sensory data. The coalitions independently work with the input sensory data, and the one that achieves highest activation level attracts attention, and its outcome enters a global workspace. In general, the activation levels for the sensory signals are determined not only by the properties of the current signal but also by the properties of the signals received earlier and the context, which includes internal (the intentions and the expectations of an agent) as well as external (perceptual context) factors. The contents of the GW attains the level of consciousness by being broadcast throughout the nervous system, which results in a response to the recognized world situation. During its lifetime, an agent cycles through understand, attend, and respond phases. The cycle starts with the understand phase, when a LIDA agent updates in representation of both the external (perception of sensory signals) and the internal features. It starts with low-level feature detectors working in the sensory memory to detect basic sensory features, which get consolidated to higher level features, such as objects, categories, relations, and events. They are represented as nodes in the perceptual associative memory (PAM). The spatial memory encodes spatial information of recognized features and are linked to the PAM nodes; it is useful to recognize visual concepts with spatial structures. The node structure of PAM represents the current percept that is passed on to the workspace, where the current situational model (CSM) for the agent is computed. A history of the recent percepts and the models created from them (which have not yet decayed in the workspace) are also used in computing the CSM. The most salient part of this representation enters the GW in attend phase by the competitive mechanism described earlier and is broadcast throughout the system to bring it to consciousness and to generate appropriate response. Further, the workspace retains a list of recent conscious contents, called the Conscious Contents Queue, to enable cognition of temporal concepts. In LIDA, response takes two forms: (i) action to influence the external world and (ii) learning to update the knowledge of the agent. An action is selected from the procedural memory based on the contextual similarity with the current situation. The selected action is forwarded to sensory motor memory, which relates actions to motor plans (algorithms), for execution. In parallel to action selection, the conscious contents are used to update the knowledge (or, the internal representations) of the agent held in the various memory units, in particular the PAM that helps in recognition and the transient episodic memory (TEM) that holds the experience about the current episode. The knowledge from the TEM is used to update the

6.6 Biologically Inspired Cognitive Architectures

contents of the declarative memory that holds the agent’s knowledge about the world, learned and unlearned from consolidation over several episodes during its lifetime. As an architecture, LIDA is a generic framework and demands limited commitment to the underlying architectural elements. In particular, the various memory modules that are the key enablers for a LIDA agent to make autonomous decisions can be implemented in different ways. For example, Snaider and Franklin (2014) proposes a new representation scheme, where the nodes in PAM are represented with vectors, rather than having symbolic description. This extension results in several improvements, such as homogeneous representation of different levels of abstraction in various memory modules, grounding of PAM nodes with the percepts, interoperability with TEM data-structure, faster computational comparisons for PAM structures, and seamless integration with deep learning methods. An implementation of spatial memory that can hold probabilistic spatial information computed from noisy multi-modal sensor data is presented in Madl et al. (2018). This is particularly useful for computer vision applications like simultaneous localization and mapping (SLAM).

6.6 Biologically Inspired Cognitive Architectures Human nervous system (including the brain) is recognized to be a large and complex network of neurons. This has motivated modeling cognitive architecture with neural networks. Inspired by the breakthrough in image classification problem (Krizhevsky et al. 2012), deep neural networks are increasingly being used for modeling more complex visual cognitive functions (Storrs and Kriegeskorte 2019). The progress has been fueled by extremely fast processors, such as GPU, for large number-crunching jobs. Basic premise of a feed-forward neural network in cognitive modeling is that the knowledge required to translate any arbitrary set of inputs to a set of outputs can be effectively learned in a sufficiently large neural network with a sufficiently large set of training data. The knowledge, so learned, is distributed over a large number of network nodes (typically, in the order of millions in contemporary deep neural networks) and lends itself to inductive generalization (emulating the principle of nearest neighbor). Unlike in symbolic representation, the knowledge in a deep neural network is represented as tensors, and inferencing takes place through their “intuitive” manipulation in response to some input data. A drawback of the feed-forward network is that it implements static transformation function that lacks the experiential factor. Recurrent neural networks, where a part of the output is fed back to the input in the next time cycle, carries forward the past experience

105

106

6 Cognitive Architectures

and provides for dynamic behavior of the system. This leads to lifelong learning in the system, and capability for prediction. While it is not necessary that engineered systems should emulate the design of biological systems for best performance,6 the biologically inspired cognitive models have been found to perform better than the classical approaches in many complex cognitive tasks in different domains, such as computer vision, natural language and speech processing, drug discovery, and so on. But, a complete cognitive cycle is yet to be modeled using this approach. Nevertheless, the recent advances in deep neural network have revolutionized realization of cognitive systems and hold a great promise for the their future (Kriegeskorte 2015; LeCun et al. 2015). We shall elaborate on the capabilities of deep neural networks in Chapter 8.

6.7 Conclusions We have dealt with various aspects of cognitive architectures in this chapter. In this concluding section, we highlight a couple of them. The first and foremost is attention, which plays a crucial role in the process of cognition by restricting the flow of sensory and internally generated information. Thus, all that is perceived through the sensory organs do not attain a level of consciousness. Another important aspect of a cognitive architecture is its recurrence, i.e. the perpetual cycle of perceive, interpret, and respond throughout the lifetime of the agent. Here memory plays an important role by remembering past experience and enabling an agent to improve its performance over time. Though the field of cognitive modeling and architecture has received attention for quite some time, some of the aspects of human intelligence may have not received the attention they deserved, and there are a number of open research issues (Langley 2017). For instance, the focus of cognitive architectures has always been on action. As a result, it has neglected deep understanding of the contents. Another shortcoming is that cognitive architectures focus on analysis of sensory signals and not on creativity. Recent research on deep learning addresses these topics. But, they are yet to be integrated in a full-fledged cognitive architecture. Another important aspect of a cognitive architecture is its representation scheme. Most of the contemporary architectures use a hybrid model with sub-symbolic representation at the sensory signal and percept levels, and a symbolic representation for higher levels of cognition and reasoning. Transition from one representation scheme to another is theoretically inelegant (Tacca 2011),

6 For instance, airplanes do not flap their wings for flying, and the automobiles do not walk!

6.7 Conclusions

and poses significant challenge to the designers of cognitive systems in practice. This has motivated research on seamless representation scheme that can be used system-wide in a cognitive system. The recent advances in deep neural network shows a promise for the cognitive architectures to be realized through a uniform sub-symbolic representation and a uniform computing paradigm. At the current time, the scientific community is divided in its opinion on whether deep learning approaches can be pragmatically used for complete implementation of a cognitive system.

107

109

7 Knowledge Representation for Cognitive Vision Knowledge is a key ingredient for reasoning in cognitive systems. As a motivating example, we see that the interpretation of the brain MRI image in Figure 7.1, such as to infer if there is a brain tumor or not, requires deep background knowledge. Maillot et al. (2004) observes that visual cognition requires three forms of knowledge: (i) knowledge about the domain, also called ontology,1 (ii) knowledge about the imaging process, and (iii) a mapping between the two. A knowledge representation scheme for visual cognition should be able to integrate the three forms of knowledge. Though both the paradigms of artificial intelligence, namely the classicism and the connectionism agree in that knowledge should be formally represented, they differ in the their approach to its representation. The classicists view knowledge to have an explicit, declarative, and symbolic representation, which may exist independent of processing structure, and can be shared. On the contrary, the connectionist view of knowledge is implicit, a distributed pattern over numerous processing nodes of a cognitive system, and is strictly private to the processing scheme. In this chapter, we present the classicist approach to knowledge representation. In particular, we focus on representation of knowledge in perceptual form, which is essential for perceptual and cognitive processing of visual and other sensory data.

7.1

Classicist Approach to Knowledge Representation

Though we have an intuitive notion of knowledge, there is no common definition for it. Researchers from of various disciplines, e.g. the philosophers, the psychologists, and the AI scientists, have defined “knowledge” from their own perspectives. 1 “An ontology defines a set of representational primitives with which to model a domain of knowledge or discourse.” (Gruber 2009) Computational Models for Cognitive Vision, First Edition. Hiranmay Ghosh. © 2020 The IEEE Computer Society, Inc. Published 2020 by John Wiley & Sons, Inc.

110

7 Knowledge Representation for Cognitive Vision

Figure 7.1 MRI image of brain. Source: File shared by Bobjgalindo through Wikimedia Commons, file name: MRI_brain_tumor.jpg.].

Some of these diverse definitions are complementary, and some are more useful than others for practical use (Ramirez and Valdes 2012). The philosophical view of knowledge is founded in Aristotelian Representational Theory of the Mind (RTM), which characterizes knowledge to be composed of named concepts, and that the names can be used to create propositions, from which conclusions can be drawn. This implies that the knowledge can be formally represented and be reasoned with. The mental representation is separated from what it represents, even from the material which the mind of the agent may be composed of.2 A concept is a mental model (also called mental image) of “something” with attributes. The “something” can either be a real-world thing or be an internal mental state of the agent, such as its emotional state. Thinking is a process for manipulating the knowledge. language of thought hypothesis (LOTH), which complements RTM at a higher cognitive level, suggests that thoughts are also mental models, arising out of manipulation of the knowledge and represented in a

2 This leads to the well-known philosophical debate on “mind-body problem” (Westphal 2016).

7.1 Classicist Approach to Knowledge Representation

language following the principles of symbolic logic. The combinatorial syntax and semantics of LOTH results in the capability of generalization (Fodor and Pylyshyn 1988). For example, experiencing a blue bird and a red ball allows one to imagine a red bird or a blue ball, without experiencing one. Psychological experiments have proved that knowledge is granular. These lead to the connectionist theories 3 that characterizes knowledge as a network of concepts, where the concepts are associated with some relations. While the traditional connectionist approaches focus on presence or absence of the relations, constructivist theories complements them with complex reasoning drivers such as causality, probability and context. The computational model of knowledge is derived from the above theories. In summary, the knowledge of a cognitive agent can be considered to be an internal representation of the agent’s world,4 as perceived by the agent. An important characteristic of the knowledge representation is that it is symbolic. It comprises a set of symbols, which are different from what it represents. For example, a symbol represented by a sequence of letters “c-a-t” in the agent’s memory may represent a four legged animal that exists in the environment and meows. A second characteristic of classicist knowledge representation is that it is structured. It is composed of named elementary units called concepts, which are associated to other concepts. The concepts and their associations constitute knowledge structures. In principle, a knowledge structure can be dynamic with updates for concepts and their associations. In practice, it tends to be stable over significant periods of time. At this juncture, we may also distinguish between two forms of knowledge that the agents generally deal with. Factual or declarative knowledge states what relations exist between concepts, e.g. “a cat has whiskers.” On the other hand, procedural knowledge affirms how things work or behave. Such specification often takes the form of an action, which is, in general, a conditional sequence of tasks, e.g. “if hungry, then {go to restaurant, order food, wait for food, etc. }”. Both forms of knowledge have the common characteristics described above, though the semantics of the association are different in the two cases. We present a brief account of classicist knowledge representation schemes in this section.

7.1.1

First Order Logic

Propositional logic is the simplest form of formal declarative knowledge representations, where an object or an event is represented as a proposition. It can take 3 Connectionist theory of knowledge should not be confused with connectionist architecture for cognition. 4 An agent’s world comprise all the entities it deals with. It includes external elements, i.e. the entities in the environment it interacts with as well as internal elements, namely the mental and physical states of the agent itself.

111

112

7 Knowledge Representation for Cognitive Vision

either of the two values true and false, which are considered to be constants. The propositions can be combined using conjunction (∧), disjunction (∨), and negation (¬) to form a sentence. A sentence that consists of disjunction of propositions, where at most one proposition is unnegated, is called horn clause and is particularly useful for reasoning. A horn clause, with exactly one or no unnegated propositions, can also be expressed in implicative forms ¬p ∨ q ≡ ¬p ∨ ¬q ≡

p→q

(7.1)

p ∧ q → false

(7.2)

The two implicative forms of horn clauses can be used in deductive reasoning in the following ways (p → q) ∧ p q (p → q) ∧ ¬q ¬p (p ∧ q → false) ∧ p ¬q

Modus Ponens (mode that affirms)

(7.3)

Modus Tollens (mode that denies)

(7.4)

Contradiction

(7.5)

While propositional logic constitutes the basic tool for deductive reasoning, it lacks expressiveness. A proposition can represent an individual, but cannot represent a class of individuals, thus lacking generality. Predicate calculus, also called first order logic (FoL), overcomes this difficulty by qualifying an individual with a predicate. For example cat(x) is true for all individuals that have the properties of a cat. Further, universal and existential qualifier (∀ and ∃) facilitates reasoning with the classes. For example, the statement ∀x ∶ cat(x) → meows(x) means all cats meow, and ∃x ∶ cat(x) ∧ ¬whiskers(x) means there is at least one cat without whiskers. Like in propositional logic, reasoning in FoL is also based on horn clauses. FoL has been extensively used in rule-based systems (RBS) (Hayes-Roth 1985) in many application domains, such as decision support, medical diagnostics, and fault analysis. An RBS typically consists of a static set of statements that represent the domain knowledge, which are often called the “rules” and an inferencing engine that can reason with such rules. The “percepts” (observations by humans, or acquired through some sensory units), which represent the dynamic state of the world, are presented symbolically as facts to the system. These facts may trigger some rules, leading to discovery of new facts. For example, there may be rules like ∀x ∶ headache(x) ∧ fever(x) → influenza(x) ∀x ∶ influenza(x) → bed-rest(x)

(7.6)

in a medical knowledge-base. When a patient Vinod reports the symptoms headache and fever, encoded as headache(Vinod) and fever(Vinod), the

7.1 Classicist Approach to Knowledge Representation

premises of the first rule become true, and it triggers leading to an inference influenza(Vinod), i.e. Vinod is diagnosed with influenza (a newly discovered fact). This, in turn, fires the second rule, and Vinod is recommended bed rest (an action to be performed). Conceptually, the RBSs are modeled as a set of stimulus-response associations, that is fundamental to all living beings. A particular weakness of FoL, and consequently of RBS, is its lack of expressivity beyond horn clauses.

7.1.2

Semantic Networks

According to RTM, the concepts in a domain of knowledge are related to other concepts. A binary relation between two concepts can be formally expressed as a 3-tuple, predicate(subject, object), e.g. has(cat, 𝑤hiskers). The subject in a sentence can be the object in another, and vice versa. Many such sentences, in context of a domain, define a body of knowledge and can be represented as a graph. The nodes in the graph represent concepts (subjects and objects), and the directed and labeled edges represent the asymmetric relations between them. Such graphical knowledge representations are known as semantic networks or semantic net. Figure 7.2 depicts a simple semantic network with a handful of concepts and relations. It is possible to do inferencing from the network assuming some semantics for the relations. For example, assuming property inheritance via “is-a” relations, we may conclude that cat eats food. A semantic network also allows reification, when a relation is represented as a concept (a node in the graph). It enables

Whiskers

has

has

lives-on

Catfish Animal

Cat is-a

is-a Land

eats

is-a

is-a

Mammal Food is-a Whale

fish

lives-on

lives-on Water

Figure 7.2

A simple semantic network.

113

114

7 Knowledge Representation for Cognitive Vision

defining relations across relations and mapping of any arbitrary n-ary relation to a set of binary relations. In general, a semantic network can have any arbitrary entities (classes or individuals) as concepts, and any arbitrary relations between them. A semantic net does not commit the semantics of the relations. This provides tremendous flexibility to the representation, but is also a hindrance to formal reasoning. As a consequence, there are many variants of semantic network representations, some informal and some highly formal. As far as the representation goes, semantic networks do not commit to any reasoning scheme. Different reasoning schemes can be devised depending on the attributed semantics of the relations. For example, in definitional networks, where the emphasis is on the class–subclass relationships, first-order logic can be used. Alternatively, in implication networks, where the relations primarily represent causality, a Bayesian network may be constructed and probabilistic reasoning may be performed. A semantic network with a formal semantics can be used to represent an ontology. Properties of some other variants of semantic networks have been discussed in Sowa (1992). Any graph-structured knowledge representation, which focuses on object instances rather than the schema used for their organization, is also called a knowledge graph.5 Large-scale knowledge-graphs (loosely called, ontologies) with millions of nodes are either curated by hand, e.g. Cyc6 (Lenat 1995), or constructed from crowd-sourced data, e.g. DBPedia (Lehmann et al. 2015) created from the structured contents (tables) in Wikipedia. These knowledge-graphs generally relate common entities of every-day use (not specialized for specific domains) and represent common-sense knowledge. A review of knowledge graphs, and methods for their creation, evaluation, and refinement is presented in Paulheim (2017).

7.1.3

Frame-Based Representation

Psychological experiments suggest that, in order to interpret a new situation, one recalls an already known prototype structure that best matches the situation. The prototype, which is called a frame, encodes the known (remembered) properties of a situation, the details of which are amended suitably to fit the current reality. A framework of knowledge representation has been based on this theory (Minsky 1974). Frame-based knowledge representation comprises a set of concepts and definitions of the concepts in terms of their typical properties (Steels 1978). Figure 7.3 depicts an example of a frame-based knowledge representation. In this representation, an entity is represented by a frame (the big boxes in the 5 The term knowledge graph is sometimes narrowly used to connote Google’s implementation of the same. 6 While Cyc is a commercial version, a smaller version called OpenCyc is publicly available.

7.1 Classicist Approach to Knowledge Representation Thing is-a instance-of

Classes

is-a is-a

Book

Person

has-Title

has-Name

is-a

has-Name has-Location

has-author has-publisher

Publisher

Author has-Name has-Institution

Book-1 “Cognitive Vision” → Author-1 → Publisher-1

Author-1

“TCS Research” Publisher-1 “Wiley-IEEE Press”

Instances

Figure 7.3

instance-of

“Hiranmay Ghosh”

instance-of

“USA”

An example of frame-based knowledge representation.

diagram). A frame that represents a class, such as a book, is called a generic frame. They are organized in a concept hierarchy with “is-a” (subclass-Of) relation, as shown above the dotted line in the diagram. All classes descend from a common ancestor (root) class, designated as “thing”. The frames below the line represent instances of some classes, e.g. this book, and are called specific frames. They constitute the leaf nodes in the knowledge graph. Structurally, a frame consists of some slots (elementary boxes in the diagram), which are populated with some fillers (text in those boxes). The fillers in the generic frame indicate the properties that instance of that class should have, e.g. a book should have an author and a publisher. The generic description of the properties are inherited by the child classes. For example, since a person should have a name, and since an author is a person, an author also should have a name (shown in italics in the diagram). The fillers of a generic frame may also contain some specific values (not shown in the diagram), which represent default values for a property, and can either be inherited by or overridden (with explicit specifications) in a child frame, either generic or specific. The fillers of a specific class possess concrete values. They can be of three types (i) a pointer to another specific frame, e.g. Author-1 for the author of Book-1, (ii) literals like “TCS Research” for Author-1’s institution, and (iii) a literal inherited from a generic frame (not shown in the diagram). Thus, each slot of a specific frame represents an attribute-value pair for an entity, which is similar to the binary relations expressed in a semantic network. Thus, the network of specific frames in a frame-based knowledge representation is equivalent to a knowledge-graph. In addition, the schema represented by the generic frames and their relations

115

116

7 Knowledge Representation for Cognitive Vision

imposes a discipline on the knowledge representation. The constraints and the semantics imposed by the schema enables formal reasoning with the knowledge graph. The concept hierarchy and property inheritance in a frame-based system results in a compact knowledge representation. Overriding of inherited properties values leads to nonmonotonic reasoning, and enables exception-handling. It has interesting consequences for visual (as well as other types of) cognition. Many objects have similar descriptions but with some differences. Visual appearance of such objects can be modeled with a network of frames, with appropriate exceptions added to each frame. Such a similarity network for some pieces of furniture is depicted in Figure 7.4. In order to recognize an object, a machine tries to match an observed visual properties with the prototype of an object. Visual similarity based organization helps in efficiently discovering the best matching prototype. If no prototype “sufficiently” matches the observed properties, a new prototype is created and linked to the nearest matching prototype with appropriate exceptions (Minsky 1974). This mechanism appears to closely follow the recognition and learning mechanism of human mind. With the wide-spread availability of Internet, distributed intelligent systems and knowledge sharing over the web has become a reality. This prompted development of standardized languages for web-based knowledge representation. Web Ontology Language (OWL) (McGuinness and van Harmelen 2004), which has been standardized by W3C as a language for knowledge representation and

Garden bench (wide seat, backrest)

Chair (seat, backrest, legs)

Stool (no backrest)

Figure 7.4

Table (longer legs, wider “seat”)

Similarity network for some pieces of furniture, based on Minsky (1974).

7.2 Symbol Grounding Problem

exchange for the web, is based on frame-based knowledge representation scheme. Description logics (DL), a variant of FoL, has been standardized as the reasoning tool with OWL. Probabilistic and fuzzy variants of OWL have been proposed in Ding and Peng (2004) and Bobillo and Straccia (2011), respectively, to deal with the uncertainty of the properties in several domains, such as medical diagnostics and machine inspection.

7.2

Symbol Grounding Problem

Offline RBSs work entirely with the internal symbolic representations, without caring for what the symbols represent in the real world. The translation between the real-world entities and their symbolic representations, which is necessary for application of the knowledge-based systems to real-world problems (such as medical diagnosis (Buchanan and Shortliffe 1984)), were left to human beings operating the system. However, situated agents, i.e. the autonomous knowledge-based systems like the robots, need to interact directly with the environment without human intervention. It is necessary for these agents to correlate the symbols with the real-world objects to which they refer to, for an agent to successfully interact with its environment. This is known as the Symbol Grounding Problem (SGP). The problem of symbol grounding is abstracted as semiotics7 , that distinguishes between the symbols, objects that they refer to, and the concepts that they connote. The three entities are related in the form of semiotic triangle as shown in Figure 7.5. For example, we have used the symbol “cat” to represent the concept cat (an entity having the properties of a cat), which materializes in the real-world as tangible instances of the concept class. Since the internal representation of an agent includes its own mental states, e.g. the epistemic knowledge about what it knows or not, it is not always necessary for all concepts to have a materialization as an object. But whenever such materialization happens, an agent needs to correlate the materialized objects Figure 7.5

Semiotic triangle.

Concept Method

Representation

Sensor Object

Symbol

7 Semiotics is “the study of signs and symbols and their use or interpretation.“ – Dictionary.com

117

118

7 Knowledge Representation for Cognitive Vision

with the concepts. An object is “sensed” by an agent using signals captured on some physical devices, e.g. visual patterns on a camera. Thus, SGP demands that an agent need to have methods to correlate the sensor data to a concept. Thus, the symbol grounding involves anchoring between symbols in the knowledge-base and the percepts that results out of signal processing associated with the sensors. The frame-based knowledge representation scheme as presented in Section 7.1.3 is essentially a declarative form of knowledge. To enable symbol grounding, a frame may contain a slot, where the filler contains the specification of the visual properties of a concept. Some “hybrid” cognitive systems (Kennedy and Trafton 2006) use procedural knowledge (e.g. a computer vision program for object detection), instead of a declarative specification. Some other systems (Bertini et al. 2009) incorporate visual prototypes (examples), which are processed by an externally defined routine. In these systems, the SGP is resolved with external procedural knowledge, not seamlessly integrated with the knowledge representation framework.

7.3

Perceptual Knowledge

Human expressions and communications have exploited two independent cognitive processes, naturalistic images and abstract signs since prehistoric times. Though it had earlier been believed that abstract symbolism originated relatively late in human history, recent research shows that the abstract symbols coexisted with naturalistic images in many of the oldest Paleolithic caves (Haarmann 2005). The symbolism has matured through the ages to the modern system of text and natural language. Even today, we find the two modes of expression complementing each other, as text complemented with illustrations (like in this book) and images annotated with text. Modern on-line communication often relies on multimedia, where naturalistic expressions in audio-visual form is complemented with text and speech. Figure 7.6 depicts the evolution of human naturalistic and symbolic expressions through the ages. These examples amply demonstrate the need for a unified knowledge representation scheme to seamlessly interact with naturalistic (perceptual) and symbolic forms of expressions, which generally coexist. This is necessary to integrate the layers of sensory perception and higher level cognitive tasks as well. Traditionally, conceptual (symbolic) and perceptual (sub-symbolic) representations have been considered to be disparate. The gap between them are often referred to the semantic gap, bridging which has always posed a formidable challenge to the computer vision community. In Sections 7.3.1–7.3.4 and 7.4, we proceed to characterize the

7.3 Perceptual Knowledge

1

3

2

4

6

(a)

5

7

16

(b)

(c)

(d)

Figure 7.6 Knowledge in perceptual and symbolic forms. (a) Coexistence of symbols with naturalistic paintings in early cave paintings (Bhimbetka caves, India). (b) Maturing of symbolic expressions (Rock art, Val Camonica, Italy). Source: Reproduced from Haarmann (2005) with permission from the authors. (c) Text with illustration (a page from a draft version of Chapter 5). (d) Communication in multimedia format. Source: File shared by ETC@USC through Wikimedia Commons, file name: Motorola_Atrix_4G_HD_Multimedia_Dock.jpg..

perceptual knowledge and discuss approaches to express the same together with symbolic knowledge in a unified ontological framework.

7.3.1

Representing Perceptual Knowledge

We look at the very nature of evolution of concepts to establish a relation between the conceptual and perceptual worlds. In real life, we observe many instances of

119

120

7 Knowledge Representation for Cognitive Vision

Automobile

Windshield Head-lamps Grill

Body

Tail-lamps

Wheels

Car

Figure 7.7

Truck

Concepts and their perceptual representations.

the objects, and we tend to organize them by their functional similarities. We have discussed a Bayesian model for taxonomy learning, based on object properties, in Chapter 3. Functional and perceptual similarities are often interlinked, e.g. all vehicles that carry passengers, have a certain commonality in their looks. We associate a concept with each group of similar objects. Concepts that are functionally or perceptually close to each other get interconnected. For example, the objects in Figure 7.7, can be organized in a class “automobiles”, with subclasses “cars” and “trucks,” with the common visual properties of the classes attributed to them. Thus, the concepts and their visual properties are inherently interlinked with each other. It is also apparent from the above discussion that a frame-like knowledge representation scheme that supports property inheritance is useful for making the representation of perceptual knowledge compact, and for enabling it to handle exceptions. However, abductive reasoning is more conducive for interpretation of visual signals than deductive logic8 that rules the frame-based systems. We shall see the synthesis of abductive reasoning with frame-based representation for integrating perceptual and symbolic knowledge later in this chapter.

7.3.2

Structural Description of Scenes

A scene, e.g. the one depicted in Figure 7.8a, generally consists of several semantically interconnected objects. Organization of the objects in a scene provides a semantic specification of the latter. For example, Li et al. (2010) reports scene classifier using objects as its semantic features. 8 See discussions in Chapter 3.

7.3 Perceptual Knowledge Study area Storage area Clock

Momento

Calendar Printer

Laptop Chair

Storage Power sockets

Desk

Cupboard Book-case

(a) Study

Top-of Study-area

Storage-area

Desk

Laptop

Cupboard

Chair Calendar Printer

Book-case

Momento Momento Books

Power-sockets

Top-of Book-case

Wall

Chair

Front-of

Front-of Clock

(b)

Top-of Right-of Front-of

Front-of Desk

Momento

Printer

Laptop

Calendar

(c)

Figure 7.8 Representations of scene in terms of contained objects. (a) A study room as a scene. (b) Scene hierarchy representation. (c) Scene graph representation.

A declarative way to describe a scene is to organize its contents in a hierarchical fashion (Vasudevan et al. 2007; Liu et al. 2014), as depicted in Figure 7.8b. The semantics of the scene can be formally represented with a probabilistic specification of the hierarchical structure, as  = ⟨L, R, P⟩, where L represents a list of objects, R a set of production rules (defining the hierarchy at each level, e.g. study-area → { desk, chair, … }, and P a set of probabilities for an object to be present in a given context. The probability values can be learned over several observations of scenes of the same semantic class. Another way to describe a scene in terms of its constituent objects is by using a graph representing interaction between objects, called the scene-graphs (Johnson et al. 2015). A scene graph provides an informal description of relations (generally, spatial) that exist between any pair of objects in the scene. A small section of scene-graph for the same scene is shown in Figure 7.8c. The two representations can be combined to form a complete description of the arrangement of the visual objects in a scene.

121

122

7 Knowledge Representation for Cognitive Vision

7.3.3

Qualitative Spatial and Temporal Relations

Knowledge about spatial composition of objects and temporal composition of events are important aspects of perceptual knowledge. For example, the relative positions of the legs, the seat, and the backrest are essential ingredients of visual description of a chair. A qualitative representation of the spatial and temporal composition of an event is generally preferred over its quantitative specification, since the former is generic in nature while the quantitative values may vary across the instances. The explicit representation of qualitative spatial and temporal relations can be reasoned with (Cohn et al. 2006). For example, given the statements “A ← B” and “B ← C” (where ← represents the spatial relation “to the left of”), it is possible to infer “A ← C”. Human mind is believed to reason with such abstract qualitative relations. There are many approaches to represent and reason with such qualitative abstractions of spatial and temporal relations (Cohn and Renz 2008; Dylla et al. 2017). We present a few representative approaches. Most of the qualitative temporal and spatial relations are based on an interval algebra proposed by Allen (1983), which describes the temporal relations between two events in terms of formal equality and inequality relations between their start and end times. The temporal relations between two finite events A and B, with their start and end-points designated by (As , Ae ) and (Bs , Be ), respectively, can be established by different outcomes of the comparison of the end-points, subject to the obvious constraints As ⩽ Ae and Bs ⩽ Be . The valid combinations lead to 13 distinct temporal relations between two events and are designated by names (and mnemonics) like before (>), meets (m), etc., as shown in Figure 7.9. Note that the relations 13 to 8 shown on the right column represent inverse of the relations 1–6 shown on the left. These relations have been opted for in the time ontology in OWL (Hobbs and Pan 2017). A real-life event is a multidimensional entity bounded in spatial and temporal dimensions. While Allen has originally defined the relations with respect to time, they are readily applicable to any spatial dimension as well. Thus, one may be tempted to express the spatiotemporal relations between two events with a conjunction of the relations of their projections on each of the space and time axes. However, such expressions often lead to ambiguities. For example, Figure 7.10a,b represent two distinct spatial relations for two 2D regions A and B with respect to their intersection behavior. But, the projections of the two regions in X and Y dimensions share the same spatial relations in the two diagrams, making the two cases indistinguishable. One way to resolve such ambiguity in two or more dimensional spaces is by introducing an additional set of six containment relations as shown in Figure 7.11. These containment relations, together with Allen’s relations computed for the projections of the events in each of the space and time dimensions, can uniquely describe the qualitative spatiotemporal relations between two events.

7.3 Perceptual Knowledge

A

A

1

A before B (A < B)

B before A (A > B) A

A A meets B (A m B)

B meets A (A mi B) A

A

3

B overlaps A (A oi B) A B

A B A starts B (A s B)

A

9

B

A during B (A d B)

B during A (A di B) A

A

6

10

B starts A (A si B)

A B

5

11

B

B A overlaps B (A o B)

4

12

B

B

2

13

B

B

8

B

B

B finishes A (A fi B)

A finishes B (A f B) A B

7

A equals B (A = B)

Figure 7.9

Allen’s temporal relations.

Y

Y

A Ye B eY

A Ye B eY

B

B sY

B

B sY A

A

A Ys

A Ys A Xs

B sX

(a)

B eX

A Xe

X

A Xs

B sX

B eX

A Xe

(b)

Figure 7.10 Ambiguity with Allen’s relation in 2D space. (a) Regions A and B do not intersect. (b) Regions A and B intersect.

X

123

124

7 Knowledge Representation for Cognitive Vision

B

A A

B

B

A outside B

A

A contains B

A inside B A

A

A

B

A overlaps B Figure 7.11

7.3.4

B

A touches B

B A skirts B

Containment relations shown in two-dimensional space.

Inexact Spatiotemporal Relations

Allen has defined an interval algebra for discovering the relation between two events A and C, when the relations between A and B, and that between B and C are known.9 It may be noted that some of the relation-pairs cannot be unambiguously resolved using such algebra, for example, multiple possibilities exist for the relation between A and C, given the relations “A o B” and “B o C”. The ambiguities in resolving the spatiotemporal relations between events motivate representation of the inexact knowledge for the relations between two events. Such inexact relations are also justified that the fact that the end-points of the events are often imprecisely perceived, either because of inherent ambiguity in the event boundaries or because of limitations in perception mechanism. The inexactness of the relations has been analyzed in Freksa (1992) in light of the semantics of Allen’s relations. Figure 7.12 places the Allen’s relations on a two-dimensional grid marked with the comparison between the start and the end-points of the two events A and B. Note that only one of the comparison parameters change between any two adjacent relations, vertically, horizontally, or diagonally. The adjacent cells in the diagram (e.g. < and m) represent some neighborhood relation between Allen’s relations. It is possible to move from one relation to another in its neighborhood by continuously deforming (i.e. by moving the end-points of) either or both the events without satisfying any other relation at an intermediate stage. Thus, an error in perceiving the end-points of the events is likely to result in misclassification of a relation to another in its neighborhood. Thus, the adjacent relations in Figure 7.12 are called conceptual neighbors. A set of relations forms a conceptual neighborhood, if its elements are connected through conceptual neighbor relations (e.g. Be

A e = Be

Ae < B e

As < Bs

Ae < Bs Ae = Bs


Bs

o

As > Bs

mi

>

As < Be

Figure 7.12

As = Be

As > Be

Semantics of Allen’s relations, based on Freksa (1992).

the end-points of two events are not perceived with certainty, it is still possible to represent their relation coarsely as disjunction between a set of conceptual neighbors. Ambiguous outcomes from Allen’s algebra can also be expressed through such incomplete specifications involving conceptual neighbors. For example, the ambiguous outcome from the relations “A o B” and “B o C” can be expressed as “A r C”, where r ≡ ∪(