The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition 9781970001709

629 96 15MB

English Pages [555] Year 2018

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition
 9781970001709

Table of contents :
Contents
Preface
Figure Credits
Introduction: Trends in Intelligent Multimodal-Multisensorial Interfaces: Cognition, Emotion, Social Signals, Deep Learning, and More
PART I. MULTIMODAL SIGNAL PROCESSING AND ARCHITECTURES
1. Challenges and Applications in Multimodal Machine Learning
2. Classifying Multimodal Data
3. Learning for Multimodal and Affect-Sensitive Interfaces
4. Deep Learning for Multisensorial and Multimodal Interaction
PART II. MULTIMODAL PROCESSING OF SOCIAL AND EMOTIONAL STATES
5. Multimodal User State and Trait Recognition: An Overview
6. Multimodal-Multisensor Affect Detection
7. Multimodal Analysis of Social Signals
8. Real-Time Sensing of Affect and Social Signals in a Multimodal Framework: A Practical Approach
9. How Do Users Perceive Multimodal Expressions of Affects?
PART III. MULTIMODAL PROCESSING OF COGNITIVE STATES
10. Multimodal Behavioral and Physiological Signals as Indicators of Cognitive Load
11. Multimodal Learning Analytics: Assessing Learners’ Mental State During the Process of Learning
12. Multimodal Assessment of Depression from Behavioral Signals
13. Multimodal Deception Detection
PART IV. MULTIDISCIPLINARY CHALLENGE TOPIC
14. Perspectives on Predictive Power of Multimodal Deep Learning: Surprises and Future Directions
Index
Biographies
Volume 2 Glossary

Citation preview

The Handbook of Multimodal-Multisensor Interfaces, Volume 2

ACM Books Editor in Chief ¨ zsu, University of Waterloo M. Tamer O ACM Books is a new series of high-quality books for the computer science community, published by ACM in collaboration with Morgan & Claypool Publishers. ACM Books publications are widely distributed in both print and digital formats through booksellers and to libraries (and library consortia) and individual ACM members via the ACM Digital Library platform.

The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition Editors: Sharon Oviatt, Monash University Bj¨ orn Schuller, University of Augsburg and Imperial College London Philip R. Cohen, Monash University Daniel Sonntag, German Research Center for Artificial Intelligence (DFKI) Gerasimos Potamianos, University of Thessaly Antonio Kr¨ uger, Saarland University and German Research Center for Artificial Intelligence (DFKI) 2018

Declarative Logic Programming: Theory, Systems, and Applications Editors: Michael Kifer, Stony Brook University Yanhong Annie Liu, Stony Brook University 2018

The Sparse Fourier Transform: Theory and Practice Haitham Hassanieh, University of Illinois at Urbana-Champaign 2018

The Continuing Arms Race: Code-Reuse Attacks and Defenses Editors: Per Larsen, Immunant, Inc. Ahmad-Reza Sadeghi, Technische Universit¨at Darmstadt 2018

Frontiers of Multimedia Research Editor: Shih-Fu Chang, Columbia University 2018

Shared-Memory Parallelism Can Be Simple, Fast, and Scalable Julian Shun, University of California, Berkeley 2017

Computational Prediction of Protein Complexes from Protein Interaction Networks Sriganesh Srihari, The University of Queensland Institute for Molecular Bioscience Chern Han Yong, Duke-National University of Singapore Medical School Limsoon Wong, National University of Singapore 2017

The Handbook of Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling, and Common Modality Combinations Editors: Sharon Oviatt, Incaa Designs Bj¨ orn Schuller, University of Passau and Imperial College London Philip R. Cohen, Voicebox Technologies Daniel Sonntag, German Research Center for Artificial Intelligence (DFKI) Gerasimos Potamianos, University of Thessaly Antonio Kr¨ uger, Saarland University and German Research Center for Artificial Intelligence (DFKI) 2017

Communities of Computing: Computer Science and Society in the ACM Thomas J. Misa, Editor, University of Minnesota 2017

Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining ChengXiang Zhai, University of Illinois at Urbana–Champaign Sean Massung, University of Illinois at Urbana–Champaign 2016

An Architecture for Fast and General Data Processing on Large Clusters Matei Zaharia, Stanford University 2016

Reactive Internet Programming: State Chart XML in Action Franck Barbier, University of Pau, France 2016

Verified Functional Programming in Agda Aaron Stump, The University of Iowa 2016

The VR Book: Human-Centered Design for Virtual Reality Jason Jerald, NextGen Interactions 2016

Ada’s Legacy: Cultures of Computing from the Victorian to the Digital Age Robin Hammerman, Stevens Institute of Technology Andrew L. Russell, Stevens Institute of Technology 2016

Edmund Berkeley and the Social Responsibility of Computer Professionals Bernadette Longo, New Jersey Institute of Technology 2015

Candidate Multilinear Maps Sanjam Garg, University of California, Berkeley 2015

Smarter Than Their Machines: Oral Histories of Pioneers in Interactive Computing John Cullinane, Northeastern University; Mossavar-Rahmani Center for Business and Government, John F. Kennedy School of Government, Harvard University 2015

A Framework for Scientific Discovery through Video Games Seth Cooper, University of Washington 2014

Trust Extension as a Mechanism for Secure Code Execution on Commodity Computers Bryan Jeffrey Parno, Microsoft Research 2014

Embracing Interference in Wireless Systems Shyamnath Gollakota, University of Washington 2014

The Handbook of Multimodal-Multisensor Interfaces, Volume 2 Signal Processing, Architectures, and Detection of Emotion and Cognition Sharon Oviatt Monash University

Bj¨ orn Schuller University of Augsburg and Imperial College London

Philip R. Cohen Monash University

Daniel Sonntag German Research Center for Artificial Intelligence (DFKI)

Gerasimos Potamianos University of Thessaly

Antonio Kr¨ uger Saarland University and German Research Center for Artificial Intelligence (DFKI)

ACM Books #21

Copyright © 2019 by the Association for Computing Machinery and Morgan & Claypool Publishers All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews—without the prior permission of the publisher. Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which Morgan & Claypool is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. The Handbook of Multimodal-Multisensor Interfaces, Volume 2 Sharon Oviatt, Bj¨ orn Schuller, Philip R. Cohen, Daniel Sonntag, Gerasimos Potamianos, Antonio Kr¨ uger, editors books.acm.org www.morganclaypoolpublishers.com ISBN: 978-1-97000-171-6 ISBN: 978-1-97000-168-6 ISBN: 978-1-97000-169-3 ISBN: 978-1-97000-170-9

hardcover paperback eBook ePub

Series ISSN: 2374-6769 print 2374-6777 electronic DOIs: 10.1145/3107990 Book 10.1145/3107990.3107991 10.1145/3107990.3107992 10.1145/3107990.3107993 10.1145/3107990.3107994 10.1145/3107990.3107995 10.1145/3107990.3107996 10.1145/3107990.3107997 10.1145/3107990.3107998

Preface Introduction Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6

10.1145/3107990.3107999 10.1145/3107990.3108000 10.1145/3107990.3108001 10.1145/3107990.3108002 10.1145/3107990.3108003 10.1145/3107990.3108004 10.1145/3107990.3108005 10.1145/3107990.3108006 10.1145/3107990.3108007

A publication in the ACM Books series, #21 ¨ zsu, University of Waterloo Editor in Chief: M. Tamer O Area Editor: Michel Beaudouin-Lafon, Universit´e Paris-Sud This book was typeset in Arnhem Pro 10/14 and Flama using ZzTEX. First Edition 10 9 8 7 6 5 4 3 2 1

Chapter 7 Chapter 8 Chapter 9 Chapter 10 Chapter 11 Chapter 12 Chapter 13 Chapter 14 Index/Bios/Glossary

This book is dedicated to our families, whose patience and support sustained the year-long effort required to organize, write, and manage different stages of this multi-volume project.

Contents

Preface xvii Figure Credits xxi Introduction: Trends in Intelligent Multimodal-Multisensorial Interfaces: Cognition, Emotion, Social Signals, Deep Learning, and More 1 A Very Brief History of HCI and AI—and Their Relationship in Time 1 Increasingly Robust AI as a Game-Changer for HCI 2 Multimodal Signal Processing, Architectures and Deep Learning 3 The Advent of Artificial Emotional and Social Intelligence 5 Insights in the Chapters Ahead 6 References 15

PART I MULTIMODAL SIGNAL PROCESSING AND ARCHITECTURES 17 Chapter 1

Challenges and Applications in Multimodal Machine Learning 19 Tadas Baltruˇ saitis, Chaitanya Ahuja, Louis-Philippe Morency 1.1 1.2 1.3 1.4 1.5

Introduction 19 Multimodal Applications 21 Multimodal Representations 23 Co-learning 33 Conclusion 38

xii

Contents

Focus Questions 38 References 39

Chapter 2

Classifying Multimodal Data 49 Ethem Alpaydin 2.1 2.2 2.3 2.4 2.5 2.6

Chapter 3

Introduction 49 Classifying Multimodal Data 49 Early, Late, and Intermediate Integration 57 Multiple Kernel Learning 60 Multimodal Deep Learning 62 Conclusions and Future Work 64 Acknowledgments 66 Focus Questions 66 References 67

Learning for Multimodal and Affect-Sensitive Interfaces 71 Yannis Panagakis, Ognjen Rudovic, Maja Pantic 3.1 3.2 3.3 3.4 3.5 3.6

Chapter 4

Introduction 71 Correlation Analysis Methods 75 Temporal Modeling of Facial Expressions 83 Context Dependency 87 Model Adaptation 88 Conclusion 90 Focus Questions 91 References 91

Deep Learning for Multisensorial and Multimodal Interaction 99 Gil Keren, Amr El-Desoky Mousa, Olivier Pietquin, Stefanos Zafeiriou, Bj¨orn Schuller 4.1 4.2 4.3 4.4 4.5

Introduction 99 Fusion Models 100 Encoder-Decoder Models 105 Multimodal Embedding Models 111 Perspectives 122 Focus Questions 123 References 123

Contents

PART II MULTIMODAL PROCESSING OF SOCIAL AND EMOTIONAL STATES 129 Chapter 5

Multimodal User State and Trait Recognition: An Overview 131 Bj¨orn Schuller 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8

Chapter 6

Introduction 131 Modeling 132 An Overview on Attempted Multimodal Stait and Trait Recognition 132 Architectures 135 A Modern Architecture Perspective 144 Modalities 144 Walk-through of an Example State 150 Emerging Trends and Future Directions 151 Focus Questions 152 References 153

Multimodal-Multisensor Affect Detection 167 Sidney K. D’Mello, Nigel Bosch, Huili Chen 6.1 6.2 6.3 6.4 6.5 6.6

Chapter 7

Introduction 167 Background from Affective Sciences 169 Modality Fusion for Multimodal-Multisensor Affect Detection 173 Walk-throughs of Sample Multisensor-Multimodal Affect Detection Systems 180 General Trends and State of the Art in Multisensor-Multimodal Affect Detection 185 Discussion 189 Acknowledgments 191 Focus Questions 191 References 192

Multimodal Analysis of Social Signals 203 Alessandro Vinciarelli, Anna Esposito 7.1 7.2 7.3 7.4 7.5

Introduction 203 Multimodal Communication in Life and Human Sciences 205 Multimodal Analysis of Social Signals 208 Next Steps 218 Conclusions 220

xiii

xiv

Contents

Focus Questions 222 References 222

Chapter 8

Real-Time Sensing of Affect and Social Signals in a Multimodal Framework: A Practical Approach 227 Johannes Wagner, Elisabeth Andr´e 8.1 8.2 8.3 8.4 8.5 8.6 8.7

Chapter 9

Introduction 227 Database Collection 228 Multimodal Fusion 233 Online Recognition 237 Requirements for a Multimodal Framework 240 The Social Signal Interpretation Framework 242 Conclusion 250 Focus Questions 253 References 254

How Do Users Perceive Multimodal Expressions of Affects? 263 Jean-Claude Martin, C´eline Clavel, Matthieu Courgeon, Mehdi Ammi, Michel-Ange Amorim, Yacine Tsalamlal, Yoren Gaffary 9.1 9.2 9.3 9.4 9.5

Introduction 263 Emotions and Their Expressions 266 How Humans Perceive Combinations of Expressions of Affects in Several Modalities 269 Impact of Context on the Perception of Expressions of Affects 276 Conclusion 278 Focus Questions 279 References 280

PART III MULTIMODAL PROCESSING OF COGNITIVE STATES 287 Chapter 10

Multimodal Behavioral and Physiological Signals as Indicators of Cognitive Load 289 Jianlong Zhou, Kun Yu, Fang Chen, Yang Wang, Syed Z. Arshad 10.1 Introduction 289 10.2 State-of-the-Art 292 10.3 Behavioral Measures for Cognitive Load 301

Contents

xv

10.4 Physiological Measures for Cognitive Load 304 10.5 Multimodal Signals and Data Fusion 309 10.6 Conclusion 318 Funding 320 Focus Questions 320 References 321

Chapter 11

Multimodal Learning Analytics: Assessing Learners’ Mental State During the Process of Learning 331 Sharon Oviatt, Joseph Grafsgaard, Lei Chen, Xavier Ochoa 11.1 11.2 11.3 11.4

Introduction 331 What is Multimodal Learning Analytics? 332 What Data Resources are Available on Multimodal Learning Analytics? 339 What are the Main Themes from Research Findings on Multimodal Learning Analytics? 349 11.5 What is the Theoretical Basis of Multimodal Learning Analytics? 356 11.6 What are the Main Challenges and Limitations of Multimodal Learning Analytics? 361 11.7 Conclusions and Future Directions 363 Focus Questions 365 References 366

Chapter 12

Multimodal Assessment of Depression from Behavioral Signals 375 Jeffrey F. Cohn, Nicholas Cummins, Julien Epps, Roland Goecke, Jyoti Joshi, Stefan Scherer 12.1 12.2 12.3 12.4 12.5 12.6 12.7 12.8 12.9 12.10

Introduction 375 Depression 376 Multimodal Behavioral Signal Processing Systems 380 Facial Analysis 382 Speech Analysis 385 Body Movement and Other Behavior Analysis 392 Analysis using Other Sensor Signals 394 Multimodal Fusion 395 Implementation-Related Considerations and Elicitation Approaches 398 Conclusion and Current Challenges 401 Acknowledgments 404 Focus Questions 404 References 405

xvi

Contents

Chapter 13

Multimodal Deception Detection 419 Mihai Burzo, Mohamed Abouelenien, Veronica Perez-Rosas, Rada Mihalcea 13.1 13.2 13.3 13.4

Introduction and Motivation 419 Deception Detection with Individual Modalities 422 Deception Detection with Multiple Modalities 433 The Way Forward 444 Acknowledgments 445 Focus Questions 445 References 446

PART IV MULTIDISCIPLINARY CHALLENGE TOPIC 455 Chapter 14

Perspectives on Predictive Power of Multimodal Deep Learning: Surprises and Future Directions 457 Samy Bengio, Li Deng, Louis-Philippe Morency, Bj¨orn Schuller 14.1 14.2 14.3 14.4 14.5 14.6

Deep Learning as Catalyst for Scientific Discovery 458 Deep Learning in Relation to Conventional Machine Learning 460 Expected Surprises of Deep Learning 464 The Future of Deep Learning 465 Responsibility in Deep Learning 467 Conclusion 468 References 470

Index 473 Biographies 499 Volume 2 Glossary 517

Preface

The content of this handbook is most appropriate for graduate students and of primary interest to students studying computer science and information technology, human-computer interfaces, mobile and ubiquitous interfaces, affective and behavioral computing, machine learning, and related multidisciplinary majors. When teaching graduate classes with this book, whether in a quarter- or semesterlong course, we recommend initially requiring that students spend two weeks reading the introductory textbook, The Paradigm Shift to Multimodality in Contemporary Interfaces (Morgan & Claypool Publishers, Human-Centered Interfaces Synthesis Series, 2015). With this orientation, a graduate class providing an overview of multimodal-multisensor interfaces then could select chapters from the current handbook, distributed across topics in the different sections. As an example, in a 10-week course the remaining 8 weeks might be allocated to reading select chapters on: (1) theory, user modeling, and common modality combinations (2 weeks); (2) prototyping and software tools, signal processing, and architectures (2 weeks); (3) language and dialogue processing (1 week); (4) detection of emotional and cognitive state (2 weeks); and (5) commercialization, future trends, and societal issues (1 week). In a more extended 16-week course, we recommend spending an additional week reading and discussing chapters on each of these five topic areas, as well as an additional week on the introductory textbook, The Paradigm Shift to Multimodality in Contemporary Interfaces. As an alternative, in a semester-long course in which students will be conducting a project in one target area (e.g., designing multimodal dialogue systems for in-vehicle use), some or all of the additional time in the semester course could be spent: (1) reading a more indepth collection of handbook chapters on language and dialogue processing (e.g., 2 weeks) and (2) conducting the hands-on project (e.g., 4 weeks).

xviii

Preface

For more tailored versions of a course on multimodal-multisensor interfaces, another approach is to have students read the handbook chapters in relevant sections and then follow up with more targeted and in-depth technical papers. For example, a course intended for a cognitive science audience might start by reading The Paradigm Shift to Multimodality in Contemporary Interfaces, followed by assigning chapters from the handbook sections on: (1) theory, user modeling, and common modality combinations; (2) multimodal processing of social and emotional information; and (3) multimodal processing of cognition and mental health status. Afterward, the course could teach students different computational and statistical analysis techniques related to these chapters, ideally through demonstration. Students might then be asked to conduct a hands-on project in which they apply one or more analysis methods to multimodal data to build user models or predict mental states. As a second example, a course intended for a computer science audience might also start by reading The Paradigm Shift to Multimodality in Contemporary Interfaces, followed by assigning chapters on: (1) prototyping and software tools; (2) multimodal signal processing and architectures; and (3) language and dialogue processing. Afterward, students might engage in a hands-on project in which they design, build, and evaluate the performance of a multimodal system. In all of these teaching scenarios, we anticipate that professors will find this handbook to be a particularly comprehensive and valuable current resource for teaching about multimodal-multisensor interfaces.

Acknowledgments In the present age, reviewers are one of the most precious commodities on earth. First and foremost, we’d like to thank our dedicated expert reviewers, who provided insightful comments on the chapters and their revisions, sometimes on short notice. This select group included Antonis Argyros (University of Crete, Greece), Vassilis Athitsos (University of Texas at Arlington, USA), Nicholas Cummins (University of Augsburg, Germany), Randall Davis (MIT, USA), Jun Deng (audEERING, Germany), Jing Han (University of Passau, Germany), Anthony Jameson (DFKI, Germany), Michael Johnston (Interactions Corp., USA), Thomas Kehrenberg (University of Sussex, UK), Gil Keren (ZD.B, Germany), Elsa Andrea Kirchner (DFKI, Germany), Stefan Kopp (Bielefeld University, Germany), Marieke Longchamp (Laboratoire de Neurosciences Cognitive, France), Vedhas Pandit (University of Passau, Germany), Diane Pawluk (Virginia Commonwealth University, USA), Jouni Pohjalainen (Jabra), Hesam Sagha (audEERING, Germany), Maximillian Schmitt (Uni-

Preface

xix

versity of Augsburg, Germany), Gabriel Skantze (KTH Royal Institute of Technology, Sweden), Zixing Zhang (Imperial College London, UK), and the handbook’s main editors. We’d also like to thank the handbook’s eminent advisory board, 12 people who provided valuable guidance throughout the project, including suggestions for chapter topics, assistance with expert reviewing, participation on the panel of experts in our challenge topic discussions, and valuable advice. Advisory board members included Samy Bengio (Google, USA), James Crowley (INRIA, France), Marc Ernst (Bielefeld University, Germany), Anthony Jameson (DFKI, Germany), Stefan Kopp (Bielefeld University, Germany), Andr´ as L˜ orincz (ELTE, Hungary), Kenji Mase (Nagoya University, Japan), Fabio Pianesi (FBK, Italy), Steve Renals (University of Edinburgh, UK), Arun Ross (Michigan State University, USA), David Traum (USC, USA), Wolfgang Wahlster (DFKI, Germany), and Alex Waibel (CMU, USA). We all know that publishing has been a rapidly changing field, and in many cases authors and editors no longer receive the generous support they once did. We’d like to warmly thank Diane Cerra, our Morgan & Claypool Executive Editor, for her amazing skillfulness, flexibility, and delightful good nature throughout all stages of this project. It’s hard to imagine having a more experienced publications advisor and friend, and for a large project like this one her experience was invaluable. Thanks also to Mike Morgan, President of Morgan & Claypool, for his support on all ¨ zsu and Michel Beaudouin-Lafon aspects of this project. Finally, thanks to Tamer O of ACM Books for their advice and support. Many colleagues around the world graciously provided assistance in large and small ways—content insights, copies of graphics, critical references, and other valuable information used to document and illustrate this book. Thanks to all who offered their assistance, which greatly enriched this multi-volume handbook. For financial and professional support, we’d like to thank DFKI in Germany and Incaa Designs, an independent 501(c)(3) nonprofit organization in the US. In addition, Bj¨ orn Schuller would like to acknowledge support from the European Horizon 2020 Research & Innovation Action SEWA (agreement no. 645094).

Figure Credits Figure 4.3 Based on: O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015c. Show and tell: A neural image caption generator. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. IEEE. Figure 4.4 Based on: O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015c. Show and tell: A neural image caption generator. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. IEEE Figure 4.5 From: E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov. 2015. Generating images from captions with attention. In Proceedings of the International Conference on Learning Representations. Courtesy of the authors. Used with permission. Figure 4.6 From: D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations. Courtesy of the authors. Used with permission. Figure 4.7 From: K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, pp. 2048–2057. Courtesy of the authors. Used with permission. Figure 4.8 From: A. E. Mousa. 2014. Sub-Word Based Language Modeling of Morphologically Rich Languages for LVCSR. Ph.D. thesis, Computer Science Department, RWTH Aachen University, Aachen, Germany. Courtesy of Amr Ibrahim El-Desoky Mousa. Used with permission. Figure 4.12 From: A. Karpathy and L. Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3128–3137. Copyright © 2015 IEEE. Used with permission. Figure 4.13 From: S. Reed, Z. Akata, H. Lee, and B. Schiele. 2016. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. Copyright © 2016 IEEE. Used with permission. Figure 9.1 Based on: M. Y. Tsalamlal, M-A. Amorim, J-C. Martin, and M. Ammi. 2017. Combining facial expression and touch for perceiving emotional valence. IEEE Transactions on Affective Computing, 99. IEEE. Figure 10.4 From: NASA-TLX APP on iOS. http://itunes.apple.com/us/app/nasa-tlx/ id1168110608. Copyright © 2018 NASA. Used with permission.

xxii

Figure Credits

Figure 10.5 From: Y. Shi, E. Choi, R. Taib, and F. Chen. 2010. Designing Cognitionadaptive human–computer interface for mission-critical systems. In G. A. Papadopoulos, W. Wojtkowski, G. Wojtkowski, S. Wrycza, and J. Zupan?i?, editors, Information Systems Development, pp. 111–119. Copyright © 2010 Springer. Used with permission. Figure 10.8 From: N. Nourbakhsh, Y. Wang, F. Chen, and R. A. Calvo. 2012. Using Galvanic skin response for cognitive load measurement in arithmetic and reading tasks. In Proceedings of the 24th Australian Computer-Human Interaction Conference, OzCHI ’12 pp. 420–423. Copyright © 2012 ACM. Used with permission. Figure 10.9 From: W. Wang, Z. Li, Y. Wang, and F. Chen. 2013. Indexing cognitive workload based on pupillary response under luminance and emotional changes. In Proceedings of the 2013 International Conference on Intelligent User Interfaces, IUI ’13. pp. 247–256. Copyright © 2013 ACM. Used with permission. Figure 10.10 From: F. Chen et al. 2012. Multimodal behavior and interaction as indicators of cognitive load. In ACM Transactions on Interactive Intelligent Systems 2(4):22:1–22:36. Copyright © 2012 ACM. Used with permission. Figure 10.12 From: F. Chen et al. 2012. Multimodal behavior and interaction as indicators of cognitive load. In ACM Transactions on Interactive Intelligent Systems 2(4):22:1–22:36. Copyright © 2012 ACM. Used with permission. Figure 10.13 From: F. Chen et al. 2012. Multimodal behavior and interaction as indicators of cognitive load. In ACM Transactions on Interactive Intelligent Systems 2(4):22:1–22:36. Copyright © 2012 ACM. Used with permission. Figure 11.1 From: S. Oviatt and A. Cohen. 2013. Written and multimodal representations as predictors of expertise and problem-solving success in mathematics. In Proceedings of the 15th ACM International Conference on Multimodal Interaction, pp. 599–606. Copyright © 2013 ACM. Used with permission. Figure 11.2 From: S. Oviatt and A. Cohen. 2013. Written and multimodal representations as predictors of expertise and problem-solving success in mathematics. In Proceedings of the 15th ACM International Conference on Multimodal Interaction, pp. 599–606. Copyright © 2013 ACM. Used with permission. Figure 11.3 From: C. Leong, L. Chen, G. Feng, C. Lee, and M. Mulholland. 2015. Utilizing depth sensors for analyzing multimodal presentations: Hardware, software and toolkits. In Proceedings of the ACM International Conference on Multimodal Interaction, pp. 547–556. Copyright © 2015 ACM. Used with permission. Figure 11.4 From: M. Raca and P. Dillenbourg. 2014. Holistic analysis of the classroom. In Proceedings of the ACM International Data-Driven Grand Challenge workshop on Multimodal Learning Analytics, pp. 13–20. Copyright © 2014 ACM. Used with permission. Figure 11.5 From: F. Dominguez, V. Echeverria, K. Chiluiza, and X. Ochoa. 2015. Multimodal selfies: Designing a multimodal recording device for students in traditional classrooms. In Proceedings of the ACM International Conference on Multimodal Interaction, pp. 567–574. Copyright © 2015 ACM. Used with permission. Figure 11.6 From: A. Ezen-Can, J. F. Grafsgaard, J. C. Lester, and K. E. Boyer. 2015. Classifying student dialogue acts with multimodal learning analytics. In Proceedings of the

Figure Credits

xxiii

Fifth International Conference on Learning Analytics And Knowledge, pp. 280–289. Copyright © 2015 ACM. Used with permission. Figure 12.2 Based on: H. Dibeklioglu, Z.Hammal, Y.Yang, and J.F. Cohn. 2015. Multimodal detection of depression in clinical interviews. In Proceedings of the ACM International Conference on Multimodal Interaction, Seattle, WA. ACM. Figure 12.3 From: J. Joshi, A. Dhall, R. Goecke, M. Breakspear, and G. Parker. 2012. Neuralnet classification for spatio-temporal descriptor based depression analysis. In Proceedings of the International Conference on Pattern Recognition, Tsukuba, Japan, pp. 2634–2638. Copyright © 2012 IEEE. Used with permission. Figure 12.4 From: N. Cummins. 2016. Automatic assessment of depression from speech: paralinguistic analysis, modelling and machine learning. Ph.D. Thesis, UNSW Australia. 375, 377, 384. Courtesy of Nicholas Peter Cummins. Used with permission. Figure 12.5 From: J. Joshi, A. Dhall, R. Goecke, and J.F. Cohn. 2013c. Relative body parts movement for automatic depression analysis. In Proceedings of the Conference on Affective Computing and Intelligent Interaction, pp. 492–497. Copyright © Springer 2013. Used with permission. Figure 13.3 From: P. Tsiamyrtzis, J. Dowdall, D. Shastri, I. T. Pavlidis, M. G. Frank, and P. Ekman. 2007. Imaging facial physiology for the detection of deceit. International Journal of Computer Vision, 71(2): 197–214. Copyright © Springer 2007. Used with permission.

Introduction: Trends in Intelligent Multimodal-Multisensorial Interfaces: Cognition, Emotion, Social Signals, Deep Learning, and More Close your eyes and imagine the beginning of Human-Computer Interaction (HCI). What do you see? In one’s mind, these early days of computer interfaces are usually marked by a direct mapping of user input via buttons, keyboards, joysticks, etc. In addition, in this picture of early-day HCI, Artificial Intelligence (AI)—or even artificial emotional and social intelligence—would hardly play a role. Since then, things have certainly changed dramatically—we now include all of AI when we interact with computing systems, e.g., by speech or gestures. The picture will soon change once more so as to consider the emotional and social capabilities of computer interfaces, as will be introduced in detail in this volume. However, let us start with the history of HCI and AI. After that, the volume will be introduced chapter by chapter.

A Very Brief History of HCI and AI—and Their Relationship in Time Looking at the history of HCI, what followed keyboards such as those used in teleprinters as an interaction-changing paradigm—as early as in 1926—were arguably the first two-axes joysticks. The first trackballs for (analog at the time) computers are said to have appeared after the end of World War II. The first patent dates to 1947 when secret military implementations were prevailing at the time. Light pens were next, first occurring right in the middle of the 1950s (during the Whirlwind project at MIT). The first computer mouse arrived only in 1964 [English et al. 1965]. On the other hand, as early as 1952, Bell Labs realized a first prototype

2

Introduction

of a speaker-dependent speech recognizer [Juang and Rabiner 2005]. Even though the vocabularies at the time hardly surpassed ten words, it seems worth setting the above into temporal relation. The roots of Voice User Interfaces (VUI)—clearly demanding AI for their realization—followed surprisingly quickly the roots of the Graphical User Interface (GUI). Accordingly, one could argue that AI was used in ambitions toward improving computer interaction surprisingly early on. Noteworthy in this context is that the term Artificial Intelligence itself was only born in the 1955 Dartmouth summer research project proposal [McCarthy et al. 2006].

Increasingly Robust AI as a Game-Changer for HCI Over the last two decades, AI’s usage in computer interfaces has increased dramatically. Think, for example, of the rise of increasingly naturalistic interaction via handwriting, speech, and video-based interaction, to name three “AI-heavy” frequently used interaction examples. To stay with VUIs as an example, in 2015 65% of smartphone owners in the United States are said to have used such technology, resembling a usage increase by a factor above two over two years.1 In fact, AI itself has changed dramatically in the last decade. One could argue that it was this change that increased acceptance of “smarter” interaction technology by users, as the improvements in AI performance made AI-empowered interaction more usable. Using speech once more as an example, in 2018 Microsoft and IBM had surpassed (single) human transcription abilities on the popular telephone conversation Switchboard2 speech recognition task at a word error rate slightly above 5%, i.e., 1 out of 20 words is erroneously transcribed. This is comparable to humanparity and exceeds a single human expert’s speech recognition abilities. To put this into perspective, on the same task in 1995 the word error rate was around 43%, in 2004 it was around 15%, and in 2016 it was 8%. Clearly, this improvement in robustness is a decisive factor when it comes to one’s choice of modality for interaction. This change in AI’s reliability in interaction use cases is usually attributed largely to the recent achievements in deep learning. While this term appeared in the 1980s [Dechter 1986] and was used in the context of artificial neural networks (ANNs), the underlying machine learning paradigm of today’s deep learning appeared later. Important early breakthroughs came largely from solutions provided by Hochreiter and Schmidhuber [1997] and Hinton et al. [2006]. These contributions support efficiently training “deeper” ANNs with more than three or even up 1. Business Insider on June 10, 2016. Databasis: Kleiner Perkins. 2. Linguistic Data Consortium, 1993/1997.

Multimodal Signal Processing, Architectures and Deep Learning

3

to hundreds of information-processing layers. Furthermore, the networks can also be efficiently trained in recurrent topologies. This allows for better model time series data such as typical interaction data. Overall, these deeper networks thus can likewise unleash their potential in efficiently learning from larger amounts of data and largely unlabeled data. Up to millions or even billions of learning parameters can be trained in these networks. In fact, the number of neurons of the human brain has already been surpassed. Such high numbers of learning parameters allow for the formation of highly complex decision boundaries as occur in real-world demanding interaction tasks, e.g., robust speech recognition or natural language understanding.

Multimodal Signal Processing, Architectures and Deep Learning Multimodal processing of human (input) signals and their synergistic combination holds several promises to make interaction more robust against failure. For example, if one or several modalities should suffer from adverse conditions, information from the other modalities can still be exploited. Further, multimodal interaction has repeatedly been shown to contribute to more efficient, natural, and enjoyable interaction. Likewise, it is not surprising that the idea to go multimodal is “out there” since longer, such as in a mid-1980s work that combined the keyboard with speech input [Mitchell and Forren 1987]. Multimodal interaction does, however, not always live up to expectations [Oviatt 1999]. A crucial factor is the optimal integration of the information sources. Accordingly, a plethora of machine learning strategies usually along the range of early fusion of signals and late fusion after making decisions per signal source has been introduced. Perhaps the first attempt to model multimodal human interaction—albeit between humans—exploiting deep learning appeared in the early days of this branch of AI [Reiter et al. 2006]. Therefore, deep learning has broadly found its way into multimodal signal processing for human communication and interaction analysis. In fact, deep learning has increasingly replaced conventional expert-crafted signal processing of human input. Unsupervised representation learning and learning “end-to-end” from raw data replaced traditional pre-processing or extraction of features. Deep neural networks do not only learn to make a final decision based on extracted features, but also learn feature extraction and pre-processing directly from (user) data. Thanks to all this progress, it is indeed becoming more and more an everyday reality that interaction has become multimodal. For example, one can speak to a

4

Introduction

Glossary Canonical Correlation Analysis (CCA) is a tool to infer information based on crosscovariance matrices. It can be used to identify correlation across heterogeneous modalities or sensor signals. Let each modality be represented by a feature vector and let us assume there is correlation across these, such as is in audiovisual speech recognition. Then, CCA will identify linear combinations of the individual features with maximum correlation amongst each other. Confidence measure is the information on the assumed certainty of a decision made by a machine learning algorithm. Cooperative learning in machine learning is a combination of active learning and semisupervised learning. In the semi-supervised learning part, the machine learning algorithm labels unlabeled data based on its own previously learned model. In the active learning part, it identifies which unlabeled data is most important to be labeled by humans. Usually, cooperative learning tries to minimize the amount of data to be labeled by humans while maximizing the gain in accuracy of a learning algorithm. This can be based on confidence measures such that the machine labels unlabeled data itself as long as it is sufficiently confident in its decisions. It asks humans for help only where its confidence is insufficient, but the data seem to be highly informative. Dynamic Time Warping (DTW) is a machine learning algorithm to align two time series such as feature vectors extracted over time based on similarity measurement. This similarity is often measured by distance measures such as Euclidean distance or based on correlation such as when aligning heterogenous modalities. A classical application example is speech recognition, where words spoken at different speed are aligned in time to measure their similarity. DTW aims at a maximized match between the two observation sequences usually based on local and global alignment path search restrictions. Encoder-decoder architectures in deep learning start with an encoder neural network which—based on its input—usually outputs a feature map or vector. The second part—the decoder—is a further network that—based on the feature vector from the encoder—provides the closest match either to the input or an intended output. The decoder is in most cases employing the same network structure but in opposite orientation. Usually, the training is carried out on unsupervised data, i.e., without labels. The target for learning is to minimize the reconstruction error, i.e., the delta between the input to the encoder and the output of the decoder. A typical application is to use encoder-decoder architectures for sequence-to-sequence mapping, such as in machine translation where the encoder is trained on sequences (phrases) in one language and the decoder is trained to map its representation to a sequence (phrase) in another language.

The Advent of Artificial Emotional and Social Intelligence

Glossary

5

(continued)

Shared hidden layer is a layer within a neural network which is shared within the topology. For example, different modalities, or different output classes, or even different databases could be trained within parts of the network mostly. In the shared hidden layer, however, they would share neurons by according connections. This can be an important approach to model diverse information types largely independently but provide mutual information exchange at some point in the topology of a neural network. Transfer learning helps to reuse knowledge gained in one task in another task in machine learning. It can be executed on different levels, such as the feature or model level. For example, a neural network can be trained on a related task to the task of interest at first. Then, the actual task of interest is trained “on top" of this pre-training of the network. Likewise, rather than starting to train the target task of interest based on a random initialization of a network, related data could be used to provide a better starting point. Zero-shot learning is a method in machine learning to learn a new task without any training examples for this task. An example could be recognizing a new type of object without any visual example but based on a semantic description such as specific features that describe the object.

smartphone to enter a navigation target and use finger gestures to zoom or move the map or use gestures and speech to control a smart TV or video console.

The Advent of Artificial Emotional and Social Intelligence The increasingly “intelligent” interaction in more and more multimodal ways is, however, still lacking a major factor as observed in human-to-human interaction and communication these days: the emotional and social intelligence. With the rise of Affective Computing [Picard 1995], the technical means of integrating user emotion recognition or computer emotion simulation as in dialogs has become available. In fact, first patents or research implementations date even earlier, such as the 1978 patent “determine the emotional state of a person” by a speech analyzer [Williamson 1978]. The first computer-based implementations started in the 1990s for emotional speech analysis [Chiu et al. 1994] or emotional speech synthesis [Cahn 1990]. Implementations of facial emotion recognition appeared at a similar time as [Mase 1991, Kobayashi and Hara 1992], as well as ground-laying work on facial expression synthesis [Parke 1972]. Today, the field has widened to modeling a

6

Introduction

large variety also of human cognitive states such as human cognitive load, attention, or interest. To recognize such information, exploitation of multimodal information is now looking back on more than two decades of experience. Early ideas have been formulated as blueprint, e.g., [De Silva et al. 1997], and implemented [Chen et al. 1998]. Therefore, it is largely acknowledged that a multimodal approach is highly synergistic, such as the voice acoustics being well suited to reveal the arousal of a user, and the facial expression or verbal content to indicate the valence. In addition, artificial social intelligence recognizing and interpreting behavioral cues of users such as social signals [Pentland 2007] is becoming a further aid to better model and understand a user and her behavior. The difference lies largely in a focus on dyadic or multi-party communication and interaction. Likewise, instead of analyzing the speech of a user in isolation, the communication flow or adaptation between communication partners would be of interest. This could include the turn-taking behavior or the change of mean pitch to adapt to the conversational partner. An example from the visual domain would be the presence or absence of eye contact with a robot when it comes to interaction. Also, spatial relations would be of interest, such as head orientation, distance to the conversational partner, or orientation to the conversational partner. Further, rather than focusing on states as emotion, social signal processing often deals with short events such as eye blinks or laughter.

Insights in the Chapters Ahead This handbook presents chapters that summarize basic research and development of multimodal-multisensor systems, including their status today and rapidly growing future directions. The initial volume [Oviatt et al. 2017] introduced relevant theory and neuroscience foundations, approaches to design and user modeling, and an in-depth look at some common modality combinations. The present second volume summarizes multimodal-multisensor system signal processing, architectures, and the emerging use of these systems for detecting emotional and cognitive states. The third volume [Oviatt et al. 2018] presents multimodal language and dialogue processing, software tools and platforms, commercialization of applications, and emerging technology trends and societal implications. Collectively, these handbook chapters address a comprehensive range of central issues in this rapidly changing field. In addition, each volume includes selected challenge topics, in which an international panel of experts exchanges their views on some especially consequential, timely, and controversial problem in the field that is in need of insightful resolution. We hope these challenge topics will stimulate talented students

Insights in the Chapters Ahead

7

to tackle these important societal problems and motivate the rest of us to envision and plan for our technology future. Information presented in these three volumes is intended to provide a comprehensive state-of-the-art resource for professionals, business strategists and technology funders, interested lay readers, and advanced undergraduate and graduate students in this multidisciplinary computational field. To enhance its pedagogical value to the readers, many chapters include valuable digital resources such as pointers to open-source tools, databases, video demonstrations, and case study walkthroughs to assist in designing, building, and evaluating multimodal-multisensor systems. Each handbook chapter defines the basic technical terms required to understand its topic. Educational resources, such as focus questions, are included to support readers in mastering newly presented materials.

Multimodal Signal Processing and Architectures The first of three parts of this volume is divided into the topics of multimodal signal processing and architectures for multimodal and multisensorial information fusion. In Chapter 1, Baltrusaitis et al. introduce a taxonomy of multimodal machine learning divided into the five aspects of representation, namely: (1) to represent and summarize multimodal data “in a way that exploits the complementarity and redundancy of multiple modalities”; (2) the alignment (“identify the direct relations between (sub)elements from two or more different modalities”); (3) the translation (“how to map data from one modality to another”); (4) the fusion (“join information from two or more modalities”); and (5) the co-learning (“transfer knowledge between modalities, their representation, and their predictive models”). The authors highlight that there might not be just one optimal machine learning solution for all of these aspects depending on the task. They further argue that the current increase in research on representation and translation benefits novel solutions and applications for multimodality. The taxonomy shall, according to the authors, help catalog research contributions according to the scheme. To illustrate this interesting perspective, illustrative examples are featured in the chapter including speech recognition and synthesis, event detection, emotion and affect, media description, and multimedia retrieval. The taxonomy of five aspects of representation is each used to identify core challenges for these tasks, as in the case of multimodal representation. Multimodal representation is broken down into two aspects: (1) joint multimodal representation by projecting the multimodal data into one space—mostly if all modalities are present—and (2) coordinated multimodal representation by projecting individual modalities into separate but coordinated spaces—usually if only one modality is

8

Introduction

present. A final major challenge is co-learning where the modeling of a resource poor modality is aided by a resource-rich modality. In Chapter 2, Alpaydin explains how multimodal data from different sensors and sources can be fused intelligently such as to best exploit complementary information with the aim to reach higher accuracies in classification tasks. The prevailing early fusion on feature level, and late fusion on semantic decision level are introduced, alongside intermediate solutions. In this latter compromise between early and late fusion, only one classifier is learned to be suitably processed, yet in a rather abstract form of the input per modality. To this end, multiple kernel learning uniting a predefined set of kernels, e.g., for support vector learning, is presented as a first paradigm. In this case, a kernel tailored to each modality is applied as such. In addition to this, an according intermediate fusion strategy based on deep learning is introduced. This bases on shared hidden layers within a neural network, i.e., per modality early layers in the network are available separately, whereas later layers combine the information from the different modalities. The chapter provides an extensive discussion on the advantages and shortcomings of each form to integrate the information from the different sources. The outcome of this discussion, however, is that an optimal solution usually needs to be tailored to the needs of the use case. The chapter stresses the importance of the layer of abstraction on which the correlation between features is expected or is best exploitable across modalities. At the chapter’s end, perspectives on potential future endeavors toward improved fusion are given. Chapter 3 is largely dedicated to the same topic, i.e., optimal fusion of information sources, yet highlighting further aspects. Panagakis et al. stress the importance of temporal information, context, and adaptability in multimodal solutions. Similar to the previous chapter, the focus is on the best ways to fuse multimodal and multisensorial data in a synergistic manner. The chapter goes beyond an abstract description of fusion strategies and focuses on a series of suited approaches when dealing with the often-met case of heterogeneous, but correlated, time series data. This could, for example, be the case in audiovisual speech recognition. Such data is not only noisy but often misaligned in time across the involved modalities or sensor signals. To solve this, the authors present correlation-based methods for heterogeneous, yet complementary, modalities or multi-sensorial information such as coming from different types of video cameras or audio and video stemming from the same scene. Starting from canonical correlation analysis, further suited solutions and their shortcomings and advantages are discussed. In combination with dynamic time warping (DTW), alignment of different types of, yet correlated information streams become feasible. Next, robust and individual component analysis

Insights in the Chapters Ahead

9

is introduced and further featured in combination with time warping. For illustration purposes, one example is facial expression analysis, including evolution over time. This includes segmentation and estimation of the intensity of the expression. The authors further argue for the importance of context integration in the fusion and discuss machine learning methods to incorporate it into the learning process. In closing, they then explore avenues for domain adaptation across various contextual factors and modalities. Chapter 4, the fourth and final chapter of Part I, is entirely dedicated to deep learning methods. Whereas, for example, in Chapter 2 shared hidden layers in a deep neural network are presented as a solution for multimodal fusion, this chapter by Keren et al. takes a broad view on multimodal and multisensorial fusion architectures, which the present deep learning methods offer. The authors promote the usage of deep learning not only because of its significant empirical success across a range of AI tasks, but also, because it is often “simpler and easier to design”, able to learn “end-to-end”, and provides reusable building blocks across tasks which support easy transfer of knowledge. The well-supported and maintained software landscape further benefits the usage of deep learning. The authors assume basic knowledge in deep neural networks and address specific topologies and methods in this chapter. As in Chapter 2, this chapter discusses early, late, and intermediate fusion, but, from a slightly different and neural network-focused perspective. The chapter introduces encoder-decoder topologies such as sequence-to-sequence approaches including attention mechanisms. It also highlights embeddings for multimodal modeling as is illustrated for images and text analyses. In an outlook, a number of future avenues are given. Concluding the chapter, the authors stress the bottleneck of little labeled data. Such interaction data together with a learning target is usually needed in a large quantity for deep learning architectures’ success in multimodal fusion.

Multimodal Processing of Social and Emotional States Part II bundles five chapters on affective computing and social signal processing for the realization of artificial socioemotionally intelligent interaction. An introduction is given in Chapter 5 to introduce the broader picture on “profiling” the user in terms of her states and traits via various modalities. The author motivates the need for such user modeling by economist Peter Drucker’s quote “The most important thing in communication is hearing what isn’t said”. Potential characteristics of a user to recognize are shown in an overview, featuring extensive examples which have been targeted in a multimodal approach in the literature. For traits, these include a person’s age, attractiveness, ethnicity, gender, height,

10

Introduction

identity, leader traits, likability, (degree of) nativeness, or personality. As to states, alertness, cognitive load, deception, depression, distraction, drowsiness, emotion, engagement, interest, laughter, physical activity, sentiment, stress, swallowing, and violence are listed. The reader is then guided in detail along the path of processing of a typical user state and trait assessment engine. This includes all typical and also less-typical, yet beneficiary, steps from pre-processing to decision making and integration in an application context. Modern aspects such as cooperative learning and transfer learning, or (multimodal) confidence measure estimation are further discussed. A focus is thereby placed on fusion of different modalities for user state and trait modeling. The author advocates a modern view and provides guidance on how to simplify, unify, and cluster these steps to allow for a more seamless integration of information in the processing. Spoken and written language, video information—such as facial expression, body posture, and movement—and physiological signals as well as tactile interaction data are then compared regarding their strengths and weaknesses for state and trait recognition. After introducing tools in the field, the reader is also walked through a didactic example: the (fictional case of) arrogance recognition. Concluding this chapter, recent trends and potential for future improvement are given: in the wild processing, diversity of culture and language, multi-subject processing, and linking analysis with synthesis. In Chapter 6, D’Mello et al. focus on affect as a state to recognize. The concept of affect is first viewed from an affective sciences perspective. Similar to the chapters in Part I, the authors then deal with ways to fuse information, here, however, for the particular use case of affect detection. Alongside basic approaches such as data, feature, decision, and hybrid fusion, model fusion is also introduced. The authors present statistical approaches as by dynamic Bayesian networks besides further variants of deep networks complementing the kernel- and deep networkbased approaches shown in Part I. Three walk-through examples of multisensormultimodal affect detection systems exemplify the usage: (1) the case of featurelevel fusion targeting basic emotions; (2) decision-level fusion for learning-centered affective states; and (3) model-based fusion is shown for a continuous learning by affective dimensions. The authors also discuss current trends in the field. They further give a detailed overview of modality combinations. A highlight to give an impression of the current performances are the results from the leading-in-thefield challenge. To conclude, the authors compare the state-of-the-field vs. ten years into the past, speculating also about ten years into the future expecting “advances in theoretical sophistication, data sources, and computational techniques”, while calling for advances in the science of validation in the field so it can advance to broad everyday usage.

Insights in the Chapters Ahead

11

In Chapter 7, Vinciarelli and Esposito lead the reader into the related and complementary field of multimodal analysis of social signals. Rather than dealing with affective states as in the previous chapter, this chapter deals with multimodal communication between living beings. It takes both the perspectives of the life sciences as well as computer science regarding technical implications and optimal realization of system solutions. The life science side organizes multimodal communication patterns as occurring in nature into a coherent taxonomy. Taking the technical stance, the authors consider early and late fusion as in previous chapters in this volume, but with a focus on social signals in communication. A meta-analysis over five years in the recent literature provides insight into the current best practices in this field. In this analysis, the authors observe that early fusion is the dominant approach, and that combining synergistic heterogeneous modalities can lead to performances exceeding the best single modality. In Chapter 8, Wagner and Andr´ e introduce their practical view on the two tasks introduced in detail in the previous two chapters: affect and social signals and their recognition. The real-time aspect is thereby crucial for a natural communication of humans with technical systems that make use of such information. The chapter first deals with a basic necessity which is often seen as the main limiting factor in the two fields: data for training and its collection. The chapter also lists a number of popular databases for these tasks before turning to fusion in a multimodal context. It also features asynchronous fusion approaches. Timing is respected for online-enabled processing, and potentially missing information. The chapter then formulates a set of requirements for a practical multimodal framework implementation, namely, multisensory abilities (catering also for potentially novel and “exotic” sensors), synchronization for modalities operating on different time scales, automated transcription (of supplementary descriptions), multimodal fusion, missing data handling, continuous classification, and real-time ability. The authors then introduce the Social Signal Interpretation Framework as tool and guide the reader through very detailed examples before making final conclusions. Chapter 9 by Martin et al. rounds Part II off by taking the opposite view on affect in a technical context as compared to the previous Chapters 5, 6, and 8. Rather than targeting analysis of human affect, this chapter highlights the synthesis of affect and its perception by humans. Following a general introduction, the authors deal with emotions and their expressions. Related to the focus of these three volumes on multimodality, the authors then turn to human perception of combinations of such affective expressions across modalities. This includes facial and bodily as well as vocal expressions. These expressions are, for example, frequently met in artificial conversational agents. In addition to discussing audio and video, combinations

12

Introduction

with haptic expression of affect are included. Further, the impact of context on the perception of multimodal expression of affect by technical systems is discussed. The authors support the discussion by presenting results from diverse experiments. In their conclusion, the authors stress the challenge of “generating and controlling synchronized expressions of affects across several modalities” and summarize including that different modalities are differently suited for different affects’ display, as was in previous chapters in this volume similarly found from the affect analysis perspective.

Multimodal Processing of Cognitive States Part III of this volume emphasizes cognitive states in four chapters, each of which target a different example. In Chapter 10, Zhou et al. first discuss cognitive load. They introduce the principle of cognitive load measurement and the current four primary methods: subjective self-reported measures, performance measures, physiological measures, and behavioral measures. The chapter then presents the state-of-the-art in terms of theories and prevailing approaches for cognitive load measurement involving applications. Measurement based on behavioral cues follows, including pen as well as speech and linguistic features. Thereafter, a range of physiological measures is presented, with emphasis on pupil dilation and galvanic skin response. The authors also introduce a feedback loop to the user in applications where automatic cognitive load measurement is included with the intention of supporting maximization of cognitive capacity. The chapter is concluded by final considerations also on future needs and efforts to be made. Chapter 11 by Oviatt et al. focuses on students’ mental state during the process of learning—a complex activity to be evaluated over time. As in previous chapters in this volume, analysis based on a multimodal approach is presented to overcome the limitations of click-stream and textual analyses. The chapter thoroughly introduces the concept of multimodal learning analytics, including setting its current emergence and the advocated multimodal approach into temporal context and laying out the main objectives. An important part of practical relevance is the description of major available multimodal corpora in this field, including a discussion on their limitations. Subsequently, the main findings in the field are presented distilling five major advantages of using multiple modalities and concluding that linguistic content analysis is not necessarily required for successful prediction of domain expertise. The chapter highlights the theoretical basis of multimodal learning analytics discussing future directions for the field.

Insights in the Chapters Ahead

13

In Chapter 12, Cohn et al. introduce the mental disorder of depression from a multimodal perspective, aiming at its automated recognition. The chapter first thoroughly introduces depression as a phenomenon, and then turns to multimodal techniques for identifying behavioral and physiological indices of it in users. The chapter details the extraction of feature information sensitive to depression from facial cues, the speech signal, and body movement and other sensor signals. This discussion also includes the presentation of the key machine learning methods applied in this context so far which are largely identical to those for the recognition of other cognitive states. As in the other chapters, emphasis is particularly placed on the fusion of information from the various modalities. Such fusion is thereby handled in the context of classification, but also prediction. Toward the chapter’s end, aspects as related to the implementation as contextual embedding in an application are considered. Further, a number of broadly used corpora are briefly reviewed. Burzo et al. introduce a final cognitive state example in Chapter 13. After introducing and motivating the topic, individual modalities are discussed: the perspective of psychological experiments is first, followed by language, vision, and physiology as examples of suited deception information carriers for automated detection. The chapter then presents selected combination examples and results from according studies, namely thermal imaging in combination with physiological sensors, and language analysis, followed by language and acoustics, and finally vision and language in combination. The authors finally discuss deficits in the field, including the need for larger datasets with more and cross-cultural subjects best in out of the lab conditions. Furthermore, the authors argue for an increased number of modalities. They also see deep learning methods as have been introduced in this volume—in particular in Chapter 4—as a promising avenue.

Multidisciplinary Challenge Topic: Perspectives on Predictive Power of Multimodal Deep Learning: Surprises and Future Directions The Challenge Topic addressed in this volume in Chapter 14 features a discussion between Samy Bengio (an expert in deep architectures for sequences and other structured objects, understanding training and generalization of deep architectures, and adversarial training with application in image captioning), Li Deng (expertise in AI, mathematical modeling, deep learning, big data analytics, speech recognition, natural language processing, and image captioning), Louis-Philippe Morency (expertise in multimodal machine learning for modeling of acoustic, visual, and verbal modalities, with an interest in human communication dynamics, and health behavior informatics), and Bj¨ orn Schuller (expertise in deep learning,

14

Introduction

recurrent networks, and end-2-end learning with application in multimodal and multisensorial affective computing and mobile health). The discussion was initiated around Chapter 4 that deals with deep learning in the context of multimodal and multisensorial interaction and introduces a range of topologies and solutions by according deep net inventory. Five questions were discussed starting with: How have deep learning techniques shown that they can be a catalyst for scientific discovery? Deng names machine reading comprehension as a starting point for even greater discovery once machines can interpret “greater and greater scientific literature” also with cross-disciplinary linkage. Discussing next the surpluses but also potential downsides of multimodal deep learning as compared to conventional methods of machine learning for multimodal fusion, Bengio and Morency highlight the ability to learn better representations as particular strengths. Morency also adds transfer learning on the positive side. In a similar vein, Bengio remarks that deep learning has shown “promises into finding common representations of concepts, irrespective of how they are sensed”, thus bearing crucial relevance for multimodal and multisensorial use cases. Along these lines, Deng states that “multimodal signals can be effectively exploited to enhance the predictive power by enriching the supervision signals across different modalities” which he names as distant-supervised learning and perceives as closer to human learning such as for a child learning spoken language. Bengio also expects crossmodal generation of concepts such as by generative adversarial networks. He and Schuller also expect their ability to exploit big amounts of data to be a key advantage in times where multimodal and multisensorial data can easily be collected at such scale. Deng agrees to this stating that conventional methods lack sufficient learning capacity and representation power to enable multimodal fusion. Asked about surprises multimodal deep learning might bring along, Deng names drastic reduction of labeling cost in large-scale prediction tasks. Bengio adds that deep learning may surprise us as simultaneous access to many modalities should benefit the robustness of learning representations. He adds, he would be “really surprised the day a deep learning model can generate things like humor!” Schuller hopes for emergent behavior, once several task-specific networks start to work together in a large multimodal context. As to future directions for multimodal deep learning, Morency highlights colearning exploiting knowledge from one modality to help perform a task in a second modality such as by zero-shot learning. Bengio thinks deep networks can “learn to generate one modality from another”, and Deng wants to see further exploration of effective architectures for multimodal information fusion.

References

15

Finally, dealing with transparency of multimodal deep learning and potential ethical and societal implications, Morency first highlights privacy and ownership of one’s (multimodal) data. Bengio sees an additional problem in the fact that such deep models can be fooled and largely blames “bad data” such as with erroneous labeling or biased in some way for “bad decisions”. As to transparency, Schuller argues that one can already explain for example learned time-profiles or compare learned features to traditional ones. Hope in this respect is further spread by Deng from an algorithmic point of view by designing “special neural cell architectures which explicitly represent some known properties of the input signals”. All agree that transparency is a crucial step.

References J. E. Cahn. 1990. The generation of affect in synthesized speech. Journal of the American Voice I/O Society, 8: 1–19. DOI: 10.1.1.52.5802. 5 L. S. Chen, T. S. Huang, T. Miyasato, and R. Nakatsu. 1998. Multimodal human emotion/expression recognition. In Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition, pp. 366–371. IEEE. DOI: 10.1109/AFGR .1998.670976. 6 C. Chiu, Y. Chang, and Y. Lai. 1994. The analysis and recognition of human vocal emotions. In Proceedings of the International Computer Symposium, pp. 83–88. 5 R. Dechter. 1986. Learning while searching in constraint-satisfaction problems, pp. 178– 183. University of California, Computer Science Department, Cognitive Systems Laboratory. 2 L. C. De Silva, T. Miyasato, and R. Nakatsu. 1997. Facial emotion recognition using multimodal information. In Proceedings of the 1997 International Conference on Information, Communications and Signal Processing, vol. 1, pp. 397–401. IEEE. DOI: 10.1109/ICICS .1997.647126. 6 W. K. English, D. C. Engelbart, and B. Huddart. 1965. Computer-aided display control. Final Report, Contract NASl-3988, SRI Project, 5061. 1 G. E. Hinton, S. Osindero, and T. W. Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation, 18(7): 1527–1554. DOI: 10.1162/neco.2006.18.7.1527. 2 S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8): 1735–1780. 2 B. H. Juang and L. R. Rabiner. 2005. Automatic speech recognition–a brief history of the technology development. Georgia Institute of Technology, Atlanta Rutgers University and the University of California. Santa Barbara, 1: 67. DOI: 10.1.1.90.5614. 2 H. Kobayashi and F. Hara. 1992. Recognition of six basic facial expression and their strength by neural network. In Proceedings IEEE International Workshop on Robot and Human Communication, pp. 381–386. IEEE. DOI: 10.1109/ROMAN.1992.253857. 5

16

Introduction

K. Mase. 1991. Recognition of facial expression from optical flow. IEICE Transactions (E), 74: 3474–3483. 5 J. McCarthy, M. L. Minsky, N. Rochester, and C. E. Shannon. 2006. A proposal for the Dartmouth Summer Research Project on Artificial Intelligence August 31, 1955, AI Magazine, 27(4): 12–14. AAAI. 2 C. B. Mirick. 1926. Electrical distant-control system. U.S. Patent, no. 1,597,416, 1926. Washington, DC: U.S. Patent and Trademark Office. C. M. Mitchell and M. G. Forren. 1987. Multimodal user input to supervisory control systems: voice-augmented keyboard. IEEE Transactions on Systems, Man, and Cybernetics, 17(4): 594–607. DOI: 10.1109/TSMC.1987.289349. 3 S. Oviatt. 1999. Ten myths of multimodal interaction. Communications of the ACM, 42(11): 74–81. DOI: 10.1145/319382.319398. 3 S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger. Editors. 2017. The Handbook of Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling and Common Modality Combinations. San Rafael, CA: Morgan & Claypool Publishers. DOI: 10.1145/3015783. 6 S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger. Editors. 2018. The Handbook of Multimodal-Multisensor Interfaces, Volume 3: Language Processing, Software, Commercialization, and Emerging Directions. San Rafael, CA: Morgan & Claypool Publishers. 6 F. I. Parke. 1972. Computer generated animation of faces. In Proceedings of the ACM annual Conference, 1: 451–457. ACM. DOI: 10.1145/800193.569955. 5 A. Pentland. 2007. Social signal processing [exploratory DSP]. IEEE Signal Processing Magazine, 24(4): 108–111. DOI: 10.1109/MSP.2007.4286569. 6 R. W. Picard. 1995. Affective Computing. MIT Press, Cambridge, MA. 5 S. Reiter, B. Schuller, and G. Rigoll. 2006. A combined LSTM-RNN-HMM-approach for meeting event segmentation and recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, volume 2. IEEE. DOI: 10.1109/ ICASSP.2006.1660362. 3 J. D. Williamson. 1978. Speech analyzer for analyzing pitch or frequency perturbations in individual speech pattern to determine the emotional state of the person. U.S. Patent No. 4,093,821. U.S. Patent and Trademark Office, Washington, DC. 5

I PART

MULTIMODAL SIGNAL PROCESSING AND ARCHITECTURES

1

Challenges and Applications in Multimodal Machine Learning

Tadas Baltruˇ saitis, Chaitanya Ahuja, Louis-Philippe Morency

1.1

Introduction

The world surrounding us involves multiple modalities. We see objects, hear sounds, feel texture, smell odors, and so on. In general terms, a modality refers to the way in which something happens or is experienced. Most people associate the word modality with the sensory modalities which represent our primary channels of communication and sensation, such as vision or touch. In this chapter we focus primarily, but not exclusively, on three such modalities: linguistic modality which can be both written or spoken; visual modality which is often represented with images or videos; and vocal modality which encodes sounds and para-verbal information such as prosody and vocal expressions. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret and reason about multimodal messages. Multimodal machine learning aims to build models that can process and relate information from multiple modalities. From early research on audio-visual speech recognition to the recent explosion of interest in language and vision models, multimodal machine learning is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential.

20

Chapter 1 Challenges and Applications in Multimodal Machine Learning

Glossary Representation learns how to represent and summarize multimodal data in a way that exploits the complementarity and redundancy of multiple modalities. The heterogeneity of multimodal data makes it challenging to construct such representations. For example, language is often symbolic while audio and visual modalities will be represented as signals. Translation addresses how to translate (map) data from one modality to another. Not only is the data heterogeneous, but the relationship between modalities is often openended or subjective. For example, there exist a number of correct ways to describe an image and one perfect translation may not exist. Alignment identifies the direct relations between (sub)elements from two or more different modalities. For example, we may want to align the steps in a recipe to a video showing the dish being made. To tackle this challenge we need to measure similarity between different modalities and deal with possible long-range dependencies and ambiguities. Fusion joins information from two or more modalities to perform a prediction. For example, for audio-visual speech recognition, the visual description of the lip motion is fused with the speech signal to predict spoken words. The information coming from different modalities may have varying predictive power and noise topology, with possibly missing data in at least one of the modalities. Co-learning transfers knowledge between modalities, their representation, and their predictive models. This is exemplified by algorithms of co-training, conceptual grounding, and zero shot learning. Co-learning explores how knowledge learning from one modality can help a computational model trained on a different modality. This challenge is particularly relevant when one of the modalities has limited resources (e.g., few annotated data).

The research field of multimodal machine learning brings some unique challenges for computational researchers given the heterogeneity of the data. Learning from multimodal sources offers the possibility of capturing correspondences between modalities and gaining an in-depth understanding of natural phenomena. In a recent survey paper, Baltruˇsaitis et al. [2017] identify five core technical challenges (and related sub-challenges) surrounding multimodal machine learning. They are (a) representation, (b) translation, (c) alignment, (d) fusion, and (e) co-learning (for definitions see the Glossary). They are central to the multimodal setting and need to be tackled in order to progress the field.

1.2 Multimodal Applications

21

We start with a discussion of main applications of multimodal machine learning (Section 1.2). In this chapter we focus on two out of the five core technical challenges facing multimodal machine learning: representation (Section 1.3) and co-learning (Section 1.4). The fusion challenge is addressed in Chapters 2, 4 and Jameson and Kristensson [2017], and part of translation challenge is discussed in Chapter 4. More details about all five challenges are also available in the survey paper [Baltruˇsaitis et al. 2017].

1.2

Multimodal Applications Multimodal machine learning enables a wide range of applications: from audiovisual speech recognition to image captioning. In this section we present a brief history of multimodal applications, from its beginnings in audio-visual speech recognition to a recently renewed interest in language and vision applications. One of the earliest examples of multimodal research is audio-visual speech recognition (AVSR) [Yuhas et al. 1989]. It was motivated by the McGurk effect [McGurk and MacDonald 1976], an interaction between hearing and vision during speech perception. When human subjects heard the syllable /ba-ba/ while watching the lips of a person saying /ga-ga/, they perceived a third sound: /da-da/. These results motivated many researchers from the speech community to extend their approaches with visual information. Given the prominence of hidden Markov models (HMMs) in the speech community at the time [Juang and Rabiner 1991], it is without surprise that many of the early models for AVSR were based on various HMM extensions [Bourlard and Dupont 1996, Brand et al. 1997]. While research into AVSR is not as common these days, it has seen renewed interest from the deep learning community [Ngiam et al. 2011]. While the original vision of AVSR was to improve speech recognition performance (e.g., word error rate) in all contexts, the experimental results showed that the main advantage of visual information was when the speech signal was noisy (i.e., low signal-to-noise ratio) [Yuhas et al. 1989, Gurban et al. 2008, Ngiam et al. 2011]. In other words, the captured interactions between modalities were supplementary rather than complementary. The same information was captured in both, improving the robustness of the multimodal models but not improving the speech recognition performance in noiseless scenarios. A second important category of multimodal applications comes from the field of multimedia content indexing and retrieval [Snoek and Worring 2005, Atrey et al. 2010]. With the advance of personal computers and the internet, the quantity

22

Chapter 1 Challenges and Applications in Multimodal Machine Learning

of digitized multimedia content has increased dramatically.1 While earlier approaches for indexing and searching these multimedia videos were keyword-based [Snoek and Worring 2005], new research problems emerged when trying to search the visual and multimodal content directly. This led to new research topics in multimedia content analysis such as automatic shot-boundary detection [Lienhart 1999] and video summarization [Evangelopoulos et al. 2013]. These research projects were supported by the TrecVid initiative from the National Institute of Standards and Technologies which introduced many high-quality datasets, including the multimedia event detection (MED) tasks started in 2011.2 A third category of applications was established in the early 2000s around the emerging field of multimodal interaction with the goal of understanding human multimodal behaviors during social interactions. One of the first landmark datasets collected in this field is the AMI Meeting Corpus which contains more than 100 hours of video recordings of meetings, all fully transcribed and annotated [Carletta et al. 2005]. Another important dataset is the SEMAINE corpus which allowed to study interpersonal dynamics between speakers and listeners [McKeown et al. 2010]. This dataset formed the basis of the first audio-visual emotion challenge (AVEC) organized in 2011 [Schuller et al. 2011]. The fields of emotion recognition and affective computing bloomed in the early 2010s thanks to strong technical advances in automatic face detection, facial landmark detection, and facial expression recognition [De la Torre and Cohn 2011]. The AVEC challenge continued annually afterward with the later instantiation including healthcare applications such as automatic assessment of depression and anxiety [Valstar et al. 2013]. A great summary of recent progress in multimodal affect recognition was published by D’Mello and Kory [2015]. Their meta-analysis revealed that a majority of recent work on multimodal affect recognition show improvement when using more than one modality, but this improvement is reduced when recognizing naturally occurring emotions. Most recently, a new category of multimodal applications emerged with an emphasis on language and vision: media description. One of the most representative applications is image captioning where the task is to generate a text description of the input image [Hodosh et al. 2013]. This is motivated by the ability of such systems to help the visually impaired in their daily tasks [Bigham et al. 2010]. The main challenges media description is evaluation: how to evaluate the quality of the 1. http://www.youtube.com/intl/en-US/yt/about/press/ (accessed May 2018) 2. http://www.nist.gov/multimodal-information-group/trecvid-multimedia-event-detection2011-evaluation (accessed May 2018)

1.3 Multimodal Representations

23

predicted descriptions. The task of visual question-answering (VQA) was recently proposed to address some of the evaluation challenges [Antol et al. 2015], where the goal is to answer a specific question about the image. In order to bring some of the mentioned applications to the real world we need to address a number of technical challenges facing multimodal machine learning. We summarize the relevant technical challenges for the above-mentioned application areas in Table 1.1. One of the most important challenges is multimodal representation, the focus of our next section.

1.3

Multimodal Representations Representing raw data in a format that a computational model can work with has always been a big challenge in machine learning. Following the work of Bengio et al. [2013], we use the term feature and representation interchangeably, with each referring to a vector or tensor representation of an entity, be it an image, audio sample, individual word, or a sentence. A multimodal representation is a representation of data using information from multiple such entities. Representing multiple modalities poses many difficulties: how to combine the data from heterogeneous sources; how to deal with different levels of noise; and how to deal with missing data. The ability to represent data in a meaningful way is crucial to multimodal problems, and forms the backbone of any model. Good representations are important for the performance of machine learning models, as evidenced behind the recent leaps in performance of speech recognition [Amodei et al. 2016, Hinton et al. 2012] and visual object classification [Krizhevsky et al. 2012] systems. Bengio et al. [2013] identify a number of properties for good representations: smoothness, temporal and spatial coherence, sparsity, and natural clustering, among others. Srivastava and Salakhutdinov [2012b] identify additional desirable properties for multimodal representations: similarity in the representation space should reflect the similarity of the corresponding concepts, the representation should be easy to obtain even in the absence of some modalities, and, finally, it should be possible to fill-in missing modalities given the observed ones. The development of unimodal representations has been extensively studied [Anagnostopoulos et al. 2015, Bengio et al. 2013, Li et al. 2015]. In the past decade there has been a shift from hand-designed representations to data-driven ones. For example, one of the most famous image descriptors in the early 2000s, the scale invariant feature transform (SIFT) was hand designed [Lowe 2004], but currently most visual descriptions are learned from data using neural architectures

24

Chapter 1 Challenges and Applications in Multimodal Machine Learning

Table 1.1

A summary of applications enabled by multimodal machine learning. For each application area we identify the core technical challenges that need to be addressed in order to tackle it. Challenges Applications

Representation Translation Fusion Alignment Co-learning

Speech Recognition and Synthesis AVSR



Speech synthesis











Event Detection Action Classification







Multimedia Event Detection







Recognition





Synthesis





Image Description





Video Description





Visual QuestionAnswering



Media Summarization





Cross Modal retrieval





Cross Modal hashing



Emotion and Affect ✓























Media Description



Multimedia Retrieval



1.3 Multimodal Representations

25

such as convolutional neural networks (CNN) [Krizhevsky et al. 2012]. Similarly, in the audio domain, acoustic features such as Mel-frequency cepstral coefficients (MFCC) have been superseded by data-driven deep neural networks in speech recognition [Amodei et al. 2016, Hinton et al. 2012] and recurrent neural networks for para-linguistic analysis [Trigeorgis et al. 2016]. In natural language processing, the textual features initially relied on counting word occurrences in documents, but have been replaced by data-driven word embeddings that exploit the word context [Mikolov et al. 2013]. While there has been a huge amount of work on unimodal representation, up until recently most multimodal representations involved simple concatenation of unimodal ones [D’Mello and Kory 2015], but this has been rapidly changing. To help understand the breadth of work, we propose two categories of multimodal representation: joint and coordinated. Joint representations combine the unimodal signals into the same representation space, while coordinated representations process unimodal signals separately, but enforce certain similarity constraints on them to bring them to what we term a coordinated space. An illustration of different multimodal representation types can be seen in Figure 1.1. Mathematically, the joint representation is expressed as: xm = f (x1 , . . . , xn),

(1.1)

where the multimodal representation xm is computed using function f (e.g., a deep neural network, restricted Boltzmann machine, or a recurrent neural network) that relies on unimodal representations x1 , . . . xn. On the other hand, the coordinated representation is as follows: f (x1) ∼ g(x2),

(1.2)

where each modality has a corresponding projection function (f and g above) that maps it into a coordinated multimodal space. While the projection into the multimodal space is independent for each modality, but the resulting space is coordinated between them (indicated as ∼). Examples of such coordination include minimizing cosine distance [Frome et al. 2013], maximizing correlation [Andrew et al. 2013], and enforcing a partial order [Vendrov et al. 2016] between the resulting spaces.

1.3.1 Joint Representations We start our discussion with joint representations that project unimodal representations together into a multimodal space (Equation 1.1). Joint representations are

26

Chapter 1 Challenges and Applications in Multimodal Machine Learning

xm = f (x1x2, …, xn) Joint

···

Optional intermediate

···

···

···

Unimodal

···

···

···

x1

x2

xn

(a) Joint representation

f (x1)

g(x2)

Coordinated

···

···

Optional intermediate

···

···

Unimodal

···

···

x1

x2

(b) Coordinated representations Figure 1.1

Structure of joint and coordinated representations. Joint representations are projected to the same space using all of the modalities as input. Coordinated representations, on the other hand, exist in their own space, but are coordinated through a similarity (e.g., euclidean distance) or structure constraint (e.g., partial order).

mostly (but not exclusively) used in tasks where multimodal data is present both during training and inference steps. The simplest example of a joint representation is a concatenation of individual modality features (also referred to as early fusion [D’Mello and Kory 2015]). In this section we discuss more advanced methods for creating joint representations starting with neural networks, followed by graphical models and recurrent neural networks (representative works can be seen in Table 1.2). Neural networks have become a very popular method for unimodal data representation [Bengio et al. 2013]. They are used to represent visual, acoustic, and textual data, and are increasingly used in the multimodal domain [Ngiam et al. 2011, Wang et al. 2015a, Ouyang et al. 2014]. In this section we describe how neu-

1.3 Multimodal Representations

Table 1.2

27

A summary of multimodal representation techniques. We identify three subtypes of joint representations (Section 1.3.1) and two subtypes of coordinated ones (Section 1.3.2). For modalities + indicates the modalities combined. Representation

Modalities

Reference

Images + Audio

[Ngiam et al. 2011, Mroueh et al. 2015]

Images + Text

[Silberer and Lapata 2014]

Joint Neural networks

Graphical models Images + Text Sequential

[Srivastava and Salakhutdinov 2012b]

Images + Audio

[Kim et al. 2013]

Audio + Video

[Kahou et al. 2016, Nicolaou et al. 2011]

Images + Text

[Rajagopalan et al. 2016]

Images + Text

[Frome et al. 2013, Kiros et al. 2015]

Video + Text

[Xu et al. 2015, Pan et al. 2016]

Images + Text

[Cao et al. 2016, Vendrov et al. 2016]

Coordinated Similarity Structured

Audio + Articulatory [Wang et al. 2015b]

ral networks can be used to construct a joint multimodal representation, and what advantages they offer. In general, neural networks are made up of successive layers of inner products followed by nonlinear activation functions. In order to use a neural network as a way to represent data, it is first trained to perform a specific task (e.g., recognizing objects in images). Due to the multilayer nature of deep neural networks, each successive layer is hypothesized to represent the data in a more abstract way [Bengio et al. 2013], hence it is common to use the final or penultimate neural layers as a form of data representation. To construct a multimodal representation using neural networks each modality starts with several individual neural layers followed by a hidden layer that projects the modalities into a joint space [Wu et al. 2014, Mroueh et al. 2015, Antol et al. 2015, Ouyang et al. 2014]. The joint multimodal representation is then passed through multiple hidden layers or used directly for prediction. Such models can be trained end-to-end, learning both to represent the data and to perform a particular task. This results in a close relationship between multimodal representation learning and multimodal fusion when using neural networks.

28

Chapter 1 Challenges and Applications in Multimodal Machine Learning

As neural networks require a lot of labeled training data, one approach is to pretrain such representations using an autoencoder on unsupervised data [Hinton and Zemel 1994]. The model proposed by Ngiam et al. [2011] extended the idea of using autoencoders to the multimodal domain. They used stacked denoising autoencoders to represent each modality individually and then fused them into a multimodal representation using another autoencoder layer. Similarly, Silberer and Lapata [2014] proposed using a multimodal autoencoder for the task of semantic concept grounding (see Section 1.4.2). In addition to using a reconstruction loss to train the representation they introduce a term into the loss function that uses the representation to predict object labels. It is also common to fine-tune the resulting representation on a particular task at hand as the representation constructed using an autoencoder is generic and not necessarily optimal for a specific task [Wang et al. 2015a]. The major advantage of neural network-based joint representations comes from their often superior performance and the ability to pre-train the representations in an unsupervised manner. The performance gain is, however, dependent on the amount of data available for training. One of the disadvantages comes from the model not being able to handle missing data naturally, although there are ways to alleviate this issue [Ngiam et al. 2011, Wang et al. 2015a]. Finally, deep networks are often difficult to train [Glorot and Bengio 2010], but the field is making progress in better training techniques [Srivastava et al. 2014]. Probabilistic graphical models are another popular way to construct representations through the use of latent random variables [Bengio et al. 2013]. In this section we describe how probabilistic graphical models are used to represent unimodal and multimodal data. One approach for graphical model-based representation is deep Boltzmann machines (DBM) [Salakhutdinov and Hinton 2009], which stack restricted Boltzmann machines (RBM) [Hinton et al. 2006] as building blocks. Similar to neural networks, each successive layer of a DBM is expected to represent the data at a higher level of abstraction. The appeal of DBMs comes from the fact that they do not need supervised data for training [Salakhutdinov and Hinton 2009]. As they are graphical models, the representation of data is probabilistic. However it is possible to convert them to a deterministic neural network, but this loses the generative aspect of the model [Salakhutdinov and Hinton 2009]. Work by Srivastava and Salakhutdinov [2012a] introduced multimodal deep belief networks as a multimodal representation. Kim et al. [2013] used a deep belief network for each modality and then combined them into joint representation for audiovisual emotion recognition. Huang and Kingsbury [2013] used a similar

1.3 Multimodal Representations

29

model for AVSR, and Wu and Shao [2014] for audio and skeleton joint-based gesture recognition. Multimodal deep belief networks have been extended to multimodal DBMs by Srivastava and Salakhutdinov [2012b]. Multimodal DBMs are capable of learning joint representations from multiple modalities by merging two or more undirected graphs using a binary layer of hidden units on top of them. They allow for the low-level representations of each modality to influence each other after the joint training due to the undirected nature of the model. Ouyang et al. [2014] explore the use of multimodal DBMs for the task of human pose estimation from multi-view data. They demonstrate that integrating the data at a later stage, after unimodal data underwent nonlinear transformations, was beneficial for the model. Similarly, Suk et al. [2014] use multimodal DBM representation to perform Alzheimer’s disease classification from positron emission tomography and magnetic resonance imaging data. One of the big advantages of using multimodal DBMs for learning multimodal representations is their generative nature, which allows for an easy way to deal with missing data even if a whole modality is missing, the model has a natural way to cope. It can also be used to generate samples of one modality in the presence of the other one, or both modalities from the representation. Similar to autoencoders, the representation can be trained in an unsupervised manner enabling the use of unlabeled data. The major disadvantage of DBMs is the difficulty of training them, the high computational cost, and the need to use approximate variational training methods [Srivastava and Salakhutdinov 2012b]. Sequential Representation. Sequential Representations are designed to be able to represent sequences of varying lengths. This is in contrast with the approaches previously described which are for static data or datasets with fixed length. In this section we describe models that can be used to represent such sequences. Recurrent neural networks (RNNs), and their variants such as long-short term memory (LSTMs) networks [Hochreiter and Schmidhuber 1997], have recently gained popularity due to their success in sequence modeling across various tasks [Bahdanau et al. 2014, Venugopalan et al. 2015]. So far, RNNs have mostly been used to represent unimodal sequences of words, audio, or images, with most success in the language domain. Similar to traditional neural networks, the hidden state of an RNN can be seen as a representation of the data, i.e., the hidden state of RNN at timestep t can be seen as the summarization of the sequence up to that timestep. This is especially apparent in RNN encoder-decoder frameworks where the task of an encoder is to represent a sequence in the hidden state of an RNN in such a way that a decoder could reconstruct it [Bahdanau et al. 2014].

30

Chapter 1 Challenges and Applications in Multimodal Machine Learning

The use of RNN representations has not been limited to the unimodal domain. An early use of constructing a multimodal representation using RNNs comes from work by Cosi et al. [1994] on AVSR. They have also been used for representing audiovisual data for affect recognition [Nicolaou et al. 2011, Chen et al. 2015] and to represent multi-view data such as different visual cues for human behavior analysis [Rajagopalan et al. 2016].

1.3.2 Coordinated Representations An alternative to a joint multimodal representation is a coordinated representation. Instead of projecting the modalities together into a joint space, we learn separate representations for each modality but coordinate them through a constraint. We start our discussion with coordinated representations that enforce similarity between representations, moving on to coordinated representations that enforce more structure on the resulting space (representative works of different coordinated representations can be seen in Table 1.2). Similarity models minimize the distance between modalities in the coordinated space. For example, such models encourage the representation of the word dog and an image of a dog to have a smaller distance between them than distance between the word dog and an image of a car [Frome et al. 2013]. One of the earliest examples of such a representation comes from the work by Weston et al. [2011] on the WSABIE (web scale annotation by image embedding) model, where a coordinated space was constructed for images and their annotations. WSABIE constructs a simple linear map from image and textual features such that corresponding annotation and image representation would have a higher inner product (smaller cosine distance) between them than non-corresponding ones. More recently, neural networks have become a popular way to construct coordinated representations, due to their ability to learn representations. Their advantage lies in the fact that they can jointly learn coordinated representations in an end-toend manner. An example of such coordinated representation is DeViSE, a deep visual-semantic embedding [Frome et al. 2013]. DeViSE uses a similar inner product and ranking loss function to WSABIE but uses more complex image and word embeddings. Kiros et al. [2015] extended this to sentence and image coordinated representation by using an LSTM model and a pairwise ranking loss to coordinate the feature space. Socher et al. [2014] tackle the same task, but extend the language model to a dependency tree RNN to incorporate compositional semantics. A similar model was also proposed by Pan et al. [2016], but using videos instead of images. Xu et al. [2015] also constructed a coordinated space between videos

1.3 Multimodal Representations

31

and sentences using a subject, verb, object compositional language model and a deep video model. This representation was then used for the task of cross-modal retrieval and video description. While the above models enforced similarity between representations, structured coordinated space models go beyond that and enforce additional constraints between the modality representations. The type of structure enforced is often based on the application, with different constraints for hashing, cross-modal retrieval, and image captioning. Structured coordinated spaces are commonly used in cross-modal hashing which is compression of high-dimensional data into compact binary codes with similar binary codes for similar objects [Wang et al. 2014]. The idea of crossmodal hashing is to create such codes for cross-modal retrieval [Kumar and Udupa 2011, Bronstein et al. 2010, Jiang et al. 2015]. Hashing enforces certain constraints on the resulting multimodal space: (1) it has to be an N -dimensional Hamming space, a binary representation with controllable number of bits; (2) the same object from different modalities has to have a similar hash code; and (3) the space has to be similarity-preserving. Learning how to represent the data as a hash function attempts to enforce all of these three requirements [Kumar and Udupa 2011, Bronstein et al. 2010]. For example, Jiang and Li [2017] introduced a method to learn such common binary space between sentence descriptions and corresponding images using end-to-end trainable deep learning techniques. While Cao et al. [2016] extended the approach with a more complex LSTM sentence representation and introduced an outlier insensitive bit-wise margin loss and a relevance feedback based semantic similarity constraint. Similarly, Wang et al. [2016] constructed a coordinated space in which images (and sentences) with similar meanings are closer to each other. Another example of a structured coordinated representation comes from orderembeddings of images and language [Vendrov et al. 2016, Zhang et al. 2016]. The model proposed by Vendrov et al. [2016] enforces a dissimilarity metric that is asymmetric and implements the notion of partial order in the multimodal space. The idea is to capture a partial order of the language and image representations, enforcing a hierarchy on the space; for example, image of “a woman walking her dog” → text “woman walking her dog” → text “woman walking.” A similar model using denotation graphs was also proposed by Young et al. [2014] where denotation graphs are used to induce a partial ordering. Lastly, Zhang et al. [2016] present how exploiting structured representations of text and images can create concept taxonomies in an unsupervised manner.

32

Chapter 1 Challenges and Applications in Multimodal Machine Learning

A special case of a structured coordinated space is one based on canonical correlation analysis (CCA) [Hotelling 1936]. CCA computes a linear projection which maximizes the correlation between two random variables (in our case modalities) and enforces orthogonality of the new space. CCA models have been used extensively for cross-modal retrieval [Hardoon et al. 2004, Rasiwasia et al. 2010, Klein et al. 2015] and audiovisual signal analysis [Sargin et al. 2007, Slaney and Covell 2001]. Extensions to CCA attempt to construct a correlation maximizing nonlinear projection [Lai and Fyfe 2000, Andrew et al. 2013]. Kernel canonical correlation analysis (KCCA) [Lai and Fyfe 2000] uses reproducing kernel Hilbert spaces for projection. However, as the approach is nonparametric it scales poorly with the size of the training set and has issues with very large real-world datasets. Deep canonical correlation analysis (DCCA) [Andrew et al. 2013] was introduced as an alternative to KCCA and addresses the scalability issue, it was also shown to lead to better correlated representation space. Similar correspondence autoencoder [Feng et al. 2014] and deep correspondence RBMs [Feng et al. 2015] have also been proposed for cross-modal retrieval. CCA, KCCA, and DCCA are unsupervised techniques and only optimize the correlation over the representations, thus mostly capturing what is shared across the modalities. Deep canonically correlated autoencoders [Wang et al. 2015b] also include an autoencoder based data reconstruction term. This encourages the representation to also capture modality specific information. Semantic correlation maximization method [Zhang and Li 2014] also encourages semantic relevance, while retaining correlation maximization and orthogonality of the resulting space, leading to a combination of CCA and cross-modal hashing techniques.

1.3.3 Discussion In this section we identified two major types of multimodal representations: joint and coordinated. Joint representations project multimodal data into a common space and are best suited for situations when all of the modalities are present during inference. They have been extensively used for AVSR, affect, and multimodal gesture recognition. Coordinated representations, on the other hand, project each modality into a separate but coordinated space, making them suitable for applications where only one modality is present at test time, such as: multimodal retrieval and translation, conceptual grounding (Section 1.4.2), and zero shot learning (Section 1.4.2). Finally, while joint representations have been used in situations to construct representations of more than two modalities, coordinated spaces have, so far, been mostly limited to two modalities.

1.4 Co-learning

1.4

33

Co-learning The final multimodal challenge in our taxonomy is co-learning that aids the modeling of a (resource poor) modality by exploiting knowledge from another (resource rich) modality. It is particularly relevant when one of the modalities has limited resources, lack of annotated data, noisy input, and unreliable labels. We call this challenge co-learning as most often the helper modality is used only during model training and is not used during test time. We identify three types of co-learning approaches based on their training resources: parallel, non-parallel, and hybrid. Parallel-data approaches require training datasets where the observations from one modality are directly linked to the observations from other modalities. In other words, when the multimodal observations are from the same instances, such as in an audio-visual speech dataset where the video and speech samples are from the same speaker. In contrast, non-parallel data approaches do not require direct links between observations from different modalities. These approaches usually achieve co-learning by using overlap in terms of categories, e.g., in zero shot learning when the conventional visual object recognition dataset is expanded with a second text-only dataset from Wikipedia to improve the generalization of visual object recognition. In the hybrid data setting the modalities are bridged through a shared modality or a dataset. An overview of the taxonomy in co-learning can be seen in Table 1.3 and summary of data parallelism in Figure 1.2.

1.4.1 Parallel Data In parallel data co-learning both modalities share a set of instances, audio recordings with the corresponding videos, images, and their sentence descriptions. This allows for two types of algorithms to exploit that data to better model the modalities: co-training and representation learning. Co-training is the process of creating more labeled training samples when we have few labeled samples in a multimodal problem [Blum and Mitchell 1998]. The basic algorithm builds weak classifiers in each modality to bootstrap each other with labels for the unlabeled data. It has been shown in the seminal work of Blum and Mitchell [1998] that more training samples for web page classification can be discovered on the web page itself and hyper-links leading to it. By definition this task requires parallel data as it relies on the overlap of multimodal samples. Co-training has been used for statistical parsing [Sarkar 2001] to build better visual detectors [Levin et al. 2003] and for audio-visual speech recognition [Christoudias et al. 2006]. It has also been extended to deal with disagreement between modalities by filtering out unreliable samples [Christoudias et al. 2008].

34

Chapter 1 Challenges and Applications in Multimodal Machine Learning

Table 1.3

A summary of co-learning taxonomy, based on data parallelism. Parallel data— multiple modalities can see the same instance. Non-parallel data—unimodal instances are independent of each other. Hybrid data—the modalities are pivoted through a shared modality or dataset. Data parallelism

Task

Reference

Co-training

Mixture

[Blum and Mitchell 1998]

Transfer learning

AVSR

[Ngiam et al. 2011]

Lip reading

[Moon et al. 2015]

Visual classification

[Frome et al. 2013]

Action recognition

[Mahasseni and Todorovic 2016]

Metaphor class.

[Shutova et al. 2016]

Word similarity

[Kiela and Clark 2015]

Image class.

[Socher et al. 2013]

Thought class.

[Palatucci et al. 2009]

MT and image ret.

[Rajendran et al. 2015]

Transliteration

[Nakov and Ng 2012]

Parallel

Non-parallel Transfer learning Concept grounding Zero shot learning Hybrid Data Bridging

While co-training is a powerful method for generating more labeled data, it can also lead to biased training samples resulting in overfitting. Transfer learning is another way to exploit co-learning with parallel data. Multimodal representation learning (Section 1.3.1) approaches such as multimodal deep Boltzmann machines [Srivastava and Salakhutdinov 2012b] and multimodal autoencoders [Ngiam et al. 2011] transfer information from representation of one modality to that of another. This not only leads to multimodal representations, but also to better unimodal ones, with only one modality being used during test time. Moon et al. [2015] show how to transfer information from a speech recognition neural network (based on audio) to a lip-reading one (based on images), leading to a better visual representation, and a model that can be used for lip-reading without need for audio information during test time. Similarly, Arora and Livescu [2013] build better acoustic features using CCA on acoustic and articulatory (location of lips, tongue, and jaw) data. They use articulatory data only during CCA construction and use only the resulting acoustic (unimodal) representation during test time.

1.4 Co-learning

Dataset

Dataset

Dataset

Dataset

Dataset

35

Dataset

Concepts

(a) Parallel Figure 1.2

(b) Nonparallel

(c) Hybrid

Types of data parallelism used in co-learning: parallel, modalities are from the same dataset and there is a direct correspondence between instances; non-parallel, modalities are from different datasets and do not have overlapping instances, but overlap in general categories or concepts; and hybrid, the instances or concepts are bridged by a third modality or a dataset.

1.4.2 Non-parallel Data Methods that rely on non-parallel data do not require the modalities to have shared instances, but only shared categories or concepts. Non-parallel co-learning approaches can help when learning representations as they allow for better semantic concept understanding and even perform unseen object recognition. Transfer learning is also possible on non-parallel data and allows for the learning of better representations through transferring information from a representation built using a data rich or clean modality to a data scarce or noisy modality. This type of transfer learning is often achieved by using coordinated multimodal representations (see Section 1.3.2). For example, Frome et al. [2013] used text to improve visual representations for image classification by coordinating CNN visual features with word2vec textual ones [Mikolov et al. 2013] trained on separate large datasets. Visual representations trained in such a way result in more meaningful errors, mistaking objects for ones of similar category [Frome et al. 2013]. Mahasseni and Todorovic [2016] demonstrated how to regularize a color video-based LSTM using an autoencoder LSTM trained on 3D skeleton data by enforcing similarities between their hidden states. Such an approach is able to improve the original LSTM and lead to state-of-the-art performance in action recognition. Conceptual grounding refers to learning semantic meanings or concepts not purely based on language but also on additional modalities such as vision, sound, or even smell. While the majority of concept learning approaches are purely language-based, representations of meaning in humans are not merely a product

36

Chapter 1 Challenges and Applications in Multimodal Machine Learning

of our linguistic exposure but are also grounded through our sensorimotor experience and perceptual system [Barsalou 2008, Louwerse 2011]. Human semantic knowledge relies heavily on perceptual information [Louwerse 2011] and many concepts are grounded in the perceptual system and are not purely symbolic [Barsalou 2008]. This implies that learning semantic meaning purely from textual information might not be optimal, and motivates the use of visual or acoustic cues to ground our linguistic representations. Starting from work by Feng and Lapata [2010], grounding is usually performed by finding a common latent space between the representations [Feng and Lapata 2010, Silberer and Lapata 2012] (in case of parallel datasets) or by learning unimodal representations separately and then concatenating them to lead to a multimodal one [Regneri et al. 2013, Shutova et al. 2016, Kiela and Bottou 2014, Bruni et al. 2012] (in case of non-parallel data). Once a multimodal representation is constructed it can be used on purely linguistic tasks. Shutova et al. [2016] and Bruni et al. [2012] used grounded representations for better classification of metaphors and literal language. Such representations have also been useful for measuring conceptual similarity and relatedness, identifying how semantically or conceptually related two words are [Kiela and Bottou 2014, Bruni et al. 2014, Silberer and Lapata 2012] or actions [Regneri et al. 2013]. Furthermore, concepts can be grounded not only using visual signals, but also acoustic ones, leading to better performance especially on words with auditory associations [Kiela and Clark 2015], or even olfactory signals [Kiela et al. 2015] for words with smell associations. Finally, there is a lot of overlap between multimodal alignment and conceptual grounding, as aligning visual scenes to their descriptions leads to better textual or visual representations [Regneri et al. 2013, Plummer et al. 2015, Kong et al. 2014, Yu and Siskind 2013]. Conceptual grounding has been found to be an effective way to improve performance on a number of tasks. It also shows that language and vision (or audio) are complementary sources of information and combining them in multimodal models often improves performance. However, one has to be careful as grounding does not always lead to better performance [Kiela and Clark 2015, Kiela et al. 2015], and only makes sense when grounding has relevance for the task such as grounding using images for visually related concepts. Zero shot learning (ZSL) refers to recognizing a concept without having explicitly seen any examples of it. For example, classifying a cat in an image without ever having seen (labeled) images of cats. This is an important problem to address as in a number of tasks such as visual object classification: it is prohibitively expensive to provide training examples for every imaginable object of interest.

1.4 Co-learning

37

There are two main types of ZSL: unimodal and multimodal. The unimodal ZSL looks at component parts or attributes of the object, such as phonemes to recognize an unheard word or visual attributes such as color, size, and shape to predict an unseen visual class [Farhadi et al. 2009]. The multimodal ZSL recognizes the objects in the primary modality through the help of the secondary one—in which the object has been seen. The multimodal version of ZSL is a problem facing non-parallel data by definition as the overlap of seen classes is different between the modalities. Socher et al. [2013] map image features to a conceptual word space and are able to classify between seen and unseen concepts. The unseen concepts can be then assigned to a word that is close to the visual representation; this is enabled by the semantic space being trained on a separate dataset that has seen more concepts. Instead of learning a mapping from visual to concept space Frome et al. [2013] learn a coordinated multimodal representation between concepts and images that allows for ZSL. Palatucci et al. [2009] perform prediction of words people are thinking of based on functional magnetic resonance images; they show how it is possible to predict unseen words through the use of an intermediate semantic space. Lazaridou et al. [2014] present a fast mapping method for ZSL by mapping extracted visual feature vectors to text-based vectors through a neural network.

1.4.3 Hybrid Data In the hybrid data setting two non-parallel modalities are bridged by a shared modality or a dataset (see Figure 1.2c). The most notable example is the Bridge Correlational Neural Network [Rajendran et al. 2015], which uses a pivot modality to learn coordinated multimodal representations in presence of non-parallel data. For example, in the case of multilingual image captioning, the image modality would always be paired with at least one caption in any language. Such methods have also been used to bridge languages that might not have parallel corpora but have access to a shared pivot language, such as for machine translation [Rajendran et al. 2015, Nakov and Ng 2012] and document transliteration [Khapra et al. 2010]. Instead of using a separate modality for bridging, some methods rely on existence of large datasets from a similar or related task to lead to better performance in a task that only contains limited annotated data. Socher and Fei-Fei [2010] use the existence of large text corpora in order to guide image segmentation. While Anne Hendricks et al. [2016] use separately trained visual model and a language model to lead to a better image and video description system, for which only limited data is available.

38

Chapter 1 Challenges and Applications in Multimodal Machine Learning

1.4.4 Discussion Multimodal co-learning allows for one modality to influence the training of another, exploiting the complementary information across modalities. It is important to note that co-learning is task independent and could be used to create better fusion, translation, and alignment models. This challenge is exemplified by algorithms such as co-training, multimodal representation learning, conceptual grounding, and zero shot learning (ZSL) and has found many applications in visual classification, action recognition, audio-visual speech recognition, and semantic similarity estimation.

1.5

Conclusion Multimodal machine learning is a vibrant multi-disciplinary field which aims to build models that can process and relate information from multiple modalities. As part of this chapter, presented the taxonomy of two challenges in multimodal machine learning: representation and co-learning [Baltruˇsaitis et al. 2017]. Some of them such as fusion have been studied for a long time, but more recent interest in alignment and translation have led to a large number of new multimodal algorithms and exciting multimodal applications. Although the focus of this chapter was primarily on the last decade of multimodal research, it is important to address future challenges with a knowledge of past achievements. Moving forward, the proposed taxonomy gives researchers a framework to understand current research and identify understudied challenges for future research. We believe that all these aspects of multimodal research are needed if we want to build computers able to perceive, model and generate multimodal signals. One specific area of multimodal machine learning which seems to be under-studied is co-learning, where knowledge from one modality helps with modeling in another modality. This challenge is related to the concept of coordinated representations where each modality keeps its own representation but find a way to exchange and coordinate knowledge. We see these lines of research as promising directions for future research.

Focus Questions 1.1. Describe the different applications that have emerged in the multimodal domain. Also think of and list a few applications not listed in the chapter. Do these applications suggest that models using multiple modalities are essential when contrasted with unimodal models?

References

39

1.2. What are the 2 categories of Representation Learning that have been discussed in this chapter? Which of the 2 categories seems to be more explicit in estimating similarities between two modalities and how did you come to that conclusion?

1.3. Coordinated Representation methods have only been used for two modalities at a time. Can you come up with ideas to extend the current methods to multiple modalities? 1.4. Compare and contrast probabilistic graphical models and neural networks for representation learning. Can you think of a way to combine the best of both worlds for multimodal representations? 1.5. Limited and noisy data from a modality is a common problem in a hoards of real world tasks. Co-learning exploits knowledge from a resource rich modality to aid the resource poor modality. Can you think of conditions where the described methods could do more harm than good?

1.6. Transfer Learning could be done on both Parallel and Non-parallel data. What are the key differences between the approaches followed on both these kinds of data?

1.7. Although the chapter focuses more on visual, acoustic, and textual modalities there are other modalities (like olfactory signals) that can act as a bridge to ground associations made by humans. What are some ways that can help one decide which modalities are complimentary (help each other and boost performance on the task)?

1.8. Zero shot learning recognizes a concept without ever having explicitly seen it before. Which kind of representation is a good choice for this task and why? 1.9. Hybrid data is similar to Non-Parallel data apart form one key difference. What is that, and how are the models modified to take advantage of this kind of data?

References D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In International Conference on Machine Learning, pp. 173–182. 23, 25 C.-N. Anagnostopoulos, T. Iliou, and I. Giannoukos. 2015. Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artificial Intelligence Review, 43(2):155–177. DOI: 10.1007/s10462-012-9368-5. 23

40

Chapter 1 Challenges and Applications in Multimodal Machine Learning

G. Andrew, R. Arora, J. Bilmes, and K. Livescu. 2013. Deep canonical correlation analysis. In International Conference on Machine Learning, pp. 1247–1255. 25, 32 L. Anne Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. 2016. Deep compositional captioning: Describing novel object categories without paired training data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–10. 37 S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433, 2015. 23, 27 R. Arora and K. Livescu. 2013. Multi-view cca-based acoustic features for phonetic recognition across speakers and domains. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 7135–7139. IEEE. 10.1109/ ICASSP.2013.6639047. 34 P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: a survey. Multimedia systems, 16(6):345–379, 2010. DOI: 10.1007/s00530-010-0182-0. 21 D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural Machine Translation By Jointly Learning To Align and Translate. ICLR. 29 T. Baltruˇsaitis, C. Ahuja, and L.-P. Morency. 2017. Multimodal machine learning: A survey and taxonomy. arXiv preprint arXiv:1705.09406. 20, 21, 38 L. W. Barsalou. 2008. Grounded cognition. Annu. Rev. Psychol., 59:617–645. DOI: 10.1146/ annurev.psych.59.103006.093639. 36 Y. Bengio, A. Courville, and P. Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828. DOI: 10.1109/TPAMI.2013.50. 23, 26, 27, 28 J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, et al. 2010. Vizwiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pp. 333–342. ACM. DOI: 10.1145/1866029.1866080. 22 A. Blum and T. Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pp. 92–100. ACM. DOI: 10.1145/279943.279962. 33, 34 H. Bourlard and S. Dupont. 1996. A mew asr approach based on independent processing and recombination of partial frequency bands. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, volume 1, pp. 426–429. IEEE, 1996. DOI: 10.1109/ICSLP.1996.607145. 21 M. Brand, N. Oliver, and A. Pentland. 1997. Coupled hidden markov models for complex action recognition. In Computer vision and pattern recognition, 1997. proceedings., 1997 ieee computer society conference on, pp. 994–999. IEEE. DOI: 10.1109/CVPR.1997 .609450. 21

References

41

M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios. 2010. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 3594–3601. IEEE. DOI: 10.1109/CVPR.2010.5539928. 31 E. Bruni, G. Boleda, M. Baroni, and N.-K. Tran. Distributional semantics in technicolor. 2012. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 136–145. Association for Computational Linguistics. 36 E. Bruni, N.-K. Tran, and M. Baroni. 2014. Multimodal distributional semantics. J. Artif. Intell. Res.(JAIR), 49(2014): 1–47. 36 Y. Cao, M. Long, J. Wang, Q. Yang, and S. Y. Philip. 2016. Deep visual-semantic hashing for cross-modal retrieval. In KDD, pp. 1445–1454. DOI: 10.1145/2939672.2939812. 27, 31 J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal, et al. 2005. The ami meeting corpus: A pre-announcement. In International Workshop on Machine Learning for Multimodal Interaction, pp. 28–39. Springer. DOI: 10.1007/11677482_3. 22 X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Doll´ ar, and C. L. Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325. 30 C. M. Christoudias, K. Saenko, L.-P. Morency, and T. Darrell. 2006. Co-adaptation of audiovisual speech and gesture classifiers. In Proceedings of the 8th international conference on Multimodal interfaces, pp. 84–91. ACM. DOI: 10.1145/1180995.1181013. 33 C. M. Christoudias, R. Urtasun, and T. Darrell. 2008. Multi-view learning in the presence of view disagreement. In UAI. 33 P. Cosi, E. M. Caldognetto, K. Vagges, G. A. Mian, and M. Contolini. 1994. Bimodal recognition experiments with recurrent neural networks. In Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE International Conference on, volume 2, pp. II–553. IEEE, 1994. DOI: 10.1109/ICASSP.1994.389596. 30 F. De la Torre and J. F. Cohn. 2011. Facial expression analysis. In Visual analysis of humans, pp. 377–409. Springer. DOI: 10.1007/978-0-85729-997-0_19. 22 S. K. D’Mello and J. Kory. 2015. A review and meta-analysis of multimodal affect detection systems. ACM Computing Surveys (CSUR), 47(3): 43. DOI: 10.1145/2682899. 22, 25, 26 G. Evangelopoulos, A. Zlatintsi, A. Potamianos, P. Maragos, K. Rapantzikos, G. Skoumas, and Y. Avrithis. 2013. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Transactions on Multimedia, 15(7): 1553–1568. DOI: 10.1109/TMM.2013.2267205. 22 A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. 2009. Describing objects by their attributes. In IEEE Conference on Computer Vision and Pattern Recognition, 2009., pp. 1778–1785. IEEE. DOI: 10.1109/CVPR.2009.5206772. 37

42

Chapter 1 Challenges and Applications in Multimodal Machine Learning

F. Feng, X. Wang, and R. Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 7–16. ACM. DOI: 10.1145/2647868.2654902. 32 F. Feng, R. Li, and X. Wang. 2015. Deep correspondence restricted boltzmann machine for cross-modal retrieval. Neurocomputing, 154: 50–60. DOI: 10.1145/2808205. 32 Y. Feng and M. Lapata. 2010. Visual information in semantic representation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 91–99. Association for Computational Linguistics. 36 A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. 2013. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, pp. 2121–2129. 25, 27, 30, 34, 35, 37 X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. 28 M. Gurban, J.-P. Thiran, T. Drugman, and T. Dutoit. 2008. Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition. In Proceedings of the 10th international conference on Multimodal interfaces, pp. 237–240. ACM. DOI: 10.1145/ 1452392.1452442. 21 D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16(12): 2639–2664. DOI: 10.1162/0899766042321814. 32 G. Hinton, L. Deng, D. Yu, G. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. 2012. Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine. DOI: 10.1109/MSP .2012.2205597. 23, 25 G. E. Hinton and R. S. Zemel. 1994. Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems, pp. 3–10. 28 G. E. Hinton, S. Osindero, and Y.-W. Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation, 18(7): 1527–1554. DOI: 10.1162/neco.2006.18.7.1527. 28 S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8): 1735–1780. 29 M. Hodosh, P. Young, and J. Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47: 853–899, 2013. 22 H. Hotelling. 1936. Relations between two sets of variates. Biometrika, 28(3/4):321–377. 32 J. Huang and B. Kingsbury. 2013. Audio-visual deep learning for noise robust speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 7596–7599. IEEE. DOI: 10.1109/ICASSP.2013.6639140. 28

References

43

A. Jameson and P. O. Kristensson. 2017. Understanding and supporting modality choices. In The Handbook of Multimodal-Multisensor Interfaces, pp. 201–238. Association for Computing Machinery and Morgan & Claypool. DOI 10.1145/3015783.3015790. 21 Q.-y. Jiang and W.-j. Li. 2017. Deep Cross-Modal Hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. DOI: 10.1109/CVPR.2017.348. 31 X. Jiang, F. Wu, Y. Zhang, S. Tang, W. Lu, and Y. Zhuang. 2015. The classification of multimodal data with hidden conditional random field. Pattern Recognition Letters, 51: 63–69. DOI: 10.1016/j.patrec.2014.08.005. 31 B. H. Juang and L. R. Rabiner. 1991. Hidden markov models for speech recognition. Technometrics, 33(3): 251–272. 21 S. E. Kahou, X. Bouthillier, P. Lamblin, C. Gulcehre, V. Michalski, K. Konda, S. Jean, P. Froumenty, Y. Dauphin, N. Boulanger-Lewandowski, et al. 2016. Emonets: Multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces, 10(2): 99–111. DOI: 10.1007/s12193-015-0195-2. 27 M. M. Khapra, A. Kumaran, and P. Bhattacharyya. 2010. Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 420–428. Association for Computational Linguistics. 37 D. Kiela and L. Bottou. 2014. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In EMNLP, pp. 36–45. DOI: 10.3115/ v1/D14-1005. 36 D. Kiela and S. Clark. 2015. Multi-and cross-modal semantics beyond vision: Grounding in auditory perception. Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2461–2470. DOI: 10.18653/v1/D15-1293. 34, 36 D. Kiela, L. Bulat, and S. Clark. 2015. Grounding semantics in olfactory perception. In ACL (2), pp. 231–236. DOI: 10.3115/v1/P15-2038. 36 Y. Kim, H. Lee, and E. M. Provost. 2013. Deep learning for robust feature generation in audiovisual emotion recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 3687–3691. IEEE. DOI: 10.1109/ICASSP .2013.6638346. 27, 28 R. Kiros, R. Salakhutdinov, and R. S. Zemel. 2015. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. TACL. 27, 30 B. Klein, G. Lev, G. Sadeh, and L. Wolf. 2015. Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation. In CVPR. DOI: 10.1109/ CVPR.2015.7299073. 32 C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler. 2014. What are you talking about? textto-image coreference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3558–3565. DOI: 10.1.1.889.207&rep=rep1&type. 36

44

Chapter 1 Challenges and Applications in Multimodal Machine Learning

A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. 23, 25 S. Kumar and R. Udupa. 2011. Learning hash functions for cross-view similarity search. In IJCAI proceedings-international joint conference on artificial intelligence, volume 22, p. 1360. DOI: 10.5591/978-1-57735-516-8/IJCAI11-230. 31 P. L. Lai and C. Fyfe. 2000. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems, 10(05): 365–377. DOI: 10.1142/S012906570000034X. 32 A. Lazaridou, E. Bruni, and M. Baroni. 2014. Is this a wampimuk? cross-modal mapping between distributional semantics and the visual world. In ACL (1), pp. 1403–1414. DOI: 10.3115/v1/P14-1132. 37 A. Levin, P. Viola, and Y. Freund. 2003. Unsupervised improvement of visual detectors using cotraining. In ICCV. 33 Y. Li, S. Wang, Q. Tian, and X. Ding. 2015. A survey of recent advances in visual feature detection. Neurocomputing, 149: 736–751. DOI: 10.1016/j.neucom.2014.08.003. 23 R. Lienhart. 1999. Comparison of automatic shot boundary detection algorithms. In Storage and Retrieval for Image and Video Databases (SPIE), pp. 290–301. 22 M. M. Louwerse. 2011. Symbol interdependency in symbolic and embodied cognition. Topics in Cognitive Science, 3(2): 273–302. DOI: 10.1111/j.1756-8765.2010.01106.x. 36 D. G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2): 91–110. DOI: 10.1023/B:VISI.0000029664.99615.94. 23 B. Mahasseni and S. Todorovic. 2016. Regularizing long short term memory with 3d humanskeleton sequences for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3054–3062. DOI: 10.1109/CVPR.2016 .333. 34, 35 H. McGurk and J. MacDonald. 1976. Hearing lips and seeing voices. Nature, 264(5588): 746–748. 21 G. McKeown, M. F. Valstar, R. Cowie, and M. Pantic. 2010. The semaine corpus of emotionally coloured character interactions. In Multimedia and Expo (ICME), 2010 IEEE International Conference on, pp. 1079–1084. IEEE. DOI: 10.1109/ICME.2010 .5583006. 22 T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. 25, 35 S. Moon, S. Kim, and H. Wang. 2015. Multimodal Transfer Deep Learning for Audio-Visual Recognition. NIPS Workshops. 34 Y. Mroueh, E. Marcheret, and V. Goel. 2015. Deep multimodal learning for audio-visual speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2015

References

45

IEEE International Conference on, pp. 2130–2134. IEEE. DOI: 10.1109/ICASSP.2015 .7178347. 27 P. Nakov and H. T. Ng. 2012. Improving statistical machine translation for a resourcepoor language using related resource-rich languages. Journal of Artificial Intelligence Research, 44: 179–222. 34, 37 J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. 2011. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696. 21, 26, 27, 28, 34 M. A. Nicolaou, H. Gunes, and M. Pantic. 2011. Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Transactions on Affective Computing, 2(2): 92–105. DOI: 10.1109/T-AFFC.2011.9. 27, 30 W. Ouyang, X. Chu, and X. Wang. 2014. Multi-source deep learning for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2329–2336. DOI: 10.1109/CVPR.2014.299. 26, 27, 29 M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell. 2009. Zero-shot learning with semantic output codes. In Advances in neural information processing systems, pp. 1410–1418. 34, 37 Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4594–4602. DOI: 10.1109/CVPR.2016.497. 27, 30 B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp. 2641–2649. DOI: 10.1109/ICCV.2015.303. 36 S. S. Rajagopalan, L.-P. Morency, T. Baltruˇsaitis, and R. Goecke. 2016. Extending long shortterm memory for multi-view structured learning. In European Conference on Computer Vision, pp. 338–353. Springer. DOI: 10.1007/978-3-319-46478-7_21. 27, 30 J. Rajendran, M. M. Khapra, S. Chandar, and B. Ravindran. 2015. Bridge Correlational Neural Networks for Multilingual Multimodal Representation Learning. In NAACL. 34, 37 N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM international conference on Multimedia, pp. 251–260. ACM. DOI: 10.1145/1873951.1873987. 32 M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal. 2013. Grounding Action Descriptions in Videos. TACL. ISSN 2307-387X. 36 R. Salakhutdinov and G. Hinton. 2009. Deep boltzmann machines. In Artificial Intelligence and Statistics, pp. 448–455. 28 M. E. Sargin, Y. Yemez, E. Erzin, and A. M. Tekalp. 2007. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Transactions on Multimedia, 9(7): 1396–1403. DOI: 10.1109/TMM.2007.906583. 32

46

Chapter 1 Challenges and Applications in Multimodal Machine Learning

A. Sarkar. 2001. Applying co-training methods to statistical parsing. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, pp. 1–8. Association for Computational Linguistics. DOI: 10.3115/1073336.1073359. 33 B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, and M. Pantic. 2011. Avec 2011–the first international audio/visual emotion challenge. Affective Computing and Intelligent Interaction, pp. 415–424. 22 E. Shutova, D. Kiela, and J. Maillard. 2016. Black holes and white rabbits: Metaphor identification with visual features. In HLT-NAACL, pp. 160–170. DOI: 10.18653/v1/ N16-1020. 34, 36 C. Silberer and M. Lapata. 2012. Grounded models of semantic representation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1423–1433. Association for Computational Linguistics. 36 C. Silberer and M. Lapata. 2014. Learning grounded meaning representations with autoencoders. In ACL (1), pp. 721–732. DOI: 10.3115/v1/P14-1068. 27, 28 M. Slaney and M. Covell. 2001. Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. In Advances in Neural Information Processing Systems, pp. 814–820. 32 C. G. Snoek and M. Worring. 2005. Multimodal video indexing: A review of the state-of-theart. volume 25, pp. 5–35. Springer. DOI: 10.1023/B:MTAP.0000046380.27575.a5. 21, 22 R. Socher and L. Fei-Fei. 2010. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 966–973. IEEE. DOI: 10.1109/CVPR .2010.5540112. 37 R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. 2013. Zero-shot learning through crossmodal transfer. In Advances in neural information processing systems, pp. 935–943. 34, 37 R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2: 207–218. 30 N. Srivastava and R. Salakhutdinov. 2012a. Learning representations for multimodal data with deep belief nets. In International conference on machine learning workshop. 28 N. Srivastava and R. R. Salakhutdinov. 2012b. Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems, pp. 2222–2230. 23, 27, 29, 34 N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1): 1929–1958. 28

References

47

H.-I. Suk, S.-W. Lee, D. Shen, A. D. N. Initiative, et al. 2014. Hierarchical feature representation and multimodal fusion with deep learning for ad/mci diagnosis. NeuroImage, 101: 569–582. DOI: 10.1016/j.neuroimage.2014.06.077. 29 G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou. 2016. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 5200–5204. IEEE. DOI: 10.1109/ ICASSP.2016.7472669. 25 M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia, S. Schnieder, R. Cowie, and M. Pantic. 2013. Avec 2013: the continuous audio/visual emotion and depression recognition challenge. In Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge, pp. 3–10. ACM. DOI: 10.1145/2512530.2512533. 22 I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun. 2016. Order-Embeddings of Images and Language. In ICLR. 25, 27, 31 S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. 2015. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. NAACL. 29 D. Wang, P. Cui, M. Ou, and W. Zhu. 2015a. Deep multimodal hashing with orthogonal regularization. In IJCAI, pp. 2291–2297. 26, 28 J. Wang, H. T. Shen, J. Song, and J. Ji. 2014. Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927. 31 L. Wang, Y. Li, and S. Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013. 31 W. Wang, R. Arora, K. Livescu, and J. Bilmes. 2015b. On deep multi-view representation learning. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1083–1092. 27, 32 J. Weston, S. Bengio, and N. Usunier. 2011. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, volume 11, pp. 2764–2770. DOI: 10.5591/978-1-57735-516-8/ IJCAI11-460. 30 D. Wu and L. Shao. 2014. Multimodal dynamic networks for gesture recognition. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 945–948. ACM. DOI: 10.1145/2647868.2654969. 29 Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue. 2014. Exploring inter-feature and interclass relationships with deep neural networks for video classification. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 167–176. ACM. DOI: 10.1145/2647868.2654931. 27 R. Xu, C. Xiong, W. Chen, and J. J. Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In AAAI, volume 5, p. 6. 27, 30

48

Chapter 1 Challenges and Applications in Multimodal Machine Learning

P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2: 67–78. 31 H. Yu and J. M. Siskind. 2013. Grounded language learning from video described with sentences. In ACL (1), pp. 53–63. 36 B. P. Yuhas, M. H. Goldstein, and T. J. Sejnowski. 1989. Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine, 27(11): 65–71. DOI: 10.1109/35.41402. 21 D. Zhang and W.-J. Li. 2014. Large-scale supervised multimodal hashing with semantic correlation maximization. In AAAI, volume 1, p. 7. 32 H. Zhang, Z. Hu, Y. Deng, M. Sachan, Z. Yan, and E. P. Xing. 2016. Learning concept taxonomies from multi-modal data. arXiv preprint arXiv:1606.09239. 31

2

Classifying Multimodal Data Ethem Alpaydin

2.1

2.2

Introduction

Multimodal data contains information from different sources/sensors that may carry complementary information and, as such, combining these different modalities intelligently improves accuracy in classification tasks. We are going to discuss three approaches: (1) in early integration, features from all modalities are concatenated as one long input and a single classifier is trained; (2) in late recognition, a separate classifier is trained with each modality, which independently makes a decision and then the decisions are combined, and (3) intermediate integration is between these two extremes—there is a single classifier but it is trained with some suitably processed, a more abstract version of the input from each modality. We consider two possibilities for this: one uses the multiple kernel learning framework where each modality has its own specific kernel and the other is multimodal deep neural networks where processing of different modalities are separate in early layers but are combined later on. For each approach, we will discuss the pros and cons in a comparative manner. We conclude that in building classifiers that combine modalities, there is no single best method and one needs to think about the level of abstraction where correlation is expected to occur between features and choose a combiner accordingly.

Classifying Multimodal Data The idea of having an ensemble of learners and combining their predictions is not a new idea in pattern recognition and machine learning, and there is a variety of methods; see Kuncheva [2004] for a comprehensive review. See the Glossary

50

Chapter 2 Classifying Multimodal Data

for key terminology related to machine learning and ensemble models. For such a combination to be useful, two questions become important [Alpaydın 2014]: 1. Since it does not make sense to have multiple learners that make the same mistakes, how can we generate learners that are diverse/independent/ uncorrelated so that they make errors on different cases and complement each other? 2. When we have these multiple learners each making a prediction on a test instance, how can we combine their predictions to calculate the overall output for highest possible accuracy? These two questions are not necessarily orthogonal but it may be useful to answer them one by one. For the first question of how to generate different learners, there have been a number of alternatives. .

.

The most popular approach is to use different learning algorithms or models. Each learning algorithm or model makes a different set of assumptions about the data and picking one algorithm corresponds to one set of assumptions. Learning is an ill-posed problem, that is, the data by itself is not sufficient to get a unique model. The set of assumptions we need to make to get a unique solution is called the inductive bias for the model or the algorithm. For example, the linear discriminant assumes that the classes are linearly separable in the input space and the k-nearest neighbor assumes that nearby instances are likely to have the same label. Hence, to make a safer bet and not “put all our eggs in the same basket,” we choose a number of algorithms/models that we believe are likely to perform well and combine their predictions. From the perspective of experiment design, we can view the learning algorithm as a (controllable) factor that affects the accuracy of the final classifier and using many algorithms corresponds to averaging over the many levels of this factor. Averaging over different models also averages out the effect of random (uncontrollable) factors; for example, neural networks are trained with gradient-descent that is randomly initialized and averaging over multiple neural networks decreases dependence on initialization. One can use the same learning algorithm or model, but with different hyperparameter settings. Each model has a hyper-parameter that allows its complexity to be adjusted to the task; for example, with multi-layer perceptrons, it is the structure of the network, i.e., the number of hidden layers and the number of hidden units; with the k-nearest neighbor classifier, it is k, that is,

2.2 Classifying Multimodal Data

51

Glossary In machine learning, the learner is a model that takes an input x and learns to give out the correct output y. In pattern recognition, typically we have a classification task where y is a class code; for example in face recognition, x is the face image and y is the index of the person whose face it is we are classifying. In building a learner, we start from a data set X = {x t , r t }, t = 1, . . . , N that contains training pairs of instances x t and the desired output values r t (e.g., class labels) for them. We assume that there is a dependency between x and r but that it is unknown—If it were known, there would be no need to do any learning and we would just write down the code for the mapping. Typically, x t is not enough to uniquely identify r t ; we call x t the observables and there may also be unobservables that affect r t and we model their effect as noise. This implies that each training pair gives us only a limited amount of information. Another related problem is that in most applications, x has a very high dimensionality and our training set samples this high dimensional space very sparsely. Our prediction is given by our predictor g(x t |θ) where g() is the model and θ is its set of parameters. Learning corresponds to finding that best θ ∗ that makes our predictions as close as possible to the desired values on the training set: θ ∗ = arg min θ

N 

L(r t , g(x t |θ))

t=1

L() is the loss function that measures how far the prediction g(x t |θ) is from the desired value r t . The complexity of this optimization problem depends on the particular g() and L(). Different learning algorithms in the machine learning literature differ either in the model they use, the loss function they employ, or the how the optimization problem is solved. This step above optimizes the parameters given a model. Each model has an inductive bias that is, it comes with a set of assumptions about the data and the model is accurate if its assumptions match the characteristics of the data. This implies that we also need a process of model selection where we optimize the model structure. This model structure depends on dimensions such as (i) the learning algorithm, (ii) the hyper-parameters of the model (that define model complexity), and (iii) the input features and representation, or modality. Each model corresponds to one particular combination of these dimensions. An ensemble is a set of models and we want the models in the set to differ in their predictions so that they make different errors. If we consider the space defined by the three dimensions that define a model as we defined above, the idea is to sample smartly from that space of learners. We want the individual models to be as accurate as possible individually, and at the same time, we want them to complement each other. How these two criteria affect the accuracy of the ensemble depends on the way we do the combination.

52

Chapter 2 Classifying Multimodal Data

Glossary

(continued)

From another perspective, we can view each particular model as one noisy estimate to the real (unknown) underlying problem. For example, in a classification task, each base classifier, depending on its model, hyper-parameters, and input features, learns one noisy estimator to the real discriminant. In such a perspective, the ensemble approach corresponds to constructing a final estimator from these noisy estimators— for example, voting corresponds to averaging them. When the different models use inputs in different modalities, there are three ways in which the predictions of models can be combined, namely, early, late, and intermediate combination/integration/fusion. In early combination, the inputs from all the different modalities are concatenated and fed to a single model. In late combination, for each modality there is a separate model that makes a prediction based on its modality, and these model predictions are later fused by a combining model. In intermediate combination, each modality is first processed to get a more abstract representation and then all such representations from different modalities are fed together to a single model. This processing can be in the form of a kernel function, which is a measure of similarity, and such an approach is called multiple kernel learning. Or the intermediate processing may be done by one or more layers of a neural network, and such an approach corresponds to a deep neural network. The level of combination depends on the level we expect to see a dependency between the inputs in different modalities. Early combination assumes a dependency at the lowest level of input features; intermediate combination assumes a dependency at a more abstract or semantic level that is extracted after some processing of the raw input; late combination assumes no dependency in the input but only at the level of decisions.

the number of nearest neighbors taken into account, and so on. Using multiple copies of the same model but with different hyper-parameter values—for example, combining three multi-layer perceptrons one with 20, one with 30, and one with 40 hidden units, again corresponds to averaging over this factor of the hyper-parameter. .

Another approach to generate learners that differ in their decisions is by training them on different training data. This can be done by sampling from the same training set and in bagging [Breiman 1996] we use bootstrap which is sampling at random with replacement; in adaboost [Freund and Schapire 1996], sampling is done sequentially and is driven by error where the next

2.2 Classifying Multimodal Data

53

learner is trained on instances misclassified by the one before. In the mixture of experts model [Jacobs et al. 1991], there is a gating network that divides the input space into regions and each learner (expert) is trained only with the data falling in its region of expertise. .

.

Yet another approach is to train different models on different random feature subsets [Ho 1998]. In the random forest, for example, each decision tree sees only a random subset of the original set of features—the different subsets can overlap. Different learners see slightly different versions of the same problem; some of the features may be noisy and some may be redundant, and combining over multiple learners averages out the effect of this factor of the feature set. But the most promising approach, and it is the topic of this chapter, seems to be training the different learners/models using data coming from different modalities. Such data from different sensor sources provide different representations of the same object or event to be classified, and hence can carry information that has the highest chance of being diverse or complementary. In machine learning literature, this is also known as multi-modal, multi-view, multi-representation, or multi-source learning. The earliest example is in speech recognition, where the first modality is the acoustic signal captured by a microphone and the second modality is the visual image of the speaker’s lips as they speak. Two utterances of the same word may differ a lot in terms of the acoustics (e.g., when the speaker is a man or a woman), but we expect their lips to move similarly; so accuracy can be improved by taking this second visual modality into account as well. Incorporating this new type of sensor, here a camera for visual input, provides a completely new source of information, and this is the power of multimodality—adding the visual source to acoustics can improve the accuracy much more than what we would get if we combined multiple learners all looking at slightly different versions of the same acoustic data; see Chapter 1 of this volume [Baltrusaitis et al. 2018] for various examples of multimodal settings. When there are multiple modalities, the immediate approach would be to concatenate features of different modalities to form one long vector and use a single learner to classify it (early integration) but, as we will see shortly, feeding different modalities to different learners (late integration) or feeding them to a single learner after some preprocessing (intermediate integration) may work better.

54

Chapter 2 Classifying Multimodal Data

When it comes to the second question of how to combine/integrate/fuse the predictions of multiple learners, again, there are different possibilities. .

The most intuitive, and the most frequently used approach is voting: Each learner looking at its input votes for one class, we sum up the votes for all classes and choose the class that receives the highest vote. In the simplest case, the votes are binary 0/1: a classifier votes 1 for the class that it believes is most likely, and majority voting chooses the class that gets the highest number of votes. Let us say gij (x) ∈ {0, 1} is the output of model i = 1, . . . , m for class j = 1, . . . , k: gij [x) is 1 if model i votes for class j and is 0 otherwise. The total vote for class j is yj =

m 

gij (x)

(2.1)

i=1

and we choose class l if k

yl = max yj j =1

(2.2)

This is known as majority voting. Certain classifiers can generate outputs that indicate their belief in classes; for example, some classifiers estimate class posterior probabilities, and in such a case, a classifier gives soft votes (e.g., in [0, 1]) for all classes, indicating the strength of the vote. Frequently, these soft votes are nonnegative and sum up to 1 (as we have with posterior probabilities) or are normalized to be so before any combination. In such a case, using Equations (2.1) and (2.2) implies the sum rule where we choose the class that gets the maximum total soft vote:  gij (x). (2.3) l = arg max j

i

This is the most straightforward rule for combination and an equivalent is the average rule where we choose the class that gets the highest average vote. Other possibilities are the median, minimum, maximum, and product rules each of which has its use in particular circumstances [Kittler et al. 1998]. For example, the median rule l = arg max median gij (x) j

i

(2.4)

2.2 Classifying Multimodal Data

55

makes more sense than the average rule when we have a large number of possibly unreliable voters whose votes may be noisy outliers. Regardless of whether the votes themselves are binary or continuous, one can use simple or weighted voting: l = arg max j



wi gij (x).

(2.5)

i

In the simplest case when we have no a priori reason to favor one learner over another, we use simple voting where all learners have the same weight: wi = 1/m. We use weighted voting when for some reason or another “some are more equal than others.” For example, one learner may have higher accuracy on a left-out validation set and hence we may want to give it more weight. In this case, with the sum rule, we calculate the weighted sum and then choose the class with the maximum total weighted vote. We generally  require these weights to be non-negative and sum up to 1: wi ≥ 0, i wi = 1. We can also interpret weighted voting from a Bayesian perspective. We can write: P (Cj |x) =

m 

P (Mi )P (Cj |Mi , x).

(2.6)

i=1

Here, P (Cj |Mi , x) is the estimate of the posterior probability for class Cj by model Mi given input x. We cannot integrate over the whole space of possible models, so we sample m likely models and take the average of their predictions weighted by the model probabilities P (Mi ). .

In stacking, this task of combination is viewed as another learning task [Wolpert 1992]. The learners to be combined are named the L0 learners and we have the L1 learner whose task is to predict the correct class given the predictions of L0 learners as its input: yj = f (g1j (x), . . . , gmj (x)|ψ).

(2.7)

Here, gij (x) are the L0 base learners and f () denotes the combining L1 learner with its own parameters ψ. Note that L1 does not see the original input x, it only learns to correctly combine the predictions of L0 learners. Typically, L0 learners and the L1 learner are trained on different data sets because L1 needs to learn when, and how, L0 learners succeed or fail in predicting the correct class.

56

Chapter 2 Classifying Multimodal Data

When L1 is a linear model, stacking works just like weighted voting except that weights are learned; they need not be positive nor sum up to 1 (although we can constrain them to do so if we want). But the nice property of stacking is that L1 can be any model—for example, L1 can be a multi-layer perceptron or a decision tree—thereby allowing a nonlinear combination of learner outputs implying a much more powerful combination than voting. .

The mixture of experts, which we mentioned above, can also be seen as a variant of weighted voting where the weight of a learner is dynamic. There is a gating model that also sees the input and its output are the combination weights—the gating model works like a probabilistic classifier and its task is to assign the input to the expert that it believes is the right model to make decision for it [Jacobs et al. 1991]. Learners are given different weights by the gating model depending on the input; a learner is given the highest weight in its region of expertise. We generalize Equation (2.5) as l = arg max j



wi (x)gij (x),

(2.8)

i

where wi (x), i = 1, . . . , m are the outputs of the gating model calculated for input x. In the hierarchical mixture of experts, this partitioning of the input space among experts is done hierarchically [Jordan and Jacobs 1994]. One can view this as a soft tree where gating models act as decision nodes and experts are the leaves, so the final output is again calculated as a weighted sum but propagated from the leaves to the root level by level. .

Cascading differs from the approaches above in the sense that the learners do not work in parallel but in series [Alpaydın and Kaynak 1998, Kaynak and Alpaydın 2000]. They are ordered and the input is first given to the first classifier, it makes a prediction for a class, and if it is confident in its output (e.g., if the highest posterior is greater than a certain threshold) we use that output, otherwise the input is fed to the second classifier, which in turn makes a prediction and we check if it is confident, and so on. The classifiers are ordered in terms of complexity so stopping early decreases the overall complexity. In a multimodal setting, the classifiers may be ordered according to the cost of sensing these different modalities, so we do not pay for a costly modality if the earlier cheaper ones suffice for confident classification.

2.3 Early, Late, and Intermediate Integration

⎧ arg max j gij (x) if g1(x) is confident ⎪ ⎪ ⎪ ⎪ ⎨ arg max g2j (x) if g2(x) is confident j l = .. ⎪. ⎪ ⎪ ⎪ ⎩ arg max g (x) otherwise j mj

57

(2.9)

These approaches for model combination, namely voting, stacking, and so on, are typically used for combining learners that use different algorithms, hyperparameters, and/or training data. In the rest of this chapter, we will discuss how these model combination approaches are applied when we have multimodal data.

2.3

Early, Late, and Intermediate Integration Let us say we have input from different modalities and we want to make a decision using information coming from all the modalities. We denote the input in modality i as the d (i) dimensional vector x (i) where i = 1, . . . , m and m is the number of modalities. Here, we assume that each x (i) is available at once; there are also integration approaches where inputs are sequences; see Chapter 3 of this volume [Panagakis et al. 2018]. The most straightforward approach is early integration where we concatenate  (i) all these vectors to form a single x = (x (1) , x (2) , . . . , x (m)) which is a d = m i=1 d dimensional vector. We train a single classifier with this concatenated x (see Figure 2.1(a)): y = g(x (1) , x (2) , . . . , x (m)|θ).

(2.10)

Here, y is the output and g() is the model defined up to a set of parameters θ . The advantages are that we train and use a single learner, and after concatenation we can use any learning algorithm to learn the classifier. But there are also two risks here. First, the concatenated input dimensionality may be very high and this causes two type of problems. (i) With more inputs, the classifier becomes more complex, in terms of space (more memory) and time (more computation), and hence higher input dimensionality implies higher implementation costs. (ii) Because the model gets more complex with more parameters, we also need more training data, otherwise there is a higher risk of overfitting; this is called the curse of dimensionality. The second and more important risk associated with early integration is that these features from different modalities come from different sources, their units and scales are different; they are like apples and oranges. Hence the joint space define by their concatenation can be a very strange space and the class distributions

58

Chapter 2 Classifying Multimodal Data

x (1)

x (1)

x (m) (a)

Figure 2.1

x (1)

x (1) (b)

x (m)

x (1)

x (1)

x (m) (c)

(a) Early, (b) late, and (c) intermediate (deep) integration drawn as multi-layer networks. Squares are inputs in different modalities (each of which may be a vector), ovals are extracted (hidden) features and shaded rectangles are predicted outputs. Each oval can be one or more hidden layers.

in there can be very difficult to model. For example, consider a scenario where we have image and speech; in such a case, some of the dimensions are pixels and some are frequency coefficients; using dot product or Euclidean distance to compare their concatenations does not make much sense. Early integration may also lead to a problem of alignment—when we have a sequence of observations in each modality, it may be tricky to know which ones should be used together; see Chapter 3 of this volume [Baltrusaitis et al. 2018]. One clear application of early integration is when if for each modality we have a representation with very few features that provide only limited information. For example, when we are doing credit scoring, we have age, gender, profession, salary, and so on; those are actually different modalities, but each by itself is not sufficient to make a prediction with, so we concatenate them and consider the whole as a four-dimensional vector. Concatenating and feeding them together may also allow finding correlations between them—typically age and salary are positively correlated. But if for each modality, we have a representation that is long and detailed enough, giving us enough information for prediction and we do not expect to see much correlation between features of different modalities, we prefer to use late integration where we have a separate learner for each modality that is trained with the input in its corresponding representation. For example, for user authentication, given the face image and speech, we have one classifier that looks at the image to

2.3 Early, Late, and Intermediate Integration

59

make a decision and another that looks at the speech to make a decision and the outputs of both classifiers are in the same scale and mean the same, e.g., both may be class posteriors, and hence fusing them makes sense. Given a test instance in m modalities, each learner, independently and in parallel, makes its prediction using its own modality and then we combine their decisions (see Figure 2.1(b)):  y = f g1(x (1)|θ1), g2(x (2)|θ2), . . . , gm(x (m)|θm) | ψ ,

(2.11)

where gi (x (i)|θi ) is model i with its parameter θi taking input in modality i. As the combining model f (), one can use any of the combining methods discussed before, i.e., voting, stacking, mixture of experts, or cascading, and ψ are the parameters, e.g., weights in voting, the parameters of the L1 model in stacking, and so on, also trained on some data. In combining such separately trained models, in practice, we see that no matter how much we play with the learning algorithms, hyper-parameters, or any other factor that affects the trained model, classifiers turn out to be positively correlated— they tend to make similar decisions. Depending on how the modalities are defined, such a correlation may also occur when we have multimodal combination. This has two consequences. First, if we have two models that are highly correlated, we do not need both; we can just keep one and discard the other. The second consequence is that when we have correlated models, having more models actually decreases accuracy; a larger ensemble is more accurate only if the voting models are independent. Both of these consequences indicate the need for post-processing an ensemble to reduce correlations. One approach is subset selection [Ulas¸ et al. 2009]: If we have m models, we want to choose a smaller subset of size k < m without losing from accuracy. The algorithms proposed for this are basically the same as the ones we use for feature extraction: if m is small, we can do an exhaustive search of all possible subsets, otherwise we perform a greedy search with hill-climbing where we start with the empty set and we add one learner at a time, adding the one that increases the accuracy the most, until no further addition improves accuracy any further. We can also do a backward search where we start with all models and remove one at a time until one more removal drastically worsens performance, or we can do a floating search that allows both additions and removals. When we use a subset instead of all, we save the space/time complexity of the pruned learners and in case where they use inputs from different modalities with associated costs, we also save the cost of sensing the modalities that turn out to be unnecessary.

60

Chapter 2 Classifying Multimodal Data

The other approach is to post-process outputs to remove correlations. This is reminiscent of feature extraction algorithms, such as principal components analysis, where we define new features in terms of the original features, e.g., by linear projection. Indeed, in the approach of eigenclassifiers, we define new features taking into account the correlations between classifiers [Ulas¸ et al. 2012]. Easy integration combines at the lowest level of input features and late integration combines at the highest level of output predictions. Intermediate integration, as its name suggests, is between these two extremes of early and late integration. First, for each modality, there is some processing done to convert the raw input to some more abstract representation and then these are fed together to a classifier. That is, there is a single learner but it is trained with some abstract version of the input from each modality (see Figure 2.1(c)): y = g(z(1) , z(2) , . . . , z(m)|θ),

(2.12)

where z(i) is a processed version of x (i). We discuss below two variants, one using multiple kernels and the other one using deep neural networks.

2.4

Multiple Kernel Learning In a kernel machine, such as the support vector machine, we write the class discriminant in terms of kernel functions [Cortes and Vapnik 1995]. A kernel function is a measure of similarity between two vectors, one of which is the input and the other is a training instance (on or inside the margin, or on the wrong side of the discriminant), named a support vector in the support vector machine algorithm. The kernel function implicitly defines a basis space that these vectors are mapped to and are compared in: K(x , y) = φ(x)T φ(y). That is, K() returns a value which is equal to the dot product of the images of the two vectors in the space of the basis function φ(.). Every valid kernel corresponds to a different basis space. In kernel machines, the most critical model selection problem is the choice of the appropriate kernel. The good kernel calculates a good similarity measure between instances so that, for example in classification, the similarity between two instances of the same class is larger than the similarity between instances of different classes. In the typical case where instances are represented as vectors, kernels typically use the dot product or its variants, such as the polynomial kernel, or the Euclidean distance or its variants, such as the Gaussian kernel. But one of the attractive properties of the kernel approach is that we do not need to have our inputs represented vectorially. We can define kernels starting directly with similarities. That is, if we have some application-specific similarity measure

2.4 Multiple Kernel Learning

61

that we can apply to pairs of instances, we can define a kernel in terms of it. So if we have some complex data structure such as a graph or a document, we do not need to worry about how to represent it in terms of a vector, as long as we can come up with some similarity measure to compare two graphs or documents. For documents, for example, the need for a vectorial representation led to the bag of words representation which has various disadvantages; it may be easier to directly define a function to compare two documents for similarity. This advantage also holds for the multimodality case: it may be easier to define a similarity measure for a modality instead of generating a vectorial representation and then using a kernel in terms of such vectors. The analog of multiple learners in kernel machines is multiple kernels: just like we have different learning algorithms to choose from, in kernel machines we have different kernels available. Typically, we do not know beforehand which one is the most suitable and the typical approach is to try different kernel functions and choose the best (e.g., by checking accuracy on a left-out validation data set), treating kernel selection as a model selection problem. The other possibility is to combine those kernels; this is called multiple kernel learning [G¨ onen and Alpaydın 2011]. The idea is that each kernel is a different measure of similarity and we use a set of candidate measures and write the discriminant as a combination of such similarities, again averaging out the effect of this factor. This multiple kernel learning framework can easily be applied to the multimodal setting where we have one kernel for each modality. The simplest and most frequently used approach is a linear combination:

K(x , y) =

m 

wi Ki (x (i) , y (i)),

(2.13)

i=1

where x (i) , y (i) are the representations of two instances x , y in modality i and Ki (x (i) , y (i)) is the kernel measuring their similarity by kernel i according to that modality. The weights wi , i = 1, . . . , m are trained on labelled data [Lanckriet et al. 2004, Sonnenburg et al. 2006]. Frequently, they are constrained to be nonnegative, and sometimes also to sum to 1. This helps interpretation—a higher wi implies a more important kernel and hence a more important modality. If the kernel weights are nonnegative, such a combination corresponds to scaling and concatenation of the underlying feature representations φi (x)—this implies a combination similar to early integration but in the space of the basis functions.

62

Chapter 2 Classifying Multimodal Data

Various variants of the multiple kernel learning framework have been proposed (G¨ onen and Alpaydın [2011] is a survey), including also nonlinear combinations. One variant is local combination where wi are not constant but are a function of the input, effectively working as a gating model choosing between kernels depending on the input [G¨ onen and Alpaydın 2008]; such an approach can also be viewed as the kernelized version of the mixture of experts.

2.5

Multimodal Deep Learning Recently, deep neural networks have become highly popular in a variety of applications. A neural network is composed of layers of processing units where each unit takes input from units in the preceding layer through weighted connections. The unit then calculates its value after this weighted sum is passed through a nonlinear activation function. Given an input, the processing proceeds as the units calculate their values layer by layer until we get to the final output layer. The network structure, i.e., the number of layers, the number of units in each layer, and the way the units are interconnected, define the model and the weights of the connections between the units are the parameters. Given a training set of pairs of input and the desired output, the weights are updated iteratively to make the actual outputs in the output layer as close as possible to the desired outputs. If there is no a priori information, layers are fully connected among them. In applications where the input has locality, connectivity is restricted to reflect dependencies; for example, in vision applications, the input is a two-dimensional image and in a convolutional layer, a hidden unit sees only a small local patch of the inputs. With data where there is temporal dependency, a recurrent connection allows a hidden unit to take into account not only the current input but also its value in the previous time step. In certain network structures, there are gating units, as we have in the mixture of experts, that allow or not the value of a unit to propagate through. A judicious use of all these type of units and connectivity makes neural networks quite powerful in a variety of applications. The idea in a neural network is that each hidden layer after the input layer learns to be a feature detector by responding to a certain combination of values in its preceding layer, and when we have a network with many such layers, i.e., in a deep neural network, successive layers learn feature detectors of increasing abstraction. A deep learning model in its many layers extract increasingly higherlevel and abstract set of features and this allows a better representation of the task and hence improves accuracy [Bengio 2009, Goodfellow et al. 2016].

2.5 Multimodal Deep Learning

63

The most important advantage of neural networks is that such a model makes very simple assumptions about the data and because it is a universal approximator, it can learn any mapping. The disadvantage is that because the model is general, we need large amount of data to constrain it and make sure that it learns the correct task and generalizes well to data outside of the training set. Another advantage of neural networks is that calculations are local in hidden units and parallel architectures such as GPUs can be efficiently programmed to handle the computation in a neural network with significant speed-up. We can view each hidden layer of neural network as learning a kernel and when we have many such hidden layers in a deep network, it is as if we are learning increasingly abstract kernels calculated in terms of simpler kernels. The advantage is that the kernels are not defined a priori but are learned from data; the disadvantage is that the optimization problem is non-convex and we need to resort to stochastic gradient-descent with all its concomitant problems. In learning the weights of the feature-detecting hidden units, the earlier approach was to use the autoencoder model [Cottrell et al. 1987] where the output is set to be equal to the input and the hidden layer in between has fewer hidden units. The hidden layer hence acts as a bottleneck and learns a compressed and abstract representation with minimum reconstruction error. The autoencoder model can also be trained to be robust to missing inputs [Vincent et al. 2008]. Roughly speaking, we can view the hidden representation learned in the autoencoder as the φi (.) basis of kernel Ki () in kernel machines. We can then stack such autoencoders to generate a deep neural network with multiple hidden layers. The autoencoder model has the advantages that first, it can be trained with unlabeled data, and, second, learning is fast because we train one layer at a time. Deep architectures have also been used to combine multiple modalities. The idea is to first train separate autoencoders for each modality and then learn to combine them across modalities by training a supervised layer on top (see Figure 2.1(c)). This is an example of intermediate combination defined in Equation (2.12) where the earlier modality-specific layers learn to generate the z(i) which are then fused in the later layer(s) (denoted by g() with its weights ψ). Nowadays with large labeled data sets and processing power available, end-toend deep neural networks are trained directly in a supervised manner, bypassing the training of autoencoders altogether. Because the whole training is supervised and all the parameters are trained together, we can achieve higher accuracy, but the disadvantage is that training multiple layers using stochastic gradient-descent is slow and one needs to use regularization methods such as dropout to make sure that the large network does not overfit.

64

Chapter 2 Classifying Multimodal Data

If we use early combination and just concatenate features from different modalities and feed them to a single network, feature-extracting (hidden) units have strong connections to a single modality with few units connecting across modalities; it is not possible to correlate basic features from one modality with basic features of another modality. But if both are separately processed using separate hidden units (trained either as an autoencoder or end-to-end) to get a higher-level and more abstract representation, these extracted features can be combined to learn a shared representation and a classifier that uses such a cross-modality representation has higher accuracy. This has been shown to be true in combining audio and lip image features for speech recognition [Ngiam et al. 2011]. A similar approach and result also holds for image retrieval where in addition to image data, there are also text tags [Srivastava and Salakhutdinov 2012, Srivastava and Salakhutdinov 2014]. For each modality, there is a separate deep network whose hidden units learn the abstract features that are modality specific; then the two such abstract representations can be combined in a set of features and we can for example use such a network to map one modality into another, so that, for example, given an input image, the network can generate a set of candidate tags, or given a set of tags, the network can find the best matching image. In training the shared features that we mention above, that combine raw features from different modalities, different unsupervised criteria can also be used. Additional to minimization of the reconstruction error, one can also use variation of information [Sohn et al. 2014] or canonical correlation analysis [Andrew et al. 2013, Wang et al. 2015]. Multimodal deep networks can also be trained to combine similarities. Separate autoencoders learn modality-specific representations and instead of using them as vectors, a similarity measure is applied to each and their weighted sum is calculated to get an overall similarity [Wu et al. 2013]. This approach is very similar to multiple kernel learning where we take a weighted sum of kernels (which are also measures of similarity); see also McFee and Lanckriet [2011]. See Keren et al. [2018] for an extensive survey of deep learning methods for multi-sensorial and multimodal interaction.

2.6

Conclusions and Future Work Combining multiple models to improve accuracy is an approach frequently used in pattern recognition and machine learning. Mostly it is used to average out the effect of factors such as the learning algorithm, hyper-parameters, or randomness in the data or in initialization. Combining multiple modalities promises to bring

2.6 Conclusions and Future Work

65

significant improvement in accuracy because data from different modalities have the highest chance of providing complementary information about the object or event to be classified. To combine multiple modalities, there are different possibilities that we have outlined above, and the right one, that is, the level where combination is to be done, depends on the level of abstraction where correlation is expected to occur between the different modalities. If features in different modalities are correlated at the feature level, one can use early combination by just concatenating all the modalities and feeding it to a single classifier. This is the easiest and fastest approach and works well if its assumptions hold and if there are not many features in total. But if the data in different modalities are correlated at a more abstract semantic level, one can use intermediate integration. For example, if we have an image and a set of tags, no individual image pixel is correlated with any tag, but the existence of a lot of blue patches in the upper half of the image may be correlated with the tag “sky.” To find such an abstract representation in each modality, one can use modality-specific hidden layers to extract it from data. By suitably combining and stacking such representations, a deep neural network can be trained. In some applications, we may know a good measure of similarity in advance for each modality which we can write down as a kernel, and a smart combination of kernels is another way to combine modalities. One big advantage of kernels is that one can define a similarity between instances without necessarily generating a vectorial representation and using a vectorial kernel. Late combination is used when there is no correlation at any level in the input, neither at the lowest level of input features nor after any amount of feature extraction. For example, in biometrics, if we have face image and fingerprint image, there is no correlation between the pixels of the two images, nor can we extract any correlation between any higher-level features extracted separately from these images. The only correlation is at the level of labels—whether they belong to the same person or not. In such a case, we can only do late integration where we classify the two images separately and combine their decisions. The take away message of this chapter should be that in building classifiers that combine modalities, there is no single best method and one needs to think about the level where correlation is expected to occur and choose a combiner accordingly. Multimodal classification and learning is a relatively recent idea but we expect to see it applied more in the future with a wider availability of divers sensors in many modalities. Mobile devices, smart objects, and wearable sensors detect and record data in different modalities [Neff and Nafus 2016]. Each such device

66

Chapter 2 Classifying Multimodal Data

or sensor provides a partial clue from a different modality, but combining them higher precision may be attained. With advances in digital technology and all types of smart online objects with their sensors—the Internet of Things [Greengard 2015]—appearing in the market, multimodal combination will only become more important in the future.

Acknowledgments This work is partially supported by Bo˘ gazici ¸ University Research Funds with Grant Number 14A01P4.

Focus Questions 2.1. Consider the level of dependency between modalities in example applications, and for each, which combination—early, intermediate, or late is appropriate.

2.2. Consider the average of identically distributed gi : y=

m 

gi /m.

i=1

Show that (a) the variance of y is minimized when gi are independent, and (b) the variance of y increases when gi are positively correlated.

2.3. In some studies on multimodal deep learning, researchers split each digit image into two, as left or right halves, or top and bottom halves, and process them as if they are two different modalities. Discuss the suitability of this approach for testing intermediate integration.

2.4. A kernel machine uses fixed kernels but defines a convex problem, which we can solve optimally. A multi-layer perceptron is trained using stochastic gradientdescent that converges to the nearest local minimum but its hidden units can be trained. Discuss the pros and cons of the two in the context of multimodal classification. 2.5. Some researchers have proposed methods for learning good kernel functions from data. Discuss how such a method can be used in multiple kernel learning in the context of multimodal classification.

2.6. In a multimodal deep learner, some layers learn features that are specific to a modality, and some learn features across modalities. How can we decide how many layers to have for each in a deep neural network?

References

67

2.7. In training a deep neural network with many hidden layers, the earlier approach was to train autoencoders one layer at a time and then stack them; nowadays, however researchers prefer to train the whole network end-to-end. Discuss the advantage and disadvantages of the two approaches.

2.8. In this chapter, we discussed methods for multimodal classification. Discuss how these can be adapted for multimodal regression and multimodal clustering.

References E. Alpaydın. 2014. Introduction to Machine Learning. 3rd edition The MIT Press. 50 E. Alpaydın and C. Kaynak. 1998. Cascading classifiers. Kybernetika, 34(3): 369–374. 56 G. Andrew, R. Arora, J. Bilmes, and K. Livescu. 2013. Deep canonical correlation analysis. In International Conference on Machine Learning, pp. 1247–1255. 64 T. Baltrusaitis, C. Ahuja, and L.-Ph. Morency. 2018. Multimodal machine learning. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition. Chapter 1, Morgan & Claypool Publishers, San Rafael, CA. 53, 58 Y. Bengio. 2009. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2: 1–127. DOI: 10.1561/2200000006. 62 L. Breiman. 1996. Bagging predictors. Machine Learning, 26: 123–140. DOI: 10.1023/ A:1018054314350. 52 C. Cortes and V. Vapnik. 1995. Support vector networks. Machine Learning, 20: 273–297, 1995. DOI: 10.1023/A:1022627411411. 60 G. W. Cottrell, P. Munro, and D. Zipser. 1987. Learning internal representations from grayscale images: An example of extensional programming. In Ninth Annual Conference of the Cognitive Science Society, pp. 462–473. 63 Y. Freund and R. E. Schapire. 1996. Experiments with a new boosting algorithm. In International Conference on Machine Learning, pp. 148–156. 52 M. G¨ onen and E. Alpaydın. 2008. Localized multiple kernel learning. In International Conference on Machine Learning, pp. 352–359. DOI: 10.1145/1390156.1390201. 62 M. G¨ onen and E. Alpaydın. 2011. Multiple kernel learning algorithms. Journal of Machine Learning Research, 12: 2211–2268. 61, 62 I. Goodfellow, Y. Bengio, and A. Courville. 2016. Deep Learning. MIT Press. 62 S. Greengard. 2015. The Internet of Things. Essential Knowledge Series. MIT Press. 66 T. K. Ho. 1998. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20: 832–844. DOI: 10.1109/ 34.709601. 53

68

Chapter 2 Classifying Multimodal Data

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. 1991. Adaptive mixtures of local experts. Neural Computation, 3: 79–87. DOI: 10.1162/neco.1991.3.1.79. 53, 56 M. I. Jordan and R. A. Jacobs. 1994. Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6(2): 181–214. DOI: 10.1162/neco.1994.6.2.181. 56 C. Kaynak and E. Alpaydın. 2000. Multistage cascading of multiple classifiers: One man’s noise is another man’s data. In International Conference on Machine Learning, pp. 455–462. 56 G. Keren, A. E. Mousa, and B. Schuller. 2018. Deep learning for multi-sensorial and multimodal interaction. In S, Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition. Chapter 4, Morgan & Claypool Publishers, San Rafael, CA. 64 J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas. 1998. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20: 226–239. DOI: 10.1109/ 34.667881. 54 L. I. Kuncheva. 2004. Combining Pattern Classifiers: Methods and Algorithms. Wiley. 49 G. R. Lanckriet, G. N. Cristianini, P. Bartlett, L. ElGhaoui, and M. I. Jordan. 2004. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5: 27–72. 61 B. McFee and G. Lanckriet. 2011. Learning multi-modal similarity. Journal of Machine Learning Research, 12: 491–523. 64 G. Neff and D. Nafus. 2016. Self-Tracking. Essential Knowledge Series. MIT Press. 65 J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. 2011. Multimodal deep learning. In International Conference on Machine Learning, pp. 689–696. 64 Y. Panagakis, O. Rudovic, and M. Pantic. 2018. Learning for multi-modal and contextsensitive interfaces. In S, Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition. Chapter 3, Morgan & Claypool Publishers, San Rafael, CA. 57 K. Sohn, W. Shang, and H. Lee. 2014. Improved multimodal deep learning with variation of information. In Advances in Neural Information Processing Systems 27, pp. 2141–2149. 64 S. Sonnenburg, G. R¨ atsch, C. Sch¨ afer, and B. Sch¨ olkopf. 2016. Large scale multiple kernel learning. Journal of Machine Learning Research, 7: 1531–1565. 61 N. Srivastava and R. Salakhutdinov. 2012. Multimodal learning with deep Boltzmann machines. In Advances in Neural Information Processing Systems 25, pp. 2222–2230. 64 N. Srivastava and R. Salakhutdinov. 2014. Multimodal learning with deep Boltzmann machines. Journal of Machine Learning Research, 15: 2949–2980. 64

References

69

A. Ulas, ¸ M. Semerci, O. T Yıldız, and E. Alpaydın. Incremental construction of classifier and discriminant ensembles. Information Sciences, 179: 1298–1318, 2009. DOI: 10.1016/j.ins.2008.12.024. 59 A. Ulas, ¸ O. T. Yıldız, and E. Alpaydın. 2012. Eigenclassifiers for combining correlated classifiers. Information Sciences, 187: 109–120. DOI: 10.1016/j.ins.2011.10.024. 60 P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In International Conference on Machine Learning, pp. 1096–1103. DOI: 10.1145/1390156.1390294. 63 W. Wang, R. Arora, K. Livescu, and J. Bilmes. 2015. On deep multi-view representation learning. In International Conference on Machine Learning, pp. 1083–1092. 64 D. H. Wolpert. 1992. Stacked generalization. Neural Networks, 5: 241–259. DOI: 10.1016/ S0893-6080(05)80023-1. 55 P. Wu, S. C. H. Hoi, H. Xia, P. Zhao, D. Wang, and C. Miao. 2013. Online multimodal deep similarity learning with application to image retrieval. In ACM International Conference on Multimedia, pp. 153–162. DOI: 10.1145/2502081.2502112. 64

3

Learning for Multimodal and Affect-Sensitive Interfaces Yannis Panagakis, Ognjen Rudovic, Maja Pantic

3.1

Introduction

Humans perceive the world by combining acoustic, visual, and tactile stimuli from their senses and interact with each other using different modalities of communication, such as speech, gazes, and gestures. Therefore, it is beneficial to equip human-computer interaction interfaces with multisensory and multimodal signal processing and analysis abilities. For instance, among other applications, personal assistants such as Amazon Echo, Google Home, and browsers such as Opera and NetFront are currently offering multimodal interaction experience to users. Multimodal or multiview signals are sets of heterogeneous data, captured by different sensors (including various types of cameras, microphones, and tactile sensors, to mention but a few) and in different contexts (where the context is defined in terms of who the subject is, where he/she is, what their task is, and so on). Such signals are expected to exhibit some mutual dependencies or correlations, since they usually represent the same physical phenomenon, and, thus, their simultaneous processing reveals information that is unavailable when considering the modalities independently. Therefore, a critical question here is how the different information sources should be modeled jointly and hence combined to achieve optimal responsive multimodal user interfaces. Indeed, the performance of multimodal interfaces is not only influenced by the different types of modalities to be integrated but also the abstraction level at which these modalities are integrated and fused as well as the machine learning method used for such multisensory data fusion. Regarding

72

Chapter 3 Learning for Multimodal and Affect-Sensitive Interfaces

Glossary A correlation is a single number that describes the degree of relationship between two variables (signals). It most often refers to how close two variables are to having a linear relationship with each other. Domain adaptation refers to machine learning methods that learn from a source data distribution a well performing model on a different (but related) target data distribution. The Facial Action Coding System (FACS) refers to a set of facial muscle movements that correspond to a displayed emotion. Gross errors refer to non-Gaussian noise of large magnitude. Gross errors are often in abundance in audio-visual data due to incorrect localisation and tracking, presence of partial occlusions, enviromental noise etc. Multimodal or multiview signals are sets of heterogeneous data, captured by different sensors, such as various types of cameras, microphones, and tactile sensors and in different contexts. Temporal dynamics of facial expression: rather than being like a single snapshot, facial appearance changes as a function of time. Two main factors affecting temporal dynamics of facial expression is the speed with which they unfold and the changes of their intensity over time.

such machine learning methods, Section 3.2 provides an overview of generic machine learning models, such as Canonical Correlation Analysis, its variants, and extensions, for extracting correlations from multimodal signals. A comprehensive review of multimodal machine learning methods can be found in Chapter 1 of this volume. Besides the multimodal nature of human interaction, human–human interaction is severely influenced by the affective states and responses of the interactants. Consequently, in the design of affect-sensitive multimodal interfaces, the goal is to ensure that they achieve the naturalness of human–human interaction by combining multimodal signals as visual (sight), audio (sound), and tactile (touch), and reasoning about users’ affective state from them. In this context, the following challenges are identified. .

How is the temporal dimension of multimodal data taken into account in order, for example, to capture the dynamics of human affect (e.g, temporal evolution of facial expressions intensity)?

3.1 Introduction

.

73

What is the role of context within and between different modalities, and can this be used to improve the performance of reasoning systems to be embedded in target user-interfaces?

To address the important challenges mentioned above, various models and machine learning methods have been proposed in the literature; please refer to Chapters 6 and 7 in this volume for an overview. In this chapter, we mainly focus on providing an overview and discussion of recent trends in machine learning models for multimodal analysis of facial behavior, which is a cornerstone in the disigning of intelligent, affect-sensitive user interfaces. Specifically, we identify and discuss the following modeling directions: .

temporal modeling,

.

context dependency, and

.

domain adaptation.

3.1.1 The Importance of Temporal Information As mentioned earlier, a key factor in learning for affect-sensitive user interfaces is the modeling of facial behavior dynamics. In particular, temporal modeling of facial expressions is critical for several reasons. Different temporal segments and intensity levels of expressions never occur in isolation but vary smoothly in time. Furthermore, temporal segments and intensity levels of facial expressions differ in their duration (e.g., the higher intensity levels occur less frequently than the lower levels). Moreover, temporal segments of emotion expression occur in a specific temporal order, i.e., the onset of emotion expression is followed by its apex or offset segment. Accounting for this temporal structure of facial expressions is important for the models to be able, for instance, to discriminate between onset and offset temporal segments of facial expressions. In the context of affect-sensitive userinterfaces, the beginning and apex of an expression of happiness may signal the interest level by the user, while the expression offset may signal disinterest or user disengagement time. Apart from a few works that attempted multimodal learning of dynamics of facial affects, most of the existing approaches focus on learning from a single modality (i.e., face images). Even when learning from visual modality, different fusion strategies are needed to address multimodality within the visual source of information due to different camera views, feature types, and so on. Thus, the same approaches that are used to address modeling challenges within a modality (e.g., multiple views of

74

Chapter 3 Learning for Multimodal and Affect-Sensitive Interfaces

facial expressions) can easily be adapted to handle fusion of multimodal data collected by different sensors (e.g., audio and visual). Therefore, we review the learning approaches applicable to the modeling of facial affect from single and multiple modalities. We also point out the challenges typically encountered when performing fusion of multimodal data in temporal domains as the dynamics of facial affect can exhibit different patterns depending on the used modality. Therefore, it is critical how these potential discordances are efficiently handled in order to take the full advantage of incorporating the underlying dynamics within each modality and their relationships across modalities.

3.1.2 The Importance of Context Context plays a crucial role in understanding the human behavioral signals that can otherwise be easily misinterpreted. For instance, a smile can be a display of politeness, contentedness, joy, irony, empathy, or a greeting, depending on the context. Yet, most existing methods to date focus on the simpler problem of detecting a smile as a prototypical and self-contained signal. To identify the smile as a social signal, one must simultaneously know where the subject is located (outside, at a reception, etc.), what his or her current task is, when the signal is displayed (timing), and who the expresser is (expresser’s identity, age and expressiveness). Vinciarelli et al. [2009] identify this as the W4 quadruplet (where, what, when, who) but quickly point out that comprehensive human behavior understanding requires the W5+ sextuplet (where, what, when, who, why, how), where the why and how factors identify both the stimulus that caused the social signal (e.g., funny video) as well as how the information is passed on (e.g, by means of facial expression intensity). However, most current methods used in the design of intelligent interfaces are unable to provide a satisfactory answer to W4, let alone W5+. Simultaneously, answering the W5+ is a key challenge of data driven design of affect-sensitive intelligent user-interfaces [Pantic et al. 2008].

3.1.3 The Importance of Adaptability Ability to adapt existing models across different contextual factors as well as modalities is the ultimate goal of any learning system. To this end, we review existing trends in domain adaptation, and, in particular, adaptation models applicable to facial behavior modeling. While there is a tied relationship between context modeling and model adaptation, most of the existing works on the latter have focused on one specific contextual dimension: subject adaptation. This is understandable, as the ultimate goal of affect-sensitive user-interfaces is to use both external and (subject’s) internal signals so as to be able to fully adapt to the user. Yet, since this

3.2 Correlation Analysis Methods

75

modeling direction is still in its early infancy, existing model adaptation techniques have placed a particular focus on the modeling of the variations within subjects, including their face physiognomy (i.e., facial geometry), facial appearance as well as subject-specific changes in expression intensity. This subject-specific adaptation is important, while in the same context the reaction of subjects to the same stimuli may differ, for example, when modeling facial expressions of pain, where a longterm exposure to pain may result in facial expressions not as intense as in the case of sudden and short-lived painful experience. Therefore, the ability of the machine learning models and, thus, the user interface, to successfully adapt to its user is one of the ultimate goals of its design. While, again, few existing models focus only on a single (and within), modality, there is no study that attempts model adaptation across different modalities (cross-modal adaptation). However, this is an important challenge in multimodal fusion for target user-interfaces, as it is expected to result in more robust and reliable interaction systems. Finally, note that the modeling topics mentioned above cannot be easily seen in isolation from one another as they are largely intersected in terms of what type of learning they account for. For example, model adaptation techniques can be seen as a way of adjusting different contextual factors from W5+ within the model (subject as context question “who”), so that the resulting user-interfaces can achieve optimal performance. Furthermore, these modeling challenges are universal across different signal modalities (e.g., visual, auditory or tactile). As we mentioned above, the rest of this chapter focuses on one signal domain, that of facial signals, that most ubiquitously illustrates the new data-driven modeling directions. Specifically, we consider the problems of facial behavior modeling and describe the state-of-theart in machine learning methods proposed for the challenges above, as they relate to multimodal context-sensitive modeling of user interfaces.

3.2

Correlation Analysis Methods Finding correlations between two or more sets of data from different modalities is inherent to many tasks and applications pertaining to multimodal and multisensor computing. For instance, an image can be captured in both visible and infrared spectrum and may be represented via a variety of visual descriptors such as SIFTs, HoGs, IGOs [Lowe 2004, Tzimiropoulos et al. 2012, Simonyan et al. 2014] etc., which can be seen as distinct feature sets corresponding to the same object. Another prominent example of such a scenario lies in the task of face recognition: a face can be recognized by employing a normal image as captured in the visible spectrum, as well as infrared captures or even forensic sketches [Li et al. 2009, Wang and Tang

76

Chapter 3 Learning for Multimodal and Affect-Sensitive Interfaces

2009]. Similarly, a particular human behavior can be identified by certain vocal, gestural, and facial features extracted from both the audio and visual modalities [Shan et al. 2007, Zhihong et al. 2009]. Since such sets of multimodal data (compromising of distinct feature sets) refer to the same object or behavior, it is anticipated that, part of the conveyed information is shared among all observation sets (i.e., correlated components). While the remaining information consists of individual information (individual components) which are particular only to a specific observation set. The correlation among different modalities provide useful information for tasks such as feature fusion [Correa et al. 2009, Atrey et al. 2010], multiview learning [Sun 2013], multilabel prediction [Sun et al. 2011], and multimodal behavior analysis [Shan et al. 2007, Zhihong et al. 2009, Nicolaou et al. 2014]. On the other hand, the individual components are deemed important for tasks such as clustering and signal separation [Zhou et al. 2012]. These individual features may interfere with finding the correlated components, just as the correlated components are likely to obscure the individual ones. Consequently, it is very important to accurately extract the correlated and the individual components among the multiple datasets. The problem becomes rather challenging when dealing with data contaminated by gross errors, which are also temporally misaligned, i.e., temporal discrepancies manifest among the observation sequences. In practice, gross errors [Huber 1981] arise from either device artifacts (e.g., pixel corruptions, sonic artifacts), missing and incomplete data (e.g., partial image texture occlusions), or feature extraction failure (e.g., incorrect object localization, tracking errors). These errors rarely follow a Gaussian distribution [Candes et al. 2011]. Furthermore, asynchronous sensor measurements (e.g., lag between audio and visual sensors), view point changes, network lags, speech rate differences, and the speed of an action, behavior, or event result into temporally misaligned sets of data. Clearly, the accurate temporal alignment of noisy, temporally misaligned sets of data is a cornerstone in many computer vision [Junejo et al. 2011, Zhou and la Torre 2009], behavior analysis [Panagakis et al. 2013, Nicolaou et al. 2014], and speech processing [Sakoe and Chiba 1978] problems, to name a few. Several methods have been proposed for the analysis of two sets of data. The Canonical Correlation Analysis (CCA) [Hotelling 1936] is a widely used method for finding linear correlated components among two data sets. Notable extensions of the CCA are the sparse CCA [Sun et al. 2011, Chu, et al. 2013], the kernel- [Akaho 2011] and deep-CCA [Andrew et al. 2013], as well as its probabilistic [Bach and Jordan 2005, Nicolaou et al. 2014] and Bayesian variants [Klami et al. 2013]. The Canonical Time Warping (CTW) [Zhou and la Torre 2009] and relevant methods

3.2 Correlation Analysis Methods

77

[Trigeorgis et al. 2016, W¨ ollmer et al. 2009] extend the CCA to handle time warping in data. In order to extract correlated components among multiple data sets, generalizations of the CCA can be employed [Kettenring 1971, Li et al. 2009]. However, the aforementioned methods ignore the individual components of the data sets, a drawback which was alleviated by the Joint and Individual Variation Explained (JIVE) [Lock et al. 2013] and Common Orthogonal Basis Extraction (COBE) [Zhou et al. 2012]. Since most of the methods mentioned above rely on least squares error minimization, they are prone to gross errors and outliers [Huber 1981], causing the estimated components to be arbitrarily away from the true ones. This drawback is alleviated to some extent by the robust methods in Panagakis et al. [2013], Nicolaou et al. [2014], and Panagakis et al. [2016] Next, a brief review of the CCA [Hotelling 1936], JIVE [Lock et al. 2013], DTW [Sakoe and Chiba 1978], CTW [Zhou and la Torre 2009], as well as the Robust Correlated and Independent Component Analysis (RCICA) [Panagakis et al. 2016] and RCICA with Time Warpings (RCITW) [Panagakis et al. 2016] is provided.

3.2.1 Canonical Correlation Analysis The CCA extracts correlated features from a pair of multivariate data. In particular, given two data sets {X (n) = [x1(n)|x2(n)| . . . |xJ(n)] ∈ RIn×J }2n=1, the CCA finds two matrices V (1) ∈ RI1×K and V (2) ∈ RI2×K , with K ≤ min(I1 , I2). These matrices define a common, low-dimensional latent subspace such that the linear combination of T the variables in X (1), i.e., V (1) X (1) are highly correlated with a linear combination T of the variables in X (2), i.e., V (2) X (2). The CCA corresponds to the solution of the constrained least-squares minimization problem [Sun et al. 2011, la Torre 2012]: 1 T T arg min V (1) X (1) − V (2) X (2) 2F 2 {V (n) }2 n=1

T

T

s.t. V (n) X (n)X (n) V (n) = I,

(3.1)

n = 1, 2.

3.2.2 Joint and Individual Variation Explained The JIVE recovers the joint and individual components among N ≥ 2 data sets {X (n) ∈ RIn×J , n = 1, 2, . . . , N }. In particular, each matrix is decomposed into three terms: a low-rank matrix J(n) ∈ RIn×J capturing joint structure between data sets, a low-rank matrix A (n) ∈ RIn×J capturing individual structure to each data set, and a matrix R(n) ∈ RIn×J accounting for i.i.d. residual noise. That is, X (n) = J(n) + A (n) + R(n) ,

n = 1, 2, . . . , N .

(3.2)

78

Chapter 3 Learning for Multimodal and Affect-Sensitive Interfaces

 Let X, J, and R be N n=1 In × J matrices constructed by concatenation of the corresponding matrices,1 the JIVE solves the rank-constrained least-squares problem [Lock et al. 2013]: arg min {J, {A (n) }N , R} n=1

1

R 2F 2 T

T

T

s.t. R = X − J − [A (1) , A (2) , . . . , A (n) ]T , rank(J) = K , rank(A T

JA (n) = 0,

(n)

)=K

(n)

(3.3)

,

n = 1, 2, . . . , N .

Problem (3.3) imposes rank constraints on joint and individual components and requires the rows of J and {A (n)}N n=1 to be orthogonal. The intuition behind the orthogonality constraint is that sample patterns responsible for joint structure between data types are unrelated to sample patterns responsible for individual structure [Lock et al. 2013]. A closely related method to the JIVE is the COBE which extract the common and the individual components among N data sets of the same dimensions by solving a set of least-squares minimization problems [Zhou et al. 2012].

3.2.3 Dynamic and Canonical Time Warping Given two temporally misaligned data sets with the same dimensionality I, namely {X (n) ∈ RI ×Jn , n = 1, 2.}, the DTW aligns them along the time axis by solving [Sakoe and Chiba 1978]: 1 arg min X (1)Δ(1) − X (2)Δ(2) 2F , 2 2 (n) {Δ } n=1

s.t. Δ(n) ∈ {0, 1}Jn×J ,

(3.4)

n = 1, 2,

where Δ(n), n = 1, 2 are binary selection matrices encoding the alignment path. Although the number of possible alignments is exponential in J1 . J2, the DTW recovers the optimal alignment path in O(J1 . J2) by employing dynamic programming. Clearly, the DTW can handle only data of the same dimensions. The CTW [Zhou and la Torre 2009] incorporates CCA into the DTW, allowing the alignment of data sequences of different dimensions by projecting them into a common latent subspace found by the CCA [Hotelling 1936]. Furthermore, the CCA-based projec. [X (1)T , X (2)T , . . . , X (N)T ]T , J = . [J(1)T , J(2)T , . . . , J(N)T ]T , R = . [R(1)T , R(2)T , . . . , R(N)T ]T 1. X =

3.2 Correlation Analysis Methods

79

tions perform feature selection by reducing the dimensionality of the data to that of the common latent subspace, handling the irrelevant or possibly noisy attributes. More formally, let {X (n) ∈ RIn×Jn }2n=1 be a set of temporally misaligned data of different dimensionality (i.e., I1 = I2), the CCA is incorporated into the DTW by solving [Zhou and la Torre 2009]: arg min {V (n) , Δ(n) }2n=1 T

1 (1)T (1) (1) T

V X Δ − V (2) X (2)Δ(2) 2F , 2 T

s.t. V (n) X (n)X (n) V (n) = I, T

T

(3.5)

T

V (1) X (1)Δ(1)Δ(2) X (2) V (2) = D, X (n)Δ(n)1 = 0, Δ(n) ∈ {0, 1}Jn×J ,

n = 1, 2.

V (1) ∈ RI1×K and V (2) ∈ RI2×K project X (1) and X (2), respectively, onto a common latent subspace of K ≤ min(I1 , I2) dimensions, where the correlation between the data sequences is maximized. D is a diagonal matrix of compatible dimensions. The set of constraints in (3.5) is imposed in order to make the CTW translation, rotation, and scaling invariant. Remark 3.1

By adopting the least squares error, the aforementioned methods assume Gaussian distributions with small variance [Huber 1981]. Such an assumption rarely holds in real-word multimodal data, where gross non-Gaussian corruptions are in abundance. Consequently, the components obtained by employing CCA, JIVE, DTW, and CTW in the analysis of grossly corrupted data may be arbitrarily away from the true ones, degenerating their performance. A general framework to alleviate the aforementioned limitation and recover both the correlated and individual components is detailed next.

3.2.4 Robust Correlated and Individual Components Analysis Consider two data sets from different modalities or feature sets possibly contaminated by gross but sparse errors. Without loss of generality these datasets are represented by two zero-mean matrices, namely {X (n) ∈ RIn×J }2n=1 of different dimensions, i.e., I1 = I2. The RCICA recovers the correlated and individual components of the data sets as well as the sparse corruptions by seeking a decomposition of each matrix into three terms: X (n) = C(n) + A (n) + E(n) ,

n = 1, 2.

(3.6)

80

Chapter 3 Learning for Multimodal and Affect-Sensitive Interfaces

C(n) ∈ RIn×J and A (n) ∈ RIn×J are low-rank matrices with mutually independent column spaces, capturing the correlated and individual components, respectively, and E(n) ∈ RIn×J is a sparse matrix accounting for sparse non-Gaussian errors. To ensure that the fundamental identifiability of the recovered components is guaranteed, the column spaces of {A (n)}2n=1 must be orthogonal to those of {C(n)}2n=1. To facilitate this, the components are decomposed as: T

C(n) = U(n)V (n) X (n) ,

(3.7)

A (n) = Q(n)H(n) ,

(3.8)

where {U(n) ∈ RIn×K }2n=1 and {Q(n) ∈ RIn×K }2n=1 are column orthonormal matrices spanning the columns of {C(n)}2n=1 and {A (n)}2n=1, respectively. K denotes the upper bound of unknown rank of {C(n)}2n=1, and {K (n)}2n=1 are the upper bounds of unknown rank of {A (n)}2n=1. The mutual orthogonality of the column spaces is estab(n)

T

T

lished by requiring {Q(n) U(n) = 0}2n=1. In analogy to the CCA, {V (n) X (n) ∈ RK×J }2n=1 are required to be maximally correlated. A natural estimator accounting for the low-rank components and the sparsity of T T {E(n)}2n=1 is to minimize the objective function of CCA, i.e., 21 V (1) X (1) − V (2) X (2) 2F as well as the rank of {C(n) , A (n)}2n=1 and the number of nonzero entries of {E(n)}2n=1 measured by the 0-(quasi) norm, e.g., Candes et al. [2011], Liu and Yan [2012], Huang et al. [2012], and Panagakis et al. [2013]. Unfortunately, both rank and 0norm minimization is NP-hard [Vandenberghe and Boyd 1996, Natarajan 1995]. The nuclear- and the 1- norms are typically adopted as convex surrogates to rank and 0- norm, respectively [Fazel 2002, Donoho 2006]. Accordingly, the objective function for the RCICA is defined as: . F (V) =

2 

T

(n) (n) [ U(n)V (n) ∗ + λ(n) ∗ Q H ∗

n=1 (n) + λ(n) 1 E 1] +

(3.9) λc (1)T (1) T

V X − V (2) X (2) 2F , 2

. {U(n) , V (n) , Q(n) , H(n) , E(n)}2 and where the unknown variables are collected in V = n=1 2 , {λ(n) }2 λc , {λ(n) } are positive parameters controlling the correlation, rank, and 1 n=1 ∗ n=1 sparsity of the derived spaces. T T Due to the unitary invariance of the nuclear-norm, e.g., Q(n)V (n) ∗ = V (n) ∗, (3.9) is simplified and thus the RCICA solves the constrained non-linear optimization problem:

3.2 Correlation Analysis Methods 2



T (n) (n) (n)

V (n) ∗ + λ(n) ∗ H ∗ + λ1 E 1

arg min V

+

81

n=1

λc (1)T (1) T

V X − V (2) X (2) 2F , 2 T

s.t. (i) X (n) = U(n)V (n) X (n) + Q(n)H(n) + E(n) T

(3.10)

T

(ii) V (n) X (n)X (n) V (n) = I, T

T

(iii) U(n) U(n) = I, Q(n) Q(n) = I, T

(iv) Q(n) U(n) = 0,

n = 1, 2.

Recall that the constraints (i) decompose each matrix into three terms capturing the correlated and the individual components as well as the sparse corruptions. The constraints (ii) are inherited by the CCA (cf. (3.1)) and are imposed in order to normalize the variance of the correlated components thus making them invariant to translation, rotation, and scaling (i.e., since data may have large-scale differences, this constraint normalizes them in order to facilitate the identification of correlated/shared components). The third set of constraints (iii) deem RCICA to be a projective method, a point which will be further clarified shortly in what follows. The constraints (iv) are imposed in order to ensure the identifiability of the model. That is, in order to perfectly disentangle the low-rank correlated and individual components, their column spaces should be mutually orthogonal. Otherwise, it would be impossible to guarantee the feasibility of the decomposition. If we assume that there are no individual components (i.e., by setting {λ(n) ∗ → 2 ∞}n=1), and the dimensionality of the data is the same, i.e., I1 = I2, and by setting ¯ (n) = U(n)V (n)T , then the RCICA is reduced to the Robust CCA [Panagakis et al. C 2013]: arg min

2



¯ (n) , E(n) }2 n=1 {C n=1

+

(n) ¯ (n) ∗ + λ(n)

C 1 E 1

λc ¯ (n) (1) ¯ (n) (2) 2

C X − C X F , 2

¯ (n)X (n) + E(n) , s.t. X (n) = C

(3.11)

n = 1, 2,

¯ (n) ∈ RIn×In }2 are low-rank matrices reconstructing correlated compowhere {C n=1 2 nents and {λ(n) 1 }n=1 are positive parameters controlling the sparsity in the error matrices.

82

Chapter 3 Learning for Multimodal and Affect-Sensitive Interfaces

Clearly, the RCICA has several appealing properties, deeming the technique advantageous in comparison to relevant methods. They are listed in what follows. 1. The RCICA is a more general approach, meaning that the CCA is also a special case of the RCICA. Indeed, if we assume that there are no gross errors 2 in the data (i.e., {E(n) = 0}2n=1) and by letting {λ(n) ∗ → ∞}n=1, i.e., there are no individual components, it is easy to verify that the solution of (3.10) is identical to that of (3.1), while {U(n) = V (n)}. 2. The RCICA can inherently handle data sets of different dimensionality. 3. The RCICA is projective in the sense that the correlated and individual features of unseen (test) vectors can be extracted via the projection matrices {U(n)}2n=1 and {Q(n)}2n=1, respectively. Obviously, this is not the case for the RCCA in (3.11) where the reconstruction of the correlated components is recovered. 4. The exact number of correlated and individual components needs not to be known in advance. Instead, an upper bound of the components’ number is sufficient. The minimization of the nuclear-norms in (3.10) and (3.11) enables the actual number (i.e., rank) of the components to be determined automatically. Clearly, this is not the case in the CCA and the JIVE, where the number of components should be exactly determined. We finally note that the RCICA and the RCCA can handle data contaminated by 2 Gaussian noise by vanishing the error term, that is by setting {λ(n) 1 → ∞}n=1.

3.2.5 RCICA with Time Warpings (RCITW) Accurate temporal alignment of noisy data sequences is essential in several problems such as the alignment and the temporal segmentation of human motion [Zhou and De la Torre Frade 2012], the alignment of facial and motion capture data [Zhou and la Torre 2009, Panagakis et al. 2013], the alignment of multiple continuous annotations [Nicolaou et al. 2014] etc. The problem is defined as finding the temporal coordinate transformation that brings two given data sequences into alignment in time. To handle temporally misaligned, grossly corrupted data, the DTW is incorporated into the RCICA. Formally, given two sets {X (n) ∈ RIn×Jn }2n=1 of different dimensionality and length, i.e., I1 = I2, J1 = J2, the RCITW enables their temporal alignment onto the subspace spanned by the robustly estimated correlated components. To this end, the RCITW solves:

3.3 Temporal Modeling of Facial Expressions

arg min

2



{V , {Δ(n) }2n=1} n=1

+

83

T (n) (n) (n)

V (n) ∗ + λ(n) ∗ H ∗ + λ1 E 1

λc (1)T (1) (1) T X Δ − V (2) X (2)Δ(2) 2F ,

V 2 T

s.t. X (n) = U(n)V (n) X (n) + Q(n)H(n) + E(n) T

(3.12)

T

X (n) V (n) = P(n) , P(n) P(n) = I, T

T

T

U(n) U(n) = I, Q(n) Q(n) = I, Q(n) U(n) = 0, X (n)Δ(n)1 = 0, Δ(n) ∈ {0, 1}Jn×J

n = 1, 2,

where Δ(n) ∈ {0, 1}Jn×J , n = 1, 2 are binary selection matrices encoding the warping path as in the CTW. The constraint X (n)Δ(n)1 = 0, n = 1, 2 ensures that the temporally aligned data are zero-mean. By solving (3.12), the temporally aligned correlated T components of reduced dimensions are given by {V (n) X (n)Δ(n) ∈ RK×J }2n=1. Moreover, one can obtain a reconstruction of the temporally aligned data in the original T space by {U(n)V (n) X (n)Δ(n) ∈ RIn×J }2n=1.

3.3

Temporal Modeling of Facial Expressions There are two main streams in the current research on automatic analysis of facial expressions. The first considers holistic facial expressions such as facial expressions of six basic emotions (fear, sadness, happiness, anger, disgust, surprise) proposed by [Ekman et al. 2002] or facial expressions of pain, for instance. The second considers local muscle activations producing facial expressions. These are described with a set of facial muscle actions named Action Units (AUs), as defined by the Facial Action Coding System (FACS) [Ekman et al. 2002]. In what follows, we review the existing approaches for temporal learning of facial expression dynamics. Different methods have been proposed for classification of facial expressions from image sequences. Despite inherent dynamic information presented in target image sequences of facial expressions, the majority of existing works still rely on static methods for classification of facial expressions, such as those for recognition of six-basic emotion categories [Ekman et al. 2002]. These methods employ classifiers such as rule-based classifiers [Pantic and Rothkrantz 2004, Black and Yacoob 1997], Neural Networks (NN) [Padgett and Cottrell 1996, Tian 2004], Support Vector Machines (SVM) [Bartlett et al. 2005, Shan et al. 2009], and Bayesian Networks (BN) [Cohen et al. 2003]. Bi-directional Long-Short-Term Memory Neural Networks have been also applied to emmotions recognition [W¨ ollmer et al. 2013, Trigeorgis et al.

84

Chapter 3 Learning for Multimodal and Affect-Sensitive Interfaces

2016]. For the static classification of AUs (i.e., by employing still images), where the goal is to assign to each AU a binary label indicating the presence of an AU, the classifiers based on NN [Bazzo and Lamar 2004, Fasel and Luettin 2000], Ensemble Learning techniques (such as AdaBoost [Yang et al. 2009] and GentleBoost [Hamm et al. 2011]), and SVM [Chew et al. 2012, Bartlett et al. 2006, Kapoor et al. n.d.], are commonly employed. The common weakness of the frame-based classification methods is that they ignore dynamics of target facial expressions or AUs. Although some of the framebased methods use the features extracted from several time frames in order to encode dynamics of facial expressions, models for dynamic classification provide a more principled way of doing so. With a few exceptions, most of the dynamic approaches to classification of facial expressions are based on the variants of Dynamic Bayesian Networks (DBN) (e.g., Hidden Markov Models (HMM) and Conditional Random Fields (CRF)). Discriminative models based on CRFs have been proposed in der Maaten and Hendriks [2012], Jain et al. [2011], and Chang et al. [2009]. For instance, der Maaten and Hendriks [2012] trained a linear-chain CRF per AU. The models’ states are binary variables indicating the AU activations. Jain et al. [2011] proposed a generalization of this model, a Hidden Conditional Random Field (HCRF) [Wang et al. 2006], where an additional layer of hidden variables is used to model temporal dynamics of facial expressions. The training of the model was performed using image sequences, but classification of the expressions was done by selecting the most likely class (i.e., emotion category) at each time instance. Another modification of HCRF, named partially observed HCRF, was proposed in Chang et al. [2009]. In this method, classification of the emotion categories (sequence-based), and the AU combinations (frame-based), is accomplished simultaneously. This method outperformed the standard HCRF, which does not use a prior information about the AU combinations. Recently, Walecki et al. [2017] proposed a Variable-state Latent CRF (VSL-CRF) model for expresion and AU segmentation that also imposes ordinal relationships between the temporal states in the model, implicitly accounting for development of temporal phases of an expression (onset, apex, offset). Temporal consistency of AUs was also modeled in Simon et al. [2010] using the structured-output SVM framework for detecting the starting and ending frames of each AU. More complex graph structures within the DBN framework have been proposed in Zhang and Ji [2005], Tong et al. [2007] for dynamic classification of facial expressions. In Zhang and Ji [2005], the DBN was constructed from interconnected time slices of static Bayesian networks, where each static network was used to link the geometric features (i.e. locations of characteristic facial points) to the target emotion categories via a set of related AUs. Tong

3.3 Temporal Modeling of Facial Expressions

85

et al. [2007] modeled relationships between different AUs using another variant of a DBN.

3.3.1 Temporal Segmentation of Facial Expressions Most of the works on facial expression analysis from image sequences focus only on classification of target expressions and/or AUs. Yet, these do not explicitly encode the dynamics (i.e., they do not perform classification of the temporal segments: neutral, onset, apex, offfset of an expression). Both the configuration, in terms of AUs constituting the observed expressions, and their dynamics are important for categorization of, e.g., complex psychological states, such as various types of pain and mood [Pantic and Bartlett 2007]. They also represent a critical factor in interpretation of social behaviors like social inhibition, embarrassment, amusement, and shame, and are a key parameter in differentiation between posed and spontaneous facial displays [Ekman et al. 2002]. The class of models that performs segmentation of the expression sequences into different temporal segments are presented in Pantic and Patras [2005], 2006. These are static rule-based classifiers based on the geometric features (i.e., facial points) that encode temporal segments of AUs in near-frontal and profile view faces, respectively. The works in Koelstra et al. [2010] and Valstar and Pantic [2012] proposed modifications of standard HMMs to encode temporal evolution of the AU segments. Specifically, Koelstra et al. [2010] proposed a combination of discriminative, frame-based GentleBoost ensemble learners, and HMMs for classification and temporal segmentation of AUs. Similarly, Valstar and Pantic [2012] combined SVMs and HMMs in a Hybrid SVM-HMM model based on the geometric features for the same task. A variant of the linear-chain CRF, named the Conditional Ordinal Random Field (CORF), was proposed in Kim and Pavlovic [2010] for temporal segmentation of six emotion categories. In this model, the node features of the linear-chain CRF model are set using the modeling strategy of the standard ordinal regression models, e.g. Chu and Ghahramani [2005], in order to enforce the ordering of the temporal segments (neutral w1 w2 w3 < /s >, where < s > and < /s > are the sentence start and end symbols. The traditional RNN neurons can also be replaced by more advanced units, like the memory cells proposed in Hochreiter and Schmidhuber [1997a]. In this case, the network is called Long Short-Term Memory (LSTM) RNN. Deeper models can also be constructed by stacking as many recurrent hidden layers as required. The

4.4 Multimodal Embedding Models

119

word embeddings are preserved at the weight matrix between the input and hidden layer. To get a sentence-level embedding, the hidden activation vector of the top most hidden layer at the last word position can be used since it depends on the whole sentence and represents the most abstract features about the input word sequence.

4.4.3 Multimodal Joint Representation In the previous sections, we described the transformations that map every image and sentence into embedding vectors in a common vector space. This is true because now both modalities can be represented in the same vector space with the same embedding dimension. However, the main issue now is how to correlate them together. Since the supervision is at the level of entire images and sentences, we need to formulate an image-sentence score as a function of the representing vectors [Karpathy and Fei-Fei 2015]. A sentence-image pair should have a high matching score if the words of the sentence have a more confident support in the image. The model of Karpathy et al. [2014], uses the dot product viT st between the i-th image region and the t-th word as a measure of similarity and use it to define the score between image k and sentence l as:  max(0, viT st ), (4.7) Skl = t∈gl i∈gk

where gk is the set of image regions in image k and gl is the set of sentence fragments in sentence l. The indices k, l range over the images and sentences in the training set. This score carries the interpretation that a sentence fragment aligns to a subset of image regions whenever the dot product is positive. The following reformulation in Karpathy and Fei-Fei [2015] simplifies the model:  maxi∈gk viT st , (4.8) Skl = t∈gl

where every word st aligns to the single best image region. Assuming that k = l denotes the correspondence between image and sentence, the final max-margin structured loss function can be formulated as:

    max(0, Skl − Skk + 1) + max(0, Slk − Skk + 1) . (4.9) C(θ ) = k

l

l

This objective encourages aligned image-sentence pairs to have a higher score than misaligned pairs, by a margin [Karpathy and Fei-Fei 2015]. The model is shown in Figure 4.12. In this model, word embeddings are generated via a bidirectional

120

Chapter 4 Deep Learning for Multisensorial and Multimodal Interaction

Image–sentence score Skl Sum

Max

RCNN

vi

st

hbt h ft

xt “dog Figure 4.12

leaps

to

catch

frisbee

Learning a scoring function between image regions and text descriptions (from Karpathy and Fei-Fei [2015]).

RNN which scans the sentence in both directions rather than only from left to right. Since the cost function is differentiable, the model can be learned end-to-end using a gradient-descent optimizing all parameters of the model simultaneously to find the relevant shared representations. Another multimodal embedding model is proposed in Reed et al. [2016] for solving the visual recognition problem, where images are classified using both visual features and natural language descriptions. The proposed model learns a compatibility function using inner product of features generated by deep neural encoders. An illustration of the model using a CNN for processing images and a word-level RNN for processing text is given in Figure 4.13. The objective is to maximize the compatibility between a description and its matching image, and minimize the compatibility with images from other classes. Thus, given data S = {(vn , tn , yn), n = 1, . . . , N } containing visual information v ∈ V, text description

4.4 Multimodal Embedding Models

121

t ∈ T and class labels y ∈ Y, the model seeks to learn functions fv : V → Y and ft : T → Y that minimize the empirical risk: C(θ ) =

N 1  (yn , fv (vn)) + (yn , ft (tn)), N n=1

(4.10)

where : Y × Y → R is the 0-1 loss, N is the number of image and text pairs in the training set. This objective is called deep structured joint embedding which is symmetric with respect to images and text. As described by Reed et al. [2016], it is possible to use just one of the two terms of this objective. For example, the first term is used to train only image classifier, i.e., only image encoder fv is trained. In this case it is called deep asymmetric structured joint embedding. It is also possible to build an asymmetric model in opposite direction, i.e. only train ft in order to perform image retrieval Reed et al. [2016]. A compatibility function F : V → T → R is defined that uses features from encoder functions φ(v) for images and ϕ(t) for text: F (v, t) = φ(v)T ϕ(t).

(4.11)

The image and text classifiers are formulated as follows: fv (v) = arg max Et∼T (y)[F (v, t)]

(4.12)

y∈Y

ft (t) = arg max Ev∼V(y)[F (v, t)],

(4.13)

y∈Y

where T (y) is the subset of T from class y, V(y) is the subset of V from class y, and the expectation is over text descriptions sampled uniformly from these subsets. Since the compatibility function is shared by ft and fv , in the symmetric objective, it must learn to yield accurate predictions for both classifiers. From the perspective of the text encoder, this means that text features must produce a higher compatibility score to a matching image compared to both the score of that image with any mismatched text, and the score of that text with any mismatched image. Since the 0-1 loss is discontinuous, a surrogate objective function that is continuous and convex is optimized instead: C(θ ) =

N 1  v (vn , tn , yn) + t (vn , tn , yn), N n=1

where the misclassification losses are written as:

(4.14)

122

Chapter 4 Deep Learning for Multisensorial and Multimodal Interaction

Accumulate matching score

+8.56

CNN

–5.23

The

Figure 4.13

beak

is

yellow

and

pointed

0.03

Learning a scoring function between full images and text descriptions (from Reed et al. [2016]).

v (vn , tn , yn) = max (0, (yn , y) + Et∼T (y)[F (vn , t) − F (vn , tn)])

(4.15)

t (vn , tn , yn) = max (0, (yn , y) + Ev∼V(y)[F (v, tn) − F (vn , tn)]).

(4.16)

y∈Y y∈Y

Since now all encoders are differentiable, the network parameters can be trained end-to-end using back-propagation.

4.5

Perspectives As described earlier in this chapter, deep learning has been successfully applied to learning representations for solving supervised tasks in complex domains such as speech, image, or language processing. Yet, multimodal interactive systems are dealing with even more complex tasks, requiring to integrate interactive contexts over several time steps and behaving consistently over multiple turns of interaction. Errors might accumulate and major deviations in the course of an interaction can be observed if time dependency between successive decisions is not explicitly accounted for in the learning process. For this reason, Reinforcement Learning (RL) [Sutton and Barto 1998] was introduced in interactive systems two decades ago [Levin et al. 1998]. Until very recently, the combination of Reinforcement Learning and Deep Learning was considered as a hard problem because of incompatibility of theoretical assumptions (essentially the i.i.d. hypothesis is of course not met when dealing with sequential decision making). Nevertheless, Deep Reinforcement

References

123

Learning have succeeded in solving major AI challenges such as reaching superhuman performance at playing Atari games from raw pixels [Mnih et al. 2015] or defeating the Go game world champion from basic low-level descriptions of the board [Silver et al. 2016]. It is therefore clear that Deep RL will have to play a key role in the perspective of training end-to-end multimodal interactive systems. Not only to embed sequential decision-making algorithms into the learning process but to drive the representation learning process so as to extract meaningful features from low-level signals in terms of their ability to ensure goal achievement. A major obstacle remains the difficulty of generating enough in-domain data. Unlike for games where simulation and self-play can artificially produce as much data as required for learning, human-machine interaction involves a costly data collection process which is still a bottleneck for Deep RL to apply.

Focus Questions 4.1. What is the difference between early, intermediate, and late fusion models? 4.2. Name three different modality combinations and the different possibilities for fusion of these combinations. 4.3. Name one advantage of intermediate fusion models over early fusion models and one over late fusion models.

4.4. How do sequence-to-sequence models encode data from one modality? 4.5. How do sequence-to-sequence models decode the encoder representation into data of another modality?

4.6. What are the advantages of incorporating the attention mechanism into encoder-decoder models?

References M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467. 99 R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, A. Belopolsky, Y. Bengio, A. Bergeron, J. Bergstra, V. Bisson, J. Bleecher Snyder, N. Bouchard, N. Boulanger-Lewandowski, X. Bouthillier, A. de Br´ ebisson, O. Breuleux, P.-L. Carrier, K. Cho, J. Chorowski, P. Christiano, T. Cooijmans, M.-A. Cˆ ot´ e, M. Cˆ ot´ e, A. Courville, Y. N. Dauphin, O. Delalleau, J. Demouth, G. Desjardins, S. Dieleman, L. Dinh, M. Ducoffe, V. Dumoulin, S. Ebrahimi Kahou, D. Erhan, Z. Fan, O. Firat, M. Germain, X. Glorot, I. Goodfellow,

124

Chapter 4 Deep Learning for Multisensorial and Multimodal Interaction

M. Graham, C. Gulcehre, P. Hamel, I. Harlouchet, J.-P. Heng, B. Hidasi, S. Honari, A. Jain, S. Jean, K. Jia, M. Korobov, V. Kulkarni, A. Lamb, P. Lamblin, E. Larsen, C. Laurent, S. Lee, S. Lefrancois, S. Lemieux, N. L´ eonard, Z. Lin, J. A. Livezey, C. Lorenz, J. Lowin, Q. Ma, P.-A. Manzagol, O. Mastropietro, R. T. McGibbon, R. Memisevic, B. van Merri¨ enboer, V. Michalski, M. Mirza, A. Orlandi, C. Pal, R. Pascanu, M. Pezeshki, C. Raffel, D. Renshaw, M. Rocklin, A. Romero, M. Roth, P. Sadowski, J. Salvatier, F. Savard, J. Schl¨ uter, J. Schulman, G. Schwartz, I. V. Serban, D. Serdyuk, S. Shabanian, E. Simon, S. Spieckermann, S. R. Subramanyam, J. Sygnowski, J. Tanguay, G. van Tulder, J. Turian, S. Urban, P. Vincent, F. Visin, H. de Vries, D. Warde-Farley, D. J. Webb, M. Willson, K. Xu, L. Xue, L. Yao, S. Zhang, and Y. Zhang. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688. 99 D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. Engel, L. Fan, C. Fougner, A. Y. Hannun, B. Jun, T. Han, P. LeGresley, X. Li, L. Lin, S. Narang, A. Y. Ng, S. Ozair, R. Prenger, S. Qian, J. Raiman, S. Satheesh, D. Seetapun, S. Sengupta, C. Wang, Y. Wang, Z. Wang, B. Xiao, Y. Xie, D. Yogatama, J. Zhan, and Z. Zhu. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proc. Int. Conf. on Machine Learning, pp. 173–182. New York, NY. 99 E. Arisoy, T. Sainath, B. Kingsbury, and B. Ramabhadran. 2012. Deep neural network language models. In Proc. NAACL-HLT Workshop, pp. 20–28. Montreal, Canada. 115 P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: a survey. Multimedia Systems, 16(6): 345–379. DOI: 10.1007/ s00530-010-0182-0. 100, 104 D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proc. International Conference on Learning Representations. Banff, Canada. 99, 110, 111, 112 Y. Bengio and R. Ducharme. 2001. A neural probabilistic language model. In Proc. Advances in Neural Information Processing Systems, vol. 13, pp. 932–938. Denver, CO. 114 L. Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proc. COMPSTAT’2010, pp. 177–186. Springer, Paris, France. DOI: 10.1007/978-3-79082604-3_16. 105 Y. Cheng, X. Zhao, R. Cai, Z. Li, K. Huang, and Y. Rui. 2016. Semi-supervised multimodal deep learning for RGB-D object recognition. In Proc. Int. Joint Conf. on AI, pp. 3345–3351. New York, NY. 105 K. Cho, B. van Merrienboer, C. ¸ G¨ ulcehre, ¸ D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. Empirical Methods in Natural Language Processing, EMNLP 2014, pp. 1724–1734. Doha, Qatar. 105

References

125

K. Cho, A. Courville, and Y. Bengio. 2015. Describing multimedia content using attentionbased encoder-decoder networks. IEEE Transactions on Multimedia, 17(11): 1875– 1886. DOI: 10.1109/TMM.2015.2477044. 110 J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio. 2015. Attention-based models for speech recognition. In Proc. Advances in Neural Information Processing Systems, pp. 577–585. Montreal, Canada. 111 R. Collobert and J. Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proc. Int. Conf. on Machine Learning, pp. 160–167. Helsinki, Finland. DOI: 10.1145/1390156.1390177. 116 L. Deng and D. Yu. 2014. Deep learning: Methods and applications. Foundations and Trends in Signal Processing, 7(3–4): 197–387. 99 A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard. 2015. Multimodal deep learning for robust rgb-d object recognition. In Proc. Intelligent Robots and Systems (IROS), pp. 681–687. IEEE, Hamburg, Germany. 105 G. Erdogan, I. Yildirim, and R. A. Jacobs. 2014. Transfer of object shape knowledge across visual and haptic modalities. In Proc. 36th Annual Conference of the Cognitive Science Society. Quebec City, Canada. 112 F. Eyben, M. W¨ ollmer, A. Graves, B. Schuller, E. Douglas-Cowie, and R. Cowie. 2010. Online emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues. Journal on Multimodal User Interfaces, 3(1–2): 7–19. DOI: 10.1007/ s12193-009-0032-6. 103 R. Girshick, J. Donahue, T. Darrell, and J. Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition, pp. 580–587. Columbus, OH. DOI: 10.1109/CVPR.2014 .81. 113 I. Goodfellow, Y. Bengio, and A. Courville. 2016. Deep Learning. MIT Press. http://www .deeplearningbook.org. 100 G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6): 82–97. DOI: 10.1109/MSP.2012.2205597. 99 S. Hochreiter and J. Schmidhuber. 1997a. Long short-term memory. Neural Computation, 9(8): 1735–1780. DOI: 10.1162/neco.1997.9.8.1735. 118 S. Hochreiter and J. Schmidhuber. 1997b. Long short-term memory. Neural Computation, 9(8): 1735–1780. 105 E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proc. 50th Annual Meeting Assoc. for Computational Linguistics, pp. 873–882. Jeju Island, Korea. 116 S. E. Kahou, X. Bouthillier, P. Lamblin, C. Gulcehre, V. Michalski, K. Konda, S. Jean, P. Froumenty, Y. Dauphin, N. Boulanger-Lewandowski, et al. 2016. Emonets:

126

Chapter 4 Deep Learning for Multisensorial and Multimodal Interaction

Multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces, 10(2): 99–111. DOI: 10.1007/s12193-015-0195-2. 103 A. Karpathy and L. Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3128–3137. Boston, MA. 113, 119, 120 A. Karpathy, A. Joulin, and F. Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Proc. Advances in Neural Information Processing Systems, pp. 1889–1897. Montreal, Canada. 119 G. Keren, J. Deng, J. Pohjalainen, and B. Schuller. 2016. Convolutional neural networks with data augmentation for classifying speakers native language. In Proc. INTERSPEECH, Annual Conference of the International Speech Communication Association. San Francisco, CA. DOI: 10.21437/Interspeech.2016-261. 103 G. Keren, S. Sabato, and B. W. Schuller. 2017a. Tunable sensitivity to large errors in neural network training. In Proc. Conference on Artificial Intelligence (AAAI), pp. 2087–2093. San Francisco, CA. 105 G. Keren, S. Sabato, and B. W. Schuller. 2017b. Fast single-class classification and the principle of logit separation. arXiv preprint arXiv:1705.10246. 105 G. Keren and B. W. Schuller. 2016. Convolutional RNN: an enhanced model for extracting features from sequential data. In Proc. International Joint Conference on Neural Networks, IJCNN, pp. 3412–3419. Vancouver, Canada. DOI: 10.1109/IJCNN.2016 .7727636. 105 D. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In Proc. International Conference on Learning Representations. Banff, Canada. 105 A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems, pp. 1097–1105. Lake Tahoe, NV. 99 Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4): 541–551. DOI: 10.1162/neco.1989.1.4.541. 108 E. Levin, R. Pieraccini, and W. Eckert. 1998. Using markov decision process for learning dialogue strategies. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 201–204. IEEE, Seattle, WA. DOI: 10.1109/ ICASSP.1998.674402. 122 E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov. 2015. Generating images from captions with attention. In Proc. International Conference on Learning Representations. Banff, Canada. 108, 109, 111 H. Mei, M. Bansal, and M. R. Walter. 2016. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In Proc. Conference on Artificial Intelligence (AAAI), pp. 2772–2778. Phoenix, AZ. 107, 111

References

127

T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 114, 116, 117 T. Mikolov, W. Yih, and G. Zweig. 2013b. Linguistic regularities in continuous space word representations. In Proc. NAACL-HLT, pp. 746–751. Atlanta, Georgia. 114 ˇ ernock´y, and S. Khudanpur. 2010. Recurrent T. Mikolov, M. Karafi´ at, L. Burget, J. H. C neural network based language model. In Proceedings of the INTERSPEECH, Annual Conference of the International Speech Communication Association, pp. 1045–1048. Makuhari, Chiba, Japan. 117 T. Mikolov, J. Kopeck´y, L. Burget, O. Glembek, and J. Cernock´y. 2009. Neural network based language models for highly inflective languages. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4725–4728. Taipei, Taiwan. 114 V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature, 518(7540): 529–533. DOI: 10.1038/nature14236 .. 123 F. Morin and Y. Bengio. 2005. Hierarchical probabilistic neural network language model. In Proc. International Workshop on Artificial Intelligence and Statistics, pp. 246–252. Barbados. DOI: 10.1.1.88.9794. 116 A. E. Mousa. 2014. Sub-Word Based Language Modeling of Morphologically Rich Languages for LVCSR. Ph.D. thesis, Computer Science Department, RWTH Aachen University, Aachen, Germany. 115 A. E. Mousa, H.-K. J. Kuo, L. Mangu, and H. Soltau. 2013. Morpheme-based featurerich language models using deep neural networks for LVCSR of Egyptian Arabic. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada. DOI: 10.1109/ICASSP.2013.6639311. 115 M. Paleari and B. Huet. 2008. Toward emotion indexing of multimedia excerpts. In Proc. 2008 International Workshop on Content-Based Multimedia Indexing, pp. 425–432. IEEE, London, UK. DOI: 10.1109/CBMI.2008.4564978. 102 E. Park, X. Han, T. L. Berg, and A. C. Berg. 2016. Combining multiple sources of knowledge in deep CNNs for action recognition. In Proc. 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–8. IEEE, Lake Placid, NY. DOI: 10.1109/WACV.2016 .7477589. 105 B. T. Polyak. 1964. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5): 1–17. 105 S. Reed, Z. Akata, H. Lee, and B. Schiele. 2016. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. Las Vegas, NV. DOI: 10.1109/CVPR.2016.13. 120, 121, 122 M. Schuster and K. K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11): 2673–2681. DOI: 10.1109/78.650093. 111

128

Chapter 4 Deep Learning for Multisensorial and Multimodal Interaction

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587): 484–489. 123 J. Sung, I. Lenz, and A. Saxena. 2015. Deep multimodal embedding: Manipulating novel objects with point-clouds, language and trajectories. arXiv preprint arXiv:1509.07831. 113 I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In Proc. Advances in Neural Information Processing Systems, pp. 3104–3112. Montreal, Canada. 105 R. S. Sutton and A. G. Barto. 1998. Introduction to Reinforcement Learning. MIT Press. 122 C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1–9. Boston, MA. DOI: 10.1109/CVPR.2015.7298594. 99 G. Trigeorgis, M. Nicolaou, S. Zafeiriou, and B. W. Schuller. 2016. Deep canonical time warping. In Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 5110–5118. Las Vegas, NV. DOI: 10.1109/CVPR.2016.552. 100 S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. 2015a. Sequence to sequence-video to text. In Proc. IEEE Conference Computer Vision, pp. 4534–4542. Santiago, Chile. DOI: 10.1109/ICCV.2015.515. 108 S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. J. Mooney, and K. Saenko. 2015b. Translating videos to natural language using deep recurrent neural networks. In Proc. NAACL-HLT, pp. 1494–1504. Denver, CO. 108 O. Vinyals, M. Fortunato, and N. Jaitly. 2015a. Pointer networks. In Proc. Advances in Neural Information Processing Systems, pp. 2692–2700. Montreal, Canada. 111 O. Vinyals, Ł. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton. 2015b. Grammar as a foreign language. In Proc. Advances in Neural Information Processing Systems, pp. 2773–2781. Montreal, Canada. 107, 108, 111 O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015c. Show and tell: A neural image caption generator. In Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. Boston, MA. 107, 109 K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proc. International Conference on Machine Learning, pp. 2048–2057. Lille, France. 110, 111, 112 L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. 2015. Describing videos by exploiting temporal structure. In Proc. IEEE Conference Computer Vision, pp. 4507–4515. Santiago, Chile. DOI: 10.1109/ICCV.2015.512. 111

II PART

MULTIMODAL PROCESSING OF SOCIAL AND EMOTIONAL STATES

5

Multimodal User State and Trait Recognition: An Overview Bj¨ orn Schuller

5.1

Introduction

It seems intuitive, if not obvious, that for intelligent interaction and communication between technical systems and human users the knowledge of the user states and traits (for definitions see the Glossary) is beneficial, if not required, on the system’s end. Economist Peter Drucker’s words seem quite inspiring in this context: The most important thing in communication is hearing what isn’t said.

Thus, acquiring information as to the emotional, cognitive, or physical load level (see also Chapters 10 and 11), degree of sleepiness or intoxication, or health state— alongside the age, gender, personality, or ethnicity, etc. from the sound of a user’s speech and other information streams available may help to increase the flow of the interaction and allow an interface to adapt on all sorts of levels. While one might ask directly for such information from the user—e.g., age or gender—this may often be inefficient, cumbersome, or inappropriate as time is precious, and modern intelligent systems are increasingly expected to show similar emotional and social intelligence (see Chapter 7) capabilities as one would expect from a human. Indeed, in comparison to a human they often have access to a richer amount of information in these days of increasingly “big” data that may be collected comparably effortlessly from an ever-growing amount of ambient, body-worn, or contact-less sensors. Traditionally, these include the more “natural” sensors in the sense that the human has these available as well as audio, video, and tactile interaction. In addition, bio-parameters such as the heart rate, skin conductance,

132

Chapter 5 Multimodal User State and Trait Recognition: An Overview

thermal images, accurate movement data, or fingerprint fall less into what a human would usually have available, but systems such as “simple” smartphones do have astonishing amounts of such information available. Interestingly, however, on the other hand the human relies on some information such as the olfactory channel that technical systems do not (yet?) employ on a broader basis. Untouched by that, technical systems still mostly fall behind human abilities when it comes to assessing user’s states and traits. The general opinion and experience is that, given a multimodal approach, such recognition can be rendered more robust. In this chapter, an overview on multimodal user state and trait recognition is given starting off with modeling and the definition of states and traits. Next, states and traits already considered for multimodal automatic assessment are shown. Then, architectures for a synergistic technical processing and fusion of these are discussed including a modern view on how such systems could be realized. In the sections that follow, an overview on modalities and their peculiarities and requirements as well as strengths is provided, namely: spoken and written language; video including facial expression, body posture, and movement; fingerprints; physiological signals such as brain waves; and tactile interaction data. Recent trends and potential for future improvement are provided at the end of this chapter.

5.2

5.3

Modeling Here, we first consider how states and traits or more generally speaking person or user attributes can be classified. In Schuller and Batliner [2013], a dozen of exemplary “taxonomies” are given including the time aspect used here to group the tasks of interest into long-term traits, longer-term states, and short-term states (see Glossary). The tasks further need to be represented in a way that can best be automatically assessed. Usually, one decides between discrete classes suited for classification and a (pseudo-) continuous representation suited for regression (see Glossary).

An Overview on Attempted Multimodal Stait and Trait Recognition A broader variety of states and traits of users or human individuals in general already has been considered for automatic assessment. Table 5.1 provides a selection of typical ones including exemplary literature references and modality combinations considered in those publications. The table is grouped into states and traits. A well-defined benchmark for obtainable results is given by the challenge events organized in this field, mostly focused on affect. The first audiovisual challenge was

5.3 An Overview on Attempted Multimodal Stait and Trait Recognition

133

Glossary Continuous or discrete (i.e., categorical) representation refer to the modeling of a user state or trait. As an example, the age of a user can be modeled as continuum such as the age in years. As opposed to this, a discretized representation would be broader age classes such as “young,” “adult’s”, and “elderly.” In addition, the time can be discretized or continuous (in fact, it is always discretized in some respect—at least by the sample rate of the digitized sampling of the sensor signals). However, one would speak of continuous measurement if processing is delivering a continuous output stream on a (short) frame-by-frame basis rather than an asynchronous processing of (larger) segments or chunks of the signal such as per spoken word or per body gesture. The user (long-term) traits include biological trait primitives (e.g., age, gender, height, weight), cultural trait primitives in the sense of group/ethnicity membership (e.g,. culture, race, social class, or linguistic concepts such as dialect or first language), personality traits (e.g., the “OCEAN big five” dimensions openness, conscientiousness, extraversion, agreeableness, and neuroticism or likability), and traits that constitute subject idiosyncrasy, i.e., ID. A longer-term state can subsume (partly self-induced) non-permanent, yet longer-term states (e.g., sleepiness, intoxication, mood such as depression (see also Chapter 12) the health state such as having a flu), structural (behavioral, interactional, social) signals (e.g., role in dyads and groups, friendship and identity, positive/negative attitude, intimacy, interest, politeness), and (non-verbal) social signals (see Chapters 7 and 8) and discrepant signals (e.g., deception (see also Chapter 13) irony, sarcasm, sincerity). A pseudo-multimodal approach exploits a modality not only by itself, but in addition to estimate another modality’s behavior to replace it. An example is estimating the heart rate from speech parameters and using it alongside (other) speech parameters. A short-term state includes the mode (e.g., speaking style and voice quality), emotions, and affects (e.g., confidence, stress, frustration, pain, uncertainty, see also Chapters 6 and 8).

the 2011 Audio/Visual Emotion Challenge and Workshop (AVEC) [Schuller et al. 2011b]. Binary above/below average decisions for the activity (arousal), expectation, power, and valence dimensions were made on the frame- or word-level. In 2012, the same data was re-used as a fully continuous task. The same data was labeled in time and value continuous perceived personality dimensions, attractiveness, likability, and engagement for the first audiovisual personality challenge (MAPTRAITS) [Gunes et al. 2014]. The AVEC 2013 and 2014 follow-ups introduced audiovisual de-

Table 5.1

Examples of user or subject states and traits attempted for automatic recognition in a multimodal way in the literature. Several combinations of modalities are contained in these examples such as face and fingerprint, face and gait, speech and visual cues, visual cues and driving data in the car, or physiology in combination with some of the above. The engines partially assess single users, sometimes also groups. Trait

Reference

Age

[Hofmann et al. 2013]

Attractiveness

[Gunes et al. 2014]

Ethnicity

[Lu et al. 2005]

Gender

[Li et al. 2010b], [Shan et al. 2007, Shan et al. 2008], [Huang and Wang 2007], [Matta et al. 2008], [Hofmann et al. 2013]

Height

[Hofmann et al. 2013]

ID

[Ko 2005, Lu et al. 2005, Ceting¨ ¸ ul et al. 2006, Sargin et al. 2006, Farr´ us et al. 2007]

Leader

[Sanchez-Cortes et al. 2013]

Likability

[Gunes et al. 2014]

Nativeness

[Georgakis et al. 2014]

Personality

[Pianesi et al. 2008, Batrinca et al. 2011, Batrinca et al. 2012]

State

Reference

Alertness

[Abouelenien et al. 2015]

Cognitive Load

[Putze et al. 2010]

Deception

[Qin et al. 2005]

Depression

[Cohn et al. 2009]

Distraction

[W¨ ollmer et al. 2011]

Drowsiness

[Andreeva et al. 2004]

Emotion

[Schuller et al. 2011b]

Engagement

[Gunes et al. 2014]

Interest

[Schuller et al. 2009]

Laughter

[Melder et al. 2007]

Physical Activity

[Maurer et al. 2006, Li et al. 2010a, McCowan et al. 2005]

Sentiment

[W¨ ollmer et al. 2013]

Stress

[Boˇril et al. 2012, Sharma and Gedeon 2012]

Swallowing

[Amft and Tr¨ oster 2006]

Violence

(MediaEval)

5.4 Architectures

135

pression recognition. The 2015 edition was the first to introduce physiology (AV+EC 2015) alongside audio and video for affect acquisition in a challenge event since further tasks have been run in the series including depression and sentiment from audiovisual data. Similar challenges exist, such as EmotiW or MediaEval, which are, however, based on multimedia such as TV material rather than user interaction data.

5.4

Architectures The typical flow of processing in user classification with respect to her or his states and traits is shown in Figure 5.1. In the sections that follow, a short step-by-step description following the typical sequence of processing is given in accordance with this figure (for a more detailed description, see Schuller [2013]). However, several blocks are optional and the order of steps may (slightly) vary in some parts of this chain such as whether features of several modalities are fused, first, and commonly enhanced or the other way around.

Crowd sourcing

Interface

Data

Knowledge

Encoding

Capture

Segment-level features

Adaptation/ enhancement

Model

Optimisation

Decision-level fusion

Signal-level fusion

Segmentation

Selection/ generation

Transfer learning

Off-/online learning

Confidnece estimation

Preprocessing

Frame-level features

Reduction

Feature-level fusion

Decision

Figure 5.1

Workflow of a state-of-the art user state and trait analyser for arbitrary modalities. Dark blue boxes indicate mandatory units, light blue ones typical further units, and lighter blue optional ones. Red indicates external units and light red units interface to other modalities and contextual information and knowledge bases. External connections are indicated by arrows from/to the outside.

136

Chapter 5 Multimodal User State and Trait Recognition: An Overview

5.4.1 Capture The first block “capture” in Figure 5.1 is connected to a sensor such as a microphone, camera, or bio-sensor. The data is made available as digital value and timequantized “raw” signal at this stage. The major parameters of interest are usually the sample rate and the word length for quantization as well as potential encoding schemes.

5.4.2 Signal-level Fusion A rather unusual option for fusion of multiple modalities is already given at this early stage. This could, for example, be the case for several bio-sensors. In practice, one would also speak of signal-level fusion if the merging of signals took place after the next step, namely the pre-processing.

5.4.3 Pre-processing During pre-processing, the signal of interest is enhanced, e.g., if in the presence of noise. A range of methods can be employed depending on the modality such as independent component analysis if several independent sensors are available for the same modality (e.g., a microphone array for audio sensing) or blind source separation, e.g., by non-negative matrix factorization or usage of deep learning such as by autoencoders (see also Chapter 4). Simpler efforts include filtering.

5.4.4 Frame-level Features The pre-processing is usually followed by feature extraction. In user state and trait recognition often one finds three sampling rates: the first is the one of the sampling of the analogue sensor signal, e.g., at 8 kHz or 16 kHz for speech. The second sampling with a larger window of analysis is found at this level, e.g., at 100 Hz for speech or 20–30 Hz (or “frames per second”) in the case of video. The third one follows in the next subsection on the segment-level. Based on this windowing, features are extracted per frame (thus, the name “frame-level” features) such as energy, fundamental frequency, zero-crossing rates, spectral characteristics, and many such tailored to the characteristics of the signal. In fact, the frame length may vary with the frame-level feature of interest. This process is often carried out hierarchically such as also calculating first-order derivatives (the “delta” (regression) coefficients) or correlations across frame-level feature contours. Also, further (potentially data-driven) quantization could take place on this level.

5.4 Architectures

137

5.4.5 Segmentation In off-line, non-continuous signal analysis, the data is (pre-)segmented, e.g., by an event such as a spoken word or a facial or body action unit (see also Chapter 3). In real-life use-cases, however, the data stream usually has to be segmented automatically [Gunes and Pantic 2010]. Such “chunking” clusters a mostly varying number of frames to segments (or chunks) of analysis such as the named words or action units. In the case of states or perceived traits (i.e., the impression of the trait varies over time), one would desire for the chunk to start and end with the state or perception. Yet, this is usually not feasible, as one would not know the beginning and end at this stage. Thus, it is often related to the named words (e.g., based on voice activity detection) or action units (e.g., based on Bayesian information criterion), etc., depending on the modality. In addition, the desire for on-line quick reactions asks for short segment lengths (e.g., one second or slightly below, [Chanel et al. 2009]) in contrast to the desire for longer such to ensure higher robustness [Berntson et al. 1997, Salahuddin et al. 2007]. That is, just as in Heisenberg’s uncertainty relation, one cannot ensure both at a time: maximum resolution in time and maximum accuracy for the estimation of the target.

5.4.6 Segment-level Features On the segment-level, higher-level features are calculated by the application of functionals to the frame-level contour [Schuller 2013]. Likewise, the time-series of (usually) varying length is projected onto a scalar per segment-level feature and a single-feature vector independent of the length of the segment, overall. Often, these are also called “supra-segmental” features (accordingly naming the framelevel features segment-level features). Such functionals usually include frequencies of occurence (e.g., of words), and histograms, moments, extremes, peaks, segments, and many others. This is often done in “brute force” manner producing up to several thousands of features to next reduce this feature variety to those being most salient. This often includes “hierarchical functionals” such as the mean of extremes of peak points, etc. Again, also on this level (potentially data driven) quantization could take place in the sense of a hierarchical functional: after calculation of the functionals, a vector quantization functional could be applied to parts of or the entire so far calculated feature vector [Pokorny et al. 2015]. Note that different segment lengths could be used for different segment-level features. Finally, consider that even feature-free approaches have recently been proven successful in this field by “end-to-end” learning (see also Chapter 4) from the signal directly through to the user state or trait [Trigeorgis et al. 2016].

138

Chapter 5 Multimodal User State and Trait Recognition: An Overview

5.4.7 Reduction The reduction of features helps reduce complexity for the following machine learning algorithm. A (usually lower-dimensional) new feature space is computed based on some suited transformation such as principle component analysis or linear discriminant analysis and variants. Note that, even if one projects into a lower dimensional space or does not keep all components of the new space (see Section 5.4.8), the individual components’ calculation usually still requires all original features, i.e., the extraction effort is increased (and not reduced), as one needs to calculate the original space and apply the transformation. Thus, the aim in fact is to reduce complexity (by reducing the number of free parameters to be trained) of the learning algorithm (cf. below). Strictly speaking, this step could thus be called transformation. As such, it relates to the hierarchical functional principle, as for example principle components can be considered as a “weighted sum of functionals” functional. Yet, like the quantization functional named previously, it would be data-based, as the weights need to be learned at first.

5.4.8 Selection and Generation The selection of features can take place in the original space, a transformed space, or across combinations of these. In addition, a random-injected generation of new features can take place during an “exploratory feature selection” such as by genetic algorithms with a (randomized) feature generation step. Usually, one cannot explore all possible feature combinations when searching for the optimal feature space. Thus, one requires (a) a search function and (b) some quality measure of the value of a certain feature sub-ensemble. Popular search functions comprise greedy and floating searches, e.g., in forward (i.e., starting with no feature and enlarging the set gradually), and backward direction (i.e., starting with all features and reducing the size gradually) or bidirectionally. Other search functions include random and genetic searches. Popular measures of the value of feature combinations comprise the target learning algorithm’s accuracy (so-called “wrapper search”) or other measures such as correlation between the feature and the target, information-theoretic measures, and alike. It is important to also consider inter-correlation of features within a selected feature set aiming at keeping it low in dimension (i.e., one should not select slight variations of a highly predictive feature only, but aim at a synergistic feature compound—or, as a metaphor, you cannot play soccer with top defenders only—someone also has to be able to shoot a goal).

5.4 Architectures

139

5.4.9 Adaptation and Enhancement Just as on the signal level, one can enhance also on (any of) the feature levels. Popular methods include training a classifier to predict “clean” features from noisy features such that features extracted from a corrupted signal serve as training input and such extracted from the non-corrupted signal as learning target. In Figure 5.1, the according box is shown only once, but each extractor can write to the data storage from which this module can read. In addition, also any form of adaptation such as to the subject or the current noise-level can take place, e.g., by subtracting a mean typical for the current subject or noise condition.

5.4.10 Feature-level Fusion The fusion of multiple modalities can also take place on the feature level. This can be any of the feature types, i.e., on the frame or segment-level. While unusual, it could also be on both levels. The fusion aspect is also described in more detail, e.g., in Chapters 1 and 2.

5.4.11 Crowd Sourcing An ever-ongoing bottleneck is the amount of labeled data for training of an automatic user state and trait acquisition system. This is even more true for multimodal data, in particular if “less usual” modalities are involved. Crowd sourcing is a popular approach these days to quickly reach a reasonable amount of such data. Crowd sourcing thereby can be used both to collect the actual multimodal data and to collect the labels such as the emotion or personality perceived by humans in data. A popular first “pay per click” crowd sourcing platform was Amazon’s Mechanical Turk; however, recently a rich variety of alternatives including gamified versions and such without payment of annotators exist. An elegant way is to source information directly from the users. The reason for the box being integrated in Figure 5.1 is that, ideally, the crowd sourcing should be embedded in an ever-ongoing efficient learning of a system keeping humans in the loop.

5.4.12 Decision At this point, the actual decision on a user state or trait is made. Given the previously sketched data scarcity for some (multimodal) user state and trait acquisition tasks (e.g., for less usual states and traits or atypical populations) make “zero”-resource approaches interesting. In such a model, rules are employed such as female users should have a higher pitch on average. However, data-trained models are usually exceeding the performance reached by such approaches—they base their decision

140

Chapter 5 Multimodal User State and Trait Recognition: An Overview

on a comparison with training instances or on a model trained based on the training instances.

5.4.13 Confidence Estimation In addition to the result of the acquisition such as “the user is interested,” an interface can benefit from the additional provision of a degree of confidence such as “I am certain that the user is interested.” Ideally, such confidence measures are determined independently of the actual decision, such as by training an independent system to recognize misclassifications of the system that classifies the user—potentially also in a semi-supervized way (see below) [Deng and Schuller 2012]. In the best case, the system is provided with a confidence per class such that based on context it may prefer the second best result as the delta in confidence might be small, but the second best result simply better fits the current situation.

5.4.14 Decision-level Fusion After the decision (per modality) one can also fuse the different modalities based on their individual results. This referred to as late fusion. Early fusion options along the chain shown in Figure 5.1 is best carried out in a “soft” fashion including the confidence levels per class and modality. For a final decision, a voting scheme— potentially weighted by the confidences—may be employed. Alternatively, a learning algorithm may be trained with data to allow for more complex and potentially non-linear decision strategies in case of “disagreement” between modalities. As pointed out above, more information on fusion is found, e.g., in Chapter 1.

5.4.15 Encoding The result has to be made readable to the interface with the application in some way. Ideally, standards are followed to allow easy re-usage of a user state and trait recognition engine with various human-machine interfaces. Existing standards comprise such for annotation of emotion, e.g., EARL [Schr¨ oder et al. 2006] or the follow-up W3C EmotionML [Burkhardt et al. 2016] recommendation. Owing to the option that individual categories and dimensions can be specified, the standard can be used to encode a broad range of user states and traits. A more broadly defined standard is the W3C Extensible MultiModal Annotation (EMMA) markup language [Baggia et al. 2007]. However, manifold alternatives exist.

5.4.16 Interface The interface is not an actual part of the engine for the recognition of user states and traits, but rather shown in Figure 5.1 to indicate where the communication with

5.4 Architectures

141

it takes part. However, given that it should ideally feed back knowledge such as its current state and the reaction of the user for integration of contextual knowledge, as well as receive demands from the engine such as when to ask for a certain label or information concerning a data of interest to the engine (see “Off- and On-line Learning” below) enabling of reinforcement or (active and) supervized learning from user feedback, it can be seen as part of the chain of processing.

5.4.17 Off- and Online Learning If one does not rely on a mere rule set for decisions as described previously, learning of a suited machine learning algorithm will be needed. There is a broad selection of popular such algorithms including (deep) neural networks, support vector machines, or hidden Markov models, and more general graphical models—a broad body of literature exists on these—for an entry within this domain see [Schuller and Batliner 2013]. Today, machine learning in the field is almost exclusively carried out in a supervized manner. This means that human labeled data are presented offline only once to the recognition engine which learns in one pass from it. However, William Osler’s advice to humans can also be given to current and future multimodal person state and trait recognition systems: “Observe, record, tabulate, communicate. Use your five senses. Learn to see, learn to hear, learn to feel, learn to smell, and know that by practice alone you can become expert.”

This alludes to the idea to keep the recognition engine learning throughout its usage time—potentially only by “practice.” This will make sense in many ways, as usually training data exactly from the usage domain is particularly scarce. Also, many user interfaces are coined by longitudinal interaction, i.e., the interface would miss quite a chance to “get to know the user” if it would not exploit the opportunity to keep learning about her or him during interactions. To design this learning process most efficient, the engine has to either partially learn by itself or with help from the user or the crowd in a targeted way. Learning by the engine itself includes semi-supervized learning that uses machine labels [Davidov et al. 2010]—ideally only of data instances with a high confidence and potentially considering multiple modalities [Zhang et al. 2016b]. Semisupervized learning can also be used to cross-label data if several states and traits are targeted, and for some data instances some target labels are missing [Zhang et al. 2016a]. Alternatively, reinforcement learning can be used [Sutton and Barto 1998], where in contrast to supervized learning no correct data instance/label pairs are given, sub-optimal actions are not explicitly corrected. Instead, some indirect

142

Chapter 5 Multimodal User State and Trait Recognition: An Overview

form of feedback is exploited such as the reaction of the user. Imagine the interface is assuming the user to be of an elderly user group. Accordingly, it addresses the user rather formal. If the user in her reaction gives the impression that this was expected, the user state and trait recognition engine would learn reinforced from this observation that the decision of the user’s age group was correct for future reference. While this type of learning has not yet been exploited in user state and trait recognition, it seems an obvious efficient avenue in the context of interaction. A key challenge in reinforcement learning is usually designing the reward function [Berridge 2000], which is decisive for the success given the high level of involved uncertainty—luckily, it can be learned itself from data [El Asri et al. 2012]. Already used for state and trait classification [Zhang et al. 2015a], active learning asks directly for human help, but only in pre-selected cases. Likewise, rather than asking for every user response about, e.g., the user emotion, the engine only asks when it feels the data is of particular interest, e.g., if the expected change in its model parameters is significant [Settles et al. 2008], or if it assumes that the current data instance could be new or rare in its type (e.g., not “again” (just) of neutral emotion, but of some other rarely seen emotion). Thereby, it can additionally learn how much to trust a user to design the process even more efficient [Zhang et al. 2015a]. Active and semi-supervized learning united increase the efficiency, as the part of the data the computer can label by itself based on high confidence does not require aid by the user, but the less certain, yet interesting, cases can be solved with aid from the user. This is known as cooperative learning [Zhang et al. 2015b]. Without human labeling or reinforcment, unsupervized learning allows to “blindly” cluster the data instances by distances in the feature space and whatnots. To avoid clustering mainly by the largest variation in the data, hierarchical clustering approaches can cluster without supervision into different states and traits such as identity and emotion [Trigeorgis et al. 2014]. Further, the currently popular deep learning often uses unsupervized initialization of hidden layers of neurons [Stuhlsatz et al. 2011]. In fact, features also can be based upon unsupervized clustering such as by vector quantization (see Figure 5.1) to produce “bag-of-word” frequency of occurance-type features for arbitrary modalities [Pokorny et al. 2015]. Finally, learning can also be used to best synchronize modalities, e.g., by some advanced form of time warping such as the recent deep canonical time warping approach [Trigeorgis et al. 2015].

5.4.18 Optimization To improve the performance, optimization of the learning algorithm’s parameters is needed. This could be, e.g., by some search function and measure of value just as in the previously sketched feature selection process with the main difference that

5.4 Architectures

143

at this stage free parameters such as the number of hidden layers or the learning rate in a neural network are optimized. A typical search function is, for example, grid search.

5.4.19 Transfer Learning Transfer learning can be a useful approach if one has little labeled data available for a specific user state or trait of interest. Consider, e.g., the desire to make an interface for children that recognizes children’s emotion (the “target domain”), but there is only emotionally labeled data from adults (the “source domain”) available. Transfer learning now allows to exploit such data from a similar domain for training by “transferring” between the source and the target domain. Transfer learning in general is a rather loosely formulated paradigm and can take place on different levels. For this reason, the according box for transfer learning is found between the features and the model in Figure 5.1—in principle, one can also transfer on other levels such as regarding the optimization or type of enhancement, etc. Putting links to any possible box, however, would have rendered the figure hard to interpret. In user state and trait recognition, so far mostly feature transfer learning has been applied [Deng et al. 2014] where unsupervized auto-encoder neural networks learn to map input features of one domain onto themselves with usually a sparse (neurons) hidden layer to reach a compact representation of the data (a so-called “compression auto encoder”). Then, data from the other domain is run through this network to reach the domain adaptation or transfer, respectively. This can also be carried out bidirectionally between the domains to make them more similar. On the other hand, model transfer learning aims at re-usage of learned models such as using data from a similar task to pre-train a neural network. Besides, one can use transfer learning not only across different user populations or tasks such as using data from sentiment recognition to train a valence recognizer, but even across modalities [Socher et al. 2013]. In addition to these building blocks of an automatic user state and trait acquisition system, a number of databases are needed, as also shown in Figure 5.1 where these are summarized as three in total.

5.4.20 Data Data summarizes the original user data such as audio, video, or sensor data recordings alongside labels (including, e.g., also individual ratings of different annotators and their personal confidences, meta-data, and other additional information of use) and potentially also other representations such as enhanced or noisy versions of the data, or features extracted on different levels.

144

Chapter 5 Multimodal User State and Trait Recognition: An Overview

5.4.21 Model The model is the actual model used by the machine-learning algorithm(s) inside the engine. It might be a set of rules, probabilities, weights, etc.

5.4.22 Knowledge Knowledge potentially includes information on the annotators, the state of the application or interface, the general situational context, or diverse knowledge databases used during decision making and confidence estimation including, e.g., dictionaries.

5.5

A Modern Architecture Perspective

5.6

Modalities

In Figure 5.2 a modern view is given on the architecture for multimodal subject state and trait acquisition. There, the “learning analyzer” includes the various aspects along the chain of signal processing up to the decision-making process and confidence calculation. It communicates via potentially more elaborate protocols with the sensors and also actuators, the interface, as well as the crowd, and has access to the same types of databases. As such, this architecture is very generic, as it may be based on feature extraction or simply end-to-end learning. At the same time, the link to the sensors and actuators is shown as bidirectional indicating that the learning analyzer can control parameters of these in a smart way, e.g., to move sensors. The “crowd” interacts via these sensors and actuators, and can be “sourced” via the interface. Ideally, this analyzer recognizes several states and traits in parallel to better see the “whole picture.” For example, knowing that a user is male or female, the personality of the user, and the age group will make it easier to estimate—say—the emotion correctly. Accordingly, such states and traits should not be targeted in isolation, but in common.

In the following, let us have a closer look at the different modalities. There exist a plethora of overviews on these mostly for multimodal affect recognition (see, e.g., Gunes and Schuller [2013]). There is no definitive gold standard on which modality is best suited for which state or trait—in fact, the use-case is often the decisive factor, as it may not always be feasible to use invasive body-contact sensors and cameras may not always be able to be mounted in a position where they “see” the user, e.g., when considering mobile user interfaces. In addition, privacy issues may come into play when, e.g., considering “open” sensor usage that record environmental data continuously such as audio or video also when other individuals may

5.6 Modalities

Model

Knowledge

Figure 5.2

Sensors/ actuators

Crowd

Learning analyzer

Interface

145

Data

A modern view on the workflow of an automatic user state and trait analyzer. The arrows indicate transfer of information. The learning analyzer as core-piece unites abilities to learn from data, transfer learn, self-learn, actively learn dynamically sourcing labels from the crowd via an interface, and learn re-inforced based on feedback from the application via the interface. Further, it includes abilities to hierarchically explicitly or implicitly extract/model features, self-select these, including unsupervized feature learning such as bag-of-words. Features are understood as attributes just as target (affect) attributes are in a multi-target learning paradigm. The analyzer may incorporate self-adaptation and self-optimization abilities. Actuators are needed to close the analysis-synthesis loop and allow the analyzer to carry out “imitation learning” by re-producing state and trait data to better understand the underlying concepts.

be present. Each modality thus has strengths and weaknesses, and multiple modalities can be congruent or incongruent for the same state or trait targeted—e.g., when a user is hiding an emotion. While this is known from human perception [Meeren et al. 2005, Van den Stock et al. 2007], automatic recognition yet has to exploit such effects.

5.6.1 Audio and Spoken and Written Language As to audio analysis, the information is usually coming from the acoustic speech signal, spoken words by linguistic analysis, and “non-verbal” sounds such as laughter or hesitations. Beyond, other forms of audio generated by a subject may be of interest, such as step sounds of walking [Geiger et al. 2014].

146

Chapter 5 Multimodal User State and Trait Recognition: An Overview

The acoustic feature spaces are often large in recent works reaching up to some thousands of supra-segmental features stemming from some hundred frame-level features. A typical example could be the average speaking rate or the pitch range. As to acquisition of states and traits from words—be they spoken or written— one generally distinguishes between domain-trained and knowledge-based approaches [Cambria et al. 2013]. Frequently seen domain-trained approaches consider posterior probabilities of words or word (or character) sequences (“N -Grams”, where N is the sequence length, i.e., the number of words per sequence) representing the state or trait of interest, or the modeling in a vector space, where each word or word/character sequence represents a feature by its frequency of occurrence in the material to be assessed. Different normalization and representation forms of the frequencies are known such as log-frequency; however, with sufficiently large corpora the choice may play a minor role [Schuller et al. 2015a]. Often, the words are stemmed, i.e., clustered by their morphological stem, or re-tagged, e.g., by partof-speech classes such as adjective, noun, or verb [Matsumoto and Ren 2011], or combinations of these such as adjective-adverb [Benamara et al. 2007] or adjectiveverb-adverb [Subrahmanian and Reforgiato 2008], as well as by semantic classes [Batliner et al. 2011] such as “politics,” “sports,” etc. In addition, manifold knowledge databases exist that can be exploited to map word sequences onto states or traits of individuals [Strapparava and Mihalcea 2010] such as Concept Net, General Inquirer, or WordNet(-Affect) [Cambria et al. 2013]. Prior to this, different handling respects the origin of the data: For spoken language, one can integrate acoustic confidence measures or several best hypothesis coming from the automatic speech recognizer, albeit “perfect recognition” seems less crucial if one only aims at assessing the states and traits of a speaker [Metze et al. 2010]. For written text, one may want to remove extra characters, punctuation, de-capitalize all characters, or handle certain sequences such as “smileys” separately. Non-verbal sound-events can be included into the word string after their recognition (e.g., inline with speech recognition) such as in “oh no ” [Schuller et al. 2009, Eyben et al. 2011]. Accordingly, “” would be handled as a word/character or feature in the probabilistic or vector space modeling, respectively.

5.6.2 Images and Video Using vision-based approaches allows one to capture the fingerprint, face, facial actions, such as raising the cheeks, and facial expressions, such as producing a smile [Pantic and Bartlett 2007], and the less pursued body postures and gestures such as head tilts and raising arms [Dael et al. 2012]. Also, the gait pattern may be of interest [Hofmann et al. 2013].

5.6 Modalities

147

Sensors to capture the information are recently often including depth information such as the broadly used consumer-level Microsoft Kinect device or simply standard cameras included in many consumer devices such as smart phones or tablets. For high robustness of tracked feature points albeit often not practical in daily user-interfaces, motion capture systems are an option for body posture [Kleinsmith et al. 2005, Kleinsmith and Bianchi-Berthouze 2007], body language [Metallinou et al. 2011], and even facial expression [W¨ ollmer et al. 2010] analysis. In video analysis such as of the face, one often categorizes the approaches into appearance-based vs. feature-based approaches and combinations there of [Pantic and Bartlett 2007]. While appearance-based modeling bases on texture and motion of regions, feature-based modeling is based on tracking, e.g., facial features such as the corners of the eyes or mouth exploiting knowledge on the anatomy. Derived features from their coordinates can be, for example, distances between these points to indicate, e.g., the degree of widening of the eyes. In a similar fashion, hand or body gesture recognition and human motion analysis can be categorized into appearance-based consideration of color or grey-scale images or edges and silhouettes, motion-based information without modeling of the structure of the body, and model-based modeling or recovering 3D configurations of the body parts [Poppe 2007, Poppe 2010]. As for motion capture, geometrical features prevail. These require registration, the definition of a coordinate system followed by calculation of relative positions, Euclidean or other suited distances, and velocities of and between captured points. Further, orientation such as of the shoulder axes can be of interest depending on the trait or state of interest such as affect [Kleinsmith and Bianchi-Berthouze 2007, Kleinsmith et al. 2005]. When employing thermal infrared imagery, blobs or shape contours [Tsiamyrtzis et al. 2007] are frequently applied among several alternatives. The images can also be divided into grids of squares. Next, the highest temperature per square is identified as a reference [Khan et al. 2006]. Alternatively, differential images with a frequency transformation between the body or face per class are an option [Yoshitomi et al. 2000]. Finally, templates of thermal variation are often used per class as reference. Tracking is fulfilled in the same manner as in the visible spectrum by condensation algorithm, particle filtering, and further methods. The problems thereby often remain the same [Tsiamyrtzis et al. 2007].

5.6.3 Physiology Physiological signal analysis often consists of multichannel biosignals as recorded from the central and autonomic nervous systems. Examples include the heart

148

Chapter 5 Multimodal User State and Trait Recognition: An Overview

rate, galvanic skin response (GSR) [Chanel et al. 2007], and electromyography for measurement of the electrical potential that correlates well with muscular cell activities [Haag et al. 2004]. Further, for example, specific breast belts allow for the measurement of the respiration rate leading to features such as depth, speed, and regularity of breathing [Chanel et al. 2007, Haag et al. 2004]. Brain waves measured over the amygdala have also been successfully used for user state modeling [Pun et al. 2006, Chanel et al. 2006, Jenke et al. 2014]. Features include the activation of the left or right frontal cortex indicating, e.g., asymmetrical brain activity [Davidson and Fox 1982]. The deep location of the amygdala in the brain, however, complicates such EEG sensing. Finally, the degree of blood perfusion in the orbital muscles can be seen by thermal imagery [Tsiamyrtzis et al. 2007]. Aiming to avoid the usual requirement of “invasive” body contact of the sensors drives the recent efforts put into the development of wearable devices. In fact, a rich choice of such already has reached the mass consumer market. These can continuously sense the heart rate at the wrist and even brain waves, e.g., the “muse” device, the BodyANT sensor [Kusserow et al. 2009], or Emotiv’s Epoc neuroheadset. As physical activity and electrical fields often interfere with the measurement, noise removal plays a key role. Simple mean filters are often employed directly on signals such as GSR, blood volume pressure, or respiration signals. In the spectral domain, bandpass filters are often applied. As in the above-described audio-analysis, delta-coefficients and supra-segmental features are often derived, e.g., by applying thresholds or detecting peaks [Liu et al. 2005] and projecting these onto scalar features by functionals—e.g., moments such as mean or standard deviation [Chanel et al. 2009, Picard et al. 2001].

5.6.4 Tactile Signals Furthermore, manual interaction can be of interest for the recognition of human states and traits. For example, humans can recognize human emotion expressed via two degrees-of-freedom force-feedback joysticks [Bailenson et al. 2007]. Automatic approaches have successfully used mouse-movements and touch-screen interaction, e.g., [Schuller et al. 2002] and [Gao et al. 2012]. Pressure sensors (e.g., arranged in sensor arrays or fields) and accelerometers are an alternative option [Altun 2014, van Wingerden et al. 2014]. Special drawboards can also come into play when using handwriting or drawing as source of information to, e.g., identify the person behind or the emotion conveyed in the writing or painting.

5.6 Modalities

149

5.6.5 Pseudo-Multimdoality Besides further potential modalities covering other human senses such as the olfactory or gustatory ones, it seems noteworthy to introduce the ability to replace modalities to a certain extent by others. While this does not make up for an actual added modality, one may reach a more targeted representation of information in the sense of a “pseudo-modality” (see the Glossary). Note, however, that the term pseudo-modality has already been used in other ways in the literature such as to allude to the combination of modalities [Kreilinger et al. 2015]. The principle considered here is best illustrated by examples. Recently, it has been shown that human heart- ate can be determined from a facial video [Pursche et al. 2012] or speech [Schuller et al. 2013] analysis. Adding heart rate derived from the speech signal to other speech features would make up for a “pseudo-multimodality” by use of physiology and speech information, albeit coming from one modality only. Yet, this can be a more useful representation form for a machine learning algorithm rather than letting it learn itself that heart rate is reflected in the speech signal and may be of use, e.g., to determine stress level which would usually require massively more data to learn. Another example is the recognition of eye contact [Eyben et al. 2013] or selected facial action units [Ringeval et al. 2015] from the voice. By that, one could avoid the need for cameras in front of the face to enrich the speech features with facial action cues. Yet, again, the information would be only of pseudo-multimodal nature.

5.6.6 Tools A larger selection of toolkits free for research purposes and often open source can be used to build up a user state and trait recognition system. Here, only a small selection of frequently applied and recent toolkits is given to enable quick building up of a system including many of the above-named aspects. For annotation of data, such tools include ELAN for categorical and continuous labels in separated annotation layers [Brugman and Russel 2004]. Further, ANVIL [Kipp 2001] provides customized coding schemes and data storage in XML format for diverse modalities. Value and time-continuous annotation can be executed via the FEELtrace toolkit [Cowie et al. 2000] and its successor Gtrace [Cowie et al. 2012]. A gamified crowd-sourcing platform tailored for user state and trait data annotation and collection is the iHEARu-PLAY platform [Hantke et al. 2015]. For the fusion of sensor signals, the SSI toolkit [Wagner et al. 2013] provides an open-source environment well suited for multimodal user state and trait analysis in real time. While the enhancement of signals can be executed in manifold

150

Chapter 5 Multimodal User State and Trait Recognition: An Overview

ways, a somewhat generic and modality independent tool is the blind signal source separation “openBlissART” toolkit [Weninger and Schuller 2012]. It is based on non-negative matrix factorization and additionally provides independent component analysis and whatnots. For audiovisual and other-type feature extraction, the openSMILE C++ opensource library enables one to extract large feature spaces in real time [Eyben et al. 2010]. Many pre-defined feature sets make it easy to use and to compare with others’ work. A closed-source alternative specialized more on speech is given by the EmoVoice toolkit [Vogt et al. 2008]. The Computer Expression Recognition Toolbox (CERT) [Littlewort et al. 2011] is a broadly used real-time-enabled facial expression recognition engine. For body movement and gestures, the EyesWeb XMI Expressive Gesture Processing Library [Glowinski et al. 2011] is frequently employed. This leaves machine learning to add to the list of tools presented here to put up a running multimodal user state and trait recognizer. The amount of according toolboxes is overwhelming. Therefore, for this reason only a reference to the frequently used WEKA 3 datamining toolkit [Hall et al. 2009] is given here as well as to the state-of-the-art CURRENNT tool—a GPU-enabled fast deep-learning library with long-short-term memory ability that led to manifold benchmark results in the field [Weninger et al. 2015]. If one has no data at hand, but wants to set up a first system, luckily manifold standardized sets are available. Again, only three examples free to use are given here that have recently been featured in research competitions, namely the RECOLA database [Ringeval et al. 2013] featured in the AV+EC 2015 challenge providing audiovisual and physiological data for different affective dimensions alongside other subject information, the SEMAINE database [McKeown et al. 2012] that provides audiovisual user data labeled in five affect dimensions used in earlier AVEC challenges alongside several trait dimensions added during the MAPTRAITS [Gunes et al. 2014] challenge, and finally the audiovisual iHEARu-EAT [Schuller et al. 2015b] corpus providing audiovisual data on eating condition and several other subject meta-data.

5.7

Walk-through of an Example State Let us now consider that we need a new user state in an interface. Let’s say for some reason your interface should be able to recognize whether a user appears to be arrogant. First, we would thus need to collect data from voices, faces, body postures, gestures, etc. and ensure according arrogance labels are provided, e.g., by crowd-sourcing. Before crowd-sourcing, however, we would need to decide on the

5.8 Emerging Trends and Future Directions

151

representation such as two classes (“not arrogant” and “arrogant”) or a scale, say from one to ten in terms of the “degree of arrogance.” Next, we set up a system, as shown in Figure 5.1 with at least the ability to capture data and extract meaningful features (unless we have so much data collected that these can be learned from the data). We could then select the optimal feature ensemble, prior to learning and optimizing a classifier whose output should be encoded in a way readable by the interface that uses the information. To enrich our initial database, we could use transfer learning from a similar task or similar population or situation of which we have observations of arrogance and its absence. Further, to keep our system learning, we would employ active and semi-supervized learning “24-7” by sourcing the crowd whenever our engine would decide it needs help that is of importance. To this end, however, we would also need to integrate a module that estimates confidence levels related to the assumptions on arrogance or not arrogance made by the system. To fuse different modalities, we could choose one of the three entry points, namely on signal level, feature-level, or as late as after individual decisions. In our mark four edition, we would add a knowledge database to integrate context and make sure we enhance the information that could be corrupted as our system has to work “in the wild” (see also Section 5.8 on the different levels such as signal or feature. Obviously, we would then want to test it extensively.

5.8

Emerging Trends and Future Directions To conclude this chapter, let us have a look at selected emerging trends and future directions. Even if these topics emerge now, they will likely remain a direction for quite some more time owing to their challenging nature.

5.8.1 In the Wild Processing To apply user state and trait recognition in real-world tasks, acquisition “in the wild” is needed. This requires systems to be able to deal with non-prototypical instances [Schuller et al. 2011a] including ambiguous cases selected without “cherry picking” under often severe conditions such as noisy, reverberated, or partially missing data. This demands for particularly robust approaches for the assessment and according data collected in such “wild” conditions.

5.8.2 Diversity of Culture and Language While humans are able to recognize states such as emotion across cultures [Scherer et al. 2001, Sauter et al. 2010] despite differences in expressivity [Scherer and Brosch

152

Chapter 5 Multimodal User State and Trait Recognition: An Overview

2009], effort yet has to go into making machines more independent in this sense. In addition, the language of a user not only impacts on linguistic analysis [Banea et al. 2011], but potentially also on acoustic cues [Polzehl et al. 2010, Feraru et al. 2015, Sagha et al. 2016] or even facial cues, as the lip-movement will be different. Accordingly, novel multi-lingually enabled approaches and tests are needed.

5.8.3 Multi-Subject Processing If applications have to cope with several users, they need to be able to infer the states and traits of these simultaneously. In image processing, this poses less problems as long as there are no occlusions by overlapping individuals. In audio processing, the signal requires separation prior to processing. Speaker diarization can help to identify the number of speakers at a time and their potential overlap [Nwe et al. 2010]. 5.8.3.1

Linking Analysis with Synthesis The methods of analysis and synthesis often differ significantly. In addition, many states and traits can be recognized from a user but not synthesized. For a balanced and symmetrical user-machine communication, however, it seems desirable that the machine can feedback the same range of states and traits to the user. Overall, it just seems promising to couple analysis and synthesis more tightly if both exist in one interface (e.g., Schr¨ oder et al. [2012]). This also allows a system to learn both reinforced or cooperatively with the user at a time in a synergistic fashion just like a young child would learn language not only by analysis but also by synthesis and getting feedback from the world around him/her. The major challenge lies, however, in embedding user state and trait analysis in more real-life products that operate “in the wild” and test them in longitudinal studies.

Focus Questions 5.1. Name different attributes that may be used to automatically classify a user and group these by taxonomies. Also think of, and list, a few attributes that have not been listed in the chapter, but could be meaningful in a user interface context.

5.2. Which modalities are the “usual suspects” in the classification of user states and traits? Briefly describe in which ways they are exploited and discuss the strengths and weaknesses, individually, and potential synergies of selected combinations of modalities. 5.3. What is understood by “pseudo-multimodal” as was discussed here?

References

153

5.4. Describe the flow of processing in an automatic user classification system starting from the input sensors and name mandatory and optional building blocks along the chain of processing. Briefly describe the principle of each such building block.

5.5. In question 5.4, at which points can multimodal fusion take place along this chain of processing? 5.6. In which way does a modern architecture of processing as described here differ from the “traditional” one?

5.7. Differentiate between supervized, active, reinforced, semi-supervized, unsupervized, and transfer learning.

5.8. What information needs to be provided to and read from the interface to classify the user states and traits?

5.9. Select an exemplary state or trait of your choice and sketch how you would realize a multimodal automatic system that recognizes it and interfaces with an application. 5.10. Name open issues in today’s automatic multimodal user classification. Share details about how different modalities are concerned.

References M. Abouelenien, M. Burzo, and R. Mihalcea. 2015. Cascaded multimodal analysis of alertness related features for drivers safety applications. In Proceedings of the 8th ACM International Conference on PErvasive Technologies Related to Assistive Environments, p. 59. ACM. DOI: 10.1145/2769493.2769505. 134 E. Alpaydin. 2018. Classifying multimodal data. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger, editors, The Handbook of MultimodalMultisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition, Ch. 2. Morgan & Claypool Publishers, San Rafael, CA. K. Altun and K. E. MacLean. 2015. Recognizing affect in human touch of a robot. Pattern Recognition Letters, vol. 66, pp. 31–40. DOI: 10.1016/j.patrec.2014.10.016. 148 O. Amft and G. Tr¨ oster. 2006. Methods for detection and classification of normal swallowing from muscle activation and sound. In Pervasive Health Conference and Workshops, 2006, pp. 1–10. IEEE. DOI: 10.1109/PCTHEALTH.2006.361624. 134 E. O. Andreeva, P. Aarabi, M. G. Philiastides, K. Mohajer, and M. Emami. 2004. Driver drowsiness detection using multimodal sensor fusion. In Defense and Security, pp. 380–390. International Society for Optics and Photonics. DOI: 10.1117/12.541296. 134

154

Chapter 5 Multimodal User State and Trait Recognition: An Overview

P. Baggia, D. C. Burnett, J. Carter, D. A. Dahl, G. McCobb, and D. Raggett. 2009. EMMA: Extensible MultiModal Annotation Markup Language. W3C Recommendation. 140 J. N. Bailenson, N. Yee, S. Brave, D. Merget, and D. Koslow. 2007. Virtual interpersonal touch: expressing and recognizing emotions through haptic devices. Human-Computer Interaction, 22(3): 325–353. 148 T. Baltrusaitis, C. Ahuja, and L.-Ph. Morency. 2018. Multimodal machine learning. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition, Ch 1. Morgan & Claypool Publishers San Rafael, CA. C. Banea, R. Mihalcea, and J. Wiebe. 2011. Multilingual sentiment and subjectivity. In I. Zitouni and D. Bikel, editors, Multilingual Natural Language Processing. Prentice Hall. DOI: 10.1.1.221.4090. 152 A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, V. Aharonson, L. Kessous, and N. Amir. 2011. Whodunnit—Searching for the most important feature types signalling emotion-related user states in speech. Computer Speech & Language, 25(1): 4–28.e DOI: 10.1016/j.csl.2009.12.003. 146 L. Batrinca, B. Lepri, and F. Pianesi.2011. Multimodal recognition of personality during short self-presentations. In Proceedings of the 2011 Joint ACM Workshop on Human Gesture and Behavior Understanding, pp. 27–28. ACM. 134 L. Batrinca, B. Lepri, N. Mana, and F. Pianesi. 2012. Multimodal recognition of personality traits in human-computer collaborative tasks. In Proceedings of the 14th ACM International Conference on Multimodal Interaction, pp. 39–46. ACM. DOI: 10.1145/ 2388676.2388687. 134 F. Benamara, C. Cesarano, A. Picariello, D. Reforgiato, and V.˜ S. Subrahmanian. 2007. Sentiment analysis: Adjectives and adverbs are better than adjectives alone. In Proceedings International Conference on Weblogs and Social Media, pp. 1–7, Boulder, CO. 146 G. G. Berntson, J. T. Bigger, D. L. Eckberg, P. Grossman, P. G. Kaufmann, M. Malik, H. N. Nagaraja, S. W. Porges, J. P. Saul, P. H. Stone, and M. W. VanderMolen. 1997. Heart rate variability:origins, methods, and interpretive caveats. Psychophysiology, 34(6): 623–648. DOI: 10.1111/j.1469-8986.1997.tb02140.x/abstract. 137 K. C. Berridge. 2000. Reward learning: Reinforcement, incentives, and expectations. Psycholology of Learning Motiva, 40: 223–278. DOI: 10.1016/S0079-7421(00)80022-5. 142 H. Boˇril, P. Boyraz, and J. H. L. Hansen. 2012. Towards multimodal driver’s stress detection. In Digital Signal Processing for In-vehicle Systems and Safety, pp. 3–19. Springer. 134 H. Brugman and A. Russel. 2004. Annotating Multi-media / Multi-modal resources with ELAN. In Proceedings of LREC, pp. 2065–2068, Lisbon, Portugal. 149

References

155

F. Burkhardt, C. Pelachaud, B. Schuller, and E. Zovato. 2017. Emotion ML. In D. Dahl, editor, Multimodal Interaction with W3C Standards: Towards Natural User Interfaces to Everything, pp. 65–80. Springer, Berlin/Heidelberg. 140 M. Burzo, M. Abouelenien, V. Perez-Rosas, and R. Mihalcea. 2018. Multimodal deception detection. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition, Ch 13. Morgan & Claypool Publishers San Rafael, CA. E. Cambria, B. Schuller, Y. Xia, and C. Havasi. 2013. New avenues in opinion mining and sentiment analysis. IEEE Intelligent Systems Magazine, 28(2): 15–21. DOI: 10.1109/ MIS.2013.30. 146 H. E. Ceting¨ ¸ ul, E. Erzin, Y. Yemez, and A. M. Tekalp. 2006. Multimodal speaker/speech recognition using lip motion, lip texture and audio. Signal Processing, 86(12): 3549– 3558. DOI: 10.1016/j.sigpro.2006.02.045. 134 G. Chanel, J. Kronegg, D. Grandjean, and T. Pun. 2006. Emotion assessment: Arousal evaluation using eeg’s and peripheral physiological signals. In LNCS vol. 4105, pp. 530–537. 148 G. Chanel, K. Ansari-Asl, and T. Pun. 2007. Valence-arousal evaluation using physiological signals in an emotion recall paradigm. In Proceedings of SMC, pp. 2662–2667, Montreal, QC. IEEE. DOI: 10.1109/ICSMC.2007.4413638. 148 G. Chanel, J. J. M. Kierkels, M. Soleymani, and T. Pun. 2009. Short-term emotion assessment in a recall paradigm. International Journal of Human-Computer Studies, 67(8): 607–627. DOI: 10.1016/j.ijhcs.2009.03.005. 137, 148 J. F. Cohn, T. S. Kruez, I. Matthews, Y. Yang, M. H. Nguyen, M. T. Padilla, F. Zhou, and F. De La Torre. 2009. Detecting depression from facial actions and vocal prosody. In Affective Computing and Intelligent Interaction and Workshops, 2009. ACII 2009. 3rd International Conference on, pp. 1–7. IEEE. DOI: 10.1109/ACII.2009.5349358. 134 J. F. Cohn, N. Cummins, J. Epps, R. Goecke, J. Joshi, and S. Scherer. 2018. Multimodal assessment of depression and related disorders based on behavioural signals. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition, Ch 12. Morgan & Claypool Publishers San Rafael, CA. R. Cowie, E. Douglas-Cowie, S. Savvidou, E. McMahon, M. Sawey, and M. Schr¨ oder. 2000. Feeltrace: An instrument for recording perceived emotion in real time. In Proceedings of ISCA Workshop on Speech and Emotion, pp. 19–24, Newcastle, UK. DOI: 10.1.1.384 .9385. 149 R. Cowie, G. McKeown, and E. Douglas-Cowie. 2012. Tracing emotion: an overview. International Journal of Synthetic Emotions, 3(1): 1–17. DOI: 10.4018/jse.2012010101. 149

156

Chapter 5 Multimodal User State and Trait Recognition: An Overview

N. Dael, M. Mortillaro, and K. R. Scherer. 2012. The body action and posture coding system (bap): Development and reliability. Journal of Nonverbal Behavior, 36(2): 97–121. DOI: 10.1007/s10919-012-0130-0. 146 D. Davidov, O. Tsur, and A. Rappoport. 2010. Semi-supervised recognition of sarcastic sentences in Twitter and Amazon. In Proceedings of CoNNL, pp. 107–116, Uppsala, Sweden. DOI: 10.1.1.182.4112. 141 R. J. Davidson and N. A. Fox. 1982. Asymmetrical brain activity discriminates between positive and negative affective stimuli in human infants. Science, 218: 1235–1237. DOI: 10.1126/science.7146906. 148 J. Deng and B. Schuller. 2012. Confidence measures in speech emotion recognition based on semi-supervised learning. In Proceedings of INTERSPEECH, Portland, OR. ISCA. 140 J. Deng, Z. Zhang, F. Eyben, and B. Schuller. 2014. Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Processing Letters, 21(9): 1068–1072. DOI: 10.1109/LSP.2014.2324759. 143 S. K. D’Mello, N. Bosch, and H. Chen. 2018. Multimodal-multisensor affect detection. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition, Ch 6. Morgan & Claypool Publishers San Rafael, CA. L. El Asri, R. Laroche, and O. Pietquin. 2012. Reward function learning for dialogue management. In Proceedings Sixth Starting AI Researchers’ Symposium – STAIRS, pp. 95–106. 142 F. Eyben, M. W¨ ollmer, and B. Schuller. 2010. openSMILE – The Munich versatile and fast open-source audio feature extractor. In Proceedings of MM, pp. 1459–1462, Florence, Italy. ACM. DOI: 10.1145/1873951.1874246. 150 F. Eyben, M. W¨ ollmer, M. Valstar, H. Gunes, B. Schuller, and M. Pantic. 2011. String-based audiovisual fusion of behavioural events for the assessment of dimensional affect. In Proceedings of FG, pp. 322–329, Santa Barbara, CA. IEEE. DOI: 10.1109/FG.2011 .5771417. 146 F. Eyben, F. Weninger, L. Paletta, and B. Schuller. 2013. The acoustics of eye contact— Detecting visual attention from conversational audio cues. In Proceedings 6th Workshop on Eye Gaze in Intelligent Human Machine Interaction: Gaze in Multimodal Interaction (GAZEIN 2013), held in conjunction with ICMI 2013, pp. 7–12, Sydney, Australia. ACM. DOI: 10.1145/2535948.2535949. 149 M. Farr´ us, P. Ejarque, A. Temko, and J. Hernando. 2007. Histogram equalization in svm multimodal person verification. In Advances in Biometrics, pp. 819–827. Springer. DOI: 10.1007/978-3-540-74549-5_86. 134 S. M. Feraru, D. Schuller, and B. Schuller. 2015. Cross-language acoustic emotion recognition: an overview and some tendencies. In Proceedings of ACII, pp. 125–131, Xi’an, P.R. China. IEEE. DOI: 10.1109/ACII.2015.7344561. 152

References

157

Y. Gao, N. Bianchi-Berthouze, and H. Meng. 2012. What does touch tell us about emotions in touchscreen-based gameplay? ACM Transactions on Computer-Human Interaction, 19(4/31). DOI: 10.1145/2395131.2395138. 148 J. T. Geiger, M. Kneissl, B. Schuller, and G. Rigoll. 2014. Acoustic Gait-based Person Identification using Hidden Markov Models. In Proceedings of the Personality Mapping Challenge & Workshop (MAPTRAITS 2014), Satellite of ICMI), pp. 25–30, Istanbul, Turkey. ACM. DOI: 10.1145/2668024.2668027. 145 C. Georgakis, S. Petridis, and M. Pantic. 2014. Discriminating native from non-native speech using fusion of visual cues. In Proceedings of the ACM International Conference on Multimedia, pp. 1177–1180. ACM. DOI: 10.1145/2647868.2655026. 134 D. Glowinski, N. Dael, A. Camurri, G. Volpe, M. Mortillaro, and K. Scherer. 2011. Towards a minimal representation of affective gestures. IEEE Transactions on Affective Computing, 2(2): 106–118. DOI: 10.1109/T-AFFC.2011.7. 150 H. Gunes and M. Pantic. 2010. Automatic, dimensional and continuous emotion recognition. International Journal of Synthetic Emototions, 1(1): 68–99. DOI: 10.4018/ jse.2010101605. 137 H. Gunes and B. Schuller. 2013. Categorical and dimensional affect analysis in continuous input: current trends and future directions. Image and Vision Compututing Journal Special Issue, 31(2): 120–136. DOI: 10.1016/j.imavis.2012.06.016. 144 H. Gunes, B. Schuller, O. Celiktutan, E. Sariyanidi, and F. Eyben, editors. 2014. Proceedings of the Personality Mapping Challenge & Workshop (MAPTRAITS 2014), Istanbul, Turkey. ACM. Satellite of the 16th ACM International Conference on Multimodal Interaction (ICMI). 133, 134, 150 A. Haag, S. Goronzy, P. Schaich, and J. Williams. 2004. Emotion recognition using biosensors: First steps towards an automatic system. In LNCS 3068, pp. 36–48. DOI: 10.1007/978-3-540-24842-2_4. 148 M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. 2009. The weka data mining software: an update. ACM SIGKDD Explorations Newsletters, 11(1): 10–18. DOI: 10.1145/1656274.1656278. 150 S. Hantke, T. Appel, F. Eyben, and B. Schuller. 2015. iHEARu-PLAY: Introducing a game for crowdsourced data collection for affective computing. In Proceedings of the 1st International Workshop on Automatic Sentiment Analysis in the Wild (WASA 2015) held in conjunction with ACII, pp. 891–897, Xi’an, P.R. China. IEEE. DOI: 10.1109/ACII .2015.7344680. 149 M. Hofmann, J. Geiger, S. Bachmann, B. Schuller, and G. Rigoll. 2013. The TUM Gait from Audio, Image and Depth (GAID) Database: Multimodal Recognition of Subjects and Traits. Journal of Visual Communication and Image Representation Special Issue on Visual Understanding Application with RGB-D Cameras, 25(1): 195–206. 134, 146 G. Huang and Y. Wang. 2007. Gender classification based on fusion of multi-view gait sequences. In Computer Vision–ACCV 2007, pp. 462–471. Springer. DOI: 10.1007/9783-540-76386-4_43. 134

158

Chapter 5 Multimodal User State and Trait Recognition: An Overview

R. Jenke, A. Peer, and M. Buss. 2014. Feature extraction and selection for emotion recognition from eeg. IEEE Transactions on Affective Computing, 5(3): 327–339. DOI: 10.1109/ TAFFC.2014.2339834. 148 G. Keren, A. E.-D. Mousa, O. Pietquin, S. Zafeiriou, and B. Schuller. 2018. Deep learning for multisensorial and multimodal interaction. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger, editors, The Handbook of MultimodalMultisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition, Ch 4. Morgan & Claypool Publishers San Rafael, CA. M. M. Khan, R. D. Ward, and M. Ingleby. 2006. Infrared thermal sensing of positive and negative affective states. In Proceedings of the International Conference on Robotics, Automation and Mechatronics, pp. 1–6. IEEE. DOI: 10.1109/RAMECH.2006.252608. 147 M. Kipp. 2001. Anvil - a generic annotation tool for multimodal dialogue. In Proceedings of the 7th European Conference on Speech Communication and Technology, pp. 1367–1370. 149 A. Kleinsmith and N. Bianchi-Berthouze. 2007. Recognizing affective dimensions from body posture. In Proceedings of ACII, pp. 48–58, Lisbon, Portugal. DOI: 10.1007/978-3-54074889-2_5. 147 A. Kleinsmith, P. R. De Silva, and N. Bianchi-Berthouze. 2005. Recognizing emotion from postures: Cross–cultural differences in user modeling. In Proceedings of the Conference on User Modeling, pp. 50–59, Edinburgh, UK. 147 T. Ko. 2005. Multimodal biometric identification for large user population using fingerprint, face and iris recognition. In Applied Imagery and Pattern Recognition Workshop, 2005. Proceedings 34th, p. 6. IEEE. DOI: 10.1109/AIPR.2005.35. 134 A. Kreilinger, H. Hiebel, and G. Muller-Putz. 2015. Single versus multiple events error potential detection in a BCI-controlled car game with continuous and discrete feedback. IEEE Transactions on Biomedical Engineering, (3): 519–29. DOI: 10.1109/ TBME.2015.2465866. 149 M. Kusserow, O. Amft, and G. Troster. 2009. Bodyant: Miniature wireless sensors for naturalistic monitoring of daily activity. In Proceedings of the International Conference on Body Area Networks, pp. 1–8, Sydney, Australia. DOI: 10.4108/ICST.BODYNETS2009 .5899. 148 M. Li, V. Rozgi´ c, G. Thatte, S. Lee, A. Emken, M. Annavaram, U. Mitra, D. Spruijt-Metz, and S. Narayanan. 2010a. Multimodal physical activity recognition by fusing temporal and cepstral information. IEEE Transactions on Neural Systems Rehabilitation Engineering, 18(4): 369–380. DOI: 10.1109/TNSRE.2010.2053217. 134 X. Li, X. Zhao, Y. Fu, and Y. Liu. 2010b. Bimodal gender recognition from face and fingerprint. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 2590– 2597. IEEE. DOI: 10.1109/CVPR.2010.5539969. 134

References

159

G. Littlewort, J. Whitehill, T. Wu, I. R. Fasel, M. G. Frank, J. R. Movellan, and M. S. Bartlett. 2011. The computer expression recognition toolbox (cert). In Proceedings of FG, pp. 298–305, Santa Barbara, CA. IEEE. DOI: 10.1109/FG.2011.5771414. 150 C. Liu, P. Rani, and N. Sarkar. 2005. An empirical study of machine learning techniques for affect recognition in human-robot interaction. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2662–2667. DOI: 10.1109/IROS.2005.1545344. 148 X. Lu, H. Chen, and A. K. Jain. 2005. Multimodal facial gender and ethnicity identification. In Advances in Biometrics, pp. 554–561. Springer. DOI: 10.1007/11608288_74. 134 K. Matsumoto and F. Ren. 2011. Estimation of word emotions based on part of speech and positional information. Computters in Human Behavior, 27(5): 1553–1564. DOI: 10.1016/j.chb.2010.10.028. 146 F. Matta, U. Saeed, C. Mallauran, and J.-L. Dugelay. 2008. Facial gender recognition using multiple sources of visual information. In Multimedia Signal Processing, 2008 IEEE 10th Workshop on, pp. 785–790. IEEE. DOI: 10.1109/MMSP.2008.4665181. 134 U. Maurer, A. Smailagic, D. P. Siewiorek, and M. Deisher. 2006. Activity recognition and monitoring using multiple sensors on different body positions. In Wearable and Implantable Body Sensor Networks, 2006. BSN 2006. International Workshop on, pp. 4–pp. IEEE. DOI: 10.1109/BSN.2006.6. 134 I. McCowan, D. Gatica-Perez, S. Bengio, G. Lathoud, M. Barnard, and D. Zhang. 2005. Automatic analysis of multimodal group actions in meetings. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(3): 305–317. DOI: 10.1109/TPAMI.2005 .49. 134 G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schr¨ oder. 2012. The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Transactions on Affective Computing, 3(1): 5–17. DOI: 10.1109/T-AFFC.2011.20. 150 H. K. Meeren, C. C. Van Heijnsbergen, and B. De Gelder. 2005. Rapid perceptual integration of facial expression and emotional body language. Proceedings of the National Academy of Sciences of the USA, 102: 16518–16523. DOI: 10.1073/pnas.0507650102. 145 W. A. Melder, K. P. Truong, M. D. Uyl, D. A. Van Leeuwen, M. A. Neerincx, L. R. Loos, and B. Plum. 2007. Affective multimodal mirror: sensing and eliciting laughter. In Proceed. International Workshop on Human-centered Multimedia, pp. 31–40. ACM. DOI: 10.1145/1290128.1290134. 134 A. Metallinou, A. Katsamanis, Y. Wang, and S. Narayanan. 2011. Tracking changes in continuous emotion states using body language and prosodic cues. In Proceedings of ICASSP, pp. 2288–2291, Prague, Czech Republic. IEEE. DOI: 10.1109/ICASSP.2011 .5946939. 147 F. Metze, A. Batliner, F. Eyben, T. Polzehl, B. Schuller, and S. Steidl. 2010. Emotion recognition using imperfect speech recognition. In Proceedings INTERSPEECH, pp. 478–481, Makuhari, Japan. ISCA. 146

160

Chapter 5 Multimodal User State and Trait Recognition: An Overview

T. L. Nwe, H. Sun, N. Ma, and H. Li. 2010. Speaker diarization in meeting audio for single distant microphone. In Proceedings of INTERSPEECH, pp. 1505–1508, Makuhari, Japan. ISCA. 152 S. Oviatt, J. F. Grafsgaard, L. Chen, and X. Ochoa. 2018. Multimodal learning analytics: Assessing learners’ mental state during the process of learning. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition, Ch 11. Morgan & Claypool Publishers San Rafael, CA. Y. Panagakis, O. Rudovic, and M. Pantic. 2018. Learning for multi-modal and contextsensitive interfaces. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition, Ch 3. Morgan & Claypool Publishers San Rafael, CA. M. Pantic and M.S. Bartlett. 2007. Machine analysis of facial expressions. In K. Delac and M. Grgic, editors, Face Recognition, pp. 377–416. I-Tech Education and Publishing, Vienna, Austria. 146, 147 F. Pianesi, N. Mana, A. Cappelletti, B. Lepri, and M. Zancanaro. 2008. Multimodal recognition of personality traits in social interactions. In Proceedings of the 10th International Conference on Multimodal Interfaces, pp. 53–60. ACM. DOI: 10.1145/1452392.1452404. 134 R. W. Picard, E. Vyzas, and J. Healey. 2001. Toward machine emotional intelligence: analysis of affective physiological state. IEEE Transactions on Pattern Analysis Machine Intelligence, 23(10): 1175–1191. DOI: 10.1109/34.954607. 148 F. Pokorny, F. Graf, F. Pernkopf, and B. Schuller. 2015. Detection of negative emotions in speech signals using bags-of-audio-words. In Proceedings of the 1st International Workshop on Automatic Sentiment Analysis in the Wild (WASA 2015) held in conjunction with ACII, pp. 879–884, Xi’an, P.R. China. IEEE. DOI: 10.1109/ACII.2015.7344678. 137, 142 T. Polzehl, A. Schmitt, and F. Metze. 2010. Approaching multi-lingual emotion recognition from speech—on language dependency of acoustic/prosodic features for anger detection. In Proceedings of Speech Prosody. ISCA. 152 R. Poppe. 2007. Vision-based human motion analysis: An overview. Computer Vision and Image Understanding, 108(1–2): 4–18. DOI: 10.1016/j.cviu.2006.10.016. 147 R. Poppe. 20101. A survey on vision-based human action recognition. Image and Vision Computing, 28(6): 976–990. DOI: 10.1016/j.imavis.2009.11.014. 147 T. Pun, T. I. Alecu, G. Chanel, J. Kronegg, and S. Voloshynovskiy. 2006. Brain–computer interaction research at the computer vision and multimedia laboratory, University of Geneva. IEEE Transactions on Neural Systems Rehabilitation Engineering, 14(2): 210–213. DOI: 10.1109/TNSRE.2006.875544. 148

References

161

T. Pursche, J. Krajewski, and R. Moeller. 2012. Video-based heart rate measurement from human faces. In Consumer Electronics (ICCE), 2012 IEEE International Conference on, pp. 544–545. IEEE. DOI: 10.1109/ICCE.2012.6161965. 149 F. Putze, J.-P. Jarvis, and T. Schultz. 2010. Multimodal recognition of cognitive workload for multitasking in the car. In Pattern Recognition (ICPR), 2010 20th International Conference on, pp. 3748–3751. IEEE. DOI: 10.1109/ICPR.2010.913. 134 T. Qin, J. K. Burgoon, J. P. Blair, and J. F. Nunamaker Jr. 2005. Modality effects in deception detection and applications in automatic-deception-detection. In System Sciences, 2005. HICSS’05. Proceedings of the 38th Annual Hawaii International Conference on, pp. 23b–23b. IEEE. DOI: 10.1109/HICSS.2005.436. 134 F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne. 2013. Introducing the recola multimodal corpus of remote collaborative and affective interactions. In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on, pp. 1–8. IEEE. DOI: 10.1109/FG.2013.6553805. 150 F. Ringeval, E. Marchi, M. M´ ehu, K. Scherer, and B. Schuller. 2015. Face reading from speech—predicting facial action units from audio cues. In Proceedings INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, pp. 1977–1981, Dresden, Germany. ISCA. 149 H. Sagha, J. Deng, M. Gavryukova, J. Han, and B. Schuller. 2016. Cross lingual speech emotion recognition using canonical correlation analysis on principal component subspace. In Proceedings 41st IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2016, Shanghai, P.R. China. IEEE. DOI: 10.1109/ICASSP .2016.7472789. 152 L. Salahuddin, J. Cho, M. G. Jeong, and D. Kim. 2007. Ultra short term analysis of heart rate variability for monitoring mental stress in mobile settings. In Proceedings of the IEEE International Conference of Engineering in Medicine and Biology Society, pp. 39–48. DOI: 10.1109/IEMBS.2007.4353378. 137 D. Sanchez-Cortes, O. Aran, D. B. Jayagopi, M. Mast, and D. Gatica-Perez. 2013. Emergent leaders through looking and speaking: from audio-visual data to multimodal recognition. Journal of Multimodal User Interface, 7(1–2): 39–53. DOI: 10.1007/s12193012-0101-0. 134 M. E. Sargin, E. Erzin, Y. Yemez, and A. M. Tekalp. 2006. Multimodal speaker identification using canonical correlation analysis. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, vol. 1, pp. I–I. IEEE. DOI: 10.1109/ICASSP.2006.1660095. 134 D. A. Sauter, F. Eisner, P. Ekman, and S. K. Scott. 2010. Cross-cultural recognition of basic emotions through nonverbal emotional vocalizations. Proceedings of the National Academy of Sciences of the U.S.A., 107(6): 2408–2412. DOI: 10.1073/pnas.0908239106. 151

162

Chapter 5 Multimodal User State and Trait Recognition: An Overview

K. R. Scherer and T. Brosch. 2009. Culture-specific appraisal biases contribute to emotion dispositions. European Journal of Personality, 23: 265–288. DOI: 10.1002/per.714/ abstract. 151, 152 K. R. Scherer, R. Banse, and H. G. Wallbott. 2001. Emotion inferences from vocal expression correlate across languages and cultures. Journal of Cross-Cultural Psychology, 32(1): 76–92. DOI: 10.1177/0022022101032001009. 151 M. Schr¨ oder, H. Pirker, and M. Lamolle. 2006. First suggestions for an emotion annotation and representation language. In Proceedings LREC, vol. 6, pp. 88–92, Genoa, Italy. ELRA. 140 M. Schr¨ oder, E. Bevacqua, R. Cowie, F. Eyben, H. Gunes, D. Heylen, M. ter Maat, G. McKeown, S. Pammi, M. Pantic, C. Pelachaud, B. Schuller, E. de Sevin, M. Valstar, and M. W¨ ollmer. 2012. Building autonomous sensitive artificial listeners. IEEE Transactions on Affectective Computing, pp. 1–20. DOI: 10.1109/T-AFFC.2011.34. 152 B. Schuller. 2013. Intelligent Audio Analysis. Signals and Communication Technology. Springer. 135, 137 B. Schuller and A. Batliner. 2013. Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing. Wiley. 132, 141 B. Schuller, Manfred Lang, and G. Rigoll. 2002. Multimodal emotion recognition in audiovisual communication. In Proceedings of ICME, vol. 1, pp. 745–748, Lausanne, Switzerland. IEEE. DOI: 0.1109/ICME.2002.1035889. 148 B. Schuller, R. M¨ uller, F. Eyben, J. Gast, B. H¨ ornler, M. W¨ ollmer, G. Rigoll, A. H¨ othker, and H. Konosu. 2009. Being Bored? Recognising natural interest by extensive audiovisual integration for real-life application. Image and Vision Computing Journal, 27(12): 1760–1774. DOI: 10.1016/j.imavis.2009.02.013. 134, 146 B. Schuller, A. Batliner, S. Steidl, and D. Seppi. 2011a. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Communications, 53(9/10): 1062–1087. DOI: 10.1016/j.specom.2011.01.011. 151 B. Schuller, M. Valstar, R. Cowie, and M. Pantic. 2011b. Avec 2011—the first audio/visual emotion challenge and workshop - an introduction. In Proceedings of the 1st International Audio/Visual Emotion Challenge and Workshop, pp. 415–424, Memphis, TN. 133, 134 B. Schuller, F. Friedmann, and F. Eyben. 2013. Automatic recognition of physiological parameters in the human voice: heart rate and skin conductance. In Proceedings 38th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013, pp. 7219–7223, Vancouver, Canada. IEEE. DOI: 10.1109/ICASSP.2013.6639064. 149 B. Schuller, A. El-Desoky Mousa, and V. Vasileios. 2015a. Sentiment analysis and opinion mining: on optimal parameters and performances. WIREs Data Mining and Knowledge Discovery, 5: 255–263. DOI: 10.1002/widm.1159/abstract. 146 ˜ . Orozco-Arroyave, E. N¨ B. Schuller, S. Steidl, A. Batliner, S. Hantke, F. H¨ onig, J.R oth, Y. Zhang, and F. Weninger. 2015b. The INTERSPEECH 2015 computational paralinguistics challenge: degree of nativeness, Parkinson’s & eating condition.

References

163

In Proceedings INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, pp. 478–482, Dresden, Germany. ISCA. 150 B. Settles, M. Craven, and S. Ray. 2008. Multiple-instance active learning. In Proceedings of NIPS, pp. 1289–1296, Vancouver, BC, Canada. 142 C. Shan, S. Gong, and P. W. McOwan. 2007. Learning gender from human gaits and faces. In Advanced Video and Signal Based Surveillance, 2007. AVSS 2007. IEEE Conference on, pp. 505–510. IEEE. DOI: 10.1109/AVSS.2007.4425362. 134 C. Shan, S. Gong, and P. W. McOwan. 2008. Fusing gait and face cues for human gender recognition. Neurocomputing, 71(10):1931–1938. DOI: 10.1016/j.neucom.2007.09 .023. 134 N. Sharma and T. Gedeon. 2012. Objective measures, sensors and computational techniques for stress recognition and classification: A survey. Computer Methods and Programs in Biomedicine, 108(3): 1287–1301. DOI: 10.1016/j.cmpb.2012.07.003. 134 R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. 2013. Zero-shot learning through crossmodal transfer. In NIPS’13 Proceedings of the 26th International Conference on Neural Information Processing Systems, vol. 1, pp. 935–943. 143 C. Strapparava and R. Mihalcea. 2010. Annotating and identifying emotions in text. In G. Armano, M. de Gemmis, G. Semeraro, and E. Vargiu, editors, Intelligent Information Access, Studies in Computational Intelligence, vol. 301, pp. 21–38. Springer Berlin/Heidelberg. ISBN 978-3-642-13999-4. DOI: 10.1007/978-3-642-14000-6_2. 146 A. Stuhlsatz, C. Meyer, F. Eyben, T. Zielke, G. Meier, and B. Schuller. 2011. Deep neural networks for acoustic emotion recognition: Raising the benchmarks. In Proceedings of ICASSP, pp. 5688–5691, Prague, Czech Republic. IEEE. DOI: 10.1109/ICASSP.2011 .5947651. 142 V. S. Subrahmanian and D. Reforgiato. 2008. AVA: adjective-verb-adverb combinations for sentiment analysis. Intelligent Systems, 23(4):43–50. DOI: 10.1109/MIS.2008.57. 146 R. S. Sutton and A. G. Barto. 1998. Reinforcement Learning: An Introduction, volume 1. MIT Press Cambridge. 141 G. Trigeorgis, K. Bousmalis, S. Zafeiriou, and B. Schuller. 2014. A deep semi-NMF model for learning hidden representations. In Proceedings of ICML, vol. 32, pp. 1692–1700, Beijing, China. IMLS. 142 ˜. Nicolaou, S. Zafeiriou, and B. Schuller. 2015. Towards deep alignment of G. Trigeorgis, M.A multimodal data. In Proceedings 2015 Multimodal Machine Learning Workshop held in conjunction with NIPS 2015 (MMML@NIPS), Montr´ eal, QC. NIPS. 142 G. Trigeorgis, F. Ringeval, R. Br¨ uckner, E. Marchi, M. Nicolaou, B. Schuller, and S. Zafeiriou. 2016. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings 41st IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2016, Shanghai, P.R. China. IEEE. DOI: 10.1109/ICASSP.2016.7472669. 137

164

Chapter 5 Multimodal User State and Trait Recognition: An Overview

P. Tsiamyrtzis, J. Dowdall, D. Shastri, I. T. Pavlidis, M. G. Frank, and P. Ekman. 2007. Imaging facial physiology for the detection of deceit. Intelligent Journal of Computer Vision, 71(2): 197–214. DOI: 10.1007/s11263-006-6106-y. 147, 148 J. Van den Stock, R. Righart, and B. De Gelder. 2007. Body expressions influence recognition of emotions in the face and voice. Emotion, 7(3):487–494. DOI: 10.1037/1528-3542.7 .3.487. 145 S. van Wingerden, T. J. Uebbing, M. M. Jung, and M. Poel. 2014. A neural network based approach to social touch classification. In Proceedings of the 2nd International Workshop on Emotion representations and modelling in Human-Computer Interaction systems, ERM4HCI, pp. 7–12, Istanbul, Turkey. ACM. DOI: 10.1145/2668056.2668060. 148 A. Vinciarelli and A. Esposito. 2018. Multimodal analysis of social signals. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition, Ch 7. Morgan & Claypool Publishers San Rafael, CA. T. Vogt, E. Andr´ e, and N. Bee. 2008. Emovoice – a framework for online recognition of emotions from voice. In Proceedings of IEEE PIT, volume 5078 of LNCS, pp. 188–199. Springer, Kloster Irsee. DOI: 10.1007/978-3-540-69369-7_21. 150 J. Wagner and E. Andr´ e. 2018. Real-time sensing of affect and social signals in a multimodal framework: a practical approach. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition, Ch 8. Morgan & Claypool Publishers San Rafael, CA. J. Wagner, F. Lingenfelser, T. Baur, I. Damian, F. Kistler, and E. Andr´ e. 2013. The social signal interpretation (ssi) framework: multimodal signal processing and recognition in real-time. In Proceedings of the 21st ACM International Conference on Multimedia, pp. 831–834. ACM. DOI: 10.1145/2502081.2502223. 149 F. Weninger and B. Schuller. 2012. Optimization and parallelization of monaural source separation algorithms in the openBliSSART toolkit. Journal of Signal Processing Systems, 69(3): 267–277. DOI: 10.1007/s11265-012-0673-7. 150 F. Weninger, J. Bergmann, and B. Schuller. 2015. Introducing CURRENNT: the Munich open-source CUDA RecurREnt neural network toolkit. Journal of Machine Learning Research, 16: 547–551. 150 M. W¨ ollmer, A. Metallinou, F. Eyben, B. Schuller, and S. Narayanan. 2010. Contextsensitive multimodal emotion recognition from speech and facial expression using bidirectional lstm modeling. In Proceedings of INTERSPEECH, pp. 2362–2365, Makuhari, Japan. ISCA. 147 M. W¨ ollmer, C. Blaschke, T. Schindl, B. Schuller, B. F¨ arber, S. Mayer, and B. Trefflich. 2011. On-line driver distraction detection using long short-term memory. IEEE

References

165

Transactions on Intelligent Transportation Systems, 12(2): 574–582. DOI: 10.1109/TITS .2011.2119483. 134 M. W¨ ollmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae, and L. P. Morency. 2013. YouTube movie reviews: Sentiment analysis in an audiovisual context. IEEE Intelligent Systems, 28(2): 2–8. DOI: 10.1109/MIS.2013.34. 134 Y. Yoshitomi, S. I. Kim, T. Kawano, and T. Kitazoe. 2000. Effect of sensor fusion for recognition of emotional states using voice, face image and thermal image of face. In IEEE International Workshop on Robot and Human Interactive Communication, pp. 178–183. DOI: 10.1109/ROMAN.2000.892491. 147 Y. Zhang, E. Coutinho, Z. Zhang, M. Adam, and B. Schuller. 2015a. Introducing rater reliability and correlation based dynamic active learning. In Proceedings of ACII, pp. 70–76, Xi’an, P.R. China. IEEE. DOI: 10.1109/ACII.2015.7344553. 142 Z. Zhang, E. Coutinho, J. Deng, and B. Schuller. 2015b. Cooperative Learning and its Application to Emotion Recognition from Speech. IEEE ACM Transactions on Audio, Speech and Language Processing, 23(1): 115–126. DOI: 10.1109/TASLP.2014.2375558. 142 Y. Zhang, Y. Zhou, J. Shen, and B. Schuller. 2016a. Semi-autonomous data enrichment based on cross-task labelling of missing targets for holistic speech analysis. In Proceedings 41st IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2016, Shanghai, P.R. China. IEEE. 141 Z. Zhang, F. Ringeval, B. Dong, E. Coutinho, E. Marchi, and B. Schuller. 2016b. Enhanced semi-supervised learning for multimodal emotion recognition. In Proceedings 41st IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2016, Shanghai, P.R. China. IEEE. DOI: 10.1109/ICASSP.2016.7472666. 141 J. Zhou, K. Yu, F. Chen, Y. Wang, and S. Z. Arshad. 2018. Multimodal behavioural and physiological signals as indicators of cognitive load. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger, editors, The Handbook of MultimodalMultisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition, Ch 10. Morgan & Claypool Publishers San Rafael, CA.

6

Multimodal-Multisensor Affect Detection Sidney K. D’Mello, Nigel Bosch, Huili Chen

6.1

Introduction

Imagine you are interested in analyzing the emotional responses of a person in some interaction context (i.e., with computer software, a robot, in a classroom, on the subway). You could simply ask the person to self-report his or her felt emotion using a questionnaire, a valence-arousal grid [Russell et al. 1989], a self-assessment manikin [Bradley and Lang 1994], or some such measurement instrument. Or you could ask trained humans to observe the person and provide emotion judgments [Ocumpaugh et al. 2015]. You could also record audio/video of the interaction and have trained coders annotate the videos for visible emotion at some later time. You can even use computer vision techniques to obtain automatic estimates of facial expressions in the videos [Girard et al. 2015]. Or you may be interested in the person’s physiological responses and can use a variety of sensors to collect these data. These examples capture some (but not all) of the contemporary approaches to measure emotional responses [Coan and Allen 2007]. The approaches can be categorized as subjective vs. objective, each with different affordances. The subjective approaches (self and observers) are best suited for emotion-level representations (e.g., discrete emotions like anger and fear or dimensional representations like valence or dominance) at coarse-grained temporal resolutions (tens of seconds to minutes). The objective approaches (sensors and software) are ideal for measurement of behavioral/physiological responses (e.g., facial expressions, electrodermal activity) at fine-grained temporal resolutions (milliseconds to seconds). The two approaches have complementary strengths and weaknesses. The subjective approaches capitalize on humans’ knowledge and reasoning capabilities, resulting in

168

Chapter 6 Multimodal-Multisensor Affect Detection

Glossary Affect. Broad term encompassing constructs such as emotions, moods, and feelings. Is not the same as personality, motivation, and other related terms. Affect annotation. The process of assigning affective labels (e.g., bored, confused, aroused) or values (e.g., arousal = 5) to data (e.g., video, audio, text). Affective computing. Computing techniques and applications involving emotion or affect. Affective experience-expression link. The relationship between experiencing an affective state (e.g., feeling confused) and expressing it (e.g., displaying a furrowed brow). Affective ground truth. Objective reality involving the “true” affective state. Is a misleading term for psychological constructs like affect Construct. A conceptual variable that cannot be directly observed (e.g., intelligence, personality). Multimodal fusion. The process of combining information from multiple modalities. User-independent model A model that generalizes to a different set of users beyond those used to develop the model.

more nuanced and contextualized emotion assessments. However, they are limited by fatigue, biases (e.g., social desirability bias), errors (e.g., memory reconstruction for self-reports), and are difficult to scale. The objective approaches are not affected by fatigue or biases and are more scalable, but have limited inference and reasoning capabilities, thereby mainly providing readouts of behavioral/physiological responses. Are there ways to reconcile the two approaches? One strategy is to combine both, for example, collecting subjective self-reports of frustration in tandem with computerized estimates of facial action units (AUs) [Girard et al. 2015]. The two are taken as complementary perspectives of the person’s emotional response and associations are analyzed offline, i.e., by correlating self-reports of frustration with AUs. But what if there was a way to combine both perspectives on the fly so that the measurement jointly reflects both subjective emotion perception by humans and objective behavioral/physiological signals recorded by sensors? And what if the measurement could occur in a fully automated fashion, thereby providing measurement at fine-grained temporal resolutions and at scale? And further, what if the measurement engine was sufficiently sophisticated to model multiple expressive channels and the nonlinear temporal dependencies among them? This is the af-

6.2 Background from Affective Sciences

169

fective computing (AC) approach to emotion measurement and is the focus of this chapter. Affective computing [Calvo et al. 2015, Picard 1997], broadly defined as computing involving or arising from human emotion, is an interdisciplinary field that integrates the affective and computational sciences. Affect detection (or affect recognition) is one of the key subfields of affective computing (see [Calvo and D’Mello 2010, D’Mello and Kory 2015, Zeng et al. 2009]). The goal of affect detection is to automatically provide estimates of latent higher-level affective representations (e.g., fear) from machine-readable lower-level response signals (e.g., video, audio, physiology). Multimodal-multisensor affect detection (MMAD) utilizes multiple modalities (e.g., video, cardiac activity) and/or multiple sensors (e.g., video, electromyography) as an alternative to unimodal affect detection (UMAD). In this chapter, we provide a conceptual and technical overview of the field of MMAD, ground the abstract ideas via walk-throughs of three MMAD systems, and provide a summative review of the state-of-the-art in the field. We begin with a background discussion from the affective sciences, starting with a very basic question: “What is affect?”

6.2 6.2.1

Background from Affective Sciences Affect What is affect? The simple answer is that affect has something to do with feeling. Perhaps a more satisfactory answer is that affect is a broad label for a range of psychological phenomena involving feelings. This includes primitive feelings like hunger pangs to more complex social emotions like jealousy and pride. A more technical answer is that affect is a multicomponential construct (i.e., conceptual entity), that operates across neurobiological, physiological, behavioral, cognitive, metacognitive, and phenomenological levels [Barrett 2014, Lewis 2005, Mesquita and Boiger 2014, Scherer 2009]. It is with good reason that none of these answers seem particularly satisfactory. The term affect (or emotion) has resisted attempts at crisp definition despite a century of concentrated effort [Izard 2007, 2010]. Understanding what emotions are and how they arise has been a contentious issue in the affective sciences and is sometimes referred to as the “hundred year emotion war” [Lench et al. 2013, Lindquist et al. 2013]. For example, there has been an ongoing debate as to whether affect is best represented via discrete categories (e.g., angry, fearful) [Lerner and Keltner 2000, Loewenstein and Lerner 2003] or by fundamental dimensions (e.g., valence, arousal, power) [Cowie et al. 2012, Russell

170

Chapter 6 Multimodal-Multisensor Affect Detection

2003] (and on how many dimensions are needed [Fontaine et al. 2007]). Other open issues pertain to whether emotions are innate or are learned, whether they arise via appraisals/reappraisals or are they products of socio-constructivism, and whether emotions are universally expressed or if context and culture shape emotion expression [Barrett 2006, 2007, Ekman 1992, 1994, Gross and Barrett 2011, Izard 1994, 2010]. Does the fact that we cannot precisely define affect imply that we cannot detect it? In our view, one does not need to precisely define a phenomenon in order to study it. However, researchers need to be mindful of the implicit assumptions in their operationalizations of affect as these are transferred to the affect detectors. For example, if one operationalizes anger as short-term emotional changes recorded while people viewing anger-eliciting films in isolation and builds an automated anger detector from these recordings, then the detector’s estimates of anger are inherently coupled to this precise operationalization and not much else (e.g., felt anger, anger in a road-rage scenario, anger in a social context). Thus, it is important to be mindful that measurement is informed by assumptions of reality (operationalizations), which, in turn, are informed by insights gleaned by measurement.

6.2.2 The Affective Experience-Expression Link Affect detection assumes a link between experienced (or felt) and expressed affect. Thus, it should be theoretically possible to “decode” latent affect (e.g., confusion) from visible behaviors (e.g., a furrowed brow). This suggests that there exist “mappings” between a set of behaviors (e.g., facial features, gestures, speech patterns) and a set of affective states. This does not mean that one simply needs to learn the mappings to perfectly solve the affect detection problem because the mappings are imprecise. For example, although facial expressions are considered to be strongly associated with affective states, meta-analyses on correlations between facial expressions and affect have yielded small to medium effects under naturalistic conditions [Camras and Shutter 2010, Fridlund et al. 1987, Ruch 1995, Russell 2003]. In the interest of maximizing adaptability to new situations and environments, the mappings have evolved to be loose and variable, not fixed and rigid [Coan 2010, Roseman 2011, Tracy 2014]. Thus, rather than being predefined, the affect-expression links emerge from dynamic interactions between internal processes and the environmental context. Some of these influences include the internal state of the individual, contextual and social factors [Parkinson et al. 2004], and individual and group (or cultural) differences [Elfenbein and Ambady 2002a, 2002b].

6.2 Background from Affective Sciences

171

At first blush, the lack of a precise experience-expression link seems to threaten the entire affect detection endeavor. But this is not the case. In our view, it is sufficient to assume that there is some link between experience and expression. The link need not be particularly strong. The link need not even be consistent across individuals, situations, and cultures. The only assumption is that there is a “beyondchance probabilistic” [Roseman 2011 (p. 440)] link between affect expression and experience. Most affect detection systems rely on supervised learning methods to learn this link. Supervised learning needs supervision in the form of “ground truth” (annotations) which bring us to the question of “What is affective ground truth?”

6.2.3 Affective Ground Truth Consider speech recognition, where the task is to translate an acoustic representation into a linguistic representation of speech. There is usually little dispute about the desired output (i.e., the words being spoken). But this is rarely the case with affect detection as affect is a psychological construct (see above). One exception is when the affective states are portrayed by actors or are experimentally induced [Kory and D’Mello 2015]. Here, the acted/induced affect can be taken as ground truth, but the resultant expressions more closely resemble the acting/eliciting micro-context and might not generalize more broadly (see also Chapter 8). There is no objective ground truth in the case of naturally occurring affective states. Instead, the truth lies in the eyes of the beholder. The beholder, in the case of humans, is the person experiencing the emotion (the self) or an external observer. Each has access to different sources of information and is subject to different biases, thereby arriving at different approximations of “ground truth.” As noted above, affective states are multicomponential in that they encompass conscious feelings (“I feel afraid”), overt actions (“I freeze”), physiological/behavioral responses (“My muscles clench”), and meta-cognitive reflections (“I am a coward”). Access to these components varies by source (self vs. observer). The self has access to some conscious feelings, some overt actions, memories of the experience, and meta-cognitive reflections, but usually not to some of the unconscious affective components. They are also more likely to distort or misrepresent their affective states due to biases, such as reference bias [Heine et al. 2002] or social desirability bias [Krosnick 1999]. In contrast, observers only have access to overt actions and behaviors that can be visibly perceived (e.g., facial features, postures, gestures) and must rely more heavily on inference [Mehu and Scherer 2012]. Observers are less likely to succumb to the same biases that befall self-reports, but they introduce biases of their own, such as the halo effect [Podsakoff et al. 2003]. There are strengths and pitfalls of reliance on either the self or external observers to establish affective

172

Chapter 6 Multimodal-Multisensor Affect Detection

“ground truth.” [D’Mello 2016]. Therefore, perhaps the most defensible position is to consider a combination of perspectives, thereby capitalizing on their merits while minimizing their flaws.

6.2.4 Multimodal Coordination of Affective Responses Consider the following quote from William James in his classic 1884 treatise, “What is an emotion?” “Can one fancy the state of rage and picture no ebullition of it in the chest, no flushing of the face, no dilatation of the nostrils, no clenching of the teeth, no impulse to vigorous action, but in their stead limp muscles, calm breathing, and a placid face?” [James 1884 (p. 452)]

Quotes such as the one above by James [1884] and similar ones by Darwin [1872], Tomkins [1962], Ekman [1992], Damasio [2003] and others, depict affective responses as being inherently multimodal. According to the classical model of emotion (called basic emotion theory), there is a specialized circuit for each (basic) emotion in the brain. Upon activation, this circuit triggers a host of coordinated responses encompassing peripheral physiology, facial expression, speech, modulations of posture, affective speech, instrumental action, cognitions, and subjective experience [Ekman 1992, Izard 2007]. According to this view, MMAD should be substantially more accurate than UMAD because MMAD approaches model this coordinated emotional response. In contrast to this highly integrated, tightly coupled, central executive view of emotion, researchers have recently argued in favor of a disparate, loosely coupled, distributed perspective [Coan 2010, Lewis 2005]. Here, there is no central affect neural circuit [Lindquist et al. 2011, 2016] that coordinates the various components of an emotional episode. Instead, these components are loosely coupled and the situational context and appraisals determine which bodily systems are activated and the dynamics of activation over time. These theories would accommodate the prediction that a combination of modalities might conceivably yield small improvements in classification accuracies, suggesting that the merits of MMAD over UMAD approaches might not necessarily lie in improved classification accuracy, but in other factors (e.g., increased reliability due to redundancy). We consider the extent the data supports each of these views later on in the chapter. The reader is also directed to Chapter 7 for a discussion on the conditions when multimodal communication should expect benefits over unimodal signaling. There is also a parallel line of work focused on human perception of affect from unimodal and multimodal cues expressed by both humans [D’Mello et al. 2013]

6.3 Modality Fusion for Multimodal-Multisensor Affect Detection

173

and virtual agents (see Chapter 9), that could establish baselines for what machines might be capable of achieving.

6.3

Modality Fusion for Multimodal-Multisensor Affect Detection Figure 6.1 highlights our theoretical position on affective states (see previous section), which informs the steps involved in building an affect detector. Affective states are assumed to emerge from person-environment interactions and are reflected in changes at multiple levels (i.e., neurobiological changes, physiological responses, bodily expressions, action tendencies, and cognitive, metacognitive, and phenomenological states) in a manner that is modulated by individual differences (e.g., affective traits, culture). Researchers typically adopt a machine learning approach for affect detection, which requires the collection of training and validation data. Accordingly, in Step 1a, raw signals (video, physiology, event log files, etc.) are recorded as participants engage in some interaction of interest (including experimental elicitation). Features are then computed from the raw signals (Step 1b). Affect annotations (Steps 2a and 2b) are obtained from the participants themselves or from external observers, either online (e.g., live observations) or offline

Multimodal, multisensor, fusion Neurobiological changes Physiological responses

Step 1 Machine sensing (facial expressions; prosody; etc.)

Step 3 Machine learning

Step 2 Inference (self/observer reports; elicitation)

Affect estimates (annotation)

Step 4 Validation

Bodily expressions Individual differences (culture, affective traits)

Action tendencies Cognitive, metacognitive, phenomenological states

Person–environment interaction (situational context, social factors, task context, etc.) Figure 6.1

Theoretical foundation and steps involved in affect detection.

Computational model

174

Chapter 6 Multimodal-Multisensor Affect Detection

Affect classifier

Feature 1

Feature 2

···

Feature 1

Modality 1 features Figure 6.2

Feature 2

···

Modality 2 features

Illustration of feature-level fusion.

(e.g., video coding). If affect is experimentally induced, then the elicited condition serves as the annotation. Next, machine learning methods (typically supervised learning) are used to computationally model the relationship between the features and the affect annotations (Step 3). The models can also include contextual information, including both external context (e.g., situational aspects, task constraints, social environment) and internal context (e.g., previous affect predictions). The resulting machine-learned model yields computer-generated annotations, which are compared to the human-provided annotations in a validation step (Step 4). Once validated, the computational model can now produce computer-generated affect annotations from a new set of raw signals without corresponding human-provided annotation. The basic affect detection approach needs an update when multiple modalities and/or sensors are involved. They key issue pertains to how to synchronize and combine (fuse) the different information channels (modalities). In the remainder of this section, we explore a variety of methods for this task. Alternate fusion methods, specifically for online affect detection, are discussed in Chapter 8.

6.3.1 Basic Methods (Data, Feature, Decision, and Hybrid Fusion) The most basic method for fusing modalities is data-level or stream-level fusion. Here, raw signals are first fused before computing features. For example, one might record electrodermal activity (EDA) from multiple sensors to compensate for leftright EDA asymmetry [Picard et al. 2015] and then fuse the two signals (e.g., via convolution) prior to computing features. The next basic method is feature-level fusion (or early fusion), where features from different modalities are concatenated prior to machine learning (see Figure 6.2). The primary advantage of feature-level fusion is its simplicity and it can be effective when features from individual modalities are independent and the temporal dependencies among modalities are minimal.

6.3 Modality Fusion for Multimodal-Multisensor Affect Detection

175

Final affect decision

Modality 1 classifier

Feature 1

Feature 2

Modality 2 classifier

Feature 1

···

Modality 1 features Figure 6.3

···

Modality 2 features

Decision-level fusion with two modalities.

Featurelevel fusion classifier

Final affect decision

Modality 1 classifier

Modality 1 features

Figure 6.4

Feature 2

Decisionlevel fusion classifier

Modality 2 classifier

Modality 2 features

Hybrid fusion with two modalities.

An alternative is decision-level (or late fusion) fusion (Figure 6.3), where models are trained for each modality. The final decision is made by fusing the outputs of models corresponding to each modality via majority voting, weighting votes according to accuracy of each model, or training a new classifier using the outputs of each model as features (stacking). It is also possible to combine feature- and decision- level fusion as illustrated in Figure 6.4. The resultant method, called hybrid fusion, is expected to capitalize on the merits of each approach.

176

Chapter 6 Multimodal-Multisensor Affect Detection

6.3.2 Model-based Fusion with Dynamic Bayesian Networks (DBNs) and Hidden Markov Models (HMMs) The aforementioned basic fusion methods are limited in that they do not account for temporal relationships among modalities. There are more sophisticated fusion methods, but these also ignore temporal dependencies. For example, in a support vector machine classifier, a kernel function is used to map input data to a higherdimensional space. Multimodal fusion can be achieved by tuning a different kernel for each modality (feature space) and mapping them all into the same higherdimensional feature space [Liu et al. 2014]. A limitation, however, is that these methods do not afford modeling of temporal dependencies, which is critical for MMAD. Model-based fusion methods model temporal dependencies as well as other relationships as illustrated with two widely used graphical models: Dynamic Bayesian Networks and Hidden Markov Models, Dynamic Bayesian Networks (DBNs) are a common graphical model used for modality fusion in affect detection. Links between variables in DBNs represent conditional dependencies between features as well as relationships across time. Figure 6.5 shows a DBN that fuses two modalities along with contextual (top-down) features with Affect being the output variable. Top-down features (e.g., age; context factors) influence affect, but do not change from one timestep to the next. Bottomup features, such as facial expressions and bodily movements, are linked across time. Affect also evolves across time, [D’Mello and Graesser 2011] so the Affect variable is linked across timesteps. Bayesian inference is used to compute the probability of the output Affect variable given the top-down (predictive) and bottomup (diagnostic) features [Conati and Maclaren 2009]. DBNs have successfully been used in several MMAD systems. Li and Ji [2005] fused a variety of modalities including facial expressions, eye gaze, and top-down features (physical condition and time in circadian rhythm) to detect fatigue, nervousness, and confusion. Chen et al. [2009] detected anger, happiness, meanness, sadness, and neutral using a DBN to fuse audio and visual features. Jiang et al. [2011] expanded that work to detect a larger set of affective states including anger, disgust, fear, happiness, sadness, and surprise, using a similar DBN. In general, DBNs are quite flexible, allowing any structure of relationships between variables and across time. However, more complex DBN structures require considerably more training data to estimate the various parameters, so in practice relatively simple structures like Figure 6.5 are used. One such structure is a Hidden Markov Model (HMM), which models affect as a hidden variable that influences observable variables (e.g., anger influencing skin conductance and heart rate). Coupled Hidden Markov Models (CHMMs) combine

6.3 Modality Fusion for Multimodal-Multisensor Affect Detection

Topdown features

Topdown features

Topdown features

Affect

Affect

Affect

Modality 1 features

Modality 1 features

t–1 Figure 6.5

Modality 1 features

Modality 1 features

t

Modality 1 features

177

Modality 1 features

t+1

Dynamic Bayesian network model fusing two modalities and top-down features.

two or more HMMs (one per modality), such that the hidden states (representing affect) of the individual HMMs interact across time (see Figure 6.6). These crossmodal links in a CHMM are chosen to model temporal relationships between modalities that might operate at different time scales (e.g., heart rate vs. facial expressions). As an example, Lu and Jia [2012] used a CHMM to combine audio and video HMMs to detect affect represented in an evaluation-activation (valencearousal) space. CHMMs capture the temporal relationships between modalities, but consider each modality as a whole. Semi-coupled Hidden Markov models (SCHMMs) extend the structure of CHMMs by coupling modalities at the feature level. Component models are created for each pair of features, resulting in a large number of small models which are subsequently combined by late fusion. The main advantage of the SCHMM approach is that it allows the temporal relationships to vary per feature pair. Lin et al. [2012] demonstrated that SCHMMs were effective for recognizing affect on two audio-visual datasets, one with evaluationactivation dimensions and one with anger, happiness, sadness, and neutral. They found that SCHMMs outperformed standard CHMMs on both datasets.

6.3.3 Modality Fusion with Neural Networks and Deep Learning Neural networks have emerged as another popular approach for modality fusion. One particularly prominent type of network is the long short-term memory (LSTM)

178

Chapter 6 Multimodal-Multisensor Affect Detection

Features

Features

Features

Affect

Affect

Affect

Affect

Affect

Affect

Features

Features

Features

t–1

t

t+1

Hidden Markov model for modality 1

Hidden Markov model for modality 2

Figure 6.6

Coupled hidden Markov model for two modalities.

neural network [Hochreiter and Schmidhuber 1997]. In LSTMs, the artificial neurons in the hidden layers are replaced by memory cells, which allow the network to maintain longer temporal sequences. Thus, they improve on feed-forward neural networks by incorporating temporal information while avoiding the vanishing gradient problem of recurrent neural networks. Bi-directional LSTMs or BLSTMs are a further extension that model both past and future information. Figure 6.7 shows a BLSTM network in which hidden layers are connected both forwards and backwards. Features from individual modalities are concatenated in the input layer in LSTMs or BLSTMs. However, we do not consider this to be feature-level fusion as the hidden layers maintain a sophisticated internal model of the incoming data and the networks internal context. LSTMs and BLSTMs have been successful with modalities such as speech where longer context can provide significant discriminative power. For example, Eyben et al. [2010] fused acoustic and linguistic features in a BLSTM to classify affect in an evaluation-activation space, finding that it outperformed a basic recurrent neural network. Ringeval et al. [2015a] fused video, audio, and physiology and showed advantages of LSTMs and BLSTMs compared to feed-forward neural networks (this study is discussed in more detail below). More recently, deep neural networks are being increasingly used for modality fusion in MMAD systems [Le Cun et al. 2015]. Deep networks contain multiple

6.3 Modality Fusion for Multimodal-Multisensor Affect Detection

Output layer

Forward hidden layer

Output layer

Forward hidden layer

Output layer

Forward hidden layer

Backward hidden layer

Figure 6.7

179

Backward hidden layer

Backward hidden layer

Input layer

Input layer

Input layer

t–1

t

t+1

BLSTM with memory cells in each hidden layer.

hidden layers and are capable of learning feature representations from raw data. For example, Kahou et al. [2013] used deep neural networks to classify affect from several modalities including video and audio. They first trained separate deep networks for each modality, then fused the networks together by weighting each network in a final prediction. The extremely large amounts of data required for deep learning are difficult to acquire in affect detection applications. However, a two-step approach can be employed to decrease the need for large affect databases (although this is more common for video rather than other modalities). First, deep networks that have been trained for more general classification tasks (e.g., object recognition) are obtained (presumably one for each modality). Second, affect detectors are developed by combining the last few hidden layers from each deep network into a new final layer and training that final layer using affect databases (example network in Figure 6.8). This method utilizes the sparse feature representations that have been learned by the deep networks in their deeper hidden layers without requiring prohibitively large affect databases, and can be considered a form of transfer learning. For example, Ng et al. [2015] found a 16% improvement by fine-tuning an object recognition deep network using multiple affect databases versus training on only one affect database.

180

Chapter 6 Multimodal-Multisensor Affect Detection

···

Deep network trained to recognize objects in images

Similar network trained to recognize affect based on features from deep networks

···

Figure 6.8

6.4

Deep network trained to recognize speech in audio

Fusion of deep neural networks by re-training final layers from networks representing each modality.

Walk-throughs of Sample Multisensor-Multimodal Affect Detection Systems We present three walk-throughs to serve as concrete renditions of MMAD systems. The walk-throughs were selected to emphasize the wide variability of research in the area and to highlight the various challenges and design decision facing MMAD systems.

6.4.1 Walk-through 1—Feature-level Fusion for Detection of Basic Emotions Our first walk-through was concerned with detection of emotions elicited through an affect elicitation procedure. Janssen et al. [2013] compared automatic detection vs. human perception of three basic emotions (happy, sad, angry), relaxed, and neutral, which were induced via an autobiographical recall procedure [Baker and Guttfreund 1993]. According to this procedure, 17 stimulus subjects were asked to write about two events in their life associated with experiences of these emotions. They were then asked to recall a subset of those events in a way that made them relive the emotions experienced. They then verbally described each event (in Dutch)

6.4 Walk-throughs of Sample Multisensor-Multimodal Affect Detection Systems

181

in 2–3 minute trials. Audio, video, and physiological signals (electrodermal activity, skin temperature, respiration, and electrocardiography) were recorded while the stimulus subjects recalled and described the events. Each recording was associated with the label of the corresponding emotion being recalled, which was taken to be the “ground truth.” The authors extracted a variety of features from the signals. Facial features included movement of automatically tracked facial landmarks around the mouth and the eyes, as well as head position. Standard acoustic-prosodic features (e.g., fundamental frequency (pitch), energy, jitter, shimmer, formants) were extracted from the speech signal. Example physiological features included respiration rate, interbeat intervals, mean skin temperature, and number of skin conductance responses. A support vector machine classifier was trained to discriminate among the elicited emotions (five-way classification) using features from the individual modalities as well from feature-level modality fusion and best-first search (see Figure 6.9). The multimodal model obtained a classification accuracy of 82%, which was greater

Spoken autobiographical recall in lab

Dry electrodes

Skin conductance

Electrocardiogram

Interbeat intervals

Thermistor

Skin temperature

Respiration band

Respiration rate

Audio from microphone

Acoustics and prosody

Video from webcam

Facial landmarks

Recalled emotion

Figure 6.9

Schematic for walk-through 1.

Feature-level fusion with best-first search

Support vector machine

Computational model

182

Chapter 6 Multimodal-Multisensor Affect Detection

than the individual modalities: 39% for audio, 59% for video, and 76% for physiology. The authors compared computer vs. human affect detection accuracy. This was done by asking a set of human judges to classify the elicited emotions based on various stimuli combinations (audio-only, video-only, audio-video). Both U.S. and Dutch judges were used, but we only report results from the Dutch judges since they match the stimulus subjects. The Dutch judges were the most accurate (63%) when provided with audio (which was also in Dutch), compared to video (36%), and combined audio-video (48%). However, their accuracy was considerably lower than the automated detector (82%), although this result should be interpreted with caution as the testing protocols may have been biased in favor of the computer as strict person-level independence between training and testing sets was not enforced. Nevertheless, this remains one of the few studies that has contrasted human- vs. machine- classification on a multimodal dataset.

6.4.2 Walk-through 2—Decision-level Fusion for Detection of Learning-centered Affective States Our second walk-through focuses on multimodal affect detection in a computerenabled classroom [Bosch et al. 2015b]. The researchers collected training data from 137 (8th and 9th grade) U.S. students who learned from a conceptual physics educational game called Physics Playground [Shute et al. 2013]. Students played the game in two 55-min sessions across two days. Trained observers performed live annotations of boredom, engaged concentration, confusion, frustration, and delight using the Baker-Rodrigo Observation Method Protocol (BROMP) [Ocumpaugh et al. 2012]. According to BROMP, the live annotations were based on observable behavior, including explicit actions towards the interface, interactions with peers and teachers, body movements, gestures, and facial expressions. The observers had to achieve a kappa of 0.6 (inter-rater reliability) with an expert to be certified as a BROMP coder. Videos of students’ faces and upper bodies and log files from the game were recorded and synchronized with the affect annotations. The videos were processed using FACET—a computer-vision program [FACET 2014] which estimates the likelihood of 19 facial action units along with head pose and position. Body movement was also estimated from the videos using motion filtering algorithms [Kory and D’Mello 2015]. Supervised learning methods were used to discriminate each affective state from the other states (e.g., boredom vs. confusion, frustration, engaged concentration, and delight) and were validated by randomly assigning students into training and testing sets across multiple iterations. The models yielded an average accuracy of 0.69 (measured with area under

6.4 Walk-throughs of Sample Multisensor-Multimodal Affect Detection Systems

183

the receiver operating characteristic curve (AUROC or AUC), where a chance model could yield a value of 0.5). Follow-up validation analyses confirmed that the models generalized across multiple days (i.e., training on subset of students from day 1 testing on different students in day 2), class period, genders (i.e., training on males, testing on females and vice versa), and ethnicity as perceived by human coders [Bosch et al. 2016]. A limitation of video-based measures is that they are only applicable when the face can be detected in the video. This is not always the case outside of the lab, where there are occlusions, poor lighting, and other complicating factors. In fact, the face could only be detected about 65% of the time in this study. To address this, Bosch et al. [2015a] developed an additional computational model based on interaction/contextual features stored in the game log files (e.g., difficulty of the current game level, the student’s actions, the feedback received, response times). The log-based models were less accurate (mean AUC of .57) than the video-based models (mean AUC of .67 after retraining), but could be applied in almost all of the cases. Separate logistic regression models were trained to adjudicate among the face- and log-based models, essentially weighting their relative influence on the final outcome via stacking (see Figure 6.10). The resultant multimodal model was almost as accurate as the video-based model (mean AUC of .64 for multimodal vs .67 for face only), but was applicable almost all of the time (98% for multimodal vs. 65% for face only). These results are notable given the noisy nature of the real-world environment with students incessantly fidgeting, talking with one another, asking questions, and even occasionally using their cellphones. They also illustrate how a MMAD approach addressed a substantial missing data problem despite it not improving detection accuracy.

6.4.3 Walk-through 3—Model-based Fusion for Modeling of Affective Dimensions The previous two case studies focused on detecting discrete affective states with feature- or decision- level fusion. Our third walk-through used a neural network for modality fusion in the course of modeling time-continuous annotations of valence (unpleasant to pleasant) and arousal (sleepy to active) [Ringeval et al. 2015a]. The authors recorded audio, facial video, electrocardiogram (ECG), and electro-dermal activity (EDA) as dyads completed a “winter survival” collaborative task. A total of 46 participants completed the task, of whom 34 provided permission for their data to be used. Data from a further 7 participants had recording errors, yielding a final data set of 27 participants. Six observers annotated the first 5 min of each participant’s data by providing time-continuous ratings of valence and arousal. The

184

Chapter 6 Multimodal-Multisensor Affect Detection

Decision-level fusion with stacking

Log-files

Learning physics from educational game in classroom

Video from webcam

Actions and performance Gross body movements Facial action units

j48 decision tree Logistic regression Logistic regression

Affect prediction

Observer affect judgments

Figure 6.10

Schematic for walk-through 2.

recordings and annotations are distributed as part of the RECOLA dataset [Ringeval et al. 2013], which has been used in recent MMAD challenges [Ringeval et al. 2015b]. A variety of features were extracted from each of the modalities (audio, video, ECG, and EDA). Audio features captured spectral, prosodic, and voice quality metrics. Video features included 15 automatically extracted facial action units (AUs) and head pose. ECG features primarily consisted of heart rate, heart rate variability, and spectral features. EDA features mainly emphasized changes in skin conductance. LSTM and BLSTM networks (as discussed above) were trained to estimate continuous valence and arousal annotations by fusing features from the various modalities (Figure 6.11). The networks were validated in a person-independent fashion. The concordance correlation rc (combining Pearson’s r and mean squared error) was used to measure model accuracy. The authors performed several experiments, including both early and late fusion and various combinations of modalities; here we focus on each feature (from any modality or combination) being an input node in the network. The best modellevel fusion achieved a rc of .769 for arousal and a rc of .492 for valence. These best results were obtained using a combination of audio and video features. Further, when compared to standard feed-forward neural networks, the BLSTM models were more accurate across shorter windows of time (2–3 secs) but accuracy was equitable across longer windows (4–5 secs). Finally, when compared to individual modalities,

6.5 General Trends and State of the Art in Multisensor-Multimodal Affect Detection

BLSTM

Continuous valence/arousal annotations

Collaborative two-person task

Figure 6.11

185

Audio from microphone

Acoustic/prosodic audio features

Video from webcam

Facial action units, head pose

Electrocardiogram

Heart rate features

Electrodermal activity sensor

Skin conductance features

Affect prediction

Schematic for walk-through 3.

there was a multimodal advantage for valence (rc = .492 vs. .431), but not for arousal (rc = .769 vs. .788), once again highlighting selective conditions where MMAD led to improvements over UMAD.

6.5

General Trends and State of the Art in Multisensor-Multimodal Affect Detection D’Mello and Kory [2015] recently performed a review and meta-analysis of 90 MMAD systems. We highlight some of their key findings, both in terms of trends in MMAD system design as well as classification accuracy of MMAD vs. UMAD. Table 6.1 lists a subset (about 1/3) of the more recent studies (2011 to 2013) reviewed in D’Mello and Kory [2015] along with a few more recent studies (2014-) published since their review.

6.5.1 Trends in MMAD systems D’Mello and Kory [2015] coded each MMAD system across a number of dimensions, such as whether the training data consisted of acted, induced, or naturalistic affective expressions, the specific modality combinations used, the most successful fusion method, and so on. Below are some of the highlights of MMAD as of 2013.

Table 6.1

Selective sample of recent MMAD systems in the D’Mello and Kory [2015] review (2011 to 2013), further extended to include more recent systems Reference

Modalities

Fusion

[Chanel et al. 2011]

EEG + Physiology

Decision

[Datcu and Rothkrantz 2011]

Face + Voice

Feature

[Jiang et al. 2011]

Face + Voice

Model

[Lingenfelser et al. 2011]

Face + Voice

Decision

[Nicolaou et al. 2011]

Face + Voice + Body

Model

[Schuller 2011]

Voice + Text

Feature

[Vu et al. 2011]

Voice + Body

Decision

[Wagner et al. 2011]

Face + Voice + Body

Decision

[Walter et al. 2011]

Voice + Physiology

Decision

[Wu and Liang 2011]

Voice + Text

Decision

[Hussain et al. 2012]

Face + Physiology

Decision

[Koelstra et al. 2012]

EEG + Physiology + Content

Decision

[Lin et al. 2012]

Face + Voice

Model

[Lu and Jia 2012]

Face + Voice

Model

[Metallinou et al. 2012]

Face + Voice

Model

[Monkaresi et al. 2012]

Face + Physiology

Feature

[Park et al. 2012]

Face + Voice

Decision

[Rozgic et al. 2012]

Face + Voice + Text

Feature

[Savran et al. 2012]

Face + Voice + Text

Model

[Soleymani et al. 2012]

EEG + Gaze

Decision

[Baltru˘saitis et al. 2013]

Face + Voice

Model

[Dobriˇsek et al. 2013]

Face + Voice

Decision

[Glodek et al. 2013]

Face + Voice

Decision

[Hommel et al. 2013]

Face + Voice

Decision

[Krell et al. 2013]

Face + Voice

Decision

[Rosas et al. 2013]

Face + Voice + Text

Feature

[Rosas et al. 2013]

Face + Voice + Text

Feature

[Wang et al. 2013]

EEG + Content

Feature

[W¨ ollmer et al. 2013a]

Face + Voice

Model

[W¨ ollmer et al. 2013b]

Face + Voice + Text

Hybrid

[Williamson et al. 2014]

Face + Voice

Decision

[Grafsgaard et al. 2014]

Face + Posture + Interaction

Feature

[Soleymani et al. 2014]

Face + EEG

Model

[Bosch et al. 2015a]

Face + Interaction

Decision

[Zhou et al. 2015]

Face + Interaction + Content

Feature

[Barros et al. 2015]

Face + Body

Model

[Monkaresi et al. 2017]

Face + Remote Physiology

Decision

Note. Physiology refers to one or more peripheral physiological channels such as electrodermal activity, heart rate variability, etc.

6.5 General Trends and State of the Art in Multisensor-Multimodal Affect Detection

.

.

.

.

.

.

.

187

MMAD systems were trained on small samples. The studies had on average of 21 participants and 97% of the studies had fewer than 50 participants. Training data for about half the studies were obtained by actors portraying affective expressions. Affective states were induced in 28% of the studies using validated elicitation methods [Coan and Allen 2007]. Very few studies (20% of studies) used naturalistic affective states (i.e., affective states that spontaneously arise as part of an interaction). In terms of MMAD, bimodal systems were far more common (87%) than trimodal systems (13%). The face and voice (paralinguistics) were the two most frequent modalities, each occurring in over 75% of the studies. By comparison, peripheral physiology was only used in 11% of the systems and other modalities (e.g., eye tracking) were much rarer. About a 1/3 of the studies (37%) focused on detecting the basic emotions of anger, fear, happiness, sadness, disgust, and surprise [Ekman 1992] or core affective dimensions of valence and arousal (28%). Very few studies focused on detecting additional affect dimensions, such as dominance or certainty [Fontaine et al. 2007] or nonbasic affective states like confusion and curiosity [D’Mello and Calvo 2013]. Feature-level (39%) and decision-level (35%) fusion were much more common than hybrid (6%) and model-level fusion (20%) A vast majority of studies employed instance-level validation (62%), where different instances from the same person were in both training and test sets, essentially limiting generalizability to new individuals.

6.5.2 Accuracy of MMAD Systems How accurate are MMAD systems compared to their unimodal affect detection (UMAD) counterparts? D’Mello and Kory [2015] addressed this question by computing the percent improvement in classification accuracy of each MMAD system compared to the best UMAD system (called MM1 effects). They also investigated factors that moderated MM1 effects. Their key findings indicated that: .

.

On average, MMAD yielded a 10% improvement in affect detection accuracy over the best UMAD counterpart. There were negative or negligible ( 10%).

188

Chapter 6 Multimodal-Multisensor Affect Detection

.

.

.

.

The median MM1 effect of 7% might be a more accurate estimate given the spread of the distribution. There was a very robust correlation (Pearson’s r = .87) between best UMAD and MMAD accuracies, suggesting a high degree of redundancy; see Chapter 7. The mean MM1 effect for detectors trained on naturalistic data (4.6%) was three times lower compared to detectors trained on acted data (12.7%) and about half compared to detectors trained on experimentally induced affective states (8.2%). Model-based fusion methods resulted in a roughly twice the mean MM1 effect (15.3%) compared to feature-level (7.7%) and decision-level (6.7%) fusion. However, this result should be taken with a modicum of caution because it involves between- study comparisons where additional factors could have varied.

Importantly, the authors were able to predict MMAD accuracy from best UMAD accuracy using data type (1 for acted data; 0 for induced or naturalistic data) and fusion method (1 for model-level fusion; 0 for feature- or decision- level fusion). The regression model shown (using standardized coefficients) below explained an impressive 83.3% of the variance based on 10-fold study-level cross-validation. MMAD accuracy = .900 × Best UMAD accuracy + .273 × Data Type Acted [1 or 0] + .312 × Model Level Fusion [1 or 0] − .253

6.5.3 MMAD Systems from the 2015 Audio-Video Emotion Recognition Challenge (AV+EC 2015) The Audio-Video Emotion Recognition Challenge (AVEC) series is an annual affect detection competition that was first organized as part of the 2011 Affective Computing and Intelligent Interaction (ACII) conference series [Schuller 2011]. The earlier challenges emphasized audio-visual detection of time-continuous annotations of affective dimensions [Schuller et al. 2012] based on data from the SEMAINE corpus [McKeown et al. 2012], which was designed to collect naturalistic data of humans interacting with artificial agents. The most recent challenge (at the time of writing) was the Audio-Visual+ Emotion recognition Challenge and workshop (AV+EC 2015), where the goal was to model time-continuous annotations of valence and arousal from audio, video, and physiology (electrocardiogram and electrodermal

6.6 Discussion

Table 6.2

189

MMAD systems featured in the AV+EC 2015 challenge Reference

Fusion Method

[Cardinal et al. 2015]

Feature, Decision (random forest, linear regression)

[Milchevski et al. 2015]

Feature, Decision (linear regression)

[Huang et al. 2015]

Feature, Decision (linear regression), Hybrid

[Chen and Jin 2015]

Model (BLSTM)

[Chao et al. 2015]

Model (LSTM)

[He et al. 2015]

Model (Deep BLSTM)

[K¨ achele et al. 2015]

Feature, Decision (averaging), Model (multilayer perceptron)

activity) signals collected as part of the RECOLA data set [Ringeval et al. 2013] (see walk-through 3 above). Table 6.2 presents the seven MMAD systems featured in the AV+EC 2015 challenge. Two systems adopted a UMAD approach and are not included here. We note the popularity of model-based fusion techniques, especially those using LSTMs and their variants, although feature- and decision- level fusion methods still feature quite prominently. The best result was obtained by He et al. [2015], who adopted a deep (i.e., multilayer) BLSTM for modality fusion. They achieved a concordance correlation (rc - see walk-through 3) of .747 for arousal and .609 for valence, both reflecting substantial improvements over the challenge baselines (rc = .444 for arousal and .382 for valence).

6.6

Discussion At the time of this writing, affective computing is nearing its 20-year birthdate [Picard 1997] (see Picard [2010] for a brief history of the field). In D’Mello and Kory [2015], we summarized the state of the field of affect detection in 2003 as: “the use of basic signal processing and machine learning techniques, independently applied to still frames (but occasionally to sequences) of facial or vocal data, to detect exaggerated context-free expressions of a few basic affective states that are acted by a small number of individuals with no emphasis on generalizability.”

It is clear as much progress has been made over the next 10 years as noted by our summary of the field as of 2013. The italicized items highlight key changes from 2003–2013. Most notable is the shift in emphasis from facial or vocal signals to

190

Chapter 6 Multimodal-Multisensor Affect Detection

facial and vocal signals, suggesting that we are finally in the age of MMAD, despite sustained progress in UMAD. “the use of basic and advanced signal processing and machine learning techniques, independently and jointly applied to sequences of primarily facial and vocal data, to detect exaggerated and naturalistic context-free and context-sensitive expressions of a modest number of basic affective states and simple dimensions that are acted or experienced by a modest number of individuals with some emphasis on generalizability.”

What would be a prospective summary of the field a decade from now—say in 2027? We anticipate progress in data collection methods (sensors used, modalities considered, data collection contexts, size of data sets), the computational methods (signal processing, machine learning, fusion techniques), and the affective phenomenon itself (affective states modeled, affect representations, how “ground truth” is established). But what about the metrics of success? The metrics we utilize embody what we (as a community) value in affect detection systems. It is fair to say that detection (or prediction) accuracy on unseen data is the key metric of success in the field (e.g., the AV+EC challenge selects winners based on prediction accuracy on a held-out test set). Does accuracy, then, embody our values? If so, then one must ask “accurate for what purpose and in what context?” Is a highly accurate system trained on a handful of participants in a lab setting of more value than a less accurate one trained on noisy data, but from thousands of individuals in the wild? Similarly, is a highly accurate system that cannot function in the presence of missing data of more value than its less accurate counterpart that is robust to data loss? If accuracy is not the only metric that embodies our values, then what might be some alternative metrics? The answer might lie into the very nature of affect itself. Recall that affect is a construct, not a physical entity. It cannot be precisely defined or directly measured, but only approximated. This level of imprecision might be discomforting to some who might rightly ask: “How can we measure what we cannot even define?” This question has plagued researchers in the psychological sciences for several decades, who have proposed a host of metrics, each based on different criterion of success. These include different forms of reliability, convergent validity (closely related to accuracy), discriminant validity, ecological validity (related to generalizability), predictive validity, criterion validity, and so on [Rosenthal and Rosnow 1984]. Herein lies the rub. Many of these criteria are in a state of tension. A system (or measure) that achieves impressive gains along one criterion likely does so at the ex-

Focus Questions

191

pense of another. Want a highly accurate (but not very generalizable) system? Just lock a few participants in the lab and ask them to act out a couple of emotions. Want a generalizable (but not very accurate) system? Try to capture affective expressions as people go about their daily routines in the world. By considering a range of metrics, we are forced to identify the inherent weaknesses in our systems and confront out assumptions about the nature of affect and “affective ground truth.” Thus, in addition to anticipated advances in theoretical sophistication, data sources, and computational techniques, we advocate for an equitable advance in the science of validation over the next decade of multisensor-multimodal affect detection research. Only then will we have a chance of developing affect detection systems that will break through the confines of the lab and live up to their fullest potential in the real world.

Acknowledgments This research was supported by the National Science Foundation (NSF) (DRL 1235958, IIS 1523091, and 1660877). Any opinions, findings and conclusions, or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the NSF.

Focus Questions 6.1. What do we mean when we say that affect is a multicomponential conceptual phenomenon?

6.2. Why is the affective experience-expression link weak and how is this related to loosely coupled uncoordinated affective responses? 6.3. Popular TV shows like “Lie to Me” assume that humans can be trained to be highly accurate emotion and deception detectors. Do you agree or disagree? Why?

6.4. Assume you want to develop a detector of surprise. What are three unique ways by which you could obtain affective ground truth to train your detector? 6.5. Assume you have three modalities: video, audio, and electrodermal activity. How would you combine them to achieve “hybrid fusion”? 6.6. Sketch four different model-level fusion designs that combine facial expressions, heart rate, eye movements, keystrokes, and user personality traits.

6.7. How would you estimate bimodal classification accuracy from corresponding unimodal classification accuracies without even building the multimodal model?

192

Chapter 6 Multimodal-Multisensor Affect Detection

6.8. How would you go about building a multisensor-multimodal detector of interest while people read news articles on www.cnn.com? What about curiosity?

6.9. How would you build a robust multimodal-multisensor detector of confusion. Robust implies that the detector should operates even when some of the modalities do not provide any data.

6.10. The concluding section lists several metrics of success in addition to detection accuracy? Which of these metrics do you think the affect detection community should prioritize in the near- (next 5 years) and long- (next 15 years) term?

References R. C. Baker and D. O. Guttfreund. 1993. The effects of written autobiographical recollection induction procedures on mood. Journal of Clinical Psychology 49:563–568. 180 T. Baltru˘saitis, N. Banda, and P. Robinson. 2013. Dimensional Affect Recognition using Continuous Conditional Random Fields. In Proceedings of the International Conference on Multimedia and Expo (Workshop on Affective Analysis in Multimedia). DOI: 10.1109/ FG.2013.6553785. 186 L. Barrett. 2006. Are emotions natural kinds? Perspectives on Psychological Science 1:28–58. DOI: 10.1111/j.1745-6916.2006.00003.x. 170 L. Barrett, B. Mesquita, K. Ochsner, and J. Gross. 2007. The experience of emotion. Annual Review of Psychology, 58:373–403. DOI: 10.1146/annurev.psych.58.110405.085709. 170 L. F. Barrett. 2014. The conceptual act theory: A pr´ ecis. Emotion Review, 6:292–297. DOI: 10.1177/1754073914534479. 169 P. Barros, D. Jirak, C. Weber, and S. Wermter. 2015. Multimodal emotional state recognition using sequence-dependent deep hierarchical features. Neural Networks, 72:140–151. DOI: 10.1016/j.neunet.2015.09.009. 186 N. Bosch, H. Chen, R. Baker, V. Shute, and S. K. D’Mello. 2015a. Accuracy vs. Availability Heuristic in Multimodal Affect Detection in the Wild. In Proceedings of the 17th ACM International Conference on Multimodal Interaction (ICMI 2015) ACM, New York. DOI: 10.1145/2818346.2820739. 183, 186 N. Bosch, S. K. D’Mello, R. Baker, J. Ocumpaugh, V. Shute, M. Ventura, and L. Wang. 2015b. Automatic Detection of Learning-Centered Affective States in the Wild. In Proceedings of the 2015 International Conference on Intelligent User Interfaces (IUI 2015) ACM, New York, pp. 379–388. DOI: 10.1145/2678025.2701397. 182 N. Bosch, S. D’Mello, R. Baker, J. Ocumpaugh, and V. Shute. 2016. Using video to automatically detect learner affect in computer-enabled classrooms. ACM Transactions on Interactive Intelligent Systems, 6:17.11–17.31. DOI: 10.1145/2946837. 183

References

193

M. M. Bradley and P. J. Lang. 1994 Measuring emotion: the self-assessment manikin and the semantic differential. Journal of Behavior Therapy and Experimental Psychiatry, 25:49–59. DOI: 10.1016/0005-7916(94)90063-9. 167 R. Calvo, S. K. D’Mello, J. Gratch, and A. Kappas. 2015. The Oxford Handbook of Affective Computing Oxford University Press, New York. 169 R. A. Calvo and S. K. D’Mello. 2010. Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Transactions on Affective Computing, 1:18–37. DOI: 10.1109/T-AFFC.2010.1. 169 L. Camras and J. Shutter. 2010. Emotional facial expressions in infancy. Emotion Review, 2(2):120–129. DOI: 10.1177/1754073909352529. 170 P. Cardinal, N. Dehak, A.L. Koerich, J. Alam, and P. Boucher. 2015. ETS system for AV+EC 2015 challenge. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge ACM, New York, pp. 17–23. DOI: 10.1145/2808196.2811639. 189 G. Chanel, C. Rebetez, M. B´ etrancourt, and T. Pun. 2011. Emotion assessment from physiological signals for adaptation of game difficulty. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 41:1052–1063. DOI: 10.1109/ TSMCA.2011.2116000. 186 L. Chao, J. Tao, M. Yang, Y. Li, and Z. Wen. 2015. Long short term memory recurrent neural network based multimodal dimensional emotion recognition. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge ACM, New York, pp. 65–72. DOI: 10.1145/2808196.2811634. 189 D. Chen, D. Jiang, I. Ravyse, and H. Sahli. 2009. Audio-visual emotion recognition based on a DBN model with constrained asynchrony. In Proceedings of the Fifth International Conference on Image and Graphics (ICIG 09) IEEE, Washington, DC, pp. 912–916. DOI: 10.1109/ICIG.2009.120. 176 S. Chen and Q. Jin. 2015. Multi-modal dimensional emotion recognition using recurrent neural networks. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge ACM, New York, pp. 49–56. DOI: 10.1145/2808196.2811638. 189 J. Coan and J. Allen. Handbook of emotion elicitation and assessment Oxford University Press, New York. 167, 187 J. A. Coan. 2010. Emergent ghosts of the emotion machine. Emotion Review, 2:274–285. DOI: 10.1177/1754073910361978. 170, 172 C. Conati and H. Maclaren. 2009. Empirically building and evaluating a probabilistic model of user affect. User Modeling and User-Adapted Interaction, 19:267–303. DOI: 10.1007/s11257-009-9062-8. 176 R. Cowie, G. McKeown, and E. Douglas-Cowie. 2012. Tracing emotion: an overview. International Journal of Synthetic Emotions (IJSE), 3:1–17. DOI: 10.4018/jse.2012010101. 169 S. D’Mello and R. Calvo. 2013. Beyond the Basic Emotions: What Should Affective Computing Compute? In S. Brewster, S. Bødker and W. Mackay editors, Extended Abstracts of the

194

Chapter 6 Multimodal-Multisensor Affect Detection

ACM SIGCHI Conference on Human Factors in Computing Systems (CHI 2013), ACM, New York. DOI: 10.1145/2468356.2468751. 187 S. K. D’Mello. 2016. On the influence of an iterative affect annotation approach on interobserver and self-observer reliability. IEEE Transactions on Affective Computing, 7:136–149. DOI: 10.1109/TAFFC.2015.2457413. 172 S. K. D’Mello and J. Kory. 2015. A review and meta-analysis of multimodal affect detection systems. ACM Computing Surveys, 47:43:41–43:46. DOI: 10.1145/2682899. 169, 185, 186, 187, 189 S. D’Mello and A. Graesser. 2011. The half-life of cognitive-affective states during complex learning. Cognition & Emotion, 25:1299–1308. DOI: 10.1080/02699931.2011.613668. 176 S. K. D’Mello, N. Dowell, and A. C. Graesser. 2013. Unimodal and multimodal human perception of naturalistic non-basic affective states during Human-Computer interactions. IEEE Transactions on Affective Computing, 4:452–465. DOI: 10.1109/ T-AFFC.2013.19. 172 A. Damasio. 2003. Looking for Spinoza: Joy, sorrow, and the feeling brain. Harcourt Inc., Orlando, FL. 172 C. Darwin. 1872. The expression of the emotions in man and animals. John Murray, London. 172 D. Datcu and L. Rothkrantz. 2011. Emotion recognition using bimodal data fusion. In Proceedings of the 12th International Conference on Computer Systems and Technologies ACM, New York, pp. 122–128. DOI: 10.1145/2023607.2023629. 186 S. Dobriˇsek, R. Gajˇsek, F. Miheliˇ c, N. Paveˇsi´ c, and V. ˇ Struc. 2013. Towards Efficient MultiModal Emotion Recognition. International Journal of Advanced Robotic Systems, 10:1–10. DOI: 10.5772/54002. 186 P. Ekman. 1992. An argument for basic emotions. Cognition & Emotion, 6:169–200. 170, 172, 187 P. Ekman. 1994. Strong Evidence for Universals in Facial Expressions - a Reply to Russells Mistaken Critique. Psychological Bulletin, 115:268–287. 170 H. Elfenbein and N. Ambady. 2002a. Is there an ingroup advantage in emotion recognition? Psychological Bulletin, 128:243–249. DOI: 10.1037/0033-2909.128.2.243. 170 H. Elfenbein and N. Ambady. 2002b. On the universality and cultural specificity of emotion recognition: A meta-analysis. Psychological Bulletin, 128:203–235. 170 F. Eyben, M. W¨ ollmer, A. Graves, B. Schuller, E. Douglas-Cowie, and R. Cowie. 2010. On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues. Journal on Multimodal User Interfaces, 3:7–19. DOI: 10.1007/s12193 -009-0032-6. 178 FACET. 2014. Facial Expression Recognition Software Emotient, Boston, MA. 182

References

195

J. Fontaine, K. Scherer, E. Roesch, and P. Ellsworth. 2007. The world of emotions is not twodimensional. Psychological Science, 18. DOI: 10.1111/j.1467-9280.2007.02024.x. 170, 187 A. J. Fridlund, P. Ekman, and H. Oster. 1987. Facial expressions of emotion. In A. W. Siegman and S. Feldstein, editors, Nonverbal behavior and communication, pp. 143– 223. Erlbaum, Hillsdale, NJ. 170 J. M. Girard, J. F. Cohn, L. A. Jeni, M. A. Sayette, and F. De la Torre. 2015. Spontaneous facial expression in unscripted social interactions can be measured automatically. Behavior Research Methods, 47:1136–1147. DOI: 10.3758/s13428-014-0536-1. 167, 168 M. Glodek, S. Reuter, M. Schels, K. Dietmayer, and F. Schwenker. 2013. Kalman Filter Based Classifier Fusion for Affective State Recognition. In Z.-H. Zhou, F. Roli and J. Kittler, editors, Proceedings of the 11th International Workshop on Multiple Classifier Systems, Springer, Berlin Heidelberg, pp. 85–94. DOI: 10.1007/978-3-642-38067-9_8. 186 J. F. Grafsgaard, J. B. Wiggins, K. E. Boyer, E. N. Wiebe, and J. C. Lester. 2014. Predicting learning and affect from multimodal data streams in task-oriented tutorial dialogue. In J. Stamper, Z. Pardos, M. Mavrikis and B. M. McLaren, editors, Proceedings of the 7th International Conference on Educational Data Mining, International Educational Data Mining Society, pp. 122–129. 186 J. J. Gross and L. F. Barrett. 2011. Emotion generation and emotion regulation: One or two depends on your point of view. Emotion Review, 3:8–16. DOI: 10.1177/ 1754073910380974. 170 L. He, D. Jiang, L. Yang, E. Pei, P. Wu, and H. Sahli. 2015. Multimodal affective dimension prediction using deep bidirectional long short-term memory recurrent neural networks. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge ACM, New York, pp. 73–80. DOI: 10.1145/2808196.2811641. 189 S. J. Heine, D. R. Lehman, K. Peng, and J. Greenholtz. 2002. What’s wrong with crosscultural comparisons of subjective Likert scales?: The reference-group effect. Journal of Personality and Social Psychology, 82:903–918. 171 S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural computation, 9:1735–1780. DOI: 10.1162/neco.1997.9.8.1735. 178 S. Hommel, A. Rabie and U. Handmann. 2013. Attention and Emotion Based Adaption of Dialog Systems. In E. Pap editor, Intelligent Systems: Models and Applications, pp. 215–235. Springer Verlag, Berlin Heidelberg. 186 Z. Huang, T. Dang, N. Cummins, B. Stasak, P. Le, V. Sethu, and J. Epps. 2015. An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge ACM, New York, pp. 41–48. DOI: 10.1145/2808196 .2811640. 189 M. Hussain, H. Monkaresi, and R. Calvo. 2012. Combining Classifiers in Multimodal Affect Detection. In Proceedings of the Australasian Data Mining Conference. 186

196

Chapter 6 Multimodal-Multisensor Affect Detection

C. Izard. Innate and universal facial expressions: Evidence from developmental and crosscultural research. Psychological Bulletin 115. DOI: 10.1037//0033-2909.115.2.288. 170 C. Izard. 2010. The many meanings/aspects of emotion: Definitions, functions, activation, and regulation. Emotion Review, 2:363–370. DOI: 10.1177/1754073910374661. 169, 170 C. E. Izard. 2007. Basic emotions, natural kinds, emotion schemas, and a new paradigm. Perspectives on Psychological Science, 2:260–280. DOI: 10.1111/j.1745-6916.2007 .00044.x. 169, 172 W. James. 1884. What is an emotion? Mind, 9:188–205. 172 J. H. Janssen, P. Tacken, J. de Vries, E. L. van den Broek, J. H. Westerink, P. Haselager, and W. A. IJsselsteijn. 2013. Machines outperform laypersons in recognizing emotions elicited by autobiographical recollection. Human–Computer Interaction, 28:479–517. DOI: 10.1080/07370024.2012.755421. 180 D. Jiang, Y. Cui, X. Zhang, P. Fan, I. Ganzalez, and H. Sahli. 2011. Audio visual emotion recognition based on triple-stream dynamic bayesian network models. In S. D’Mello, A. Graesser, S. B and J. Martin, editors, Proceedings of the Fourth International Conference on Affective Computing and Intelligent Interaction, Springer-Verlag, Berlin Heidelberg, pp. 609–618. DOI: 10.1007/978-3-642-24600-5_64. 176, 186 M. K¨ achele, P. Thiam, G. Palm, F. Schwenker, and M. Schels. 2015. Ensemble methods for continuous affect recognition: Multi-modality, temporality, and challenges. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge ACM, New York, pp. 9–16. DOI: 10.1145/2808196.2811637. 189 S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, C. ¸ G¨ ulcehre, ¸ R. Memisevic, P. Vincent, A. Courville, Y. Bengio, and R. C. Ferrari. 2013. Combining modality specific deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM International Conference on Multimodal Interaction ACM, New York, pp. 543–550. DOI: 10.1145/2522848.2531745. 179 S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi, T. Pun, A. Nijholt, and I. Patras. 2012. Deap: A database for emotion analysis using physiological signals. IEEE Transactions on Affective Computing 3:18–31. DOI: 10.1109/T-AFFC.2011.15. 186 J. Kory and S. K. D’Mello. 2015. Affect elicitation for affective computing. In R. Calvo, S. D’Mello, J. Gratch and A. Kappas, editors, The Oxford Handbook of Affective Computing, pp. 371–383. Oxford University Press, New York. DOI: 10.1093/oxfordhb/ 9780199942237.013.001. 171, 182 J. Kory, S. K. D’Mello, and A. Olney. 2015. Motion Tracker: Camera-based Monitoring of Bodily Movements using Motion Silhouettes. Plos One 10, 10.1371/journal.pone.0130293. DOI: 10.1371/journal.pone.0130293. G. Krell, M. Glodek, A. Panning, I. Siegert, B. Michaelis, A. Wendemuth, and F. Schwenker. 2013. Fusion of Fragmentary Classifier Decisions for Affective State Recognition. In F. Schwenker, S. Scherer and L.-P. Morency, editors, Proceedings of the The 1st

References

197

International Workshop on Multimodal Pattern Recognition of Social Signals in HumanComputer-Interaction, Springer-Verlag, Berlin Heidelberg, pp. 116–130. 186 J. A. Krosnick. 1999. Survey research. Annual Review of Psychology, 50:537–567. 171 Y. Le Cun, Y. Bengio, and G. E. Hinton. 2015. Deep learning. Nature, 521:436–444. 178 H. C. Lench, S. W. Bench, and S. A. Flores. 2013. Searching for evidence, not a war: Reply to Lindquist, Siegel, Quigley, and Barrett (2013). Psychological Bulletin, 113:264–268. DOI: 10.1037/a0029296.. 169 J. S. Lerner and D. Keltner. 2000. Beyond valence: Toward a model of emotion-specific influences on judgement and choice. Cognition & Emotion, 14:473–493. DOI: 10 .1080/026999300402763. 169 M. D. Lewis. 2005. Bridging emotion theory and neurobiology through dynamic systems modeling. Behavioral and Brain Sciences, 28:169–245. DOI: 10.1017/ S0140525X0500004X. 169, 172 X. Li and Q. Ji. 2005. Active affective state detection and user assistance with dynamic bayesian networks. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 35:93–105. DOI: 10.1109/TSMCA.2004.838454. 176 J. Lin, C. Wu, and W. Wei. 2012. Error Weighted Semi-Coupled Hidden Markov Model for Audio-Visual Emotion Recognition. IEEE Transactions on Multimedia, 14:142–156. DOI: 10.1109/TMM.2011.2171334. 177, 186 K. A. Lindquist, A. B. Satpute, T. D. Wager, J. Weber, and L. F. Barrett. 2016. The brain basis of positive and negative affect: evidence from a meta-analysis of the human neuroimaging literature. Cerebral Cortex, 26:1910–1922. DOI: 10.1093/cercor/bhv001. 172 K. A. Lindquist, E. H. Siegel, K. S. Quigley and L. F. Barrett. 2013. The Hundred-Year Emotion War: Are Emotions Natural Kinds or Psychological Constructions? Comment on Lench, Flores, and Bench (2011). Psychological Bulletin, 139:264–268. DOI: 10.1037/ a0029038. 169 K. A. Lindquist, T. Wager, D. H. Kober, E. Bliss-Moreau, and L. F. Barrett. 2011. The brain basis of emotion: A meta-analytic review. Behavioral and Brain Sciences, 173:1–86. DOI: 10.1017/S0140525X11000446. 172 F. Lingenfelser, J. Wagner and, E. Andr´ e. 2011. A systematic discussion of fusion techniques for multi-modal affect recognition tasks. In Proceedings of the 13th International Conference on Multimodal Interfaces ACM, New York, pp. 19–26. DOI: 10.1145/ 2070481.2070487. 186 M. Liu, R. Wang, S. Li, S. Shan, Z. Huang, and X. Chen. 2014. Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. In Proceedings of the 16th ACM International Conference on Multimodal Interaction ACM, New York, pp. 494–501. DOI: 10.1145/2663204.2666274. 176 G. Loewenstein and J. S. Lerner. 2003. The role of affect in decision making. In Handbook of Affective Science, 619:3. DOI: 10.1016/B978-0-444-62604-2.00003-4. 169

198

Chapter 6 Multimodal-Multisensor Affect Detection

K. Lu and Y. Jia. 2012. Audio-visual emotion recognition with boosted coupled HMM. In Proceedings of the 21st International Conference on Pattern Recognition IEEE, Washington, DC, pp. 1148–1151. 177, 186 J.-C. Martin, C. Clavel, M. Courgeon, M. Ammi, M.-A. Amorim, Y. Tsalamlal, and Y. Gaffary. 2018. How Do Users Perceive Multimodal Expressions of Affects? In In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition. Morgan & Claypool Publishers, San Rafael, CA. G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder. 2012. The SEMAINE database: Annotated multimodal records of emotionally coloured conversations between a person and a limited agent. IEEE Transactions on Affective Computing, 3:5–17. DOI: 10.1109/T-AFFC.2011.20. 188 M. Mehu and K. Scherer. 2012. A psycho-ethological approach to social signal processing. Cognitive Processing, 13:397–414. DOI: 10.1007/s10339-012-0435-2. 171 B. Mesquita and M. Boiger. 2014. Emotions in context: A sociodynamic model of emotions. Emotion Review, 6:298–302. 169 A. Metallinou, M. Wollmer, A. Katsamanis, F. Eyben, B. Schuller, and S. Narayanan. 2012. Context-Sensitive Learning for Enhanced Audiovisual Emotion Classification. IEEE Transactions on Affective Computing, 3:184–198. DOI: 10.1109/T-AFFC.2011.40. 186 A. Milchevski, A. Rozza, and D. Taskovski. 2015. Multimodal affective analysis combining regularized linear regression and boosted regression trees. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge ACM, New York, 33–39. DOI: 10.1145/2808196.2811636. 189 H. Monkaresi, N. Bosch, R. A. Calvo, and S. K. D’Mello. 2017. Automated detection of engagement using video-based estimation of facial expressions and heart rate. IEEE Transactions on Affective Computing, 8:15–28. DOI: 10.1109/TAFFC.2016.2515084. 186 H. Monkaresi, M. S. Hussain and R. Calvo. 2012. Classification of affects using head movement, skin color features and physiological signals. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics IEEE, Washington, DC, pp. 2664–2669. DOI: 10.1109/ICSMC.2012.6378149. 186 H.-W. Ng, V. D. Nguyen, V. Vonikakis, and S. Winkler. 2015. Deep learning for emotion recognition on small datasets using transfer learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (ICMI 2015) ACM, New York, pp. 443–449. DOI: 10.1145/2818346.2830593. 179 M. Nicolaou, H. Gunes, and M. Pantic. 2011. Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence& Arousal Space. IEEE Transactions on Affective Computing, 2:92–105. DOI: 10.1109/T-AFFC.2011.9. 186 J. Ocumpaugh, R. S. Baker, and M. M. T. Rodrigo. 2012. Baker-Rodrigo Observation Method Protocol (BROMP) 1.0. Training Manual Version 1.0. Worcester Polytechnic Institute,

References

199

Teachers College Columbia University, & Ateneo de Manila University, New York and Manila, Philippines. DOI: 10.1007/978-3-642-39112-5_74. 182 J. Ocumpaugh, R. S. Baker, and M. M. T. Rodrigo. 2015. Baker Rodrigo Ocumpaugh Monitoring Protocol (BROMP) 2.0 Technical and Training Manual Teachers College, Columbia University, and Ateneo Laboratory for the Learning Sciences, New York, and Manila, Philippines. 167 J. Park, G. Jang, and Y. Seo. 2012. Music-aided affective interaction between human and service robot. EURASIP Journal on Audio, Speech, and Music Processing 2012, 1–13. DOI: 10.1186/1687-4722-2012-5. 186 B. Parkinson, A. H. Fischer, and A. S. Manstead. Emotion in social relations: Cultural, group, and interpersonal processes. Psychology Press. 170 R. Picard. 1997. Affective Computing. MIT Press, Cambridge, MA. 169, 189 R. Picard. 2010. Affective Computing: From Laughter to IEEE. IEEE Transactions on Affective Computing, 1:11–17. DOI: 10.1109/T-AFFC.2010.10. 189 R. W. Picard, S. Fedor, and Y. Ayzenberg. 2015. Multiple arousal theory and daily-life electrodermal activity asymmetry. Emotion Review, 8 (1), 62–75. DOI: 10.1177/ 1754073914565517. 174 P. M. Podsakoff, S. B. MacKenzie, J. Y. Lee, and N. P. Podsakoff. 2003. Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88:879–903. DOI: 10.1037/0021-9010.88.5.879. 171 F. Ringeval, F. Eyben, E. Kroupi, A. Yuce, J.-P. Thiran, T. Ebrahimi, D. Lalanne, and B. Schuller. 2015a. Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data. Pattern Recognition Letters, 66:22–30. DOI: 10 .1016/j.patrec.2014.11.007. 178, 183 F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, R. Cowie, and M. Pantic. 2015b. AV+ EC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge ACM, New York, pp. 3–8. 184 F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne. 2013. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In Proceedings of the 2nd International Workshop on Emotion Representation, Analysis and Synthesis in Continuous Time and Space (EmoSPACE) in conjunction with the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition IEEE, Washington, DC. DOI: 10.1109/FG.2013.6553805. 184, 189 V. Rosas, R. Mihalcea and L. Morency. 2013. Multimodal Sentiment Analysis of Spanish Online Videos. IEEE Intelligent Systems, 28:38–45. DOI: 10.1109/MIS.2013.9. 186 I. J. Roseman. 2011. Emotional behaviors, emotivational goals, emotion strategies: Multiple levels of organization integrate variable and consistent responses. Emotion Review, 3:434–443. 170, 171

200

Chapter 6 Multimodal-Multisensor Affect Detection

R. Rosenthal and R. Rosnow. 1984. Essentials of behavioral research: Methods and data analysis. McGraw-Hill, New York. 190 V. Rozgic, S. Ananthakrishnan, S. Saleem, R. Kumar, and R. Prasad. 2012. Ensemble of SVM trees for multimodal emotion recognition. In Proceedings of the Signal & Information Processing Association Annual Summit and Conference IEEE, Washington, DC, pp. 1–4. 186 W. Ruch. 1995. Will the real relationship between facial expression and affective experience please stand up: The case of exhilaration. Cognition & Emotion, 9:33–58. DOI: 10 .1080/02699939508408964. 170 J. Russell. 2003. Core affect and the psychological construction of emotion. Psychological Review, 110:145–172. 169, 170 J. A. Russell, J. A. Bachorowski, and J. M. Fernandez-Dols. 2003. Facial and vocal expressions of emotion. Annual Review of Psychology 54, 329–349. DOI: 10.1146/annurev.psych.54 .101601.145102. J. A. Russell, A. Weiss, and G. A. Mendelsohn. 1989. Affect Grid - A single-item scale of pleasure and arousal. Journal of Personality and Social Psychology, 57:493–502. DOI: 10.1037/0022-3514.57.3.493. 167 A. Savran, H. Cao, M. Shah, A. Nenkova, and R. Verma. 2012. Combining video, audio and lexical indicators of affect in spontaneous conversation via particle filtering. In Proceedings of the 14th ACM International Conference on Multimodal Interaction ACM, New York, pp. 485–492. DOI: 10.1145/2388676.2388781. 186 K. R. Scherer. 2009. The dynamic architecture of emotion: Evidence for the component process model. Cognition & Emotion, 23:1307–1351. DOI: 10.1080/02699930902928969. 169 B. Schuller. 2011. Recognizing Affect from Linguistic Information in 3D Continuous Space. IEEE Transactions on Affective Computing, 2:192–205. DOI: 10.1109/T-AFFC.2011.17. 186, 188 B. Schuller, M. Valster, R. Cowie, and M. Pantic. 2011. AVEC 2011: Audio/Visual Emotion Challenge and Workshop. In S. D’Mello, A. Graesser, B. Schuller and J.-C. Martin, editors, Proceedings of the 4th International Conference on Affective Computing and Intelligent Interaction (ACII 2011), Springer, Berlin. B. Schuller, M. Valster, F. Eyben, R. Cowie, and M. Pantic. 2012. AVEC 2012: The continuous audio/visual emotion challenge. In Proceedings of the 14th ACM international conference on Multimodal interaction ACM, New York, pp. 449–456. DOI: 10.1145/ 2388676.2388758. 188 V. J. Shute, M. Ventura, and Y. J. Kim. 2013. Assessment and learning of qualitative physics in Newton’s playground. The Journal of Educational Research, 106:423–430. 182 M. Soleymani, S. Asghari-Esfeden, M. Pantic, and Y. Fu. 2014. Continuous emotion detection using eeg signals and facial expressions. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME) IEEE, Washington DC, pp. 1–6. DOI: 10.1109/ICME.2014.6890301. 186

References

201

M. Soleymani, M. Pantic, and T. Pun. 2012. Multi-Modal Emotion Recognition in Response to Videos. IEEE Transactions on Affective Computing, 3:211–223. DOI: 10.1109/T-AFFC .2011.37. 186 S.S. Tomkins. 1962. Affect Imagery Consciousness: Volume I, The Positive Affects. Tavistock, London. 172 J. L. Tracy. 2014. An evolutionary approach to understanding distinct emotions. Emotion Review, 6:308–312. DOI: 10.1177/1754073914534478. 170 A. Vinciarelli and A. Esposito. 2018. Multimodal Analysis of Social Signals. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition. Morgan & Claypool Publishers, San Rafael, CA. H. Vu, Y. Yamazaki, F. Dong, and K. Hirota. 2011. Emotion recognition based on human gesture and speech information using RT middleware. In IEEE International Conference on Fuzzy Systems IEEE, Washington, DC, pp. 787–791. DOI: 10.1109/ FUZZY.2011.6007557. 186 J. Wagner and E. Andr´ e. 2018. Real-time sensing of affect and social signals in a multimodal context. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos, and A. Krueger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition. Morgan & Claypool Publishers, San Rafael, CA. J. Wagner, E. Andre, F. Lingenfelser, J. Kim, and T. Vogt. 2011. Exploring Fusion Methods for Multimodal Emotion Recognition with Missing Data. IEEE Transactions on Affective Computing, 2:206–218. DOI: 10.1109/T-AFFC.2011.12. 186 S. Walter, S. Scherer, M. Schels, M. Glodek, D. Hrabal, M. Schmidt, R. B¨ ock, K. Limbrecht, H. Traue, and F. Schwenker. 2011. Multimodal emotion classification in naturalistic user behavior. In J. Jacko, editor, Proceedings of the International Conference on Human-Computer Interaction. Springer, Berlin, pp. 603–611. DOI: 10.1007/978-3-64221616-9_68. 186 S. Wang, Y. Zhu, G. Wu, and Q. Ji. 2013. Hybrid video emotional tagging using users’ EEG and video content. Multimedia Tools and Applications, 1–27. DOI: 10.1007/s11042 -013-1450-8. 186 J. R. Williamson, T. F. Quatieri, B. S. Helfer, G. Ciccarelli, and D. D. Mehta. 2014. Vocal and Facial Biomarkers of Depression Based on Motor Incoordination and Timing. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge ACM, New York, pp. 65–72. DOI: 10.1145/2661806.2661809. 186 M. W¨ ollmer, M. Kaiser, F. Eyben, and B. Schuller. 2013a. LSTM modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing, 31. DOI: 10.1016/j.imavis.2012.03.001. 186 M. W¨ ollmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae, and L. Morency. 2013b. YouTube Movie Reviews:Sentiment Analysis in an Audiovisual Context. IEEE Intelligent Systems, 28:46–53. DOI: 10.1109/MIS.2013.34. 186

202

Chapter 6 Multimodal-Multisensor Affect Detection

C. Wu and W. Liang. 2011. Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. IEEE Transactions on Affective Computing, 2:10–21. DOI: 10.1109/T-AFFC.2010.16. 186 Z. Zeng, M. Pantic, G. Roisman, and T. Huang. 2009. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31:39–58. DOI: 10.1109/TPAMI.2008.52. 169, 527 D. Zhou, J. Luo, V. M. Silenzio, Y. Zhou, J. Hu, G. Currier, and H. A. Kautz. 2015. Tackling Mental Health by Integrating Unobtrusive Multimodal Sensing. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI-2015) ACM, New York, pp. 1401–1409. 186

7

Multimodal Analysis of Social Signals Alessandro Vinciarelli, Anna Esposito

7.1

Introduction

One of the earliest books dedicated to the communication between living beings is The Expression of Emotion in Animals and Man by Charles Darwin. The text includes a large number of accurate and vivid descriptions of the way living beings express emotions: “many kinds of monkeys, when pleased, utter a reiterated sound, clearly analogous to our laughter, often accompanied by vibratory movements of their jaws or lips, with the corners of the mouth drawn backwards and upwards, by the wrinkling of the cheeks, and even by the brightening of the eyes” [Darwin 1872]. Darwin never uses the word multimodal, but the example above—and the many similar others the book contains—makes it clear that the expression of emotions often involves the simultaneous use of multiple communication channels and, correspondingly, the simultaneous stimulation of different senses. Approximately 100 years after the seminal insights by Darwin, research in life and human sciences started to adopt the expression multimodal communication to denote the phenomenon above and to investigate its underlying principles and laws [Partan and Marler 1999, Scheffer et al. 1996, Rowe and Guilford 1996]. In parallel, it was observed that multimodality is not a peculiar expression or perception of emotions, but it concerns any interaction between living beings (human-human, animal-animal, or human-animal) [Poggi 2007]. In other words, multimodality plays a major role in the exchange of social signals, i.e., “acts or structures that influence the behavior or internal state of other individuals” [Mehu and Scherer 2012], “communicative or informative signals which . . . provide information about social facts” [Poggi and D’Errico 2012], or “actions whose function is to bring about some reaction or to engage in some process” [Brunet and Cowie 2012].

204

Chapter 7 Multimodal Analysis of Social Signals

Glossary Classifier: in pattern recognition and machine learning, it is a function that maps an object of interest (represented through a set of physical measurements called features) into one of the classes or categories such an object can belong to (the number of classes or categories is finite). Classifier Combination: in pattern recognition and machine learning, it is a body of methodologies aimed at jointly using multiple classifiers to achieve a collective performance higher—to a statistically significant extent—than the individual performance of any individual classifier. Classifier Diversity: in a set of classifiers that are being combined combined, the diversity is the tendency of different classifiers to have different performance in different regions of the input space. Communication: process between two or more agents aimed at the exchange of information or at the mutual modification of beliefs, shared or individual. Redundancy: tendency of multiple signals or communication channels to carry the same or widely overlapping information. Social Signals: constellations of nonverbal behavioural cues aimed at conveying socially relevant information such as attitudes, personality, intentions, etc. Social Signal Processing: computing domain aimed at modeling, analysis and synthesis of social signals in human-human and human-machine interactions.

At the moment, now that machines are powerful enough to deal with human behavior and its subtleties, the interest for multimodal communication has reacheed the computing community [Vinciarelli et al. 2009, 2012]. Figure 7.1 shows the distribution of the number of computing oriented papers containing the word multimodal in the ACM Digital Library, one of the most important repositories of computing literature. The chart shows that the interest for the topic has been continuously increasing for the last 15 years. According to the latest technology forecasts, 1. socially intelligent technologies—in particular humanoid robots—will become a ubiquitous feature of everyday life in the next 20 years. This suggests that the trend of Figure 7.1 will continue in the foreseeable future. 1. According to Tractica, “annual robot unit shipments will increase from 8.8 million in 2015 to 61.4 million by 2020, with more than half the volume in that year coming from consumer robots,” i.e., robots that integrate the everyday life of their users (http://www.tractica.com/newsroom/pressreleases/global-robotics-industry-to-surpass-151-billion-by-2020/)

7.2 Multimodal Communication in Life and Human Sciences

205

Articles containing the word “multimodal”

Number of articles

500 400 300 200 100 0

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Year Figure 7.1

The chart shows the number of papers that the ACM Digital Library returns after submitting the query multimodal. Overall, the number grows continuously since 2000.

Overall, the brief outline above shows that multimodality and multimodal communication are concepts that attract attention in communities as diverse as life sciences, computing, and human sciences. However, the exact meaning of the term multimodal is not necessarily the same in all fields. The goal of this chapter is to address, at least in part, such an issue and to show differences and commonalities (if any) behind the use of the word multimodal in the various areas. In particular, this chapter tries to show if concepts originally elaborated in life sciences can be “translated” into computational methodologies and, if yes, how. The rest of the chapter is organized as follows. Section 7.2 provides a brief introduction to the concept of multimodality in life and human sciences. Section 7.3 describes the key methodological issues of multimodal approaches for the analysis of social signals. Section 7.4 highlights some future perspectives and, finally, Section 7.5 draws some conclusions.

7.2

Multimodal Communication in Life and Human Sciences According to life sciences, “Animals communicate with their entire bodies and perceive signals with all available faculties (vision, audition, chemoreception, etc.). To best understand communication, therefore, we must consider the whole animal and all of its sensory emissions and percepts” [Partan and Marler 2005]. However, what makes communication truly multimodal is not the joint use of multiple modalities, but their integration to achieve communicative effects that cannot be

Response

Signal

Response

Chapter 7 Multimodal Analysis of Social Signals

Signal

206

a

a+b

Equivalence

b

a+b

Enhancement

Redundancy

a+b a

a+b

b

a+b

and

Independence Dominance

Nonredundancy

a+b Separate signals Figure 7.2

or

Modulation Emergence

Multimodal signals

A reproduction of the scheme proposed in [Partan and Marler 2005, Partan and Marler 1999] and it shows the multimodal communication patterns observed in nature.

achieved individually by the various modalities involved: “[the use of multimodal signals] produces unexpected psychological responses . . . which remain hidden when the components are presented alone, with the clear implication that the full significance of multicomponent animal signals cannot be understood by investigating components independently” [Rowe and Guilford 1996]. In other words, communication is multimodal when the effect of multiple modalities is not just the sum of the individual effects achieved with individual modalities. Figure 7.2 shows a taxonomy of multimodal communication patterns observed in nature. The first important distinction is between patterns based on redundant and non-redundant signals, corresponding to the upper and lower half of the figure, respectively [Partan and Marler 1999]. In the case of redundant signals, the different modalities carry the same information and the main function of multimodality is to ensure that the message reaches the agent who is supposed to perceive it. A typical case in nature is the combination of acoustic signals and movements. The latter ensures that the message can be received even when there is loud noise in the environment, a condition that is frequent in natural settings. Conversely, the acoustic signals ensure that the message can be received even in the absence

7.2 Multimodal Communication in Life and Human Sciences

207

of light or in the case of visual occlusion. In the case of non-redundant signals, the main function of multimodality is to transmit a larger amount of information per unit of time. For example, the appropriate combination of pigmentation and chemical signals can discourage predators or attract sexual mates, two functions that require the relevant information to be communicated quickly in order to survive and transmit genes, respectively [Scheffer et al. 1996]. The second important criterion that informs the taxonomy of Figure 7.2 is the response of the subject that perceives the multimodal communication pattern [Partan and Marler 2005], whether the response corresponds to a behavioral display or to a correct understanding of the message conveyed by the pattern. In the case of redundant signals, two scenarios are observed, namely equivalence and enhancement. The former corresponds to the case in which the response to the multimodal signal is the same as the responses observed when the unimodal signals are perceived individually. The latter corresponds to the case in which the response is the enhanced version of the one observed when the unimodal signals are perceived individually. When the signals are non-redundant, the scenarios observed in nature are four, i.e., independence, dominance, modulation, and emergence (see Figure 7.2). Independence means that the response to a multimodal pattern is the mere sum of the responses to the individual signals. According to the principle outlined at the beginning of this section, such a scenario does not even qualify as an example of multimodal communication. The dominance scenario covers those cases where the response is the same as the one that would be observed for one of the unimodal signals in the multimodal pattern. The modulation scenario is similar, but the magnitude of the response changes with respect to the unimodal case. Finally, the emergence scenario corresponds to those cases in which the response is different from those that can be observed when each of the unimodal signals is perceived individually. It is not surprising to observe that the first observations on multimodal communication were done by life scientists studying interactions between animals. The reason is that these use a much wider variety of sensory channels than how humans do, including hearing, sight, radar-like receptors of acoustic waves, olfaction, chemoreceptors, etc. [Rowe and Guilford 1996, Scheffer et al. 1996]. When it comes to human-human communication, interaction takes place mainly via speech and visual signals (facial expressions, gestures, posture, mutual distances, etc.). The other channels—touch, smell and taste—are used only rarely and only in very specific contexts (e.g., sexual intercourse). For this reason, in the case of humans, multimodality typically means bimodality. In particular, it is possible to distinguish

208

Chapter 7 Multimodal Analysis of Social Signals

between micro-bimodality (the combination between speech and movements like those of the lips that are necessary for the very emission of voice and articulation of phonemes) and macro-bimodality (the combination of speech and movements like facial expressions that are not strictly necessary for the emission of speech) [Poggi 2007]. To the best of our knowledge, no systematic attempts have been done to verify whether the taxonomy of Figure 7.2 applies to human-human communication. However, a few examples indicate that this is actually the case. A person that attracts the attention of others by shouting and waiving arms is a case of equivalence (the visual signal tries to reach distances that the voice cannot). Similarly, people that say “No” while shaking their head enhance their message through the use of two redundant signals. In the case of non-redundant signals (lower half of Figure 7.2), a person can manifest aggressiveness through the tone of her voice while showing fear through a defensive body posture, thus fitting the independence scenario. The dominance case applies, e.g., to someone that says to be comfortable while blushing and, hence, is perceived as someone that is actually not comfortable. For what concerns the modulation case, prosody helps to stress and emphasize certain parts of a verbal message, thus achieving a modulation effect. Finally, irony can be considered a case of emergence where, e.g., the verbal and non-verbal components of a message are opposite to one another.

7.3

Multimodal Analysis of Social Signals The taxonomy of Figure 7.2 does not explain how multimodal communication patterns result into a given response, it simply provides criteria and terminology to describe multimodal communication in rigorous terms. Approaches for multimodal analysis of social signals do not explain the way multimodal signals produce a response either [Vinciarelli et al. 2009, Vinciarelli et al. 2012]. However, computer analysis requires one to express the process in operational terms. If X = ( x1 , x 2 , . . . , x R ) is a multimodal pattern, − xi is a vector of physical measurements extracted from modality i (meaning from the data captured with sensor i), and R is the total number of sensors adopted (every modality corresponds to a sensor), there are two approaches to deal with it. The first approach is called early fusion and it consists in concatenating the unimodal x i vectors to obtain a multimodal vector x . The latter approach can then be fed to any pattern recognition approach to perform classification or regression depending on the problem. In this case, the response to x will be the output of the particular approach being used (e.g., a probability, a score, a distance, etc.). The second is called late fusion and consists of

7.3 Multimodal Analysis of Social Signals

209

adopting a Multiple Classifiers System (MCS), i.e., a combination of several classifiers that deal separately with the various modalities. The output of an MCS is a probability distribution P (ωk |X), where ωk ∈ = {ω1 , . . . , ωL} and is the set of all possible responses. The probability distribution P (ωk |X) allows one to make a decision about the response to X according to the following rule: ωˆ = arg max P (ωk |X) = arg max P (ωk | x1 , . . . , x R ). k=1, ..., L

k=1, ..., L

(7.1)

Early and late fusion are not expected to explain the way living beings respond to multimodal communication patterns. They are simply methodologies that allow a machine to map a multimodal input pattern X into a suitable response ω. ˆ In other words, early and late fusion can perhaps reproduce the observations summarized in Figure 7.2, but they cannot be considered an explanation or a model of the processes that lead from stimulus to response in living beings. The rest of this section focuses on the probabilistic framework underlying MCS and, in particular, it closely follows the approach proposed in Kittler et al. [1998] to x1 , . . . , x R ) (the literature does not provide similar frameworks for estimate P (ωj | the early fusion). Such a distribution is difficult to estimate because this requires to knowing the distribution p( x1 , . . . , x R |ωj ) which is typically difficult to infer: P (ωj | x1 , . . . , x R ) =

p( x1 , . . . , x R |ωj )P (ωj ) p( x1 , . . . , x R )

p( x1 , . . . , x R |ωj )P (ωj ) = L . (7.2) x1 , . . . , x R |ωj )P (ωk ) k=1 p(

For this reason, it is common to make independence assumptions that, while making the problem tractable, lead to intuitive combination rules. If the vectors x k are assumed to be statistically independent given ωj , then p( x1 , . . . , x R |ωj ) boils down to the following: p( x1 , . . . , x R |ωj ) =

R

p( xk |ωj ).

(7.3)

k=1

The main problem with such an approximation is that the posterior probability becomes low even if just one of the terms p( xk |ωj ) is low. For this reason, the adoption of MCS requires the assumption that the posteriors P (ωk | xi ) are similar to the a-priori probabilities P (ωk ): P (ωj | xi ) = P (ωj ) . (1 + δj i ) with |δj i |