Academic Press Library in Signal Processing, Volume 6: Image and Video Processing and Analysis and Computer Vision [6] 012811889X, 9780128118894

Academic Press Library in Signal Processing, Volume 6: Image and Video Processing and Analysis and Computer Vision is ai

614 176 37MB

English Pages 458 [437] Year 2017

Report DMCA / Copyright


Polecaj historie

Academic Press Library in Signal Processing, Volume 6: Image and Video Processing and Analysis and Computer Vision [6]
 012811889X, 9780128118894

Table of contents :
Academic Press Library in Signal Processing, Volume 6
About the Editors
Section Editors
Multiview video: Acquisition, processing, compression, and virtual view rendering
Multiview Video
Multiview Video and 3D Graphic Representation Formats for VR
Super-Multiview Video for 3D Light Field Displays
DIBR Smooth View Interpolation
Basic Principles of DIBR
DIBR vs. Point Clouds
DIBR, Multiview Video, and MPEG Standardization
Multiview Video Acquisition
Multiview Fundamentals
Depth in Stereo and Multiview Video
Multicamera System
Acquisition System Examples
Nagoya University multiview camera system
Fraunhofer HHI camera system
Poznań University of Technology multiview camera system (linear rig)
Poznań University of Technology multiview camera system (modular)
Hasselt University multiview camera system
Multiview Video Preprocessing
Geometrical Parameters
Intrinsic parameters
Extrinsic parameters
Lens distortion
Estimation of camera parameters
Camera parameters file format
Video Correction
Color correction
Lens distortion removal
Depth Estimation
Local Stereo Matching
Global Stereo Matching
Graph Cut
Belief propagation
Multicamera Depth Estimation
Plane sweeping
Epipolar plane images
View Synthesis and Virtual Navigation
View Blending
View Synthesis Reference Software
Monoscopic Video Coding and Simulcast Coding of Multiview Video
Multiview Video Coding
3D Video Coding
Future Trends
Further Reading
Plenoptic imaging: Representation and processing
Light Representation: The Plenoptic Function Paradigm
Empowering the Plenoptic Function: Example Use Cases
Light Field Communication
Use case 1.1: Super-multiview home television
Use case 1.2: Immersive bidirectional communication
Light Field Editing
Use case 2.1: Photographic light field editing
Use case 2.2: Cinematic, mixed reality light field editing
Free Navigation
Use case 3.1: Omnidirectional 360 degree viewing of the surrounding environment
Use case 3.2: Free viewpoint sports event
Use case 3.3: Free viewpoint home television
Interactive All-Reality
Use case 4.1: Surveillance with depth recovery
Use case 4.2: Remote surgery with glasses-free 3D display
Use case 4.3: Interactive VR training
Use case 4.4: Augmented reality surveillance with light field editing
Plenoptic Acquisition and Representation Models
Plenoptic Data Coding
Plenoptic Data Rendering
Rendering Textured Meshes and Point Clouds
Interpolating a Light Field in a Microlens and/or Discrete Camera Array
View Synthesis in MVV Plus Depth
Refocusing With Microlens Light Field
Plenoptic Representations Relationships
Related Standardization Initiatives
Future Trends and Challenges
Further Reading
Visual attention, visual salience, and perceived interest in multimedia applications
Visual Attention in the Field of Multimedia: A Rising Story
From Vision Science to Engineering: Concepts Mash Up and Confusion
Classification of Attention Mechanisms
Overt and Covert Attention
Types of Overt Visual Attention Mechanisms
Endogenous and exogenous visual attention
Top-down and bottom-up attention
Interaction between the top-down and bottom-up attention mechanisms
Concept of perceived importance: Top-down attention is not equal to object of interest
Importance maps and salience maps: Which ground truth for perceived interest?
Computational Models of Visual Attention
Top-Down Computational Attention Models
Visual search task
Object recognition task
Driving, gaming, and sports
Information-Theory and Decision-Theory Models
Entropy/information maximization
Action-reward based
Spatio-Temporal Computational Models
Center-surround in the temporal domain
Detection of irregular actions/behavior
Graph-Based Methods
Graph flow techniques
Foreground-background segmentation
Random walk based
Salient boundary and object identification
Graph spectral methods
Scan-Path (Saccadic) Models
Memory-based modeling
Semantic region based
Residual information based
Oculomotor bias and memory based
Acquiring Ground Truth Visual Attention Data for Model Verification
Conducting an eye-tracking experiment
Existing eye and video tracking datasets for model validation
Processing the Eye-Tracking Data
Saccades and fixations
Saliency maps for images and videos
Scan-path generation
Analysis of disruptions
Testing the Computational Models
Statistical analysis of fixation and saccades
Similarity in saliency maps
Scan-path similarity metrics
Hybrid approaches
Applications of Visual Attention
Quality Assessment
Using saliency as a weighting factor of local distortions
Purely attention-based image quality measures: Visual attention deployment as a proxy for quality
Visual Attention in Multimedia Delivery
Interactive streaming
Dealing with packet loss
Image re-targeting
Applications in Medicine
Eye-tracking in disease detection
Eye-tracking in the training of medical personnel
Visual Attention and Immersive Media: A Rising Love Story
Stereoscopy and 3D displays
Virtual reality (VR)
Emerging science of QoE in multimedia applications: Concepts, experimental guidelines, and validation of models
QoE Definition and Influencing Factors
Factors Influencing QoE
System influence factors
Context influence factors
Human influence factors
QoE Measurement
Including System Influence Factors in QoE Measurement
Including Context Influence Factors in QoE Measurement
Including Human Influence Factors in QoE Measurement
Multidimensional Perceptual Scales for QoE Measurement
Scales and scaling methods
Direct Scaling Methods
Single Stimulus/Absolute Category Rating
ACR with hidden reference
Double Stimulus Impairment Scale/Degradation Category Rating
Double Stimulus Continuous Quality Scale
Processing of Results of Direct Scaling Methods
Mean scores calculation
Confidence intervals calculation
Screening of the subjects
Indirect Scaling Methods
Paired Comparison
Square design PC
Adaptive square design PC
Processing of Results of Indirect Screening Methods
Thurston-Moesteller model
Bradley-Terry-Luce model
Direct processing of pair comparison matrix
Influence Factors Significance Calculation
Calculating significance ratio in direct scaling experiments
Calculating significance ratio in indirect scaling experiments
Calculating SR in indirect scaling partial design experiments
Performance Evaluation of Objective QoE Estimators
Pearsons Linear Correlation Coefficient
Root-Mean-Squared Error
Epsilon-Insensitive Root-Mean-Squared Error
Outlier Ratio
Spearman's Rank Order Correlation Coefficient
Kendall's Rank Order Correlation Coefficient
Resolving Power Measures
RP accuracy
Classification plots
ROC-Based Performance Evaluation
Different vs. similar analysis
Better vs. worse analysis
Statistical comparison of objective algorithms
Compensation for Multiple Comparisons
Bonferroni correction procedure
Holm-Bonferroni correction procedure
Benjamini-Hochberg correction procedure
Computational photography
Breaking Precepts Underlying Photography
Sensor Resolution ≠ Image Resolution
Spatial multiplexing
Spatial multiplexing designs
Space-Time Bandwidth Product Can Be Greater Than the ADC Rate
Image models and coded spatiotemporal imaging
Depth of Field Can Be Changed Independent of Exposure Time
Coded apertures
Extended DoF via depth-invariant defocus blur
Light field cameras
Cameras With Novel Form Factors and Capabilities
Lensless Imaging
Fourier ptychography
Subdiffraction Limited Microscopy
Solving Inverse Problems
Time-of-Flight-Based Range Imaging
Principles of CWAM ToF
Four-bucket technique
Principles of discrete ToF
Applications of ToF cameras
Direct-Global Separation
Face detection with a 3D model
Related Work
Multiview models
3D view-based models
3D models
Cascade approaches
Face alignment
Parameter sensitive classifiers
Face detection with pose estimation
Face Detection Using a 3D Model
Face representation
Face 3D model
Energy Model
Inference Algorithm
Detecting Face Keypoints
Generating 3D Pose Candidates
Image-based regression
Ground truth 3D pose
Training details
Generating Face Candidates
Keypoint support
Scoring the Face Candidates
Local difference features
Modified LBF features
Local selected features (LSF)
Special features
Score function
Nonmaximal Suppression
Parameter Sensitive Model
Parameter sensitive linear model
Nonlinear model
Training the Parameter Sensitive Model
Training cost function
Fitting 3D Models
Fitting a Rigid Projection Transformation
Learning a 3D Model From 2D Annotations
Training dataset
Method nomenclature
Evaluation of Face Candidates
Face Detection Results
Evaluation of design decisions
Failure modes
Detection time
Conclusions and Future Trends
A survey on nonrigid 3D shape analysis
General Formulation
Invariance Requirements
Problem Statement and Taxonomy
Shape Spaces and Metrics
Kendall's Shape Space
Morphable models
The nonlinear nature of Kendalls shape space
Metrics That Capture Physical Deformations
The shape space of thin shells
The shape space of square-root representations
Transformation-Based Representations
Choice of the template T
Deformation models
Metrics on the space of deformations
Registration and Geodesics
Landmark-based elastic registration
Elastic registration as a re-parameterization problem
Geodesics using pullback metrics
Geodesics in the space of SRNFs
Comparison and discussion
Statistical Analysis Under Elastic Metrics
Statistical Analysis Using Non-Euclidean Metrics
Statistical Analysis by SRNF Inversion
Examples and Applications
Registration and Geodesic Deformations
Elastic Coregistration of 3D Shapes
Random 3D Model Synthesis
Other Applications
Summary and Perspectives
Topological and structural variabilities
Multiply interacting shapes
Markov models and MCMC algorithms in image processing
Introduction: The Probabilistic Approach in Image Analysis
Lattice-based Models and the Bayesian Paradigm
Parameter Estimation
Some Inverse Problems
Denoising and Deconvolution: The Restoration Problem
Segmentation Problem
Texture Modeling
Spatial Point Processes
Multiple Objects Detection
Population Evaluation
Road Network Detection
Further Reading
Image and video-based analytics
Scalable image informatics
Core Requirements
Core Concepts
Metadata Graph
Versioning, Provenance, and Queries
Basic micro-services
Uniform Metadata Representation and Query Orchestration: Data Service
Scalability of Micro-Services and Analysis
Analysis Extensions: Module Service
Uniform Representation of Heterogeneous Storage Subsystems: Blob Service
Uniform Access and Operations Over Data Files: Image Service and Table Service
Image service
Table service
Analysis Modules
Python and Matlab Scripting
Pipeline Support
Complex Module Execution Descriptors
Building on the Concepts: Sparse Images
Feature Services and Machine Leaning
Feature Service
Connoisseur Service for Deep Learning
Connoisseur Module for Domain Experts
Application Example: Annotation and Classification of Underwater Images
Person re-identification
The re-identification Problem: Scenarios, Taxonomies, and Related Work
The Scenarios and Taxonomy
Related Work
Feature Extraction
Model Learning
Experimental Evaluation of re-id Datasets and Their Characteristics
The SDALF Approach
Object Segmentation
Symmetry-Based Silhouette Partition
Symmetry-Driven Accumulation of Local Features
Weighted color histograms
Recurrent high structured patches (RHSPs)
The Matching Phase
Metric Learning
Mahalanobis Metric Learning
Large Margin Nearest Neighbor
Efficient Impostor-Based Metric Learning
Conclusions and New Challenges
Social network inference in videos
Related Work
Video Shot Segmentation
Actor Recognition
Learning to Group Actors
Visual Features
Auditory Features
Grouping Criteria
Inferring Social Communities
Social Network Graph
Actor Interaction Model
Social Network Analysis
Assignment to Communities
Estimating Community Leader
The Dataset
Audiovisual Alignment
Social Affinity
Community Assignment
Actor Affinity
Community Leaders
Latent Features
Further Reading

Citation preview

Academic Press Library in Signal Processing, Volume 6 Image and Video Processing and Analysis and Computer Vision

Academic Press Library in Signal Processing, Volume 6 Image and Video Processing and Analysis and Computer Vision

Edited by Rama Chellappa Department of Electrical and Computer Engineering and Center for Automation Research, University of Maryland, College Park, MD, USA

Sergios Theodoridis Department of Informatics & Telecommunications, University of Athens, Greece

Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1800, San Diego, CA 92101-4495, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom # 2018 Elsevier Ltd. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).

Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN 978-0-12-811889-4 For information on all Academic Press publications visit our website at

Publisher: Mara Conner Acquisition Editor: Tim Pitts Editorial Project Manager: Charlotte Kent Production Project Manager: Sujatha Thirugnana Sambandam Cover Designer: Mark Rogers Typeset by SPi Global, India

Contributors Adrian Barbu Statistics Department, Florida State University, Tallahassee, FL 32306 Patrick Le Callet University of Nantes, Nantes, France Marco Cristani University of Verona, Verona, Italy Eduardo A.B. da Silva Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil Xavier Descombes Universit e C^ ote d’Azur, INRIA, I3S, IBV, I3S, Sophia-Antipolis cedex, France Marek Doma nski  University of Technology, Poznan , Poland Poznan Dmitry Fedorov University of California, Santa Barbara, CA, United States Gary Gramajo New Ventures Analyst, Chick-fil-A Corporation, Atlanta, GA 30349 Ashish Gupta Ohio State University, Columbus, OH, United States Luka´sˇ Krasula University of Nantes, Nantes, France Kristian Kvilekval University of California, Santa Barbara, CA, United States Gauthier Lafruit Brussels University (French wing), Brussels, Belgium Hamid Laga Murdoch University, Perth, WA, Australia; Phenomics and Bioinformatics Research Centre, University of South Australia Christian A. Lang University of California, Santa Barbara, CA, United States Nathan Lay Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Clinical Center, National Institutes of Health, Bethesda, MD 20892 B.S. Manjunath University of California, Santa Barbara, CA, United States Vittorio Murino Pattern Analysis and Computer Vision, Istituto Italiano di Tecnologia, Genova, Italy




Adithya K. Pediredla Rice University, Houston, TX Fernando Pereira Instituto Superior T ecnico, Universidade de Lisboa – Instituto de Telecomunicac¸o˜es, Lisboa, Portugal Yashas Rai University of Nantes, Nantes, France Aswin C. Sankaranarayanan Carnegie Mellon University, Pittsburgh, PA Olgierd Stankiewicz  University of Technology, Poznan , Poland Poznan Ashok Veeraraghavan Rice University, Houston, TX Alper Yilmaz Ohio State University, Columbus, OH, United States

About the Editors Prof. Rama Chellappa is a distinguished university professor, a Minta Martin Professor in Engineering and chair of the Department of Electrical and Computer Engineering at the University of Maryland, College Park, MD. He received his BE (Hons.) degree in Electronics and Communication Engineering from the University of Madras, India and the ME (with distinction) degree from the Indian Institute of Science, Bangalore, India. He received his MSEE and PhD degrees in Electrical Engineering from Purdue University, West Lafayette, IN. At UMD, he was an affiliate professor of Computer Science Department, Applied Mathematics, and Scientific Computing Program, a member of the Center for Automation Research and a permanent member of the Institute for Advanced Computer Studies. His current research interests span many areas in image processing, computer vision, and machine learning. Prof. Chellappa is a recipient of an NSF Presidential Young Investigator Award and four IBM Faculty Development Awards. He received the K.S. Fu Prize from the International Association of Pattern Recognition (IAPR). He is a recipient of the Society, Technical Achievement, and Meritorious Service Awards from the IEEE Signal Processing Society. He also received the Technical Achievement and Meritorious Service Awards from the IEEE Computer Society. Recently, he received the Inaugural Leadership Award from the IEEE Biometrics Council. At UMD, he received numerous college- and university-level recognitions for research, teaching, innovation, and mentoring of undergraduate students. In 2010, he was recognized as an Outstanding ECE by Purdue University. He received the Distinguished Alumni Award from the Indian Institute of Science in 2016. He is a fellow of IEEE, IAPR, OSA, AAAS, ACM, and AAAI and holds six patents to his credit. Prof. Chellappa served the EIC of IEEE Transactions on Pattern Analysis and Machine Intelligence, as the co-EIC of Graphical Models and Image Processing, as an associate editor of four IEEE Transactions, as a co-guest editor of many special issues, and is currently on the Editorial Board of SIAM Journal of Imaging Science and Image and Vision Computing. He has also served as the general and technical program chair/co-chair for several IEEE International and National Conferences and Workshops. He is a golden core member of the IEEE Computer Society, served as a distinguished lecturer of the IEEE Signal Processing Society and as the president of IEEE Biometrics Council.



About the Editors

Sergios Theodoridis is currently professor of Signal Processing and Machine Learning in the Department of Informatics and Telecommunications of the University of Athens. His research interests lie in the areas of Adaptive Algorithms, Distributed and Sparsity—Aware Learning, Machine Learning and Pattern Recognition, Signal Processing for Audio Processing, and Retrieval. He is the author of the book “Machine Learning: A Bayesian and Optimization Perspective,” Academic Press, 2015, the co-author of the bestselling book “Pattern Recognition,” Academic Press, 4th ed., 2009, the co-author of the book “Introduction to Pattern Recognition: A MATLAB Approach,” Academic Press, 2010, the co-editor of the book “Efficient Algorithms for Signal Processing and System Identification”, Prentice-Hall 1993, and the co-author of three books in Greek, two of them for the Greek Open University. He currently serves as editor-in-chief for the IEEE Transactions on Signal Processing. He is editor-in-chief for the Signal Processing Book Series, Academic Press and co-editor-in-chief for the E-Reference Signal Processing, Elsevier. He is the co-author of seven papers that have received Best Paper Awards including the 2014 IEEE Signal Processing Magazine best paper award and the 2009 IEEE Computational Intelligence Society Transactions on Neural Networks Outstanding Paper Award. He is the recipient of the 2017 EURASIP Athanasios Papoulis Award, the 2014 IEEE Signal Processing Society Education Award and the 2014 EURASIP Meritorious Service Award. He has served as a Distinguished Lecturer for the IEEE Signal Processing as well as the Circuits and Systems Societies. He was Otto Monstead Guest Professor, Technical University of Denmark, 2012, and holder of the Excellence Chair, Department of Signal Processing and Communications, University Carlos III, Madrid, Spain, 2011. He has served as president of the European Association for Signal Processing (EURASIP), as a member of the Board of Governors for the IEEE CAS Society, as a member of the Board of Governors (Member-at-Large) of the IEEE SP Society and as a Chair of the Signal Processing Theory and Methods (SPTM) technical committee of IEEE SPS. He is a fellow of IET, a corresponding fellow of the Royal Society of Edinburgh (RSE), a fellow of EURASIP and a fellow of IEEE.

Section Editors Section 1 Dr Frederic Dufaux is a CNRS Research Director at Laboratoire des Signaux et Syste`mes (L2S, UMR 8506), CNRS—CentraleSupelec—Universite Paris-Sud, where he is head of the Telecom and Networking division. He is also editor-in-chief of Signal Processing: Image Communication. Frederic received his MSc in Physics and PhD in Electrical Engineering from EPFL in 1990 and 1994, respectively. He has over 20 years of experience in research, previously holding positions at EPFL, Emitall Surveillance, Genimedia, Compaq, Digital Equipment, and MIT. Frederic is a fellow of IEEE. He was vice general chair of ICIP 2014. He is vice-chair of the IEEE SPS Multimedia Signal Processing (MMSP) Technical Committee, and will continue to be a chair in 2018 and 2019. He is the chair of the EURASIP Special Area Team on Visual Information Processing. He has been involved in the standardization of digital video and imaging technologies, participating both in the MPEG and JPEG Committees. He is the recipient of two ISO awards for his contributions. His research interests include image and video coding, 3D video, high dynamic range imaging, visual quality assessment, and video transmission over wireless network. He is author or co-author of three books (“High Dynamic Range Video,” “Digital Holographic Data Representation and Compression,” and “Emerging Technologies for 3D Video”), more than 120 research publications, and 17 patents issued or pending.

Section 2 Anuj Srivastava is a professor of Statistics and a distinguished research professor at the Florida State University. He obtained his PhD degree in Electrical Engineering from Washington University in St. Louis in 1996 and was a visiting research associate at Division of Applied Mathematics at Brown University during 1996–97. He joined the Department of Statistics at the Florida State University in 1997 as an assistant professor. He was promoted to the associate professor position in 2003 and to the full professor position in 2007. He has held



Section Editors

several visiting positions in Europe, including one as a Fulbright Scholar to the University of Lille, France. His areas of research include statistics on nonlinear manifolds, statistical image understanding, functional analysis, and statistical shape theory. He has published more than 200 papers in refereed journals and proceedings of refereed international conferences. He has been an associate editor for several statistics and engineering journals, including IEEE Transactions on PAMI, IP, and SP. He is a fellow of IEEE, IAPR, and ASA.

Section 3 Dr Sudeep Sarkar is a professor and chair of Computer Science and Engineering and the associate vice president for I-Corps Programs at the University of South Florida in Tampa. He received his MS and PhD degrees in Electrical Engineering, on a University Presidential Fellowship, from The Ohio State University. He is the recipient of the National Science Foundation CAREER Award in 1994, the USF Teaching Incentive Program Award for Undergraduate Teaching Excellence in 1997, the Outstanding Undergraduate Teaching Award in 1998, and the Ashford Distinguished Scholar Award in 2004. He is a fellow of the American Association for the Advancement of Science (AAAS), Institute of Electrical and Electronics Engineers (IEEE) and International Association for Pattern Recognition (IAPR), American Institute for Medical and Biological Engineering (AIMBE), and a fellow and member of the Board of Directors of the National Academy of Inventors (NAI). He has served on many journal boards and is currently the editor-in-chief for Pattern Recognition Letters. He has 25-year expertise in computer vision and pattern recognition algorithms and systems, holds four US patents, licensed technologies, and has published high-impact journal and conference papers.

Section Editors

Amit Roy-Chowdhury received his PhD from the University of Maryland, College Park (UMCP) in Electrical and Computer Engineering in 2002 and joined the University of California, Riverside (UCR) in 2003 where he is a professor of Electrical and Computer Engineering and a cooperating faculty in the Department of Computer Science and Engineering. He leads the Video Computing Group at UCR, with research interests in computer vision, image processing, pattern recognition, and statistical signal processing. His group is involved in research projects related to camera networks, human behavior modeling, face recognition, and bioimage analysis. Prof. Roy-Chowdhury’s research has been supported by various US government agencies including the National Science Foundation, and private industries such as Google, NVDIA, CISCO, and Lockheed-Martin. His research group has published close to 200 papers in peer-reviewed journals and top conferences, including approximately 50 journal papers and another approximately 40 in highly competitive computer vision conferences. He is the first author of the book Camera Networks: The Acquisition and Analysis of Videos Over Wide Areas, the first monograph on the topic. His work on face recognition in art was featured widely in the news media, including a PBS/National Geographic documentary and in The Economist. He is on the editorial boards of major journals and program committees of the main conferences in his area. He is a fellow of the IAPR.


Introduction Following the success of the first edition of the Signal Processing e-reference project, which was very well received by the signal processing community, we are pleased to present the second edition. Our effort in this second phase of the project was to fill in some remaining gaps from the first edition, but mainly to be currently taking into account recent advancements in the general areas of signal, image and video processing, and analytics. The last 5 years, although in a historical perspective appear to be a short period, in the context of science, engineering, and technology were very dense in terms of results and ideas. The availability of massive data, which we refer to as Big Data, together with advances in Machine Learning and affordable GPUs, has opened up new areas and opportunities. In particular, the application of deep learning networks to problems such as face/object detection, object recognition, face verification/recognition has demonstrated superior performance that was not thought possible just a few years back. We are at a time when “caffe,” the software for implementing deep networks, is probably used more than FFT! We take comfort that the basic module in caffe is convolution, the basic building block of signal and image processing. While one cannot argue against the impressive performance of deep learning methods for a wide variety of problems and time-tested concepts such as statistical models and inference, the role of geometry and physics will continue to be relevant and may even enhance the performance and generalizability of deep learning networks. Likewise, the power of nonlinear devices, hierarchical computing structures, and optimization strategies that are central to deep learning methodologies will inspire new solutions to many problems in signal and image processing. The new chapters that appear in these volumes offer the readers the means to keep track of some of the changes that take place in the respective areas. We would like to thank the associate editors for their hard work in attracting top scientists to contribute chapters in very hot and timely research areas and, above all, the authors who contributed their time to write the chapters.



Multiview video: Acquisition, processing, compression, and virtual view rendering


Olgierd Stankiewicz*, Gauthier Lafruit†, Marek Domanski* Pozna n University of Technology, Pozna n, Poland* Brussels University (French wing), Brussels, Belgium†


two-dimensional three-dimensional advanced video coding Cave Automatic Virtual Environment Depth Estimation Reference Software depth image-based rendering Free viewpoint TV high efficiency video coding high efficiency video coding with depth maps Head Mounted Device International Electrotechnical Commission International Organization for Standardization International Telecommunications Union-Telecommunication sector Joint Photographic Experts Group Moving Picture Experts Group multiview video plus depth Video Coding Experts Group Virtual Reality View Synthesis Reference Software

1.1 MULTIVIEW VIDEO Ever since the success of three-dimensional (3D) games where the user selects his/ her own viewpoint to a premodeled 3D scene, much interest has risen to be able to mimic this functionality also on real content. In 1999, the science-fiction movie The Matrix popularized the bullet time effect, showing a continuously changing camera viewpoint around the action scene. This was obtained by cleverly triggering and

Academic Press Library in Signal Processing, Volume 6. # 2018 Elsevier Ltd. All rights reserved.



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.1 Hundreds of cameras to capture The Matrix bullet effect. # Wikipedia.

interpolating hundreds of camera views, cf. Fig. 1.1, which we refer as multiview video in the remainder of the chapter. Think of the possibilities if we could create such Free Navigation effect on the fly at low production cost, e.g., in sports events, where each individual TV viewer can choose his/her own viewpoint to the scene. Moreover, synthesizing two adjacent viewpoints at any moment in time supports stereoscopic viewing, and closing the loop with a stereoscopic Head Mounted Device (HMD) that continuously renders the stereo viewpoints corresponding to the actual HMD position, brings Virtual Reality (VR) one step closer to a “Teleported Reality,” also called Cinematic VR. Such topics are actively explored in many audiovisual standardization committees. Joint Photographic Experts Group (JPEG) and Moving Picture Experts Group (MPEG), for instance, are heavily involved in getting new multimedia functionalities off the ground, far beyond simple two-dimensional (2D) pictures and video. More specifically, 3D data representations and coding formats are intensely studied to target a wide variety of applications [1], including VR.

1.1.1 MULTIVIEW VIDEO AND 3D GRAPHIC REPRESENTATION FORMATS FOR VR The specificity of VR over single view video is that VR renders content from many user-requested viewpoints. This puts many challenges in modeling the data, especially when reaching out for photorealistic Cinematic VR. A striking example of this challenge is the multiformat data representation shown in Fig. 1.2 [2], capturing actors (e.g., Haka dancers) with close to hundreds of multimodal cameras (RGB, IR, depth) and transforming this data successively into Point Clouds and 3D Meshes with 2D Textures. In this way, rendering under any perspective viewpoint becomes possible at relatively low cost at the client side, through the OpenGL/DirectX 3D rendering pipeline.

1.1 Multiview video

FIG. 1.2 From multimodal camera input (left) to Point Clouds and 3D meshes (middle) for any perspective viewpoint visualization (right). # Microsoft.

These successive transformations, however, involve heavy preprocessing steps, and if not sufficient care is taken in cleaning and modeling the data, the rendering might look synthetic with too sharp object silhouettes and unnatural shadings, as if they came out of a discolored old movie. Other solutions that do not explicitly involve geometry information are a good alternative for synthesizing virtual views. Because they merely use the input RGB images and depth images, they are referred to as depth image-based rendering (DIBR) and are the main subject of this chapter.

1.1.2 SUPER-MULTIVIEW VIDEO FOR 3D LIGHT FIELD DISPLAYS The feeling of immersion can also be obtained with large super-multiview (SMV) or Light Field displays instead of HMDs, cf. Fig. 1.3, that project slightly varying images in hundreds of adjacent directions, so creating a 3D Light Field with perceptively correct views. Ref. [3] interpolates a couple of dozens of camera feeds to create these hundreds of views. Anybody in front of the display will then be able to capture with his/her eyes the correct stereo pair of images, without wearing stereo glasses. In a sense, such Light Field display is the glasses-free Cave Automatic Virtual Environment (CAVE) equivalent of Cinematic VR with HMD.

1.1.3 DIBR SMOOTH VIEW INTERPOLATION Image-based rendering (IBR) solutions, solely using the input views and some image processing are more appealing for photorealistic rendering, especially in live transmission applications. Unfortunately, the solution is less simple than was presented in Fig. 1.1, since it involves much more than switching between input camera views. Very challenging are the discontinuous jumps between adjacent views that have to be resolved with DIBR techniques, where depth plays a critical role in nonlinearly interpolating adjacent views. Figs. 1.4 [6] and 1.5 [7] (the reader is invited to use



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.3 Holografika Super-MultiView 3D display at Brussels University: (left) clear perspective changes, and (right) a MPEG test sequence visualized. # Brussels University, VUB.

FIG. 1.4 Fencing view interpolation. ski, A. Dziembowski, A. Grzelka, D. Mieloch, O. Stankiewicz, K. # Poznan University of Technology; M. Doman Wegner, Multiview Test Video Sequences for Free Navigation Exploration Obtained Using Pairs of Cameras, ISO/ IEC JTC 1/SC 29/WG 11 Doc. M38247, Geneva, Switzerland, 2016, experiencing-new-points-view-free-navigation.

1.1 Multiview video

FIG. 1.5 Soccer view interpolation. # Hasselt University; P. Goorts, S. Maesen, M. Dumont, S. Rogmans, P. Bekaert, Free viewpoint video for soccer using histogram-based validity maps in plane sweeping, in: Proceedings of The International Conference on Computer Vision Theory and Applications (VISAPP), 2014, pp. 378–386, v¼6MzeXeavE1s.

the links in the references for a video animation) show that very good results can be obtained with as little as a dozen of cameras cleverly positioned around the scene.

1.1.4 BASIC PRINCIPLES OF DIBR View interpolation creates a continuous visual transition between adjacent discrete camera views by synthesizing spatial views in between them with DIBR. To better apprehend the process of this depth-based view interpolation, let us explain it with a simple experiment: the view interpolation of fingers from both hands. Let us place one finger of our left hand at arm length in front of our nose, and one finger of our right hand touching the tip of our nose, as shown in Fig. 1.6. And now, let us blink our eyes, alternating between left and right eye, keeping one eye open, while the other is closed. What do we observe? When our left eye is open, we see our front right-hand finger a couple of centimeter more to the right of our rear left-hand finger. Conversely, when our right eye is open, we see our front right-hand finger a couple of centimeter more to the left of our rear left-hand finger. Our front finger, which is very close to our eyes, clearly undergoes a large displacement when alternating between our eyes, while our rear finger is hardly moving. This phenomenon is called disparity, which by definition is the displacement from the left to the right camera view (our eyes) of a 3D point in space. Obviously, the disparity of any 3D point in space is inversely proportional to its depth: our rear left-hand finger with a large depth will have a much smaller disparity than our front



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.6 DIBR view interpolation basic principles.

right-hand finger. This phenomenon is exactly what will be exploited for depthbased view interpolation. Indeed, suppose we would like to look to the scene from a third eye, virtually positioned more or less halfway between our left and right eye. Let us say that this virtual third eye is—starting from the left eye—positioned at 3/8 of the distance between left and right eye, cf. Fig. 1.7, which shows the left end right eye view of Fig. 1.6 combined.

FIG. 1.7 DIBR principle in synthesizing a new virtual view at fractional position α from two camera views.

1.1 Multiview video

If we want to synthesize the image that this virtual third eye would see, all we have to do is to displace all pixels we see in our right eye to the right with 5/8 (¼1  3/8) of its corresponding disparity δp, where the subscript p clearly indicates that the disparity is pixel p dependent. In general, for the virtual third eye fractional position α (α ¼ 0 at the left eye, and α ¼ 1 at the right eye), the pixels we see in our right eye should be displaced with 1  α their disparity δp to the right in order to create the corresponding virtual view. For α ¼ 0 (the virtual view corresponds to the left view), all pixels of the right view will be displaced to the right with their disparity δp, recovering almost perfectly the left view. Notice the subtle “almost perfectly” in the previous sentence. Actually, most often, adjacent pixels at each side of an object silhouette will not be displaced with the same amount from left to right, since their depth and hence their disparity is likely to be very different. This results in unfilled, empty spaces around the objects’ silhouette of the virtual view, which is solved by: • • •

Not only displace the pixels from the right image to the virtual position α, but also displace the pixels from the left image to the virtual position α Blend the so-obtained images, drastically reducing the empty spaces The remaining empty cracks are filled with inpainting techniques, cf. Section 1.5.3, that may be compared to ink that is dropped at one border of the crack and diffuses its color to the other side of the crack (note that the diffusion process propagates in both directions, in a similar way as the aforementioned left-to-virtual or right-to-virtual image transformations are done)

More details about the DIBR steps are provided in Section 1.5.

1.1.5 DIBR VS. POINT CLOUDS Is DIBR on multiview video sequences the best approach to obtain nice Free Navigation effects? Not necessarily. As shown in Fig. 1.2 already, many different data formats can serve the purpose of Free Navigation. In particular, the Point Cloud data representation of Fig. 1.8 [8] provides all means to freely navigate into the scene. At large distances, cf. Fig. 1.8(A), the effect is very nice. However, at shorter viewing distances, cf. Fig. 1.8(C)-(E), blocky splatting effects with increased point sizes start to occur. Fig. 1.9, gives a glimpse on the preprocessing and rendering quality differences between DIBR (left—use the link of the reference to see the animation [9]) and Point Clouds (right). Moreover, to obtain this high-density Point Cloud, many different acquisition positions of the Time-of-Flight Lidar scanner had to be covered, followed by an intense data preparation preprocessing steps [10]. DIBR techniques, on the contrary, restrict themselves to depth extraction and image warping operations to obtain an acceptable rendering without too many noticeable artifacts. Nevertheless, Fig. 1.10 shows that DIBR and Point Cloud representations are mathematically very similar. In essence, in a Point Cloud representation, each 3D



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.8 London Science Museum Point Cloud (A) decomposed into individual clusters (B) taken from the position indicated by the empty circles. Splat rendering in (C) to (E). ScanLab projects Point Cloud data sets,

point in space captured by a Time-of-Flight active depth sensing device is given by its position (x,y,z) and the light colors it emits in all directions (referred to as a BRDF). Its complementary image-based representation will capture these light rays into RGB images, e.g., pixel (u,v) captures the light ray under angles (θ, φ) emanating from the point (x,y,z). Capturing multiple images from different viewpoints gives the possibility to estimate depth for each point (e.g., by stereo matching), adding depth and hence resulting in a fully corresponding DIBR representation of the Point Cloud.

1.1 Multiview video

FIG. 1.9 DIBR perspective view synthesis (left) vs. Point Cloud rendering (right).

FIG. 1.10 Point Cloud data format (left) and its equivalence with DIBR (right).

The similarity between DIBR and Point Clouds will become more apparent in Section 1.5. More details about this discussion can also be found in the JPEG/MPEG JAhG (Joint Ad-Hoc Group) Light/Sound Fields technical output document [11], as well as in Ref. [1]. Though the underlying data formats are known to be mathematically equivalent between DIBR and Point Clouds, the compression and rendering qualities may differ and heavily depend on the preprocessing level of the acquired data. DIBR is the



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.11 MPEG test sequences used in the DIBR exploration activities.

preferred technique proposed in the MPEG-FTV group for Free Navigation and SMV applications. We have therefore selected the DIBR technique to be presented in more details in the remainder of the chapter.

1.1.6 DIBR, MULTIVIEW VIDEO, AND MPEG STANDARDIZATION MPEG, which mission is to develop audiovisual compression standards for the industry, is currently actively involved in all aforementioned techniques: DIBR, Point Cloud, and 3D mesh coding. The interested reader is referred to Ref. [1] for an in-depth overview of use cases and 3D related data representation and coding formats. In particular, in the MPEG terminology, the multicamera DIBR technology this chapter is devoted to, is referred to as multiview video. Multiview test sequences are made available to the community, helping in developing improved depth estimation, view synthesis, and compression techniques, cf., respectively, Sections 1.4–1.6. Some of the test sequences are presented in Fig. 1.11 and have been acquired with the acquisition systems which will be presented in Section 1.2.4.

1.2 MULTIVIEW VIDEO ACQUISITION 1.2.1 MULTIVIEW FUNDAMENTALS The term “multiview” is widely used to describe various systems. In this section we are dealing exclusively with the systems where the outputs from all cameras are

1.2 Multiview video acquisition

processed jointly in order to produce some general representation of the scene. Therefore, for example, the video surveillance systems where video is registered and viewed from each camera independently are out of the scope of our considerations. The goal of multiview video acquisition considered is to produce a 3D representation of a scene. Such 3D representation is needed to enable functionalities related to VR free navigation, e.g., to synthesize virtual views as seen from arbitrary positions in the scene. Currently, the most commonly used format of such representation is multiview video plus depth (MVD). In MVD many 2D (flat) images from cameras are amended with depth maps which bring additional information about the third dimension of a scene. Depth maps are matrices of values reflecting distances between the camera and points in the scene. Typically, depth maps are presented as gray-scale video, cf. Fig. 1.12, where the closer objects are marked in high intensity (light) and the far objects are marked in low intensity (dark). The particular meaning of the term “depth” varies in literature and often is used interchangeably with others such as disparity, normalized disparity, z-value. In fact, in Fig. 1.12, the “depth” represented by the gray-scale value of the pixels corresponds to disparity: the closer pixels have a higher value, which is in line with a higher disparity value. In order to understand the exact meanings of each of these terms “depth” and “disparity,” let us consider a pinhole model of a camera. A pinhole camera is a simplified Camera obscura in which the lenses have been replaced with a single pinhole which acts as an infinitely small aperture. Light rays from the 3D scene pass through this single point and project an inverted image on the opposite side of the box, cf. Fig. 1.13A. In some formulations of the model of the pinhole camera, the fact that the projected image is inverted is often omitted by

FIG. 1.12 The original view (A) and corresponding depth map (B) of a single frame of “Poznan Blocks” [104, 105] 3D video test sequence. In the depth map, the closer objects are marked in high intensity (light) and the far objects are marked in low intensity (dark).  University of Technology. # Poznan



A visualization of a pinhole camera (A), its simplified model (B), and the corresponding coordinate system (C).

CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.13

1.2 Multiview video acquisition

assuming that the image plane (virtual) is placed on the same side of the pinhole as the scene, cf. Fig. 1.13B. Therefore the aperture equation is: f u v ¼ ¼ : z x y


where f is the focal length (distance of the image plane along the z axis), u and v are coordinates on the projected image plane of some point with x, y, z coordinates in 3D space. In the mentioned equation the projected coordinate system UV is shifted so that it is centered on the principal point P, cf. Fig. 1.13C. In such a case the perspective projection is as follows: x y u¼ f, v¼ f: z z


Using such a pinhole camera model, we can consider object F, cf. Fig. 1.14, observed from two cameras (left and right). In each of views, the given object is seen from a different angle and position, cf. Fig. 1.14, and therefore its observed positions on image planes are different (FL in the left image and FR in the right image). With only a single view (e.g., the left one) we cannot tell what is the distance of the object observed at position FL. For the real position F, the object can be thought as being in position of G, I, or J, which in the right view would be observed in positions FR,

FIG. 1.14 Epipolar line as a projection to the right view of all potential position of object residing on ray ρ, seen in the left view as a single point.



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering GR, IR, and JR, respectively. The set of all potential 3D positions of observed object EL, projected onto the other image plane, is geometrically inclined to lie along a socalled epipolar line. An epipolar line is a projection of a ray, pointing from an optical center of one camera to a 3D point, to the image plane of another view. For example, in Fig. 1.14, a ray ρ (starting in OL and passing through F) is seen by the left camera as a point because it is directly in line with that camera’s projection. However, the right camera sees this ray ρ in its image plane as a line. Such projected line in the view of the right camera is called an epipolar line. An epipolar line marks potential corresponding points in the right view for pixel F in the left view. Positions and orientations of epipolar lines are indicated by the locations of the cameras; their orientations; and other parameters, such as focal length, angle of view, etc. Typically, all of those parameters are gathered in the form of intrinsic and extrinsic camera parameter matrices, cf. Section 1.3.1. In a general case, epipolar lines may lie along arbitrary angles, cf. Fig. 1.16 (top). A common and important case for considering depth is the linear arrangement of the cameras with parallel axes of the viewpoints, cf. Fig. 1.15. Such setup can be obtained both by precise physical positioning of the cameras or by postprocessing, called rectification, cf. Fig. 1.16 (bottom). In such a linear arranged case, the image planes of all views coincide, cf. Fig. 1.16 (bottom), and the epipolar lines are all aligned with the horizontal axes of the images. Therefore the differences in observed positions of objects become disparities along horizontal rows. Due to the projective nature of such video system with linear camera arrangements, the further the object is from the camera, the closer is its projection to the center of the image plane.

1.2.2 DEPTH IN STEREO AND MULTIVIEW VIDEO Let us now define various terms related to the depth, using the camera geometry of Fig. 1.17.

FIG. 1.15 Linear arrangement of the cameras with parallel axes of the viewpoints.

1.2 Multiview video acquisition

FIG. 1.16 Original (top) and rectified (bottom) images from “Poznan Blocks” [104, 105] multiview video sequence for some selected pair of views. Exemplary epipolar line for ray ρ has been marked with a solid line.  University of Technology. # Poznan

FIG. 1.17 Exemplary objects E, F, and G—projected onto image planes of two cameras.



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering For a given camera (e.g., the left one in Fig. 1.14) the distance between the optical center (OL) and the object (e.g., represented by a single pixel) along the z axis is called the z-value. The unit of the z-value is the same as the distance units in the scene. These may be, e.g., meters or inches, and depends on the scale of the represented scene. The distance between observed positions of object F, with FL in the left image and FR in the right image, is called disparity: ΔE ¼ FL  FR :


In a general case, the disparity is a vector lying along the respective epipolar lines. In case of linear rectified camera arrangements, disparity vectors lie along horizontal lines. The z-value and the disparity between two views are mathematically related and can be derived if the positions and models of the cameras are known. For a stereoscopic pair of cameras, horizontally aligned (or rectified), distant by the length of baseline B, the z-value ZF of point F can be calculated as follows: ZF ¼ f  B 

1 , ΔF


where ZF is the distance along z-axis of point F from the camera’s optical center, B is the baseline distance between the pair of cameras, and ΔF is the disparity of point F. Obviously, a similar but more complex relation between the z-value and the disparity can be found for other arrangements of camera pairs. The term “depth,” depending on the context, may refer to different meanings, e.g., it is often used interchangeably with disparity and with z-value distance, or sometimes as a generalized term, describing both. Usually, in literature term “depth” is just used to refer to a data stored in depth maps, e.g., in the form of images or files on disk. For example, in OpenEXR Blender files [12] the depth is stored in the form of z-values represented as 32-bit floating point number (IEEE-754 [13]). The most common format of depth value representation is an 8-bit integer representing, the so-called normalized disparity. The disparity is then normalized so that the represented range (0–255) covers the range of disparities from Δmin to Δmax, corresponding to z-values from zfar to znear, respectively. Such a format combines the advantages of representing depth as a disparity or as a z-value. The representation of depth as disparity better suits the human visual system (foreground is represented at a finer quantization than the background), while disparity can be almost directly used for view synthesis and is actually a direct product of depth estimation algorithms. The representation of depth as disparity actually depends on the considered camera arrangement (e.g., disparity is proportional to baseline B). The representation of depth as z-value does not have such disadvantage. The same goes for normalized disparity, because the disparity values are normalized with respect to the zfar to znear scale.

1.2 Multiview video acquisition

The depth defined as normalized disparity can be calculated from the following equation: 

 Δ  Δmin , d ¼ dmax  Δmax  Δmin  d ¼ dmax 

1 1  Z zfar




1 zfar



where dmax is the maximal value for a given integer binary representation, e.g., 255 for 8-bit integers, or 65,535 for 16-bit integers. The selected depth range (e.g., Δmin to Δmax, zfar to znear) along with the bit width of integers used together affect the precision with which the depth is represented. For a given bit width, e.g., 8-bit, the broader the selected depth range, the worse is its precision. On the other hand, enforcing sufficient precision narrows the depth range which at some point may disallow the representation of some objects in the scene. For sophisticated scenes, with nonlinear camera arrangements, it is often impossible to meet both requirements at the same time and usage of wider integer range (e.g., 16-bit) becomes mandatory. Apart from the depth precision, which is an attribute of the representation format, another important factor is the depth accuracy which results from the method of depth acquisition. The depth can be acquired with the use of specialized equipment, like depth sensors, or by means of algorithmic estimation, as explained in Section 1.4. In the case of depth sensors, the achievable depth accuracy results from the used technology. For example, for Microsoft Kinect [14], cf. Fig. 1.18 (top), the depth error is from 2 mm for close objects to 7 cm for far objects [14]. For the Mesa Imaging SR4000 [15] depth sensor, cf. Fig. 1.18 (bottom), the accuracy is about 15 mm. It must be noted, however, that the accuracy of depth sensors depends on factors like reflectivity of objects in the scene, distance of the scene, etc. In the case of depth estimation algorithms, like the ones in Section 1.4, the accuracy is strongly influenced by the resolution of the analyzed images. The depth is estimated by finding image correspondences between the views, and thus the direct outcome of such is disparity. The measurement of distances in the images is limited by their resolutions and thus the higher the resolution, the more accurate the depth is [16]. Also it is crucial that the image frames from different cameras are taken at the same moment, and thus the cameras must be synchronized in time. The desired degree of synchronization is at the subframe level, e.g., per scan line. This is one of the factors influencing the choice of equipment for building a 3D camera system, which is a topic of the next section.



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.18 Photography of Microsoft Kinect depth sensor (left) and Mesa Imaging SR4000 depth sensor (right).

1.2.3 MULTICAMERA SYSTEM Considerations on multiview camera system features are highly defined by specifics of the current multiview technology, which is not mature nowadays. On one hand, there are no comprehensive industrial solutions yet, and on the other hand, the construction of an experimental multiview system is a very technically demanding task. Effectively, a universal system that would satisfy all needs currently cannot be reached. In order to develop a useful multiview camera system, many compromises have to be made. As it will be shown, some of the constraints are interconnected and cannot be considered separately. Therefore it is practical to start with some set of initial requirements and then refine them according to technical possibilities. The following issues in Fig. 1.19 should be considered, together with the practical setup choices of Table 1.1. Number of cameras—perhaps the central parameter of the whole multiview camera system definition. We consider three or more cameras. Less than three cameras does not allow production of real multiview material, useful for SMV or depth estimation. Up to some extent, the more cameras we have the better, but typically there are three main limitations: budget, size, and throughput (of processing and transmission/storage). The first two are very loose, while the latter can be a serious problem. Processing throughput—depending on the target usage (real time or off-line), some of the processing has to be done on the fly. Depending on our needs, we should anticipate computational power for stream synchronization, rectification, color

1.2 Multiview video acquisition

FIG. 1.19 Mutual relations among features and between parameters of a multiview camera system.

Table 1.1 Comparison of Camera Arrangements Angular Arrangement (Divergent)

Linear Arrangement

Angular Arrangement (Convergent)

Simple, many cameras can be used. Disparity estimation by horizontal search only Simpler

Difficult, simple disparity estimation cannot be used

Difficult, simple disparity estimation cannot be used

More difficult

More difficult

Relatively easy to set the camera in a line

More difficult to set the cameras in an arc

More difficult to set the cameras in an arc

Camera rig

Compact, can be mobile

Parallax/3D effect

Limited, especially in the case of distant objects

Depends on the arc size; can be large and difficult to move Great inside arc. Not good at looking away

Target display system

Stereoscopic display, auto-stereoscopic display

Depth estimation

System calibration (rectification) Camera setup (arrangement)

Stereoscopic, wide arc can be used for SMV displays. Data for SMV displays is more difficult to prepare. VR glasses

Good for looking-around feature, but bad at close distances Panoramic displays and VR glasses



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering correction, and depth map estimation. Also we should estimate the required bitrate for our medium. For uncompressed raw bitstreams we can use the following equation: B ¼ N  W  H  C  D  F,


where B is the required bitrate (in bits/s) of a medium—this may be transmission bandwidth of a connection or bitrate of a file; N is the number of cameras; W and H are, respectively, the width and height (in pixels) of the picture; C is the number of full-scale color components (3.0 for 4:4:4 format, 1.5 for 4:2:0 format); D is the bit-depth (8 for typical images); and F is the frame rate (in Hz). For example, if we consider uncompressed 8-bit 4:2:0 raw data stream from nine HD (1920  1080) resolution cameras, with frame rates of 25 frames/s, we get approximately a required bitrate of about 5.6 GBit/s. Indoor/outdoor usage—the previous parameters can be greatly influenced by the available space to install the system. Outdoor systems cannot use long-range cable connections to any workstations, while wireless connections offer much lesser bitrate and some video compression might be required. The outdoor systems are typically desired to be mobile, which is very difficult to achieve if uncompressed video is acquired. In such systems, the maximum cable length is an important factor that influences mobility and capability to outdoor video acquisition. Mobility—not only means the ability of the system to be easily rearranged in a different place, but also the ability to shot with moving scenes. Depending on the camera arrangement and mounting, some wheels, caterpillars, or rails may be required. Camera arrangement—the most commonly used camera setups are linear and circular. The linear setup offers good visual sensations for distant scenes and is easier to handle from the software point of view (rectification and depth map algorithms are commonly investigated for linear setup). On the other hand, circular setups give better 3D depth sensation in case of short-range, indoor scenes. It also more resembles the human visual system. Depending on the mobility, the camera number and spacing, the use of a rig should be considered. Although a rig can be bulky and heavy, it offers much better stiffness and rigidness than the use of separate camera tripods. Camera spacing—this also depends on the type of target scenes. Short-range scenes require short base intercamera distance. This distance often resembles the human interocular distance (about 6.5 cm) as much as possible. Shooting distant scenes requires longer intercamera distances, so that differences between views acquired by cameras is big enough. Camera type—probably the most sensitive feature of the system. The cameras must be selected with special care. Implicitly, selection of the camera type will impact on all parameters of the whole system. The size of a camera optical system influences the minimal camera spacing and thus the camera arrangement and mounting possibilities. Resolution of the camera (SD, HD, or even 4K) influences the required processing throughput which can enforce off-line processing

1.2 Multiview video acquisition

with the use of intermediate video compression for transmission or can allow real-time processing of uncompressed data streams. Also, depth estimation requires exact temporal synchronization which can be supported by the given camera or not. Tools for reaching temporal synchronization include synchronization input/output ports (genlock), time-stamp signature [17], and control interfaces like LANC [18]. Another factor is the type of the shutter. In order to minimize the influence of motion in the scene (which is observed in parallel with all of the cameras) in the construction of 3D model of the scene, the preferred shutter type is the global one. Unfortunately, many cameras, especially cheap ones, provide rolling shutters which makes shooting fast moving scenes more difficult. In the end, prices of cameras vary very much, which depending on the available budget influences the number of cameras that can be afforded. Table 1.2 summarizes the most important properties of different types of cameras with respect to available features and interfaces. The features listed in the table are the most popular ones.

1.2.4 ACQUISITION SYSTEM EXAMPLES As mentioned in previous section, it is impossible to build a universal system which would satisfy all possible needs. Nonetheless, we can consider some of successfully deployed systems that were built experimentally and are currently used for studies on the development of multiview technology, cf. Table 1.3. Of course the known multicamera systems are not limited to those examples. Nagoya University multiview camera system The system [19] has been built at Nagoya University, Japan, cf. Fig. 1.20. It is composed of 80 cameras which are fixed on a steel frame with 50-mm interval. Depending on the setup the cameras can be placed linearly or on an arc, which converges at about 8 m from the camera array. The cameras that are used are JAI Pulnix TMC1400CL. They enable acquisition of SXGA images with a frame rate of almost 30 FPS. The shooting is temporally synchronized up to about 1-μs precision. The system also consists of one host-server PC and 100 client PCs called nodes. The interface between camera and PC is CameraLink. Fraunhofer HHI camera system The system [20] has been developed at the Fraunhofer Institute for Telecommunications, Heinrich Hertz Institute, HHI, Germany, cf. Fig. 1.21. It consists of 16 machine vision cameras arranged linearly on a rig. The cameras are Hitachi 3-chip CCD progressive scan RGB cameras (HVF31CL-S1) with a XGA resolution. These are equipped with high quality 6-mm lenses (Fujinon DF6HA-1B) with horizontal Field of View (FoV) of 44 degree and vertical FoV of 33 degree. The trigger period is 20 ms, which results in a frame rate of 16.67 FPS. The data is captured in raw RGB format using the CameraLink interface via a raid-like PC cluster.



Industrial Cameras Video capture

Resolution Frame rate (per second) Shutter

Video interface

Control interface

Market and technology

Dynamic range CameraLink SD\HD-SDI Gigabit Ethernet USB FireWire Ethernet FireWire LANC Tri-Leve Sync Time Code Dedicated Accuracy/ repeatability Price

Surveillance (Megapixel) Cameras

Consumer (Handy Cams)

Professional Cameras

Cinematic Cameras

Up to several megapixels 10–1000 wide range (trade-off with resolution) Rolling/global

Up to several megapixels 5–60 wide range (trade-off with resolution) Typically rolling




24–60 and even to 120 in slow motion mode Global or tiled

2K/4K/8K and more 24–60 and even to 120 in slow motion mode Global or tiled




+  +

Low but often with night-mode   +




+ + + +  

+ + + +  

+ +  + + 

 +  + + +


 + Satisfactory

 + Low


+  Satisfactory

+  Satisfactory


Low to medium




Typically rolling Medium

CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

Table 1.2 Comparison of General Characteristics of the Most Common Classes of Cameras

Table 1.3 A Survey of Some Experimental Multiview Systems Poznan University of Technology Modular System

Hasselt University Linear Rig

Hasselt University Modular Curvilinear System

Fraunhofer HHI













50 mm

65 mm

137.5 mm

About 1 m

About 10 m

Camera type

JAI Pulnix TMC1400CL

Canon XH-G1 CCD

Basler avA160050gc

Prosilica GC

Resolution Frame rate

1920  1080 24 FPS

1920  1080 24 FPS

1600  1200 60 FPS

1920  1080 25 FPS

Gigabit PC Cluster

Gigabit PC Cluster

Raid-like PC cluster Mobile individual modules

PC Cluster

PC Cluster


SDI Raid-like PC cluster Raid-like PC cluster Mobile rig

SDI Internal


1392  1040 29.4114 FPS CameraLink PC computers PC computers Immobile

Hitachi 3-chip CCD progressive scan RGB camera HVF31CL-S1 1024  748 16.67 FPS

Mobile modules, minimally about 50 cm Canon XH-G1 CCD





Mobile individual modules Linear

Mobile individual modules Any

Year of production Number of cameras Camera spacing

Interface Recording Processing

CameraLink Raid-like PC cluster Raid-like PC cluster


1.2 Multiview video acquisition

Nagoya University

Poznan University of Technology Linear Rig



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.20 Nagoya University camera system [19]. # Nagoya University.

FIG. 1.21 Fraunhofer HHI Camera system [20,21]. # Fraunhofer HHI.

1.2 Multiview video acquisition Poznan University of Technology multiview camera system (linear rig) The system has been built at the department of Multimedia Electronics and Telecommunications, Pozna n University of Technology, Poland, as an experimental framework for studies on future 3D television [22,23]. The system of Fig. 1.22 consists of nine cinematic Canon XH-G1 cameras placed on a mobile (wheeled) metal rig. The rig has been manufactured exclusively to provide special mounting pads that allow precise alignment of the cameras. The output video signal is HDTV (1920  1080) and is provided via SDI interface. All streams are temporally synchronized with the use of a genlock and captured by a raid-like PC cluster. The whole processing is done offline. Poznan University of Technology multiview camera system (modular) This is the second multicamera system built at Poznan University of Technology. It employs the same Canon XH G1 cameras, but this time they are mounted on special wireless mobile units, cf. Fig. 1.23. This allows for placing the cameras anywhere in the scene. There are 10 of such mobile camera units, cf. Fig. 1.24. Each of them is also equipped (apart from the camera) with a power supply (battery), a wireless synchronization receiver, a remote control receiver, and a HDD recorder. Thanks to that, each camera module is able to record about 30 minutes of high resolution video. All cameras are precisely synchronized with the use of a wireless dedicated 869-MHz link. Each captured frame is signed with a time code for error resilience. This also allows for the detection of miss-synchronization. All cameras can be controlled by a dedicated system that also uses a separate WiFi wireless link.

FIG. 1.22  University of Technology Experimental multiview camera system built at Poznan (left—camera rig, right—part of recording system).  University of Technology. # Poznan



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.23 Wireless mobile camera module (side and rear views).  University of Technology. # Poznan

FIG. 1.24 Multicamera setup on an exemplary arc.  University of Technology. # Poznan Hasselt University multiview camera system Hasselt University has provided soccer test sequences to the MPEG community, one with a linear camera arrangement and a second one with an arc arrangement, cf. Fig. 1.25. For the linear setup, eight Basler avA1600-50gc cameras with a resolution of 1600  1200 were placed on a line approximately 1 m apart from each other. The first half of the cameras uses 25-mm lenses, the second half 12:5-mm lenses.

1.3 Multiview video preprocessing

FIG. 1.25 Linear (top) and curvilinear (bottom) multicamera system of Hasselt University. # Hasselt University.

The recordings were made at 60 Hz with perfect synchronization. The data was transferred through a 10-Gigabit fiber Ethernet switch to a centralized computer. The curvilinear data set was captured with 16 Prosilica GC cameras, 10 m apart from each other. Sixteen millimeter lenses were used, and the recordings were done at full-HD 1920  1080, 25 fps.

1.3 MULTIVIEW VIDEO PREPROCESSING 1.3.1 GEOMETRICAL PARAMETERS The previous section presented the pinhole camera model. Such model explains the nature of the projection of light rays onto the image plane and gives some understanding on various terms related to depth. The described pinhole camera model was defined with the use of x, y, z coordinates in 3D space and u, v coordinates in the image plane. In this section we will consider a system of multiple pinhole cameras, each having individual coordinate systems placed, respectively, to the camera position, cf. Fig. 1.26. Placement of these coordinate systems for each camera will be defined by camera parameters: intrinsic, extrinsic, and radial distortions parameters.



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.26 A system of multiple pinhole cameras, each having individual coordinate systems: x, y, z in 3D space and u, v in the image plane. Intrinsic parameters We will use the described model with some slight modifications compared to Fig. 1.26: the u, v coordinates have been shifted and scaled so that they match units of pixels in the given image and are centered at the upper-left corner pixel (instead of on principal point P). For example, for Full-HD 1920x1080 resolution u, v coordinates (0,0) address the left-top-most pixel. The next pixel to the right has coordinates (1,0) and the bottom-right corner pixel has coordinates (1919,1079). The modified projection equations (1.2) are then modified to: x y u ¼  f  uscale + Pu , v ¼  f  vscale + Pv , z z


where (Pu,Pv) is the position of the principal point P in the u, v image plane space and is thus expressed in units of pixels. The terms f  uscale and f  vscale are often denoted as fu and fv, respectively, and they express the focal length in units of pixel size (horizontal or vertical). Because the pixel aspect-radio can be nonrectangular, fu can be different than fv. Eq. (1.2) can be rewritten with the use of homogenous coordinates [24]. In contrast to Cartesian coordinates which define a point, homogenous coordinates define a line, which is parameterized with respect to the last coordinate, which customarily is denoted as 1, because it can be pulled out in front of the vector. For example, the following homogenous coordinates express the same line in 3D space: 2 3 2 3 2 3 x x=2 x=z 4 y 5 ¼ 2  4 y=2 5 ¼ z  4 y=z 5: z z=2 1


1.3 Multiview video preprocessing

FIG. 1.27 The role of intrinsic and extrinsic matrices in transformation of coordinate systems.

As it can be seen, homogenous coordinates are useful for describing perspective projection along the last coordinate (e.g., z). Thus Eq. (1.2) can be rewritten as: 2 3 2 3 2 3 u x fu 0 Pu 4 5 4 5 4 s  v ¼ A  y , where A ¼ 0 fv Pv 5, 1 z 0 0 1


where s is a scaling factor chosen to make the third output coordinate to be one (here s ¼ z). Matrix A is called the intrinsic matrix. Its role is to transform points in the coordinate system of the given camera to the projection homogenous coordinates, cf. Fig. 1.27. Thus the name “intrinsic” comes from fact that it describes transformation of coordinates inside of the camera (internal). Extrinsic parameters The second set of camera parameters is called extrinsic. Its role is to transform points in the global world coordinate system to the coordinate system of given camera. Therefore the name “extrinsic” comes from fact that it describes a transformation of coordinates outside of the camera (external). Extrinsic parameters typically consist of rotation and translation, as shown in Fig. 1.27. Rotation in 3D space can be described with an orthonormal 3  3 matrix, here denoted as R: c ¼ R  w,

(1.11) T

where c is a vector of coordinates in camera coordinate system c ¼ ½xc yc zc  and w is a vector in the world coordinate system w ¼ ½xw yw zw T . For example, R may be a rotation around the z-axis with angle α: 2

3 cos ðαÞ sin ðαÞ 0 4 R ¼ sin ðαÞ cos ðαÞ 0 5: 0 0 1




CHAPTER 1 Acquisition, processing, compression, and virtual view rendering Obviously, translation of the camera can be described with a vector t in the world coordinate system: 2 3 tx c ¼ w + t, where t ¼ 4 ty 5: tz


Often, the translation of the camera is described by means of the inverse translation vector t0 ¼ R  t. Although vectors t and t0 both describe position of the camera—and they are often confused in the literature—the difference between them is crucial. Vector t describes the position of the camera in the world coordinate system. Vector t0 describes the position of the origin of the world coordinate system in the given camera coordinate system. The extrinsic set of parameters, i.e., rotation and translation, can be expressed as a single transform matrix, called extrinsic matrix E, again with the use of homogenous coordinates: 2 3 3 xw xc 6 yw 7 4 yc 5 ¼ E  6 7, where E ¼ ½Rjt0 : 4 zw 5 zc 1 2


Such equation can be chained with the intrinsic transformation in order to obtain a combined equation, allowing the calculation of the image plane coordinates u, v from coordinates w in world space: 2 3 2 3 xw u 6 yw 7 4 5 6 , s v ¼AE4 7 zw 5 1 1


where s is a scaling factor chosen to make the third output coordinate equal to one. Lens distortion Lens distortion is a deviation from the ideal projection considered in pinhole camera model. It is a form of optical aberration in which straight lines in the scene do not remain straight in an image. Examples of lens distortions are barrel distortion and pincushion distortion, cf. Fig. 1.28. Most of the lens distortions can be corrected with the use of Brown-Conrady model [25]. The model is defined in the form of a transformation of a point from undistorted coordinates to distorted coordinates:         ud ¼ un 1 + K1 r 2 + K2 r 4 + ⋯ + P2 r 2 + 2u2n + 2P1 un vn  1 + P3 r 2 + P4 r 4 + ⋯ ,         vd ¼ vn 1 + K1 r 2 + K2 r4 + ⋯ + P1 r2 + 2v2n + 2P2 un vn  1 + P3 r 2 + P4 r 4 + ⋯ ,


qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi where r ¼ ðun  uc Þ2 + ðvn  vc Þ2 , ud, vd are coordinates in the distorted image, un, vn are coordinates in the undistorted image, uc, vc are coordinates of the distortion

1.3 Multiview video preprocessing

FIG. 1.28 Lens distortion shown on example of rectangular grid (left): barrel distortion (center) and pincushion distortion (right).

center (e.g., the principal point Pu, Pv), Kn are the radial distortion coefficients, and Pn are the tangential distortion coefficients. Very often, only radial distortions of second order are considered. In such a case:   ud ¼ un 1 + K1 r2 + K2 r4 ,   vd ¼ vn 1 + K1 r 2 + K2 r 4 ,


Barrel distortion is typically modeled with negative K1 value, whereas pincushion distortion has positive K1 value. Estimation of camera parameters Practically any processing of data coming from a multicamera system, e.g., depth estimation, view synthesis, etc., requires knowledge about camera parameters described in this section. Therefore it is crucial to calibrate the system before use. The most common way of calibration of geometrical parameters is the use of calibration patterns, like checkerboards, cf. Fig. 1.29. The calibration algorithms

FIG. 1.29 Checkerboard pattern (left) used in calibration of camera system for “Poznan Hall” sequence [23] with marked corners which are used as features for calibration (right).  University of Technology. # Poznan



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering assume that the shape of the pattern in the real world is known, e.g., a checkerboard composed of 13  9 black/white boxes, each of size 10 cm  10 cm. The location of the pattern in the image is found by feature search, e.g., edge/corner detection. Good examples of calibration algorithms are described in [26,27]. Typically, calibration of some set of parameters (e.g., intrinsic matrix) requires that more primal parameters are known. Therefore it is best to perform calibration from lens distortion, then intrinsic parameters and finally extrinsic parameters. Estimation of lens distortion parameters consists in solving a system of nonlinear equations (Eq. 1.16) for a set of points describing the calibration pattern. For the mentioned example we would assume that un, vn are lying on a perfectly rectangular grid (although seen from a slight angle). The observed ud, vd allow for estimation of Kn and Pn lens distortion parameters. In order to achieve good results, the calibration checkerboard should be entirely visible and should cover most of the view of the camera being calibrated. As soon as the lens distortion is calibrated (and preferably corrected) it is possible to calibrate the intrinsic camera parameters. The process is very similar to the previous one. We try to find intrinsic parameters that minimize the error of Eq. (1.10) for a set of features describing the calibration pattern. In our example of the checkerboard, we know that the points lie on a plane, on a perfectly rectangular grid (we assume that lens distortions have been already removed). We know what are the observed u, v coordinates of each feature and what are the assumed x, y, z coordinates. From proportions in our perspective projection, we can tell what are the coordinates Pu, Pv of the principal point in the image plane and what are the values of fu, fv expressing the focal length in units of pixel size (horizontal or vertical). It is worth to notice that is impossible to measure focal length f in world units (e.g., centimeters) without additional knowledge about positioning of the pattern (e.g., its z-value). The final step, calibration of the extrinsic parameters of the cameras, requires a slightly different arrangement of the calibration pattern than before. Because we want to find relative positioning of the cameras in the scene, the calibration pattern should be visible in all of the views simultaneously. Of course such situation is not always possible in practical situations. One solution is to capture the checkerboard at multiple positions, so that it is visible in at least a few views at the same time, cf. Fig. 1.30A. Another solution is the use of a lighting diode moving in the fields of view of all cameras, cf. Fig. 1.30B. Regardless of the exact method of finding features which are common between the cameras, the calibration of the extrinsic parameters employs them to construct a system of equations like Eq. (1.15). It is assumed that the given feature, although observed in different projected position u, v, resulting from different camera-space coordinates xc, yc, zc (both individual for each camera) resides in a particular point in the world coordinate system xw, yw, zw, common for all cameras. It is interesting to notice that even when a vast number of features are used, the equation system will

1.3 Multiview video preprocessing

FIG. 1.30 Extrinsic camera parameters calibration: (A) checkerboard pattern placed in various places and angles in the scene in order to maximize visibility in the cameras, and (B) lighting diode moving in the fields of view of all cameras.  University of Technology. # Poznan

not become overdetermined. In fact, it is underdetermined and additional assumptions have to be made. The most commonly used ones are as follows: – the global world coordinate system is identical with the local coordinate system of one of the cameras – the extrinsic parameters matrix E is composed of the translation vector t0 and proper rotation matrix R (Eq. 1.14) which is orthonormal – the units in the world coordinate systems are scaled with respect to the real world units, e.g., meters. Camera parameters file format An example of format for storage of camera parameters is used within the International Organization for Standardization/International Electrotechnical Commission (ISO/IEC) MPEG Depth Estimation Reference Software (DERS) and View Synthesis Reference Software (VSRS) [28]. The format is ASCII based and is composed of multiple sections (one per camera) based on the following template (cf. Fig. 1.31). The dot sign (.) is used as a decimal point separator. “camera_name” can be any Latin characters string without spaces and end-of-line characters. The consecutive fields should be separated with spaces of tabulations. It can be noted that in most of the software packages, values of radial distortion coefficients are ignored.

1.3.2 VIDEO CORRECTION Multiview video processing algorithms, like depth estimation or view synthesis, are very sensitive to imperfections of the acquisition path. For example, algorithms assume that objects have exactly the same color in all of the analyzed views or that there are no lens distortions.



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.31 Camera parameter file format used by MPEG [28].

Therefore for the sake of simplicity of such processing algorithms, it is beneficial to preprocess the sequences in order to impose the required corrections, e.g., color correction or lens distortion correction. Color correction In multiview systems composed of high-class cameras, the mismatch between colors observed in different views can be invisible with a nonexpert eye. However, even in such optimistic case, there might be some slight inconsistencies in color profiles of the cameras. Therefore it is recommended to perform color calibration of the cameras, especially since the problem grows with the number of cameras. The camera calibration needs a known calibration reference to be captured and the resulting output from the camera to be converted to color values. A correction profile can then be estimated (like ICC [29]) using the difference between the camera resulting values and the known reference values. The easiest way to perform this is to employ color calibration patterns, cf. Fig. 1.32, which are the counterparts of those used in geometrical calibration. An alternative technique which can be used is to calibrate two or more cameras relatively to each other, to reproduce the same color values; not some reference color. In such a case [30], the target color profile may be, e.g., the average of the color profiles of all cameras. The most common technique for correcting the colors is to usage Look-Up Tables (LUTs) which for given uncorrected color components (e.g., Rn, Gn, Bn in RGB color space) returns corrected color components (Rc, Gc, Bc). Due to implementation limitations, such lookup is typically performed independently for each component, e.g.: Rc ¼ LUTR ½Rn , Gc ¼ LUTG ½Gn , Bc ¼ LUTB ½Bn :


Finally, it has to be noted that color correction does not solve all problems related to color mismatches between the views. For example, as a consequence of nonLambertian reflections, surfaces seen from different angles exhibit different colors.

1.3 Multiview video preprocessing

FIG. 1.32 Exemplary color calibration patterns.

Such issues have to be coped inside the respective algorithms, e.g., the depth estimation itself. Lens distortion removal Techniques that remove lens distortion consist in imitating a perfect pinhole camera so that further processing algorithms can be unaware of the actual acquisition imperfections. For such model, the distortions have to be known, as presented in Section In the simplest technique for lens distortion removal, cf. Fig. 1.33 (left), the model of the distortion is used directly for all pixels in the output undistorted image. For each pixel with coordinates un, vn in the output undistorted image, the coordinates ud, vd in the input distorted image are calculated. A pixel targeted by those coordinates is copied to the output image. For the Brown-Conrady model Eq. (1.16) or (1.17) can be used. Of course, if the process has to be performed in real time for several frames with the same distortion model, ud, vd values can be stored in LUTs so that: ud ¼ LUTu ½un , vn , vd ¼ LUTv ½un , vn :


Another solution for lens distortion removal in real time is to use a texture-mapping functionality supported by modern GPUs (graphics processing units). As shown in Fig. 1.33 (right), instead of processing single pixels, we use a triangle mesh covering the whole undistorted image. Content of each triangle is filled by means of texture mapping from the distorted image.



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.33 Lens distortion removal by direct pixel copy (left) and by triangle-mesh-based texture mapping (right).

1.4 DEPTH ESTIMATION As already explained in Section 1.1, depth plays an important role in synthesizing virtual views that do not correspond to existing camera views. Section 1.2 already briefly touched the subject of active versus passive depth estimation: the former uses active illumination techniques and responses thereof to estimate depth with Time-ofFlight sensors, while the latter is solely based on matching images acquired from multiple viewpoints to estimate depth. This is referred as stereo matching over a pair of input images. Of course, also hybrid solutions between active depth sensing and stereo matching are described in literature. This section will give an overview of the passive depth estimation techniques that are related to the disparity phenomenon explained in Section 1.1: the more the 3D points in space are close to the cameras, the more disparity is observed over the camera views, or said differently, the more the 2D projection of that 3D point in space will move when switching from one camera view to the next.

1.4 Depth estimation

The following sections will survey a couple of depth estimation techniques, ranging from local stereo matching where only the local neighborhood of each pixel is matched to the companion input image, to global methods involving all pixels of the input images at once for estimating the depth map. The number of input camera views is also an important parameter: some depth estimation techniques work well with only a pair of input images (stereo matching); others require at least a dozens of input images for depth estimation, which is the price to pay to provide very reliable depth maps. Since depth estimation has been an active domain of research for many decades, a high abundance of papers exists in the field. A good starting point is the KITTI [31] and the Middlebury data set [32] providing test material, as well as indicative comparisons of the most prominent stereo matching algorithms against a small set of objective criteria. Recent comparisons between a multitude of stereo matching algorithms can be found in Refs. [33,34]. To simplify our discussion, we start from a set of perfectly parallel camera views, where the cameras have rigorously the same parameters (focal length, center of the optical axis, etc.). Referring to Section 1.3, the epipolar lines are then strictly horizontal. In practice, nonperfectly parallel input camera views can always be rectified in order to correctly obtain these horizontal lines. Each feature in an image is then guaranteed to lie on the same horizontal line in the other rectified input image, simplifying the stereo matching implementation. We will often follow this rectification convention in the remainder of the section, largely simplifying our discussion. Of course, it is always possible to bypass the rectification step and directly do the stereo matching over the nonhorizontal epipolar lines, as shown in Fig. 1.16 (top).

1.4.1 LOCAL STEREO MATCHING Local stereo matching will compare the surroundings of a pixel p in the left image to slightly translated positions q in the right image to estimate the disparity of pixel p. Each pixel is processed separately, without taking the full image context into account, which often results in noisy disparity images, especially in nontextured image regions influenced by any source of input noise (e.g., light flickering, slightly varying colors over adjacent camera views, etc.). Ref. [35] provides a detailed overview of the different steps of local stereo matching in their paper’s first figure. In view of giving the reader a bird’s eye view over the different methods, we will restrict ourselves to the core disparity estimation kernel, which is followed by many refinements kept out of the scope of the current section. To estimate the disparity in local stereo matching methods, the surroundings of a pixel p in the left image are compared to the surroundings of the pixel q in the right image, where q has been translated over a candidate disparity δp compared to p, cf. Fig. 1.34. For each pixel p, N candidate disparities (δ1p, δ2p, …, δNp ) are tested (N ¼ 256 and 65,536 for 8 and 16-bit depth maps, respectively), and the candidate



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.34 Matching windows in stereo matching.

disparity resulting in the lowest matching cost is assigned to pixel p. Refs. [36,37] give a short overview of the most used matching costs. Besides choosing a good matching cost metric, it is also important to well define the shape of the surrounding matching window around pixel p. Indeed, it is always implicitly assumed that all pixels within the window have more or less the same disparity δip, so that only one disparity can be reliably assigned to pixel p. As a counterexample, if the matching window lies half over one object and covers for the other half another object lying at a very different depth—e.g., the rectangular window of the object silhouette in Fig. 1.34—pixels of different disparities and/or partially disoccluded pixels (pixels not visible in the other image) will be matched together, yielding a best matching cost at a possibly wrong disparity value, very different from the real disparity that should be assigned to pixel p. Reliable, local stereo matching approaches therefore take good care of the matching window shape, coinciding its borders with object borders. Ref. [38] includes gradient calculations to respect border objects, and Refs. [35,39] describe a method where, starting from pixel p and moving outward, the window shape is fixed at a position where neighboring pixel color differences start to be too large. This creates matching windows that never cross an object border, cf. Fig. 1.35, hence providing more reliable depth estimations. The papers [35,38] describe these local stereo matching approaches in a tutorial step-by-step style, very useful to starters. Ref. [33] provides many visual results of different stereo matching techniques (including global ones, cf. next section) on the Middlebury, KITTI, and MPEG-FTV data sets.

1.4 Depth estimation

FIG. 1.35 Border-aware window shapes (the cross-shaded regions) for stereo matching (each middle dot is the pixel for which the depth is estimated) [33]. # Brussels University, VUB.

1.4.2 GLOBAL STEREO MATCHING In contrast to the local methods described in previous section, global stereo matching methods will assign disparities to all pixels in a coherent way over the full image, trading off competing local costs. These competing costs are as follows: • • •

the disparity matching cost similar to the ones in the previous section, a pixel coherency cost between neighboring pixels preserving local smoothness in the disparity values assigned to them, An occlusion/disocclusion cost to cope with pixels that are visible in one view, but not in another.

The latter two costs propagate from one pixel to the next throughout the full image, therefore taking a globally optimal decision over the image. Two global methods are presented in the next sections. They offer very good performances and are often top ranked in the Middlebury and KITTI tests. Graph Cut As the name reveals, the Graph Cut technique [40,41] is associated to creating a graph that will be cut in N subgraphs, one for each candidate disparity δip. All pixels in a single subgraph will be associated the same disparity label, cf. Fig. 1.36. To simplify the discussion, let us first take the example of the two fingers experiment of Section 1.1, where all pixels of each finger are associated to one disparity



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.36 Max-flow/min-cut approach for assigning disparities to pixels (right) from the original flow network (left).

value, either 0 (the rear finger) or 255 (for the large disparity front finger in 8-bit disparity format coding). All pixels in the image correspond to nodes that are locally interconnected in a graph, along so-called n-links. The pixels/nodes are also linked—through the socalled t-links—to the aforementioned two disparity values 0 and 255, called disparity labels. These are thought of as a source and a sink, from/to which conceptually water will flow throughout the graph network. Both the n-links between pixels and the t-links between the pixels and the disparity labels are assigned a weighting/cost factor, which will somehow influence the water flow in the graph. For the n-links, if two neighboring pixels p and q are eventually assigned the same disparity label, their n-link cost is zero. However, for a different disparity label, the n-link cost is a positive value. Such cost value effectively expresses the penalty of having two neighboring, n-link interconnected pixels that may end up with a different disparity label. This is one of the costs that the system will try to minimize, effectively favoring local smoothness in the assigned disparity labels over neighboring pixels. For the t-links, the cost factor corresponds to the local disparity matching cost as in the previous section, between pixel p attached to the t-link, and the pixel p + δip for the associated disparity label δip. The cost is hence high when p is associated to a disparity label that is different from the real disparity that pixel p has in reality. This is a second cost that globally—in balance with the previous one over the n-links— should be minimized.

1.4 Depth estimation

The global cost minimization in the Graph Cut technique is obtained by cutting the graph in two along a minimal cut (min-cut), i.e., a cut along which the sum of costs is minimal. In doing so, one subset of pixels will remain attached to the source and its associated disparity label, while the other subset of pixels is attached to the sink. These latter pixels get the sink disparity label assigned. There is clearly a tradeoff for each pixel between getting the correct disparity label (the one that corresponds to reality) and keeping disparity smoothness with its neighbors. There are actually no efficient algorithms that easily solve this min-cut problem, but its complementary problem—finding the maximum flow (max-flow) through the graph—has good implementations. As well explained in Ref. [40], gradually increasing the water flow coming out of the source to the sink, some links get saturated w.r.t. their capacity associated to the weighting costs, jeopardizing the further increase/ maximization of the water flow through the network. Consequently, these are the links that should first be cut away in the equivalent min-cut problem statement, to enforce a higher water flow through the rest of the network. This method can be generalized for N disparity labels, with a multimin-cut and equivalent multimax-flow approach, reaching N subsets of pixels, each assigned to the corresponding disparity label δip. Further extensions to (i) inclusion of occlusion cost factors and (ii) more than two input images are presented in Ref. [41]. The DERS used in MPEG [42] for sustaining Free viewpoint TV (FTV) and 6-DoF VR applications uses such Graph Cut depth estimation approach. Belief propagation Belief propagation propagates evidence from some pixel to its neighbors by statistical inference. Similar to graph cuts, two competing costs—rather called potentials here—are introduced: a pixel intensity difference φ between a pixel p in the left image and the pixel q that is translated over an assumed disparity, as well as a smoothness ψ for neighboring pixels with unequal disparities. The difference, however, is that these functions are thought of as a statistical process. For instance, in the function φ, the likelihood that pixels p and q under disparity δ have a different intensity I is expressed as a Gaussian distribution [43]. Each pixel will update its potential functions for a given disparity based on the observations/ beliefs from adjacent pixels through iterative message passing, until a maximum a posteriori global potential encapsulating φ and ψ is reached under the disparity values eventually assigned to all pixels. Fig. 1.37 [37] compares the results from the DERS of MPEG [42] applied on some test sequences of Fig. 1.11 with the belief propagation method described in Ref. [33]. Clearly, belief propagation obtains results that respect the object boundaries better than DERS. This was actually a requirement in all previously described methods. Finally, Ref. [44] shows the similarities—both in concept and results—between Graph Cut and Belief propagation, which are the methods that in general rank best on the Middlebury data set [32].



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.37 DERS from MPEG (left) vs. Belief Propagation (right). # Brussels University, VUB, ULB.

1.4.3 MULTICAMERA DEPTH ESTIMATION Having only two input views as in the stereo matching techniques presented in the previous sections, is a highly limiting factor in reaching reliable depth estimation. Obviously, the more input images can be included in the depth estimation, the better the results are expected to be. One straightforward extension of stereo matching consists in doing multiple, pairwise stereo matching estimations over the set of cameras, and combining the results. The multicamera setup of Fraunhofer HHI in Section follows this approach and iteratively improves the estimated depth maps. Other methods, however, exist for combining the information from all input cameras simultaneously in reaching a globally consistent depth estimation. Two methods are further detailed in the next sections.

1.4 Depth estimation

FIG. 1.38 Deprojections from camera views Ci to a depth plane Dpj. Courtesy of Hasselt University. Plane sweeping As the name suggests, plane sweeping [45] is a method where a plane sweeps over successive depth candidates (D1p, D2p, …, DNp ) for each pixel p (or object P constituted of many pixels p). These depths are equivalent to the candidate disparities (δ1p, δ2p, …, δNp ) presented in the stereo matching sections before. All the camera input images are deprojected—i.e., projected from the 2D camera views into 3D space—onto a candidate depth plane Dip, as shown in Fig. 1.38. Note that the relative positions of the cameras (extrinsics), as well as their field of view and optical axis (intrinsics) should be known in this deprojection operation. For objects that in reality really lie on the candidate depth plane Dip, the camera views will be deprojected toward coinciding images on the depth plane, cf. the brown/blue-clothed player in the middle of Fig. 1.39 (bottom). The reprojections of an object that originally is not positioned at the candidate depth plane Dip will be distinct for each camera view, yielding hardly overlapping, faded copies of the object, cf. each yellow player in Fig. 1.39. Calculating the color histogram of the so-obtained depth plane image (i.e., all deprojections blended over each other into one image) will reveal whether the candidate depth plane Dip is the right depth plane or not: a very sharp histogram suggests that all deprojected images coincide at the right depth plane; otherwise the candidate depth plane Dip is not the right one for that object. In the latter case, the next candidate depth plane Dip+ 1 is tested (i.e., deprojection and histogram tests), and all depth



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.39 Plane sweeping at different candidate depth planes. # Hasselt University.

planes are iteratively traversed, seeking for the peakiest histogram distribution to settle the depth plane for each object (or to be precise, for each pixel p over all objects). This plane sweeping method has shown to yield very stable depth estimation results [7,46], even in wide baseline camera array configurations with large perspective view changes, cf. Fig. 1.40, at a quality level competitive to the Graph Cut depth map of Fig. 1.12. To be precise, Fig. 1.40 is a disparity map (front objects have higher gray-scale pixel values), corresponding to the depth planes in the depth sweeping method. Epipolar plane images The concept of epipolar plane images (EPIs) originates from the late 1980s [47] when having multiple cameras was still expensive, but rich fellows could start thinking of their possibilities.

1.4 Depth estimation

FIG. 1.40 Soccer depth map estimation results with plane sweeping. # Hasselt University.

To explain the concept, let us start by assuming a long, linear camera array with many small cameras put side by side, and registering all camera input images so that any feature point in one image always lies on the same scanline in the other images, cf. Section These registered images are then stacked one behind the other, cf. Fig. 1.41. One horizontal section thereof is by definition an EPI, i.e., an Image that provides information about feature points in the same horizontal scanline—i.e., the same Epipolar Line after registration—over all registered input images. The physical interpretation of such an EPI follows the disparity concepts presented in Section 1.1. Each horizontal scanline in the EPI originates from a specific input camera view, and when switching from one camera view to the next (thus from one scanline to the next in the EPI), a 3D point in space will be displaced as a function of its disparity. More specifically, a 3D point far away in space (large depth) will have almost zero disparity, hence it will follow an almost vertical line in the EPI. A 3D point close to the cameras, however, will have a large disparity and hence will make large translations from one horizontal scanline to the next in the EPI. Consequently, such 3D point will follow a quite diagonal line in the EPI, cf. Fig. 1.41, which we will call an EPI-D-trackline for simplicity of our discussion: the slope of such line in the EPI corresponds to the disparity δ or depth D of the associated 3D point in space. Interestingly, an EPI also provides information about occlusions. Following two 3D points in their EPI-D-tracklines with associated disparities/depths D, we see how these points move from one camera view to the next. Since the point with smallest depth moves more rapidly than the point with largest depth, the former might catch up the latter, i.e., cause an occlusion, so-creating a crossing in their respective EPI-Dtracklines. Ref. [48] obtains very detailed depth maps in using such techniques, even when reducing the number of cameras from around hundred to a dozen, which corresponds



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.41 Epipolar Plane Image is a horizontal section of a registered Multiview image stack.

to a heavily subsampled/discontinuous EPI, from which it is very difficult to extract the EPI-D-tracklines easily. Refs. [49,50] have further developed these techniques to create depth maps from any viewpoint, correctly synthesizing occlusions and disocclusions in the depth maps, which are used for further DIBR based virtual view synthesis.

1.5 VIEW SYNTHESIS AND VIRTUAL NAVIGATION Virtual navigation in 3D scenes is a key feature of FTV, VR, and augmented reality systems. In order to allow the user to select virtual viewpoints, placed in positions different than real camera locations, a virtual view synthesis has to be performed. The concept of the virtual view synthesis is also exploited in storage and transmission, e.g., in high efficiency video coding with depth maps (3D-HEVC) extensions of high efficiency video coding (HEVC) to increase coding efficiency of a multiview video, cf. Section 1.6, as well as in error concealment techniques. The fundamental goal of view synthesis is to generate any intermediate view in a position of a virtual camera placed in between the real captured (or transmitted) views, cf. Fig. 1.42. The content of those views is used within this process, which is hence often called view interpolation. The most common view synthesis employs DIBR in which two components of a view are used to render the new view: texture image and its associated depth map. Many input views can be used together, which constitutes a MVD scene representation format.

1.5 View synthesis and virtual navigation

FIG. 1.42 Generation of virtual view in an intermediate position between real views.

FIG. 1.43 General scheme of modern view synthesis algorithms.

Most modern view synthesis algorithms share a common scheme, cf. Fig. 1.43, consisting of warping the input views, view blending, inpainting (hole filling), and postprocessing. These steps are described in the following sections.

1.5.1 WARPING The basic principle for generating new views based on images and depth is projective geometry presented in Section 1.2. The 3D information of the scene included in MVD data, as well as the camera arrangement (e.g., in the form of extrinsic and intrinsic camera parameters) is used to project pixels from the image plane of input view to 3D space and then reproject them back to an image plane, but this time related to the virtual view, cf. Fig. 1.44. Thus implicitly, in this process in 3D space,



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.44 Projection of pixels from the input image plane to 3D space and reprojection to the image plane of the output virtual view, which is equivalent to warping of pixels.

the scene is represented by means of a point cloud. Looked at from an end-to-end point of view, this step consists of warping pixels from an input to a different location determined by the associated depth. Of course, after projection some of the pixels may occlude others. Therefore care must be taken to use the corresponding depth information in order to cover farther pixels by the closer ones. This resembles the z-buffer occlusion techniques known from computer graphics [51]. There are two main approaches to perform this, called forward and backward projection (or synthesis). Forward warping directly follows the scheme presented in Fig. 1.44. First, the output view is cleared with the background color (e.g., black) and an infinite depth value. Then, pixels from the input view are shifted to new positions according to their depth value. The pixels are written to the output virtual view, but only if their depth value is closer than the previously written depth value. A major disadvantage of forward warping is that the generated depth (and thus the image) contains cracks on discontinuities between the projected pixels, cf. Fig. 1.45 (top). Even if in the input image two pixels are adjacent, after the warping they must be written into discrete positions, which—due to rounding—may yield a crack in between of them. This problem can be solved by projecting primitives like triangles instead of single pixels [54]. Another drawback of forward warping is that the generated depth cannot be postprocessed in order to improve the quality of the generated image. Backward warping starts with processing of depth only, which is warped in the forward direction, like described before. Of course, this includes an occlusion test with the z-buffer technique. Then, the generated depth is postprocessed in order to fill the cracks. For example, this can be done with simple median filtering, cf. Fig. 1.45 (bottom-left). Finally, the depth is used to warp-back color content from the input image. Thanks to that, the resultant output image, cf. Fig. 1.45 (bottomright) is almost free from cracks. The only holes that are generated are disocclusions, which are the parts of the scene that are not visible from the input view.

1.5 View synthesis and virtual navigation

FIG. 1.45 View warping on example of “Ballet” sequence [52,53]. Forward warping leads to discontinuity artifacts in the depth (top-left) and consequently in the output image (top-right). Median filtering of the depth map (bottom-left) used in backward warping improves the results (bottom-right). Holes are marked in black.  University of Technology. Ballet sequence # Microsoft Research, rendering—courtesy of Poznan

Backward warping techniques are used more commonly than forward warping, i.e., in the MPEG VSRS [55] and in the view synthesis software from Poznan University of Technology [54].

1.5.2 VIEW BLENDING The warping described in previous section exploits information only from a single view. Therefore in the output view, some of the regions cannot be synthesized because they are occluded in the input view. The occluded regions are different in each of the views, which can be used to the advantage of view synthesis algorithm to fill in some holes. In the most common approach, multiple versions of the requested virtual view are synthesized, based upon another input view, cf. Fig. 1.46. Those multiple versions of the virtual view are then merged together with various blending methods in order to reach a single output view without disocclusions. For example, the colors from input



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.46 Virtual view synthesized from left (A) and right (B) input view. Disoccluded regions are marked in black.  University of Technology. Ballet sequence # Microsoft Research, rendering—courtesy of Poznan

FIG. 1.47 Blending of pixels from two input views with weights calculated based upon distances in camera positions.

views can be linearly blended with weights wi, calculated based on the distances di between the position of a given input view i and the requested virtual view: di wi ¼ 1  XN



d k¼1 k

where N is the number of input cameras used for the generation of variants of the requested output virtual view. Fig. 1.47 presents an example with two input cameras and weights w1 and w2. The result of such formulated view blending is presented in Fig. 1.48 (left).

1.5.3 INPAINTING Though blending of multiple views substantially reduces the problem of disocclusion in the final synthesized image, it may not solve it entirely. For example, the viewer

1.5 View synthesis and virtual navigation

FIG. 1.48 Virtual view (left) obtained from blending of two input views presented in Fig. 1.47, and the same image with inpainting of holes (right).

can select a viewpoint pointing at something which was occluded in all of the views. In such cases the missing pixels have to be colored with a generated value. This process of filling such regions with content generated from the pixel’s neighborhood is called inpainting. Depending on the size and the number of holes, various inpainting algorithms can be used. The simplest employs techniques of finding the nearest available pixel and using its color to fill the entire hole region. For example, such a technique is used in the MPEG VSRS [55]. As it can be seen in Fig. 1.48 (right), due to the usage of inpainting the holes vanish, providing satisfactory results in the “Ballet” sequence [52,53]. A more advanced inpainting method includes interpolation [54,56,57] and texture synthesis [58]. In the latter technique, the holes are filled with content which tries to resemble not only the color of the nearest neighboring pixels, but also higher order features, texture, like edges, gradients. Fig. 1.49 presents results obtained with texture synthesis inpainting [58] (left) compared to the nearest color inpainting in VSRS [55] (right).

1.5.4 VIEW SYNTHESIS REFERENCE SOFTWARE VSRS has been originally developed in 2008 by Nagoya University [59] during MPEG studies on new 3D video coding standards. Later, in 2013, it has been enhanced to better address the needs arising in applications of FTV and Virtual Navigation [55]. During these years, VSRS has been continuously used in research studies as a technique of reference as well as a starting point for the implementation of new scientific ideas. As a consequence, it is the most commonly used view synthesis algorithm in papers in the area. Currently, the development of VSRS is coordinated in the FTV Ad-hoc Group in MPEG [60].



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FIG. 1.49 Texture synthesis inpainting [58]. (Left) Compared to nearest color inpainting in VSRS [55] (right). € € Cited from P. Ndjiki-Nya, M. Koppel, D. Doshkov, H. Lakshman, P. Merkle, K. Muller, T. Wiegand, Depth image-based rendering with advanced texture synthesis for 3-D video, IEEE Trans. Multimedia 13(3) (2011) 453–465,

For completeness, Table 1.4 presents a description of the VSRS configuration file format used in the MPEG exploration studies. The features of VSRS include: – Backward warping with hole filling. – Support for full 3D space warping as well as fast 1D warping (for linear camera arrangements). – View blending in linear distance-based weights. – Inpainting based on nearest-pixel blending. – Subpixel accuracy thanks to half-pixel and quarter-pixel precision of synthesis. – Postprocessing tools for improvement of boundaries of synthesized objects. VSRS can be used freely for scientific and standardization purposes within MPEG. It can be downloaded from the MPEG SVN repository at the following address: .

1.6 COMPRESSION 1.6.1 INTRODUCTION As described in Section 1.2, a multiview video is acquired with the use of multiple cameras. The video is transmitted from the cameras, and somewhere, in an outside broadcasting van, in a preprocessing computer cluster or elsewhere, is the starting point from which the video from all the cameras is transmitted together. Such a video from multiple cameras is called “multiview video” under the assumption that the video sequences from all the cameras are synchronized as discussed in Section 1.2.

1.6 Compression

Table 1.4 VSRS Configuration File Description Parameter


DepthType SourceWidth SourceHeight StartFrame StartFrameD TotalNumberOfFrames LeftNearestDepthValue LeftFarthestDepthValue RightNearestDepthValue RightFarthestDepthValue CameraParameterFile

0: Camera space, 1: world space Input frame width Input frame height Start frame index in image YUV files Start frame index in depth YUV files Total number of input frames znear for the left input view zfar for the left input view znear for the left input view zfar for the left input view Name of file with camera parameters as described in Fig. 1.31 in Section 1.3 Name of real left camera, as specified in previous file Name of virtual camera, as specified in previous file Name of real right camera, as specified in previous file Name of left input image YUV file Name of right input image YUV file Name of left input image YUV file Name of right input image UYV file Name of output virtual view YUV file 0: General, 1: linear arrangement of cameras Internal color space: 0: YUV, 1: RGB Upsampling precision: 1: integer-pixel, 2: half-pixel, 4: quater-pixel Upsampling filter: 0: Bi-linear, 1: Bi-Cubic, 2: AVC Postprocessing tool for removal of noise from synthesizer boundaries of objects: 0 or 1 View blending disable (only in general mode): 0: weighted interpolation 1: use nearest 0: disable; 1: enable for all pixels; 2: enable only for boundary pixels. Default: 2 A parameter to enlarge the boundary area with SplattingOption ¼ 2. Default: 40 0: Z-buffer only; 1: averaging only; 2: adaptive merging using Z-buffer and averaging. Default: 2 A threshold is only used with MergingOption ¼ 2. Range: 0–255. Default: 75 A threshold for number of holes

LeftCameraName VirtualCameraName RightCameraName LeftViewImageName RightViewImageName LeftDepthMapName RightDepthMapName OutputVirtualViewImageName SynthesisMode ColorSpace Precision Filter BoundaryNoiseRemoval ViewBlending SplattingOption BoundaryGrowth MergingOption DepthThreshold HoleCountThreshold



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering It would be a truism to say that the uncompressed multiview video needs extremely high bitrates for the transmission. For example, only 10 streams of consumer high-definition (HD) video (1920  1080 pixels in a frame, 25 frames/s) need a total bitrate of 6.22 Gbps even excluding all necessary auxiliary data, e.g., synchronization data. Therefore efficient compression of multiview video is of paramount importance. Multiview video compression is obviously related to general video compression technology developed mostly for video taken from a single camera, i.e., monoscopic video. The state-of-the-art technology of monoscopic video compression is very sophisticated, and therefore development of a new generation of video is inconceivably expensive. On the other hand, compression of multiview video is needed for various multiview video applications (cf. Section 1.1), i.e., for quickly growing but still niche markets. Therefore it would be too expensive to develop from scratch a compression technology just for multiview video. As a consequence, the multiview video compression exploits the technology of monoscopic video compression. A deeper description of the monoscopic video compression technology may be found elsewhere [61–64] and will be omitted here. For compression of multiview video, the commonly used technology of monoscopic video compression may be used in two different approaches. 1. Simulcast coding. This term means that for each view, the video is encoded independently, i.e., the similarities among the views are not exploited by compression. Each video is compressed using the standard compression techniques developed for monoscopic video. Therefore the commonly used codecs, both hardware and software, are used. This approach is straightforward, easy to implement, and therefore cheap and eagerly used. A specific version of simulcast coding is widely used for stereoscopic video broadcasting and multicasting by television and IP networks. Nowadays, these “3D video” services use so-called frame-compatible transmission of stereoscopic video, where the left and the right views are appropriately subsampled and merged into single frames. In such a frame, the left half corresponds to the left view, while the right half corresponds to the right view (“side-by-side” format). Alternatively, the left and the right decimated views may be packed into the top and bottom parts of frames (“top-and-bottom” format). Some other ways of packing two views into single frames have been proposed and even standardized. The advantage of the “framecompatible” formats is that the corresponding video may be compressed by standard mono video codecs, and the whole television transmission infrastructure needs no modification for stereoscopic video. The only specific information usually added to the bitstream is the metadata that signals the presence and the format of multiview video packed into frames. As the considerations of this chapter are focused on video with a higher number of views, the compression of stereoscopic video in the “frame-compatible” formats will not be further considered here. 2. Multiview and 3D video coding. These approaches exploit the similarities between views, therefore stronger compression is possible as compared to the

1.6 Compression

simulcast coding. The multiview video coding exploits strong similarities between horizontally aligned blocks of samples in the neighboring views. The 3D video coding additionally takes advantage from more sophisticated relations between the views and depth maps, thus achieving even more bitrate reduction than the purely multiview techniques. Both multiview and 3D video coding techniques are developed on the top of the standard monoscopic video coding techniques. They use the coding tools from the respective monoscopic video compression techniques, but they also use some additional coding tools that exploit the similarities between views. The 3D video coding techniques may additionally use the tools capable to exploit the spatial information about the scene in order to further reduce the bitrate. Application of these additional tools implies that the respective codecs are not just the popular monoscopic codecs but their specific modifications. Such extended multiview or 3D video codecs are still rarely used in practice, but the prospective applications in VR, virtual navigation, SMV and lightfield technology (see Section 1.1) are likely to stimulate increasing interest in their usage. Video coding, either monoscopic, multiview, or 3D, may be lossless or lossy. The term “lossless coding” refers to such compression where the samples of the decoded video exhibit exactly the same values as the respective samples in the original video. Lossless compression of video is relatively rarely used and considered, so multiview or 3D lossless compression is still practically beyond the scope of research, and will be not considered here. Therefore the further considerations of this section are limited to lossy coding that always results in some distortions of the decoded video. For lossy coding, there exists a trade-off between the quality of the decoded video and the bitrate of the compressed bitstream, cf. Fig. 1.50. For commercial broadcasting, in

FIG. 1.50 The “rate-distortion” lines in video coding.



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering television, as well as for over-the-top video, so-called broadcast quality is required. It means that the coding artifacts should be either invisible or hardly visible. The quality and bitrate are controlled by quantization steps of the quantizers deployed for the transformed samples of the prediction errors. Changing the value of the quantization step moves the operational point of the codec along a “rate-distortion” curve that corresponds to the current video. Such a curve strongly depends on video content as depicted in Fig. 1.50.

1.6.2 MONOSCOPIC VIDEO CODING AND SIMULCAST CODING OF MULTIVIEW VIDEO During the last quarter of a century, consecutive generations of video coding technology have been developed and promoted to the international standards, such as MPEG-2 [65], AVC (advanced video coding) [66], and HEVC [67]. These consecutive video coding generations have been developed thanks to huge research efforts. For example, the development, optimization, standardization, and implementation of HEVC needed an effort measured in thousands of man-years. This estimation implies that the development of a multiview video codec from scratch would be definitely too expensive, as already mentioned. Obviously, the previously mentioned three milestone standards of video coding do not represent the only existing video coding technologies. There are many others [68], but they are less important for multiview video coding, at least hitherto. When considering the three previously mentioned representative video coding standards, the following “rule” may be formulated [69,70], cf. Fig. 1.51: (1) (2)

For each next generation, for a given quality level, a given video format and content complexity, the bitrate is halved. The temporal intervals of about 9 years were observed between each consecutive technology generations of video coding.

During a 9-year interval the available computational power is increased by a factor of about 20, according to Moore’s law. After each such interval, this increase may be consumed by more sophisticated codecs of the next generation. The previously mentioned “rule of thumb” was observed during two cycles only. It is probably too short a time to establish a rule that may be used to forecast the future developments. Nevertheless, the ISO/IEC expert group on audio and video (MPEG) has recently started the exploration activities aimed at creating a new compression technology that should be capable of reducing the bitrates again, possibly by a factor of about two. Its name is not fixed yet, but the experts use the tentative name “Future Video Coding.” The expectations are to have this technology standardized around 2020–21, which would be roughly compliant with the previously mentioned rule of thumb. Each video coding standard is related to some reference software that provides nearly the highest compression possible for this standard defining the compressed bitstream semantics and syntax. Therefore the rate-distortion performance of the reference software is a good model for a given compression technology. It should be

1.6 Compression

FIG. 1.51 The milestone generations of video coding. B stands for the least possible bitrate needed to achieve the assumed quality for the assumed complexity of video content and an assumed video format.

noted that for a given standard, the minimum bitrate required for a given quality level may change dramatically from the first encoders on the market until mature, sophisticated designs become available after some years (e.g., [71]). The latter ones mostly provide the compression similar to that provided by the reference software. For a given content type, and for a given video format, the bitrate of the compressed bitstream may be roughly assessed assuming the required quality level and a mature codec implementation that reaches nearly the same compression as the standard reference software. For demanding complex dynamic content and assuming broadcast quality levels, for monoscopic video codecs the bitrate B may be very roughly estimated using the following equation [69,72,73] B  A  V ðMbpsÞ


where A is a technology factor, where A ¼ 4 for MPEG-2, A ¼ 2 for AVC, A ¼ 1 for HEVC, A ¼ 0.5 for the prospective technology expected around year 2020–21, and V is video-format factor, where V ¼ 1 for the Standard Definition (SD) format, (either 720  576, 25 fps or 720  480, 30 fps, chroma subsampling 4:2:0, i.e., one chroma sample from each chroma component CR and CB per 4 luma samples), V ¼ 4 for the High Definition (HD) format (1920  1080, 25/30 fps, chroma subsampling 4:2:0), V ¼ 16 for the Ultra High Definition (UHD) format (3840  2160, 50/60 fps, chroma subsampling 4:2:0). Eq. (1.21) defines the very rough estimate for the bitrate of a compressed bitstream, from which broadcast-quality video may be retrieved. Some examples of these estimations are presented in Table 1.5. The last column of Table 1.5 shows that the state-of-the-art video codecs represent video using a surprisingly small numbers of bits. Even for demanding content,



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering Table 1.5 Video Compression for Main Broadcasting Formats Assuming Complex Video Content (fo Denotes the Frame Rate) Standard

Video Format


SD (720  576, fo ¼ 25 Hz) HD (1920  1080, fo ¼ 25 Hz) UHD (3840  2160, fo ¼ 50 Hz)



Approximate Bitrate (Mbps)

Approximate Average Number of Bits per Pixel of Color Video







the HEVC bitstreams can reach 1 bit in average for 25 color pixels! This result is even expected to be improved by reaching 1 bit in average for 50 color pixels for the forthcoming new standard expected until 2021. It is worth emphasizing that these numbers refer to all bits in the bitstream including all control data, motion vectors, etc. This means that for most blocks of samples no data is transmitted. The content of such blocks is indeed well predicted using sophisticated interframe and intraframe predictions. The description of the relevant algorithms and tools may be found in Refs. [74,75] for MPEG-2 [61,76], for AVC, and for HEVC [62–64]. Each consecutive video compression standard defines an increased number of coding tools. As it would be not economically efficient to implement all these tools in all codecs, the standards define profiles that correspond to sets of the coding tools. In fact, a profile is a specific subset of the bitstream syntax defined by the standard. Obviously, codecs of different profiles accept different sets of input video formats, provide different compression efficiency, are characterized by different complexity, etc. The previous considerations of this section assume the codec profiles widely used in video broadcasting, i.e., Main Profile for MPEG-2, High Profile for AVC, and Main Profile for HEVC. Somewhat higher bitrates of the HEVC bitstreams are necessary to handle 10-bit video samples using Main10 Profile of HEVC that is required by Wide Dynamic Range and Wide Color Gamut video as defined by the recently introduced Recommendation H.2020 of International Telecommunication Union (ITU) [77]. The issues of Wide Dynamic Range and Wide Color Gamut need also to be explored for future multiview video systems. The conceptually simplest way to implement the simulcast coding of multiview video is to encode each view as an independent video stream with the time stamps included. In that way, the commonly used relatively cheap video codecs may be efficiently applied. The total bitrate Bm of the bitstreams is Bm ¼ N  B,


where N is the number of views, B is the bitrate for a single view from Eq. (1.21). This equation is similar to Eq. (1.7) in Section 1.2.3, except that here B stands for the bitrate of the compressed video from one view. Eq. (1.22) is written under the

1.6 Compression

View 1

View 2

View 3

View 4

View 5

View 6

View 7

View 8

View 9

View 10 View 11 View 12

View 13 View 14 View15 View 16

FIG. 1.52 “Frame-compatible” packing of 16 HD views into an “8K” UHD frame.

assumption that the bitrates B are similar for all the views, as it is often the case for multiview video. For example, according to the previously mentioned formulas, the bitrate for the HEVC simulcast of 10 views in the HD (1920  1080) format would be around 40 Mbps for very dynamic and demanding content. Another option for simulcast compression of multiview video is “frame-compatible” coding, where several views are packed into a frame, similar to the transmission of stereoscopic video (cf. the previous section). Such a frame consists of the views taken in the same time instant, cf. Fig. 1.52. In that way, multiview video may be compressed by a standard monoscopic encoder. The advantage is also that no side synchronization information is needed. The limitation of this approach is related to the quantitative limitations of the encoders available on the market. They mostly allow for encoding of HD video or sometimes UHD video, although the current HEVC specification (3rd ed.) foresees the frame size (the picture size) up to 35.65 millions of samples. That frame size is sufficient for “8K” format of UHDT (7680  4320) and to accommodate up to 16 views in the HD (1920  1080) format. The “8K” codecs together with other equipment for “8K” ultrahigh definition television systems are currently under intensive development, and they may be an interesting option for compression of multiview video already around the year 2020. Unfortunately, the metadata for “frame-compatible” coding is yet not standardized for such applications. It should be also emphasized that the bitrates discussed before are sufficient only for the video delivery to a final viewer, and are far insufficient for the production environment, where either high-fidelity versions of the video codecs or still image codecs may be used. Currently, the frequently used solutions exploit the JPEG2000 standard, or more exactly speaking the Motion JPEG2000 standard [78–82] where all the frames of video are encoded independently like still images. The bitrates of compressed HD video are mostly in the range between 50 and 100 Mbps. Applications of such a compression technology to multiview video would lead to very high total bitrates, and high-fidelity versions of the video codecs like HEVC High Tier or even All Intra are advantageous options. Hitherto, we have considered simulcast coding of multiview video. Another issue is simulcast coding of multiview video with depth maps. For such purposes the depth representations are standardized in MPEG-C Part 3 [83]. A depth map may be treated as monochrome video, thus may be compressed using monoscopic video codecs (like



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering AVC or HEVC) configured to compress monochrome video, or using still image codecs (e.g., JPEG or JPEG2000) that are capable to compress monochrome images. For the scenario where both views and depth maps are encoded using the AVC codecs, the bit allocation between views and depth has been studied in detail in order to achieve the best quality of virtual views, e.g., in Ref. [84]. Such studies imply that the depth maps mostly consume 10%–30% of the total bitrate assuming that for each view the respective depth map is encoded. Such video and image codecs like AVC and HEVC are developed for natural color pictures while depth is monochrome and featured by sharp edges and large flat areas. Therefore the more efficient approach is to use special depth compression techniques instead of standard video coding techniques. These special depth coding techniques include platelet-based [85], wedgelet-based [86], contour-based [87], synthesis-based [88], and other techniques.

1.6.3 MULTIVIEW VIDEO CODING The main idea of the multiview video coding is to exploit the similarities between neighboring views. One view, called the base view, is encoded like a monoscopic video. During the encoding, the standard intraframe and temporal interframe predictions are used. The respective bitstream constitutes the base layer of the multiview video representation. The base view may be decoded from the base-layer bitstream using a standard monoscopic decoder. For the other views called the dependent views, in addition to intraframe and interframe predictions, the interview prediction with disparity compensation may be used, cf. Fig. 1.53. In such prediction, a block in a dependent view is predicted using a block of samples from a frame from another view in the same time instant. The location of this reference block is pointed out by the disparity vector. This interview prediction is dual to the interframe prediction, but the motion vectors are replaced by the disparity vectors, and the temporal reference frames are replaced by the reference frames from other views. Moreover, also other Base view

Dependent view

Disparity vector


FIG. 1.53 Interview prediction with disparity compensation.

1.6 Compression

View 0






View 1






Base view

Dependent views View 2






FIG. 1.54 Example of the frame structure in multiview video coding using interview prediction with disparity compensation: solid line arrows denote interframe predictions, while dashed line arrows correspond to temporal predictions. The letters I, P, and B denote I-frames (intraframe coded), P-frames (compressed using intra- and interframe coding), and B-frames (compressed with the additional use of a second interframe prediction or interview prediction).

coding tools have been proposed, e.g., compensation for illumination differences that allows to further increase compression efficiency [89]. In multiview video coding, the pictures are predicted not only from temporal interframe references but also from interview references. An example of a prediction structure is shown in Fig. 1.54. Multiview video coding has been standardized as extensions to the MPEG-2 standard [74], the AVC standard [90], and the HEVC standard [91]. The multiview extension of AVC is denoted as MVC and that of HEVC as MV-HEVC. The respective profiles of standards are summarized in Table 1.6. These multiview extensions have been standardized in such a way that only minor modifications are needed to the monoscopic codec implementations. Therefore some more advanced techniques for multiview coding are not included into the standards. The extensions of the AVC and HEVC standards provide also the ability to encode the depth maps, where nonlinear representations are allowed [92]. The multiview coding provides the bitrate reduction in the order of 15%–30%, sometimes reaching even 50% as compared to the simulcast coding. These high bitrate reductions are achievable for video that is obtained from cameras densely located on a line, and then rectified in order to virtually set all the optical axes parallel and on the same plane. For sparse and arbitrary camera locations, the gain with respect to the simulcast coding reduces significantly.

1.6.4 3D VIDEO CODING The distinction between multiview video coding and 3D video coding is not precise. The latter refers to compression of the multiview plus depth representations using more sophisticated techniques of interview prediction. A great diversity of 3D video



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering Table 1.6 Multiview Coding Profiles in the Video Compression Standards Standard




Multiview Stereo High

Not in use Base layer compatible with High Profile Limited to stereo (two views only) Tools for interlaced video coding Adopted in consumer stereoscopic video recording (3D Blu-Ray) Some software video players provide support for this profile Base layer compatible with High Profile Arbitrary number of views No tool for interlaced video coding Support for coding of depth maps together with multiview video Multiview video coding compatible with either stereoscopic high profile (two views) or multiview High Profile (more views allowed) Base layer compatible with Main Profile Support for coding of multiview video together with depth maps

Multiview High Multiview Depth High


Multiview Main

coding tools has been already proposed including prediction based on view synthesis, interview prediction by 3D mapping defined by depth, advanced inpainting, coding of disoccluded regions, depth coding using platelets and wedgelets, etc. [86,93–97]. Some of these tools have been already included into the standards of 3D video coding: 3D High Profile of AVC [66,98] and 3D Main Profile of HEVC [67,91]. The latter defines the state-of-the-art technology for compression of multiview video with accompanying depth. 3D-HEVC is an extension of the multiview coding in HEVC, i.e., MV-HEVC. Similarly, as in MV-HEVC, the standardization requirement was to reuse the monoscopic decoding cores for implementations. MV-HEVC, 3D-HEVC, and the scalable extension of HEVC share nearly the same high-level syntax of the bitstreams. Therefore it was decided that view encoding cannot depend on the corresponding depth. As compared to MV-HEVC, 3D-HEVC provides additional prediction types: (1) (2) (3)

Combined temporal and interview prediction of views that refers to pictures from another view and another time instant; View prediction that refers to a depth map corresponding to the previously encoded view; Prediction of depth maps using the respective view or a depth map corresponding to another view.

The compression gain of 3D-HEVC over MV-HEVC is around 2%–12% bitrate reduction. Nevertheless, the compression gains of both 3D-HEVC and MV-HEVC

1.7 Future trends

are smaller when the cameras are not aligned on a line. For circular camera arrangements, in particular with the angles between the camera axes exceeding 10 degrees, the gain over simulcast falls below 15%, often being around 5%. This observation stimulated research on the extensions of 3D-HEVC that uses true 3D mapping for more efficient interview prediction [99,100]. Such extension of 3D-HEVC has been proposed in the context of transmission of the multiview plus depth representations of the dynamic scenes in the future free-viewpoint television systems [73]. The compression efficiency of simulcast HEVC, MV-HEVC, 3D-HEVC, and extended HEVC for arbitrary camera locations has been discussed in Ref. [100] using the coding results for 10 well-known test multiview sequences: • •

with linear camera arrangements: Poznan_Street, Poznan_Hall 2 [23], Dancer [101], Balloons, Kendo [102], Newspaper [103], with circular camera arrangements: Poznan_Blocks [104,105], Big Buck Bunny Flowers [106], Ballet, Breakdancers [53].

The results obtained using the PSNR index for luma quality assessment for coding of three selected views are presented in Fig. 1.55 where the bitrate reductions are calculated using the Bjøntegaard metric [107]. These results clearly illustrate the conclusions drawn previously for both MV-HEVC and 3D-HEVC. In particular, the results show that for cameras sparsely located on an arc (like for test sequences Poznan_Blocks and Big Buck Bunny/BBB_Flowers) the compression gains of MVHEVC and 3D-HEVC over the simulcast HEVC are negligible. Fig. 1.55 also shows some gains for more advanced codecs that were recently proposed but that are not standardized [100].

1.7 FUTURE TRENDS This chapter has provided an overview of DIBR techniques using Multiview plus Depth, developed within MPEG throughout the last decade in the context of FTV. The 3D scene is captured with multiple cameras from which depth is estimated, relying on algorithms that have been an active field of research over the last couple of decades. 3D video coding is currently a research topic for several groups around the world, and also future standardization activities are expected. Recently, MPEG-FTV, the body within MPEG (Moving Picture Experts Group of ISO and IEC) exploring FTV studies possible 3D-HEVC extensions for efficient coding of multiview video taken from arbitrary camera positions. Current industrial deployment of 3D-HEVC is rare, but growing interests in the applications hitherto mentioned in this chapter will stimulate applications of this compression as well as, probably, standardization of its more efficient extensions. It is also expected that the coding tools of 3D-HEVC together with possible improvements will be included, probably with some delay,


CHAPTER 1 Acquisition, processing, compression, and virtual view rendering




60% Linear



Bitrate reduction






0% l2


H n_


z Po



zn Po


t _S



n Da


n Ke




l Ba



a sp






zn Po

o Bl

r we


lo _F






c an




e Br

FIG. 1.55 Bitrate reduction against HEVC simulcast for various multiview test sequences with linear and circular camera arrangements along or around a scene. The data are provided for MV-HEVC and 3D-HEVC reference software and for the extension proposed in Ref. [100].

into the forthcoming future video coding standard that is expected to be ready around 2020–21 as its first version. With the advent of VR, new challenges also arise at the horizon: • • •

Propose market-friendly solutions to capture the 3D scene at low cost with only a handful of cameras At least support real-time decoding and rendering at low latency Overcome cyber sickness, among others, by enabling 6-DoF user motion

Especially the latter will require virtual view synthesis techniques, going far beyond nowadays 3-DoF VR and 360 video streaming, which essentially select a viewport from a high-resolution texture. At this point of the exploration and standardization discussions in JPEG and MPEG, it is believed that Light Fields embedding directional light information might be a solution to this 6-DoF Photorealistic/Cinematic VR. Other candidates like Point Clouds are also considered a viable solution.


Interestingly, Refs. [1,11] suggest many similarities between aforementioned data representations. Even though Light Fields, DIBR/Multiview plus Depth, and Point Clouds might—for the man in the street—look very different from each other, they are actually not. The authors believe that there is an urgent need to bring experts from these different fields together, looking for a common denominator between these technologies, with the aim to reuse as much as possible existing coding solutions, slightly modified to the next-generation VR application needs.

ACKNOWLEDGMENTS This work was partially supported by the National Science Centre, Poland, within project OPUS according to the decision DEC-2012/05/B/ST7/01279, and within project PRELUDIUM according to the decision DEC-2012/07/N/ST6/02267, as well as project 3DLICORNEA funded by the Brussels Institute for Research and Innovation, Belgium, under Grant No. 2015-DS/R-39a/b/c/d. The authors would also like to thank the fruitful discussions with members from the JPEG and MPEG standardization committees, and experienced colleagues having participated to the project DREAMSPACE funded by the European Commission.

GLOSSARY 3D mesh model


Depth image-based rendering Point cloud

Rendering Super-multiview video

a 3D data representation model where the 3D shape of the object and its 2D texture are used in the rendering process the process by which the redundancies in a data representation model are exploited to reduce the amount of information (bitstream) to represent the data the rendering process using input images and depth maps to synthesize the virtual viewpoint set of 3D data points in space, which essentially correspond to the vertices of a 3D mesh model, but without including any connectivity. In practice, point clouds should be dense to obtain good rendering quality the process by which some visual data is processed for visualization on a display high density array of video views captured by a linear or curvilinear arrangement of conventional cameras



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

REFERENCES [1] F. Pereira, E.A.B. da Silva, G. Lafruit, Plenoptic imaging: representation and processing, in: R. Chellappa, S. Theodoridis (Eds.), Academic Press Library in Signal Processing, vol. 6, Elsevier, Academic Press, New York, 2018, pp. 75–111. [2] A. Collet, M. Chuang, P. Sweeney, D. Gillett, D. Evseev, D. Calabrese, H. Hoppe, A. Kirk, S. Sullivan, High-quality streamable free-viewpoint video, in: ACM Transactions on Graphics (TOG) – Proceedings of ACM SIGGRAPH 2015, vol. 34(4), August 2015 (Article No. 69). [3] A. Jones, J. Unger, K. Nagano, J. Busch, X. Yu, H.-Y. Peng, O. Alexander, M. Bolas, P. Debevec, An automultiscopic projector array for interactive digital humans, in: Proceeding SIGGRAPH’15 ACM SIGGRAPH 2015 Emerging Technologies, 2015, https:// (Article No. 6). [4] TimeSlice Films, Tim Macmillan Early Work 1980–1994, 2009, 6165108. [5] TimeSlice Films, Sky High Jump, 2012, [6] M. Domanski, A. Dziembowski, A. Grzelka, D. Mieloch, O. Stankiewicz, K. Wegner, Multiview Test Video Sequences for Free Navigation Exploration Obtained Using Pairs of Cameras, ISO/IEC JTC 1/SC 29/WG 11 Doc. M38247, Geneva, Switzerland, 2016, [7] P. Goorts, S. Maesen, M. Dumont, S. Rogmans, P. Bekaert, Free viewpoint video for soccer using histogram-based validity maps in plane sweeping, in: Proceedings of The International Conference on Computer Vision Theory and Applications (VISAPP), 2014, pp. 378–386,¼6MzeXeavE1s. [8] ScanLab projects Point Cloud data sets, [9] T. Gurdan, M.R. Oswald, D. Gurdan, D. Cremers, Spatial and temporal interpolation of multi-view image sequences, in: 36th German Conference on Pattern Recognition, M€unster, September 2–5, 2014, [10] A. Schenkel, Corrections geometriques et colorimetriques automatisees de mode`les tridimensionnels de grande taille (Ph.D. Thesis), LISA Department, Universite Libre de Bruxelles, 2016. [11] Technical Report of the Joint Ad Hoc Group for the Digital Representations of Light/ Sound Fields for Immersive Media Applications, ISO/IEC JTC1/SC29/WG1/N72033 & ISO/IEC JTC1/SC29/WG11/N16352, Geneva, Switzerland, June 2016, http://mpeg. Report_JAhG_light-sound_fields.pdf. [12] [13] 754-1985 – IEEE Standard for Binary Floating-Point Arithmetic, 754-2008 – IEEE Standard for Floating-Point Arithmetic. [14] [15] Mesa Imaging, SR4000 Data Sheet, ipcglab/docs/vision/SR4000_Data_Sheet.pdf. [16] T. Grajek, R. Ratajczak, K. Wegner, M. Doma nski, Limitations of vehicle length estimation using stereoscopic video analysis, in: 20th International Conference on Systems, Signals and Image Processing, IWSSIP 2013, Bucharest, Romania, July 7–9, 2013, pp. 27–30. [17] ST 259:2008: For Television – SDTV1 Digital Signal/Data – Serial Digital Interface,, ISBN 978-1-61482-407-7, 2008.


[18] [19] M. Tanimoto, Ray Capture Systems for FTV, Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), Asia-Pacific, 2012, ISBN: 978-14673-4863-8. [20] F. Zilly, Method for the Automated Analysis, Control and Correction of Stereoscopic Distortions and Parameters for 3D-TV Applications (Doctoral Thesis), 2015. [21] I. Feldmann, M. Mueller, F. Zilly, R. Tanger, K. Mueller, A. Smolic, P. Kauff, T. Wiegand, HHI Test Material for 3D Video, ISO/IEC JTC1/SC29/WG11, Doc. m15413, France, Archamps, 2008. [22] M. Domanski, K. Klimaszewski, J. Konieczny, M. Kurc, A. Łuczak, O. Stankiewicz, K. Wegner, An experimental free-view television system, in: 1st International Conference on Image Processing & Communications (IPC), Bydgoszcz, Poland, September 2009, pp. 175–184. [23] M. Domanski, T. Grajek, K. Klimaszewski, M. Kurc, O. Stankiewicz, J. Stankowski, K. Wegner, Poznan Multiview Video Test Sequences and Camera Parameters, ISO/IEC JTC1/SC29/WG11, MPEG/M17050, m17050, Xian, China, October 26–30, 2009. [24] J.J. McConnell, Computer Graphics: Theory into Practice, Jones & Bartlett Learning, Jones and Bartlett Publishers, Inc, 2006, p. 120. ISBN: 0-7637-2250-2. [25] C.D. Brown, Decentering distortion of lenses, Photogramm. Eng. 32 (3) (1966) 444–462. [26] tion.html. [27] Z. Zhang, A Flexible New Technique for Camera Calibration, MSR-TR-98-71, Microsoft Research, 1998. [28] G. Lafruit, K. Wegner, M. Tanimoto, FTV Software Framework, ISO/IEC JTC1/SC29/ WG11 MPEG2015/N15349, Poland, Warsaw, 2015. [29] [30] J. Stankowski, K. Klimaszewski, O. Stankiewicz, K. Wegner, M. Doma nski, Preprocessing Methods Used for Poznan 3D/FTV Test Sequences, ISO/IEC JTC1/SC29/WG11 MPEG 2010/M17174, m17174, Kyoto, Japan, 2010. [31] A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: the KITTI dataset, Int. J. Rob. Res. 32 (11) (2013) 1231–1237. [32] D. Scharstein, R. Szeliski, A taxonomy and evaluation of dense two-frame stereo correspondence algorithms, Int. J. Comput. Vis. 47 (1) (2002) 7–42. [33] M.D. Nguyen, Stereo Depth Estimation With Inter-View Consistencies (Master Thesis), Faculty of Science and Bio-Engineering Sciences, Department of Computer Science, Vrije Universiteit Brussel, 2015. [34] R.A. Hamzah, H. Ibrahim, Literature survey on stereo vision disparity map algorithms, J. Sens. 2016 (2016) 23. (Article ID 8742920). [35] M. Dumont, P. Goorts, S. Maesen, G. Lafruit, P. Bekaert, Real-time edge-sensitive local stereo matching with iterative disparity refinement, in: 11th International Joint Conference, ICETE 2014, Vienna, Austria, August 28–30, 2014, pp. 435–456. [36] D.M. Nguyen, J. Hanca, S.-P. Lu, P. Schelkens, A. Munteanu, Accuracy and robustness evaluation in stereo matching, in: SPIE Optical Engineering and Applications, Proceedings vol. 9971, Applications of Digital Image Processing XXXIX, 2016. 10.1117/12.2236509. [37] B. Ceulemans, M.D. Nguyen, G. Lafruit, A. Munteanu, New Depth Estimation and View Synthesis, ISO/IEC JTC1/SC29/WG11, MPEG2016/ M38062, San Diego, 2016.



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering [38] P. Tan, P. Monasse, Stereo disparity through cost aggregation with guided filter, Image Process. On Line 4 (2014) 252–275. [39] K. Zhang, J. Lu, G. Lafruit, Cross-based local stereo matching using orthogonal integral images, IEEE Trans. Circ. Syst. Video Technol. 19 (2009) 1073–1079. [40] S.N. Sinha, Graph Cut Algorithms in Vision, Graphics and Machine Learning an Integrative Paper, Graph Cut Algorithms in Vision, Graphics and Machine Learning, Microsoft Technical Report MSR-TR-2004-152, 2004, research/publication/248342/. [41] V. Kolmogorov, R. Zabih, Graph cut algorithms for binocular stereo with occlusions, in: N. Paragios, Y. Chen, O. Faugeras (Eds.), Handbook of Mathematical Models in Computer Vision, Springer, New York, 2006. [42] K. Wegner, O. Stankiewicz, M. Tanimoto, M. Domanski, Enhanced Depth Estimation Reference Software (DERS) for Free-Viewpoint Television, ISO/IEC JTC1/SC29/ WG11, Doc. M31518, 2013. [43] E.B. Sudderth, W.T. Freeman, Signal and image processing with belief propagation, IEEE Signal Process. Mag. 25 (2) (2008) 114–141. [44] M.F. Tappen, W.T. Freeman, Comparison of graph cuts with belief propagation for stereo, using identical MRF parameters, in: Proceedings of the Ninth IEEE International Conference on Computer Vision, vol. 2, October 2003, pp. 900–906. [45] R. Collins, A space-sweep approach to true multi-image matching, in: Proceedings of the CVPR, 1996, pp. 358–363. [46] P. Goorts, Real-Time, Adaptive Plane Sweeping for Free Viewpoint Navigation in Soccer Scenes (Ph.D. Thesis), Hasselt University, 2014. [47] R.C. Bolles, H.H. Baker, D.H. Marimont, Epipolar-plane image analysis: an approach to determining structure from motion, Int. J. Comput. Vis. l (1987) 7–55. [48] C. Kim, H. Zimmer, Y. Pritch, A. Sorkine-Hornung, M. Gross, Scene reconstruction from high spatio-angular resolution light fields, in: ACM Transactions on Graphics (TOG) Conference Proceedings of SIGGRAPH, vol. 32(4), July 2013 (Article No. 73). [49] L. Jorissen, P. Goorts, S. Rogmans, G. Lafruit, P. Bekaert, Multi-camera epipolar plane image feature detection for robust view synthesis, in: Proceedings of the 3DTVConference: The True Vision – Capture, Transmission and Display of 3D Video (3DTV-CON), 2015. [50] L. Jorissen, P. Goorts, G. Lafruit, P. Bekaert, Multi-view wide baseline depth estimation robust to sparse input sampling, in: Proceedings of the 3DTV-Conference: The True Vision – Capture, Transmission and Display of 3D Video (3DTV-CON), 2016. [51] E.E. Sutherland, R. Sproull, R.A. Schumacker, A characterization of ten hidden-surface algorithms, ACM Comput. Surv. 6 (1) (1974) 1–55. [52] Break-Dancers and Ballet Sequence: sbkang/3dvideodownload/. [53] C.L. Zitnick, S.B. Kang, M. Uyttendaele, S. Winder, R. Szeliski, High-quality video view interpolation using a layered representation, ACM Trans. Graph. 23 (3) (2004) 600–608. [54] A. Dziembowski, A. Grzelka, D. Mieloch, O. Stankiewicz, K. Wegner, M. Doma nski, Multiview synthesis – improved view synthesis for virtual navigation, in: Picture Coding Symposium, PCS 2016, Nuremberg, Germany, 2016. [55] K. Wegner, O. Stankiewicz, M. Tanimoto, M. Domanski, Enhanced View Synthesis Reference Software (VSRS) for Free-Viewpoint Television, ISO/IEC JTC1/SC29/WG11, Doc. M31520, 2013.


[56] K. Wegner, O. Stankiewicz, M. Domanski, Depth Based View Blending in View Synthesis Reference Software (VSRS), ISO/IEC JTC1/SC29/WG11 MPEG2015, M37232, Geneva, Switzerland, 2015. [57] K. Wegner, O. Stankiewicz, M. Domanski, Novel depth-based blending technique for improved virtual view synthesis, in: IEEE International Conference on Signals and Electronic Systems, Krakow, Poland, 2016. [58] P. Ndjiki-Nya, M. K€oppel, D. Doshkov, H. Lakshman, P. Merkle, K. M€ uller, T. Wiegand, Depth image-based rendering with advanced texture synthesis for 3-D video. IEEE Trans. Multimedia 13 (3) (2011) 453–465, TMM.2011.2128862. [59] M. Tanimoto, T. Fujii, K. Suzuki, Reference Software of Depth Estimation and View Synthesis for FTV/3DV, ISO/IEC JTC1/SC29/WG11, MPEG2008/M15836, Busan, Korea, 2008. [60] G. Lafruit, K. Wegner, T. Grajek, T. Senoh, K.P. Tama´s, P. Goorts, L. Jorissen, B. Ceulemans, P.C. Lopez, Sergio Garcı´a Lobo, Qing Wang, Joe¨l Jung, Masayuki Tanimoto, FTV Software Framework, ISO/IEC JTC1/SC29/WG11 MPEG2014/N15349, Poland, Warsaw, 2015. [61] I.E. Richardson, The H.264 Advanced Video Compression Standard, second ed., Wiley, Chichester, West Sussex, UK, 2010. [62] V. Sze, M. Budagavi, G.J. Sullivan, High Efficiency Video Coding (HEVC), Algorithms and Architectures, Springer, Springer Heildelberg New York Dordrecht London, 2014. [63] M. Wien, High Efficiency Video Coding, Coding Tools and Specification, Springer, Springer Cham Heildelberg New York Dordrecht London, 2015. [64] I.E. Richardson, Coding Video: A Practical Guide to HEVC and Beyond, Wiley, Chichester, West Sussex, UK, 2016. [65] Generic Coding of Moving Pictures and Associated Audio Information: Video, ISO/IEC Int. Standard 13818-2: 2013 and ITU-T Rec. H.262 (V3.1), 2012. [66] Coding of audio-visual objects, Part 10: Advanced Video Coding, ISO/IEC Int. Standard 14496-10: 2014 and Advanced Video Coding for Generic Audiovisual Services, ITU-T Rec. H.264 (V9) 2014. [67] High Efficiency Coding and Media Delivery in Heterogeneous Environment: High Efficiency Video Coding, ISO/IEC Int. Standard 23008-2: 2015 and High Efficiency Video Coding, ITU-T Rec. H.265 (V3), 2015. [68] K.R. Rao, Do Nyeon Kim, Jae Jeong Hwang, Video Coding Standards: AVS China, H.264/MPEG-4 Part10, HEVC, VP6, DIRAC and VC-1, in: Series Signals and Communication Technology, Springer, Netherlands, 2014., ISBN: 9789400767416/9789400767423 (online). [69] M. Domanski, T. Grajek, D. Karwowski, J. Konieczny, M. Kurc, A. Łuczak, R. Ratajczak, J. Siast, J. Stankowski, K. Wegner, Coding of multiple video + depth using HEVC technology and reduced representations of side views and depth maps, in: 29th Picture Coding Symposium, PCS 2012, Krako´w, May 2012, pp. 5–8. [70] M. Domanski, A. Dziembowski, D. Mieloch, A. Łuczak, O. Stankiewicz, K. Wegner, A practical approach to acquisition and processing of free viewpoint video, in: 31st Picture Coding Symposium PCS 2015, Cairns, Australia, 31 May–3 June 2015, pp. 10–14. [71] G. Davidson, M. Isnardi, L. Fielder, M. Goldman, C. Todd, ATSC video and audio coding, Proc. IEEE 94 (2006) 60–76. [72] M. Domanski, Approximate Video Bitrate Estimation for Television Services, ISO/IEC JTC1/SC29/WG11 Doc. MPEG M3671, Warsaw, June 2015.



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering [73] M. Domanski, A. Dziembowski, T. Grajek, A. Grzelka, Ł. Kowalski, M. Kurc, A. Łuczak, D. Mieloch, R. Ratajczak, J. Samelak, O. Stankiewicz, J. Stankowski, K. Wegner, Methods of high efficiency compression for transmission of spatial representation of motion scenes, in: IEEE Int. Conf. Multimedia and Expo Workshops, Torino, 2015. [74] B.G. Haskell, A. Puri, A.N. Netravali, Digital Video: An Introduction to MPEG-2, Chapman & Hall, New York, 1996. [75] L. Torres, M. Kunt (Eds.), Video Coding, The Second Generation Approach, Kluwer, Boston, MA, 1996. [76] J.-B. Lee, H. Kalva, The VC-1 and H.264 Video Compression Standards for Broadband Video Services, Springer, New York, 2008. [77] Parameter values for ultra-high definition television systems for production and international programme exchange, Rec. ITU-R BT.2020-1, 2014. [78] JPEG 2000 Image Coding System, Part 1: Core Coding System, second ed., ISO/IEC International Standard 15444-1, 2004 and ITU-T Rec. T.800, 2002. [79] JPEG 2000 Image Coding System, Part 3: Motion JPEG 2000, ISO/IEC International Standard 15444-3, 2007 and ITU-T Rec. T.802, 2005. [80] T. Acharya, P.-S. Tsai, JPEG2000 Standard for Image Compression: Concepts, Algorithms and VLSI Architectures, Wiley-Interscience, Hoboken, 2004. [81] P. Schelkens, A. Skodras, T. Ebrahimi (Eds.), The JPEG 2000 Suite, Wiley, Chichester, 2009. [82] D.S. Taubman, M.W. Marcelin, JPEG 2000 Image Compression Fundamentals, Standards and Practice, Kluwer, Boston, 2002. [83] MPEG Video Technologies – Part 3: Representation of Auxiliary Video Streams and Supplemental Information, ISO/IEC Int. Standard 23002-3, 2007. [84] K. Klimaszewski, O. Stankiewicz, K. Wegner, M. Doma nski, Quantization optimization in multiview plus depth video coding, in: IEEE International Conference on Image Processing ICIP 2014, Paris, France, October 27–30, 2014. [85] Y. Morvan, D. Farin, P.H.N. de With, Platelet-based coding of depth maps for the transmission of multiview images, in: Proc. SPIE, Stereoscopic Displays and Applications, vol. 6055, San Jose, January 2006. [86] P. Merkle, C. Bartnik, K. M€uller, D. Marpe, T. Wiegand, 3D video: depth coding based on inter-component prediction of block partitions, in: 29th Picture Coding Symposium, PCS 2012, Krako´w, May 2012, pp. 149–152. [87] F. J€ager, Contour-based segmentation and coding for depth map compression, in: Proc. IEEE Visual Communications and Image Processing, Tainan, Taiwan, 2011. [88] J.Y. Lee, H.W. Park, Efficient synthesis-based depth map coding in AVC-compatible 3D video coding, IEEE Trans. Circ. Syst. Video Technol. 26 (2016) 1107–1116. [89] J.H. Hur, S. Cho, Y.L. Lee, Adaptive local illumination change compensation method for H.264/AVC-based multiview video coding, IEEE Trans. Circ. Syst. Video Technol. 17 (11) (2007) 1496–1505. [90] A. Vetro, T. Wiegand, G.J. Sullivan, Overview of the stereo and multiview video coding extensions of the H.264/MPEG-4 AVC standard, Proc. IEEE 99 (2011) 626–642. [91] G. Tech, Y. Chen, K. M€uller, J.-R. Ohm, A. Vetro, Y.-K. Wang, Overview of the multiview and 3D extensions of high efficiency video coding, IEEE Trans. Circ. Syst. Video Technol. 26 (1) (2016) 35–49.


[92] O. Stankiewicz, K. Wegner, M. Domanski, Nonlinear depth representation for 3D video coding, in: IEEE International Conference on Image Processing ICIP 2013, Melbourne, Australia, September 15–18, 2013, pp. 1752–1756. [93] Y. Chen, X. Zhao, L. Zhang, J.-W. Kang, Multiview and 3D video compression using neighboring block based disparity vector, IEEE Trans. Multimedia 18 (4) (2016) 576–589. [94] M. Domanski, O. Stankiewicz, K. Wegner, M. Kurc, J. Konieczny, J. Siast, J. Stankowski, R. Ratajczak, T. Grajek, High efficiency 3D video coding using new tools based on view synthesis, IEEE Trans. Image Process. 22 (9) (2013) 3517–3527. [95] Y. Gao, G. Cheung, T. Maugey, P. Frossard, J. Liang, Encoder-driven inpainting strategy in multiview video compression, IEEE Trans. Image Process. 25 (2016) 134–149. [96] K. M€uller, H. Schwarz, D. Marpe, C. Bartnik, S. Bosse, H. Brust, T. Hinz, H. Lakshman, P. Merkle, F.H. Rhee, G. Tech, M. Winken, T. Wiegand, 3D high-efficiency video coding for multi-view video and depth data, IEEE Trans. Image Process. 22 (9) (2013) 3366–3378. [97] F. Shao, W. Lin, G. Jiang, M. Yu, Low-complexity depth coding by depth sensitivity aware rate-distortion optimization, IEEE Trans. Broadcast. 62 (1) (2016) 94–102. [98] M.M. Hannuksela, D. Rusanovskyy, W. Su, L. Chen, R. Li, P. Aflaki, D. Lan, M. Joachimiak, H. Li, M. Gabbouj, Multiview-video-plus-depth coding based on the advanced video coding standard, IEEE Trans. Image Process. 22 (9) (2013) 3449–3458. [99] J. Stankowski, Ł. Kowalski, J. Samelak, M. Doma nski, T. Grajek, K. Wegner, 3D-HEVC extension for circular camera arrangements, in: 3DTV Conference: The True VisionCapture, Transmission and Display of 3D Video, 3DTV- Con 2015, Lisbon, Portugal, July 8–10, 2015. [100] J. Samelak, J. Stankowski, M. Domanski, Adaptation of the 3D-HEVC coding tools to arbitrary locations of cameras, in: International Conference on Signals and Electronic Systems, Krako´w, 2016. [101] D. Rusanovskyy, P. Aflaki, M.M. Hannuksela, UndoDancer 3DV Sequence for Purposes of 3DV Standardization, ISO/IEC JTC1/SC29/WG11, Doc. MPEG M20028, Geneva, Switzerland, 2011. [102] M. Tanimoto, T. Fujii, N. Fukushima, 1D Parallel Test Sequences for MPEG-FTV, ISO/ IEC JTC1/SC29/WG11, Doc. MPEG M15378, Archamps, France, 2008. [103] Y.S. Ho, E.K. Lee, C. Lee, Multiview Video Test Sequence and Camera Parameters, ISO/IECJTC1/SC29/WG11 Doc. MPEG M15419, Archamps, France, 2008. [104] M. Domanski, A. Dziembowski, A. Kuehn, M. Kurc, A. Łuczak, D. Mieloch, J. Siast, O. Stankiewicz, K. Wegner, Poznan Blocks – A Multiview Video Test Sequence and Camera Parameters for Free Viewpoint Television, ISO/IECJTC1/SC29/WG11, Doc. MPEG M32243, San Jose, USA, 2014. [105] M. Domanski, A. Dziembowski, A. Kuehn, M. Kurc, A. Łuczak, D. Mieloch, J. Siast, O. Stankiewicz, K. Wegner, Experiments on acquisition and processing of video for free-viewpoint television, in: 3DTV Conference 2014, Budapest, Hungary, July 2–4, 2014. [106] P. Kovacs, [FTV AHG] Big Buck Bunny light-field test sequences, ISO/IEC JTC1/ SC29/WG11, Doc. MPEG M3571, Geneva, 2015. [107] G. Bjøntegaard, Calculation of average PSNR differences between RD-curves, ITU-T Study Group 16 Question 6 (VCEG), Document VCEG-M33, Austin TX, 2001.



CHAPTER 1 Acquisition, processing, compression, and virtual view rendering

FURTHER READING M. Domanski, M. Gotfryd, K. Wegner, View synthesis for multiview video transmission, in: The 2009 International Conference on Image Processing, Computer Vision, and Pattern Recognition, Las Vegas, USA, July 13–16, 2009, pp. 1–4. K. Khoshelham, S.O. Elberink, Accuracy and resolution of kinect depth data for indoor mapping applications, Sensors (Basel) 12 (2) (2012) 1437–1454.


Plenoptic imaging: Representation and processing


Fernando Pereira*, Eduardo A.B. da Silva†, Gauthier Lafruit‡ Instituto Superior T ecnico, Universidade de Lisboa – Instituto de Telecomunicac¸o˜es, Lisboa, Portugal* Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil† Brussels University (French wing), Brussels, Belgium‡

2.1 INTRODUCTION It is useless to try stressing the importance of vision for human (and also nonhuman) beings. Vision is at the core of human communication capabilities and deeply impacts everyday life in all its dimensions. Since there is no vision without light, light plays a key role on the way humans communicate among them and with the world around. In fact, one of the main ways the world communicates with humans is through the pattern of emitted/reflected light rays which fill the space and are “acquired and processed” by the human visual system. While humans directly capture the light around them to experience the world where they are immersed, technology allows to create/recreate (virtual) visual experiences by acquiring and displaying light information using sensors/cameras and displays. Over the years, sensors and displays as well as the processing of the visual data itself and its transmission and storage have been evolving to provide the users improved experiences and functionalities in a multitude of applications and services in almost all areas of human activity, ranging from industry to entertainment, and passing by education and shopping. Naturally, to make these experiences possible, visual data has to be acquired and represented to allow recreating some replica of the initial light information while considering all the relevant constraints, e.g., related to the acquisition, transmission, storage, and display processes, and also to the human visual system characteristics. For each of the target applications and services, different requirements may be relevant, notably in terms of efficiency, immersion, interaction, delay, complexity, error robustness, random access, etc. To consider all these issues, appropriate acquisition, representation, coding, and display models have to be selected. In past decades, the acquisition, representation, and display models have been mainly defined by the available camera and display technologies. The adopted model Academic Press Library in Signal Processing, Volume 6. # 2018 Elsevier Ltd. All rights reserved.



CHAPTER 2 Plenoptic imaging: Representation and processing

for the light/visual information corresponds to rectangular windows of the real world, which in the digital paradigm corresponds to rectangular sets of samples, well known as images or frames, for some wavelength components. To improve the user experience, the image/frame spatial resolution has been steadily increasing up to ultra high definition and more recently also the frame rate and sample depth have been increasing (high dynamic range (HDR) will come soon). If adopted for coding, the frame-based model often leads to a huge coding bitrate, thus asking for the development of coding models which go beyond frames and samples. In this context, the development of more efficient coding models considering the resources constraints and required functionalities has become a major target in the past three decades. Because most applications have critical interoperability requirements, image and video coding standardization has played a fundamental role in the evolution of this technology. In this domain, the main players are ISO/IEC/ITU-T JPEG (Joint Photographic Experts Group), ISO/IEC MPEG (Moving Picture Experts Group), and ITU-T VCEG (Video Coding Experts Group), who separately or jointly specified several coding solutions along the years, which have completely changed the world landscape in terms of image and video-based applications and services. The most recent video coding standards, H.264/AVC (advanced video coding) [1] and HEVC (high efficiency video coding) [2], dominate the market with hundreds of millions of devices, which have changed everyday life. While the frame-based (also known as 2D) acquisition and display model has allowed great achievements with wide deployment, it does not obviously offer a perfect replica of the real world as looking to a 2D display scene is much different from experiencing a real scene. This is the booster for the continuous struggle to offer more realistic and immersive user experiences, which naturally require more light/visual information and thus more sophisticated acquisition and display models and devices. As a consequence, 3D experiences have enjoyed a growing popularity in recent years in several application domains, thus asking for appropriate solutions for 3D video acquisition, representation, coding, and display. The most common solutions have been considering two views as stereo pairs to provide stereo parallax (and better depth experiences) and multiple views to provide also motion parallax. Naturally, increasing the number of video views makes the compression efficiency even more critical and thus several multiview video (MVV) coding standards have emerged over the years. The first MVV coding standards have been based on a pure texture-based approach, e.g., the multiview video coding (MVC) standard [3] which is a backward compatible extension of the H.264/AVC standard, and the MV-HEVC standard which is a backward compatible extension of the HEVC standard [4,5]. The more recent standard—the 3DHEVC standard [4,5]—has adopted a more powerful representation model where both texture and depth are coded. The 3D-HEVC standard is clearly the state of the art not only in terms of pure coding, as it includes several additional coding tools regarding HEVC, notably exploiting for the first time the intercomponent (this means texture and depth) redundancy, but also in terms of representation power

2.1 Introduction

as it allows to synthesize additional views for display beyond those decoded based on a depth-based image rendering (DBIR) approach. This more sophisticated representation model, which may require a more sophisticated acquisition process, depending if the depth is acquired or obtained from the texture, offers more immersive user experiences, notably in terms of motion parallax smoothness. Unfortunately, the 3D-HEVC standard was developed mainly considering linear and horizontal-only parallax camera arrangements, narrow baselines, i.e., short camera distances, and reduced viewing ranges, which limits its range of application. Following the continuous pressure for better acquiring and replicating the visual world around, some major developments in terms of sensors/cameras and displays have happened recently, notably the Lytro [6], Raytrix [7], and Light [8] cameras and the Holografika [9] and Ostendo [10] displays. These new devices depart from the usual frame-based model to adopt richer acquisition and display models, notably acquiring and displaying also the so-called directional information, finally targeting more immersive and natural visual experiences with multiple views with correct perspective cues. With some of these devices reaching the general user and offering new capabilities associated to the adopted richer representation, it is unavoidable to revisit the foundations of the light representation and its impact on the vision process to better understand and manage the current trends. In this context, a key role is assumed by the so-called plenoptic function which represents the intensity of light seen from any viewpoint or 3D spatial position, any angular viewing direction, over time, and for each wavelength [11,12]. This powerful model opens new dimensions to visual representation, and thus to new cameras and displays, toward providing more complete, realistic, and immersive visual experiences, while also encompassing the current, simpler representation models in a rather scalable way. Naturally, this increased representation power comes at the cost of an increased amount of data, notably associated to new sampling dimensions, and thus appropriate representation models and efficient coding solutions are needed. Moreover, rendering this type of visual data becomes a more challenging problem, notably if these new visual modalities have to be displayed in conventional 2D or stereo displays. This is especially critical considering the fact that new cameras seem to be emerging at a faster speed than new light field displays, whose technology is less mature and does not yet offer the same qualityprice ratio. The emergence of these new devices and the need for novel representation and coding models has already been recognized by the major standardization bodies, notably JPEG and MPEG, which have started studying the upcoming needs so that standard coding solutions may be provided in due time [13–15]. In this exciting context, this chapter will review and discuss the basics and recent developments related to visual acquisition, representation, and display from a plenoptic function perspective, notably plenoptic imaging representation and processing with particular emphasis on coding and rendering. Finally, it is important to mention that not all visual data has to be acquired from the real world, as computer-generated visual data is common in these days; naturally, combining real



CHAPTER 2 Plenoptic imaging: Representation and processing

and synthetic visual data is also common and increasingly popular, leading to the various flavors of augmented and mixed reality.

2.2 LIGHT REPRESENTATION: THE PLENOPTIC FUNCTION PARADIGM As stated before, more immersive visual experiences demand richer and more complete representation paradigms for the visual information. This quest is not new, and dates as back as the 16th century, when Leonardo da Vinci envisioned an "imaging device capturing every optical aspect of a scene." In his own words [16]: The air is full of an infinite number of radiant pyramids caused by the objects located in it. These pyramids intersect and interweave without interfering with each other.…The semblance of a body is carried by them as a whole into all parts of the air, and each smallest part receives into itself the image that has been caused.

Ultimately, all the visual information present in the world is given by the full description of the electromagnetic field in the visible spectrum for every point in space. At first, providing such a complete description of this electromagnetic field may seem too complicated. However, useful models may be obtained with some simplifying assumptions: •

By restricting the description to noncoherent light, it is possible to use the Fourier representation to describe a propagating wave as an infinite sum of random-phase sinusoids within the visible spectrum, each with a different energy. Mathematically, this is represented as a power density for each wavelength, λ. The electromagnetic wave at a point (x,y,z) in space can be decomposed by a sum of wavefronts coming from all directions. Each direction can be described by an azimuth and orientation pair, (θ,ϕ). Therefore for a given wavelength, the power spectral density for each point and each propagation direction is given by a function P(x,y,z,θ,ϕ,λ). If the energy of each electromagnetic wave varies in time (at a rate much smaller than the frequency of the sinusoidal electromagnetic waves), this variation can be represented by adding a temporal dimension, finally yielding a function P(x,y, z,θ,ϕ,λ,t).

Thus provided that the light is noncoherent, all the visual information in the world can be represented by the 7D scalar function P(x,y,z,θ,ϕ,λ,t), usually referred to as the plenoptic function [11], which is illustrated in Fig. 2.1. However, although the plenoptic function is conceptually simple, its seven dimensions imply that an enormous amount of data is associated to its representation. Therefore in practical situations it is essential to reduce its dimensionality and appropriately sample it. In order to do so, one should know the bandwidth and spectrum of the plenoptic function. There are

2.2 Light representation: the plenoptic function paradigm


FIG. 2.1 Visualizing the plenoptic function.

many works that theoretically study the spectrum and bandwidth of the plenoptic function. Good examples are the works in Refs. [17,18]. One important simplification of the plenoptic function has to do with a fundamental characteristic of the human visual system. Its response to a power spectral density P(λ) is just three-dimensional (3D)—the retina transforms it in the intensity response of three types of cone cells, one of two types of photoreceptor cells in the eye retina that are responsible for color vision and function best in relatively bright light. Therefore as long as the human visual system is concerned it suffices to represent P(λ) by its projection in a 3D system of primary colors—the R (red), G (green), and B (blue) colors, yielding the three plenoptic function components, PR, PG, and PB. Therefore one can remove the wavelength dependency of the plenoptic function and replace it by three 6D plenoptic functions—PR(x,y,z,θ,ϕ,t), PG(x,y,z,θ,ϕ,t), and PB(x,y,z,θ,ϕ,t). In the remaining of this text, PC(x,y,z,θ,ϕ,t) will be used to refer to the plenoptic function, where C can be R, G, or B. Note that, in case one is interested in the infrared spectrum, one can add channels corresponding to the infrared band, as is often done, for example, in remote sensing applications. Another common example of the plenoptic function is when video is recorded from a pinhole camera whose opening is located at a fixed position (x0,y0,z0). In this case, the plenoptic function for color component C would depend only on the direction (θ,ϕ) of the light ray entering the camera and on the time t, becoming PC(θ,ϕ,t). As illustrated in Fig. 2.2, in this case the orientation (θ,ϕ) is replaced by the position (u,v) where the light ray strikes the image sensor, yielding the 3D function PC(u,v,t). This function corresponds to a 2D video. Another interesting case is the so-called MVV, where an array of video cameras is used. The array can be 1D, either linear or on a circular arc, or 2D. In the case of a 1D array, the camera position (x,y,z) is usually replaced by the index k of the camera in the array, yielding the 4D function PC(k,u,v,t). If the array is 2D, the camera position



CHAPTER 2 Plenoptic imaging: Representation and processing

FIG. 2.2 Simplification of the plenoptic function corresponding to a pinhole 2D camera at a position (x0,y0,z0).

corresponds to a position (k,l) in the array, yielding the 5D function PC(k,l,u,v,t). An example of a time sample of such a function is given in Fig. 2.3. A different case to the previous one is when one wants to record the plenoptic function inside a camera. In other words, in this case one wants to record all the light rays that are inside a wide-aperture camera, what is usually referred to as a light field [20,21]. Since one light ray, upon entering the camera opening and before reaching the image sensor, suffers no obstruction, then one can suppose that its intensity is constant inside the camera. Therefore one can specify a light ray by the coordinates (u,v) of the point it enters the camera and the coordinates (x,y) of

FIG. 2.3 Example of a time sample of the 5D simplification of the plenoptic function PC(k,l,u,v,t) (multiview video) [19].

2.2 Light representation: the plenoptic function paradigm

FIG. 2.4 Simplification of the plenoptic function corresponding to a light field inside a camera.

the points it strikes the sensor. This yields a simplification of the plenoptic function given as PC(x,y,u,v,t), as exemplified in Fig. 2.4. Often one is interested in representing 3D objects or surfaces rather than light rays. One important case is given by the so-called point clouds. If a surface is Lambertian [22], that is, its properties are not dependent on the direction of the light rays, one can associate to each point (x,y,z) on a surface or object its color value, yielding three functions PC(x,y,z) for C ¼ R, G, or B [23]. Examples are given in Fig. 2.5. Such a function can be further simplified by replacing the color value PC(x,y,z) by a binary value B(x,y,z) that indicates the presence of the object or surface, thus defining only its 3D structure. One can also make this model more sophisticated by dropping the Lambertian constraint, associating to each point on the surface a property that is dependent on the direction (θ,ϕ) of the light rays emanating from it, leading to a function PC(x,y,z,θ,ϕ). If the object or surface moves or its properties vary in time, a 6D function PC(x,y,z,θ,ϕ,t) is recovered. An alternative way of representing surfaces or objects, this time based upon MVV, is to add, to each R,G,B intensity value recorded at position (u,v) inside a camera located at the position (k,l), the distance from the point on the surface emanating the light ray to the camera plane, that is referred as the depth d, represented as d(k,l,u,

FIG. 2.5 Point cloud examples [24,25].



CHAPTER 2 Plenoptic imaging: Representation and processing

v,t). These four 5D functions, PR(k,l,u,v,t), PG(k,l,u,v,t), PB(k,l,u,v,t), and d(k,l,u,v,t), are referred to as the MVV plus depth representation. From the earlier, one can see that the 7D plenoptic function is a very rich representation paradigm, and all the previous relevant cases can be derived from it. In addition, such high dimensionality representation enables the plenoptic function to provide many interesting and useful functionalities. Once sampled and recorded, this rich information can be processed to render images in multiple, desired viewing conditions and displays. For example, light rays can be combined a posteriori to generate an image focused at any depth [20]. It is possible to select objects by their depth using 4D digital filtering [21]. Virtual camera movement can be easily produced, and the rich information in the plenoptic representation can be used to perform tasks such as highly realistic relighting [26]. Also, by dropping all but the spatial position (x,y,z) a binarized plenoptic function B(x,y,z) may yield the 3D structure of an object in the form of a point cloud. The next section will shift from the abstract mathematical representation given by the plenoptic function to practical use cases where plenoptic functions naturally appear.

2.3 EMPOWERING THE PLENOPTIC FUNCTION: EXAMPLE USE CASES In this section, some relevant example use cases where plenoptic imaging plays an important role are presented. All of them involve either the capture, transmission, processing, or display of several degrees of sampling and approximation of the plenoptic function. The use cases are grouped into four functional categories (not necessarily disjoint), where the term “light field” refers to the complete “field of light” (i.e., the plenoptic function) that emanates from the scene objects: 1. Light field communication—The light field is captured and transmitted to the receiver for viewing, essentially without alterations and/or changing viewpoint, e.g., 3D video telco. 2. Light field editing—The acquired light field is edited/processed to add a special effect along an existing viewpoint, e.g., a posteriori focus on another object in the scene, mix natural and synthetic content with correct illumination/ shading, etc. 3. Free navigation—On top of the former category, the viewer requests a viewpoint that is drastically different from the originally acquired ones, freely navigating through the scene, e.g., “The Matrix” frozen bullet time effect. 4. Interactive all reality—On top of the former category, the user also interacts with the scene objects in a virtual/mixed/augmented reality application, e.g., collision detection in gaming, object removal/displacement, depth measurements, etc.

2.3 Empowering the plenoptic function: example use cases

In essence, each category gradually adds functionalities to the previous one, eventually reaching higher degrees of composition, perspective view changes, and interactivity.

2.3.1 LIGHT FIELD COMMUNICATION Use case 1.1: Super-multiview home television In this case, the studio has a 1D or 2D array of cameras recording a different video from each position, see Fig. 2.6 (top). These videos are jointly compressed and transmitted to a TV set that has a super-multiview (SMV) display, i.e., a display with a high density array of projected video views, see Fig. 2.6 (bottom). This display decodes the video streams and recovers the light field. The user sees a different perspective from each different position in front of the display, since the light field is spread all over the space around the TV, simultaneously projecting all possible viewpoints that the user might take. In the terminology of Section 2.2, the corresponding approximation of the plenoptic function employed is PC(x,y,z,u,v,t), providing a different view for each viewing direction (u,v).





View 1 View 2

View 3 View 4

FIG. 2.6 (Top-left) Single parallax, circular camera array [27]; (top-right) linear, full parallax Stanford camera array [28]; (bottom) super-multiview display image projections [9]; (Bottom left) Typical multiview system, showing the overlapped, invalid viewing zones versus the correct viewing positions, to explain the drawbacks and inherent limitations coming from the principle. (Bottom right): Schematic illustration of the 3D light field displaying or HoloVizio principle, where opposing the multiview technologies, those limitations does not exist.



CHAPTER 2 Plenoptic imaging: Representation and processing Use case 1.2: Immersive bidirectional communication Two users in different locations capture light fields using an array of light field cameras (see Fig. 2.6) and perform bidirectional communication of the light fields that are used to render a 3D representation of the persons that are communicating, giving the impression of a 3D face-to-face communication. In this case, one uses the approximation of the plenoptic function PC(x,y,u,v,t), providing a different video for each position of the receiver.

2.3.2 LIGHT FIELD EDITING Use case 2.1: Photographic light field editing The user takes photographs with a camera that records light fields as PC(x,y,u,v) and sends the light fields to his/her friends, e.g., via Facebook. These photographs can then a posteriori be rendered under different focal points, using light field rendering plugin software, see Fig. 2.7 and Section 2.6. This service effectively shoots the scene in a light field data format once, with postprocessing axial viewpoint changes that can be done multiple times without the need of additional shooting actions. Use case 2.2: Cinematic, mixed reality light field editing A dense array of video cameras (see Fig. 2.6—top right) captures the light field in the form of PC(x,y,u,v,t), building cinematic virtual reality (VR) models that dynamically vary in time. These models can be rendered in various focuses, illuminations, and poses (within a limited range), thanks to the information contained in the plenoptic function PC(x,y,u,v,t). A computer graphics-generated object may even be blended within the live action, with correct illumination cues. The film producer renders all the frames with a predefined perspective view, creating a movie presented in cinema theaters. Since it is also possible to ask for a perspective view slightly aside the ones from the cameras [29], this use case includes the seeds for the free navigation use cases 3.2 and 3.3, presented in the next section.

FIG. 2.7 Image rendering with different focal distances (left and center) and all in focus (right) [20].

2.3 Empowering the plenoptic function: example use cases

2.3.3 FREE NAVIGATION Use case 3.1: Omnidirectional 360 degree viewing of the surrounding environment A 360 degree camera is used to capture an omnidirectional view of a scene. The plenoptic representation captures the information all around the camera. In its simplest form, a viewport (part of the full panorama) to this panoramic view is extracted for visualization on VR glasses. More elaborate techniques allow to extract viewpoints with a different optical center than the one from the capturing cameras’ sphere, supporting slight motion parallax, similar to use case 2.2 [29]. Use case 3.2: Free viewpoint sports event A sports event is recorded from several fixed viewpoints, multiplexed and transmitted to the receiver, see Fig. 2.8. This use case is technically very similar to the former with the difference that the cameras are positioned inward toward (a part of) the enclosed scene, rather than outward from a central sphere to the surrounding scene. The user chooses his/her perspective viewpoint from which to watch the event with the help of a joystick, and a specialized view rendering software synthesizes the corresponding views. Depending on the sampling approach of the plenoptic function, the view synthesis might be obtained through 3D graphics pipeline rendering techniques typically used in gaming engines or by image-based interpolation techniques. Use case 3.3: Free viewpoint home television This use case is similar to the previous one, but often puts more stringent constraints on the rendering quality for closely viewed objects and “frozen time” walk-around effects, as in Fig. 2.9 for the bullet time effect in the movie “The Matrix.” An array of conventional cameras is placed along a circular arc in a TV studio (see Figs. 2.6 and

FIG. 2.8 Free viewpoint sports recording [30,31].



CHAPTER 2 Plenoptic imaging: Representation and processing

FIG. 2.9 Hundreds of cameras around the scene for free navigation in “The Matrix” [32,33].

2.9), each one recording a video from a different horizontal position. These videos are compressed taking into account the redundancy among the different views in order to be transmitted to the TV set at home. At the TV set, the videos are decoded and a joystick or eye tracking device in the TV set requests a viewpoint to be shown to the user, with correct perspective parallax. Intermediate views not transmitted through the network can be synthesized by using interpolation techniques, so that the user has the impression of a continuous change of perspective as he/she moves horizontally in front of his/her TV set. Using the terminology of Section 2.2, the plenoptic function approximation used here is PC(k,l,u,v,t).

2.3.4 INTERACTIVE ALL-REALITY Use case 4.1: Surveillance with depth recovery Surveillance of a cluttered environment is carried out using a video camera capable of capturing light fields PC(x,y,u,v,t) as in Fig. 2.6, mounted on top of a moving robot, see Fig. 2.10. With the light field images, one can perform 4D spatial filtering of the plenoptic function PC(x,y,u,v,t) to recover the depth of each point and use the 3D information to detect abandoned objects [21,34]. This use case extends previous categories with explicit depth measurements for applicative purposes (previous use cases often have implicit depth information for proper rendering but this is not visible to the user as an end service). Use case 4.2: Remote surgery with glasses-free 3D display A light field camera is used to record the surgical light field that is transmitted from the operating room where the patient is treated, to a doctor behind a surgery control device at a remote location, see Fig. 2.11. The surgeon looks to the organs of the patient through a stereoscopic display (see Fig. 2.11), a super-multiview display (as in use cases 1.1 and 1.2) or equivalently a holographic display that generates an image of the surgical field in front of the doctor (see Fig. 2.11). In the latter case, the surgeon typically interacts with his/her free hands in space, which poses and positions should be recorded and transmitted to the robotic arms in the operating room. A

2.3 Empowering the plenoptic function: example use cases

FIG. 2.10 Video camera capturing light fields on top of a moving robot performing remote surveillance.

FIG. 2.11 Remote surgery. From # [2017] Intuitive Surgical, Inc.,

ToF (time-of-flight) camera therefore records the surgeon’s hands and builds their 3D model with all necessary details for the surgical action to be remotely transmitted. Note that the plenoptic function is used both for capturing the dynamic surgical field and for modeling the surgeon’s hands. As in the previous use case, depth might be recovered explicitly as side information to the surgeon. Also notice the fundamental importance of the interactivity capability, i.e., the possibility of the user to really



CHAPTER 2 Plenoptic imaging: Representation and processing

“touch” the object with simulated force feedback. This most probably includes the need of a data format that is able to describe the position of the objects and their constituents (e.g., each point of a point cloud). This feature is not necessarily present in all preceding use cases, where the feature of creating a high-quality, parallax-correct viewpoint with high immersion is more prominent than the need for interactivity. Use case 4.3: Interactive VR training An offshore oil exploitation plant is mapped using a LIDAR or ToF laser scanner that builds a point cloud model, from which a textured 3D mesh model is extracted. This computer graphics model will be part of a VR environment for training personnel in maintenance operations. Use case 4.4: Augmented reality surveillance with light field editing A subsea facility is monitored with an array of cameras, each being able to capture light fields PC(x,y,u,v,t). The recorded light fields are used to build a 3D model of the facility and to detect its structural problems. However, since the waters are turbulent due to many suspended particles from operating the facility, one uses 5D processing of the light fields to remove the interference caused by these suspended particles, obtaining a better 3D model and a more effective detection of the structural problems. Pertinent 3D information is sent to the diver’s helmet and/or a tablet he/she is pointing toward the structure, displaying augmented reality (AR) information. This use case is a combination of the use cases 2.1 and 2.2 on one hand (advanced light field editing with depth-based characteristics) and the previous use cases in this fourth category. After having exemplified some use cases with rich plenoptic function representations, the next section will address the plenoptic function acquisition and representation models that may enable them.

2.4 PLENOPTIC ACQUISITION AND REPRESENTATION MODELS The processing chain used in plenoptic imaging applications can be summarized as in Fig. 2.12 (inspired from Ref. [35]). First, plenoptic data is either captured with some sensor or computer generated using some authoring tool. Both the natural Data + metadata


Acquisition/ creation

Data +

Coded (data


+ metadata)


FIG. 2.12 Plenoptic imaging processing flow.


Data + metadata Decoding

Rendered data




2.4 Plenoptic acquisition and representation models

acquisition or the artificial creation lead to some sampling of the plenoptic function with reduced dimensionality using some acquisition/creation model closely related to the selected sensors (naturally, some metadata may be also acquired or inserted). After acquisition, this information may be converted to a more convenient (still uncompressed) representation format as the acquisition and representation formats do not have to be the same. In fact, while the number of acquisition models may be large as closely related to the variety of sensors, the number of representation models should ideally be as small as possible to increase interoperability and reduce processing complexity, e.g., associated to mutual conversions. While having a single representation model would be great, the variety of application scenarios with associated functionalities and display types may justify the need to have more than one representation model. This will very likely lead to multiple plenoptic imaging coding solutions as they are closely dependent on the representation model. In all cases, since the amount of data generated is huge, one must compress the raw plenoptic data generated for storage or transmission. In this section, the plenoptic imaging acquisition, representation, and display stages will be addressed while the next sections will be specifically dedicated to coding and rendering. While the 7D plenoptic function can fully express the light information available in a real scene, practical constraints have led to strongly reducing its sampling dimensionality. However, with the trend to increase visual immersion, less powerful dimensionality reductions become relevant compared to the past. Naturally, the sampling of the plenoptic function will involve measuring radiance/color information for specific wavelengths or bandwidths, for some positions in a 5D space, considering or not the temporal dimension.

2.4.1 ACQUISITION To perform the acquisition, several types of sensors may be used, naturally in different ways and contexts: 1. Texture – Traditional cameras which measure color information represented as RGB or luminance and chrominance data, the so-called texture, for each view. – Arrays of traditional cameras which include a more or less dense array of cameras, with regular or irregular, linear or nonlinear arrangements, with single (horizontal or vertical only) or double/full parallax, see Fig. 2.6. – Microlens array (aka light field) cameras which include a microlens array into the optical path of a monocular camera, thus providing directional data for each sample (x,y) position. In practice, these devices behave like multiple small cameras in a single box and may consider or not the temporal dimension. – 360 degree (static or video) cameras which increase the field of view up to 360 degree in the horizontal plane, or even cover (approximately) the entire sphere, see Fig. 2.13.



CHAPTER 2 Plenoptic imaging: Representation and processing

FIG. 2.13 360 Degree camera examples: Bubl [36], Panono Panoramic Ball [37], Nokia Ozo [38], and Samsung Project Beyond [39].

2. Depth – Infrared cameras which capture the object distance using infrared radiation, typically in wavelengths as long as 14 μm. Infrared cameras may be used as part of structured light projectors and ToF cameras. – Structured light projectors which measure depth by solving a correspondence problem between a reference structured light pattern (grids, horizontal bars, or random dots) and the captured light pattern, after projecting the reference pattern onto a scene. – Time-of-flight (ToF) range imaging cameras which measure the distance by means of the time of flight of a continuous light signal between the camera and the subject for each point of the image, resulting into depth maps. Some devices work by modulating the outgoing beam with a radio frequency (RF) carrier and measuring the phase shift of that carrier at the receiver side. – LIDAR1 cameras which measure the distance to a target by illuminating it with laser pulses and analyzing the reflected light, resulting into depth maps. ToF and LIDAR are typically appropriate for shorter ( δjX ¼ xi ÞpðX ¼ xi Þ


xi ¼δ

In the work of [125], for example, it was found that these gaze disruptions are much better predictors of visual quality than traditional objective quality metrics.

3.4.3 TESTING THE COMPUTATIONAL MODELS After deriving several parameters from eye-tracking data (either from a custom experiment or an existing dataset), we then compare this data with that generated from the designed computational model, in order to judge its goodness. Several types of computational models like fixation duration predictors [126], salience detectors [19], and scan-path generators [86] can be compared using the methods described in the following sections.



CHAPTER 3 Visual salience versus perceived interest Statistical analysis of fixation and saccades Simple measures derived from first order data, like the amplitude of saccades and the duration of fixations can play a very important role in gaze analysis [32, 127, 128]. Fixations of shorter durations, for example, means that less information is collected per fixation, either due to the nature of the stimuli or that of the task [128]. A statistical analysis of the fixation durations from the actual eye-tracking data and the ones predicted from the computational model is necessary to evaluate the goodness of the model. After first analyzing the nature of the data for normality, either a z-test or the Kolmogorov-Smirnov (KS) test can be used in a parametric or nonparametric case respectively. Significant differences are checked between the two groups based on the null hypothesis that the fixation duration samples in both: the computational model and the actual experiment come from identical distributions or that there is no significant differences between them. Similarity in saliency maps Common methods used to evaluate the degree of similarity between two saliency maps (for instance between the one generated from a computational model and the other from actual eye-tracking data) are the correlation-based measure, the Kullback-Leibler divergence and ROC analysis [129]. The Pearson correlation coefficient r between two maps H and P is defined as rH , P ¼

covðH,PÞ σH σP


where cov(H, P) is the covariance between H and P, in which σ H and σ P represent the standard deviation of maps H and P, respectively. The linear correlation coefficient has a value between 1 and 1. A value of 0 indicates that there is no linear correlation between the two maps. Values close to 0 indicates a poor correlation between the two sets. A value of 1 indicates a perfect correlation. The sign of r is helpful in determining whether the data share the same structure. A value of 1 also indicates a perfect correlation, but the saliencies are exactly opposite. This indicator is very simple to compute and is invariant to linear transformation. The Kullback-Leibler (KL) divergence can also be used to estimate the overall dissimilarity between two probability density functions. Considering two discrete distributions R and P with probability density functions rk and pk, the KL divergence between R and P is given by the relative entropy of P with respect to R: KLðR,PÞ ¼

X k

pk log

rk pk


The KL-divergence is defined only if rk and pk both sum to 1 and if rk > 0 for any k such that pk > 0. The KL-divergence is not a distance, since it is not symmetric and does not satisfy the triangle inequality. It is nonlinear as well and varies in the range of zero to infinity. A zero KL-divergence indicates that the two probability density functions are strictly equal. The fact that the KL divergence does not have a welldefined upper bound is a major drawback.

3.4 Acquiring ground truth visual attention data for model verification

A third method, known as the ROC analysis is perhaps the most popular and most widely used method in the community for assessing the degree of similarity between two (a theoretical and a predicted) saliency maps. Although, ROC analysis usually involves the ground truth saliency data (from eye-tracking experiments) and the saliency map obtained from a computational model, it is also common to encounter a second method in the literature that involves fixation points and a saliency map (that is described further under the section on Hybrid Methods). In the former, continuous saliency maps are processed after a binary classifier is applied on every pixel. It means that the image pixels of the ground truth, as well as those of the prediction, are classified as fixated (or salient) or as not fixated (or not salient). A simple threshold operation is used for this purpose. The continuous saliency map is thresholded with a constant threshold in order to retain a given percentage of image pixels. For instance, we can keep the top 2%, 5%, 10%, or 20% salient pixels of the map. This threshold is called TGx (G for the ground truth and x indicating the percentage of image considered as being fixated). The threshold is systematically moved between the minimum and the maximum values of the map. A high-threshold value corresponds to an overdetection, whereas a smaller threshold affects the most salient areas of the map alone. This threshold is represented as TPx (P for the prediction and x indicating the threshold). For each pair of thresholds, four numbers representing the quality of the classification are computed. They represent the true positives (TPs), the false positives (FPs), the false negatives (FNs), and the true negatives (TNs). The true positive number is the number of fixated pixels in the ground truth that are also labeled as fixated in the prediction. An ROC curve that plots the FP rate (FPR) as a function of the TP rate (TPR) is usually used to display the gaze classification result for the set of thresholds used. The TPR, also called sensitivity or recall, is defined as TPR ¼ TP/(TP + FN), whereas the FPR is given by FPR ¼ FP/(TP + FN) (Fig. 3.9). The ROC area, or the area under curve (AUC), provides a measure indicating the overall performance of the classification. A value of 1 indicates a perfect classification by the model (in which the chance level is 0.5). A similar ROC measure, as illustrated in Fig. 3.12 was used for the purpose of observer rejection in the work of [124] to obtain the results shown in Fig. 3.9. Scan-path similarity metrics Two metrics are typically used for comparing two scan-paths, a distance-based string edit technique or the vector-based scan-path similarity method. Vector-based approaches are generally preferred because they perform the comparison in different dimensions like temporal, spatial, and frequency (temporal) [129]. Scan-paths predicted from a computational model and that obtained from actual eye-tracking experiments are expected to produce a significant similarity scores. A common way to compare two scan-paths, is to compute the Levenshtein distance which in turn indicates the number of string replace/insert/delete operations needed to transform the tested string into the reference string [130]. As the same number of edits in a shorter sequence means more dissimilarity than in a long sequence, the Levenshtein score normalized to the maximum length (among the


CHAPTER 3 Visual salience versus perceived interest

0.75 True positive rate (TPR)




0.6 0.55







25 30 User number





FIG. 3.9 The true positive (TP) rate at the equal error rate (EER) point averaged over all sequences. Those observers having a TP rate of close to 50% (below the indicated horizontal line) are rejected. The current analysis therefore rejects three inconsistent observers.

two compared strings), was for instance used to compute the similarity of the scan patterns in the work of [121]. To test for statistical significance in the Levenshtein similarity scores of the model as compared to the ground truth, a KS test or a z-test can be performed. Alternatively, a dynamic programming approach can be used in which, each of the m individual saccades in the model generated scan-path is compared to that of the n saccades in the ground truth scan-path, as illustrated in Fig. 3.10. Three separate aspects of similarity can be considered: (1) spatial proximity of the saccade starting points, i.e., 2D Euclidean separation between their starting points, (2) difference in direction and magnitude (as indicated by the vector difference of the saccades), and (3) the temporal proximity of the two saccades. However, there is often a reactiontime latency after the on-set of an event till the time the user executes a saccade. Therefore, a latency of approx 219 ms [131] must also be accounted for, in case of the third measure. Each of the three measures are typically normalized by the maximum possible value and are then averaged together to produce an overall similarity score ranging from 0 to 1. Scores obtained for each pair of saccades are then tabulated in a table of size m n through which we compute a least cost path starting from the top-left node to the bottom-most right node using a shortest path algorithm (like the Dijkstras algorithm [123]) as shown in Fig. 3.11. The path traversal cost indicates the overall similarity of the two scan-paths. This cost has to be normalized in accordance to the total path length for further comparison. However, to compute a test statistic, we need to accumulate such scores in several cases and then perform a statistical test.

3.4 Acquiring ground truth visual attention data for model verification

Y coordinate (in pixs)

1000 800 600 400 200 0 2000 1500 1000 X coordinate (in pixs) 500 0 0










Time (s)

FIG. 3.10 Two scan-paths corresponding to two different test conditions (HRCs) from the work of [121]. While the solid lines indicate the fixated locations, the thinner lines indicate the saccades. While the central dot indicates the median value of the saccade amplitude, the ends of the box indicate the 25th and 75th percentile value. The ends of the line indicate the extreme value after removing the outliers.

Similarity score

4 5










0 20

1 15 10 Arms of red path

10 12


0 0


8 4 6 Arms of green path


FIG. 3.11 Similarity values between each pair of saccades of the m  n matrix indicated as a surface. The shortest path is traced from top-left corner of the matrix to the bottom-right. Hybrid approaches We can also measure the similarity between two different types of gaze descriptors. For example, similarity between a saliency map and a set of fixation points can be computed using methods often referred to as Hybrid approaches [129].


CHAPTER 3 Visual salience versus perceived interest

In addition to the previous application, ROC analysis can also be performed between a continuous saliency map and a set of fixation points. The method tests how the saliency predicted from the model at the points of human fixation compares with the predicted saliency at nonfixated points. As in the previous section, the continuous saliency map is thresholded to retain only a given percentage of pixels of the map. Each pixel is then labeled as either fixated or not-fixated. For each threshold, the observer’s fixations are laid down on the thresholded map. The TPs (fixations that fall on fixated areas) and the FNs (fixations that fall on nonfixated areas) are determined. A curve that shows the TPR (or hit rate) as a function of the threshold can be plotted. Although interesting, this method is not sensitive to the false alarm rate. To deal with this limitation, a set of control points (corresponding to nonfixated points), have also to be used for the analysis. In a similar fashion as described in Section, the control points and the fixation points are then used to plot an ROC curve as illustrated in Fig. 3.12. For each threshold, the observer’s fixations and the control points are laid down on the thresholded map. The TPR (fixations that fall on fixated areas) and the FPR are determined. From this ROC curve, the AUC is computed. The quality of the classification relies on the final equal error rate (EER) obtained. The EER is the location on an ROC curve where the FPR and the TPR are equal (i.e., the error at which false alarms equal the miss rate FPR ¼ (1 TPR). As with the AUC, the EER is used to compare the accuracy of the prediction. In general, the system with the lowest EER is the most accurate [129]. Yet another idea is to measure the saliency values at fixation locations along a subject’s scan-path. The first step here is to standardize the saliency values in order to have a zero mean and unit standard deviation. It is simply given by: ZSM ðxÞ ¼

SMðxÞ  μ σ


1 0.9 True positive rate (TPR)


0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0





0.6 0.7 0.4 0.5 False positive rate (FPR)




FIG. 3.12 Receiver operating characteristics for User 1 for the sequence construction field [121].

3.5 Applications of visual attention

where ZSM is the standardized saliency map and μ and σ are the standard deviations respectively. To take account of the fact that we do not focus accurately on a particular point, the NSS value for a given fixation location is computed on a small neighborhood (say one degree) centered on that location. A third metric known as the percentile metric computes a percentile value P(xf(k)) for each location of the fixation xf(k). This score is the ratio between the number of locations in the saliency map with values smaller than the saliency value at xf(k), and the set of all other locations. The percentile value is defined as follows: Pðxf ðkÞÞ ¼ 100 

jx 2 X : SMðxÞ < SMðxf ðkÞÞj jSMj


where X is the set of locations in the saliency map SM, and xf(k) is the location of the kth fixation in the ground truth data. The final score is the average of P(xf(k)) for all the fixations in the list [129].

3.5 APPLICATIONS OF VISUAL ATTENTION As mentioned earlier, visual attention has been studied and used for various use cases by both: the multimedia coding and the computer vision communities. This covers many diverse applications like video compression to medicine to marketing as in Fig. 3.13. In this section, without being exhaustive, we provide some elementary details on some of these applications.

3.5.1 QUALITY ASSESSMENT Quality evaluation of media content as perceived by humans is of great importance in the domain of video compression. As several modern encoders like H.264/265 are all based on a local optimization of the rate-distortion function, incorporating elements of visual attention and saliency is very useful to achieve higher compression efficiency at the same overall perceived quality. Application of visual attention

Business and commercial

Web-page design

Multimedia retargetting



Disease detection


FIG. 3.13 Applications of visual attention.

Source end Multimedia delivery quality evaluation

Training surgeons



Steoreoscopy Virtual-reality


CHAPTER 3 Visual salience versus perceived interest Using saliency as a weighting factor of local distortions Intuitively, one can expect that a distortion occurring in an area that gets the viewer’s attention is more annoying than in any other area, as confirmed by study with human observers [68] (see Fig. 3.14). There have been a plethora of works that have used visual attention as one of the factors to improve and aid objective quality metrics, as discussed in [12, 132, 133]. In the work of [133, 134] for instance, an added benefit of integrating saliency models was visible for objective metrics like PSNR and SSIM. Such techniques were also used for guiding the video encoder adaptively in [12]. The eye-tracking-weighted PSNR (EWPSNR) used in the work helped in achieving a subjectively optimum bit-stream in more than 90% of the cases. Purely attention-based image quality measures: Visual attention deployment as a proxy for quality In vision science, it is known that attentional modulation can strongly vary the response toward visual tasks, especially quality evaluation. Several experiments performed with a simultaneous foveal and peripheral tasks [135, 136] have indicated that the response of the extra-striate cortical areas (especially V4) strongly depend on whether the effective stimulus was directly attended to or not. Such an observation challenges the validity of usual subjective media quality assessment protocols which assigns an explicit task to the observer to provide a quality judgment score usually by filling a questionnaire. In order to obtain a more naturalistic measure of videoquality, a free-viewing-based approach is more suitable and is a more naturalistic
















Cortical magnification (in mm/deg)

10 Drop in MOS Cortical magnification


Delta MOS value


2 0


4 2 3 Eccentricity from Foveola (in degrees of viewing angle)



FIG. 3.14 The difference opinion score of users (as compared to the pristine sequence) who observed a similar distortion at various viewing eccentricities.

3.5 Applications of visual attention

representation of media consumption. Observation of user behavior, and in particular visual attention deployment in various media conditions could be related to the quality of viewing experience. In contrast to using attention as an added factor in quality evaluation, a purely attention-based metric maybe used to quantify the quality of a video by either measuring video attention deviation (VAD) [137] or quantifying Disruptions [124, 138] in viewing patterns. While VAD measures quality by comparing the steady state saccadic probabilities, several other metrics make use of viewing disruptions as a naturalistic measure of perceived quality [124, 138]. In the work of [137], the gaze of a subject is modeled as a Markovian process that can rest in one of the two states: Fixation or a Saccade. The probability of a user making a saccade varies throughout the length of the video, before reaching a steady state value which in turn is directly dependent on the content: a factor also known as the busyness of the video. This steady state saccadic probability, also known as VAD, holds important clues regarding the quality of a video. In other instances, the amount of disruptions that localized artifacts cause to the gaze patterns are used as indicators of the video quality [124, 138]. These metrics are based on the property that, attention is a stochastic process affected by a number of bottom-up and top-down factors [139] (quality impairments being one of them), as seen in the experiments from [140].

3.5.2 VISUAL ATTENTION IN MULTIMEDIA DELIVERY Interactive streaming With the availability of low-cost, consumer grade eye trackers, visual-attentionbased bit allocation techniques for network video streaming have also been introduced [141]. To improve the efficacy of such gaze-based networked systems, gaze prediction strategies can be used to predict future gaze locations to lower the end-toend reaction delay caused by the finite round trip time (RTT) of transmission networks. It was demonstrated in [142] that the bit rate can be reduced by slightly more than 20%, without noticeable visual quality degradation, even when end-to-end network delays are as high as 200 ms. Dealing with packet loss Packets in an image or video bitstream often contains data with varying levels of importance from the visual information point of view. This results in disproportionate amounts of perceived quality loss when these packets are lost. Quality assessment experiments with observers have demonstrated that the effect of a lost packet depends on the visual importance and salience of the information contained in the packet [68, 143]. Visual attention-based error resilience or ROI-based channel coding methods are consequently good candidates to attenuate the perceptual quality reduction resulting from packet loss [144]. In the context of highly prediction-based coding technologies such as H26x families, for good compression performance, there is a high dependency between many parts of the coded video sequence.



CHAPTER 3 Visual salience versus perceived interest

However, this dependency comes with the drawback of allowing a spatio-temporal propagation of the error resulting from a packet loss. ROI-based coding should also consider attenuating the effect of this spatio-temporal dependency when important parts of the bitstream are lost. As part of the H.264/AVC video coding standard, error resilience features such as flexible macroblock ordering (FMO) and data partitioning (DP) can be exploited to improve resilience of salient regions of video content. DP partitions code slice into three separate network abstract layer (NAL) units, containing each different part of the slice. These concepts may already be used in very early stages of transmission where physical layer parameters like packet loss ratio, energy savings, and spectrum utilization can be tuned to achieve the communication profile that matches the content properties [145]. Image re-targeting With the recent explosion of consumer devices like tablets and smart phones, formats [3-D, high definition (HD), ultra HD] and services (video streaming, broadcast television, image database browsing), the visual dimension of multimedia contents viewed by a human observer can vary enormously, resulting in the stimulation of very different regions of his or her visual field. Depending on display capacity and the purpose of the application, contents often need to be redesigned to smaller versions, with respect to spatial size, bitstream quality, frame rate, and so on [144]. A common way to achieve this goal is to dramatically down sample the picture rather uniformly, as in thumbnail modes. This often yields poorly rendered pictures, since important objects of the scene may be no longer recognizable. Alternatively, content re-targeting techniques perform content-aware image resizing, for example, by seam carving [146]. Saliency-based image retargeting have helped us identify important ROI and compute the reduced picture centered on these parts [147].

3.5.3 APPLICATIONS IN MEDICINE One of the fields where eye-tracking has been most beneficial is in the area of medicine, where it might be used for the training of medical personnel (especially surgeons), or to diagnose the presence of several psychological disorders or in cutting-edge applications like Tele-surgery. Eye-tracking in disease detection Attention and eye-tracking have been used to detect several forms of mental disorders like Attention Deficit Hyperactivity Disorder (ADHD) [148], Obsessive Compulsive Disorder (OCD) [149], Schrizophenia [150], Parkinson’s [151, 152], and Alzheimer’s disease [153]. A strong and statistically significant difference was found in several gaze parameters like Fixations on/Saccades toward ROI [148, 149], saccadic velocity and amplitude [151], average fixation duration [152], imperfect pursuit abilities [150, 153] between the control and test groups. Eye-tracking tests are in many cases a window to determine several mental and physical disorders of a patient.

3.5 Applications of visual attention Eye-tracking in the training of medical personnel In other scenarios, practical analysis of medical images are normally performed by surgeons and other experts by first visually inspecting the image (involving visual perception processes, including detection and localization tasks), and later performing an interpretation (requiring cognitive processes). Unfortunately, interpretation is not error free and can be affected by the observer’s level of expertise. Understanding how clinicians read images and develop expertise throughout their careers or investigating why some are better at interpreting medical images than others, helps develop better training programs and create new tools that could enhance and speed up the learning process for trainee surgeons [144]. Several studies have linked the eye-movement strategies of surgeons to their level of expertise [154, 155]. While [155] showed that the search patterns of surgeons and radiologists changed with each successive year of experience, Tsang et al. [154] have observed that novice surgeons, for example, switched their attention frequently between the two displays that were present in the room as compared to expert surgeons. Eye-movement related training has in-fact demonstrated significant benefits in the surgical skills of doctors and other medical personnel [156]. Tele-surgery Visionary technology in the area of medical electronics has made possible telesurgical patient care, helped realize transcontinental surgery and also perform medical procedures in space missions [157]. In existing tele-surgical systems like the Telelap Alf-x, for example, built in eye-tracking systems centers the video to the point under regard by the surgeon [158]. On similar lines, exploiting concepts like foveation may help reduce the bandwidth necessary for transcontinental transmission of high quality surgical video thus bringing the technology closer to reality.

3.5.4 VISUAL ATTENTION AND IMMERSIVE MEDIA: A RISING LOVE STORY Exploiting the aspects of attention is becoming a key in modern technologies like Stereoscopy and Virtual Reality. The following sections indicates some ways in which visual attention has been used and exploited in these applications. Stereoscopy and 3D displays A key factor required for the widespread adoption of services based on stereoscopic images will be the creation of a compelling visual experience for the end user. Perceptual issues and the importance of considering 3D visual attention to improve the overall 3D viewing experience in 3DTV broadcasting have been discussed extensively [144, 159]. Comfortable viewing conditions, like the zone of comfortable viewing, of stereoscopic content is linked to several factors like the accommodation-vergence conflict, range of depth of focus, and range of fusion [160]. A seminal study by W€opking



CHAPTER 3 Visual salience versus perceived interest

[161] suggests that visual discomfort increases with high spatial frequencies and disparities, partially because the limits of stereoscopic fusion increase as a result of the decreased spatial frequency. More generally, it appears that blurring can have a positive impact on visual comfort because it reduces the accommodation-vergence conflict, limiting both the need for accommodation and the effort to fuse [162]. Blurring effects can also be used for 3D content to direct the viewer’s attention toward a specific area of the image that could meet a comfortable viewing zone. Additionally, three-dimensional visual attention models can be employed to provide the area of interest and convergence plane to drive the content repurposing of stereoscopic content. Such visual-attention-based adaptive rendering of 3D stereoscopic video has been proposed using a 2D visual attention model [163]. Virtual reality (VR) With the advent of novel virtual reality systems like immersive VR headsets, cave systems and augmented reality glasses, the need to provide a realistic and immersive experience to the user is very important. Eye tracking and concepts of attention have been exploited in VR systems in two different ways: for realistic and easy navigation in virtual environments or for faster rendering of content by exploiting the redundancies in content incident on our peripheral vision. VR scenes often contain enormous amounts of information: many orders of magnitude greater than the actual processing capacity of the brain. For fast rendering of this data within the headset, we can exploit the property of the human visual system that there is a significant reduction in the ability to perceive spatial texture [164], color [165], motion [166], and flicker [167] sensitivity across the retinal periphery. It has already been observed in [68] that the overall perceived quality in such instances is mainly driven by the foveal content and that peripheral content has a very small contribution to the overall experience: especially so after six degrees of viewing angle. It was clearly observed from Fig. 3.14 that the drop in quality across the periphery follows the drop in resolution of our visual field: an aspect also called as Cortical Magnification: This is an aspect that can be easily exploited in the context of Virtual Reality to reduce the processing complexity and aid speedy rendering of content within the headset. In other instances [168] it has been observed that, exploiting the user focus or point of attention may help in rendering the content so as to experience optimal sensation when navigating in first person using VR. Dynamic adaptation of the VR content in this manner was found to have a positive effect on the subjective preference of the technology.

REFERENCES [1] J. Lubin, Sarnoff JND Vision Model, 1997. [2] A.B. Watson, J.A. Solomon, Model of visual contrast gain control and pattern masking, J. Opt. Soc. Am. A Opt. Image Sci. Vis. 14 (9) (1997) 2379–2391.


[3] K. Ferguson, An adaptable human vision model for subjective video quality rating prediction among CIF, SD, HD and E-cinema, in: Proc. 3rd Int. Workshop Video Process. Quality Metrics for Consum. Electron. (VPQM), 2007. [4] D. Chandler, S. Hemami, VSNR: a wavelet-based visual signal-to-noise ratio for natural images, IEEE Trans. Image Process. 16 (9) (2007) 2284–2298. [5] E.C. Larson, D.M. Chandler, Most apparent distortion: full-reference image quality assessment and the role of strategy, J. Electron. Imaging 19 (1) (2010) 011006. [6] L.K. Choi, L.K. Cormack, A.C. Bovik, On the visibility of flicker distortions in naturalistic videos, in: 2013 Fifth International Workshop on Quality of Multimedia Experience (QoMEX), IEEE, New York, 2013, pp. 164–169. [7] M.P. Eckstein, A.J. Ahumada, A.B. Watson, Visual signal detection in structured backgrounds. II. Effects of contrast gain control, background variations, and white noise, J. Opt. Soc. Am. A Opt. Image Sci. Vis. 14 (9) (1997) 2406–2419. [8] R. Mantiuk, K.J. Kim, A.G. Rempel, W. Heidrich, HDR-VDP-2: a calibrated visual metric for visibility and quality predictions in all luminance conditions, ACM Trans. Graph, 30 (2011) 40. [9] D.M. Chandler, M.D. Gaubatz, S.S. Hemami, A patch-based structural masking model with an application to compression, EURASIP J. Image Video Process. 2009 (1) (2009) 649316, [10] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Trans. Image Process. 13 (4) (2004) 600–612. [11] Y. Rai, P. Le Callet, G. Cheung, Role of HEVC coding artifacts on gaze prediction in interactive video streaming systems, in: 2016 IEEE International Conference on Image Processing (ICIP), 2016, pp. 3344–3348, [12] Z. Li, S. Qin, L. Itti, Visual attention guided bit allocation in video compression, Image Vis. Comput. 29 (1) (2011) 1–14. [13] H. Hadizadeh, I.V. Bajic, Saliency-aware video compression, IEEE Trans. Image Process. 23 (1) (2014) 19–33. [14] R. Gupta, S. Chaudhury, A scheme for attentional video compression, in: International Conference on Pattern Recognition and Machine Intelligence, Springer, Berlin, Heidelberg, 2011, pp. 458–465. [15] S. Wulf, U. Zolzer, Visual saliency guided mode decision in video compression based on Laplace distribution of DCT coefficients, in: Visual Communications and Image Processing Conference, 2014 IEEE, IEEE, New York, 2014, pp. 490–493. [16] O. Le Meur, P.L. Callet, What we see is most likely to be what matters: visual attention and applications, in: 2009 16th IEEE International Conference on Image Processing (ICIP), 2009, pp. 3085–3088. [17] Y. Rai, G. Cheung, P. Le Callet, Quantifying the relation between perceived interest and visual salience during free viewing using trellis based optimization, in: 2016 International Conference on Image, Video, and Multidimensional Signal Processing, vol. 9394, 2016. [18] U. Engelke, H. Kaprykowsky, H. Zepernick, P. Ndjiki-Nya, Visual attention in quality assessment, IEEE Signal Process. Mag. 28 (6) (2011) 50–59. [19] L. Itti, C. Koch, Computational modelling of visual attention, Nat. Rev. Neurosci. 2 (3) (2001) 194–203. [20] A. Mazaheri, N.E. DiQuattro, J. Bengson, J.J. Geng, Pre-stimulus activity predicts the winner of top-down vs. bottom-up attentional selection, PLoS One 6 (2) (2011) e16243.



CHAPTER 3 Visual salience versus perceived interest

[21] T. Foulsham, Saliency and Eye Movements in the Perception of Natural Scenes, Ph.D. dissertation, University of Nottingham, 2008. [22] W. James, The Principles of Psychology, Read Books Ltd, New York, NY, 2013. [23] M. Carrasco, Visual attention: the past 25 years, Vis. Res. 51 (13) (2011) 1484–1525. [24] M. Cheal, D.R. Lyon, D.C. Hubbard, Does attention have different effects on line orientation and line arrangement discrimination? Q. J. Exp. Psychol. 43 (4) (1991) 825–857. [25] E. Hein, B. Rolke, R. Ulrich, Visual attention and temporal discrimination: differential effects of automatic and voluntary cueing, Vis. Cognit. 13 (1) (2006) 29–50. [26] O. Hikosaka, S. Miyauchi, S. Shimojo, Focal visual attention produces illusory temporal order and motion sensation, Vis. Res. 33 (9) (1993) 1219–1240. [27] S. Suzuki, P. Cavanagh, Focused attention distorts visual space: an attentional repulsion effect, J. Exp. Psychol. Hum. Percept. Perform. 23 (2) (1997) 443. [28] K.A. Briand, Feature integration and spatial attention: more evidence of a dissociation between endogenous and exogenous orienting, J. Exp. Psychol. Hum. Percept. Perform. 24 (4) (1998) 1243. [29] Z.-L. Lu, B.A. Dosher, Spatial attention: different mechanisms for central and peripheral temporal precues? J. Exp. Psychol. Hum. Percept. Perform. 26 (5) (2000) 1534. [30] S. Ling, M. Carrasco, Sustained and transient covert attention enhance the signal via different contrast response functions, Vis. Res. 46 (8) (2006) 1210–1220. [31] A.K. Moorthy, A.C. Bovik, Visual quality assessment algorithms: what does the future hold? Multimed. Tools Appl. 51 (2) (2011) 675–696. [32] O. Le Meur, A. Ninassi, P. Le Callet, D. Barba, Do video coding impairments disturb the visual attention deployment? Signal Process. Image Commun. 25 (8) (2010) 597–609. [33] M.A. Peterson, Vision: top-down effects, in: Encyclopedia of Cognitive Science, 2003. [34] W. Einh€auser, U. Rutishauser, C. Koch, Task-demands can immediately reverse the effects of sensory-driven saliency in complex visual stimuli, J. Vis. 8 (2) (2008), [35] T. Ogawa, H. Komatsu, Neuronal dynamics of bottom-up and top-down processes in area V4 of macaque monkeys performing a visual search, Exp. Brain Res. 173 (1) (2006) 1–13. [36] K. Schill, E. Umkehrer, S. Beinlich, G. Krieger, C. Zetzsche, Scene analysis with saccadic eye movements: top-down and bottom-up modeling. J. Electron. Imaging 10 (1) (2001) 152–160, [37] U. Engelke, W. Zhang, P. Le Callet, H. Liu, Perceived interest versus overt visual attention in image quality assessment, in: IS&T/SPIE Electronic Imaging, vol. 9394, 2016. [38] J. Wang, D.M. Chandler, P. Le Callet, Quantifying the relationship between visual salience and visual importance, in: IS&T/SPIE Electronic Imaging, International Society for Optics and Photonics, 2010, p. 75270K. [39] A. Borji, L. Itti, State-of-the-art in visual attention modeling, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 185–207. [40] C. Bundesen, S. Vangkilde, A. Petersen, Recent developments in a computational theory of visual attention (TVA), Vis. Res. 116 (2015) 210–218. [41] M. Pomplun, Saccadic selectivity in complex visual search displays, Vis. Res. 46 (12) (2006) 1886–1900. [42] K.A. Ehinger, B. Hidalgo-Sotelo, A. Torralba, A. Oliva, Modelling search for people in 900 scenes: a combined source model of eye guidance, Vis. Cognit. 17 (6-7) (2009) 945–978. [43] G.J. Zelinsky, A theory of eye movements during target acquisition, Psychol. Rev. 115 (4) (2008) 787.


[44] L. Zhang, M.H. Tong, T.K. Marks, H. Shan, G.W. Cottrell, SUN: a Bayesian framework for saliency using natural statistics, J. Vis. 8 (7) (2008) 32. [45] A. Torralba, A. Oliva, M.S. Castelhano, J.M. Henderson, Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search, Psychol. Rev. 113 (4) (2006) 766. [46] V. Navalpakkam, L. Itti, Modeling the influence of task on attention, Vis. Res. 45 (2) (2005) 205–231. [47] N.J. Butko, J.R. Movellan, Optimal scanning for faster object detection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009, IEEE, New York, 2009, pp. 2751–2758. [48] S. Han, N. Vasconcelos, Biologically plausible saliency mechanisms improve feedforward object recognition, Vis. Res. 50 (22) (2010) 2295–2307. [49] M. Sodhi, B. Reimer, J. Cohen, E. Vastenburg, R. Kaars, S. Kirschenbaum, On-road driver eye movement tracking using head-mounted devices, in: Proceedings of the 2002 Symposium on Eye Tracking Research & Applications, ACM, New York, 2002, pp. 61–68. [50] E. Chen, H. Guan, H. Yan, Z. Xu, Drivers’ Visual Behavior Under Various Traffic Conditions, pp. 1854–1864. [51] R.J. Peters, L. Itti, Beyond bottom-up: incorporating task-dependent influences into a computational model of spatial attention, in: 2007 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, New York, 2007, pp. 1–8. [52] Q. Cheng, D. Agrafiotis, A.M. Achim, D.R. Bull, Gaze location prediction for broadcast football video, IEEE Trans. Image Process. 22 (12) (2013) 4918–4929. [53] X. Hou, L. Zhang, Dynamic visual attention: searching for coding length increments, in: Advances in Neural Information Processing Systems, 2009, pp. 681–688. [54] Y. Li, Y. Zhou, J. Yan, Z. Niu, J. Yang, Visual saliency based on conditional entropy, in: Asian Conference on Computer Vision, Springer International Publishing, Switzerland, 2009, pp. 246–257. [55] N. Bruce, J. Tsotsos, Saliency based on information maximization, in: Advances in Neural Information Processing Systems, 2005, pp. 155–162. [56] W. Wang, C. Chen, Y. Wang, T. Jiang, F. Fang, Y. Yao, Simulating human saccadic scanpaths on natural images, in: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, New York, 2011, pp. 441–448. [57] M. Jiang, X. Boix, G. Roig, J. Xu, L. Van Gool, Q. Zhao, Learning to predict sequences of human visual fixations, IEEE Trans. Neural Netw. Learn. Syst. 27 (6) (2016) 1241–1252, [58] E. Gu, J. Wang, N.I. Badler, Generating sequence of eye fixations using decisiontheoretic attention model, in: International Workshop on Attention in Cognitive Systems, Springer, Berlin, Heidelberg, 2007, pp. 277–292. [59] L. Itti, P. Baldi, A principled approach to detecting surprising events in video, in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 1 IEEE, New York, 2005, pp. 631–637. [60] T.A. Sørensen, S. Vangkilde, C. Bundesen, Components of attention modulated by temporal expectation, J. Exp. Psychol. Learn. Memory Cognit. 41 (1) (2015) 178. [61] S. Vangkilde, J.T. Coull, C. Bundesen, Great expectations: temporal expectation modulates perceptual processing speed, J. Exp. Psychol. Hum. Percept. Perform. 38 (5) (2012) 1183. [62] A.C. Kia Nobre, S. Kastner, Time for the fourth dimension in attention, in: A.C. Kia Nobre, G. Rohenkohl (Eds.), The Oxford Handbook of Attention, ISBN 9780199675111, Available from: 001.0001/oxfordhb-9780199675111-e-036.



CHAPTER 3 Visual salience versus perceived interest

[63] S. Vangkilde, A. Petersen, C. Bundesen, Temporal expectancy in the context of a theory of visual attention, Philos. Trans. R. Soc. Lond. B: Biol. Sci. 368 (1628) (2013) 20130054. [64] V. Mahadevan, N. Vasconcelos, Spatiotemporal saliency in dynamic scenes, IEEE Trans. Pattern Anal. Mach. Intell. 32 (1) (2010) 171–177. [65] S. Marat, T.H. Phuoc, L. Granjon, N. Guyader, D. Pellerin, A. Guerin-Dugue, Modelling spatio-temporal saliency to predict gaze direction for short videos, Int. J. Comput. Vis. 82 (3) (2009) 231–243. [66] C. Guo, L. Zhang, A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression, IEEE Trans. Image Process. 19 (1) (2010) 185–198. [67] L. Itti, N. Dhavale, F. Pighin, Realistic avatar eye and head animation using a neurobiological model of visual attention, in: Optical Science and Technology, SPIE’s 48th annual Meeting, International Society for Optics and Photonics, 2004, pp. 64–78. [68] Y. Rai, A. Aldahdooh, S. Ling, M. Barkowsky, P. Le Callet, Effect of content features on short-term video quality in the visual periphery, in: IEEE International Workshop on Multimedia Signal Processing, 2016. MMSP ’16, 2016, pp. 1–6. [69] O. Boiman, M. Irani, Detecting irregularities in images and in video, Int. J. Comput. Vis. 74 (1) (2007) 17–31. [70] J. Harel, C. Koch, P. Perona, Graph-based visual saliency, in: Advances in Neural Information Processing Systems, 2006, pp. 545–552. [71] C. Yang, L. Zhang, H. Lu, X. Ruan, M.-H. Yang, Saliency detection via graph-based manifold ranking, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3166–3173. [72] I. Hwang, S.H. Lee, J.S. Park, N.I. Cho, Saliency detection based on seed propagation in a multilayer graph, Multimed. Tools Appl. (2016) 1–19. [73] J. Zhang, K.A. Ehinger, J. Ding, J. Yang, A prior-based graph for salient object detection, in: 2014 IEEE International Conference on Image Processing (ICIP), IEEE, New York, 2014, pp. 1175–1178. [74] L. Lova´sz, Random walks on graphs: a survey, 1993. Available from: [75] J.-G. Yu, J. Zhao, J. Tian, Y. Tan, Maximal entropy random walk for region-based visual saliency, IEEE Trans. Cybern. 44 (9) (2014) 1661–1672. [76] W. Wang, Y. Wang, Q. Huang, W. Gao, Measuring visual saliency by site entropy rate, in: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, New York, 2010, pp. 2368–2375. [77] V. Gopalakrishnan, Y. Hu, D. Rajan, Random walks on graphs for salient object detection in images, IEEE Trans. Image Process. 19 (12) (2010) 3232–3242. [78] R. Kennedy, J. Gallier, J. Shi, Contour cut: identifying salient contours in images by solving a Hermitian eigenvalue problem, in: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, New York, 2011, pp. 2065–2072. [79] H. Jiang, J. Wang, Z. Yuan, T. Liu, N. Zheng, Automatic salient object segmentation based on context and shape prior, in: J. Hoey, S. McKenna, E. Trucco (Eds.), Proceedings of the British Machine Vision Conference, BMVA Press, UK, 2011, pp. 110.1–110.12. ISBN 1-901725-43-X, Available from: [80] Q. Zhu, G. Song, J. Shi, Untangling cycles for contour grouping, in: 2007 IEEE 11th International Conference on Computer Vision, IEEE, New York, 2007, pp. 1–8.


[81] T. Chuk, A.B. Chan, J.H. Hsiao, Understanding eye movements in face recognition using hidden Markov models, J. Vis. 14 (11) (2014) 8. [82] E. Coviello, G.R. Lanckriet, A.B. Chan, The variational hierarchical EM algorithm for clustering hidden Markov models, in: Advances in Neural Information Processing Systems, 2012, pp. 404–412. [83] S. Roy, S. Das, Saliency detection in images using graph-based rarity, spatial compactness and background prior, in: 2014 International Conference on Computer Vision Theory and Applications (VISAPP), vol. 1, IEEE, New York, 2014, pp. 523–530. [84] H.R. Tavakoli, E. Rahtu, J. Heikkil€a, Stochastic bottom-up fixation prediction and saccade generation, Image Vis. Comput. 31 (9) (2013) 686–693. [85] H. Liu, D. Xu, Q. Huang, W. Li, M. Xu, S. Lin, Semantically-based human scanpath estimation with HMMs, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3232–3239. [86] O. Le Meur, Z. Liu, Saccadic model of eye movements for free-viewing condition, Vis. Res. 116, Part B (2015) 152–164 Computational Models of Visual Attention. [87] C.A. Curcio, K.R. Sloan, R.E. Kalina, A.E. Hendrickson, Human photoreceptor topography, J. Comp. Neurol. 292 (4) (1990) 497–523. [88] N. Drasdo, C. Fowler, Non-linear projection of the retinal image in a wide-angle schematic eye, Br. J. Ophthalmol. 58 (8) (1974) 709. [89] T.J. Smith, P.K. Mital, Attentional synchrony and the influence of viewing task on gaze behavior in static and dynamic scenes, J. Vis. 13 (8) (2013) 16. [90] H. Alers, L. Bos, I. Heynderickx, How the task of evaluating image quality influences viewing behavior, in: 2011 Third International Workshop on Quality of Multimedia Experience (QoMEX), IEEE, New York, 2011, pp. 167–172. [91] A. Ninassi, O. Le Meur, P. Le Callet, D. Barba, A. Tirel, Task impact on the visual attention in subjective image quality assessment, in: 2006 14th European Signal Processing Conference, IEEE, New York, 2006, pp. 1–5. [92] T. Judd, F. Durand, A. Torralba, Fixations on low-resolution images, J. Vis. 11 (4) (2011) 14. [93] J. Yuen, B. Russell, C. Liu, A. Torralba, Labelme video: building a video database with human annotations, in: 2009 IEEE 12th International Conference on Computer Vision, IEEE, New York, 2009, pp. 1451–1458. [94] M. Cerf, J. Harel, W. Einh€auser, C. Koch, Predicting human gaze using low-level saliency combined with face detection, in: Advances in Neural Information Processing Systems, 2008, pp. 241–248. [95] S.O. Gilani, R. Subramanian, Y. Yan, D. Melcher, N. Sebe, S. Winkler, PET: an eyetracking dataset for animal-centric Pascal object classes, in: 2015 IEEE International Conference on Multimedia and Expo (ICME), IEEE, New York, 2015, pp. 1–6. [96] Z. Bylinskii, P. Isola, C. Bainbridge, A. Torralba, A. Oliva, Intrinsic and extrinsic effects on image memorability, Vis. Res. 116 (2015) 165–178. [97] H. Alers, J. Redi, H. Liu, I. Heynderickx, Studying the effect of optimizing image quality in salient regions at the expense of background content, J. Electron. Imaging 22 (4) (2013) 043012. [98] T. Judd, F. Durand, A. Torralba, A Benchmark of Computational Models of Saliency to Predict Human Fixations, 2012. [99] A. Borji, L. Itti, Cat2000: A Large Scale Fixation Dataset for Boosting Saliency Research, 2015, arXiv preprint arXiv:1505.03581.



CHAPTER 3 Visual salience versus perceived interest

[100] A. Coutrot, N. Guyader, How saliency, faces, and sound influence gaze in dynamic social scenes, J. Vis. 14 (8) (2014) 5. [101] C. Shen, Q. Zhao, Webpage saliency, in: European Conference on Computer Vision, Springer, Cham, Switzerland, 2014, pp. 33–46. [102] K. Koehler, F. Guo, S. Zhang, M.P. Eckstein, What do saliency models predict? J. Vis. 14 (3) (2014) 14. [103] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, H.-Y. Shum, Learning to detect a salient object, IEEE Trans. Pattern Anal. Mach. Intell. 33 (2) (2011) 353–367. [104] L. Itti, R. Carmi, Eye-tracking data from human volunteers watching complex video stimuli. 2009, [105] Y. Fang, J. Wang, J. Li, R. Pepion, P. Le Callet, An eye tracking database for stereoscopic video, in: 2014 Sixth International Workshop on Quality of Multimedia Experience (QoMEX), IEEE, New York, 2014, pp. 51–52. [106] J. Wang, M.P. Da Silva, P. Le Callet, V. Ricordel, Computational model of stereoscopic 3D visual saliency, IEEE Trans. Image Process. 22 (6) (2013) 2151–2165. [107] P.K. Mital, T.J. Smith, R.L. Hill, J.M. Henderson, Clustering of gaze during dynamic scene viewing is predicted by motion, Cognit. Comput. 3 (1) (2011) 5–24. [108] M. Dorr, T. Martinetz, K.R. Gegenfurtner, E. Barth, Variability of eye movements when viewing dynamic natural scenes, J. Vis. 10 (10) (2010) 28. [109] N. Riche, M. Mancas, D. Culibrk, V. Crnojevic, B. Gosselin, T. Dutoit, Dynamic saliency models and human attention: a comparative study on videos, in: Asian Conference on Computer Vision, Springer International Publishing, Switzerland, 2012, pp. 586–598. [110] U. Engelke, A. Maeder, H.-J. Zepernick, Visual attention modelling for subjective image quality databases, in: IEEE International Workshop on Multimedia Signal Processing, 2009. MMSP’09, IEEE, New York, 2009, pp. 1–6. [111] Y. Gitman, M. Erofeev, D. Vatolin, B. Andrey, Semiautomatic visual-attention modeling and its application to video compression, in: 2014 IEEE International Conference on Image Processing (ICIP), IEEE, New York, 2014, pp. 1105–1109. [112] J. Xu, M. Jiang, S. Wang, M.S. Kankanhalli, Q. Zhao, Predicting human gaze beyond pixels, J. Vis. 14 (1) (2014) 28. [113] U. Engelke, M. Barkowsky, P. Le Callet, H.-J. Zepernick, Modelling saliency awareness for objective video quality assessment, in: International Workshop on Quality of Multimedia Experience (QoMEX), 2010. [114] M. Narwaria, M.P. Da Silva, P. Le Callet, R. Pepion, Effect of tone mapping operators on visual attention deployment, in: SPIE Optical Engineering + Applications, International Society for Optics and Photonics, 2012, p. 84990G. [115] T. Vigier, J. Rousseau, M.P.D. Silva, P.L. Callet, A new HD and UHD video eye tracking dataset, in: MMSYS, 2016. [116] I. Van Der Linde, U. Rajashekar, A.C. Bovik, L.K. Cormack, DOVES: a database of visual eye movements, Spat. Vis. 22 (2) (2009) 161–177. [117] T. Falck-Ytter, S. B€olte, G. Gredeb€ack, Eye tracking in early autism research, J. Neurodev. Disord. 5 (1) (2013) 1. [118] R.J. Peters, A. Iyer, L. Itti, C. Koch, Components of bottom-up gaze allocation in natural images, Vis. Res. 45 (18) (2005) 2397–2416. [119] R. Ramloll, C. Trepagnier, M. Sebrechts, J. Beedasy, Gaze data visualization tools: opportunities and challenges, in: Eighth International Conference on Information Visualisation, 2004. IV 2004. Proceedings, IEEE, New York, 2004, pp. 173–180.


[120] D.D. Salvucci, J.H. Goldberg, Identifying fixations and saccades in eye-tracking protocols, in: Proceedings of the 2000 Symposium on Eye Tracking Research & Applications, ACM, New York, 2000, pp. 71–78. [121] Y. Rai, M. Barkowsky, P. Le Callet, Does H.265 based peri and para-foveal quality flicker disrupt natural viewing patterns? in: 2015 International Conference on Systems, Signals and Image Processing (IWSSIP), 2015, pp. 133–136. [122] F. Shic, K. Chawarska, B. Scassellati, The amorphous fixation measure revisited: with applications to autism, in: 30th Annual Meeting of the Cognitive Science Society, 2008. [123] H. Jarodzka, K. Holmqvist, M. Nystr€om, A vector-based, multidimensional scanpath similarity measure, in: Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications, ACM, New York, 2010, pp. 211–218. [124] Y. Rai, M. Barkowsky, P. Le Callet, Role of spatio-temporal distortions in the visual periphery in disrupting natural attention deployment. in: IS&T/SPIE Electronic Imaging, vol. 9394, 2015, p. 93941H, [125] Y. Rai, P. Le Callet, Do gaze disruptions indicate the perceived quality of nonuniformly coded natural scenes? in: IS&T/SPIE Electronic Imaging, 2017. [126] H.A. Trukenbrod, R. Engbert, ICAT: A computational model for the adaptive control of fixation durations, Psychon. Bull. Rev. 21 (4) (2014) 907–934. [127] P. Le Callet, S. Pechard, S. Tourancheau, A. Ninassi, D. Barba, Towards the next generation of video and image quality metrics: impact of display, resolution, contents and visual attention in subjective assessment, in: Second International Workshop on Image Media Quality and its Applications, IMQA2007, 2007, p. A2. [128] J. Radun, T. Leisti, T. Virtanen, G. Nyman, J. Hakkinen, Why is quality estimation judgment fast? Comparison of gaze control strategies in quality and difference estimation tasks. J. Electron. Imaging 23 (6) (2014) 061103, JEI.23.6.061103. [129] O. Le Meur, T. Baccino, Methods for comparing scanpaths and saliency maps: strengths and weaknesses, Behav. Res. Methods 45 (1) (2013) 251–266. [130] T. Foulsham, Saliency and eye movements in the perception of natural scenes, Ph.D. dissertation, University of Nottingham, 2008. [131] Q. Yang, M. Bucci, Z. Kapoula, The latency of saccades, vergence, and combined eye movements in children and in adults, Invest. Ophthalmol. Vis. Sci. 43 (9) (2002) 2939–2949. [132] M.C. Farias, W.Y. Akamine, On performance of image quality metrics enhanced with visual attention computational models, Electron. Lett. 48 (11) (2012) 631–633. [133] H. Liu, I. Heynderickx, Studying the added value of visual attention in objective image quality metrics based on eye movement data, in: 2009 16th IEEE International Conference on Image Processing (ICIP), IEEE, New York, 2009, pp. 3097–3100. [134] W.Y. Akamine, M.C. Farias, Incorporating visual attention models into video quality metrics, in: IS&T/SPIE Electronic Imaging, International Society for Optics and Photonics, 2014, p. 90160O. [135] B. Khurana, E. Kowler, Shared attentional control of smooth eye movement and perception, Vis. Res. 27 (9) (1987) 1603–1618. [136] H. Daimoto, T. Takahashi, K. Fujimoto, H. Takahashi, M. Kurosu, A. Yagi, Effects of a dual-task tracking on eye fixation related potentials (EFRP), in: Human-Computer Interaction. HCI Intelligent Multimodal Interaction Environments, Springer, Berlin, Heidelberg, 2007, pp. 599–604.



CHAPTER 3 Visual salience versus perceived interest

[137] Y. Feng, G. Cheung, P. Le Callet, Y. Ji, Video attention deviation estimation using inter-frame visual saliency map analysis, in: IS&T/SPIE Electronic Imaging, International Society for Optics and Photonics, 2012, p. 83050H. [138] M.G. Albanesi, R. Amadeo, A new algorithm for objective video quality assessment on Eye Tracking data, in: 2014 International Conference on Computer Vision Theory and Applications (VISAPP), vol. 1, IEEE, New York, 2014, pp. 462–469. [139] N. Mackworth, A. Morandi, The gaze selects informative details within pictures. Percept. Psychophys. 2 (11) (1967) 547–552, [140] M.S. Gide, S.F. Dodge, L.J. Karam, The Effect of Distortions on the Prediction of Visual Attention, 2016, arXiv preprint arXiv:1604.03882. [141] W.S. Geisler, J.S. Perry, Real-time foveated multiresolution system for low-bandwidth video communication, in: Photonics West’98 Electronic Imaging, International Society for Optics and Photonics, 1998, pp. 294–305. [142] Y. Feng, G. Cheung, W.-t. Tan, P. Le Callet, Y. Ji, Low-cost eye gaze prediction system for interactive networked video streaming, IEEE Trans. Multimedia 15 (8) (2013) 1865–1879. [143] F. Boulos, W. Chen, B. Parrein, P.L. Callet, A new H.264/AVC error resilience model based on regions of interest, in: 2009 17th International Packet Video Workshop, 2009, pp. 1–9. [144] P. Le Callet, E. Niebur, Visual attention and applications in multimedia technologies, Proc. IEEE 101 (9) (2013) 2058–2067. [145] V. Pejovic, E.M. Belding, A context-aware approach to wireless transmission adaptation, in: 2011 8th Annual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks (SECON), IEEE, New York, 2011, pp. 592–600. [146] S. Avidan, A. Shamir, Seam carving for content-aware image resizing, in: ACM Transactions on Graphics (TOG), vol. 26, ACM, New York, 2007, p. 10. [147] O. Le Meur, X. Castellan, P. Le Callet, D. Barba, Efficient saliency-based repurposing method, in: 2006 International Conference on Image Processing, IEEE, New York, 2006, pp. 421–424. [148] M. Fried, E. Tsitsiashvili, Y.S. Bonneh, A. Sterkin, T. Wygnanski-Jaffe, T. Epstein, U. Polat, ADHD subjects fail to suppress eye blinks and microsaccades while anticipating visual stimuli but recover with medication, Vis. Res. 101 (2014) 62–72. [149] M.C. Bradley, D. Hanna, P. Wilson, G. Scott, P. Quinn, K.F. Dyer, Obsessivecompulsive symptoms and attentional bias: an eye-tracking methodology, J. Behav. Ther. Exp. Psychiatry 50 (2016) 303–308. [150] D.L. Levy, A.B. Sereno, D.C. Gooding, G.A. O’Driscoll, Eye tracking dysfunction in schizophrenia: characterization and pathophysiology, in: Behavioral Neurobiology of Schizophrenia and Its Treatment, Springer, Berlin, Heidelberg, 2010, pp. 311–347. [151] S. Marx, G. Respondek, M. Stamelou, S. Dowiasch, J. Stoll, F. Bremmer, W.H. Oertel, G.U. H€oglinger, W. Einhauser, Validation of mobile eye-tracking as novel and efficient means for differentiating progressive supranuclear palsy from Parkinson’s disease, Front. Behav. Neurosci. 6 (2012) 88. [152] N.K. Archibald, S.B. Hutton, M.P. Clarke, U.P. Mosimann, D.J. Burn, Visual exploration in Parkinson’s disease and Parkinson’s disease dementia, Brain 136 (3) (2013) 739–750.


[153] J.T. Hutton, J. Nagel, R.B. Loewenson, Eye tracking dysfunction in Alzheimer-type dementia, Neurology 34 (1) (1984) 99. [154] H.Y. Tsang, M. Tory, C. Swindells, eSeeTrack&# 8212; visualizing sequential fixation patterns, IEEE Trans. Vis. Comput. Graph. 16 (6) (2010) 953–962. [155] E.A. Krupinsky, On the development of expertise in interpreting medical images, in: IS&T/SPIE Electronic Imaging, International Society for Optics and Photonics, 2012, p. 82910R. [156] F. Hermens, R. Flin, I. Ahmed, Eye movements in surgery: a literature review, J. Eye Movement Res. 6 (4) (2013). [157] T. Haidegger, J. Sa´ndor, Z. Benyo´, Surgery in space: the future of robotic telesurgery, Surg. Endosc. 25 (3) (2011) 681–690. [158] M. Stark, E.R. Morales, S. Gidaro, Telesurgery is promising but still need proof through prospective comparative studies, J. Gynecol. Oncol. 23 (2) (2012) 134–135. [159] Q. Huynh-Thu, M. Barkowsky, P. Le Callet, The importance of visual attention in improving the 3D-TV viewing experience: overview and new perspectives, IEEE Trans. Broadcast. 57 (2) (2011) 421–431. [160] S. Pastoor, Human factors of 3D displays in advanced image communications, Displays 14 (3) (1993) 150–157. [161] M. W€opking, Viewing comfort with stereoscopic pictures: an experimental study on the subjective effects of disparity magnitude and depth of focus, J. Soc. Inform. Display 3 (3) (1995) 101–103. [162] J.L. Semmlow, D. Heerema, The role of accommodative convergence at the limits of fusional vergence, Invest. Ophthalmol. Vis. Sci. 18 (9) (1979) 970–976. [163] C. Chamaret, S. Godeffroy, P. Lopez, O. Le Meur, Adaptive 3D rendering based on region-of-interest, in: IS&T/SPIE Electronic Imaging, International Society for Optics and Photonics, 2010, p. 75240V. [164] R. Scobey, J. Horowitz, Detection of image displacement by phasic cells in peripheral visual fields of the monkey. Vis. Res. 16 (1) (1976) 15–24, 0042-6989(76)90071-7. [165] T. Hansen, L. Pracejus, K.R. Gegenfurtner, Color perception in the intermediate periphery of the visual field, J. Vis. 9 (4) (2009) 26. [166] V. Virsu, J. Rovamo, P. Laurinen, R. N€as€anen, Temporal contrast sensitivity and cortical magnification, Vis. Res. 22 (9) (1982) 1211–1217. [167] R. Snowden, R. Hess, Temporal frequency filters in the human peripheral visual field, Vis. Res. 32 (1) (1992) 61–72. [168] S. Hillaire, A. Lecuyer, R. Cozot, G. Casiez, Using an eye-tracking system to improve camera motions and depth-of-field blur effects in virtual environments, in: 2008 IEEE Virtual Reality Conference, IEEE, New York, 2008, pp. 47–50.



Emerging science of QoE in multimedia applications: Concepts, experimental guidelines, and validation of models


Luka´sˇ Krasula, Patrick Le Callet University of Nantes, Nantes, France

This chapter is related to the concept of Quality of Experience (QoE) in multimedia applications. Throughout the chapter, QoE mostly has to be considered in such applied context, even if the term multimedia is not always mentioned. After reminding the concept of QoE, the chapter provides guidelines on: (1) how to design, conduct, and interpret QoE experiments with humans and (2) how to use the results obtained from such experiments for evaluating the performance of objective QoE estimators. While QoE can be measured through many ways (questionnaires, electrophysiology, attention deployment, etc.), this chapter is mainly focused on the “survey based” experiments. In such test procedure, users provide their task-related feedback in the form of votes with respect to the predefined series of stimuli, presented to them according to a specific procedure. This type of tests can be used in laboratory conditions, as well as in the users’ home environment. In addition to the definition of QoE in the context of multimedia applications, the main factors influencing QoE are first summarized. Furthermore, the possibilities of involving particular aspects of QoE into the subjective tests are discussed. The most popular subjective procedures are introduced in detail, together with their advantages and weaknesses. Thorough instructions regarding processing the outcomes of the particular procedures are provided, as well as advanced methodologies for determining the significance of influence of the specific dimensions on the measured overall QoE. Finally, the possibilities of utilizing subjective data for evaluating the performance of objective QoE estimators are introduced in detail. Such methods are useful for comparison, training, and validation of the estimators. All concepts are described either mathematically, as a pseudo code, or provided in independent scripts.1 1

Academic Press Library in Signal Processing, Volume 6. # 2018 Elsevier Ltd. All rights reserved.



CHAPTER 4 Emerging science of QoE in multimedia applications

4.1 QoE DEFINITION AND INFLUENCING FACTORS The notion of quality in communications has been, for many years, connected to the term of Quality of Service (QoS). ITU-T Recommendation E.800 from 2008 [1] defines QoS as “the totality of characteristics of a telecommunications service that bear on its ability to satisfy stated and implied needs of the user of the service.” The quality is therefore expressed by the physical properties of the service itself. The same recommendation [1] also introduces a term QoS experienced (QoSE) or QoS perceived (QoSP) by customer/user as “a statement expressing the level of quality that customers/users believe they have experienced,” which indicates the shift of the attention toward the user of the service. This “user centered” orientation is emphasized with the concept commonly known as QoE. Even though the term has been increasingly used since the late 1990s, the commonly accepted definition covering all the aspects of QoE came from the joint effort under the framework of COST Action IC1003 Qualinet and its White Paper on Definitions of QoE in 2012 [2]. The working definition states that QoE “is the degree of delight or annoyance of the user of an application or service. It results from the fulfillment of his or her expectations with respect to the utility and/or enjoyment of the application or service in the light of the user’s personality and current state.” The definition clearly highlights the extension of the scope, compared to the QoS and QoSE/QoSP, since the concept is no longer connected to the telecommunication services only and covers much broader spectrum of applications. The overall QoE can be regarded as an interaction of multiple perceptual dimensions. As a simple example, an audiovisual quality of video can be taken. The overall quality is defined by the interaction of the audio and video components. Emerging technologies bring even more dimensions into the overall experience. For 3DTV, Chen et al. [3] developed a pyramidal model of visual experience depicted in Fig. 4.1. It can be seen that the visual experience is mainly composed by the perceived naturalness and the quality of depth rendering. These complex aspects can be further broken down into lower level components which are 2D image quality, depth quantity, and visual comfort.  ´k et al. [4] studied the perception of quality in tone-mapping of high dynamic Cadı range (HDR) images. They identified perceived brightness, contrast, details, colorfulness, and artifacts as the main attributes influencing the resulting quality of a stimulus. The relevant perceptual dimensions or attributes therefore vary according to the specific application scenario. In addition, these perceptual attributes of QoE are affected by a number of factors, a.k.a as Influencing factors, which are described below.

4.1.1 FACTORS INFLUENCING QoE The above-mentioned Qualinet white paper [2] defines the QoE influence factor (IF) as “any characteristic of a user, system, service, application, or context whose actual state or setting may have influence on the Quality of Experience for the user.” The IF can be classified into three main categories:

4.1 QoE definition and influencing factors

Visual experience

Depth rendering


2D Image quality

Depth quantity

Visual comfort

FIG. 4.1 Model of visual experience in 3DTV [3]. Primary factors are on the bottom while more complex aspects are on the higher levels.

• • •

System IFs, Context IFs, Human IFs,

although it should be noted that these groups are not mutually exclusive and do interact among each other. System influence factors System IFs include the characteristics having an impact on the technically produced quality. It can be noticed that in the QoS and QoSE concepts, only this group of factors is considered. They are closely related to (a) content, i.e., its spatial and temporal characteristics, color depth, dynamic range, texture, semantics, etc., (b) media configuration such as resolution, encoding, frame rate, etc., (c) properties connected to the transmission over a network, i.e., packet loss, bandwidth, delay, error rate, etc., and (d) properties of devices involved in the end-to-end delivery chain including system and equipment specifications, device capabilities, and service provider specifications and capabilities [2]. The specific set of System IFs depends on the considered application scenario. Context influence factors Context IFs consider the user’s environment on multiple levels. Physical environment describes the user’s whereabouts such as location and occupied space. Transition between locations or movement within one location are also the physical environmental aspects. Time of day when the service is used or frequency and duration of usage are considered as temporal factors. Another important influential aspect



CHAPTER 4 Emerging science of QoE in multimedia applications

of the context is economical since the cost of the service, type of subscription, etc. have a strong impact on the user’s expectations. Aspects related to the task and social context (e.g., presence or involvement of other people in the experience) belong to this group as well. Human influence factors The last class of IFs concerns the idiosyncrasies of particular users. It involves more stable factors such as demographic and socio-economic background or physical and mental dispositions, as well as short term factors—e.g., emotional state of the user. Jumisko-Pyykk€ o et al. [5] argued that the impact of human IFs on the perception is twofold—low level and high level. The low level aspect consists of basic, sensory-based processing. Here, the stable IFs include perceptual acuity, age, or gender, while the momentarily IFs may be represented by user’s current mood, degree of attention, motivation, etc. The high level perception includes higher cognitive processes, judgments, and interpretations which are determined by socio-economic situation, education, personal values, etc., as well as previous experience, current needs, expectations, or emotions. As already mentioned earlier, all of the above stated IFs have an impact on the overall QoE. In comparison to QoS, where all the aspects are technically oriented, measuring QoE requires much broader multidisciplinary and multimethodological procedures [2]. Approaches applicable toward this goal will be discussed in the following sections.

4.2 QoE MEASUREMENT There are two main approaches toward measuring QoE—subjective experiments and objective estimation. The first approach employs a panel of human participants who are given a specific QoE related task. The majority of such tests are “survey based,” i.e., the users are evaluating a series of stimuli on a predefined scale with respect to a given research question. The results therefore directly represent their opinions and the resulting data, when correctly processed, are very reliable. However, it has been discovered that, when given a task, e.g., to evaluate the quality of a stimulus, the subjects’ perceptual mechanisms are different compared to the task-free exposure [6]. Such effect may possibly introduce a bias between the measured opinions and the hypothetical opinions that would be obtained in the absence of the task. This motivates the idea to exchange the “survey based” procedures by exposing the subjects to the stimuli without any particular task and measuring a different kind of response (e.g., physiological measures, gaze patterns, etc.). The measurement needs to be noninvasive, in order to avoid introducing another bias into the users’ typical conditions, and there needs to be a clear relationship between the measured entity and the QoE. An example of a potentially good proxy for QoE in the area of nonuniformly

4.2 QoE measurement

distorted video is disruption of the gaze patterns [7]. For more information about the concept, together with the exhaustive introduction to the visual attention measurement and modeling, the reader is referred to Chapter 3: Visual attention, visual salience, perceived interest in multimedia applications. Nevertheless, despite the above-mentioned limitation, the “survey based” experiments remain the most popular and universal way to subjectively measure QoE and will, therefore, be the principle topic of this chapter. Drawbacks of the subjective QoE measurement involve time and economical requirements, limitations with respect to the experiments size and duration, or the unsuitability to be used in the feedback loop in order to control or optimize QoE. These issues are resolved by objective QoE estimators which are algorithms (metrics, models, measures, indexes, etc.) evaluating QoE, or some of its aspects, automatically. Even though the objective measurement overcomes the previously stated disadvantages of the subjective methods, it suffers from other problems. The objective criteria can mostly be focused only on a particular aspect, or set of aspects, of QoE. The major issue is the limited reliability of the estimators which need to be validated with respect to subjective data relevant to the respective application scenario. Guidelines on how to use subjective results for objective criteria performance evaluation will be provided in Section 4.3. The introduction into the objective measurement of quality and QoE exceeds the scope of this chapter and interested readers are referred to, e.g., Ref. [8] for visual and Ref. [9] for speech quality criteria. Measuring QoE subjectively requires designing experiments that will cover the whole spectrum of the IFs described in Section 4.1. This then enables determining the degree of influence of particular IFs on the overall experience. The dimensionality of QoE can be reduced by focusing on a certain application scenarios only, i.e., leaving certain IFs constant. However, when interpreting the results of such experiments, we should be aware of which IFs have been fixed, not to overly generalize the conclusions. Two main approaches toward QoE measurement can be identified: • •

In lab experiments, Field experiments.

In lab experiments take place in a controlled laboratory environment. A researcher has a possibility to properly explain the experiment to the user and monitor his/her behavior during the test. In lab experiments mostly enable to test more conditions and are allowed to last much longer, since users come directly for the sole purpose of the testing and thus are better concentrated and more motivated to provide good results. It has been shown (e.g., in [10]) that subjects tend to be more critical in the laboratory conditions. Furthermore, these tests allow for measuring some additional variables such as visual attention or physiological responses (galvanic skin response, electroencephalogram, etc.) providing an important insight into the user’s emotional state and other human IFs.



CHAPTER 4 Emerging science of QoE in multimedia applications

Field (or crowdsourcing) experiments, on the other hand, are mostly run in the user’s home environment or in other place relevant to the application scenario (e.g., cafe, bus stop, etc. [10]). The advantage is that much broader spectrum of users can be reached, however, considering the lack of control over the user’s behavior while performing the task, experiments need to be much shorter, carefully prepared and interpreted. It is advisable to determine as much information about the test environment, such as the type and properties of the device the user is using, as possible. It is also desirable to regularly check the attention of the user by so called “honey pots,” trick questions making sure that he/she is sufficiently concentrated on the task and is not just mechanically providing random data. A thorough and robust data processing including outlier detection is absolutely necessary. Very good guidelines to performing crowdsourcing experiments can be found in [11]. The following sections will summarize how to consider the particular IFs in the QoE experiments and sum up and describe commonly used methodologies for QoE measurement, including the procedures for data processing. The goal is to provide compact instructions for conducting the experiments leading to reliable and repeatable results.

4.2.1 INCLUDING SYSTEM INFLUENCE FACTORS IN QoE MEASUREMENT System IFs of QoE are connected to the content, media, network, and devices (see Section Given that these factors are also included in QoSE, their measurement have been thoroughly studied, especially throughout the last two decades. The large portion of system IFs is fixed by the application scenario itself. The scenario may refer to the technology (e.g., 3DTV [12], HDR [13], etc.) as well as to the viewing conditions. In the “in lab experiments,” mostly one, exceptionally few, viewing conditions (i.e., displaying device, viewing distance, ambient lighting, etc.) are considered. Field experiments are conducted in much less stable conditions which can partially be determined (e.g., type and properties of the reproduction device, ability to reproduce certain type of stimuli, etc. [11]) but cannot be fully monitored. The impact of viewing conditions is therefore not fully determinable in field experiments. Moreover, tests using emerging technologies are virtually impossible “in the field.” The rest of the system IFs is defined by the dataset used for the testing. It is desirable to choose the source content in a way covering a wide variety of conditions, in terms of spatial and temporal characteristics, semantics, etc., depending on the target application. Another important dimension in the dataset creation is selecting various media (different degrees of simulated distortions, resolutions, frame-rates, etc.) and network (clean transmission scenario, different levels of packet loss, etc.) configurations according to the application. Considering the extent of the factor space, numerous datasets are being produced in order to provide a substantial coverage of the possible conditions.

4.2 QoE measurement

4.2.2 INCLUDING CONTEXT INFLUENCE FACTORS IN QoE MEASUREMENT Context IFs bring more complications in the measurement of QoE. Section divided these factors into physical, temporal, task related, social, and economical. The first four classes can be only simulated in laboratory conditions. The standard test room can be modified in order to resemble, e.g., a typical living room or another application related place. Field experiments concluded by a survey asking the subjects about their whereabouts are another option. Jumisko-Pykk€o and Hannuksela [10] chose a hybrid approach. They provided subjects with special wearable device enabling them to perform the task themselves in three specific scenarios—cafe, bus, and train station. However, this approach is not accessible in all application scenarios. Sackl and Schatz [14] attempted to include the economical aspect of QoE by treating the user not only as a passive observer but as a decision making entity. They allowed users select the content they want to watch from the larger pool and provided them with an amount of money they could spend to enhance the quality of the videos they were watching. In this way, the willingness to pay can be determined. Moreover, the following quality assessment provides another interesting insight to the QoE. Another possible approach is to provide the context to the user artificially as part of the instructions to the test, e.g., introduce him/her to the scenario where he/she is subscribed to a certain level of service with a certain fee. This method is, however, dependent on the user’s ability to imagine himself/herself in the particular role. The hypothetical scenario should therefore not be very different from the subject’s nature. Ideal case is when the subjects can be prescreened and invited according to their situation, similarly to the phenomenon of personas in the User Experience (UX) community [15]. Alternative approach is to determine the user’s archetype after the test by a survey. This, however, does not ensure having a necessary number of different user types to draw any conclusions.

4.2.3 INCLUDING HUMAN INFLUENCE FACTORS IN QoE MEASUREMENT Including certain human IFs in the measurement of QoE uses principles very similar to the previously discussed context IFs. In order to incorporate demographic and socio-economic factors, subjects need to be screened and divided into groups. Considering the limited availability of the group representatives in most of the panels participating in “in lab” experiments, “field” experiments provide much more flexibility and are available to reach much broader spectrum of users. On the other hand, laboratory conditions open new possibilities such as measuring subconscious responses that may represent a priceless insight into the subject’s current emotional state, mood, level of attention, comfort, health situation, etc. Moreover, such measurements, when carefully performed, are very objective since the subject cannot willingly influence the results or make an error in evaluation. These measurements include, e.g., galvanic skin response (GSR), electroencephalography (EEG),



CHAPTER 4 Emerging science of QoE in multimedia applications

electrocardiography (ECG), or electromyography (EMG). More information on performing physiological measurements for QoE purposes can be found, e.g., in [12]. More direct, but also less objective, way of determining the emotional response is a questionnaire. A lot of effort has been put into finding the most reliable subject selfassessment method. An overview can be found, e.g., in [16]. The most popular solutions among researchers are probably Positive and Negative Affect Schedule (PANAS) [17] and, especially, Self-Assessment Manikin (SAM) [18, 19]. SAM offers a graphical explanation of the emotions, enabling simple evaluation of the particular aspects on a scale. An example of the SAM scale for valence and arousal is depicted in Fig. 4.2. A significant perk of the self-assessment procedures is that they can be used both in the field and in the laboratory. Some important aspects of QoE related to the personal idiosyncrasies, in the areas including visual signals, can also be determined using eye-tracking experiments. The resulting visual attention information may provide an insight into the user’s perceptual mechanisms, level of concentration and interest, etc. It should be noted that, similarly to physiological measurements, eye-tracking is only possible in laboratory and puts subjects into slightly different conditions than they are used to and thus the results should be interpreted with this in mind. Generally speaking, such experiments are immensely helpful in getting qualitative data and forming a hypothesis, while the quantitative hypothesis verification should be done in more typical viewing scenarios.

4.2.4 MULTIDIMENSIONAL PERCEPTUAL SCALES FOR QoE MEASUREMENT The previous sections discussed the possibilities of including all the IFs of QoE into the experiments. Here, the specific procedures to subjectively measure the overall QoE, as well as its particular perceptual dimensions, and to determine the influences of individual factors will be described.

FIG. 4.2 Self-Assessment Manikin for valence (top row) and arousal (bottom row).

4.2 QoE measurement

In any QoE experiment, each subject, influenced by the context and human IFs, provides his/her opinions regarding the quality of presented stimuli with various system IFs. However, as already discussed at the beginning of the Section 4.1, the notion of the experienced quality is mostly not single-dimensional. It is therefore common to divide the task of overall experience evaluation into assessment of the application specific partial aspects (see Fig. 4.1 for an example of the QoE aspects in 3DTV). This can be beneficial for multiple reasons. First of all, it provides subject with more concretely defined tasks, simplifying the otherwise considerably complex decision making process. Moreover, information regarding subentities offers an interesting insight into user’s perception and enables deeper studies of the factors’ interaction. Most of the procedures that will be described below are designed for measuring only one dimension (i.e., one entity) at the time. Nevertheless, they can be used multiple times to get the subjective opinions regarding all the desired entities. There are two ways to proceed. The first way is to let the user evaluate the stimulus on multiple scales at the same time. For example, after showing the 3D stimulus, ask the subject’s opinion about visual comfort, depth quantity, and image quality and only after all these aspects are evaluated proceed to another stimulus. This method is effective with respect to the experiment duration, however, it is more demanding for the observer since he/she needs to focus on all the aspects at the same time, keep track of all the scales, etc. This is much simplified in the second way, where the experiment is repeated with different measured entity each time. The user is capable of fully focusing on the aspect in question and thus provide more reliable results. Nevertheless, time requirements are several times higher and, especially in case of longer duration of stimuli, experiments become too long and tiring due to the frequent exposure to the same stimuli. All of the evaluation methodologies that will be introduced can be used in both configurations. Scales and scaling methods Before introducing the different scaling procedures, it is necessary to define different type of scales. Ferwerda [20] divides them into four categories—nominal, ordinal, interval, and ratio (see Table 4.1). The nominal scale contains categories that are represented by their names and are not necessarily ordered. An example of nominal scale can be, for example, sorting food into sweet, sour, bitter, and salty. The ordinal scale only provides the ranks between the particular stimuli. It is possible to say which of the stimuli ranks higher but we are not able to say what is the distance between them. In the interval scale, the distance can be determined, however only in relative terms. We know how far the stimuli are from each other but we do not have a reference point to determine their absolute magnitudes. If such point is available, the scale is a ratio scale. In such case, the magnitudes are absolutely comparable among each other.



CHAPTER 4 Emerging science of QoE in multimedia applications

Table 4.1 Different Types of Scales According to Ferwerda [20]. Scale Type Nominal Ordinal Interval Ratio

Properties Discrete categories, not necessarily ordered Scores are ordered but not numerical Numerical (the distance between scores is known) but there is no absolute reference point, i.e., relative scale Numerical with respect to the reference point, i.e., absolute scale

The procedures for subjective tests can be divided into direct and indirect scaling methods [20]. The advantage of direct scaling procedures is that the results from individual users are directly on the ratio scale (see Table 4.1) that is clearly defined by the particular methodology which simplifies the following processing and interpretation. On the other hand, indirect procedures typically provide higher discriminatory power and can be less complicated and tiring for the subjects. Several works has shown that indirect scaling methods need lower number of subjects to provide the same reliability as direct scaling procedures [21, 22].

4.2.5 DIRECT SCALING METHODS As mentioned previously, direct scaling methods collect the opinions of subjects regarding each particular stimulus directly on the ratio scale. They can be related to the magnitude estimation method, as introduced by Stevens [23]. The scale values depend on the selected procedure. Once results from all users are collected, processing including outlier detection and averaging is applied directly on the raw data. The final outcome of the experiments has the form of mean opinion scores (MOS) or differential mean opinion scores (DMOS) with respective confidence intervals calculated from the variance of the raw scores. MOS values represent the magnitude of each stimulus with respect to the measures entity, i.e., the stimuli of high magnitude reach higher MOS values. In case of DMOS, the magnitude of the reference stimulus is considered to be the highest possible and the difference of the scores with respect to the reference scores are taken. Note that DMOS values can mostly be calculated from MOS and vice versa. Given the higher requirements on participants, it is recommended to conduct a training session (with the content that does not appear in the test) where extreme cases are shown prior to the test itself. This helps participants to calibrate their mental scale to the range within the test, thus providing the reference point of the absolute scale. The most popular opinion collection procedures together with their respective MOS/DMOS scales are described below.

4.2 QoE measurement Single Stimulus/Absolute Category Rating Single Stimulus (SS) methodology [24], also known as Absolute Category Rating (ACR) [25], is the simplest subjective procedure. The stimuli are presented to the participants in a random order one at a time, followed by the mid-gray screen during the voting period. Depending on the research question, the reference can be explicitly shown to the subjects or not used at all. The participants are mostly asked to evaluate the measured entity on the five grade scale as: 5 4 3 2 1

Excellent Good Fair Poor Bad

The outcome of the experiment using this scale (after the results processing as described in Section 4.2.6) are the MOS scores ranging from 1 to 5. Since the method is the least time consuming, stimuli can be also displayed repeatedly to increase the discriminatory power and reliability. The timeline of the procedure can be seen in Fig. 4.3.

ACR with hidden reference The special case of the procedure with so called hidden reference (ACR-HR) can be used as well. Here, the reference stimuli are present within the set but participants are not aware of this fact. This enables calculating the DMOS score for each stimulus with respect to the score of the reference as DMOSðAi Þ ¼ MOSðAi Þ  MOSðAHR Þ + 5,


where DMOS(Ai) and MOS(Ai) are the DMOS and MOS scores for a content A under a condition i, respectively and MOS(AHR) is the MOS score for the hidden reference of the content A. The higher the DMOS, the closer is the quality of the stimulus to the reference. Ai







mx j


pdf (−)

Entity scale (−)

FIG. 4.9 PDFs for two stimuli on the entity scale.




CHAPTER 4 Emerging science of QoE in multimedia applications

Since both variables are Gaussian, ξi  ξj is also a Gaussian random variable with mean μξi ξj and standard deviation σ ξi ξj . Eq. (4.15) can, therefore, be expressed as Z Prðξi > ξj Þ ¼ 0

ðx  μξi ξj Þ2 1 qffiffiffiffiffiffiffiffiffiffiffiffiffi exp 2σ 2ξi ξj 2πσ 2ξi ξj


! 1 x2 q ffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ : exp 2σ 2ξi ξj μξi ξj 2πσ 2ξi ξj Z

The symmetry of the Gaussian enables to write Z Prðξi > ξj Þ ¼

μξi ξj ∞

1 x2 qffiffiffiffiffiffiffiffiffiffiffiffiffi exp 2σ 2ξi ξj 2πσ 2ξi ξj


Z ¼

μξi ξj


1 σ ξi ξj


x σ ξi ξj


! :


Therefore, the probability of selecting Ai over Aj can be expressed in terms of standard normal CDF Φ as Prðξi > ξj Þ ¼ Φ

where Φ is defined as

Z ΦðzÞ ¼



μξi ξj


σ ξi ξj





Eq. (4.18) can be inverted in order to get the mean difference   μξi ξj ¼ σ ξi ξj Φ1 Prðξi > ξj Þ ,



where Φ is the inverse standard normal CDF, also known as the probit. According to Thurstone, the probability Pr(ξi > ξj) can be estimated from the number of times the stimulus Ai was chosen over Aj during the test (Cξi >ξj ), thus Prðξi > ξj Þ ¼

Cξi >ξj : Cξi >ξj + Cξj >ξi


Cξj >ξi represents the number of cases where participants selected stimulus Aj over Ai. The estimate of mean difference can therefore be obtained as μ ^ ξi ξj ¼ σ ξi ξj Φ1

! Cξi >ξj , Cξi >ξj + Cξj >ξi


which is known as Thurstone’s Law of Comparative Judgment. It can be seen that the value of σ ξi ξj will not be obtained from the PC test. To enable the practical computations, Thurstone proposed several simplifications. In the absolute majority of applications, Case V model is used. It assumes that the variables ξi and ξj are uncorrelated (ρξi ξj ¼ 0) and their standard deviations are equal, i.e., σ ξi ¼ σ ξj . Since the variance σ 2ξi ξj ¼ σ 2ξi + σ 2ξj  2  ρξi ξj  σ ξi  σ ξj , it is convenient to assume 1 (without loss of generality) that σ 2ξi ¼ σ 2ξj ¼ . The estimate is then computed as 2 1

μ ^ ξi ξj ¼ Φ

! Cξi >ξj : Cξi >ξj + Cξj >ξi


4.2 QoE measurement

To obtain the final values on the interval scale, the Least Square Method is used [39]. When a being the number of stimuli, a vector of the scores μ ¼ ½μξ1 ,μξ2 , …,μξa  can be defined. An a  a matrix of the estimates of the mean differences D with elements Dði, jÞ ¼ μ ^ ξi ξj can then be used for optimization: μ ^ ¼ arg min μ2a

X i, j

ðDi, j  ðμξi  μξj ÞÞ2 :


This problem has a simple closed form solution. When μξ1 is set to be 0, the individual scores are obtained as μ ^ ξj ¼

a X Di, 1 i¼1


a X Di, j i¼1



(4.25) Bradley-Terry-Luce model Similar comparative judgment model has been developed by Bradley and Terry [40– 42]. It is usually called Bradley-Terry-Luce (BTL) model after Luce who extended the model to a multiple choice scenario [43]. Here, the probability of choosing stimulus Ai over Aj is expressed as Prðξi > ξj Þ ¼ Prðξi  ξj > 0Þ ¼

π ξi , π ξi + π ξj


where π ξi ¼ exp ðμξi =sÞ with s being a scale parameter. Therefore, the previous equation can be expressed as Prðξi > ξj Þ ¼

  μξi  μξj exp ðμξi =sÞ 1 1 ¼ + tanh : 2s exp ðμξi =sÞ + exp ðμξj =sÞ 2 2


From here, it has been noticed that the variable ξi  ξj is a logistic variable with the mean μξi  μξj and the scale parameter s. This is the main difference from the TM model, where this variable is considered to be Gaussian. It has been proven by Block and Marschak [44] that in order to fulfill the model’s requirements, the variables ξi and ξj follow Gumbel distribution. Performing the similar operations as with TM model, the estimate of magnitude differences can be obtained as μ ^ ξi ξj ¼ s

Cξi >ξj ln Cξi >ξj + Cξj >ξi


Cξi >ξj  ln 1  Cξi >ξj + Cξj >ξi

!! :


Since it is not necessary to compute the error function, the BTL model is computationally less demanding. Nevertheless, this advantage is irrelevant with modern computers. The estimation of individual scores can be obtained by least square method as well. However, it is more common to perform maximum likelihood estimation (MLE) [42]. The likelihood function has a form of



CHAPTER 4 Emerging science of QoE in multimedia applications


Cξi i π ξi

i 0

If V1(xi = 0)−V1(xi = 1) > 0






xi C

C B A = V2(1,0)−V2(0,0), B = V2(1,0)−V2(1,1) C = V2(1,0)+V2(0,1)−V2(1,1)−V2(0,0)

FIG. 8.3 Edges construction for order 1 and order 2 interactions.


8.2 Lattice-based models and the bayesian paradigm

Let us first consider the complete case for which we have a realization of X. The log-likelihood is then written as follows: logLðΘÞ ¼ 


θi Si ðXÞ  log ZðΘÞ,



where Z(Θ) is the normalizing constant. To estimate Θ we can compute the maximum likelihood (or maximum of the loglikelihood) using gradient descent-based algorithms. However, these schemes require the computation of the derivatives that are given by: ∂ logLðΘÞ ¼ Si ðXÞ + ZðΘÞ ∂θi


Si ðXÞ exp 


θi Sj ðXÞdX:



This implies the evaluation of the moments of the Sj(X) denoted by hSj(X)iΘ. As they are theoretically intractable their evaluation requires to sample the model at each iteration which is not feasible for computation time reasons. To avoid numerous sampling steps, the importance sampling trick can be used. Let us consider two different values for the parameters Θ and Ψ. The ratio of the corresponding partition functions is written as follows: Z



X ðθi  ψ i ÞSi ðXÞ exp  ψ i Si ðXÞdX i Z : X exp  ψ i Si ðXÞdX



Therefore if we slightly modify the log-likelihood definition, without changing the maximizer, as follows: logLðΘÞ ¼ 


θi Si ðXÞ  log


ZðΘÞ , ZðψÞ


the derivatives become:

X hSi ðXÞ exp  jðθi  ψ i ÞSj ðXÞiψ ∂LðΘÞ X ¼ Si ðXÞ + : ∂θi hexp  jðθi  ψ i ÞSj ðXÞiψ


Therefore, we can sample the model with a given set of parameters Ψ and then compute the derivatives by evaluating the proper moments on these samples for any Θ. In practice, if the distance between Θ and Ψ is too high, the respective distribution do not overlap enough to provide accurate estimation. Therefore, a compromise has to be found as in Algorithm 8.4. Some alternative methods have been proposed to solve the parameter estimation problem in specific cases or by considering some approximations of the likelihood. We can cite the coding method [1] that consists in considering a subset of pixels, each of them being conditionally independent. Consider a subset S such that: 8s 2 S,Ns [ S=fsg ¼ ∅,




CHAPTER 8 Markov models and MCMC algorithms in image processing

ALGORITHM 8.4 MAXIMIZATION OF THE LOG-LIKELIHOOD Compute the statistics Si(X) and initialize Θ ¼Θ0 and n ¼ 0 while convergence not reached do • Sample the law with parameter Θn • Define a confidence interval [Θn  δn, Θn + δn] • Maximize the log-likelihood defined in Eq. (8.17) in the confidence interval (using, e.g., a ^ Newton-Raphson scheme) to get Θ ^ is on the border of the confidence interval set θn + 1 ¼ Θ, ^ if not we have converged. • if Θ end

where Ns refers to the neighborhood of s. The likelihood relatively to the coding set S is then written as follows: LS ðΘÞ ¼


pðxs jxt ,t 2 Ns Þ:



The probabilities in Eq. (8.20) only involve local partition functions. Similarly, some approximation can be considered such as the maximization of the pseudolikelihood: PLðΘÞ ¼


pðxs jxt , t 2 Ns Þ:



When considering the MRF model as a prior in a Bayesian framework, the variable X is not known. The parameter estimation problem, which is called in this case an incomplete data estimation problem, becomes even more tricky. the likelihood P(Y ;Θ) cannot be obtained even up to a normalizing constant. The only equation we have is the following: Z PðY;ΘÞ ¼



The integral in Eq. (8.22) cannot be evaluated. Therefore, we have to address the problem through the general Expectation Maximization (EM) scheme. Consider the following two quantities: Z QðΘ,Θ0 ;yÞ ¼

PðXjY;Θ0 Þ logPðY, X;ΘÞdX,


PðXjY;Θ0 Þ dX: PðXjY;ΘÞ


Z DðΘjjΘ0 Þ ¼

PðXjY;Θ0 Þ log

Q is interpreted as an expectation whereas D is equal to the Kullback-Leibler distance between P(XjY ;Θ) and P(XjY ;Θ0). As D(ΘjjΘ0) is positive, the following iterative scheme increases the likelihood at each iteration:

8.3 Some inverse problems

ALGORITHM 8.5 EM ALGORITHM while convergence not reached do • Expectation (E-step): compute Q(Θ, Θk;Y) • Maximization (M-step): compute Θk + 1 ¼ argmaxΘ QðΘ,Θk ;YÞ end

Different adaptations of the EM algorithm, depending on the way to compute and optimize the expectation, have been proposed and applied in the case of MRFs Q [12, 13]. In practice, the expectation Q is usually intractable. Therefore, it is estimated by sampling the model using an MCMC sampler and approximated by an empirical estimator.

8.3 SOME INVERSE PROBLEMS In this section, we consider the most classical applications of MRF modeling in image processing. More detailed descriptions of this methodology and more complete reviews of applications can be found in books dedicated to MRFs [14–16].

8.3.1 DENOISING AND DECONVOLUTION: THE RESTORATION PROBLEM Image restoration is often the first step before analyzing the information content of an image. It mainly consists of the image quality improvement. It is one of the first problems which have been addressed by an MRF modeling [3, 16, 17]. We consider an image Y corrupted by some distortion due to the optics of the sensor, for example, a convolution modeling the blurring effect due to defocussing, and noise due to the sensor itself or to the transmission process. If we assume the distortion to be linear, we can write: Y ¼ KðXÞη,


where η represents the noise and K is a linear operator that is the convolution in case of blurring effect. The restoration problem consists in recovering X from Y. Embedding the problem into a Bayesian framework, we maximize the following posterior: PðXjYÞ ¼



Assuming, for example, that we have an additive independent Gaussian noise, in the case of image denoising (K is the identity), we then have: (

PðYjXÞ ¼ exp 

X ðys  xs Þ2 s2Λ

2σ 2

) pffiffiffiffiffiffiffiffiffiffi 2  log 2πσ ,




CHAPTER 8 Markov models and MCMC algorithms in image processing

where σ 2 is the noise variance. The prior P(X) aims at regularizing the solution, i.e., smoothing the resulting image and is usually modeled by a pairwise interaction Gibbs field: 8 9 < X = 1 PðXÞ ¼ exp  Vc ðxs , xs0 Þ : : ; Z c¼hs, s0 i


A Gaussian Gibbs field with Vc ðxs ,xs0 Þ ¼ ðxs  xs0 Þ2 could achieve this goal. However, a Gaussian prior field leads to blur edges. To avoid blurring, more sophisticated priors are considered to preserve edges [18, 19], as for example, the Φ-model: Vc ðxs , xs0 Þ ¼

β ðxs  xs0 Þ2 1+ δ2



where β is a parameter representing the strength of the prior and δ is the minimum gray level gap to define an edge. To summarize, the energy to be minimized, in the case of image denoising is written as follows: Uden ðXjYÞ ¼


c¼hs, s0 i 1 +

ðxs  xs0 Þ δ2



X ðys  xs Þ2 s2Λ

2σ 2



where c are sets of two neighboring pixels (sites). We usually consider for each pixel the four or eight closest pixels. Another classical example is image deconvolution, when the image is blurred due to movements of the camera or defocussing. In this case the operator K is a convolution by a kernel K and the energy is given by: Udec ðXjYÞ ¼

X c¼hs, s0 i 1 +

β ðxs  xs0 Þ2 δ2


X ðys  ðK  XÞ Þ2 s : 2σ 2 s2Λ


A denoising result obtained by the energy described in Eq. (8.30) using the Langevin dynamics is shown in Fig. 8.4 for different levels of noise [20]. As stated before, one key point is to smooth the noise without smoothing edges. The first solution is to consider adapted potentials, such as the Φ-functions. Another possibility is to consider an additional process on the dual lattice, so called a line process [18]. A binary random variable is associated with each clique. The potential between the two sites is then inhibited when the line variable is equal to one. A prior on the line process model the edge continuity of the configuration. We then have to optimize two random fields.

8.3.2 SEGMENTATION PROBLEM A segmentation map is a partition of the plane. Each region represents an object or a specific area on the image. Consider a random field Y ¼ (ys)s2Λ, where ys 2 S. The likelihood term P(Y jX) model the gray level distribution of the pixels belonging to a

8.3 Some inverse problems

Initial image

Noisy image (s = 20)

Restored image

Noisy image (s = 50)

Restored image

FIG. 8.4 Image denoising using a Φ-model.

given class or region. For example, we may consider that each class represents a given feature (sea, sand, crops in remote sensing data or gray, white matter, and CSF for brain images, as in the example given in Fig. 8.5) and exhibits a Gaussian distribution, characterized by its means and variance. The likelihood is then written as follows: (

PðYjXÞ ¼ exp 

XX s2Λ i2I

Initial image


ðys  μi Þ2  log 2σ 2i

#) qffiffiffiffiffiffiffiffiffiffi! 2 2πσ i δðxs ¼ iÞ ,

Maximum Likelihood

FIG. 8.5 Magnetic Resonance Image segmentation using a Potts model.

Potts model




CHAPTER 8 Markov models and MCMC algorithms in image processing

where δ(a) is equal to one if a is true and 0 otherwise. μi (resp. σ 2i ) is the mean (resp. variance) of class i, I being the set of classes. By maximizing the likelihood function alone, we obtain a noisy segmentation (see Fig. 8.5). To regularize the solution, i.e., to obtain smooth regions without holes, we consider a Gibbs field P(X) as prior to impose spatial homogeneity in the solution. Derived from statistical physics, the most famous prior in image segmentation is the Potts model [1, 3], which is written as follows: 8 9 < X = 1 PðXÞ ¼ exp  ½βδðxs 6¼ xs0 Þ , : ; Z c¼fs, s0 g


where β > 0 represents the strength of the prior. The total energy is then written as follows: Useg ðXjYÞ ¼


βδðxs 6¼ xs Þ +



c¼fs, s0 g

s2Λ i2I


ðys  μi Þ2  log 2σ 2i

# qffiffiffiffiffiffiffiffiffiffi! 2 2πσ i δðxs ¼ iÞ :


Fig. 8.5 shows the segmentation results with and without the Potts model. We can see that the prior removes the local errors due to data. Although the Potts model succeeds in regularizing the solution, it is not always well suited for image segmentation [21]. Let us consider a connected component and its energy given by Eq. (8.34). The energy cost to keep this connected component is proportional to its surface if we consider the data term and to its boundary length is we consider the prior. Therefore, the longer the boundary the more penalized the object. The Potts model tends to remove fine structures in the solution. To overcome this problem, several models have been proposed [22, 23]. The main idea of these models is to differentiate local noise on configurations from edges and lines. To define edges, higher range interactions are needed. An example of such a model is given in [24, 25] and referred to as Chien model. This model preserves fine structures and linear shapes in images while regularizing the solution. In this model, the set of cliques is composed of 3  3 squares. Three parameters (n, l, and e) are associated with these patterns. Before constructing the model, the different configurations induced by a 3  3 square are classified using the symmetries (symmetry black-white, rotations, etc.). This classification and the number of elements in each class are described in Fig. 8.6. A parameter C(i) is associated with each class that refers to the value of the potential function for the considered configuration. So, under the hypothesis of isotropy of the model, which induces some symmetries, plus the black/white symmetry, we have for such a topology (cliques of 3  3) fifty-one degrees of freedom. The construction of the model consists in imposing constraints that provide relations between its parameters. Two energy functions that differ only by a constant are equivalent, so we suppose that the minimum of the energy is equal to zero. We suppose that uniform realizations define the minimum of the energy, so that the first equation for the parameters is given by C(1) ¼ 0. We then define the different constraints with respect

8.3 Some inverse problems






































C(15) 16
































































FIG. 8.6 The different classes induced by a binary 3  3 model and their number of elements.

to those two uniform realizations. The first class of constraints concerns the energy of edges per unit of length which is noted e. Due to symmetries and rotations we only have to define three orientations of edges corresponding to the eight ones induced by the size of cliques. These constraints and the derived equations are represented in Fig. 8.7. Similar constraints are considered to define the energy associated with lines and undesirable local configurations are set to n, representing the noise. The potential associated with each configuration is then a linear combination of the three parameters e, l, and n: 8i ¼ 0,…, 51 CðiÞ ¼ EðiÞe + λðiÞl + ηðiÞn,










FIG. 8.7 Equations associated with edge constraints.




CHAPTER 8 Markov models and MCMC algorithms in image processing

and coefficients E(i), λ(i), η(i) are defined through the relations between potentials C(i). Then the resulting distribution is written: Pe, l, n ðXÞ ¼

1 exp ½eN0 ðXÞ  lN1 ðXÞ  nN2 ðXÞ, Zðe, l, nÞ

where: N0 ðXÞ ¼



EðiÞ#i ðXÞ,

i¼1, …, 51

N1 ðXÞ ¼


λðiÞ#i ðXÞ,

i¼1, …, 51

N2 ðXÞ ¼


ηðiÞ#i ðXÞ:

i¼1, …, 51

#i(X) being the number of configurations of type i in the realization X. A comparison of Potts and Chien models for fine structures segmentation is shown in Fig. 8.8. We have reversed 15% of the pixels in this binary image. The Chien model appears to be much more adapted to image modeling than the Potts model.

8.3.3 TEXTURE MODELING Another important domain of image processing where Gibbs fields play a leading role is texture modeling [2,26–28]. To characterize objects or specific land cover in an image, the pixel scale is not always the most appropriate. In Fig. 8.9, the radiometry (gray level information) is adapted for distinguishing the different fields. But within the urban area, the gray level are almost uniformly distributed. Therefore, to decide if a pixel belong to an urban area or not, the gray level information is not sufficient. To distinguish urban areas, we then have to consider the local distribution of gray levels or a texture parameter to characterize it. In the MRF approach, we assume that, locally, the gray levels are distributed according to a Gibbs distribution, and we estimate the parameters associated with this Gibbs distribution. The relevant feature to discriminate the different areas in an image is given by these parameters, instead of the gray level values. To analyze textures, for example to delineate urban areas in remote sensing images, simple models leading to fast and robust estimation techniques are preferred. When the goal is to model texture themselves in order to synthesize them, generic models are addressed. In this context, the relevance of Gibbs modeling is shown in [27] in the context of Brodatz textures. High range pairwise interactions are considered to model macro and micro properties of the texture. Herein, we consider a simpler model, that is the four connected isotropic Gaussian MRF, to extract urban areas from satellite images [26]. Let us consider an image X ¼ (xs), s 2Λ, where xs 2 S. The gray level space is typically S ¼ {0, 1, …, 255}. We assume that locally the considered image is a realization of a Gibbs field with the following energy:

8.3 Some inverse problems

Initial image

Noisy image

Segmentation using Ising model

Segmentation using the Chien model

FIG. 8.8 Ising model vs. Chien model.

Initial image

FIG. 8.9 Urban areas detection using a GMRF.

b map

Uran area



CHAPTER 8 Markov models and MCMC algorithms in image processing

0 Uθ ðXÞ ¼ β@

X c¼fs, s0 g


ðxs  xs0 Þ + λ


1 ðxs  μÞ





where θ ¼ (β, λ, μ) are the model parameters. μ is the mean, λ weight the external field and the interactions, and β is an inverse temperature parameter. Different estimators can be used to obtain the parameter value. For instance, the Maximum Likelihood (ML) estimator is given by: ^θ ML ¼ arg max Pθ ðXÞ: θ


The normalization constant being untractable numerically, we prefer to consider the Pseudo Likelihood which only depends on a local normalization constant. In case of MRFs, the Maximum of the Pseudo Likelihood (MPL) is given by: ^θ MPL ¼ arg max θ


Pðxs jxt ,t 2 Ns Þ:



Parameter β is a local conditional variance and can be interpreted as an urban indicator. The higher β, the higher probability to be in an urban area. After estimating β on each pixel by considering a local window, the β map is segmented to delineate urban areas, as shown in Fig. 8.9 [28].

8.4 SPATIAL POINT PROCESSES When considering high resolution images, the geometrical information is crucial. Indeed, objects are often better characterized and recognized by their shape rather than their radiometry. Shape properties are hardly modeled by local interactions. Besides, interactions between objects, such as repulsion or alignment, are even more tricky to model at the pixel scale. One can imagine to consider an MRF on a graph, where each node represents an object and edges are set between interacting objects [29]. Node attributes can model the object geometry. However, this scheme requires to construct the graph, which implies to know the number of objects to define the nodes and their relations to define the edges, knowledge that we do not have in practice. To overcome this limitation, we describe in this section a second modeling framework which can be seen as an extension of the MRF approach. We consider configurations of an unknown number of objects described by parametric shapes and local interactions between objects. We consider the marked point process framework that was previously developed in the spatial statistical community [30, 31]. A review of models proposed in the context of image analysis can be found in [4]. They have been employed for detecting different items in remote sensing imaging such as roads [32, 33], buildings [34, 35], or vehicles [36]. One particular application is the problem of counting a population (such as a crowd [37], trees [38], flamingos, or cells [39]).

8.4 Spatial point processes

8.4.1 MODELING Consider K, a compact subset of Rn . K represents the image support, i.e., the image coordinates are embedded in a continuous space. A configuration of points, denoted by x, is a finite unordered set of points in K, such as {x1, …, xn}. The configuration space, denoted by Ω, is therefore written as: Ω¼


Ωn ,



where Ω0 ¼ {ø} and Ωn ¼ {{x1, …, xn}, xi 2 K, 8i} is the set of the configurations of n unordered points for n6¼0. For every Borel set A in K, let NX(A) be the number of points of X that fall in the set A. A point process is then defined as follows: Definition 8.3 X is a point process on K if and only if, for every Borel set A in K, NX(A) is a random variable that is almost surely finite. Definition 8.4 A point process X on K is called a Poisson process with intensity measure ν() if and only if: • •

NX(A) follows a discrete Poisson distribution with expectation ν(A) for every bounded Borel set A in K, for k nonintersecting Borel sets A1, A2, …, Ak, the corresponding random variables NX(A1), NX(A2), …, NX(Ak) are independent.

For every Borel set, B, the probability measure π ν(B), associated with the Poisson process, is given by [30]: π ν ðBÞ ¼ e


Z π νn ðBÞ ¼


1½fx1 , …, xn g2Bn  νðdx1 Þ⋯νðdxn Þ,


Z ⋯


! ∞ X π νn ðBÞ 1½∅2B + , n! n¼1



where Bn is the subset of configurations in B which contain exactly n points. We now define more general processes by considering a density with respect to the Poisson measure. Let f be a probability density with respect to the π ν() law of the Poisson process, such that: Z

f : Ω ! ½0, ∞Þ,


f ðxÞdπ ν ðxÞ ¼ 1:


R The measure defined by PðAÞ ¼ A f ðxÞdπ ν ðxÞ, for every Borel set A in Ω, is a probability measure on Ω that defines a point process. Such a model can favor or penalize geometric properties such as clustering effect or points’ alignment, which leads to interesting possibilities for modeling the scene under study. Similarly to MRFs, we introduce the following Markov Property:



CHAPTER 8 Markov models and MCMC algorithms in image processing

Definition 8.5 Let X be a point process with density f. X is a Markov process under the symmetric and reflexive relation if and only if, for every configuration x in Ω such that f(x) > 0, X satisfies: • •

f(y) > 0 for every y included in x (heredity), for every point u from K, f ðx [ fugÞ=f ðxÞ only depends on u and its neighborhood ∂ðfugÞ \ x ¼ fx 2 x : u xg (Markov property).

A result similar to the Hammersley-Clifford theorem allows the density of a Markov point process to be decomposed as the product of local functions defined on cliques: Theorem 8.1 A density which is associated with a point process f : Ω ! ½0, ∞½ is Markovian under the neighborhood relation if and only if there exists a measurable function ϕ : Ω ! ½0, ∞½ such that: 8x 2 Ω, f ðxÞ ¼ α



y x, y2Cx


where the set of cliques is given by Cx ¼ fy x : 8fu, vg y, u vg. Just as for the random field case, we can then write the density in the form of a Gibbs density: " # X 1 f ðxÞ ¼ exp  VðyÞ , c y x, y2C



P where UðxÞ ¼ y x,y2Cx VðyÞ is called the energy and V (y) is a potential. A typical example of a Markov/Gibbs process, which is often used, is the pairwise interaction process. In this case, a neighborhood relation is defined between pairs of points. For example, xi xj if and only if d(xi, xj) < r, where d(, ) is the Euclidean distance and r is a radius of interaction. In this case, the unnormalized density is written as: hðxÞ ¼

nðxÞ Y i¼1

bðxi Þ


gðxi , xj Þ,


1 i > ¼ < 1 + exp½βðUðxÞ  UðxnxÞÞδ 1 + ax δ px, δ ¼ 1 > > : 1 + ax δ


x ! xnx,


x ! x ðx survivesÞ:




CHAPTER 8 Markov models and MCMC algorithms in image processing

The convergence of this discrete procedure toward the continuous procedure is presented in [42], and the resulting algorithm is called the multiple births and deaths algorithm:

ALGORITHM 8.7 MULTIPLE BIRTHS AND DEATHS ALGORITHM Initialize x(0), n ¼ 0, δ ¼ δ0, T ¼ T0; while convergence had not been attained do Draw a realization of a Poisson process with intensity δ, call it y, and update the configuration x x[y h i , draw p from a uniform distribution over For every object x in x calculate ax ¼ exp UðxÞUðxnxÞ T [0, 1]; Ifp < 1 +axaδx δthen remove x : x xnx end n n + 1, δ δ  αδ , T T  αT end

The parameters αδ and αT are coefficients, less than 1. In practice, convergence is obtained when all the objects added during the birth phase, and only those, are removed during the death phase. Note that too great a decrease in δ, and therefore in the birth rate, can freeze the configuration, since objects will no longer be added. Compared to the RJMCMC scheme, the birth phase add to the current configuration a collection of objects, without any rejection. Therefore, fluctuations between two consecutive configurations are larger, especially at low temperature. This implies a much more robust algorithm with respect to the temperature cooling scheme. To accelerate the convergence of the multiple births and deaths algorithm, the birth choice can be made according to the data instead of using a uniform Poisson law. Suppose that we have a birth map, BðsÞ, s 2 S, on the image lattice that favors certain positions during the birth phase. Without losing the convergence properties, the multiple births and deaths algorithm can then be modified as follows:

ALGORITHM 8.8 ADDITION OF THE BIRTHS MAP For every pixel s calculate the value of the birth map B(s) Normalize the birth map : 8s 2 S,bðsÞ ¼ PBðsÞBðsÞ s2S Initialize x(0), n ¼ 0, δ ¼ δ0, T ¼ T0; while convergence has not been attained do For every pixel s 2 S, if no object from x is centered on s, add an object, centered on s with probability b(s)δ. h i , draw p according to the uniform For every object x of x calculate ax ¼ exp UðxÞUðxnxÞ T distribution on [0, 1]; a + xδ If p < then1 + ax δ remove x : x xnx end n n + 1, δ δ  αδ , T T  αT end

8.4 Spatial point processes

The death step can also be improved by ordering the objects in the configuration. For example, the objects can be visited in order of decreasing data energy, which means that we first propose to remove the objects which are badly localized on the data. Finally, similarly to MRFs graph techniques have been investigated to optimize some specific Marked Point Processes (MPPs). Although a purely deterministic approach is not reachable as the number is not known we can replace the death step in the multiple births and deaths algorithm by a minimal cut in a proper graph [43]. The main advantage of the births and cut algorithm is to avoid the simulated annealing scheme and thus the temperature cooling setting. The regularity constraint, necessary to use the graph cut algorithm [11], imposes some attractive interactions. However, when considering MPP for object detection a repulsive interaction to control object overlaps is necessary. To overcome this difficulty, the cut step is defined on two configurations, each of them containing noninteracting objects P and Q. Some repulsive interactions are considered between objects in P and objects in Q. We define a binary model. The binary variable associated with an object is equal to 1 if the object is preserved and to 0 if the object is removed. The cut step select the optimal configuration of objects included in P [ Q. As for the MRF case, two nodes (s and t) are added to the nodes defined by the objects in P [ Q. The node s means that the binary variable associated with an object in P (resp. in Q) is equal to 1 (resp. 0) and vice versa for the node t. The edges construction is summarized in Fig. 8.10. The multiple births and cut is then given as follows: n(qi)


s U(n(pi) = 1)


U(n(qi) = 0)

–U(n(pi) = 1)




U(n(pi) = 1) > 0

U(n(pi) = 1) < 0

U(n(qi) = 0) > 0

–U(n(qi) = 0) n(pi) U(n(qi) = 0) < 0 s




–V(x,y) V(x,y) –V(x,y)

–V(x,y) n(x)




n(y) V(x,y)

–V(x,y) x∈P, y∈P V(x,y) < 0


x∈Q, y∈Q V(x,y) < 0 t x∈P, y∈Q V(x,y) > 0

FIG. 8.10 Graph construction for the multiple births and cut algorithm.



CHAPTER 8 Markov models and MCMC algorithms in image processing

ALGORITHM 8.9 MULTIPLE BIRTHS AND CUT ALGORITHM Generate randomly a configuration of noninteracting (nonoverlapping for example) objects P while convergence has not been attained do Generate randomly a configuration of noninteracting objects Q Construct the graph containing the objects of P and Q (plus nodes s and t as described in Fig. 8.10). Compute the minimal cut and consider P as the set of the objects of P linked with s and the objects in Q linked with t end


Aerial and satellite images play a more and more important role in the field of natural resource management, and in particular for forests. High resolution data provide accurate enough information to perform a count of individuals in a population. In this context, the objective of modeling by marked point processes, as presented here, is to extract the tree crown from very high resolution aerial images of the forests (see [38] for details). In different fields, such as biological imagery, counting objects appears as an important problem for evaluating cell populations or vesicles within cells [39]. We consider ellipses as objects. The position space associated with the image is K ¼ [0, XM]  [0, YM], and the mark space associated with the ellipse configurations, M ¼ ½am , aM   ½bm , bM   ½0, π½, where XM and YM are the height and width of the image I respectively, and where a 2 [am, aM] is the major axis, b 2 [bm, bM] is the minor axis, and θ 2 [0, π[ is the orientation of the ellipses. For images of plantations, we model three main properties in the prior: •

Trees are not superposed. To avoid detecting the same tree crown with two objects (cf. Fig. 8.11 left) we consider a repulsion term between two objects

2e Intersection

V2 1

Y X V1




FIG. 8.11 Left: object intersections and the associated coefficient of intersection. Right: object alignments favored by the prior.

8.5 Multiple objects detection

xi rxj that intersect each other. A coefficient, Qr ðx1 , x2 Þ 2 ½0,1, more or less penalizes two objects as a function of their area of intersection: X

Ur ðxÞ ¼ γ r

xi r xj

Qr ðxi , xj Þ,…,γ r 2 R + :

Plantations are characterized by a regular repartition of trees. An attraction term is defined to favor regular alignments (see Fig. 8.11 right). A quality function, Qa ðx1 ,x2 Þ 2 ½0, 1 quantifies the alignment between two objects x1 ax2, by comparing the vector x1 x2 ! to the two vectors that represent the principal directions of the plantation: X

Ua ðxÞ ¼ γ a

Qa ðxi , xj Þ,…, γ a 2 R :

xi a xj



For process stability reasons, we forbid two objects closer than the minimum admissible distance, taken to be equal to 1 pixel:  Uh ðxÞ ¼

+∞ 0

if 9ðxi , xj Þ 2 xjdðxi , xj Þ < 1, otherwise:


Fig. 8.12 shows a plantation and a simulation of the prior. The prior provides dense configurations of nonintersecting ellipses.

FIG. 8.12 Left: image of a poplar plantation. Right: simulation of the prior.



CHAPTER 8 Markov models and MCMC algorithms in image processing

The data on which the tests are carried out are scanned aerial photographs from infra-red film, which highlights the zones containing chlorophyll. In [38], a Bayesian model, in which the data term is the likelihood of the observations, is presented: Ud ðxÞ ¼  log ðLðIjxÞÞ. It assumes that we can construct a statistical model given a configuration of objects x, inside and outside the tree crowns. The approach consists of considering the pixels of the image as belonging to one of two Gaussian classes: either the background (low gray level) or else the trees (high gray level), whose parameters can be estimated using a K-means classification algorithm. This Bayesian data term gives good results for images that only contains trees and ground such as in Fig. 8.13. For more complex scene, the energy of the data term is defined as the sum of local terms for each of the objects [44]: Ud ðxÞ ¼ γ d

X xi 2x

Ud ðxi Þ


FIG. 8.13 Result of extraction on the image in Fig. 8.12 (200  140 pixels) with a Bayesian model, 10 millions iterations (15 minutes).

8.5 Multiple objects detection

For more complex scenes, the energy of the data term is defined as the sum of local terms for each of the objects [44]: Ud ðxÞ ¼ γ d

X xi 2x

Ud ðxi Þ:


The objects are favored by a negative data energy and penalized by a positive data energy. The local data energy can be interpreted as the output of a local filter. In the images, the trees are distinguished thanks to their shadows, which create a dark zone around the tree. The foliage and branches of a tree can thus be considered as a light shape surrounded by a dark zone. This property is reflected by a distance quantifying the contrast between the ellipse x and its neighborhood F ρ ðxÞ, as defined in Fig. 8.14. For example, we can consider the following distance: ! ðμx  μF ðxÞ Þ2 σ x σ F ðxÞ dB ðx,F ðxÞÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  log 2 2 : σ x + σ F ðxÞ 2 σ 2x + σ 2F ðxÞ


A sigmoid function maps the considered distance to the interval [1, 1] as follows: 8 d > > 1  < d0   Qa ðdÞ ¼  > ðd  d0 Þ > : exp 1 d0

if d < d0 , (8.64) otherwise,

where d0 is the distance threshold from which the data term favors the object in the configuration. The optimization is performed using an RJMCMC algorithm embedded into a simulated annealing. The proposition kernel contains birth and death, simple

Dark pixels inside the frontier


Light pixels inside the ellipse

FIG. 8.14 Ellipse x and its border F ρ ðxÞ.



CHAPTER 8 Markov models and MCMC algorithms in image processing

perturbation of objects (dilation, translation, rotation), and a specific movement consisting in splitting and merging ellipses. As shown in Fig. 8.15, the Bayesian model tends to fit ellipses on the chlorophyll areas. The model considered a data term as a sum of “local filters” and avoids false alarms on meadows providing a more precise count. The framework is general provided that the objects can be approximated by some parametric shapes. Another example is shown in Fig. 8.16 with a disk process to count cells from confocal microscopy images [39]. For more complex shapes, we can predefine a dictionary of shapes. In Fig. 8.17 intra-cellular vesicles are detected considering a dictionary composed of connected shapes included in a square of 5  5 pixels (using the SPADE software freely available at pypi/small-particle-detection).

8.5.2 ROAD NETWORK DETECTION With this second example, we tackle the traditional problem of remote sensing image analysis consisting of extracting the line networks and especially road networks. Traditionally, we distinguish two main types of methods for extracting a road network. The first type relates to semiautomatic methods for which some seeds are provided by the user, and from these a tracking algorithm extracts the network [45]. This kind of approach can be made unsupervised, the second kind of methods, by automatic detection of the seeds [46]. When using the marked point process framework, we address the problem at a segment scale [32, 33]. We introduce a marked point process for road network extraction (see [32, 33] for details). The line network S is defined by a set of segments that is the realization of a Markov object process S. We specify the process by its density h relative to a homogeneous Poisson process. The density contains a data term g and a prior term f that contains the constraints on the structure, the connectivity and the average curvature of the network: hðSÞ∝gðSÞf ðSÞ,


where S ¼ {si, i ¼ 1, …, n} is a set of segments. Two segments are said to be connected if the closest distance between their endpoints is less than a constant E. This relation makes it possible to define three types of segments as shown in Fig. 8.18. Free segments are unconnected segments, simple segments are connected at only one endpoint, and double segments are connected at both endpoints. The connectivity of the network is modeled in the density by penalizing free segments and, to a lesser extent, by penalizing simple segments. To avoid superposition of segments, or intersections with acute angles, while still permitting intersections with (approximate) right angles, we define the connectivity relation Rc by stating that two segments are connected if the angle between them is large. For example, in Fig. 8.20, s1 and s2 are considered as connected but not s1 and s3. Moreover, in order to favor segment pairs whose endpoints and orientations are close, as for the (s1, s2) pair of Fig. 8.20, a potential function VRc is defined as follows:

8.5 Multiple objects detection

FIG. 8.15 Top: test image (166  224 pixels). Middle: result of extraction with the Bayesian model. Bottom: result of extraction with the non-Bayesian model, in 10 million iterations (20 minutes).



CHAPTER 8 Markov models and MCMC algorithms in image processing

FIG. 8.16 A synthetic confocal microscopy image of cells (left) and the obtained detection using a MPP of disks (right).

FIG. 8.17 A synthetic confocal microscopy image of intracellular vesicles (left) and the obtained detection using a MPP based on a small shapes dictionary (right, green (light): good detection, blue (dark): false negative, red (medium): false positive).

Disk of radius e

Free segment

FIG. 8.18 The different types of segments.

Simple segment

Double segment

8.5 Multiple objects detection

for si c sj , VRc ðsi , sj Þ ¼

Vτ ðτij Þ + VE ðdij Þ , 2


8  σðτij ,τmax Þ if jτij j < τmax , < Vτ ðτij Þ ¼ with 1 otherwise, : VE ðdij Þ ¼ σðdij ,EÞ:

Vτ favors segment pairs, (si, sj), whose difference in orientation τij is less than a threshold τ and penalizes other cases. VE relates to the distance dij between the endpoints, and tends to connect close segments. The attractive terms of these functions are given by a quality function σ which is described by Eq. (8.67): σð  , MÞ : ½M, M ! x

½0,1,   1 1 + M2 7 ! σðx,MÞ ¼ 2  1 : M 1 + x2


This function is positive over [M, M], and it is maximal at 0, as shown in Fig. 8.19. Segment pairs which form an angle that is less than a (small) constant, denoted c, are prohibited. For other configurations, we consider the same quality function σ

s ( . ,π/4) 1

s( . ,π/2)

−p /2

FIG. 8.19 Quality function.

−p /4


p /4

p /2



CHAPTER 8 Markov models and MCMC algorithms in image processing









FIG. 8.20 Different types of connection—(s1,s2): attractive connection; (s1,s3): not considered as a connection; (s1,s4): repulsive connection due to orientation.

which is of a repulsive nature, according to the difference between the orientations of the two segments. Thus, for each pair (si, sj) such that si iosj, we consider: 

VRio ðsi , sj Þ ¼

∞ if τij < c, 1  σðτij , π=2  δmin Þ



where τij is the orientation difference between si and sj, δmin is the minimum deviation from a right angle for which two segments are considered as badly oriented, and c is the minimum difference between the orientations that can exist in the configuration (Fig. 8.20). To summarize, the prior density is then defined as follows: 2

f ðSÞ∝ exp 4nf ωf  ns ωs 






Vr ðsi , sj Þ5,


hsi , sj ir

8 < nf ,ns are the number of free and single segments, where : h,ir represents a segment pair in interaction via the relation r, : Vr ð  ,  Þ is a potential function:

As for the trees we consider the data term as a sum of local terms: "

gðSÞ∝ exp γ d



δi ,


si 2S

where δi is a statistical value that is calculated over the region corresponding to pixels of segment si and of its neighborhood. γ d is a positive weighting coefficient.

8.5 Multiple objects detection

FIG. 8.21 H1: significant difference with adjacent zones (left), H2: homogeneity of the segment (right).

The potential δi is based on two assumptions concerning the segment and its neighborhood. We assume, first of all, that adjacent regions are different, at least on average, to the region of the segment (see Fig. 8.21, top). Then, the set of pixels of a segment must be homogeneous, to avoid the contours being detected as roads (see Fig. 8.21, bottom). To check these assumptions, for a given segment, we divide it into several bands, b1, …, bn (note that by the term segment we mean an elongated rectangle in the discrete space). In addition, we consider two bands on each side, Ri1 and Ri2 , at a distance d from the segment, as illustrated in Fig. 8.22. The values of the pixels in each band constitute a population. A Student t-test is calculated to determine if the averages of two populations are significantly different. For two populations x and y, we thus calculate the following: jx  yj ttestðx,yÞ ¼ rffiffiffiffiffi 2 σ x 2 , nx + σ y




where x, σ , and n refer to the mean, the variance, and the number of observations respectively. We would like a significant difference between the segment and its two sides, so we consider the statistical test (H1) defined by the minimum of the two tests:



CHAPTER 8 Markov models and MCMC algorithms in image processing








FIG. 8.22 Division of a segment into several bands.

TH1 ðsi Þ ¼ min ttestðRil , si Þ: l2f1, 2g


The assumption of homogeneity H2 is given by the maximum value of the tests between two adjacent bands within the segment: TH2 ðsi Þ ¼


j2f1, …, nb 1g

½ttestðbj , bj + 1 Þ:


Finally, the selected potential is the ratio of these two quantities, while taking the precaution to undervalue TH2 ðsi Þ: Ti ¼

TH1 ðsi Þ : max½1, TH2 ðsi Þ


Moreover, a sigmoid function makes it possible to convert values from ½0, ∞ into [1, 1]. Thus, we have: 8 1 > > > > > > Ti  t1 < δi ¼ 1  2 t2  t1 > > > > > > : 1


Ti < t1 ,


t1 < Ti < t2 ,


Ti > t2 ,

where t1 and t2(t1 < t2) are two thresholds that parameterize the data term.


8.5 Multiple objects detection

The optimization is performed using an RJMCMC algorithm. The proposition kernel includes birth and death, segment translation and rotation. A specific movement, which speed up the process, consists in defining a birth kernel where the new segment is proposed in order to be connected with a segment lying in the current configuration. This movement intuitively induces a tracking of connected chains of segments. The reverse movements consisting in removing a connected segment has also to be included in the proposition kernel. The RJMCMC algorithm is integrated in a simulated annealing framework for tests which are carried out on real data. Figs. 8.23 and 8.24 show that the marked

FIG. 8.23 Aerial image (892  652 pixels).

FIG. 8.24 Results of extraction of the road network in the image from Fig. 8.23 for the continuous potentials Candy model.



CHAPTER 8 Markov models and MCMC algorithms in image processing

point process prior allows the network detection even on the shadow part of the road, where there is a lack of contrast between the road and the border.

8.6 CONCLUSION Stochastic modeling belongs to the history of image processing and computer vision. The major advantage of these models is their ability to embed prior information on the solution. They are therefore a very powerful tool for solving inverse problem which are often ill posed and for which the data themselves may be not sufficient to provide a robust solution. Another key point on favor of probabilistic modeling is the algorithm corpus available for simulating, optimizing, and estimating. They sometimes suffer from a bad reputation concerning their resources needs. However, the recent development in algorithmic and the evolution of computer performances is usually a good answer to these criticisms. Initially dedicated to pixelwise modeling, they follow the recent development of sensors. New models have been developed to take into account for the increasing resolution of data, embedding geometrical information. Stochastic models still have a large future for analyzing images.

REFERENCES [1] J. Besag, Spatial interaction and the statistical analysis of lattice systems, J. R. Stat. Soc. Ser. B 36 (2) (1974) 192–236. [2] G. Cross, A. Jain, Markov random field texture models, IEEE Trans. Pattern Anal. Mach. Intell. 5 (1) (1983) 25–39. [3] S. Geman, D. Geman, Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. 6 (6) (1984) 721–741. [4] X. Descombes, Stochastic Geometry for Image Analysis, Wiley (Hoboken, USA)/Iste (London, UK), 2011. [5] R.L. Dobrushin, The description of a random field by means of conditional probabilities and conditions of its regularity, Theory Probab. Appl. 13 (2) (1968) 197–224. [6] C.P. Robert, Monte carlo methods, John Wiley & Sons, Ltd, USA, 2004. [7] W.K. Hastings, Monte Carlo sampling methods using Markov chains and their applications, Biometrika 57 (1) (1970) 97–109. [8] S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, Optimization by simulated annealing, Science 220 (1983) 671–680. [9] B. Hajek, Cooling schedules for optimal annealing 1987, Preprints Math. Op. Research. [10] Y. Boykov, O. Veksler, Graph cuts in vision and graphics: theories and applications, in: Handbook of Mathematical Models in Computer Vision, Springer (Berlin, Heidelberg), 2006. [11] V. Kolmogorov, R. Zabih, What energy functions can be minimized via graph cuts? IEEE Trans. Pattern Anal. Mach. Intell. 26 (2) (2004) 147–159. [12] B. Chalmond, An iterative Gibbsian technique for reconstruction of m-ary images, Pattern Recognit. 22 (6) (1989) 747–761.


[13] P. Masson, W. Pieczynski, SEM algorithm and unsupervised statistical segmentation of satellite images, IEEE Trans. Geosci. Remote Sens. 31 (3) (1993) 618–633. [14] R. Chellappa, A. Jain, Markov Random Fields: Theory and Applications, Academic Press, Boston, 1993. [15] S. Li, Markov Random Field Modeling in Computer Vision, second ed., Springer-Verlag, Berlin, Heidelberg, 2001. [16] G. Winkler, Image Analysis, Random Fields and Markov Chain Monte Carlo Methods, A Mathematical Introduction, Springer, New York, 2003. [17] D. Geman, Random fields and inverse problems in imaging. In: Lecture Notes in Mathematics, vol. 1427, Springer-Verlag, 1991, pp. 113–193. [18] S. Geman, G. Reynolds, Constrained restoration and recovery of discontinuities, IEEE Trans. Pattern Anal. Mach. Intell. 14 (3) (1992) 367–383. [19] M. Nikolova, Analysis of the recovery of edges in images and signals by minimizing non convex regularized least-squares, SIAM J. Multiscale Model. Simul. 4 (3) (2005) 960–991. [20] X. Descombes, M. Lebellego, E. Zhizhina, Image deconvolution using a stochastic differential equation approach, in: Proc. International Conference on Computer Vision Theory and Applications, Barcelona, Spain, 2007. [21] R.D. Morris, X. Descombes, J. Zerubia, The Ising/Potts model is not well suited to segmentation tasks, in: Proc. Digital Signal Processing Workshop, IEEE, Loean, Norway, 1996. [22] G. Wolberg, T. Pavlidis, Restoration of binary images using stochastic relaxation with annealing, Pattern Recognit. Lett. 3 (1985) 375–388. [23] H. Tjelmeland, J. Besag, Markov random fields with higher-order interactions, Scand. J. Stat. 25 (3) (1998) 415–433. [24] X. Descombes, J.-F. Mangin, E. Pechersky, M. Sigelle, Fine structures preserving Markov model for image processing, in: Proc. 9th Scandinavian Conference on Image Analysis, 1995. [25] X. Descombes, R.D. Morris, J. Zerubia, M. Berthod, Estimation of Markov random field prior parameters using Markov chain Monte Carlo maximum likelihood, IEEE Trans. Image Process. 8 (7) (1999) 954–963. [26] X. Descombes, M. Sigelle, F. Preteux, GMRF parameter estimation in a non-stationary framework by a renormalization technique: application to remote sensing imaging, IEEE Trans. Image Process. 8 (4) (1999) 490–503. [27] G.L. Gimel’farb, Image Textures and Gibbs Random Fields, Springer, Berlin, Heidelberg, 1999. [28] A. Lorette, X. Descombes, J. Zerubia, Texture analysis through a Markovian modeling and fuzzy classification: application to urban area extraction from satellite images, Int. J. Comput. Vis. 36 (3) (2000) 221–236. [29] F. Tupin, H. Maitre, J.-F. Mangin, J.-M. Nicolas, E. Pechersky, Detection of linear features in SAR images: application to road network extraction, IEEE Trans. Geosci. Remote Sens. 36 (2) (1998) 434–453. [30] M.N.M. Van Lieshout, Markov Point Processes and Their Applications, Imperial College Press, London, 2000. [31] S.N. Chiu, D. Stoyan, W.S. Kendall, J. Mecke, Stochastic Geometry and its Applications, J Wiley & Sons, USA, 2013. [32] R. Stoica, X. Descombes, J. Zerubia, A Gibbs process for road extraction in remotely sensed images, Int. J. Comput. Vis. 57 (2) (2004) 121–136.



CHAPTER 8 Markov models and MCMC algorithms in image processing

[33] C. Lacoste, X. Descombes, J. Zerubia, Point processes for unsupervised line network extraction in remote sensing, IEEE Trans. Pattern Anal. Mach. Intell. 27 (10) (2005) 1568–1579. [34] M. Ortner, X. Descombes, J. Zerubia, A marked point process of rectangles and segments for automatic analysis of digital elevation models, IEEE Trans. Pattern Anal. Mach. Intell. 30 (1) (2008) 105–119. [35] C. Benedek, X. Descombes, J. Zerubia, Building development monitoring in multitemporal remotely sensed image pairs with stochastic birth-death dynamics, IEEE Trans. Pattern Anal. Mach. Intell. 34 (1) (2012) 33–50. [36] A. Borcs, C. Benedek, Extraction of vehicle groups in airbone lidar point clouds with two-level point processes, IEEE Trans. Geosci. Remote Sens. 53 (3) (2015) 1475–1489. [37] W. Ge, R.T. Collins, Marked point processes for crowd counting, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 2913–2920. [38] G. Perrin, X. Descombes, J. Zerubia, A marked point process for tree crown extraction in plantations, in: Proc. International Conference on Image Processing, Genoa, Italy, 2005. [39] X. Descombes, Multiple objects detection in biological images using a marked point process framework, Methods 115 (2017) 2–8. [40] P.J. Green, Reversible jump Markov chain Monte Carlo computation and Bayesian model determination, Biometrika 82 (4) (1995) 711–732. [41] C.J. Geyer, J. Moller, Simulation and likelihood inference for spatial point process, Scand. J. Stat. B 21 (1994) 359–373. [42] X. Descombes, R. Minlos, E. Zhizhina, Object extraction using a stochastic birth-anddeath dynamics in continuum, J. Math. Imaging Vis. 33 (3) (2009) 347–359. [43] E.A. Gamal, X. Descombes, G. Charpiat, J. Zerubia, Multiple birth and cut algorithm for multiple object detection, Proc. IEEE International Conference on Signal-Image Technology and Internet-based Systems (SITIS), Kuala Lumpur, Malaysia, de´cembre (2010). [44] X. Descombes, Mathematical methods for signal and image analysis and representation, in: L. Florack, R. Duits, G. Jongbloed, M.N.M. Van Lieashout (Eds.), Interacting Adaptive Filters for Multiple Objects Detection, Springer-Verlag, London, 2012. [45] D. Geman, B. Jedynak, An active testing model for tracking roads from satellite images, IEEE Trans. Pattern Anal. Mach. Intell. 18 (1) (1996) 1–14. [46] M. Barzohar, D.B. Cooper, Automatic finding of main roads in aerial images by using geometric-stochastic models and estimation, IEEE Trans. Pattern Anal. Mach. Intell. 18 (7) (1996) 707–721.

FURTHER READING Y. Verdie, F. Lafarge, Detecting parametric objects in large scenes by Monte Carlo sampling, Int. J. Comput. Vis. 106 (1) (2014) 57–75.


Scalable image informatics


Dmitry Fedorov, B.S. Manjunath, Christian A. Lang, Kristian Kvilekval University of California, Santa Barbara, CA, United States

9.1 INTRODUCTION Images and video play a major role in scientific discoveries. Significant new advances in imaging science over the past two decades have resulted in new devices and technologies that are able to probe the world at nanoscales to planetary scales. These instruments generate massive amounts of multimodal imaging data. In addition to the raw imaging data, these instruments capture additional critical information—the metadata—that include the imaging context. Further, the experimental conditions are often added manually to such metadata that describe processes that are not implicit in the instrumentation metadata. Despite these technological advances in imaging sciences, resources for curation, distribution, sharing, and analysis of such data at scale are still lacking. Robust image analysis workflows have the potential to transform image-based sciences such as biology, ecology, remote sensing, materials science, and medical imaging. In this context, this chapter presents BisQue, a novel eco-system where scientific image analysis methods can be discovered, tested, verified, refined, and shared among users on a shared, cloud-based infrastructure. The vision of BisQue is to enable large-scale, data-driven scientific explorations. The following sections will discuss the core requirements of such an architecture, challenges in developing and deploying the methods, and will conclude with an application to image recognition using deep learning.

9.1.1 CORE REQUIREMENTS The development of BisQue is driven by the requirements of the scientific imaging community. Fig. 9.1 summarizes the core requirements that support ubiquitous access to multidimensional (5D) images and other binary data types. •

Heterogeneous data: Typical scientific datasets are heterogeneous in nature (e.g., multidimensional 3D volumes together with 1D sequence information). A comprehensive solution to managing such data is critical for discovery and innovation.

Academic Press Library in Signal Processing, Volume 6. # 2018 Elsevier Ltd. All rights reserved.


CHAPTER 9 Scalable image informatics







in M

Multimodal: Textual, 5D graphical, images, graph, geometry Semantic meaning

Large scale & extensible Automated UI Docker deployments Python, ITK, Matlab, CellProfiler, DREAM.3D, etc.

io at liz ua

250+ image formats Everything on the web Searchable & organizable







5D images of any size WebGL volume viewer Many types: images, molecules, PDFs, etc. Plots and summaries


Annotations Computations

Vi s


FIG. 9.1 The development of BisQue is driven by the requirements of the scientific imaging community. At the core is the data/feature infrastructure that supports metadata resources and ubiquitous access to multidimensional (5D) images and other binary data types.

• • •

• • •

Metadata integration: Contextual information in most scientific imaging experiments are embedded in the associated metadata that describe instrumentspecific and experiment-specific parameters. Large-scale data management: Modern experiments are typically statistical in nature and thus require analyzing large datasets. Integrated data analysis: Further, most experiments carry carefully curated manual annotations and manipulations that are critical for further computations. Integration with automated analytics: Equally critical is that the analysis modules work closely with the data and metadata, and the results of such analysis are integrated into the overall system for data mining. Provenance tracking: Integration of the data and methods will allow provenance tracking toward enabling reproducible computations. Support for large-scale data analytics: This is critical for high-throughput imaging applications that routinely generate large amounts of data. Support for easy integration of analysis modules: Most new experiments are designed with new computational analysis in mind and thus require support for easy integration of new and complex analysis routines. Support for indexing, searching, and querying high-dimensional and graph data. This is important as many commonly used multimodal features are highdimensional vectors and organizing such data is a major challenge for pattern recognition and data-mining methods. Developer support: In addition to the scientific users, an extensible system should provide support for researchers outside of a chosen scientific domain who would be able to contribute novel analytics and benefit from testing/validating their methods on a diverse data. This is critical for scientific knowledge discovery.

9.2 Core concepts

Toward addressing these earlier challenges, the BisQue image informatics system has pioneered extensible image analysis on the web with full web standards compliance and large data support. BisQue introduced the schema-less hierarchical and flexible data model for annotations so as to address the needs of diverse labs/users, which has proven most useful for storing heterogeneous data and metadata. The flexible annotations are a key element allowing rapid integration of image analysis and its results into the system. BisQue is designed from the ground-up to be scalable and deployable on cloud computing infrastructure and BisQue can be easily deployed using Docker virtualization [1]. Deep learning and data analytics architectures are being integrated with BisQue, thus harnessing the collective power of data and methods in order to derive new insights.

9.2 CORE CONCEPTS At its core, the BisQue system describes data objects (e.g., images, tables, experiments) via trees of user-defined tuples of metadata, where each element in the tree may be described with a user given name, value, type, and units. There are additional attributes controlled by the system enabling ownership and access control, time of creation and access, ontological reference, etc. Annotation tuples can simply be textual annotations as well as more complex graphical annotations described via several multidimensional graphical primitives, such as points, polylines, polygons, surfaces, and more complex shapes. The metadata trees are called documents or resources and may describe 0, 1, or many binary files accessible to the system. Each resource and tuples contained within are addressable via Universal Resource Locators (URIs) allowing the creation of links to subelements in each document. Types of these resources and tuples suggest specific micro-services that can operate on referenced binary resources and provide domain-specific operations. All metadata documents are handled by the data service, which is used to orchestrate the query system and any other binary services. Other resource types include an image resource, serviceable via an image service. Images can be composed of one or multiple files and the metadata elements of the image resource may describe the geometry and physical characteristics of these files composing a multidimensional image. Furthermore, dense table data (such as HDF5) can be serviced by a table service allowing slicing and dicing. Chemical data (such as molecule SD files) can be serviced by a chemistry service providing typical visualization and queries performed over molecular data. The metadata resources are used not only to describe binary files but also every other concept used in the system, such as users or module executions. Internal and external APIs, micro-service communication and historical differences are also described with resource documents. The canonical representation of the resource document is in XML due to the well-described xpath/xquery interfaces that allow slicing and dicing complex hierarchical documents. Services can also represent



CHAPTER 9 Scalable image informatics

metadata trees in other formats such as JSON or CSV when exchanging uniform data in a more efficient manner. The dense data elements can be described in the canonical XML representation (for exchange or indexing) but presenting many dense elements in such a form may be unwieldy. Thus binary data formats are also used for exchange by specialized micro-services, such as images or HDF-5. The BisQue system orchestrates multiple micro-services providing access control for asynchronous operations, users and user contributed analyses (Fig. 9.2). Services are tightly integrated with the system and have access to internal data structures and scalability. User contributed analyses, called modules, execute in a sand-boxed environment and only possess temporary user-level credentials while executing. The module system on the other hand automatically scales module executions based on available resources and module requirements. Sharing and collaboration are extremely important for any scientific endeavor and the web nature of the BisQue system makes that easy. Flexible data format support by the micro services optimizes data transfers for editing and visualization. Proximity to computational resources enables large-scale computations. In addition,

Query services Query dialects and extensions: XPath, BICKER, etc.

Index services

Geometry Graph Full text

Content indexing: Full text, graph, geometry, image content

XML + binary

Feature/classification services Extractors: Shape, color, location,texture,etc.

Image similarity

Blob/ image services Data services

Analysis services

FIG. 9.2 Bisque is implemented as a scalable and modular web-service. Image Servers store and manipulate images. Data servers provide flexible metadata storage. Extension servers house executable modules. The client server seamlessly integrates services across storage and processing hardware. Communication between various components is performed in a RESTful manner through HTTP requests carrying XML and JSON. Backend servers seamlessly provide access to local resources as well as large scalable resources like Amazon S3 or iRODS for data storage or Condor grids for computation.

9.2 Core concepts

the system keeps a complete record of all resultant data, analysis code that produced it and users who ran it, guaranteeing strict provenance, and reproducibility of all of its data computed collaboratively.

9.2.1 METADATA GRAPH The hierarchical and flexible annotations enabled by BisQue allow user and developers to rapidly contribute data, add annotations, and integrate analysis modules. Metadata trees can represent virtually any user-defined schema. Semantic meaning for elements in the tree can be achieved by a simple user-defined type or more rigorously by pointing to an ontology definition provided by some service. Pointers (URIs) within documents are used extensively in order to connect different resources. For example, an image may point to a microscope definition document as well as a sample preparation document. Such links form metadata graphs, allowing aggregated graph queries and also providing access to specific binary micro-services. This rich metadata representation is too expensive to represent very large amounts of uniform (dense) data (millions or billions of elements) thus pointers can be used to simply link a branch in the metadata tree to a dense data element. Such elements are typically served by micro-services representing table data (e.g., detected cells with computed features) or pixel data (e.g., image data or segmented masks).

9.2.2 VERSIONING, PROVENANCE, AND QUERIES The provenance of scientific data is also important for multiuser environments (such as laboratories and companies) in order to ensure the validity and reproducibility of scientific discoveries. At the same time the nature of scientific process is based on experimentation and thus involves a lot of trial and error. The BisQue system solves these concerns with versioned metadata documents and strict system-controlled analysis provenance that both encourages experimentation and guarantees strict provenance of analysis results. The metadata system preserves all the changes to the metadata by storing validity time intervals for each data item. This information allows the recreation of any document at any point in time, historical queries, and the representation of changes via delta documents. All analysis executions are described by a system-controlled module execution document (MEX) that contains pointers to an exact module being used for computation, all input elements, such as pointers to input resources (e.g., images, tables) and explicit module parameters. They also contain all the produced outputs either explicitly or by linking. In addition, module documents identify the exact state of the source code or binary algorithm at module execution time by additionally storing a source code repository reference.



CHAPTER 9 Scalable image informatics

Document versioning enables user correction of automatically produced results and provides an exact change that could be useful for improvement of automated methods. This system thus ensures strict provenance of every produced resultant data element and encourages experimentation. In addition, provenance paired with strict versioning of analysis code guarantees reproducibility of all the computed results. The BisQue system encourages storing of different types of data needed to describe an experiment. In order to derive scientific insights from these heterogeneous data a query system able to handle multimodal data is required. BisQue’s unified query system uses an SPARQL-like query language that provides an abstraction from the underlying data stores and indices and allows expression of complex queries over graphs of metadata documents. Depending on the availability of indices and the amount of data queried, the query system can decide to execute a query by pushing it down into an underlying relational database (e.g., PostgreSQL) or it may decide to run it via brute-force computation in a distributed fashion (e.g., aggregating millions of data items via Apache Spark). These two approaches require very different computational infrastructure and algorithmic support. The BisQue system allows each micro-service responsible for its data type to contribute one or more indices/summarizers for the unified query system based on expandable functions. The query system will in turn dispatch queries to specific subindices and later aggregate them into one single answer. The data service keeps track of all the available indices and sends them updates whenever a document changes. Each index decides whether the update is relevant to their indexable data type and whether to update its indices instantaneously or queue them for a later batch update. Such indices/summaries cover as varied data types as full text, metadata graph, ontological terms, graphical annotations geometry, image-based similarity, molecular scaffold similarity, and statistical summaries of numerical dense data. For more complex data analytics requirements that cannot be easily expressed with BisQue’s query language or that are provided as program code (e.g., in Matlab, Python, or C), BisQue’s analysis module system can be utilized. BisQue modules can be parallelized on a computational cluster in a Map-Reduce or in a more complex Directed Acyclic Graph (DAG) fashion in order to maximize parallelism. Each execution can in turn utilize any scalable micro-service and their indices. The orchestration of the distributed module execution, data/code shipment, and result collection is handled transparently by the module system.

9.3 BASIC MICRO-SERVICES BisQue is designed following common web techniques and benefits from the available hardware and software infrastructure. Cloud infrastructure, RESTful APIs, light-weight virtual machines, HTTP scalability, and caching are all used by BisQue. Another design motif utilized throughout the system is lazy evaluation paired with extensive caching as it offers fine-grained control of dispatch and reduces overall computations by skipping unneeded work.

9.3 Basic micro-services

9.3.1 UNIFORM METADATA REPRESENTATION AND QUERY ORCHESTRATION: DATA SERVICE The main service within the BisQue system is the data service responsible for storage, access, and query orchestration over the metadata documents. The data service may restrict access to certain elements of system-defined documents such as user descriptions or MEXs. These schema restrictions and enforcements are provided by type-specific micro-services and are orchestrated by the data service. Userdefined or system-unknown types will not provoke any specialized behavior or restrictions and thus allow natural system extensions. Extensible, type-driven user interface elements are described later in the appropriate section. The data service is also responsible for converting metadata documents to/from XML, JSON, and other formats. The most important orchestration function of the data service is in multimodal queries. The data service provides XPath-like and SPARQL-like (“BQR”) query languages supporting extended functionality by accessing multiple indexers via plug-in functions. While both languages allow slicing and dicing of large hierarchical documents, XPath is more expressive for tree-structured queries (e.g., “get all descendant nodes of this node”) while BQR is more expressive for graph-structured queries (e.g., “find nodes in documents linked by this document”). In fact, the XPath query processor builds on the BQR query processor, whereby reducing code complexity. The data service notifies all the known indexers about any changes to a metadata document. Indexers make a decision on whether to immediately update themselves or queue their update in an asynchronous fashion. Some indexers may require resource-intensive computations of feature descriptors and thus asynchronous updates would be preferred. Each indexer maintains their own data structures needed to efficiently answer specific queries (e.g., an R-tree index structure for lowdimensional nearest neighbor queries). The BQR language provides primitives to query specific indexes and on linking them with the other parts of the query (e.g., “find documents linking cells that are very similar in shape to this cell”).

9.3.2 SCALABILITY OF MICRO-SERVICES AND ANALYSIS There are multiple scalability approaches required by the BisQue components. Services utilize three basic scalability mechanisms. The first one scales micro-services themselves in order to support a large number of concurrent users. This technique utilizes standard web technologies for load detection and request distribution. Cloud technologies allow automated scale-up and down of service machines based on current user load. The second mechanism allows services to off-load a slow operation when a response time-out is exhausted over to the background asynchronous processing system. It is required due to some operations being computationally expensive (e.g., complex data analytics queries). In this case a service will respond with a document indicating the operation is still in progress and provide a return URL where a



CHAPTER 9 Scalable image informatics

response can eventually be obtained. The requester can then periodically poll to check if the result is ready. The third mechanism is specifically designed for user-contributed analysis and utilizes a slower but massively scalable cluster dispatch mechanism. Compute jobs (i.e., modules) are scheduled to one or more cluster nodes based on their hardware and software requirements. This is discussed in more detail in Section 9.4.3.

9.3.3 ANALYSIS EXTENSIONS: MODULE SERVICE The module service enables users to easily contribute their own analyses (e.g., image classification and recognition methods) and allows dispatching these analyses to the available computational infrastructure (see Section 9.4 for more details). The module service operates on two system-defined resource types: 1. Module description documents that define formal inputs and outputs of the algorithm, user interfaces, source code location, and version along with other descriptors. 2. MEXs that describe a particular execution, values of its inputs and outputs, execution status, and initial and final date and time. Because modules are described by metadata documents they can be shared with collaborators as well as published for community participation, these modules can later be curated by the system administrator as demonstrated on Fig. 9.3. Module execution is initialized by passing a document template from module definition and containing required input values. A developer can permit automatic data-parallel execution by simply defining an iterable input parameter. A more complex execution can be achieved by passing a DAG document composed of multiple module execution templates. Automatically parallelized execution is only allowed in batch mode where all the required inputs are defined a priori. Interactive analysis on the other hand has only partial inputs initially and will request additional user inputs via the user interface while running until the end of the execution is reached. Such a module may remain executing and waiting for input parameters for a prolonged period of time. In addition, the module service provides mechanisms to monitor currently executing modules and it facilitates communication between the user interface and its running module code.

Private module development, collaboration

Published module community rating

FIG. 9.3 The BisQue module development life cycle.

Curated module access to large compute

9.3 Basic micro-services

9.3.4 UNIFORM REPRESENTATION OF HETEROGENEOUS STORAGE SUBSYSTEMS: BLOB SERVICE Modern institutions already utilize large storage systems, often they manage multiple different systems and often those systems are read-only. Moreover, individual users also use several storage services. To further complicate matters, data storage is available at different speeds with different price points. It may be advantageous to store old data in a cheaper but slower system while keeping most current data elements in a much faster but more expensive system. BisQue allows bringing disparate storage mechanisms together and enabling annotation and analysis of data stored within those systems. The blob service handles multiple storage systems and their authentication via extensible drivers that can handle large local file systems and enterprise solutions like iRODS, Amazon S3, Box, Google Drive, and others. The blob service also handles local caching of these resources for improved multiple access performance. Metadata documents describing files located in remote storage systems simply store URIs that define a driver, a specific attached store, and a path to the file. This allows annotating large amounts of resources without ever moving any bytes until those bytes are required for visualization or analysis. Another benefit of this descriptory mechanism is with cold storage. Elements located or moved to cold storage can be rapidly found using metadata descriptors without access to bytes. Moreover, caching of derivative results may enable fast partial visualization, for example, image service may provide pixel preview without accessing original data by caching derivative thumbnails.

9.3.5 UNIFORM ACCESS AND OPERATIONS OVER DATA FILES: IMAGE SERVICE AND TABLE SERVICE Accessing bytes does not yet allow accessing functional information because scientific data are typically stored in a multitude of proprietary formats. Providing uniform access to those formats is another goal of a truly interoperable system. Because of their size, an important goal is to keep the data as close as possible to the computational infrastructure used to analyze and visualize the data. For example, an image viewer can only show a number of pixels available on the screen, which is usually a tiny fraction of pixels available in the scientific image itself. At the same time, the visualization process typically needs to access a much larger number of pixels in order to compute the required view of the data. Similarly, visualization of a numeric table with a billion rows only needs to show a few hundred rows at a time. Thus the bandwidth required for preparation of the view is much larger than the view itself allowing remote viewers. Analysis of these data, on the other hand, typically requires access to most of the bytes and thus faster access is desired. BisQue utilizes micro-services responsible for a specific logical data type. These services reside in data centers on fast hardware with very good network connectivity and typically on the same local network with the institutional data storage (e.g.,



CHAPTER 9 Scalable image informatics

CyVerse Atmosphere+iRODS or Amazon EC2+S3). Each service is able to handle multiple formats while providing a single uniform API offering all the functionality needed for visualization and analysis. In the following, we describe in more detail services that provide access to the most basic data types used in image informatics workflows. Image service The image service provides access to the most important data type in image informatics: images. It offers support for more than 250 image formats by combining most widely used decoding and encoding libraries: our own C++ libbioimage, OpenSlide, Imaris Convert, BioFormats, FFmpeg, GDCM, and others. This allows BisQue to support a wide gamut of life sciences imaging modalities from 5D fluorescence, large EM connectomics data, behavioral video data, underwater imagery, GIS aerial and satellite imagery, and medical imaging CT/MRI/ultrasound to histopathology whole slide imaging. There is a large number of typical image processing operations that can be requested in any sequence on input data. They include slicing and dicing, extracting tiles and resolutions, transformations of colors, bit depths, geometric and spatial, intensity projections, interpolations, fusions, histogram operations, and many others. All these operations are considered views of original image data and never modify the original pixels. Another very important function of the image service is to extract and present in a uniform manner metadata embedded in image files. Most modern microscopes embed a large amount of acquisition and instrument parameters that can be useful for data interpretation and processing. Many other services use image service for image data, such as feature service, image content indexing, classification, and others. Table service Another common data type used in bio-imaging is the dense numeric table. They are commonly used to store features extracted from images, like cells with many measured parameters, numeric descriptors, and other data. Table service provides uniform access to most common formats like CSV, HDF-5, Excel as well as other services like Paradigm4 SciDB. Typical operations provided via the RESTful API are slicing and dicing and various transformations. Large numbers of graphical annotations can also be stored in these dense formats and later used for further computations or visualized at varying scale levels.

9.4 ANALYSIS MODULES Facilitating analysis within a database framework was the original BisQue motivation, and BisQue is unique in terms of providing such an integrated storage, analysis, and visualization environment. There are many disparate analysis packages widely

9.4 Analysis modules

used in life sciences. Bringing them all together in a simple integrated manner along with custom analysis routines proves very useful to create custom workflows. The BisQue system makes it easy to rapidly script custom analysis and run at a large scale, it also allows creating complex and custom user interfaces shareable with collaborators. In the following we will describe ways of bringing custom and already available analysis modules to the system.

9.4.1 PYTHON AND MATLAB SCRIPTING Python has gained large popularity in the scientific community and offers a large number of high-quality libraries from native SciPy to easy-to-use python bindings for libraries such as the ITK [2]. Matlab has been a popular language for a while and there are many libraries available to its users. Bringing these under the same umbrella may save a lot of development and validation time. BisQue offers an elegant way to bring these analyses from experimentation to large-scale execution. Any user can request a special authentication token for local analysis, which can then be used for an interactive remote session in any of the programming languages. BisQue offers APIs for both Python and Matlab that simplify most aspects of communication with the system. Interactive session permits a developer to explore the system while using small data portions. Finalized scripts can easily be converted to BisQue modules and used for large-scale processing.

9.4.2 PIPELINE SUPPORT Instead of using low-level programming language logic, many scientific analysis tools rely on higher-level “execution pipelines” to describe the specific logic in which analysis steps are performed and what data flows from one step to the next. BisQue allows importing pipelines as first-order resources that can be annotated, searched, and viewed. Like with other supported resources, BisQue preserves the original pipeline files. Examples of supported pipelines include Dream.3D for materials science [3] and CellProfiler for biology [4]. Imported pipelines can contain placeholders for parameters to be filled in by the user when starting the module execution. The module user interface renders each of the placeholders as a run parameter and instantiates the pipeline accordingly before execution starts. As with other BisQue modules, these input parameters are preserved for later inspection as part of the MEXs.

9.4.3 COMPLEX MODULE EXECUTION DESCRIPTORS The BisQue MEX allows the specification of directed acyclic execution graphs where each step is run only when the nodes of all incoming edges have completed execution. A simple example is a graph with a prerun phase to perform some preprocessing, followed by five parallel runs of the main analysis phase, followed by a postrun phase to collect and summarize results. Computer-based analysis of



CHAPTER 9 Scalable image informatics

scientific data oftentimes requires analysis steps to be run with many input parameter combinations in order to understand the effect on the generated output. For this purpose, BisQue’s module execution system allows to specify parameter ranges and automatically executes the module instances in parallel, one for each parameter combination. Depending on the underlying compute cluster system and the module execution graph, BisQue may either orchestrate the execution itself or may push it down to the cluster scheduler. An example for the latter is the execution of a DAG of tasks by the Condor task scheduler. In either case, the system has to ensure to schedule tasks only to nodes that are consistent with the task’s requirements (e.g., sufficient main memory, availability of GPUs). The use of Docker containerization technology allows the utilization of heterogeneous clusters, without the need to configure or preinstall software on each node. Because the parallel instances may update the BisQue metadata store at any time, care has to be taken to ensure that the data remain consistent without sacrificing parallelism. For example, the XML document describing the module run may contain many sections (one for each parameter combination). Since they are independent of each other, they can be updated in parallel without locking the entire document. For this purpose, the module execution relies heavily on the proper concurrency control of the underlying data service.

9.5 BUILDING ON THE CONCEPTS: SPARSE IMAGES Image mosaics (montages) are very popular with microscopists to augment the instrument’s field of view. They have become a requirement for studies in connectomics by generating images of at very large scales. Microscopes equipped with automated stages and embedded vibrotomes can produce volumetric images iteratively slicing the tissue and providing an approximate location of an image in the physical volume. Light-sheet microscopes paired with modern clearing techniques enable scientists to image and study whole organs. In all of these cases, microscopes produce a large number of images possibly with their approximate locations in 3D space. The typical workflow is to undergo a heavy processing by refining the location of each block within the volume using automatic image registration and subsequently produce a large dense image by geometrically transforming each input block into the output discrete volume before the whole image can be visualized and analyzed. The desire to keep the original images typically means doubling the storage requirement as well as a difficulty to immediately identify which portion of the final volume do these images correspond to? By contrast, BisQue utilizes a more dynamic and lazy approach by describing a “sparse” image composed of references to original images along with associated geometric transformations. Such a structure can represent a very complex multidimensional image. It can be rapidly imported and visualized in the system. Each image

9.6 Feature services and machine leaning

block in this construct can be independently transformed by scalable image services while the client-side image viewer decides what portions to present on the screen. Moreover, “sparse” images can grow and change over time as, for example, additional data blocks are acquired by the microscope. Thus allowing visualizing large mosaics as they are being acquired and/or geometrically refined. BisQue supports this by using a “mosaic” metadata document and a specialized service called Montager, which operates on a mosaic type to refine transformations, generate fused overlapping pixels for visualization, and eventually generate a large densified image in a background asynchronous process. Montager uses image service extensively to create transformed derivatives of input image blocks and the feature service to compute point descriptors for geometric transformation refinements.

9.6 FEATURE SERVICES AND MACHINE LEANING Image recognition often requires basic feature extraction. More recently, deep learning-based pattern recognition techniques have demonstrated highly promising results in various computer vision applications. BisQue supports a diversity of feature computations through its feature service, and the next release of BisQue will include integration with deep learning architectures for scalable computations.

9.6.1 FEATURE SERVICE The BisQue feature service is responsible for the computation of numeric feature descriptors in a scalable and validated manner. It offers more than 100 commonly used numerical descriptors (HTD, EHD, SCD, DCD, SIFT, SURF, HAR, Wavelet Histogram, Radon Coefficients, Chebishev Statistics, etc.) [5] computed on image derivatives (provided by the image service) as well as graphical annotations. In order to guarantee correctness we have integrated, validated, corrected, and rewritten code from multiple well-known libraries: OpenCV [6], Mahotas [7], MPEG7 [8], and WndChrm [9]. This provides a basic building block for any classifier in taking care of the training/testing dataset creation. The same service can later be used at classification time to compute descriptors for data to be classified. The demonstration of this approach is available as “Botanicam” classification and training modules [10].

9.6.2 CONNOISSEUR SERVICE FOR DEEP LEARNING The Connoisseur service is an integrated training and classification solution for image recognition based on deep learning (Fig. 9.4). Connoisseur uses convolutional neural networks (CNNs) in order to create a model directly from training data without feature selection. This enables any domain scientist to create a specialized classifier model directly from annotated still or video imagery without the need to know the engineering intricacies of classifier design. Any of the BisQue’s organizational and filtering functions to choose the training dataset, and could possibly



CHAPTER 9 Scalable image informatics

include all of the available data. Connoisseur presents all classes found within the selected data and then shows how well it did against the automatically chosen testing part (never seen by the training) of the dataset. At this point a scientist can choose to discard certain classes if their performance was below the required level. During classification, each sample is given a confidence score computed from multiple measurements performed in a vicinity of the location of interest. This measure is adjustable by the user to skip low confidence samples. Automated annotations can be validated and modified by the human expert creating more data. These new samples can then be used to improve the model over time, thus a model is a dynamic object that is constantly updated. Not surprisingly, one of the most time-consuming parts of the training process is the preparation of the training, testing, and validation datasets. All image data must be in a specific format and size as defined by the training model, which usually means extraction of small patches around annotated regions, possibly resizing those patches, ensuring the same color space and profile as well as pixel depth in bytes. Considering hundreds of annotations per image this operation will be repeated millions of times per typical datasets. This embarrassingly parallel processing is ideally handled by the scalable BisQue image service. Here, multiple asynchronous background Connoisseur processes are requesting the image derivatives following the metadata found in the identified set and writing the training database in parallel. Once the training database is created, a single multi-GPU server is used to effectively train the model. A typical dataset could be trained in a matter of a few hours on a modern 4-GPU (nVidia TitanX) server using the BisQue asynchronous background processing facility.

Learning pipeline

Images Annotations

Multimodal query Selector

BisQue services

Recognition pipeline



FIG. 9.4 Deep learning training and classification pipelines.

Deep learner


GPU cluster

Deep learner


9.7 Application example: annotation and classification of underwater images A model metadata document is created to describe model files. This includes the classes detected in the identified dataset, numbers of samples per class, and their accuracies and errors once the training is done. This model document will be updated with every consecutive training session. Once the model is trained, data classification is quite efficient and is embarrassingly parallel. The slowest process here is the initialization of the GPU library and loading of the CNN model into the GPU memory. A typical CNN model’s size is about 500 MB (depending on the network topology) and therefore multiple models can be loaded at the same time.

9.6.3 CONNOISSEUR MODULE FOR DOMAIN EXPERTS The Connoisseur classification module offers a user-friendly interface, parallel execution over datasets, and permanent storage and provenance of resulting annotations. A user can choose a dataset to classify, a pretrained model, and a classification mode. There are three classification modes offered by Connoisseur. The first one creates uniformly or randomly distributed point annotations. It is designed to automate the widely used percent cover technique. Each automated point is marked with an accuracy measure and allows visual selection of the desired level. The second mode is a fast Voronoi partitioning of the image. It is useful for object-environment co-occurrence questions. For example, brittle stars over mud versus rock in underwater images or cancer cells near fat or blood vessels in microscopy data. While the previous two methods produce graphical annotations (vector data), the third one produces a mask image with a higher quality but slower semantic segmentation where the image is partitioned in classified regions. Each region has an associated accuracy measure and can also be pruned to a desired level.

9.7 APPLICATION EXAMPLE: ANNOTATION AND CLASSIFICATION OF UNDERWATER IMAGES Here we briefly describe an example application to marine sciences. Researchers at the Marine Science Institute, UCSB are using BisQue to store, manage, annotate, and develop automated analysis techniques for the Marine Biodiversity Observation Network project. Fig. 9.5 shows the BisQue annotation interface configured for percent cover with 100 sample points. The types (species) of annotations (visible in the right side text widget) are user-defined and can grow and change as needed over time. This enables continuous evolution of the annotations to fit the evolving needs of the project.



CHAPTER 9 Scalable image informatics

FIG. 9.5 Manual annotation interface configured for percent cover with 100 points. The types (species) of annotations (visible in the right side widget) are user-defined and can grow and change as needed over time. This enables continuous evolution of the annotations to fit the evolving needs of the project.

It is desired to have many annotators due to the number of training images required. A typical dataset may contain about 300 different classes of interest and thus would need hundreds of thousands of training annotations. The BisQue UI allows spitting a dataset among a number of annotators. Further, it also facilitates validation and accepting the annotations by an independent expert. One particular study included a dataset of over 2000 underwater images and manually annotated for percent coverage of sessile species. Each image contained 100 annotated locations amounting to >200 K data points with over 30 species. Over 80% of these data points are covered by the 11 most abundant classes. We obtained 85% classification accuracy on these 11 classes using two different feature aggregation techniques, one using CRF-based models and the other using a K-NN classification with dropout regularization [11]. The Connoisseur-based deep learning technique demonstrated an even higher average accuracy of 94.73% with an error of 3.65% on the same dataset [12], demonstrating the power of state-of-the-art CNN approaches. An example of the uniform percent cover classification automation at 95% confidence is presented in Fig. 9.6. Different classes are shown in different colors. Fig. 9.7 shows an example result for the third mode of classification that results in segmented regions for the same 95% confidence.

9.8 Summary

FIG. 9.6 Automated point annotations imitating percent cover annotations at 95% confidence.

FIG. 9.7 Semantic segmentation of at 95% confidence.

9.8 SUMMARY We presented an extensible image informatics platform, BisQue, for reproducible multimodal data analytics. While the initial motivation for BisQue came from the life sciences, these requirements cut across most scientific imaging applications. Some of the recent applications include marine sciences, materials science, medical imaging, and health care. BisQue is unique in its integration of multimodal databases with data analytics, making it possible to track the data and its processing, including



CHAPTER 9 Scalable image informatics

provenance on the methods themselves. BisQue adopts the state of the art in web-based analytics and cloud computing, making it easy for the end users to immediately take advantage of the latest methods. At the same time it enables researchers in computer vision, pattern recognition, and machine learning to work with diverse types of data at scale. BisQue is available as a core service through the CyVerse cyber infrastructure ( as well as open source for download.

ACKNOWLEDGMENTS BisQue research and development are supported in part by the following grants/awards: NSF DBI#1356750, NSF DBI#1265383 through U. Arizona, and NSF ACI#1650972.

REFERENCES [1] D. Merkel, Docker: lightweight Linux containers for consistent development and deployment, Linux J. 1075-35832014 (239) (2014).¼2600239. 2600241. [2] B. Lowekamp, D. Chen, L. Ibanez, D. Blezek, The design of SimpleITK, Front. Neuroinform. 7 (2013) 45. [3] M.A. Groeber, M.A. Jackson, DREAM.3D: a digital representation environment for the analysis of microstructure in 3D, Integr. Mater. Manuf. Innovation 3 (1) (2014) 1–17. [4] A. Carpenter, T. Jones, M. Lamprecht, C. Clarke, I. Kang, O. Friman, D. Guertin, J. Chang, R. Lindquist, J. Moffat, P. Golland, D. Sabatini, CellProfiler: image analysis software for identifying and quantifying cell phenotypes, Genome Biol. 7 (R:100) (2006). [5] C. Wheat, D. Fedorov, K. Kvilekval, Feature service, (2010). https://biodev.ece.ucsb. edu/projects/bisque/wiki/Developer/FeatureService. [6] G. Bradski, OpenCV library, Dr. Dobb’s Journal of Software Tools (2000). [7] L. Coelho, Mahotas: open source software for scriptable computer vision, CoRR (2012). [8] B.S. Manjunath, P. Salembier, T. Sikora, Introduction to MPEG-7: Multimedia Content Description Interface, John Wiley & Sons, Inc., New York, NY, 2002. ISBN 0471486787. [9] L. Shamir, N. Orlov, D.M. Eckley, T.J. Macura, J. Johnston, I.G. Goldberg, Wndchrm: an open source utility for biological image analysis, Source Code Biol. Med. 3 (2008). [10] C. Wheat, D. Fedorov, G. Abdollahian, K. Kvilekval, Botanicam: The Plant Recognizer, (2008). [11] A.M. Rahimi, D.V. Fedorov, S. Sunderrajan, B.S. Manjunath, R.J. Miller, B.M. Doheny, H.M. Page, Marine biodiversity classification using dropout regularization, in: Workshop on Computer Vision for Analysis of Underwater Imagery: International Conference on Pattern recognition, 2014. Stockholm, Sweden. [12] D.V. Fedorov, K.G. Kvilekval, B.S. Manjunath, R.J. Miller, BisQue: cloud-based system for management, annotation, visualization, analysis and data mining of underwater and remote sensing imagery, in: Poster at Ocean Sciences Meeting, 2016. New Orleans, LA.


Person re-identification


Marco Cristani*, Vittorio Murino† University of Verona, Verona, Italy* Pattern Analysis and Computer Vision, Istituto Italiano di Tecnologia, Genova, Italy†

10.1 INTRODUCTION Re-identification (re-id) is a fundamental operation for the management of distributed surveillance systems, in which multiple cameras with nonoverlapping fields of views are deployed in large environments. It aims at associating people across camera views at different locations and times, and is tightly connected with long-term multicamera tracking and forensic search. A typical re-identification system is constituted by three main processing stages. For the sake of clarity, we consider only two cameras, A and B, with nonoverlapped fields of view, even if the concept can easily be extended to multiple cameras. Moreover, we also consider that the system started from scratch (no people have been processed beforehand). In a first phase dubbed collection, people P1, …, PM enter in the scene that is monitored by camera A. During this period, the re-identification system collects several views for each Pi and builds an ID signature: this phase ends up with a repository of ID signatures called the gallery set. In other words, we collect the ID signatures P(1, A), …, P(M, A), where the subscript A indicates that they are built considering the appearance captured by camera A. The signature does not typically take into account biometric cues, but only the appearance of the subject derived from its clothes. In the second stage, which is named model design, the system sets up the chosen approach to recognize the identity of every individual P1, …, PM. In this phase, ID signatures can be manipulated, highlighting their differences by extracting discriminative features, or performing projections on discriminative subspaces. In short, a model architecture is selected and set performing operations which basically prepare a classification system for the actual recognition. In the third phase, called matching, we suppose that subjects P1, …, PM enter in the scene captured by camera B. Obviously, conditions can be completely different from those for camera A, so as the pose of the persons under this field of view. The goal is then to create another pool of signatures P(1, B), …, P(M, B), the so-called probe set. For each probe (ID signature) image, we look for a match with the (manipulated) signatures in the gallery Academic Press Library in Signal Processing, Volume 6. # 2018 Elsevier Ltd. All rights reserved.



CHAPTER 10 Person re-identification

P(1, A), …, P(M, A). Ideally, the goal would be that of having Pði,AÞ ¼ Pði, BÞ ! Pi , for each i, that is, to have every Pi captured by camera A correctly identified by the camera B. This three-stage pipeline, collection, model design, and matching, is present in almost every re-identification approach, even if in the most simple cases, the model design phase may not be present, and direct matching is performed between the signatures. These stages underlie the need of a reliable person detection algorithm able to locate the person in the scene, and possibly also of a tracking algorithm which can record a few consecutive frames of the same person. It is clear that the overall re-id process may benefit from the availability of more instances of the same person in the gallery and/or the probe sets, so as to exploit such redundancy to improve the performance. In this pipeline, two main operations can be individuated: the feature extraction and the model training. The feature extraction is performed during the collection and matching stages, and serves to get the patterns related to each single individual. The model training operation occurs during the model design stage, and uses training data to prepare the system to perform the matching, setting up the space where the patterns of the gallery have to be compared with the probe ones. In the next section, a taxonomy of the re-identification approaches will be illustrated, taking into account the pipeline introduced earlier. Afterward, in the remaining of the chapter, two strategies for the feature extraction and model training will be detailed. In particular, for the feature extraction, we consider the Symmetry-based Descriptor of Accumulated Local Features (SDALF) approach [1], one of the most known techniques for extracting robust features from pedestrians, and the metric learning paradigm will be described as example of the model training operation. The idea of metric learning is to search for the optimal metrics such that instances belonging to the same person are more similar, whereas instances belonging to different people are more different. These two strategies have been chosen since they are particularly cited in the literature, and because they represent general approaches upon which many other variations have been designed to date. Considering the figures of merit to evaluate a re-identification pipeline, there are two most relevant performance measures: the rank-n accuracy, in particular the rank1, and the CMC curve. Rank-n accuracy shows the method performance in correctly identifying the right person in the gallery in a list of ranked n subjects. The list length n typically spans between 5, 10, or even more, considering a more realistic scenario in which the model is expected to report a list of ordered matches (from the most to the least likely), which a human operator can visually inspect and confirm the true match, in the hope that the true match occurs high in the ranking. Rank-1 is actually the standard classification accuracy, and counts how many probe images are correctly matched with the corresponding gallery ID. As expected, it is very hard to have high rank-1 scores, since re-id considers each single person as a class, and there is typically a rather low interclass variation since many people wear the same clothes, and the severe changes of conditions between the views (illumination, camera calibration, etc.) does the rest in making more complex the task. The Cumulative Match Characteristic (CMC) curve is a plot of the recognition performance versus the

10.2 The re-identification problem

ranking score and represents the expectation of finding the correct match in the top n matches. Other metrics which can be derived from the CMC curve include the scalar area under the curve (AUC), and expected rank (on average how far down in the list is the true match).


Considering the re-id pipeline discussed earlier, a first distinction could be assessed by individuating how the collection and the matching phase can be carried out. Almost all of the approaches in the literature assume to already have a prestored gallery dataset, and that the probe set (the images collected by the camera B) contains exactly the M people seen by camera A. This is the so-called closed world scenario, which essentially focuses on analyzing the many ways that the model setting can be carried out, by adopting different model training methodologies. In terms of applicability on real situations, the closed world scenario models those settings in which it is expected that, soon or later, all the people seen by camera A will be seen by camera B, as it happens in the case of a long one-way gallery monitored by two cameras at its entrance and exit. This is a very constrained scenario, mostly used so far since all re-id datasets contain the same number of persons in both the gallery and probe sets. On the other hand, the more generic and hardest open world scenario admits that the people seen by camera B may not be the same seen by camera A, and this scenario happens in the case there is no unique path that starting from the area observed by camera A leads to the zone covered by camera B. In practice, whereas in the closed world scenario a probe image is expected to find a match with a gallery ID, in the open world setting a subject seen by camera B might not be present in the gallery, and in this case the re-id system has to produce a negative output. Needless to say, this is the most common scenario we could have in whatever real multicamera surveillance system. Another possible taxonomy comes from the number of (person) instances considered in the gallery and probe sets, and how they are used in the processing pipeline. Of course, two instances per person is the minimum number of images (one for the gallery and the other as probe) which can be considered, but if tracking is available, multiple instances may be used in the ID signature construction, the model design and the matching phase as well. In the case single images are used as probes in the matching phase we are talking about single-image re-identification, and if there are multiple probe images per person, we could talk about multiimage reidentification. A further distinction could be made by considering how many instances per person we have in the gallery: even in this case we can have a single or multiple images per person. Combining this spectrum of possibilities for the probe and the gallery sets, we can have single vs. single, single vs. multiple, multiple vs. single, and multiple vs. multiple. This granularity of situations is related to the



CHAPTER 10 Person re-identification

single-shot and multishot modalities quoted in many works in the current literature to date. As also mentioned before, the more instances per person we have in both gallery and probe sets, likely the better the performance in terms of accuracy. Besides the several possible scenarios above quoted, a taxonomy of re-id techniques can be more classically illustrated from a methodological standpoint, as reported in the following.

10.2.2 RELATED WORK Re-identification techniques can coarsely be organized in two main classes of methods: direct methods and learning-based methods. In the former, the designed algorithms look for the most discriminant features to compose a powerful descriptor for each individual (e.g., see [1–5]). In contrast, learning-based methods have techniques that learn metric spaces where to compare pedestrians, in order to guarantee high re-id rates (e.g., see [6–12]). The class of direct approaches is the more standard and straightforward, mostly used since the beginning. These methods mainly focus on the design of novel features for capturing the most distinguishing aspects of individuals. In [13], a descriptor is proposed by subdividing the person in horizontal stripes, keeping the median color of each stripe accumulated over different frames. A spatio-temporal local feature grouping and matching was proposed in [14], where a decomposable triangular graph is built, able to capture the spatial distribution of the local descriptor over time. In [15], the proposed method consisted in segmenting a pedestrian image into regions, and registering their color spatial relationship into a co-occurrence matrix. This technique proved to work well when pedestrians are seen within small variations of the viewpoint. Hamdoun et al. [16] employed SURF as interest points, which were collected over short video sequences. Symmetry- and asymmetry-driven features are explored on [1, 17] based on the idea that features closer to the bodies’ axes of symmetry are more robust against scene clutter. Similar features (i.e., Maximally stable color region (MSCR) and color histograms) to the ones proposed in this work have been employed also in [18]. This approach was then extended in [2] by matching signature features coming from a number of well-localized body parts, and manually weighing those parts on the basis of their saliency. Following the same idea, Bhuiyan et al. [4] devised a method which automatically assign weights to the body parts on the basis of their discriminative power. In [19], a novel illuminationinvariant feature descriptor was proposed based on logchromaticity (log) color space, demonstrating that color as a single cue has a relatively good performance in identifying persons under greatly varying imaging conditions. In addition to color-based features, there are some other features that have been proved as promising such as: textures [8, 11, 20], edges [20], Haar-like features [21], interest points [16], image patches [8], and segmented regions [15]. These features can be extracted from horizontal strips [13], triangular graphs, concentric rings [14], symmetry-driven structures [1, 17], and horizontal patches [21]. Another unconventional application of re-identification considers Pan-Tilt-Zoom cameras, where distance between signatures are also computed across different scales [22].

10.2 The re-identification problem

As learning-based examples, in [23], local and global features are accumulated over time for each subject, and fed into a multiclass Support Vector Machine (SVM) for recognition and pose estimation, while employing different learning schemes. In [24], pairwise dissimilarity profiles between individuals are learned and adapted for a nearest neighbor classification. The combination of spatial and color information for viewpoint invariance, using an ensemble of discriminant localized features and classifiers selected by boosting, was the main idea proposed in [8]. Similarly, in [20], a high-dimensional signature composed by multiple features is projected into a low-dimensional discriminant latent space by Partial Least Squares (PLSs) reduction. In [25], contextual visual knowledge is exploited, enriching a bag-of-word-based descriptor by features derived from neighboring people, assuming that people stay together across different cameras. Re-identification was also addressed as a binary classification problem (one vs. all) by Bak et al. [21] using Haar-like features and a part-based MPEG7 dominant color descriptor, and in [11], re-id is seen as a relative ranking problem in a higher dimensional feature space where true and wrong matches become more separable. Recent approaches [12, 26–29] designed classifiers to learn low-dimensional projection spaces that enforce features from the same individual to be closer than features from different individuals. In practice, the classic metric learning paradigm is then developed in many forms. In [28], an adaptive boundary approach that jointly learns the metric and adaptive thresholding rule was proposed for person reidentification. An alternative approach was to use a logistic function to approximate the hinge loss so that the global optimum can still be achieved by iterative gradient search along the projection matrix [12, 29]. Unfortunately, these methods showed to be prone to overfitting. In [27], local Fisher discriminant analysis (LFDA) was proposed, which is a closed-form solution of Mahalanobis matrix required to use PCA for dimensionality reduction. However, PCA can eliminate discriminant features defeating the benefits of LFDA. By introducing a regularization term in the methods [12, 28, 29], and using a series of kernel-based techniques to learn nonlinear feature transformation functions [27] to preserve the discriminative features, Xiong et al. [26] reported better performance over the respective methods. Dictionaries learned from data have recently achieved impressive results in several classification and recognition problems [6, 9, 30], and they were also proposed in the re-id context. Lisanti et al. [6] proposed a sparse representation ranking with iterative re-weighting. The approach makes use of soft and hard re-weighting to redistribute energy among the most relevant contributing elements and to ensure that the best candidates are ranked at each iteration. In [9], two semisupervised coupled dictionaries are learned for probe and gallery sets. In testing phase, sparse representation of test image is calculated using the “probe” dictionary while it is recovered using the “gallery” dictionary. Liu et al. [31] considered multiple person instances for dictionary learning, but the algorithm did not explicitly require the training a separate dictionary for each class (i.e., each person), instead exploiting the whole set. Jing et al. [30] proposed a semicoupled low-rank discriminant dictionary learning approach for superresolution person re-id. The aim is to convert the features from low-resolution probe images into discriminant high-resolution gallery features. To this end, a pair of high- and low-resolution dictionaries is learnt together with a



CHAPTER 10 Person re-identification

mapping from the features of the high-resolution gallery to those of the lowresolution probe images. Finally, person re-identification is consider as a partial matching problem in [32], where a novel Ambiguity Sensitive Matching Classifier (AMC) based on sparse representation classification which computes an ambiguity score at the patch-level between a probe and each gallery patch. More recently, both direct and metric learning methods have been treated as a joint learning problem to further increase the performance. For describing a person’s appearance, many of the existing approaches try to extract features based on different color spaces and textures without considering the salient features for a particular individual. To do this, both direct and the learning-based approach were deemed necessary. In this context, Liu et al. [9] presented an unsupervised method to weigh the importance of different features, the associated experimental results showed that the importance of features including different color spaces and textures was different under different conditions. Similarly, Figueira et al. [33] proposed a semisupervised multifeature learning approach which exploited the general framework of multifeature learning with manifold regularization in vector-valued Reproducing Kernel Hilbert Spaces (RKHS), in order to fuse together different features, so enforcing the mutual coherence. Traditional color information was noted not to be the optimal way for describing color. Thus, Kuo et al. [34] employed semantic color names, which are learned in [35] to describe color in a more meaningful way. Following the same pipeline, Yang et al. [36] proposed a novel salient color name-based descriptor for person re-id, where each color in RGB color space was represented over its salient color names. This has guaranteed that a higher probability will be assigned to the color name which is closest to the intrinsic color. In [37], the maximization of the horizontal occurrences of local features has been carried out as feature representation, and a cross-view quadratic discriminant analysis (XQDA) metric learning technique (an extended version of Keep It Simple and Straightforward MEtric (KISSME) [38]), is employed for matching. Recently, both discriminative feature learning and metric learning have been treated as a joint learning problem employing deep convolutional architectures providing competitive performance [39]. However, similar to many other deep learning architectures, generating a huge amount of labeled training data is an issue which needs to be addressed carefully. In the following, we will report a brief sketch of the main classes of techniques concerning the feature extraction and the model training operations, which will be detailed in the subsequent sections.

10.2.3 FEATURE EXTRACTION A large number of feature types have been proposed for re-id, e.g., color, textures, edges, shape, global features, regional features, and patch-based features. In order to cope with sparsity of data and the challenging view conditions, most person re-id methods benefit from integrating several types of features having complementary nature [1, 8, 11, 15, 17, 18, 20, 40–43]. Often, each type of visual feature is represented by a bag-of-words scheme in the form of histogram. Feature histograms are then concatenated with some weighting between different types of features in

10.3 Experimental evaluation of re-id datasets and their characteristics

accordance to their perceived importance, i.e., based on some empirical or assumed discriminant power of certain types of features in distinguishing visual appearance of individuals. Spatial information about the layout of these features is also an important cue. However, there is a tradeoff between the granularity of the spatial decomposition providing more detailed information and the increasing risk of misalignment between regions in the image pairs, and thus brittleness of the match. To integrate spatial information into the feature representation, images are typically partitioned into different segments or regions, from which features are extracted. Existing partitioning schemes include horizontal stripes [9, 11, 41, 43], triangulated graphs [14], concentric rings [25], and localized patches [21, 44], or body parts [18].

10.2.4 MODEL LEARNING If camera pair correspondences are known, one can learn a transfer function for modeling camera-dependent photometric or geometric transformations. In particular, a photometric function captures the changes of color distribution of objects transiting from one camera view to another. The changes are mainly caused by different lighting and viewing conditions. Geometric transfer functions can also be learned from the correspondences of interest points. Following the work of Porikli [45], a number of studies have proposed different ways for estimating the brightness transfer function (BTF) [11, 46–50]. The BTF can be learned either separately on different color channels, or taking into account the dependencies between channels [49]. Some BTFs are defined on each individual, while other studies learn a cumulative function on the full available training set [11]. A popular alternative to color transformation learning is distance metric learning. The idea of distance metric learning is to search for the optimal metric under which instances belonging to the same person are more similar, and instances belonging to different people are more different. It can be considered as a data-driven feature importance mining technique [9] to suppress cross-view variations. Existing distance metric learning methods for re-identification includes Large Margin Nearest Neighbor (LMNN) [51], Information Theoretic Metric Learning (ITML) [52], Logistic Discriminant Metric Learning (LDML) [53], KISSME [38], RankSVM [11], and Probabilistic Relative Distance Comparison (PRDC) [41]. Early metric learning methods [51, 52] are relatively slow and data hungry. More recently, re-identification research has driven the development of faster and lighter methods [38, 54].

10.3 EXPERIMENTAL EVALUATION OF RE-ID DATASETS AND THEIR CHARACTERISTICS There are number of publicly available re-id datasets which are considered to be challenging as well as realistic for evaluating the performance of the several re-id methods. The specific characteristics that define the uniqueness of a particular



CHAPTER 10 Person re-identification

dataset are: the number of subjects considered, the number of the instances per subject, pose and illumination variations, severity of occlusions, number of different camera views, and the resolution of the captured images. As mentioned earlier, based on the number of the instances per subject in the dataset, the re-id problem can be approached in two modalities, single-shot and multishot. In the former case, there are only two instances, one for the probe and one for the gallery. In the latter case, there are multiple instances per subject which may be used to extract more robust information from the subject. Among all the datasets proposed in the literature, few of them are widely used, whose characteristics will be illustrated in the following. The VIPeR dataset [8] is specifically made for viewpoint-invariant pedestrian reidentification. The significant amount of viewpoint and illumination variation makes it extremely interesting for evaluating the performance of an appearance-based model in a realistic surveillance system. It contains 632 pedestrian image pairs taken from two cameras captured in arbitrary viewpoints under varying illumination conditions. The data were collected in an academic setting over the course of several months. Each image is scaled to 128  48 pixels. This dataset has been widely used and is considered to be one of the benchmarks of reference for pedestrian reidentification. Modeling pedestrian appearance by direct methods is not working well for the VIPeR dataset, as compared to other benchmark datasets. One of the main problem of this dataset is that it contains only one image per person, so that only single-shot modality can be adopted. For modeling an appearance robustly, most of the state of the art [1, 2, 4, 18] relied on multiple instances per person, so allowing to pursue a multishot approach. The results reported in Table 10.1 support our claims where the performances of direct methods score relatively lower than the results reported in Table 10.2, reporting the results for metric learning-based methods. For the same reason, the transfer appearance-based methods, such as [55, 56], do not still perform well. The evidence of that has been experimentally explored in [57, 57], where exploiting multiple-detections for transferring brightness appearance performs significantly better than adopting only a single detection. Notable characteristics of metric learning-based techniques have been explored in [26]. The drawback of most of the metric learning-based methods [27, 38, 58] regard the high computational complexity mainly due to the large dimensionality of the feature vector. To overcome this problem, for VIPeR (and CAVIAR4REID too) dataset, the kernel trick is adopted to handle large dimensional feature vectors while maximizing a Fisher optimization criteria proposed in kernel Local Fisher Discriminant Analysis (kLFDA) [26] performed relatively well. Moreover, a combination of the direct methods and metric learning-based formulated as a joint learning problem in [37] to increase the performance further. In [37], Local Maximal Occurrence (LOMO) feature estimated by maximizing the horizontal occurrence of local features and a XQDA metric learning technique was employed for matching. The combination LOMO and XQDA performed well setting up the new state of the art for VIPeR dataset, as reported in Table 10.2.

10.3 Experimental evaluation of re-id datasets and their characteristics

Table 10.1 Experiment on VIPeR Dataset Reported by Top-Ranked Matching Rate (%) for Direct Approaches Methods



r 5 10

r 5 20

SDALF [1] CPS [2] SCA [4] WBTF [55] ICT [56]

20.00 23.89 25.92 21.99 14.4

38.00 46.08 49.91 46.84 41.61

48.5 59.02 61.74 59.97 59.70

65 73.16 73.99 75.95 71.2

Table 10.2 Experiment on VIPeR Dataset Reported by Top-Ranked Matching Rate (%) for Metric Learning-Based Methods Methods



r 5 10

r 5 20

ELF [8] KISSME [38] LF [27] kLFDA [26] KCCA [58] LOMO+XQDA [37]

12.00 22.94 24.18 32.2 37.3 40.00

31.50 55.21 51.2 65.80 71.4 68.90

44.00 62.20 67.12 79.70 84.60 80.5

61 77.00 82.00 90.90 92.30 91.1

The CAVIAR4REID dataset [18] contains images of pedestrians extracted from CAVIAR repository [18], providing a challenging real world setup. This is the only publicly available dataset where the intra-person images vary a lot in terms of resolution, light, pose, and so on. The main challenge of this dataset is its broad changes in resolution, the minimum and maximum size of the images is 17  39 and 72  144, respectively. Severe pose variations make it particularly challenging. Among the 72 identified different individuals, 50 are captured by two cameras with 20 images each, and 22 from one camera only with 10 images for each of them. So, in practice, this dataset selects a set of images for each camera view for each pedestrian in order to maximize the variance with respect to resolution changes, light conditions, occlusions, and pose changes, so increasing the complexity of the re-id task. Since CAVIAR4REID is one of the most challenging dataset for re-id, there are few processing pipelines that could deal robustly with it. Direct approaches to model the appearance for this dataset gained considerable attention due to their robust appearance representations. The pipeline of this kind of appearance-based method started mainly with SDALF [1], a symmetry driven method to automatically segment salient body parts and an accumulation of features making the descriptor more robust to appearance variations. Following the same idea, Cheng and Cristani [2], Farenzena et al. [1], Bazzani et al. [3], Bhuiyan et al. [4], and Kviatkovsky et al. [5] introduced different appearance-based methods to increase the performance for this dataset. Applying appearance transfer-based methods have less effect on the dataset



CHAPTER 10 Person re-identification

performance since there is not much illumination variation. The appearance transfer method considered in [56] (named ICT) works well when the number of the training samples for mapping the transfer function is large as compared to the corresponding test set. Nevertheless, the brightness appearance transfer methods proposed in [55, 57, 59–61] is working better than ICT [56], while considering less number of training samples, which is actually more realistic for re-id. The experimental results reported in Table 10.3 and Table 10.4 support the claims above discussed. The PRID2011 dataset [62] consists of images extracted from trajectories recorded from two static outdoor cameras. Images from these cameras are characterized by viewpoint changes and a sharp differences in illumination, background, and camera characteristics. The images of this dataset also have relatively clear background and few occlusions. A camera view shows 385 persons, camera view B shows 749 persons. The first 200 persons appear in both camera views. Each person appears in multiple instances per camera. Direct approaches for PRID2011 dataset work relatively well compared to VIPeR and CAVIAR4REID datasets. One of the possible explanation could be the images present in PRID2011 have higher resolution, clear background, and few occlusion as compared to the images in VIPeR and CAVIAR4REID datasets. The presence of Table 10.3 Experiment on CAVIAR4REID Dataset Reported by Top-Ranked Matching Rate (%) and Normalized Area Under Curve (nAUC) (%) for Direct Methods Methods



r 5 10

r 5 20


SDALF [1] CPS [2] SCA [4] WBTF [55] ICT [56] CWBTF with CPS [57] CWBTF with SCA [57]

11.5 20.25 22.75 16.25 8 21.75 23.95

38.5 53 59.25 45.5 32.75 55.10 59.05

62 71 71.5 65.5 52.25 74 75.13

83 89.25 89.25 83.75 77 91.63 90.52

73.80 82.01 82.63 77.79 71.19 83.28 83.82

Table 10.4 Experiment on CAVIAR4REID Dataset Reported by Top-Ranked Matching Rate (%) for Metric Learning-Based Methods Methods



r 5 10

r 5 20

PCCA [29] LFDA [27] SVMML [28] KISSME [38] rPCCA [26] kLFDA [26]

33.00 31.7 25.8 31.4 34.00 35.90

67.2 56.1 61.40 61.90 67.5 63.60

83.10 70.40 78.6 77.80 83.40 77.90

95.7 86.90 93.60 92.90 95.80 91.2

10.3 Experimental evaluation of re-id datasets and their characteristics


100 90 80

Recognition Rate (%)

70 60 50 40 30 CWBTF with SCA(nAUC 90.27) CWBTF with CPS(nAUC 89.05) SCA(nAUC 85.54) CPS(nAUC 83.01) WBTF(nAUC 82.20) SDALF(nAUC 72.65) ICT(train with 5)(nAUC 51.84)

20 10 0







Rank Score

FIG. 10.1 CMC and nAUC of direct approaches on the PRID2011 dataset.

multiple instances per person assists this design process further to model the appearance robustly. The experimental results reported in Fig. 10.1 and Table 10.5 provide the evidence of our claims. Since there exists a sharp difference in illumination between the images from both cameras, applying the BTF to map the illumination from one camera to another would certainly improve the re-id performance. Table 10.5 Experiment on PRID2011 Dataset Reported by Top-Ranked Matching Rate (%) for Metric Learning-Based Methods Methods



r 5 10

r 5 20

Salience [44] DVR [63] RankSVM [64] LFDA [27] AFDA [65]

25.80 28.90 22.40 22.30 43.00

43.60 55.30 51.9 41.70 72.70

52.60 65.50 62.00 51.6 84.60

62.00 82.8 80.70 62.00 91.90



CHAPTER 10 Person re-identification

Experimental findings reported in Fig. 10.1 support this claim where robust brightness transfer method proposed in [57] performed quite well outperforming all the related works in the literature. As the ICT [56] method does not depend on illumination variations, but severely depends on large number of the training samples, the performance of ICT on PRID2011 is quite poor considering less number of training samples to maintain a fair comparison with CWBTF [57] as well as occurring in realistic re-id scenarios (see Fig. 10.1). The SAIVT-SoftBio dataset [66] includes annotated sequences (704  576 pixels, 25 frames per second) of 150 people, each of which is captured by a subset of eight different cameras placed inside an institute, providing various viewing angles and varying illumination conditions, but images in this dataset rarely get occluded. A coarse bounding box indicating the location of the annotated person in each frame is provided. Most of the state-of-the-art methods evaluated this dataset for two camera pairs: Cameras 3–8 having similar viewpoint, and Cameras 5–8 characterized by very different viewpoints, with large viewing angle difference. For the similar view case (camera pair 3–8), it is easy to model the appearance since there is not much viewpoint variation between the images in this camera pairs. In addition, the common characteristics of this dataset are that these images have rarely suffered from occlusions. So, it is quite straightforward to expect that employing the direct approaches for this pair of cameras would certainly perform better as compared to the other considered datasets or camera pairs. Experimental evidences illustrated in Fig. 10.2 and Table 10.6, one can note that the direct approaches on this similar view camera pair verify this argument and even outperform the metric learning-based methods which is highly unlikely for dissimilar views. Likewise PRID2011, images in SAIVT-SoftBio dataset are characterized by varying illumination across cameras. Consequently, adopting brightness transfer to map the illumination from one camera to another directly influences the re-id performance (as shown in Fig. 10.1). However, ICT [56] does not rely much upon illumination variations but, rather, rigorously depends on the (large) number of the training samples. For a fair comparison with CWBTF [57], when we consider less amount of training samples for both similar and dissimilar views of SAIVT-SoftBio dataset, the ICT performance results the worst among all the related state of the art, as shown in Fig. 10.2. Most of the metric learning-based techniques [27, 38, 58] design to learn a feature space where images belonging to the same person stay close while images belonging to different people are far apart. All those methods are suffering from either computational complexity or to select representative samples which can cover the diversity of the person. To overcome these limitations, a multishot-based Adaptive Fisher Discriminant Analysis (AFDA) method is proposed in [65], where LFDA is adapted to maximize interclass distance and minimize intra-class difference, while preserving local structures. It also integrates Fisher Guided Hierarchical Clustering (FGHC) algorithm to select representative samples from each class and maintain diversity based on the Fisher criterion. The superiority of AFDA is reported in Tables 10.5 and 10.6 for PRID2011 and SAIVT-SoftBio datasets, respectively. Therefore, it is

10.4 The SDALF approach

SAIVT3-8 (Similar View)







60 50 40 30 CWBTF with SCA(nAUC 96.01) CWBTF with CPS(nAUC 95.43) SCA(nAUC 93.44) CPS(nAUC 92.64) WBTF(nAUC 80.18) SDALF(nAUC 83.20) ICT(train with 5)(nAUC 60.90)

20 10 0





SAIVT5-8 (Dissimilar View)


Recognition Rate(%)

Recognition Rate(%)



60 50 40 30 CWBTF withSCA(nAUC 87.57) CWBTF withCPS(nAUC 87.07) SCA(nAUC 82.44) CPS(nAUC 82.34) WBTF(nAUC 78.99) SDALF(nAUC 81.54) ICT(train with 5)(nAUC 52.24)

20 10 0








Rank Score

Rank Score

FIG. 10.2 CMC and nAUC of direct method-based approaches on the SAIVT-SoftBio dataset.

Table 10.6 Experiment on SAIVT-SoftBio Dataset Reported by Top-Ranked Matching Rate (%) of Metric Learning-Based Methods Dataset

SAIVT 3-8 (Similar View)

SAIVT 5-8 (Dissimilar View)




r 5 10

r 5 20



r 5 10

r 5 20

Fused [66] PFDS [67] RankSVM [64] LFDA [27] AFDA [65]

36.40 33.20 32.40

60.30 60.50 68.40

76.00 74.00 82.00

87.60 87.20 92.90

20.00 18.60 14.90

33.00 32.90 40.5

50.40 53.00 57.9

67.80 85.3 75.00

12.20 44.40

36.80 77.40

57.76 89.40

74.90 95.90

9.30 30.09

27.10 61.60

41.20 77.30

60.06 91.10

worth to assert that multishot datasets such as SAIVT-SoftBio, PRID2011 are well experimented datasets which are suitable for metric learning as well as for direct approaches to robustly model the appearance.

10.4 THE SDALF APPROACH Once an individual has been detected and segregated within a bounding box in one or more frames, a characteristic descriptor (i.e., a signature) can be estimated. The SDALF descriptor constitutes one of the standard, most used, yet effective descriptor. The process for building such signature is slightly different depending on the modality we are considering, i.e., single- or multiple-shot. It consists of three phases: 1. Object segmentation separates the pixels of the individual (foreground) from the rest of the image (background);



CHAPTER 10 Person re-identification

2. Symmetry-based silhouette partition individuates perceptually salient body portions; 3. Symmetry-driven accumulation of local features composes the signature as an ensemble of features extracted from the body parts. In the following, each step is described and analyzed focusing on the differences between single-shot and multishot modality.

10.4.1 OBJECT SEGMENTATION The aim of this phase is to separate the genuine body appearance from the rest of the scene. This allows the descriptor to focus solely on the individual, disregarding the context in which it is. We suppose that in a real scenario, a person can be captured at completely different locations, like the arrival hall of an airport or in the parking lot. In the case of a sequence of consecutive images, the object/scene classification may be operated by a whatsoever background subtraction strategy. In the case of a single image, the separation is performed by Stel Component Analysis (SCA) [68]. SCA lies on the notion of “structure element” (stel), which can be intended as an image portion (often discontinuous) whose topology is consistent over an image class. This means that in a set of given objects (faces or pedestrian images), a stel individuates the same part over all the instances (e.g., the eyes in a set of faces, the body in a set of images each one containing a single pedestrian). In other words, an image can be seen as a segmentation, where each segment is a stel. SCA enriches the stel concept as it captures the common structure of an image class by blending together multiple stels: it assumes that each pixel measurement xi, with its 2D coordinate i, has an associated discrete variable si, which takes a label from the set {1, …, S}. Such a labeling is generated from K stel priors pk(si), which capture the common structure of the set of images. The model detects the image selfsimilarity within a segment: the pixels with the same label s are expected to follow a tight distribution over the image measurements. Instead of the local appearance similarity, the model insists on consistent segmentation via the stel prior. Each component k represents a characteristic (pose or spatial configuration) of the object class at hand, and other poses are obtained through blending these components. We set S ¼ 2 (i.e., foreground/background) and K ¼ 2 (e.g., upper/lower body parts), modeling the distribution over the image measurements as a mixture of Gaussians as we want to capture segments with multiple color modes within them. SCA is learnt beforehand on a generic person database not considering the experimental data, and the segmentation over new samples consists in a fast inference. Each Expectation-Maximization iteration of the inference algorithm takes in average 18 ms1 when dealing with images of size 48  128. In our experiments, we set the number of iterations to 100 to ensure that the learning process reached a local 1

We used the authors’ MATLAB code [68] on a quad-core Intel Xeon E5440, 2.83GHz with 4GB of RAM.

10.4 The SDALF approach

minima of the likelihood function. In practice, we saw that 10–20 iterations are enough in most of the cases.

10.4.2 SYMMETRY-BASED SILHOUETTE PARTITION Background/foreground segmentation is used to eliminate the background information and also to subdivide the human body into salient parts, exploiting asymmetry and symmetry principles. Considering a pedestrian acquired at very low resolution (see Fig. 10.4), it is easy to note that the most distinguishable parts are three: head, torso, and legs. Focusing on such parts is thus reasonable, and their detection can be exploited observing natural symmetry/asymmetry properties in the human body appearance. In addition, the relevance of head, torso, and legs as salient regions for human characterization also emerged from the boosting approach proposed by Gray and Tao [69]. Let us define the chromatic bilateral operator as: Cði,δÞ∝


d 2 ðpi , p^i Þ,


B½iδ, i + δ

where d(, ) is the Euclidean distance, evaluated between HSV pixel values pi , p^i , located symmetrically with respect to the horizontal axis at height i. The sum is over B[iδ, i+δ], that is the foreground region lying in the box of width J and vertical extension 2δ + 1 around i as depicted in Fig. 10.3. The value of δ is experimentally set to I/4, where I is the image height. Let us also introduce the spatial covering operator, that calculates the difference of foreground areas for two regions: Sði,δÞ ¼

   1   A B½iδ, i  A B½i, i + δ , Jδ


where AðbÞ is a function that computes the foreground area in a given box b and J is the image width.

1. .. i HL i




1. .. R1

d B d [i−d, i+d]



j lr1

R2 I


j lr2 dd

FIG. 10.3 Silhouette partition: first, the asymmetrical axis iTL is extracted, followed by iHT; afterwards, for each region Rk, k ¼ {1, 2} region, the symmetrical axes jLRk are computed.



CHAPTER 10 Person re-identification

FIG. 10.4 Images of individuals at different resolutions (from 64  128 to 11  22) and examples of foreground segmentation and symmetry-based partitions.

Combining opportunely the two operators C and S enables us to find the axes of symmetry and asymmetry. To locate the horizontal asymmetry axes, we want to maximize the difference in appearance and the similarity between foreground areas. Therefore, the main x-axis of asymmetry (usually the torso-legs axis) is located at height iTL by solving the following problem: iTL ¼ arg min ð1  Cði, δÞÞ + Sði, δÞ: i


The values of C are normalized by the numbers of pixels in the region B[iδ, i+δ]. The search for iTL holds in the interval [δ, I  δ]: iTL usually separates the two biggest body portions characterized by different colors (corresponding to t-shirt/pants or suit/legs, for example). The other x-axis of asymmetry (usually the shoulders-head axis) is positioned at height iHT. The goal is to find a local gradient variation in the foreground area: iHT ¼ arg min ðSði,δÞÞ: i


The search for iHT is limited in the interval [δ, iTL  δ]. Once computed iHT and iTL, three regions of interest are isolated Rk, k ¼ {0, 1, 2}, approximately corresponding to head, body, and legs, respectively. For reidentification purposes, it is common to discard the information of the head/face region because standard biometric algorithms usually fail at low resolution.

10.4 The SDALF approach

Therefore, here R0 is discarded. For each part Rk, k ¼ {1, 2}, a (vertical) symmetry axis is estimated, in order to localize the areas that most probably belong to the human body, i.e., pixels near the symmetry axis. In this way, the risk of considering background clutter is minimized. To this end, both the chromatic and spatial covering operator are used on both R1 and R2. The y-axes of symmetry jLRk, (k ¼ 1, 2) are obtained as follows: jLRk ¼ arg min Cðj,δÞ + Sðj,δÞ: j


C is evaluated on the foreground region of size the height of Rk and width δ (see Fig. 10.3). The goal is to search for regions with similar appearance and area. In this case, δ is proportional to the image width, and it is heuristically fixed to J/4. Some results of the optimization process applied to images at different resolutions are shown in Fig. 10.4. As one can observe, our subdivision segregates correspondent portions independently on the assumed pose and the adopted resolution.

10.4.3 SYMMETRY-DRIVEN ACCUMULATION OF LOCAL FEATURES The SDALF descriptor is computed by extracting features from each R1 and R2 part. The goal is to distill as much complementary aspects as possible in order to encode heterogeneous information, hence capturing distinctive characteristics of the individuals. Each feature is extracted by taking into account its distance with respect to the jLRk axes. The basic idea is that locations far from the symmetry axis belong to the background with higher probability. Therefore, features coming from that areas have to be (a) weighted accordingly, or (b) discarded. Depending on the considered features, one of these two mechanisms will be applied. There are many possible cues useful for a fine visual characterization. Considering the previous literature in human appearance modeling, features may be grouped by considering the kind of information to focus on, that is, chromatic (histograms), region-based (blobs), and edge-based (contours, textures) information. SDALF considers a feature for each aspect. Weighted color histograms The chromatic content of each part of the pedestrian is encoded by color histograms. We evaluate different color spaces, namely, HSV, RGB, normalized RGB (where each channel is normalized by the sum of all the channels), per-channel normalized RGB [70], CIELAB. Among these, HSV has shown to be superior and also allows a intuitive quantization against different environmental illumination conditions and camera acquisition settings. Therefore, we build weighted histograms, so taking into consideration the distance to jLRk axes. In particular, each pixel is weighted by a one-dimensional Gaussian kernel N ðμ, σÞ, where μ is the y-coordinate of jLRk, and σ is a priori set to J/4. The nearer a pixel to jLRk, the more important. In the single-shot case, a single histogram for each part is built. Instead, in the multiple-shot case, N histograms for each part are



CHAPTER 10 Person re-identification

considered, where N is the number of images for each pedestrian. Then, the matching policy will handle these multiple histograms properly (see Section 10.4.4). MSCRs The MSCR operator2 [71] detects a set of blob regions by looking at successive steps of an agglomerative clustering of image pixels. At each step, neighboring pixels with similar color are clustered, considering a threshold that represents the maximal chromatic distance between colors. Those regions that are stable over a range of steps constitute the MSCRs of the image. The descriptor of each region is a 9-dimensional vector containing area, centroid, second moment matrix, and average RGB color. MSCR exhibits desirable properties for matching useful also in re-identification: covariance to adjacency preserving transformations and invariance to scale changes and affine transformations of image color intensities. Moreover, they show high repeatability, i.e., given two views of an object, MSCRs are likely to occur in the same correspondent locations. In the single-shot case, MSCRs are extracted separately from each part of the pedestrian. To discard outliers, MSCRs which do not lay inside the foreground regions are ruled out. In the multiple-shot case, MSCRs from multiple images have to be opportunely accumulated. To this end, a mixture of Gaussian clustering procedure [72] that automatically selects the number of components is utilized. Clustering is carried out using the 5-dimensional MSCR subpattern composed by the centroid and the average RGB color of each blob. We cluster the blobs similar in appearance and position, since they yield redundant information. This phase helps in discarding redundant information, and keeping low the computational cost during matching because only the representatives of each cluster are used. The descriptor is then a set of 4-dimensional MSCR subpatterns, the y coordinate and the average RGB color of each blob, while x coordinates are discarded because they are strongly dependent on the pose and viewpoint variation. Recurrent high structured patches (RHSPs) We design this feature taking inspiration from the image epitome [73]. The idea is to extract image patches that are highly recurrent in the human body figure (see Fig. 10.5). Differently from the epitome, we want to take into account patches (1) that are informative (in an information theoretic sense, i.e., carrying out high entropy values), and (2) that can be affected by rigid transformations. The first constraint selects only those patches with strong edgeness, such as textures. The second requirement takes into account that the human body is a 3D entity whose parts may be captured with distortions, depending on the pose. Since the images have low resolution, we can approximate the human body with a vertical cylinder. In these conditions, the RHSP generation consists in three phases. 2

We used the author’s implementation, downloadable at

10.4 The SDALF approach

FIG. 10.5 Recurrent high structured patches’ extraction. The final result of this process is a set of patches (in this case only one) characterizing each body part of the pedestrian.

The first step consists in the random extraction of patches p of size J/6  I/6, independently on each foreground body part of the pedestrian. In order to take the vertical symmetry into consideration, we mainly sample the patches around the jLRk axes. Thus, a Gaussian kernel centered in jLRk is used similarly to the color histograms computation. The patches that do not show a kind of structure (e.g., uniformly colored) are removed by thresholding on the entropy values of the patches. The patch entropy is computed as the sum Hp of the pixel entropy of each RGB channel. We choose those patches with Hp higher than a fixed threshold τH (fixed to 13 in all our experiments). In the second step, a set of transformations Ti, i ¼ 1, 2, …, NT are applied on the generic patch p, for all the sampled p’s in order to check their invariance to (small) body rotations. We thus generate a set of NTsimulated patches pi, gathering an enlarged set p^ ¼ fp1 , …,pNT , pg. In the third and final phase, only the most recurrent patches are kept. We evaluate the Local Normalized Cross-Correlation (LNCC) of each patch in p^ with respect to the original image. All the NT + 1 LNCC maps are then summed together forming an average map. Averaging again over the elements of the map indicates how much a patch, and its transformed versions, is present in the image. Thresholding this value does select the RHSP patches. As threshold, we fix τμ ¼ 0.4. The RHSPs are computed for each region R1 and R2, and the descriptor consists again of an HSV histogram of them. The multishot case differs from the single-shot case from the fact that the candidate RHSP descriptors are accumulated over different frames. Please note that, even if we have several thresholds that regulate the feature extraction, they have been fixed once and left unchanged in all the experiments.



CHAPTER 10 Person re-identification

The best values have been selected by qualitatively analyzing the results on the VIPeR dataset.

10.4.4 THE MATCHING PHASE In re-identification, two sets of pedestrian images are available: the gallery set A (the database of signatures whose label is known) and the probe set B (the set of tracked pedestrians without label). Re-identification consists in matching each signature in the set B, IB to the corresponding signature of the set A, IA. The association mechanism depends on how the two sets are organized, more specifically, on how many pictures are present for each individual. This gives rise to three matching strategies: (1) single-shot vs. single-shot (SvsS), if each image in a set represents a different individual; (2) multiple-shot vs. single-shot (MvsS), if each image in B represents a different individual, while in A each person is portrayed in different images, or instances; (3) multiple-shot vs. multiple-shot (MvsM), if both A and B contain multiple instances per individual. The MvsM strategy is preferred when trajectories of people are available, because one can exploit the redundancy and diversity of the data to make the signature more robust. Re-identification can be seen as a maximum log-likelihood estimation problem [74]. More in details, given a probe B the matching is carried out by: A ¼ arg max ð log PðIA jIB ÞÞ ¼ arg min ðdðIA ,IB ÞÞ: A


During testing, we want to match the given probe signature IB against the gallery set signatures IA. The goal is to optimize the likelihood of IA given the probe IB. The right-hand term of the formula above is given by the fact that, in this work, we define the matching probability P(IAjIB) in a Gibbs form PðIA jIB Þ ¼ edðIA ,IB Þ and d(IA, IB) measures the distance between two descriptors. The SDALF matching distance d is defined as a convex combination of the local features: dðIA ,IB Þ ¼ βWH  dWH ðWHðIA Þ, WHðIB ÞÞ






where the WH(), MSCR(), and RHSP() are the weighted histograms, MSCR, and Recurrent High Structured Patch descriptors, respectively, and β’s are normalizing weights. The distance dWH considers the weighted color histograms. In the SvsS case, the HSV histograms of each part are concatenated channel by channel, normalized, and finally compared via Bhattacharyya distance [75]. Under the MvsM and MvsS policies, we compare each possible pair of histograms contained in the different signatures, keeping the lowest distance.

10.5 Metric learning

For dMSCR, in the SvsS case, we estimate the minimum distance of each MSCR element b in IB to each element a in IA. This distance is defined by two components: dyab , that compares the y component of the MSCR centroids, while the x component is ignored in order to be invariant with respect to body rotations. The second component is dcab , that compares the MSCR color. In both cases, the comparison is carried out using the Euclidean distance. The two components are combined as: dMSCR ¼

X b2IB

min γ  dyab + ð1  γÞ  dcab , a2IA


where γ takes values between 0 and 1. In the multishot cases, the set IA of Eq. (10.9) becomes a subset of blobs contained in the most similar cluster to the MSCR element b. The distance dRHSP is obtained by selecting the best pair of RHSP, one in IA and one in IB, and evaluating the minimum Bhattacharyya distance among the RHSP’s HSV histograms. This is done independently for each body part (excluding the head), summing up all the distances achieved and then normalizing for the number of pairs. In our experiments, we fix the values of the parameters as follows: βWH ¼ 0.4, βMSCR ¼ 0.4, βRHSP ¼ 0.2, and γ ¼ 0.4. These parameters have been estimated only once in a validation step using a subset of 100 image pairs of the VIPeR dataset, and remain unchanged for all the experiments.

10.5 METRIC LEARNING The metric learning methodology casts the re-identification problem as a ranking problem, instead of a classification one. The main idea is to learn a subspace where the potential true match is given the highest ranking rather than any direct distance measure. The person re-identification problem is changed from an absolute scoring problem to a relative ranking problem. Given a repository X ¼ fðxi , yi ÞgNi¼1 , where xi 2 m is a multidimensional feature vector representing the appearance of a person captured in one view, yi is its label and N is the number of training samples (images of persons). Each vector xi can n be associated witho a set of relevant observation feature vectors + + ,…, xi,m and related irrelevant observation feature vectors di+ ¼ xi,+1 , xi,2 + ðx Þ i n o    d i ¼ xi,1 , xi, 2 , …,xi,m ðxi Þ , corresponding to correct and incorrect matches from

another camera view. Here, m+(xi) (m(xi)) is the number of relevant (related irrelevant) observations for query xi, and m(xi) ¼ m  m+(xi)  1. It is worth noting that m+(xi) ≪ m(xi) since usually there are only a few instances of correct matches and much more wrong matches. The goal of ranking relevance vectors  any paired  image   is to learn a ranking function δ for all pairs of xi , xi,+j0 and xi , x i, j0 such that the     relevance ranking score δ xi , xi,+j0 is larger than δ xi , x i, j0 .



CHAPTER 10 Person re-identification

One of the more obvious metric learning approaches is discussed in [76], and is based on ranking SVMs. In particular, the   goal is to find a score δ given by the linear function of the pairwise sample xi , xi, j as follows:   δ xi , xi, j ¼ w> jxi  xi, j j,


where the difference jxi xi, jj gives, for each of its m components, the absolute value of the difference between the mth components of xi and xi, j. The vector jxi xi, jj is called the absolute difference vector. Note that for a query feature vector xi it is recommended to have the following rank relationship for a relevant feature vector xi,+j and a related irrelevant feature vector x i, j0   w> jxi  xi+, j j  jxi  x i, j0 j :


+ Let ^ x s+ ¼ jxi  xi,+j j and ^ x s ¼ jxi  xi, j0 j. Then, by going through all samples xi as well +  as the xi, j and xi, j in the dataset X, we obtain a corresponding set of all pairwise rel  j > 0 is expected. This vector set is evant difference vectors in which w> j^x s+  ^x  s  +   x s . A ranking by SVM model, briefly rankSVM, is then denoted by P ¼ ^ x s ,^ defined as the minimization of the following objective function: jPj X 1 k wk2 + C ξs 2 s¼1   s:t: w> j^x s+  ^x  s j  1  ξs , s ¼ 1,…, jPj, ξs  0, s ¼ 1,…,jPj,


where C is a positive number which controls the tradeoff between margin against training error. The problem of this formulation is the potentially large size of P, which brings to computational issues (both temporal and spatial) [76]. More important, looking at Eq. (10.10), one can see that RankSVM solely learns an independent weight for each feature, discarding the interrelations that could exist between features. To this aim, Roth et al. [77] introduced Mahalanobis matrix metric learning, which instead optimizes a full distance matrix, which is definitely more powerful. To the sake of clarity, we first give the general idea of Mahalanobis metric learning and then give a set of methods that implement this principle from different points of view. All the approaches have available public implementations.

10.5.1 MAHALANOBIS METRIC LEARNING Distance metric is crucial in many computer vision and pattern recognition tasks, such as classification and clustering, other than re-identification. The most known distance metric is the Euclidean one, where the input space as an isotropic one (it treats all the data dimensions equally). However, this assumption may not hold in many applications, where the underlying relationships between input instances are not isotropic. The Mahalanobis metric takes this fact into account, studying the correlation among different data dimensions. In simple words, Mahalanobis metric can be viewed as an Euclidean metric on a input space which has been globally

10.5 Metric learning

transformed (through a linear transformation). The Mahalanobis metric learning estimates this linear transformation. Given N data points xi 2 m , the goal is to estimate a matrix M such that dM ðxi , xj Þ ¼ ðxi  xj Þ> Mðxi  xj Þ


describes a pseudo-metric (which is a generalized metric space where the distance between two distinct points can be zero). This condition is met if M is positive semidefinite. Eq. (10.13) becomes the standard Mahalanobis distance if M ¼Σ1 (i.e., the inverse of the sample covariance matrix). An equivalent formulation gives M in a factorized form and is dL ðxi , xj Þ ¼ jjLðxi  xj Þjj2 ,


that comes from ðxi  xj Þ> Mðxi  xj Þ ¼ ðxi  xj Þ> L> L 2 |ffl{zffl} ðxi  xj Þ ¼ jjLðxi  xj Þjj :



In re-identification, the label information can be injected as in the case of the rankSVM by indicating the sets same S and different D, and applying the subsequent optimization. S ¼ fxi , di+ g


D ¼ fxi ,d i g:

(10.17) >

To increase readability, the notation Cij ¼ (xi xj)(xixj) can be used, so as the similarity variable

yij ¼

1 0

xj 2 di+ xj 2 d i :


10.5.2 LARGE MARGIN NEAREST NEIGHBOR LMNN metric learning [78], looks for a local separation between samples, and this is the main difference with the Mahalanobis framework where the transformation to be applied to the data is global. It operates in two steps: the former identifies a set of k similarly labeled target neighbors for each input xi. Formally, j ↝ i indicates that xj belongs to the k nearest neighbors with the same label of xi. The second step adapts the Mahalanobis distance metric so that these target neighbors are closer to xi than all other differently labeled inputs. The Mahalanobis distance metric is estimated by solving a problem in semidefinite programming. Samples having a different label that are into this local region (impostors) are penalized. More technically, for a target pair ðxi , xj Þ 2 S, i.e., yij ¼ 1, any sample xl with yil ¼ 0 is an impostor if



CHAPTER 10 Person re-identification

jjLðxi  xl Þjj2 jjLðxi  xj Þjj2 + 1,


where L comes from Eq. (10.14) and defines a given Mahalanobis distance metric. The goal of this metric learning strategy is to pull target pairs together and to penalize the presence of impostors. The penalization is obtained through the following objective function: LðMÞ ¼



X dM ð x i , x j Þ + β ð1  yil Þξijl ðMÞ !






with !




ξijl ðMÞ ¼ 1 + dM ð x i , x j Þ  dM ð x i , x l Þ:


The first term of the RHS of Eq. (10.20) minimizes the distance between target neighbors xi and xj, indicated by j ↝ i, while the second denotes the amount by which a differently labeled input xl can invade the perimeter around input xi, defined by its target neighbor xj. The β term gives the trade-off between the two terms in the objective function (which is usually set to 1). To estimate the metric M, gradient descent is performed on the objective function (Eq. 10.20): X ∂LðMÞ X ¼ Cij + β ðCij  Cil Þ ∂M j↝i ði, j, lÞ2N


N describes the set of triplets indices corresponding to a positive slack variable ξ. A further version of the LMNN for re-identification was designed in [7].

10.5.3 EFFICIENT IMPOSTOR-BASED METRIC LEARNING Efficient Impostor-based Metric Learning (EIML) [79] is a simple yet effective strategy which takes from LMNN the idea of the impostors. In particular, Eq. (10.19) is relaxed to the original difference space. Thus, given a target pair (xi, xj), a sample xl is an impostor if jjðxi  xl Þjj2 jjðxi  xj Þjj2 :


To estimate the metric M ¼L>L the following objective function has to be minimized: LðLÞ ¼



jjLðxi  xj Þjj2 

ðxi , xj Þ2S

jjL wil ðxi  xl Þjj2 ,

ðxi , xl Þ2I


where I is the set of all impostor pairs and wil ¼ e

jjxi xl jj  jjx x jj i



10.5 Metric learning

is a weighting factor also taking into account how much an impostor invades the perimeter of a target pair. By adding the orthogonality constraint LL> ¼ I, Eq. (10.24) can be re-formulated to an eigenvalue problem: ðΣS  ΣI ÞL ¼ ΛL,


where ΣS ¼

1 X 1 X Cij and ΣI ¼ Cij jSj ðx , x Þ2S jI j ðx , x Þ 2 I i





are the covariance matrices for S and I , respectively. The problem is now much simpler and can be solved efficiently.

10.5.4 KISSME The goal of the KISSME [80] is to address the metric learning approach from a statistical inference point of view. Therefore, it tests the hypothesis H0 that a pair (xi, xj) is dissimilar against H1 stating that it is similar, using a likelihood ratio test: pðxi , xj jH0 Þ f ðxi , xj ,θ0 Þ ¼ log , δðxi , xj Þ ¼ log pðxi , xj jH1 Þ f ðxi , xj ,θ1 Þ


where δ is the log-likelihood ratio, and f(xi, xj, θ) is a PDF with the parameter set θ. Assuming a zero-mean Gaussian structure of the difference space, Eq. (10.28) can be re-written as 0 1 1 > 1 p ffiffiffiffiffiffiffiffiffiffiffiffiffi ffi expð1=2 ðx  x Þ Σ ðx  x ÞÞ i j i j D B 2πjΣ j C D B C δðxi , xj Þ ¼ log B C, 1 @ A > 1 pffiffiffiffiffiffiffiffiffiffiffiffiffi expð1=2 ðxi  xj Þ ΣS ðxi  xj ÞÞ 2πjΣS j


where ΣS and ΣD are the covariance matrices of S and D according to Eq. (10.27). The maximum likelihood estimate of the Gaussian is equivalent to minimizing the distances from the mean in a least squares manner. This allows KISSME to find respective relevant directions for S and D. By taking the log and discarding the constant terms we can simplify Eq. (10.29) to > 1 δðxi , xi Þ ¼ ðxi  xj Þ> Σ1 S ðxi  xj Þ  ðxi  xj Þ ΣD ðxi  xj Þ > 1 1 ¼ ðxi  xj Þ ðΣS  ΣD Þðxi  xj Þ:


Hence, the Mahalanobis distance matrix M is defined by   1 M ¼ Σ1 S  ΣD :


All of these metric learning approaches are thought to be applied in the single shot re-identification scheme, where each image becomes a sample. Anyway, in the case a bunch of images of the same individual could be condensed in a single pattern, the whole framework of metric learning still applies. Results of some of the metric learning approaches have been reported in Section 10.3.



CHAPTER 10 Person re-identification

10.6 CONCLUSIONS AND NEW CHALLENGES Re-identification aims at keeping the identity of individuals in multicamera scenarios, encoding discriminative blueprints for each subject and matching them across the camera network. In the large majority of approaches, re-id focuses on modeling the body appearance of people, since finer details (faces, gait styles) cannot be captured by sensors. Also, other soft biometric features (behavioral pattern, social signals) are too difficult to be captured robustly to date. This brings to the crucial assumption of most of re-id approaches that people do not change their clothes. While this ensures that systems will work properly, it concretely limits the re-id scope to very short periods of time. The very few approaches that cover longer periods of time do that by exploiting RGBD data and extracting anthropometric features from 3D meshes [81].

REFERENCES [1] M. Farenzena, L. Bazzani, A. Perina, V. Murino, M. Cristani, Person re-identification by symmetry-driven accumulation of local features, in: IEEE Conference Computer Vision and Pattern Recognition, 2010, pp. 2360–2367. [2] D.S. Cheng, M. Cristani, Person re-identification by articulated appearance matching, in: Person Re-Identification, Springer, London, 2014, pp. 139–160. [3] L. Bazzani, M. Cristani, A. Perina, M. Farenzena, V. Murino, Multiple-shot person reidentification by HPE signature, in: 20th International Conference on Pattern Recognition (ICPR), 2010, IEEE, New York, 2010, pp. 1413–1416. [4] A. Bhuiyan, A. Perina, V. Murino, Person re-identification by discriminatively selecting parts and features, in: European Conference on Computer Vision, Springer, Cham, 2014, pp. 147–161. [5] I. Kviatkovsky, A. Adam, E. Rivlin, Color invariants for person reidentification, IEEE Trans. Pattern Anal. Mach. Intell. 35 (7) (2013) 1622–1634. [6] G. Lisanti, I. Masi, A.D. Bagdanov, A. Del Bimbo, Person re-identification by iterative re-weighted sparse ranking, IEEE Trans. Pattern Anal. Mach. Intell. 37 (8) (2015) 1629–1642. [7] M. Dikmen, E. Akbas, T.S. Huang, N. Ahuja, Pedestrian recognition with a learned metric, in: Proc. Asian Conf. on Computer Vision, 2010. [8] D. Gray, H. Tao, Viewpoint invariant pedestrian recognition with an ensemble of localized features, in: European Conference on Computer Vision, 2008, pp. 262–275. [9] C. Liu, S. Gong, C.C. Loy, X. Lin, Person re-identification: what features are important? in: European Conference on Computer Vision, First International Workshop on Re-identification, 2012, pp. 391–401. [10] B. Ma, Y. Su, F. Jurie, Local descriptors encoded by fisher vectors for person re-identification, in: European Conference on Computer Vision, Springer, Cham, 2012, pp. 413–422. [11] B. Prosser, W. Zheng, S. Gong, T. Xiang, Person re-identification by support vector ranking, in: British Machine Vision Conference, 2010, pp. 21.1–21.11.


[12] W.-S. Zheng, S. Gong, T. Xiang, Person re-identification by probabilistic relative distance comparison, in: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, New York, 2011, pp. 649–656. [13] N.D. Bird, O. Masoud, N.P. Papanikolopoulos, A. Isaacs, Detection of loitering individuals in public transportation areas, IEEE Trans. Intell. Transport. Syst. 6 (2) (2005) 167–177. [14] N. Gheissari, T. Sebastian, R. Hartley, Person reidentification using spatiotemporal appearance, in: IEEE Conference Computer Vision and Pattern Recognition, 2006, pp. 1528–1535. [15] X.G. Wang, G. Doretto, T. Sebastian, J. Rittscher, P. Tu, Shape and appearance context modeling, in: International Conference on Computer Vision, 2007, pp. 1–8. [16] O. Hamdoun, F. Moutarde, B. Stanciulescu, B. Steux, Person re-identification in multicamera system by signature based on interest point descriptors collected on short video sequences, in: Second ACM/IEEE International Conference on Distributed Smart Cameras, 2008. ICDSC 2008, IEEE, New York, 2008, pp. 1–6. [17] L. Bazzani, M. Cristani, V. Murino, Symmetry-driven accumulation of local features for human characterization and re-identification, Comput. Vis. Image Underst. 117 (2) (2013) 130–144. [18] D. Cheng, M. Cristani, M. Stoppa, L. Bazzani, V. Murino, Custom pictorial structures for re-identification, in: British Machine Vision Conference, 2011, pp. 68.1–68.11. [19] I. Kviatkovsky, A. Adam, E. Rivlin, Color invariants for person reidentification, IEEE Trans. Pattern Anal. Mach. Intell. 35 (7) (2013) 1622–1634. [20] W.R. Schwartz, L.S. Davis, Learning discriminative appearance-based models using partial least squares, in: Brazilian Symposium on Computer Graphics and Image Processing, 2009, pp. 322–329. [21] S. Bak, E. Corvee, F. Bremond, M. Thonnat, Person re-identification using spatial covariance regions of human body parts, in: IEEE International Conference on Advanced Video and Signal Based Surveillance, 2010, pp. 435–440. [22] P. Salvagnini, L. Bazzani, M. Cristani, V. Murino, Person re-identification with a PTZ camera: an introductory study, in: 2013 IEEE International Conference on Image Processing, IEEE, New York, 2013, pp. 3552–3556. [23] C. Nakajima, M. Pontil, B. Heisele, T. Poggio, Full-body person recognition system, Pattern Recognit. 36 (9) (2003) 1997–2006. [24] Z. Lin, L.S. Davis, Learning pairwise dissimilarity profiles for appearance recognition in visual surveillance, in: International Symposium on Visual Computing, Springer, Berlin, Heidelberg, 2008, pp. 23–34. [25] W. Zheng, S. Gong, T. Xiang, Associating groups of people, in: British Machine Vision Conference, 2009, pp. 23.1–23.11. [26] F. Xiong, M. Gou, O. Camps, M. Sznaier, Person re-identification using kernel-based metric learning methods, in: European Conference on Computer Vision, Springer, Cham, 2014, pp. 1–16. [27] S. Pedagadi, J. Orwell, S. Velastin, B. Boghossian, Local fisher discriminant analysis for pedestrian re-identification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3318–3325. [28] Z. Li, S. Chang, F. Liang, T.S. Huang, L. Cao, J.R. Smith, Learning locally-adaptive decision functions for person verification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3610–3617.



CHAPTER 10 Person re-identification

[29] A. Mignon, F. Jurie, PCCA: a new approach for distance learning from sparse pairwise constraints, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, New York, 2012, pp. 2666–2672. [30] X.-Y. Jing, X. Zhu, F. Wu, X. You, Q. Liu, D. Yue, R. Hu, B. Xu, Super-resolution person re-identification with semi-coupled low-rank discriminant dictionary learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 695–704. [31] X. Liu, M. Song, D. Tao, X. Zhou, C. Chen, J. Bu, Semi-supervised coupled dictionary learning for person re-identification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3550–3557. [32] W.-S. Zheng, X. Li, T. Xiang, S. Liao, J. Lai, S. Gong, Partial person re-identification, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4678–4686. [33] D. Figueira, L. Bazzani, H.Q. Minh, M. Cristani, A. Bernardino, V. Murino, Semisupervised multi-feature learning for person re-identification, in: 2013 10th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), IEEE, New York, 2013, pp. 111–116. [34] C.-H. Kuo, S. Khamis, V. Shet, Person re-identification using semantic color names and rankboost, in: 2013 IEEE Workshop on Applications of Computer Vision (WACV), IEEE, New York, 2013, pp. 281–287. [35] J. Van De Weijer, C. Schmid, J. Verbeek, D. Larlus, Learning color names for real-world applications, IEEE Trans. Image Process. 18 (7) (2009) 1512–1523. [36] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, S.Z. Li, Salient color names for person reidentification, in: European Conference on Computer Vision, Springer, Cham, 2014, pp. 536–551. [37] S. Liao, Y. Hu, X. Zhu, S.Z. Li, Person re-identification by local maximal occurrence representation and metric learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2197–2206. [38] M. Kostinger, M. Hirzer, P. Wohlhart, P.M. Roth, H. Bischof, Large scale metric learning from equivalence constraints, in: IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2288–2295. [39] E. Ahmed, M. Jones, T.K. Marks, An improved deep learning architecture for person re-identification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3908–3916. [40] L. Bazzani, M. Cristani, A. Perina, V. Murino, Multiple-shot person re-identification by chromatic and epitomic analyses, Pattern Recognit. Lett. 33 (7) (2012) 898–903. [41] W. Zheng, S. Gong, T. Xiang, Re-identification by relative distance comparison, IEEE Trans. Pattern Anal. Mach. Intell. 35 (3) (2013) 653–668. [42] A. Alahi, P. Vandergheynst, M. Bierlaire, M. Kunt, Cascade of descriptors to detect and track objects across any network of cameras, Comput. Vis. Image Underst. 114 (6) (2010) 624–640. [43] C.C. Loy, C. Liu, S. Gong, Person re-identification by manifold ranking, in: IEEE International Conference on Image Processing, 2013. [44] R. Zhao, W. Ouyang, X. Wang, Unsupervised salience learning for person re-identification, in: IEEE Conference on Computer Vision and Pattern Recognition, 2013. [45] F. Porikli, Inter-camera color calibration by correlation model function, in: IEEE International Conference on Image Processing, 2003.


[46] K.-W. Chen, C.-C. Lai, Y.-P. Hung, C.-S. Chen, An adaptive learning method for target tracking across multiple cameras, in: IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8. [47] T. D’Orazio, P.L. Mazzeo, P. Spagnolo, Color brightness transfer function evaluation for non overlapping multi camera tracking, in: ACM/IEEE International Conference on Distributed Smart Cameras, 2009, pp. 1–6. [48] O. Javed, K. Shafique, Z. Rasheed, M. Shah, Modeling inter-camera space-time and appearance relationships for tracking across non-overlapping views, Comput. Vis. Image Underst. 109 (2) (2008) 146–162. [49] K. Jeong, C. Jaynes, Object matching in disjoint cameras using a color transfer approach, Mach. Vis. Appl. 19 (5–6) (2008) 443–455. [50] G. Lian, J.-H. Lai, C.Y. Suen, P. Chen, Matching of tracked pedestrians across disjoint camera views using CI-DLBP, IEEE Trans. Circuits Syst. Video Technol. 22 (7) (2012) 1087–1099. [51] K.Q. Weinberger, L.K. Saul, Distance metric learning for large margin nearest neighbor classification, J. Mach. Learn. Res. 10 (2009) 207–244. [52] J.V. Davis, B. Kulis, P. Jain, S. Sra, I.S. Dhillon, Information-theoretic metric learning, in: International Conference on Machine Learning, 2007, pp. 209–216. [53] M. Guillaumin, J. Verbeek, C. Schmid, Is that you? Metric learning approaches for face identification, in: International Conference on Computer Vision, 2009, pp. 498–505. [54] M. Hirzer, P. Roth, M. K€ostinger, H. Bischof, Relaxed pairwise learned metric for person re-identification, in: European Conference on Computer Vision, 2012, pp. 780–793. [55] A. Datta, L.M. Brown, R. Feris, S. Pankanti, Appearance modeling for person reidentification using weighted brightness transfer functions, in: 21st International Conference on Pattern Recognition (ICPR), 2012, IEEE, New York, 2012, pp. 2367–2370. [56] T. Avraham, I. Gurvich, M. Lindenbaum, S. Markovitch, Learning implicit transfer for person re-identification, in: European Conference on Computer Vision, Springer, Cham, 2012, pp. 381–390. [57] A. Bhuiyan, A. Perina, V. Murino, Exploiting multiple detections to learn robust brightness transfer functions in re-identification systems, in: 2015 IEEE International Conference on Image Processing (ICIP), IEEE, New York, 2015, pp. 2329–2333. [58] G. Lisanti, I. Masi, A. Del Bimbo, Matching people across camera views using kernel canonical correlation analysis, in: Proceedings of the International Conference on Distributed Smart Cameras, ACM, New York, NY, USA, 2014, p. 10. [59] B. Prosser, S. Gong, T. Xiang, Multi-camera matching using bi-directional cumulative brightness transfer functions, in: BMVC, vol. 8, Citeseer, 2008, p. 164. [60] O. Javed, K. Shafique, M. Shah, Appearance modeling for tracking in multiple nonoverlapping cameras, in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2, IEEE, New York, 2005, pp. 26–33. [61] A. Bhuiyan, B. Mirmahboub, A. Perina, V. Murino, Person re-identification using robust brightness transfer functions based on multiple detections, in: International Conference on Image Analysis and Processing, Springer, Cham, 2015, pp. 449–459. [62] M. Hirzer, C. Beleznai, P.M. Roth, H. Bischof, Person re-identification by descriptive and discriminative classification, in: Scandinavian Conference on Image Analysis, Springer, Berlin, Heidelberg, 2011, pp. 91–102. [63] T. Wang, S. Gong, X. Zhu, S. Wang, Person re-identification by video ranking, in: European Conference on Computer Vision, Springer, Cham, 2014, pp. 688–703.



CHAPTER 10 Person re-identification

[64] T. Joachims, Optimizing search engines using clickthrough data, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 2002, pp. 133–142. [65] Y. Li, Z. Wu, S. Karanam, R.J. Radke, Multi-shot human re-identification using adaptive fisher discriminant analysis, in: Proceedings of the British Machine Vision Conference, 2015, pp. 1–12. [66] A. Bialkowski, S. Denman, S. Sridharan, C. Fookes, P. Lucey, A database for person reidentification in multi-camera surveillance networks, in: 2012 International Conference on Digital Image Computing Techniques and Applications (DICTA), IEEE, New York, 2012, pp. 1–8. [67] J. Garcıa, N. Martinel, G.L. Foresti, A. Gardel, C. Micheloni, Person orientation and feature distances boost re-identification, in: 2014 22st International Conference on Pattern Recognition (ICPR), IEEE, New York, 2014, pp. 4618–4623. [68] N. Jojic, A. Perina, M. Cristani, V. Murino, B. Frey, Stel component analysis: modeling spatial correlations in image class structure, IEEE Conf. Comput. Vis. Pattern Recognit. (2009) 2044–2051. [69] D. Gray, H. Tao, Viewpoint invariant pedestrian recognition with an ensamble of localized features, in: European Conference on Computer Vision, 2008, pp. 262–275. [70] S. Bak, E. Corvee, F. Bremond, M. Thonnat, Person re-identification using spatial covariance regions of human body parts, in: AVSS, 2010, [71] P.-E. Forssen, Maximally stable colour regions for recognition and matching, in: IEEE Conference on Computer Vision and Pattern Recognition, 2007. [72] M. Figueiredo, A. Jain, Unsupervised learning of finite mixture models, IEEE Trans. Pattern Anal. Mach. Intell. 24 (3) (2002) 381–396. [73] N. Jojic, B. Frey, A. Kannan, Epitomic analysis of appearance and shape, in: Proc. of International Conference on Computer Vision, 2003. [74] L. Bazzani, M. Farenzena, A. Perina, V. Murino, M. Cristani, Multiple-shot person reidentification by HPE signature, in: IEEE International Conference on Pattern Recognition, 2010. [75] T. Kailath, The divergence and Bhattacharyya distance measures in signal selection, IEEE Trans. Commun. 15 (1) (1967) 52–60. [76] B. Prosser, W.-S. Zheng, S. Gong, T. Xiang, Person re-identification by support vector ranking, in: Proc. British Machine Vision Conf., 2010. [77] P.M. Roth, M. Hirzer, M. K€ostinger, C. Beleznai, H. Bischof, Mahalanobis Distance Learning for Person Re-identification, Springer London, Cambridge, London, 2014, pp. 247–267. [78] K.Q. Weinberger, L.K. Saul, Fast solvers and efficient implementations for distance metric learning, in: Proc. Int’l Conf. on Machine Learning, 2008. [79] M. Hirzer, P.M. Roth, H. Bischof, Person re-identification by efficient imposter-based metric learning, in: Proc. IEEE Int’l Conf. on Advanced Video and Signal-Based Surveillance, 2012. [80] M. K€ostinger, M. Hirzer, P. Wohlhart, P.M. Roth, H. Bischof, Large scale metric learning from equivalence constraints, in: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2012. [81] I.B. Barbosa, M. Cristani, A. Del Bue, L. Bazzani, V. Murino, Re-identification with RGB-D sensors, in: European Conference on Computer Vision Workshop on Re-Identification, Springer, Cham, 2012, pp. 433–442.


Social network inference in videos


Ashish Gupta, Alper Yilmaz Ohio State University, Columbus, OH, United States

11.1 INTRODUCTION Video analytics has experienced a lot of progress in the past decade. Nevertheless, the progress made in video understanding, that of extracting high-level semantics in the form of relations between people in a video, is in comparison, insufficient. The explosive growth in social media makes this an interesting task. In this chapter, we discuss an effective approach toward learning interactions between people, building social networks, inferring social groups, and discovering the leader of each of these groups in a video from a sociological perspective. Our principal contributions include inferring the associations between low-level visual and auditory features to social relations by utilizing machine learning algorithms such as support vector regression (SVR) and Gaussian processes. The inferred social network is subsequently analyzed to: discover communities of people and identify a leader of each community. These are arguably two of the most relevant objectives in social network analysis. In addition, as an extension to the basic framework, we discuss the relationship between visual concepts and social relations that have been explored in [1]. Herein, the visual concepts can be considered as mid-level visual representation in inferring social relations and these are then compared with features used in the basic framework. The analysis of scene from video has been a subject of intense research. Researchers want to understand the scene by analyzing patterns in motion. The motion is estimated by computing the trajectories of various objects in the scene [2–7]. However, the majority of these research efforts limited in their ability to infer semantic information from video. They conduct elementary analysis, like clustering of motion trajectories or understanding actions of tracked objects [8–12], where each object is tracked separately from others. They do not focus on trying to understand some manner of group behavior in the pattern of multiple trajectories. Broadly speaking, computer vision researchers have not investigated the content in video from the perspective of sociological relations between entities in that video. This would have provided an understanding of the actions of people in terms of their mutual relations. Academic Press Library in Signal Processing, Volume 6. # 2018 Elsevier Ltd. All rights reserved.



CHAPTER 11 Social network inference in videos

Within the purview of existing research work on action recognition and analysis, the use of inferred social relations used in conjunction with other visual features can potentially aid in the disambiguation of complex events in video [13, 14], by providing relevant contextual information. Social networks have evolved to become the data structure of choice to represent and aid in the analysis of social relations between various entities in a sociological context [15]. Social relations are modeled by a network or a graph structure, consisting of nodes and edges. The nodes represent individual entities within the network, and the edges denote the relations between these entities. When analyzing video to infer social relation between people in that video, the entities are the actors or characters. The graph built from analyzing a video now serves as a basis to discover social groups in that network. A social group is also referred to as a community, wherein members of a community share a common mutual social relationship. These communities are typically discovered using the connectivity between the nodes in the graph, which are the actors in the video, using social network analysis algorithms. The modularity algorithm [16] is a popular example. Social network analytics has also drawn attention from research communities in data mining [17, 18] and content analysis of surveillance videos [19]. In order to understand and be able to model complex social relations between people in video, we turned to feature films. Such films provide plenty of scenarios for social interaction between various characters in that video. In addition to this, visual and auditory features are always available. Since a film is continuous video across different contextual situations we must first segment the film to minimize contextual mixing. The film is segmented into shots and scenes [20]. Then actors’ occurrences in scenes are detected using face recognition [21]. The characteristic of our framework that makes it unique is the ability to apply it to a social network, which has adversarial relations in addition to regular associations between actors. This is a significant topic in sociology that has not received adequate attention [15]. This perhaps is in part due to the inherent challenge of simultaneously modeling adversarial relations in conjunction with friendly relations. We consider an adversarial social network to constitute two distinct communities. The actors in each of these communities have a friendly social relation with other actors in the same community and also have adversarial relation with actors in the other community. We employ visual and auditory features in our framework toward quantitative analysis of potential clustering of actors in each scene. This provides a soft constraint between the actors. These soft constraints are subsequently combined to compute an affinity between pairs of actors. In this social network, communities are computed using a generalized modularity principle on the interactor affinity matrix [22]. Next we explore the concept of leadership. Social communities tend to have a central figure, a leader or a hub. This is a person who has the strongest connectivity to most of the other people in that community. Consequently, this person is also the most influential. Arguably, we consider this influential person as the leader of that community. We identify the leader using Eigenvector centrality [23]. Consider the feature film Troy (2004); the narrative has two communities, the Greeks and the Troyans.

11.1 Introduction

FIG. 11.1 Social network graph of actors in film Troy (2004) is constructed from the affinity matrix of film scenes and actors.

Each actor in each communities contributes to the group, but certain actors, Achilles for the Greeks and Paris for the Troyans are the most influential, and their social interactions are the most important in that film, making them community leaders, from a social network perspective. Our learning-based framework for computing social networks is shown in Fig. 11.1. In this example, we consider the film Troy. There are ten actors who we will use to build the social network. As we process the frames of the movie sequentially, we analyze each frame for the occurrence of one or more of the ten actors. Actors who co-occur share a social relation by virtue of presence in the same scene. This social relation can be friendly or adversarial. We record this social relation by scoring the appropriate scene-actor matrix. After the entire movie is processed, we use to the scene-actor matrix to compute an affinity matrix between actors. Those actors who occur in the same scene naturally share a stronger social affinity as compared with two actors who rarely ever cooccur. The graph between actors where the edge attribute of the graph is based on the affinity matrix is later analyzed to infer social relationships. The rest of this chapter is organized as follows. We begin with a review of related work in Section 11.2. Our approach to video shot segmentation is explained in Section 11.3. The computation of the scene-actor occurrence matrix using face recognition is described in Section 11.4. Next we describe our approach to use low-level visual and auditory features to compute information for grouping in Section 11.5. In Section 11.6, we detail how social networks are learned from videos. In Section 11.7,



CHAPTER 11 Social network inference in videos

we describe our methodology used to analyze these social networks. We evaluate our approach on a set of films in Section 11.8. The association between visual concepts and social relations is explored in Section 11.9. We summarize this chapter in Section 11.10.

11.2 RELATED WORK The ideas in social networking are not new or isolated to the field of sociology. In one form or another it has been used in other fields to solve various problems in their respective domains. These fields include data mining, computer vision, and multimedia analysis [17, 19, 24]. In this section, we elaborate on various related approaches and where they sit in relation to our problem domain. Understanding relations between people in surveillance video is important toward predicting behavior. Toward detecting such possible groups of people, the work in [19] employs traditional methods in social network analysis. This paper conjectures that the interactions between people are based on their physical proximity. This conjecture, that physical proximity is correlated to social proximity, is not guaranteed. Nevertheless, it provides a quantitative measure of social relations between actors in a video. It uses traditional modularity [16] to detect potential groups. In Ge et al. [25], the authors extend this measure of social proximity based on distance to include relative velocity between various tracked objects. They next use clustering to discover groups in a crowd, wherein people belonging to the same group, while intermixed with other people will tend to have correlated velocity vectors. The work in [24] generates a social network from cooccurrences of people in the video. The authors do this without attributing them to some low-level video features. In their work, the relations are limited to be only friendly relations. This means that there can only be one social group with various degrees of affinity between various actors. Such a single group can be easily detected using regular clustering on the graph structure. This cooccurrence concept correlates to our framework; however, we extend further to relate them to visual and auditory features. Importantly, we utilize these features to quantify the types of social interactions between various actors and communities. Some of the recent work in computer vision, including for example that in [26], has tried to identify interactions between objects in terms of categories from video using Markov random fields. However, their objective is not estimating social relationships for a group of individuals. In a similar vein, the work in [27] delves in the task of identifying social roles played by actors in a scene using weakly supervised conditional random fields. However, these authors did not leverage social network structure for their analysis. Some research pursuits in the field of data mining have recently explored approaches to mining relations in a network using log-entries as raw data. In [17, 31] the research groups have analyzed social networks and their dynamics using Bayesian modeling of social networks and the interactions among actors. Similar to the examples in the data-mining field, other researchers have also

11.2 Related work

utilized data on cell phone usage to estimate social networks [28, 29]. We note that these set of methods do not make use of visual data. Videos are a very rich source of information and are being increasingly adopted for communication in society. The use of videos to infer social relations is both viable and valuable toward social network analysis. By employing audiovisual features in video, our basic framework creates a new avenue for understanding social interactions in videos. Comparatively speaking, it is relatively simple to utilize network log-entries and mobile phone usage, to extract social information. However, video data represent a significant challenge toward inferring social information using pattern recognition techniques. The principal novelty in our framework and its difference from the research work we reviewed here and other works in the domain of sociology is the approach that we have adopted to address the problem of inferring a social network. In particular, the existing techniques define social interaction heuristically and construct a social network using these heuristics. By contrast, our approach is analogous to human perceptual approach to inferring social relations in an observed scene. Imagine a scene wherein there are numerous actors involved in some activities. A human observer, bereft of prior knowledge of the context of that scene would assume all actors to be equivalent and similarly related to others. In other words, there is no bias in any pair of actors. As the scene plays out, the human will begin conjecturing the affinity between pairs of actors and using this to roughly construct a social network. Analogous to human intuition, we observe interactions and learn communities in the network from observations in a scene without prior bias. It is our conviction that this approach will benefit other application domains in computer vision. For example, automated analysis of video of a professional meeting, where modeling the actions of individuals and inferring high-level semantic relations between the attendees, is important [32, 33]. We have summarized some of these works in Table 11.1.

Table 11.1 Our Framework in Context of Related Work on Inferring Social Network Relations Data Source

Observed Features

Construction Method

From interaction logs From cell phone usage From videos (existing) From videos (this chapter)

Social interactions

Simple connections

Call data, etc. Tracked people Audiovisual cues

Simple connections Proximity heuristics Learning approaches



On collected social data, e.g., emails On collections of mobile devices On surveillance videos On videos with training labels

[17, 31]

[28, 29] [19, 25] [1, 30]



CHAPTER 11 Social network inference in videos

11.3 VIDEO SHOT SEGMENTATION In order to record occurrences of actors and compute audiovisual features, we must first divide the film into shorter segments. Ideally, a video shot corresponds to the view from one camera of the same set of actors. When there is a change of camera, even though there is no change of scene, we consider it a change in shot. For example, consider a dialog between two actors. In the first shot, the camera is behind actor A and shows his back along with showing the face of actor B. Note that the camera may move a little, zoom in or out, and maybe pan a little, but so long as the view does not introduce a new actor and actor B is continuously visible, then there is no change in video shot. If the video next shows the view from camera behind actor B and the face of actor A, then there has been a shot transition at this point, since the actor whose face is visible has now changed. This will be reflected in the scene-actor two-mode social network. A film scene typically consists of several shots, wherein the camera view alternates between two actors in a dialog. It also includes the scenario where the camera slowly pans or zooms out and thereby introduces new actors to the view. The principal factor that defines a shot transition is significant change in scene content. We measure this using dense optical flow. However, since motion of actors within a shot or small motion of camera itself will also amount to a perceptible change in optical flow. Therefore, we empirically determine a threshold on cumulative motion across frames of the video. Fig. 11.2 shows the cumulative motion value for frames in a video of film Troy. Note that a low threshold allows for large motion




FIG. 11.2 Video segmentation into shots based on a cumulative measure of activity in the video. At the point of transition between shots, there is a huge variation in visual content of frames, since different video shots have different foreground and background. Because there is activity within a single shot and sometimes the different in background between shots may be very small (background is sky), we empirically determined a activity threshold (shown by threshold bar across the graph). Therefore, the time at which video activity greater than this threshold is marked as a shot transition point.

11.4 Actor recognition

FIG. 11.3 Video segmentation into shots. Figure shows sequence of shots from a scene in film Troy. The images in the top row are the first frame in each segmented shot and the images in the bottom row are the last frame in that shot.

within a single shot to be segmented into multiple shots. This is undesirable because large motion sometimes occurs in films with a busy scene where multiple actors are simultaneously moving, like a dance floor scene or a battle scene. The huesaturation-value (HSV) color-based score for each frame, shown in the top of Fig. 11.2, can also be used to disambiguate between false-positive shot transitions, since the HSV values would typically change when the shot changes. The results of our shot segmentation approach are shown in Fig. 11.3. We show the first frame of each shot in the top row and the last frame of that shot in the bottom row. We are successful as segmenting the video into shots where each shot has unaltered set of actors engaged in some activity and the transition frame is properly identified.

11.4 ACTOR RECOGNITION In order to compute a scene-actor two-mode social network, we detected and recognized faces in the video frames. Instead of recognizing the entire body of an actor we chose to restrict recognition to the actor’s face only. This choice was based on the intuition that a typical actor appears in multiple different costumes and thereby the only consistent visual feature is the actor’s face. We used the Local Binary Pattern (LBP) cascade descriptor to detect candidate face regions in a frame. We also used LBP for modeling a face. Typically, face recognition works well when we have a full frontal image of the face. Changes in pose toward a profile pose make accurate face recognition challenging. The use of costumes that partially occlude the face also add to the challenges. We acquired the training image for each actor’s face from Google Image Search engine. We queried the search engine with three types of queries for each actor. One is the name of the movie and the name of the actor, second is the name of the movie and the name of the character played by the actor, and third is just the name of the actor. These sets of downloaded images were collated into one raw image set. We used the LBP detector to identify candidate face bounding boxes in the raw image set. A single image can have one or more instances of a face in them. The cropped bounding boxes in all images were resized for consistency. Since the LBP cascade detector returns several false-positive faces images, we introduce a further pruning step to remove images that did not have a face in them. We converted the resized images to a YCrCb color space, which work well for skin



CHAPTER 11 Social network inference in videos

FIG. 11.4 Detecting actor occurrence in video frames. Figure shows examples of face recognition of the actors in film Troy. Each image shows one or more bounding boxes centered on a detected face in that video frame. The name of the most likely actor and associated confidence score is shown above the box.

color-based operation. We detected contours in the bounding box. A contour corresponding to a human face can be expected to have about three-fourths of the area of the bounding box itself. We used this heuristic to prune those images, which were false positives. We trained our face recognition classifier, where each actor was considered as a class. The face recognizer associates a confidence score with each actor. The actor with the high score is the predicted label. We show examples of our face recognition implementation for frames in film Troy in Fig. 11.4. In the figure, it is clear that our face recognition can handle small changes in pose, partial occlusion, and large changes in background and various ambient lighting conditions. Since large changes in pose lead to misclassification, we pursue an aggressive pruning strategy where only those bounding boxes with a confidence score higher than an empirically determined threshold value are retained. As we building a scene-actor occurrence matrix, we only require one reliable detection anywhere in a scene. As a scene is composed of multiple shots and a shot is composed of multiple video frames, we can afford to commit Type I statistical error but not Type II error.1

11.5 LEARNING TO GROUP ACTORS We begin with the condition where there is no known relation between any actor in the video. We only have low-level visual features available to us. As there are no discernible social communities, we cannot explicitly label any community. We extract low-level visual features from videos. The kernels on features at scene level 1 In statistical hypothesis testing, a Type I error is the incorrect rejection of a true null hypothesis (a “false positive”), while a Type II error is incorrectly retaining a false null hypothesis (a “false negative”).

11.5 Learning to group actors

provide us criteria for grouping actors using regression we have learned from other videos. Unlike several other approaches, our mapping strategy is data driven and provides a flexible and extensible approach. This approach can incrementally use new features as they are observed in the video. Say the video  we consider is composed of scenes, s1, s2, …, sM, each of which contains a set of actors and has an associated grouping criteria Γ i 2 [1, +1]. The grouping criteria are used to decide the affinity between actors who occur in the same scene. These two actors belong to the same community if Γ i > 0 or different communities if, Γ i < 0. The absolute value of Γ i is indicative of the definitive association of actors and community. Next, we detail our approach on estimating such grouping criteria from low-level features observed in a video. We proceed with the assumption that there is a difference in the nature of interactions that occur between a pair of actors belonging to the same community as compared with the interaction between pair of actors from different communities. This conjecture facilitates an anonymous grouping of actors into their respective communities, because the difference in interaction is a function of their community membership rather than any property unique to an actor. This serves as a weak grouping criteria. As a consequence, any inferred community labels from one set of training videos can be propagated to a novel video. It follow that actions of actors in a video described by low-level audiovisual features are positively correlated to the type of social relationship among the actors. In this context, we define an activity as an expression of social relationship. Stated in another way, the association between activities and actors provides a distinct feature set that can be used to infer if members of a single community or different communities cooccur in the same scene. For example, tutors and students in a video in a school can be observed to interact in distinct manner within and across the two communities [34, 35]. Put simply, lowlevel audiovisual features in a scene are correlated with the predominant social expression in that scene. While the gamut of social interactions is very large. The underlying conjecture in this work is that the nature of activities between two antipodal communities is sufficiently distinct to allow for effective separation of actors into two distinct groups. The friend and foe interactions are encoded using low-level features into friendly and adversarial relations. We define a scene as the smallest segment of a video such that one scene contains one event, where an event is expression of one social relation. In other words, an event can be considered as associated with expression of either friendly or adversarial relationship. Consequently, low-level features generated from the video and audio of each scene can be used to quantify adversarial and friendly scene features. To help us disambiguate, we note that film directors typically follow certain structure for conveying a story and dramatic content. This is referred to as cinematic principles in film literature, and is to emphasize the adversarial content in scenes for dramatic effect. Accordingly, adversarial scenes have sudden and surprising changes in visual and auditory contents, which are effective in conveying an atmosphere of conflict and tension. In contrast to this friendly scenes have gradual change in visual and auditory



CHAPTER 11 Social network inference in videos

content to convey a sense of calm and cooperation. Therefore, the visual and auditory features, which quantify friendly/adversarial scene content, can be extracted by analyzing the audiovisual disturbances in a video [36].

11.5.1 VISUAL FEATURES To measure visual disturbance, we formulate our interpretation of cinematic dramatic principles in terms of motion field from all actors in a scene. Regardless of actor identity and specific social interaction between various actors, in an adversarial scene, the motion field is distributed in multiple directions. We illustrate this with an example in Fig. 11.5. We compute the motion field using optical flow distribution. We use dense flow field generated by estimating optical flow at each pixel [37] in the frames extracted from the video. The other option is to use specific points, which are have relevant visual information in their local neighborhood. It should be noted that typical tracking implementations use “good features to track” approach. This approach works well for tracking few objects in a relatively uniform background. However, the degree of background activity in films is high because the camera view point keeps changing. Therefore, we use a dense flow field with lower memory to reduce the erroneous effects of camera movements and change in viewpoints. The example on the left in Fig. 11.5 shows an adversarial scene, where Hector and Ajax in Troy are fighting. In the orientation histogram and the quiver plot of motion flow fields, several bins of the orientation histogram have high values. In contrast to this, in the example on the right of Achilles and Odysseus talking, which have motion flow field in harmony and the orientation histogram is unimodal. Next, we need to quantify the flow field, specifically, the disturbance in the flow field. The entropy of the orientations of the flow field vectors provides a good aggregate measure of the degree of disturbance in the scene. As we stated previously, an adversarial scene will have a more isometric distribution of flow field, which translates to a higher entropy of the histogram of orientations. In a friendly scene, the actors will typically be moving together, which means the flow field will be unidirectional and the histogram of orientations will be sparse with a unimodal distribution. The entropy of this histogram will be lower than that for isometric orientation distribution of adversarial scenes. This simple but highly effective idea is illustrated using examples in Fig. 11.6 (Fig. 11.7). The computed histograms of optical flow vectors are weighted by the magnitude of motion. The lower-valued bins of the entropy histogram correspond to scene with unidirectional motion and the higher-valued bin corresponds to scene with isometric directional motion. Our results corroborate our conjecture as is evident in Fig. 11.8. The flow distributions in adversarial scenes tend to be isometrically distributed and thereby consistently have a bias toward higher entropy bins as compared to friendly scenes. We now have an established criteria for distinguishing between friendly/ adversarial scenes.

Histogram of optical flow for troy

Histogram of optical flow for troy 0.30

0.10 0.25










0.00 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 Orientation angle

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 Orientation angle

FIG. 11.5 Social interaction is correlated to optical flow pattern in video. Figure shows examples of scenes from film Troy with dense optical flow in a sample frame and its associated histogram of orientations of motion vectors.

11.5 Learning to group actors




406 CHAPTER 11 Social network inference in videos

FIG. 11.6 Computation of histogram of entropies of frames from video. Figure shows example of film Troy. Based on social interaction, each frame has its own entropy. The histogram of these entropies for all frames in the film Troy is shown on the right. For comparisons we utilize normalized histograms.

11.5 Learning to group actors

Entropy histogram for film Troy 3.5



Entropy histogram for film

2.5 2.0

Troy YearOne




1.0 2.5 0.0









0.5 0.0

2.0 1.5

Entropy histogram for film YearOne


1.0 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0

0.5 0.0 0.0
















FIG. 11.7 Entropy histogram of films Troy and Year One. The aggregate entropy histograms of these two different kinds of films are fairly similar. Consequently, a normalized entropy histogram is a descriptor that is unbiased by the film itself. It can therefore be used consistently for different videos without the need to modify the descriptor based on the film.

11.5.2 AUDITORY FEATURES We extract the audio track in the film. Therefore, for each scene we have both the video and its corresponding audio track. Typically, there is correlation between nature of audio and visual disturbance in a scene. So adversarial scenes will typically contain more dramatic sounds. In keeping with the atmosphere of tension in such scenes, the voices of actors will be at a higher pitch. Consequently, auditory features extracted from the audio track can be used in conjunction with the visual features to improve the performance of our framework. To encode relevant properties of the audio track we use features in both time and frequency domain. We utilize the types of auditory features discussed in [36, 38]. The features we consider in this chapter are p 1. Energy peak ratio EPR ¼ , where p is the number of energy peaks and S is length S of an audio frame. PK 2. Energy entropy EE ¼  i¼1 ei log ei , where an audio frame is divided into K subwindows. For subwindow i, energy ei is computed. P 3. Short-time energy SE ¼ Si¼1 x2i , where S is the length of an audio frame.



Friendly scene in Troy

Histogram of entropy of scenes in the film Troy 5 Frequency

1.5 1.0 0.5

Friendly scene Adversarial scene



3 2 1





1.25 1.50 1.75 2.00 2.25 2.50 2.75


Entropy Adversarial scene in Troy


2.4 2.5 2.6 2.7 Entropy Adversarial scene in Troy






6 8 Frequency


5 6 4 2


4 3 2





0 2.0


2.4 Entropy






2.2 2.4 Entropy




2.00 Entropy





FIG. 11.8 Entropy histogram-based description of friendly and adversarial social relationship in video. The top-left histograms in green show the entropy histogram of scenes in Troy with friendly social interaction. The bottom-left histograms in red pertain to scenes with adversarial social interaction. It is fairly evident that the histogram shape is correlated to the social interaction. For comparison, the histogram on the right shows overlay of histograms from three friendly and three adversarial scenes in Troy. The friendly scene histograms have a bias toward lower-entropy bins, whereas the adversarial scenes have a bias toward high-entropy bins.

CHAPTER 11 Social network inference in videos

Friendly scene in Troy

11.5 Learning to group actors

1 XK XF ðε  εi1, j Þ2 , where εi, j is the spectral i¼2 j¼1 i, j KF energy at subwindow i and frequency channel j. 1 XS 5. Zero crossing rate ZCR ¼ jsgnðxi Þ  sgnðxi1 Þj, where sgn stands for a i¼1 2S sign function.

4. Spectral flux SF ¼

We compute these features for sliding window that is 400 ms in length. The value of each feature that is used to populate the feature vector is the average of the corresponding features for the duration of the scene. An example of these auditory features is shown in Fig. 11.5 for both friendly and adversarial scenes. The adversarial scenes have more peaks as compared to friendly scenes.

11.5.3 GROUPING CRITERIA We compute visual and auditory features for every scene. We quantize the features to five bins, and so we have two vectors per scene: a five-dimensional visual feature vector and a five-dimensional auditory feature vector. We use both of these for each scene to compute a grouping criteria Γ i 2 [1, +1] for that scene si. Toward this we use SVR. Specifically, we use radial basis function (RBF) for both the visual and auditory feature vectors, which lead to two kernel matrices Kv and Ka , respectively. The bandwidths of these two kernels are determined using cross-validation. The joint kernel is the product of these two kernels, given as Kðu,vÞ ¼ Kv ðu,vÞKa ðu, vÞ:

The dual criteria in our SVR implementation are to find a function g() that has a maximum deviation of E from the labeled targets for the training data and also is as flat as possible. The decision function can be written as, Γ i ¼ gðsi Þ ¼

L X ðαj  αj ÞK lj , i + b,



where the coefficient b is offset, αi and αi are the Lagrange multipliers for labeling constraints, L is the number of labeled examples, and lj is the index for the jth labeled example. In our problem domain, the joint kernel together with training video scenes and their grouping criteria Γ i ¼ +1 (scene with members of only one community) and Γ i ¼ 1 (scene with members from different communities) leads to grouping constraints for a novel video. This is achieved by estimating the corresponding grouping criteria Γ i using the regression learned from labeled video scene examples from other videos in the training set.



CHAPTER 11 Social network inference in videos

11.6 INFERRING SOCIAL COMMUNITIES Now that we have formulated grouping criteria using low-level audiovisual features in film scenes, we turn next to inferring communities in social network graph data structure. With regard to actors cooccurring in a scene, we hold the view that frequency of cooccurrence is correlated to similar membership. That is, actors of the same community cooccur more frequently. The reasoning behind our conjecture is that pair of actors with a friendly social relation will occur in friendly scenes. But they will also cooccur in adversarial scenes, which involve conflict between multiple people. On the other hand, a pair of actors that have an adversarial social relation will only cooccur in exclusively adversarial scenes. This is reflected in the grouping criteria wherein, higher the number of cooccurrences for community members, higher the value of the positive grouping criteria compared to the negative grouping criteria.

11.6.1 SOCIAL NETWORK GRAPH We denote the occurrence of an actor ci in a video by a boolean appearance function Λi : T ! f0, 1g, based on time where the duration of the video is T   + . Naturally, in implementation we only have access to its sampled version. Let the sampling period be of length t seconds. We satisfy Nyquist sampling theorem. Accordingly, as long as t  min i f1=2Bi g, where Bi is the highest frequency of the actor i’s appearance function, information regarding both the continuous appearance and the actor’s cooccurrences can be determined from those discrete samples. In our formulation a video  is considered to constitute on-overlapping M scenes, where each scene si contains social interactions among actors occurring in the same scene. We approximate the appearance functions of actors as a scene-actor relation matrix denoted by A ¼ {Ai, j}, where Ai, j ¼ 1 if there exists t 2 Li, where Li is the time interval of si, such that Λj(t) ¼ 1. For a feature film this can be obtained by searching for mention of corresponding actor names in the film script. This representation is reminiscent of the actor-event graph in social network analysis [15]. Although actor relations in A can directly be applied to construct a social network, we shall quantitatively show that utilization of audiovisual features leads to a better social network representation. This should also be intuitively obvious, because audiovisual features give us a real number-valued measure of the degree of affinity between different actors, whereas simple cooccurrence is a binary-valued feature. The actor social network is represented as an undirected graph G(V, E) with cardinality jV j. In this graph, the nodes represent the actors V ¼ fvi : node vi  actor ci g


and the edges define the interactions between the actors E ¼ fðvi , vj Þjvi , vj 2 Vg:


The graph G is fully connected with an affinity matrix K of size jV j jV j, because any two actors can potentially cooccur in any scene. The element in the affinity

11.6 Inferring social communities

matrix K(ci, cj) for two actors ci and cj is a real-valued score, which is decided by an affinity learning method. The values in the affinity matrix serve as the basis for social network analysis. This includes estimating social communities and also the leader in each of these estimated communities.

11.6.2 ACTOR INTERACTION MODEL Let ci be actor indexed i, and f ¼ (f1, …, fN)T be the vector of community memberships containing 1 values, where fi refers to the membership of ci. Let f distribute according to a zero-mean identity-covariance Gaussian process 1

PðfÞ ¼ ð2πÞN=2 exp  2f f : T


In order to model the information contained in the scene-actor relation matrix A and the aforementioned grouping criteria of each scene Γ i, we assume the following distributions: 1 1. If actors ci and cj cooccur in a friendly scene k (Γ k 0), then fi  fj  N ð0, 2 Þ. Γk 2. If actors c i and c j cooccur in an adversarial scene k (Γ k < 0), then 1 fi + fj  N 0, 2 . Γk So, if Γ i ¼ 0, then the constraint imposed by a scene is rendered inconsequential. This corresponds to the least confidence in the constraint. On the other hand, if Γ i ¼ 1, the corresponding constraint becomes the strong. Due of the nature of distributions we use, none of the constraints is hard, which makes our model robust to prediction errors. Applying the Bayes’ rule, the posterior probability of f given the constraints is defined in a continuous formulation as the following: (


ðfi  fj Þ2 ΓðtÞ2 dt 2 i, j t2ft: ΓðtÞ 0g ) XZ ðfi + fj Þ2 ΓðtÞ2 dt Λi ðtÞΛj ðtÞ  2 i, j t2ft: ΓðtÞ 0, and Ki, j0 ¼ 0 for other entries. Next, we compute a complementary affinity matrix K00 with the condition: Ki, j00 ¼ Ki, j for Ki, j < 0, and Ki, j00 ¼ 0 for other entries. The matrix K00 represents the degree of lack of relation between actors in the graph in terms of community memberships. Adopting the strategy in [22] and using K0 and K00 , we formulate the max-min modularity criterion as QMM ¼ Q max  Q min for: Q max ¼

  ki0 kj0 1 X 1 X 0 0 ðfi fj + 1Þ≜ 0 K  B ðfi fj + 1Þ, 0 0 ij 2m i, j 2m i, j i, j 2m


11.8 Experiments

! 00 00 ki kj 1 X 1 X 00 00 Q min ¼ 00 Kij  00 ðfi fj + 1Þ≜ 00 B ðfi fj + 1Þ, 2m i, j 2m i, j i, j 2m


ki0 kj0 P 00 1X 0 0 P 0 1 X 00 00 00 K , k ¼ K , m ¼ K , and k ¼ K and the term i i j ij j ij ij ij ij ij 2 2 2m0 represents the expected edge strength between the actors ci and cj [16]. Based on this ki0 kj0 measures how much the connection between two observation, we note that Ki,0 j  2m actors is stronger than what would be expected between them, and serves as the basis for keeping the two actors in the same community. In our formulation, the max-min modularity QMM arises from the conditions for a good network division, such that connectivity weight between communities should be smaller than expected; and assignment of actors not friendly to each other to the same community should be minimized. These conditions can be realized by maximizing QMM. Using standard 1 0 1 00 eigenvector analysis, the eigenvector u of 0 B  00 B with the largest eigenvalue 2m 2m maximizes a relaxed version of QMM. The resulting eigenvector solution contains real values, and we threshold them at the 0 level to obtain the desired community assignments for the actors. We let fi ¼ +1 if ui 0, and fi ¼ 1 if otherwise. 0

where m ¼

11.7.2 ESTIMATING COMMUNITY LEADER Subsequent to assignment of actors to communities, we next estimate the leaders of each community. We do this by analyzing the centrality of each actor in the community. The literature in sociology defines the centrality of a member in a social network by its degree of betweenness [40]. Instead of this definition of centrality, we adopt a new measure which we refer to as the Eigen centrality [23]. We do this as it aligns well without approach to community estimation. Let the centrality score, xi for the ith actor be proportional to the sum of the scores of all actors which are connected to it: 1 XN 0 xi ¼ K x , where N is the total number of actors in the video and λ is a conj¼1 i, j j λ stant. It follows from this notation that the centralities of actors satisfy K0 x ¼ λx in the vector form. It can be shown that the eigenvector with largest eigenvalue provides the desired centrality measure [23]. Therefore, if we let the eigenvector of K0 with the largest eigenvalue be v, the leaders of the two communities are given by arg max i: ui 0 vi and arg max i: ui