Multi-Level Bayesian Models for Environment Perception [1st ed. 2022] 9783030836535, 9783030836542, 3030836533

This book deals with selected problems of machine perception, using various 2D and 3D imaging sensors. It proposes sever

107 68 14MB

English Pages 215 [208] Year 2022

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Multi-Level Bayesian Models for Environment Perception [1st ed. 2022]
 9783030836535, 9783030836542, 3030836533

Table of contents :
Acknowledgements
Contents
Acronyms and Notations
Abbreviations and Concepts
General Notations Used in the Book
Specific Notations Used in MRF/CXM Models
Specific Notations Used in MPP Models
1 Introduction
2 Fundamentals
2.1 Measurement Representation and Problem Formulations
2.2 Markovian Classification Models
2.2.1 Markov Random Fields, Gibbs Potentials, and Observation Processes
2.2.2 Bayesian Labeling Approach and the Potts Model
2.2.3 MRF-Based Image Segmentation
2.2.4 MRF Optimization
2.2.5 Mixed Markov Models
2.3 Object Population Extraction with Marked Point Processes
2.3.1 Definition of Marked Point Processes
2.3.2 MPP Energy Functions
2.3.3 MPP Optimization
2.4 Methodological Contributions of the Book
3 Bayesian Models for Dynamic Scene Analysis
3.1 Dynamic Scene Perception
3.2 Foreground Extraction in Video Sequences
3.2.1 Related Work in Video-Based Foreground Detection
3.2.2 MRF Model for Foreground Extraction
3.2.3 Probabilistic Model of the Background and Shadow Processes
3.2.4 Microstructural Features
3.2.5 Foreground Probabilities
3.2.6 Parameter Settings
3.2.7 MRF Optimization
3.2.8 Results
3.2.9 Summary and Applications of Foreground Segmentation
3.3 People Localization in Multi-camera Systems
3.3.1 A New Approach on Multi-view People Localization
3.3.2 Silhouette-Based Feature Extraction
3.3.3 3D Marked Point Process Model
3.3.4 Evaluation of Multi-camera People Localization
3.3.5 Applications and Alternative Ways of 3D Person Localization
3.4 Foreground Extraction in Lidar Point Cloud Sequences
3.4.1 Problem Formulation and Data Mapping
3.4.2 Background Model
3.4.3 DMRF Approach on Foreground Segmentation
3.4.4 Evaluation of DMRF-Based Foreground-Background Separation
3.4.5 Application of the DMFR Method for Person and Activity Recognition
3.5 Conclusions
4 Multi-layer Label Fusion Models
4.1 Markovian Fusion Models in Computer Vision
4.2 A Label Fusion Model for Object Motion Detection
4.2.1 2D Image Registration
4.2.2 Change Detection with 3D Approach
4.2.3 Feature Selection
4.2.4 Multi-layer Segmentation Model
4.2.5 L3Mrf Optimization
4.2.6 Experiments on Object Motion Detection
4.3 Long-Term Change Detection in Aerial Photos
4.3.1 Image Model and Feature Extraction
4.3.2 A Conditional Mixed Markov Image Segmentation Model
4.3.3 Experiments on Long-Term Change Detection
4.4 Parameter Settings in Multi-layer Segmentation Models
4.5 Conclusions
5 Multitemporal Data Analysis with Marked Point Processes
5.1 Introducing the Time Dimension in MPP Models
5.2 Object-Level Change Detection
5.2.1 Building Development Monitoring—Problem Definition
5.2.2 Feature Selection
5.2.3 Multitemporal MPP Configuration Model and Optimization
5.2.4 Experimental Study of the mMPP Model
5.3 A Point Process Model for Target Sequence Analysis
5.3.1 Application on Moving Target Analysis in ISAR Image Sequences
5.3.2 Problem Definition and Notations
5.3.3 Data Preprocessing in a Bottom-Up Approach
5.3.4 Multiframe Marked Point Process Model
5.3.5 Multiframe MPP Optimization
5.3.6 Experimental Results on Target Sequence Analysis
5.4 Parameter Settings in Dynamic MPP Models
5.5 Conclusions
6 Multi-level Object Population Analysis with an Embedded MPP Model
6.1 A Hierarchical MPP Approach
6.2 Problem Formulation and Notations
6.3 EMPP Energy Model
6.4 Multi-level MPP Optimization
6.5 Applications of the EMPP Model
6.5.1 Built-in Area Analysis in Aerial and Satellite Images
6.5.2 Traffic Monitoring-Based on Lidar Data
6.5.3 Automatic Optical Inspection of Printed Circuit Boards
6.6 Implementation Details
6.7 Quantitative Evaluation Framework
6.7.1 EMPP Benchmark Database
6.7.2 Quantitative Evaluation Methodology
6.8 Experimental Results
6.8.1 EMPP Versus an Ensemble of Single Layer MPPs
6.8.2 Application Level Comparison to Non-MPP-Based Techniques
6.8.3 Effects on Data Term Parameter Settings
6.8.4 Computational Time
6.8.5 Experiment Repeatability
6.9 Conclusion
7 Concluding Remarks
Appendix References
Index

Citation preview

Csaba Benedek

Multi-Level Bayesian Models for Environment Perception

Multi-Level Bayesian Models for Environment Perception

Csaba Benedek

Multi-Level Bayesian Models for Environment Perception

Csaba Benedek Institute for Computer Science and Control (SZTAKI) Budapest, Hungary

ISBN 978-3-030-83653-5 ISBN 978-3-030-83654-2 (eBook) https://doi.org/10.1007/978-3-030-83654-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Acknowledgements

The author would like to thank the Institute for Computer Science and Control (SZTAKI), Eötvös Loránd Research Network, Hungary and the Pázmány Péter Catholic University (PPCU) for funding his research work and providing excellent research infrastructure. The author gratefully acknowledges the continuous help of his closest colleagues at the Machine Perception Research Laboratory (MPLab) of SZTAKI, especially to Prof. Tamás Szirányi, laboratory head, and his former supervisor and mentor. The author thanks the following colleagues for the contributions to various joint research papers on which this book is based: Tamás Szirányi, Ákos Utasi, Balázs Nagy, Attila Börcs, Maha Shadaydeh, Bence Gálai, and Zsolt Jankó from SZTAKI, Josiane Zerubia and Xavier Descombes from INRIA Sophia Antipolis, France, Marco Martorella from the University of Pisa, Italy, Zoltán Kató from the University of Szeged, and László Jakab and Olivér Krammer from the Budapest University of Technology and Economics. The presented results rely on a large variety of particular test data. Special sensors, such as various high-speed cameras and real-time Lidar laser scanners, were available in the author’s research laboratory, SZTAKI MPLab. Remotely sensed satellite images were provided by the Lechner Knowledge Center in Budapest and by INRIA, aerial and terrestrial Lidar point clouds by Airbus Defense and Space, Hungary, and by Budapest Közút Zrt, and radar images by the University of Pisa. The research work introduced in this book was supported by various projects and grants: by the János Bolyai Research Fellowship of the Hungarian Academy of Sciences; by the National Research, Development and Innovation (NRDI) Office of Hungary within the frameworks of the National Laboratory for Autonomous Systems (NLAS), and the Artificial Intelligence National Laboratory (MILAB); by the NRDI Fund (OTKA) grants K-120233, KH-125681 and PD-101598; by the Hungarian R&D grants EFOP-3.6.2-16-2017-00013, EFOP-3.6.2-16-2017-00015 and TKP2021-NVA-01; by the DUSIREF project of the European Space Agency under the PECS-HU framework; by the APIS European Defence Agency project; and by the Michelberger Master Prize of the Hungarian Academy of Engineering.

v

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

2 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Measurement Representation and Problem Formulations . . . . . . . . . 2.2 Markovian Classification Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Markov Random Fields, Gibbs Potentials, and Observation Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Bayesian Labeling Approach and the Potts Model . . . . . . . . 2.2.3 MRF-Based Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 MRF Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5 Mixed Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Object Population Extraction with Marked Point Processes . . . . . . . 2.3.1 Definition of Marked Point Processes . . . . . . . . . . . . . . . . . . . 2.3.2 MPP Energy Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 MPP Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Methodological Contributions of the Book . . . . . . . . . . . . . . . . . . . . .

9 9 11

3 Bayesian Models for Dynamic Scene Analysis . . . . . . . . . . . . . . . . . . . . . 3.1 Dynamic Scene Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Foreground Extraction in Video Sequences . . . . . . . . . . . . . . . . . . . . . 3.2.1 Related Work in Video-Based Foreground Detection . . . . . . 3.2.2 MRF Model for Foreground Extraction . . . . . . . . . . . . . . . . . . 3.2.3 Probabilistic Model of the Background and Shadow Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Microstructural Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Foreground Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.6 Parameter Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.7 MRF Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.9 Summary and Applications of Foreground Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11 13 13 15 15 17 17 18 20 22 25 25 27 27 32 33 38 41 43 49 49 55

vii

viii

Contents

3.3 People Localization in Multi-camera Systems . . . . . . . . . . . . . . . . . . 3.3.1 A New Approach on Multi-view People Localization . . . . . . 3.3.2 Silhouette-Based Feature Extraction . . . . . . . . . . . . . . . . . . . . 3.3.3 3D Marked Point Process Model . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Evaluation of Multi-camera People Localization . . . . . . . . . . 3.3.5 Applications and Alternative Ways of 3D Person Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Foreground Extraction in Lidar Point Cloud Sequences . . . . . . . . . . 3.4.1 Problem Formulation and Data Mapping . . . . . . . . . . . . . . . . 3.4.2 Background Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 DMRF Approach on Foreground Segmentation . . . . . . . . . . . 3.4.4 Evaluation of DMRF-Based Foreground-Background Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.5 Application of the DMFR Method for Person and Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55 57 58 59 60

4 Multi-layer Label Fusion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Markovian Fusion Models in Computer Vision . . . . . . . . . . . . . . . . . 4.2 A Label Fusion Model for Object Motion Detection . . . . . . . . . . . . . 4.2.1 2D Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Change Detection with 3D Approach . . . . . . . . . . . . . . . . . . . 4.2.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Multi-layer Segmentation Model . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 L 3 Mrf Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.6 Experiments on Object Motion Detection . . . . . . . . . . . . . . . . 4.3 Long-Term Change Detection in Aerial Photos . . . . . . . . . . . . . . . . . 4.3.1 Image Model and Feature Extraction . . . . . . . . . . . . . . . . . . . . 4.3.2 A Conditional Mixed Markov Image Segmentation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Experiments on Long-Term Change Detection . . . . . . . . . . . 4.4 Parameter Settings in Multi-layer Segmentation Models . . . . . . . . . . 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79 79 82 85 88 88 90 93 93 101 101

5 Multitemporal Data Analysis with Marked Point Processes . . . . . . . . . 5.1 Introducing the Time Dimension in MPP Models . . . . . . . . . . . . . . . 5.2 Object-Level Change Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Building Development Monitoring—Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Multitemporal MPP Configuration Model and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Experimental Study of the mMPP Model . . . . . . . . . . . . . . . . 5.3 A Point Process Model for Target Sequence Analysis . . . . . . . . . . . . 5.3.1 Application on Moving Target Analysis in ISAR Image Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62 63 64 65 65 68 70 77

109 113 118 119 121 121 122 122 124 130 132 136 136

Contents

5.3.2 Problem Definition and Notations . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Data Preprocessing in a Bottom-Up Approach . . . . . . . . . . . . 5.3.4 Multiframe Marked Point Process Model . . . . . . . . . . . . . . . . 5.3.5 Multiframe MPP Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.6 Experimental Results on Target Sequence Analysis . . . . . . . 5.4 Parameter Settings in Dynamic MPP Models . . . . . . . . . . . . . . . . . . . 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Multi-level Object Population Analysis with an Embedded MPP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 A Hierarchical MPP Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Problem Formulation and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 EMPP Energy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Multi-level MPP Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Applications of the EMPP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Built-in Area Analysis in Aerial and Satellite Images . . . . . . 6.5.2 Traffic Monitoring-Based on Lidar Data . . . . . . . . . . . . . . . . . 6.5.3 Automatic Optical Inspection of Printed Circuit Boards . . . . 6.6 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Quantitative Evaluation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.1 EMPP Benchmark Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.2 Quantitative Evaluation Methodology . . . . . . . . . . . . . . . . . . . 6.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8.1 EMPP Versus an Ensemble of Single Layer MPPs . . . . . . . . 6.8.2 Application Level Comparison to Non-MPP-Based Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8.3 Effects on Data Term Parameter Settings . . . . . . . . . . . . . . . . 6.8.4 Computational Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8.5 Experiment Repeatability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

137 138 142 145 146 152 154 155 155 157 159 161 162 162 165 169 172 174 176 177 178 178 181 182 183 185 186

7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

Acronyms and Notations

The list below collects the acronyms and the most frequent symbols and notations used in this book.

Abbreviations and Concepts AOI CXM DMRF e.g. EMPP Fm MPP FBB FoV GEI GODH i.e. ISAR Lidar MAP MBD MLS/TLS MMD mMPP MPP MRF PCA PCB PCC/PDC pdf

Automated Optical Inspection Conditional Mixed Markov Model Dynamic Markov Random Field for example (in latin: ‘exempli gratia’) Embedded Marked Point Process (multi-level MPP) Multiframe Marked Point Process (object sequence analysis) Feature-Based Birth Process Field of View Gait Energy Image Gradient Orientation Density Histogram That is; in other words Inverse Synthetic Aperture Radar Light detection and ranging Maximum a posteriori Multiple Birth and Death (MPP SA optimization technique) Mobile/terrestrial laser scanning Modified Metropolis Dynamic (MRF SA relaxation technique) Multitemporal Marked Point Process (used for change detection) Marked Point Process Markov Random Field Principal Component Analysis Printed Circuit Board Post Classification/Detection Comparison Probability density function xi

xii

RANSAC RJMCMC RMB Lidar SA ToF

Acronyms and Notations

Random sample consensus Reversible-Jump Markov Chain Monte Carlo (MPP optimization) Rotating Multi-beam Lidar (sensor) Simulated Annealing (optimization method) Time-of-Flight

General Notations Used in the Book i, j, k, m n S s, r L = { p1 , . . . , pn } x, y, z G(← S) , G i g(s), gi (s) N (μ, σ ) η(x, μ, σ ) B(x, α, β) ζ (x, τ, m) κs , κsi t, .t , t T

Arbitrary index (number or enumeration) Dimension parameter, index Pixel lattice Pixel of the image lattice (s, r ∈ S) Point cloud of n points ( pi is the ith point) Cartesian point coordinates Image (over S lattice), the ith image Gray value/image sensor value at pixel s (in the ith image) Normal distribution with mean μ and standard deviation σ Gaussian (normal) pdf with parameters μ, σ Beta pdf with parameters α, β Sigmoid function with parameters τ and m Weight (ith term) in a mixture pdf corresponding to pixel s Time (upper or lower) index (for any quantities) Temperature (for simulated annealing)

Specific Notations Used in MRF/CXM Models G(V, E) υ ε s, r si l, li N Nυ ς (υ), ςυ

ς  (a) ϒ

X 

MRF graph with set of nodes V and edges E Abstract node of a graph G, υ ∈ V Edge of a graph ε ∈ E Node of G in case of a single-layer MRF model Node for pixel s ∈ S in the ith layer (in multi-layer MRF models) Label set (# = J ) Abstract label or class identifier Neighborhood system of G Neighborhood of node υ in G (Nυ ∈ N) Label of node υ in G (ς (υ) ∈ ) Global labeling: {[υ, ς (υ)|υ ∈ V} MAP estimation of the optimal global labeling The label of the regular node addressed by a in CXM Set of all the possible global labelings ( ∈ ϒ) Label subconfiguration corresponding to set X ⊆ V ( X ⊆ )

Acronyms and Notations

VX ( ) f (s), f¯(s) f (υ), f¯(υ) f i (υ) F C C VC V{υ1 ,...,υn } (ς1 , ς2 ) δ, δ i ρ 1{E}, 1ς pς (s) ες (s)

xiii

Potential of the subconfiguration X Observation vector (∈ Rn ) at pixel s ∈ S Observation vector (∈ Rn ) assigned to node υ ∈ V ith component of vector f (υ) Global observation on G: { f (υ)υ ∈ V} Clique of G Set of cliques in G Potential of clique C Potential of a clique containing nodes υ1 , . . . , υn Potts smoothing term Parameter of the smoothing term (in the ith layer) Parameter of the inter-layer potential term Indicator function of an event E, or class ς Pdf value corresponding to pixel s and class ς −log pς (s)

Specific Notations Used in MPP Models u, v H ω = {u 1 , . . . , u n }  (ω) :  → R f (u) ( f i (u)) F M( f, d0 , D) Ru ⊂ S ϕ f (u) A(u) I (u, v) ψ q (qu ) Qu u∼v Nu (ω)

MPP objects represented by geometric figures Parameter space of the objects Configuration (or population) of n objects Configuration space MPP configuration energy function Data feature associated with object u (i-named data feature) Global observation data over the input image(es)/point cloud(s) Feature-mapping function (d0 : acceptance threshold, D: normalization) Set of image pixels covered by the geometric figure of object u Data term of object u considering feature f Unary potential of object u in (ω) Interaction potential between (parent) objects u and v Object group (in the EMPP model) Child object (of parent u) in the EMPP model Set of child objects of parent u (in the EMPP model) Neighborhood relation Neighborhood of object u within population ω: {v ∈ ω|u ∼ v}

Chapter 1

Introduction

Abstract This chapter presents the key problem statements and solutions discussed in the book. An overview is provided on timely challenges in environment perception, by introducing recent sensor technologies and methodologies which are efficient candidates to handle the tasks ahead. A survey of various image segmentation and image-based object population modeling approaches is given, by emphasizing the limitations of existing techniques, and the points of progress where the methods of the book provide contributions. Key issues in change detection, information fusion, and hierarchical image content modeling are focused based on the joint utilization of various data models and geometric shape descriptors, embedded into a strict probabilistic inference framework. A short overview of the remaining parts of the book closes the chapter.

This book deals with selected problems of machine perception, targeting the automated interpretation of the observed static or dynamic environment based on various image-like measurements. Scene understanding is based nowadays far beyond on conventional image processing approaches dealing with standard grayscale or RGB photos. Multi-camera systems, high-speed cameras, radar systems, depth and thermal sensors, or laser scanners may be used concurrently to support a given application, therefore proposing a competitive solution for a problem should mean not only to construct the best pattern recognition or artificial intelligence algorithm but also to choose the best hardware–software configuration. Examples of data modalities used in the book are demonstrated in Fig. 1.1. Besides the wide choice of sensing technologies, we can witness today a quick development of the available sensors in terms of spatial and temporal resolution, number of information channels, level of noise, etc. For this reason, by implementing efficient environment perception systems, we should answer various challenges of automatic feature extraction, object and event recognition, machine learning, indexing, and content-based retrieval. First, the developed methodologies should be able to deal with various data sources, and they should be highly scalable. This property enables flexible sensor fusion and the replacement of outdated sensors with novel ones providing improved data quality, without complete re-structuring of existing software systems. Second, the increased spatial resolution and dimension of the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 C. Benedek, Multi-Level Bayesian Models for Environment Perception, https://doi.org/10.1007/978-3-030-83654-2_1

1

2

1 Introduction

Fig. 1.1 Examples for different data modalities used in the book

observed data implies that in a single measurement segment one may detect multiple effects on different scales, demanding recognizer algorithms which perform hierarchical interpretation of the content. As an example, in a very high-resolution aerial photo, we can jointly analyze macro-level urban or forest regions, separate different districts and roads of the cities, extract and cluster buildings, or focus on smaller objects such as vehicles or street furniture [146, 160]. Third, we should also efficiently utilize the multiple available scales of the time dimension. While object motion information can be directly extracted through pixel-by-pixel comparison of the consecutive frames in an image sequence with video frame rate, comparing measurements with several months or years of time differences captured from the same area needs a high-level modeling approach. The accomplished research work should point therefore toward obtaining a complex system, where the provided information of various data sources is organized into a unified hierarchical scene model, enabling multi-modal representation, recognition, and comparison of entities, by combining object-level analysis with low-level feature extraction. From a functional point of view, the methods proposed in the book present either general preprocessing steps of different early vision applications, or contribute to higher level object-based scene analysis modules. In the first case, the introduced models rely on low-level local features extracted from the sensor measurements, such as the pixel color values in images, or texture descriptors calculated over small rectangular image parts. The output is a classification (or segmentation) of the observation, which can be interpreted as a semantic labeling of the raw data. For example, in a video frame we can separate the foreground and background pixels, or in an

1 Introduction

3

aerial Lidar point cloud roof and terrain regions can be distinguished. Although the classification is primarily based on the extracted local features, which provide posterior (observation-dependent) information for the process, additional prior constraints are also exploited to decrease the artifacts due to noise and ambiguities of the input data. One of the simplest, but often used prior conditions is the connectivity: we can assume in several problems that the classification should result in homogeneous regions, e.g. in images, neighboring pixels correspond usually to the same class. Markov Random Fields (MRFs) [103] are widely used classification tools since the early eighties, since they are able to simultaneously embed a data model, reflecting the knowledge on the measurements, and prior constraints, such as spatial smoothness of the solution through a graph-based image representation. Since conventional MRFs show some limitations regarding context-dependent class modeling, different modified schemes have been recently proposed to increase their flexibility. Triplet Markov fields [7] contain an auxiliary latent process which can be used to describe various subclasses of each class in different manners. Mixed Markov models [62] extend MRFs by admitting data-dependent links between the processing nodes, which fact enables introducing configurable structures in feature integration. Conditional Random Fields (CRFs) directly model the data-driven posterior distributions of the segmentation classes [113]. On their positive side, the above Markovian segmentation approaches are robust and well established for many problems [120]. However, as it will be explained in Chap. 2 in detail, the MRF concept offers only a general framework, which has a high degree of freedom. In particular, a couple of key issues should be efficiently addressed for a given real-world problem. The first one is extracting appropriate features and building a proper probabilistic model of each semantic class. The second key point is developing an appropriate model structure, which consists of simple interactive elements. The arrangement and dialogue of these units are responsible for smoothing the segmentation map or integrating the effects of different features. Choosing the right dimension of the field is also a critical step. MRFs can be either defined on 2D lattices, or on 3D voxel models, but projecting a high-dimensional problem to a lower dimensional domain is also a frequently used option. For example, for segmenting a point cloud, a straightforward approach is to construct the MRF in the 3D Euclidean space of the measurements. However, if the point cloud data was recorded by a 2.5D sensor moving on a fixed trajectory, range image representation may provide more efficient results, which is less affected by artifacts of sensor noise and occlusion. We propose in this book novel solutions regarding many of the above-mentioned aspects following the demands of real applications. On the one hand, we combine various statistical features to solve different change detection problems, and explore the connections between different 2D image and 3D point cloud-based descriptors. On the other hand, we investigate the efficiency of various possible model structures both in terms of scalability and in practical problem-solving performance. We will propose new complex and flexible low-level inference algorithms between various measured features and prior knowledge. Dealing with higher dimensional data, we pay particular attention to reduce the complexity of the model structures, save computational time, and keep the modeling process tractable.

4

1 Introduction

One of the frequent applications of MRF and CRF models is implementing lowlevel change detection between co-registered measurements captured in a dynamic environment. Formally, we have to implement here a semantic pixel-labeling process with change and background classes in case of 2D images, and voxel or point labeling by working with 3D data. Change detection is an important early processing task of several machine perception problems, since shape, size, number, and position parameters of the relevant scene objects can be derived from an accurate MRF-based change map and used among others in video surveillance [25, 117], aerial exploitation [30, 186], traffic monitoring [188] or land-cover change detection in remote sensing images [91, 171], autonomous driving [190], and smart city-related applications [125]. As the large variety of applications shows, change detection is a wide concept: different classes of algorithms should be separated depending on the environmental conditions and the exact goals of the systems. While conventional electro-optical cameras are still important visual information sources, recently released Lidar range sensors offer alternative approaches for scene analysis, by directly measuring 3D geometric information from the environment. Using the Lidar technology, the most important limitation is currently a necessary trade-off between the spatial and the temporal resolution of the available sensors, which makes it difficult to observe and analyze small details of the scenes in real time. Important research issues are therefore the exploration of new tasks which can be handled by these new sorts of measurements, the adaption of conventional image processing algorithms, structures for voxel-based scene representation, and vision-related machine learning methodologies to Lidar data. This book deals with four selected tasks from the low-level change detection problem family. Although the abstract aim (indicating some kind of changes between consecutive images in an image sequence) and the applied mathematical tools (statistical modeling, feature differencing, and Markov Random Fields) are similar for the introduced three problems, the further inspections will show that the solutions must be significantly different. First, we propose a Bayesian approach for foreground and shadow estimation in low frame rate surveillance videos. Second, we construct a novel motion detection and segmentation method based on the measurements of a single Rotating Multi-beam (RMB) Lidar sensor. Third, we introduce a new model for object motion detection in image pairs captured from a moving aerial platform, by attempting to remove registration and parallax errors. Fourth, we propose a novel mixed Markovian structure for relevant change extraction in airborne photos taken with several years of time differences. A higher level of visual data interpretation can be based on object-level analysis of the scene. Object extraction is a crucial step in several perception applications, starting from remotely sensed data analysis, through optical fabric inspection systems, until video surveillance. Object detection techniques in the literature follow either a bottom-up or an inverse (top-down) approach. The straightforward bottom-up techniques [136] construct the objects from primitives, like blobs, edge parts, or corners in images. Although these methods can be fast, they may fail if the primitives cannot be reliably detected. We can mention here Hough transform or mathematical morphology-based methods [168] as examples, however, these approaches show limitations in cases of dense

1 Introduction

5

populations with several adjacent objects. To increase robustness, it is common to follow the Hypothesis Generation-Acceptance (HGA) scheme [165, 169]. Here, the accuracy of object proposition is not crucial, as false candidates can be eliminated in the verification step. However, objects missed by the generation process cannot be recovered later, which may result in several false negatives. On the other hand, generating too many object hypotheses (e.g. applying exhaustive search) slows down the detection process significantly. Finally, conventional HGA techniques search for separate objects instead of global object configurations, disregarding populationlevel features such as overlapping, relative alignment, color similarity, and spatial distance of the neighboring objects [114]. To overcome the above drawbacks, recent inverse methods [55] assign a fitness value to each possible object configuration, and an optimization process attempts to find the configuration with the highest confidence. This way, flexible object appearance models can be adopted, and it is also straightforward to incorporate prior shape information and object interactions. However, this inverse approach needs to perform a computationally expensive search in a high-dimensional population space, where local maxima of the fitness function can mislead the optimization. Using the above terminology, MRFs can also be considered as inverse techniques. However, staying at pixel level in the graph nodes, we find only very limited options to consider geometrical information [35, 187]. Marked Point Processes (MPP) [53, 55] offer an efficient extension of MRFs, as they work with objects as variables instead of with pixels, considering that the number of variables (i.e. number of objects) is also unknown. MPPs embed prior constraints and data models within the same density, therefore similarly to MRFs, efficient algorithms for model optimization [54, 66, 182] and parameter estimation [6, 47] are available. Recent MPP applications range from 2D [46, 121, 137] and 3D object extraction [67, 180] in various environments to 1D signal modeling [129] or target tracking [184]. Marked Point Processes have previously been used for various population counting problems, dealing with a large number of objects which have low varieties in shape. MPP models can efficiently handle these situations, by jointly describing individual objects by various data terms, and using information from entity interactions by prescribing the (soft) fulfillment of prior geometric constraints [53]. In this way, one can extract configurations which are composed of similarly shaped and sized entities such as buildings [41], trees [202, 209], birds [54, 66, 67], and boats [6] from remotely sensed data, cell nuclei from medical images [181], galaxies in space applications [46], or people in video surveillance scenarios [70]. While the computational complexity of MPP optimization may mean a bottleneck for some applications, various efficient techniques have been proposed to speed up the energy minimization process, such as the Multiple Birth and Death (MBD) [54] algorithm or the parallel Reversible-Jump Markov Chain Monte Carlo (RJMCMC) sampling process [182]. Although the above applications show clear practical advantages of conventional MPP-based solutions, neither the time dimension of the measurements nor the spatial

6

1 Introduction

hierarchical decomposition of the scene are addressed in the referred previous works of the literature. Therefore, this book presents contributions focusing on temporal and spatial extensions of the original MPP framework, by expansively analyzing the needs and alternative directions for the solutions, and demonstrating the advantages of the improvements in real problem environments. The temporal dimension appears in two different aspects. The first problem is object-level change detection in image pairs, where low-level approaches are combined with geometric object extraction by a multitemporal MPP (mMPP) model. The result is an object population, where each object is marked as unchanged, changed, new, or disappeared between the selected two time instances, typically based on measurements with several months or years of time differences. A second task is tracking a moving target across several frames in time sequences of very low-quality measurements, such as radar images. For this purpose, a novel Multiframe MPP (Fm MPP) framework is proposed, which simultaneously considers the consistency of the observed data and the fitted objects in the individual images, and also exploits interaction constraints between the object parameters in the consecutive frames of the sequence. Following the Markovian approach, here each target sample may only affect objects in its neighboring frames directly, limiting the number of interactions for efficient sequence analysis. Another major targeted issue is spatial hierarchical content modeling. Classical MPP-based image analysis models [54, 55] focus purely on the object level of the scene. Simple prior interaction constraints such as non-overlapping or parallel alignment are often utilized to refine the accuracy of detection, but in this way only very limited amount of high-level structural information can be exploited from the global scenario. In various applications, however, investigation of object grouping patterns and the decomposition of objects to smaller parts (i.e. subobjects) are relevant issues. We propose therefore a hierarchical MPP extension, called the Embedded Marked Point Process (EMPP) model, which encapsulates on the one hand a hierarchical description between objects and object parts as a parent–child relationship, and on the other hand it allows corresponding objects to form coherent object groups, by a Bayesian segmentation of the population. This book uses the basic concepts and results of probability theory (e.g. random variables, probability density functions, and Bayes’ rule), and machine learning (neural networks, supervised training strategies) which are supposed to be familiar for the Readers. The outline of the book partially follows [15]. Chapter 2 presents a short introduction to stochastic image segmentation, object population extraction, and machine learning approaches, by introducing the data types, general notations, and basic mathematical tools used in the following parts of the book. In Chap. 3, we introduce new MRF and MPP approaches connected to dynamic environment analysis. In Chap. 4, novel multi-layer Markovian label fusion models are proposed for the above-introduced two different change detection applications. Chapter 5 deals with multitemporal object-level scene analysis for tasks of building change detection in remotely sensed optical image pairs, and moving target tracking in radar image sequences. Finally, in Chap. 6 we give a complex multi-level stochastic model for

1 Introduction

7

spatial scene decomposition, and demonstrate its usability in three very different application fields. This chapter and Chap. 2 of the book are based on Sects. 1 and 2 of [15]. Chapters 3 and 4 partially rely on selected parts of [8, 29].

Chapter 2

Fundamentals

Abstract This chapter presents the main mathematical foundations of the problems, concepts, and methods covered by the book. First, a formal description is given for 2D image and 3D point cloud-based measurement representation, then various Markovian data analysis frameworks are discussed, which implement image segmentation, and geometric object population extraction tasks. The chapter covers state-of-the-art methodologies of graph-based scene representation, probabilistic modeling of prior knowledge-based and image data-based information, Bayesian inference, parameter estimation, and various energy optimization approaches. Special focus is devoted to established techniques such as Markov Random Fields, mixed Markov models, and Marked Point Process frameworks. Finally, based on the presented fundamentals, the methodological contributions of the book are summarized.

2.1 Measurement Representation and Problem Formulations In this book, the various sensor measurements at given time instances are represented either as 2D digital images or as 3D point clouds. Both cases can be completed with a temporal dimension obtaining image or point cloud sequences. A digital image is defined over a 2D pixel lattice S having a finite size SW × S H , where s ∈ S denotes a single pixel. The pixels’ observation values represent grayscale or RGB color information, depth values, etc. or any descriptors calculated from the raw sensor measurements by spatio-temporal filtering or feature fusion. A point cloud L is by definition an unordered set of l points: L = { p1 , . . . , pl }, where each point, p ∈ L, is described by its (x, y, z) position coordinates in a 3D Euclidean world coordinate system. Additional parameters, such as intensity, color values, and further sensor-specific parameters may also be associated with the points. Although several different techniques are discussed in the book with various goals and model structures, they are strongly connected from the point of view of theoretical foundations and methodologies: they can be formulated either as lowlevel segmentation (or classification) problems or as object population extraction tasks (see examples in Fig. 2.1). © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 C. Benedek, Multi-Level Bayesian Models for Environment Perception, https://doi.org/10.1007/978-3-030-83654-2_2

9

10

2 Fundamentals

Fig. 2.1 Demonstration of a segmentation and an object population extraction task for aerial images

Segmentation (or classification) can be formally considered as a labeling task where each local element (pixel of the image or point of the point cloud) gets a label from a J -element label set corresponding to J different segmentation classes. In other words, a J -colored image or point cloud is generated for a given input. Following statistical inverse approaches, we should be able to assign a fitness (or probability) value to all the J #el possible segmentations,1 based on the current measurements (called observation), domain-specific knowledge about the classes, and prior constraints, in a way that higher fitness values correspond to semantically better solutions. By object population extraction, we mean the detection of an unknown number of entities from a preliminarily defined object library. Here, the fitness function needs to characterize any of the possible entity configurations. The objects are described by geometric shapes such as ellipses or rectangles, while the fitness function evaluates how the independent objects fit the image data and it may also consider predefined interaction constraints. To overcome the course of dimensionality, the fitness functions are usually modularly defined: they can be decomposed into individual subterms, and the domain of each subterm consists only of a few nearby pixels or objects. In this way, if we change locally the segmentation map or the population, we should not re-calculate the whole fitness function, only those subterms, which are affected by the selected entities. This property significantly decreases the computational complexity of iterative fitness optimization techniques [2, 71]. An efficient Bayesian approach can be based on a graph representation, where each node of the graph corresponds to a structural model element, such as a pixel of the image, or an object of the population. We define edges between two nodes, if the corresponding entities influence each other directly, i.e. there is a subterm of the fitness function which depends on both elements. For example, to ensure the spatial

1

#el marks here the number of pixels, or points.

2.1 Measurement Representation and Problem Formulations

11

smoothness of the segmented images, one can prescribe that the neighboring pixels should have the same labels in the vast majority of cases [147]. Since the seminal work of Geman and Geman [71], Markov Random Fields (MRFs) and their variants such as Mixed Markov models offer powerful tools to ensure contextual classification in image or point set segmentation tasks. Marked Point Process (MPP) models have been introduced in computer vision more recently, as a natural object-level extension of MRFs. In the following part of the chapter, we give the formal definitions and algorithmic steps regarding MRF-based data segmentation and MPP-based object population extraction. The concepts and notations introduced here will be used in the following parts of the book.

2.2 Markovian Classification Models 2.2.1 Markov Random Fields, Gibbs Potentials, and Observation Processes A Markov Random Field (MRF) can be defined over an undirected graph G = (V, ε), where V = {υi |i = 1, . . . N } marks the set of nodes, and ε is the set of edges. Two nodes υi and υk are neighbors, if there is an edge eik ∈ ε connecting them. The set of points which are neighbors of a node υ (i.e. the neighborhood of υ) is denoted by Nυ , while we mark with N = {Nυ |υ ∈ V} the neighborhood system of the graph. A classification problem can be interpreted as a labeling task over the nodes. Using a finite label set  = {l1 , l2 , . . . , l J }, we assign a unique label ς (υ) ∈  to each node υ ∈ V. We mean by a global labeling  the enumeration of the nodes with their corresponding labels:  = { [υ, ς (υ)] | ∀υ ∈ V }.

(2.1)

Let us denote by ϒ the (finite) set of all the possible global labelings ( ∈ ϒ). In some cases, instead of a global labeling, we need to deal with the labeling of a given subgraph. The subconfiguration of  with respect a subset X ⊆ V is denoted by  X = { [υ, ς (υ)] | ∀υ ∈ X }. In the next step, we define Markov Random Fields. As usual, Markov property means here that the label of a given node depends only on its neighbors directly. Definition (Markov Random Field) X is a Markov Random Field (MRF), with respect to a graph G, if the following two conditions hold: • for all  ∈ ϒ; P(X =  ) > 0 • for every υ ∈ V and  ∈ ϒ: P(ς (υ) | V\{υ} ) = P(ς (υ) | Nυ ).



Discussion about MRFs is most convenient by defining the neighborhood system via the cliques of the graph. A subset C ⊆ V is a clique if every pair of distinct nodes in C are neighbors. C denotes a set of cliques.

12

2 Fundamentals

To characterize the fitness of the different global labelings, a Gibbs measure is defined on ϒ. Let V be a potential function which assigns a real number VX ( ) to the subconfiguration  X . V defines an energy U ( ) on ϒ by U ( ) =



VX ( ),

(2.2)

X ∈2V

where 2V denotes the set of the subsets of V. Definition (Gibbs distribution) A Gibbs distribution is a probability measure π on ϒ with the following representation: π( ) =

  1 exp − U ( ) , Z

(2.3)

where Z is a normalizing constant or partition function: Z=



  exp − U ( ) .

(2.4)

 ∈ϒ

If VX ( ) = 0 whenever X ∈ / C, then V is called a nearest neighbor potential. The following theorem is the principle of most MRF applications in computer vision [71]: Theorem 2.1 (Hammersley–Clifford) X is an MRF with respect to the neighborhood system N if and only if π( ) = P(X =  ) is a Gibbs distribution with nearest neighbor Gibbs potential V , that is,    1 VC ( ) . π( ) = exp − Z C∈C

(2.5)

We mean by observation arbitrary measurements from real-world processes (such as image sources) assigned to the nodes of the graph. In image processing, usually the pixels’ color values or simple textural responses are used, but any other local features can also be calculated. In general, we only prescribe that the observation process assigns a D-dimensional real vector, f (υ) ∈ R D , to selected graph nodes. The global observation over the graph is marked by F = { [υ, f (υ)] | ∀υ ∈ O } where O ⊆ V.

(2.6)

MRF-based classification models use two assumptions. First, each class label li ∈  corresponds to a random process, which generates the observation value f (υ) at υ according to a locally specified probability density function (pdf), pυ,i (λ) = P( f (υ) = λ|ς (υ) = li ). Second, local observations are conditionally independent, given the global labeling:

2.2 Markovian Classification Models

P(F | ) =

13



P ( f (υ)|ς (υ)).

(2.7)

υ∈O

2.2.2 Bayesian Labeling Approach and the Potts Model Let X be an MRF on graph G = (V, ε), with (a priori) clique potentials {VC ( ) | C ∈ C}. Consider an observation process F on G. The goal is to find the labeling 

, which is the maximum a posteriori (MAP) estimate, i.e. the labeling with the highest probability given F : (2.8) 

= argmax P( |F ).  ∈ϒ

Following Bayes’ rule and Eq. (2.7), 1 P(F | )P( ) = P( |F ) = P(F ) P(F )





P ( f (υ)|ς (υ)) P( ).

(2.9)

υ∈O

Based on the Hammersley–Clifford theorem, P( ) follows a Gibbs distribution:    1 VC ( ) , P( ) = π( ) = exp − Z C∈C

(2.10)

while P(F ) and Z (in the Gibbs distribution) are independent of the current value of  . Using also the monotonicity of the logarithm function and Eqs. (2.8), (2.9), and (2.10), the optimal global labeling can be written into the following form:



= argmin  ∈ϒ



− log P ( f (υ)|ς (υ)) +

υ∈O



 VC ( ) .

(2.11)

C∈C

Note that due to the conditional independence of the observations at the different nodes, the fact that the prior field π( ) is an MRF implies that the π( |F ) posterior field is also an MRF. In this case, the − log P ( f (υ)|ς (υ)) quantity can be considered as the potential of a singleton clique {υ}.

2.2.3 MRF-Based Image Segmentation A widely used implementation of the above Bayesian labeling framework for image segmentation is based on the Potts model [147]. Assume that the problem is defined over the 2D lattice S and we have a measurement vector f (s) ∈ R D at each pixel s. The goal is to segment the input lattice with J pixel clusters corresponding to J

14

2 Fundamentals

Fig. 2.2 Illustration of simple connections in MRFs: a first-ordered neighborhood of a selected node on the lattice, b ‘singleton’ clique, c doubleton clique

Fig. 2.3 Demonstration of MRF-based supervised image segmentation with three classes: a input image with the training regions, b pixel-by-pixel segmentation without using node interactions, c result of the Potts model with MMD optimization

random processes (l1 , . . . , l J ), where the clusters of the pixels are consistent with the local measurements, and the segmentation is smooth, i.e. pixels having the same cluster form connected regions. Here by the definition of G, we assign to each pixel of the input lattice a unique node of the graph. One can simply use a first-ordered neighborhood, where each pixel has four neighbors. In this case, the cliques of the graph are singletons or doubletons as shown in Fig. 2.2. As a consequence, the prior term π( ) = P( ) of the MRF energy function is defined by the doubleton clique potentials. According to the Potts model, the prior probability term is responsible for getting smooth connected components in the segmented image, so that we give penalty terms to each neighboring pair of nodes whose labels are different. For any r, υ ∈ V node pairs, which fulfill υ ∈ Nr , {r, υ} ∈ C is a clique of the graph, with the potential:  −δ if ς (r ) = ς (υ) V{r,υ} ( ) = (2.12) +δ if ς (r ) = ς (υ) where δ ≥ 0 is a constant. A sample MRF-based segmentation result, with the demonstration of the role of the Potts smoothing term, is shown in Fig. 2.3.

2.2 Markovian Classification Models

15

2.2.4 MRF Optimization In applications using the MRF models, the quality of the classification depends both on the appropriate probabilistic model of the classes, and on the optimization technique which finds a good global labeling with respect to Eq. (2.11). The latter factor is a key issue, since finding the global optimum is NP-hard [40]. On the other hand, stochastic optimizers using simulated annealing (SA) [2, 71] and graph-cut techniques [40] have proved to be practically efficient offering a ground to validate different energy models. Detailed overviews on the various optimization approaches, and tutorials on MRFbased image segmentation can be found in several books and monographs dealing with the topic [103, 120]. The results shown in the following chapters have been generated by either the fast graph-cut-based optimization technique [40] or by the deterministic Modified Metropolis (MMD) [100, 104] relaxation algorithm. While the graph-cut technique is adopted without modification [40] in Chap. 3 and Sect. 5.3.3, the MMD algorithms will be tailored to specific multi-layer model structures discussed in Chap. 4. For a deeper understanding of this step, we introduce here the basic MMD algorithm first [104]: 1. Initialization: Pick up randomly an initial labeling  , set k := 0 and choose a sufficiently high initial temperature T := T0 . 2. Main program: a. Perturbation: Construct a trial perturbation  ˘ from the current configuration  such that  ˘ differs only in one element (i.e. in the label of single pixel) from  . b. Metropolis criteria: Compute U = U ( ˘ ) − U ( ) and accept  ˘ if U ≤ 0, else accept with probability exp(− U/T ) using an analogy with thermodynamics: ⎧ ˘ if U ≤ 0 ⎨ ˘ if U > 0 and τ < exp(− U/T )  :=  ⎩  otherwise, where τ is a uniform random number in [0, 1). c. Convergence test: if the process has not converged, decrease the temperature T = Tk+1 , set k = k + 1, and go back to the Perturbation step.

2.2.5 Mixed Markov Models Mixed Markov models have been originally proposed for gene regulatory network analysis [62], and extend the modeling capabilities of Markov random fields: besides prior static connections, they enable using observation-dependent dynamic links

16

2 Fundamentals

Fig. 2.4 Possible interactions in mixed Markov models. Four different configurations, where A and B regular nodes may directly interact. Empty circles mark address nodes, continuous lines are edges, and dotted arrows denote address pointers

between the processing nodes. This property allows encoding interactions that occur only in a certain context and are absent in all others. A mixed Markov model—similar to a conventional MRF—is defined over a graph G = (V, ε), where V and ε denote again the sets of nodes and edges, respectively. A label, i.e. a random variable ς (υ), is assigned to each node υ ∈ V as well, and the node labels over the graph determine a global labeling  as defined by Eq. (2.1). However in mixed Markov models, two types of nodes are discriminated: V R contains regular nodes and V A is the set of address nodes (V = V R ∪ V A , V R ∩ V A = ∅). Regular nodes r ∈ V R have the same roles as nodes in MRFs: the corresponding variable ς (r ) will encode a segmentation label getting values from a finite, application-dependent label set. On the other hand, address nodes provide configurable links in the graph by creating pointers to other (regular) nodes. Thus for a given address node a ∈ V A , the domain of its ‘label’ ς (a) is the set V R ∪ {nil}. In the case of ς (a) = nil, let us denote by ς (a) the label of the regular node addressed by a: (2.13) ς (a) := ς (ς (a)). There is no restriction on the graph topology: edges can link any two nodes. The edges define the set of cliques of G, which is denoted again by C. In a given configuration, two regular nodes may interact directly if they are connected by a static edge or by a chain of a static edge and dynamic address pointers: four typical configurations of connection are demonstrated in Fig. 2.4. More specifically, with notation for each clique C ∈ C: ςC = {ς (υ)|υ ∈ C} and  ςCA = {ς (a)a ∈ V A ∩ C, ς (a) = nil} the prior probability of a given global labeling  is given by    1 (2.14) exp − VC ςC , ςCA , P( ) = Z C∈C where VC is a C → R clique potential function, which has a ‘low’ value if the labels A within the set ςC ∪ ςC are semantically consistent, while VC is ‘high’ otherwise. Scalar Z =  P( ) is again a normalizing constant, which could be calculated over all the possible global labelings. Note that a detailed analysis of analytical and computational properties of mixed Markov models can be found in [62], which confirms the efficiency of the approach in probabilistic inference.

2.3 Object Population Extraction with Marked Point Processes

17

2.3 Object Population Extraction with Marked Point Processes Similar to Markov Random Fields, the Marked Point Process (MPP) methods use a graph-based representation for semantic content modeling. However, in MPPs the graph nodes are associated with geometric objects instead of low-level pixels or point cloud elements. In this way, an MPP model enables to characterize whole populations instead of individual objects, by exploiting information from entity interactions. Following the classical Markovian approach, each object may only affect its neighbors directly. This property limits the number of interactions in the population and results in a compact description of the global scene, which can be analyzed efficiently. For easier discussion, in this chapter we introduce MPP models purely over 2D pixel lattices, dealing with 2D objects. While most of the object detection tasks discussed in this book are handled indeed as 2D pattern recognition problems, we note that the model extension to 2.5D or 3D (spatial) scenes is quite straightforward.

2.3.1 Definition of Marked Point Processes In statistics, a random process is called point process, if it can generate a set of isolated points either in space or time, or in even more general spaces. In this book, we will mainly use a discrete 2D point process, whose realization is a set of an arbitrary number of points over a pixel lattice S: o = {o1 , o2 , . . . , on }, n ∈ {0, 1, 2, . . .}, ∀i : oi ∈ S.

(2.15)

A sample task for using point processes in image processing is detecting buildings in aerial images, as shown in Fig. 2.5a, where each point corresponds to a building center. However, modeling our objects with point-wise entities is often an insufficient abstraction. For example, in high-resolution aerial photos, building shapes can often be efficiently approximated by rectangles (Fig. 2.5b). To include object geometry

Fig. 2.5 Marked Point Process example: a building population as a realization of a point process, b rectangle model of a selected building, c parameters of the marked object [17]

18

2 Fundamentals

in the model, we assign markers to the points. As shown in Fig. 2.5c, a rectangle can be defined by the center point o ∈ S, the orientation θ ∈ [−90◦ , +90◦ ], and the perpendicular side lengths e L and el . In this case, the marker is a 3D parameter vector (θ, e L , el ). Taking a general case, let us denote by u an object candidate of the scene whose imaged shape over lattice S is represented by a plane figure from a preliminary fixed shape library. In this book, mainly ellipses or rectangles are used. We will model each marked object by its reference point o, the global orientation θ and further shapedependent parameters such as major and minor axes for ellipses, and perpendicular side lengths for rectangles. Denoting by P the domain of the markers, the H parameter space of the individual objects (i.e. u ∈ H) is obtained as H = S × P. A configuration of an MPP model, denoted by ω, is a population of marked objects: ω = {u 1 , . . . , u n }, ∀i : u i ∈ H,

(2.16)

where the number of objects, n, is an arbitrary integer, which is initially unknown in population extraction tasks. Consequently, the object configuration space, , has the following form: =

∞ 

n ,

  n = {u 1 , . . . , u n } ⊂ H n .

(2.17)

n=0

Next, we define a ∼ neighborhood relation between the objects of a given ω configuration. For example, we can prescribe for objects u, v ∈ ω that u ∼ v iff the distance between the object centers is lower than a predefined threshold. The neighborhood of object u in ω is Nu (ω) = {v ∈ ω|u ∼ v}. (2.18)

2.3.2 MPP Energy Functions Object populations in MPP models are evaluated by simultaneously considering the input measurements (e.g. images), and prior application-specific constraints about object geometry and interactions. Let us denote by F the union of all image features derived from the input data. For characterizing a given ω configuration based on F , we introduce a non-homogenous data-dependent Gibbs distribution (see Eq. (2.3)) on the population space: PF (ω) = P(ω|F ) = with a Z normalizing constant:

  1 · exp −F (ω) Z

(2.19)

2.3 Object Population Extraction with Marked Point Processes

19

Fig. 2.6 Calculation of the I (u, v) interaction potentials: intersections of rectangles are denoted by striped areas

Z=



  exp −F (ω) ,

(2.20)

ω∈

where F (ω) is called the configuration energy. Following the energy decomposition approach discussed earlier by MRFs (Eq. (2.5)), we obtain F (ω) as the sum of simple components, which can be calculated by considering small subconfigurations only. More specifically, we distinguish unary (or singleton) terms (A) defined on individual objects, and Interaction terms (I (u, v)), concerning neighboring objects: F (ω) =



A(u) + γ ·

u∈ω



I (u, v),

(2.21)

u,v∈ω u∼v

where γ > 0 is a weighting factor between the unary and interaction terms, and it should be calibrated in each application on a case-by-case basis. In general, both the A(u) and I (u, v) terms may depend on the F observation. However, it is a frequent strategy that only the unary terms depend on F , so that they evaluate the object candidates as a function of the local image data. On the other hand, the I (u, v) components may implement prior geometric constraints, such as neighboring objects should not overlap, or they should have similar orientation. Denoting by Ru ⊂ S the set of image pixels covered by the geometric figure of object u, a simple interaction term penalizing object intersection can be calculated as I (u, v) =

#(Ru ∩ Rv ) , #(Ru ∪ Rv )

(2.22)

where # denotes the set cardinality. (See also Fig. 2.6.) In the following, we will only use the subscript F , when we want to particularly emphasize that a given MPP energy term depends on the measurement data (e.g. AF (u), F (ω)). In several clear situations, the subscript notation will be omitted to preserve the simplicity of formalism.

20

2 Fundamentals

2.3.3 MPP Optimization The optimal object population

ω in an MPP model can be taken as the MAP configuration estimate: (2.23)

ω = argmax PF (ω) = argmin F (ω). ω∈

ω∈

However, finding

ω needs to perform an efficient search in the high-dimensional population space with a non-convex energy function. Ensuring high-quality object configurations by algorithms with feasible computation complexity is crucial in several applications, therefore, we can find an extensive bibliography of MPP energy minimization techniques. Most previous approaches use the iterative ReversibleJump Markov Chain Monte Carlo (RJMCMC) scheme [115, 140], where each iteration consists in perturbing one or a couple of objects using various kernels such as birth, death, translation, rotation, and dilation. Here, experiments show that the rejection rate, especially for the birth move, may induce a heavy computation time. Besides, one should be very careful when decreasing the temperature, because at low temperature, it is difficult to add objects to the population. A recent alternative approach called the Multiple Birth and Death Dynamics technique (MBD) [54] attempts to overcome several ones from the above-mentioned limitations. Unlike following a discrete jump-diffusion scheme as in RJMCMC, the MBD optimization method defines a continuous-time stochastic evolution of the object population, which aims to converge to the optimal configuration. The evolution under consideration is a birth-and-death equilibrium dynamics on the configuration space, embedded into a Simulated Annealing (SA) process, where the temperature of the system tends to zero in time. The final step is the discretization of this nonstationary dynamics: the resulting discrete process is a non-homogeneous Markov chain with transition probabilities depending on the temperature, energy function, and discretization step. In practice, the MBD algorithm evolves the population of objects by alternating purely stochastic object generation (birth) and removal (death) steps in a SA framework. In contrast to the above RJMCMC implementations, each birth step of MBD consists of adding several random objects to the current configuration, which is allowed due to the discretization trick. Using MBD, there is no rejection during the birth step, therefore, high energetic objects can still be added independently of the temperature parameter. Thus, the final result is much less sensitive to the tuning of the SA temperature decreasing process, which can be achieved faster. Due to these properties, in selected remote sensing tasks (bird and tree detection) [54] the optimization with MBD proved to be around ten times faster than RJMCMC with similar quality results. On the other hand, we note that parallel sampling in MBD implementations is less straightforward than regarding the RJMCMC relaxation [182]. In the book, we will propose different structural modifications of the Multiple Birth and Death Dynamic (MBD) adopted to our addressed problems. For this reason, we introduce here the steps of the basic MBD algorithm [54]:

2.3 Object Population Extraction with Marked Point Processes

21

Fig. 2.7 Selected examples of population extraction with MPP models: a flamingo detection in aerial images [54], b building extraction in satellite photos [17], c vehicle detection from Lidar data [37]

1. Initialization: calculate a Pb () : S → R birth map using the F input data, which assigns to each pixel s a pseudo probability value Pb (s) estimating how likely s is an object center. 2. Main program: initialize the inverse temperature parameter β = β0 and the discretization step δ = δ0 and alternate birth and death steps. a. Birth step: for each pixel s ∈ S, if there is no object with center s in the current configuration ω, choose birth with probability δ Pb (s). If birth is chosen in s: • generate a new object u with center s; • set the object parameters (marks of u) randomly based on prior knowledge; • add u to the current configuration ω. b. Death step: Consider the configuration of objects ω = {u 1 , . . . , u n } and sort it from the highest to the lowest value of the unary (data) term ϕY (u). For each object u taken in this order, compute the death rate as follows: dω (u) =

   δaω (u) , where aω (u) = exp −β F (ω/{u}) − F (ω) 1 + δaω (u)

and kill u with probability dω (u). 3. Convergence test: if the process has not converged, increase the inverse temperature β and decrease the discretization step δ by a geometric scheme and go back to the birth step. The convergence is obtained when all the objects added during the birth step, and only these ones, have been killed during the death step. Selected state-of-the-art results for MPP-based object population extraction in different applications using different input sources are shown in Fig. 2.7. Examples (b) and (c) are results of particular methods detailed later in this book.

22

2 Fundamentals

2.4 Methodological Contributions of the Book Although Markov Random Field (MRF) and Marked Point Process (MPP) models provide established tools for classification and population modeling tasks, they face a couple of limitations, which are disadvantageous in various real work tasks. Chapter 3 presents results from the field of dynamic environment perception, based on both conventional video cameras, and a new imaging sensor called the rotating multi-beam Lidar. In this field, several recently published techniques rely on adhoc and heuristic methodological approaches. In this book, we take the advantage of the established Bayesian modeling concepts to improve the automatic detection performance under realistic outdoor circumstances. In MRF-based segmentation models, the integration of multiple information sources is a key issue. Earlier proposed feature fusion approaches, such as observation modeling by multinomial feature distributions, or using simple pixel-by-pixel operations on various label maps, often yield insufficient performance. In Chap. 4, we propose novel Markovian label fusion models, which enable flexible integration of various observation-based and prior knowledge-based descriptors in a modular framework. We also introduce a multi-layer Mixed Markov model, which exploits the probabilistic connection modeling capabilities of Mixed Markov models in the multi-layer segmentation process. The conventional MPP models are extended in this book both regarding the temporal and the spatial dimensions. In Chap. 5, we introduce multitemporal MPP frameworks dealing with object-level change detection and moving target tracking tasks. From a technical point of view, this extension needs the definition of various databased or prior interaction terms between object examples from different time layers, apart from the usual intra-layer constraints of Eq. (2.22). Regarding spatial scene content decomposition, in Chap. 6 we propose an Embedded MPP model consisting of three hierarchical levels, namely object groups, super objects, and object parts. The super (or parent) objects play a similar role as regular objects in MPP models, while the object parts (or child objects) are also marked objects with a predefined set of possible geometric attributes, and they are connected to the parents through additional markers. On the other hand, the object groups are interpreted as subpopulations, which may contain any number of (parent) objects, and various local geometric constraints can be prescribed for the included members. Another key point in MPP models is the probabilistic approach for object proposal. In several previous MPP applications [140], the generation of object candidates followed prior (e.g. Poisson) distributions. On the contrary, we apply a data-driven birth process to accelerate the convergence of MBD, proposing relevant objects with higher probability based on various image features. In addition, we calculate not only a probability map for the object centers but also estimate the expected object appearances through low-level descriptors. This approach uses a similar idea to the Data-Driven MCMC scheme of image segmentation [177]. However, while in [177] the importance proposal probabilities of the moves are used by a jump-

2.4 Methodological Contributions of the Book

23

diffusion process, we should embed the data-driven exploration steps into the MBD framework. Note that the Bayesian techniques discussed in the book can be efficiently applied, if either a color/texture-based statistical description can specify the semantically corresponding regions (see MRFs), or strong geometric constraints can be adapted for object shape description and object population modeling (MPPs). In various situations, for example in semantic urban scene segmentation, or detection of objects with diverse elastic shapes, such assumptions cannot be set, and neural network (NN)based solutions are often taken as the first options. While in MRF/MPP models we directly involve our prior knowledge (such as geometric features) in the modeling process, in NN-based methods, the information used for classification should be entirely extracted from the training data, thus the qualitative and quantitative parameters of the training dataset are critical factors. The success of deep neural networks (DNNs) has grabbed a very intensive focus of computer vision research on machine learning approaches in the recent years [75]. Apart from their abilities to implement automatic feature learning instead of handcrafted feature selection, DNNs can learn strong contextual dependencies from training samples, leading us close to a human-like holistic scene interpretation. On the other hand, while some DNN-based attempts on population counting [206] or remote sensing image segmentation [122] have already been proposed, their superiority versus probabilistic or geometric approaches have not yet been thoroughly demonstrated in these domains.

Chapter 3

Bayesian Models for Dynamic Scene Analysis

Abstract In this chapter, we discuss Bayesian approaches for foreground object detection and localization in video surveillance applications. Two different sensors are used for these tasks: conventional electro-optical video cameras, and Rotating Multi-Beam (RMB) Lidar sensors. For the camera image sequences, we propose first a Markov Random Field (MRF)-based foreground extraction technique which is able to address cast shadow detection and the exploitation of spatial coherence of the color and texture values observed in the foreground regions. Thereafter, based on the extracted foreground masks, we present a new Marked Point Process (MPP)-based method for pedestrian localization and height estimation in multi-camera systems, and give a detailed comparative evaluation of the proposed method versus a state-ofthe-art technique. The last part of the chapter deals with Lidar point cloud processing where key challenges are compensating the low and inhomogeneous spatial resolution of the measurements, and various artifacts in point cloud formation caused by the rotating sensor technology. We also present here application examples including motion detection, gait-based pedestrian re-identification and activity recognition using a single RMB Lidar sensor which monitors the scene from a fixed position.

3.1 Dynamic Scene Perception Automated perception and interpretation of the surrounding environment are key issues in intelligent city management, traffic monitoring and control, security surveillance, or autonomous driving. Critical tasks involve detection, recognition, localization and tracking of various moving and static objects, environmental change detection and change classification. Apart from using image-based 2D detection techniques, several surveillance systems provide solutions for localization and tracking in the real 3D world coordinate system of the observed environment, which requirement—considering the temporal dimension of the measurements—implies 4D perception problems [19]. A significant part of the existing environment monitoring systems use electrooptical cameras as perception sensors (see Fig. 3.1a), due to their established technologies, wide choices of the available properties and scalable prices. Nevertheless, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 C. Benedek, Multi-Level Bayesian Models for Environment Perception, https://doi.org/10.1007/978-3-030-83654-2_3

25

26

3 Bayesian Models for Dynamic Scene Analysis

Fig. 3.1 Measurements from the same surveillance scene by an optical camera and a Lidar sensor

despite the well-explored literature of the topic, event analysis in optical image sequences may be still challenging in cases of crowded outdoor scenes due to uncontrolled illumination conditions, irrelevant background motion, and occlusions caused by various moving and static scene objects [70, 199]. In such situations multi-camera configurations can provide better solutions, since they monitor a dynamic scene from multiple viewpoints by taking the advantages of stereo vision to exploit depth information for 3D localization and tracking [59, 69]. However, both mono and multicamera systems suffer from a number of basic problems, such as artifacts due to moving shadows and low contrast between different objects in the color domain [132, 191], which issues raise still open research challenges in the topic. As alternative solutions of conventional optical video cameras, range sensors offer significant advantages for scene analysis, since direct geometrical information is provided by them [38]. Using infra light-based Time-of-Flight (ToF) cameras [161] or laser-based Light Detection and Ranging (Lidar) sensors vkaestner10bayesian enable recording directly measured range images, where we can avoid artifacts of the stereo vision- based depth map calculation. From the point of view of data analysis, ToF cameras record depth image sequences over a regular 2D pixel lattice, where established image processing approaches, such as Markov Random Fields (MRFs) can be adopted for smooth and observation consistent segmentation and recognition [25]. However, such cameras can only be reliably used indoors, due to limitations of current infra-based sensing technologies, and usually they have a limited Field of View (FoV), which fact can be a drawback for surveillance and monitoring applications. Rotating Multi-beam (RMB) Lidar systems provide a 360◦ FoV of the scene, with a vertical resolution equal to the number of the sensors, while the horizontal angle resolution depends on the speed of rotation (see Fig. 3.1b). Each laser point of the output point cloud is associated with 3D spatial coordinates, and possibly with auxiliary channels such as reflection number or an intensity value of laser reflection. RMB Lidars can produce high frame rate point cloud videos enabling dynamic event analysis in the 3D space. On the other hand, the measurements have a low spatial density, which quickly decreases as a function of the distance from the sensor, and the point clouds may exhibit particular patterns typical to sensor characteristic (such as ring patterns in Fig. 3.1b).

3.1 Dynamic Scene Perception

27

Foreground object segmentation is a key preliminary step in automated video surveillance applications. Foreground areas usually contain the regions of interest, moreover, an accurate object-silhouette mask can directly provide useful information for several applications, for example, people [51, 85, 86] or vehicle detection [99], tracking [52, 212], biometrical identification through gait recognition [87, 189] or activity analysis [170].

3.2 Foreground Extraction in Video Sequences The first method we introduce here is foreground detection in the video frames of a single camera. For this purpose, we propose a Markov Random Field (MRF)-based approach [25], which considers the task as a three-class segmentation problem with foreground, background and moving shadow classes. The applied technique has three key features [25]: • We present a new parametric shadow model [28], where local feature vectors are derived at the individual pixels, and the shadow’s domain is represented by a global probability density function in that feature space. The parameter adaption algorithm is based on following the changes in the shadow’s feature domain. • The model encapsulates a novel multi-modal color model for the foreground class, which exploits spatial color statistics instead of high frame rate temporal information to describe the regions of moving objects. Using the assumption that any object consists of spatially connected parts which have typical color/texture patterns, the distribution of the likely foreground colors has been locally estimated in each pixel neighborhood. • We introduce a probabilistic description of microstructural responses observed in the background and in shadows, where the features can be defined by arbitrary 3 × 3 kernels. At different pixel positions different kernels could be used, and an adaptive kernel selection strategy has been proposed considering the local textural properties of the background regions.

3.2.1 Related Work in Video-Based Foreground Detection Although background removal in video sequences has been a well examined problem in the last two decades (see, e.g. [63, 83, 134, 162, 170, 191, 192, 210, 212]), it still raises challenging problems, and it remains an active research field even today. The automatic detection of objects that are abandoned or removed in a video scene has key applications in video surveillance [127]. Another approach focuses on decreasing false positive hits of the motion detector due to misalignment problems, ensuring the reliable detection of both static and moving objects in surveillance videos [64]. For camera motion compensation, a flow-process-based foreground extraction method

28

3 Bayesian Models for Dynamic Scene Analysis

has been proposed in [205], which can be utilized both in video codecs and in monitoring systems. A codebook-based background subtraction model is proposed in [76], which combines a multi-layer block-based strategy and adaptive feature extraction from blocks of various sizes, for removing most parts of the nonstationary (dynamic) background and significantly increasing the processing efficiency. A novel background subtraction approach using an RGB-D camera and an adaptive blind updating policy is introduced in [56]. This method allows the scene model to adapt to the changes in the background, by detecting the stationary moving objects and reducing the ghost phenomenon. Another approach [156] presents an algorithm for background modeling and foreground detection that uses scaling coefficients, which are defined with a new color model called lightness-red-green-blue. After the background model is computed, foreground objects are detected by using the scaling coefficients and various additional criteria. In this chapter, we address two specific problems of video segmentation: shadow detection and foreground modeling. To enhance the results, a novel microstructure model is used as well.

3.2.1.1

Shadow Detection: An Overview

The presence of moving cast shadows on the background makes it difficult to estimate shape [211] or behavior [85] of moving objects, because they can be erroneously classified as part of the foreground mask. Even very recent studies [139, 157] emphasize the difficulties of automatic shadow detection due to various environmental problems, such as chromatic shadows, non-textured and dark surfaces, and foreground-background camouflage. Since under some illumination conditions 40−50% of the non-background points may belong to shadows, methods which do not use explicit shadow models become less robust against complex illumination issues in real environment [162, 170, 210]. Hence, we deal here with an image segmentation problem with three classes: foreground objects, background, and shadows of the foreground objects being cast on the background. Note that we should not detect self-shadows (i.e. shadows appearing on the foreground objects), which are part of the foreground, and static shadows (i.e. cast shadows of the static objects), because they correspond to the background. In the literature, different approaches are available regarding shadow detection. Apart from a few geometry-based techniques suited to specific conditions [84, 201], shadow detection is usually done by color filtering. Still image-based methods [58, 61] attempt to find and remove shadows in the single frames independently. However, these models have been evaluated only on high-quality images where the background has a uniform color or texture pattern, while in video surveillance, we must expect images with poor quality and resolution. The authors in [58] note that their algorithm is robust when the shadow edges are clear, but artifacts may appear in cases of images with complex shadows or diffuse shadows with poorly defined edges. For practical use, the computational complexity of these algorithms should be decreased [61].

3.2 Foreground Extraction in Video Sequences

29

Some other methods focus on the discrimination of the shadow edges, and edges due to objects boundaries [73, 106]. However, it may be difficult to extract connected foreground regions from the resulting edge map, which is often ragged [73]. Complex scenarios containing several small objects or shadow parts may also be disadvantageous for these methods. For the above reasons, we focus on video (instead of still image) and region (instead of edge)-based shadow modeling techniques in the following. Here, an important point of view regarding the categorization of the algorithms [148] is the discrimination of the non-parametric and parametric cases. In non-parametric or ‘shadow invariant’ methods, the pixel values are converted into an illuminant invariant feature space: these approaches attempt to remove shadows instead developing a detection algorithm. In many cases, a color space transformation is suitable for this task. The normalized rgb [45, 141] and C1 C2 C3 spaces [158] are supposed to fulfill color constancy through using only chrominance color components. The method of Porikli and Thornton [145] is based on the assumption of hue constancy under illumination changes, which is used to train a weak classifier as a key step of a more complex shadow detector. An extensive review of early illumination invariant techniques can be found in [158], which confirms that strict constraints should be prescribed regarding the reflecting surfaces and lighting conditions when using this approach. Also [106] emphasizes the limits of the illumination invariant methods: outdoors, shadows will have a blue color cast (due to the sky), while lit regions have a yellow cast (sunlight), hence the chrominance color values corresponding to the same surface point may be significantly different in shadow and in sunlight. We have also found in our experiments that the shadow invariant methods fail outdoors several times, and they are rather usable indoors (Fig. 3.11). Moreover, since they ignore the luminance components of the color, these models become sensitive to noise. Consequently, we develop a parametric model: first, we estimate the mean background values of the individual pixels through a statistical background model [170], then we extract feature vectors from the actual and the estimated background values of the pixels and model the feature domain of shadows in a probabilistic way. Parametric shadow models may be local or global. A local shadow model uses a distinct shadow process for each pixel, whose parameter values must be set independently of other pixels. In the method of Martel-Brisson and Zaccarin [131], the local shadow parameters are trained using a second mixture model similar to the background in [170]. Following this way, we can consider that in an inhomogeneous surface the light absorption-reflection properties of the different scene points are notably different. However, to ensure convergence of the unsupervised training process each single pixel should be covered by shadows several times during the observation period, meanwhile the illumination conditions should stay unchanged. This hypothesis cannot be ensured in most cases in outdoor surveillance environments, therefore, this local modeling approach is very risky to use in our case. Following a different approach we characterize shadows with global parameters in an image (or in each relevant subregion of the image), which describe the dependencies between the illuminated and shadowed pixel values of a given background surface point. Using a probabilistic model, this relationship is modeled by a random

30

3 Bayesian Models for Dynamic Scene Analysis

transform, therefore several illumination artifacts can be taken into consideration. On the other hand, the shadow parameters are derived from global statistical image descriptors, therefore, the model can also estimate the expected shadow color values on pixel positions where motion is rare.

3.2.1.2

Modeling the Foreground

Another important issue is related to foreground modeling. Some approaches [134, 170] consider background subtraction as a one class-classification problem, where foreground image points are purely recognized as non-matching pixels to the background model. Similarly, [132, 192] build adaptive models for the background and shadow classes and detect foreground as outlier regions with respect to both models. However, background and shadow-colored object parts cannot be detected in this way. To overcome this problem, foreground must be also modeled in a more sophisticated manner. Before going into the details, we make a remark on an important property of the examined video flows. Many video surveillance applications require high-resolution images. However, since the bandwidth of video signal transmission is limited, the sequences are often captured at low [48] or unsteady frame rate. The bandwidth issues are particularly crucial, if the video sources are connected to the processing system through narrow band radio channels or congested networks. Let us continue with another—off-line—application. Recorded surveillance videos of a given scene should be quickly evaluated after a criminal event. Since usually many video streams should be continuously recorded in parallel, these videos may have a frame rate less than 1 fps to save up storage resources. In the above-mentioned applications, widely used temporal features cannot be efficiently utilized, such as probabilistic pixel state transition models [99, 155, 191], periodicity analysis [51, 85], temporal foreground description [162], or tracking [52, 200], which need a high and permanent frame rate. Therefore, we have chosen frame rate independent features to ensure a graceful degradation of the system if the frame rate is low or unsteady. For the above reasons, our discussed new model uses spatial color information instead of temporal statistics to describe the foreground. It assumes that foreground objects consist of spatially connected parts and these parts can be characterized by typical color distributions. Since these distributions can be multi-modal, the object parts should not be homogenous in color or texture, while we exploit the spatial information without segmenting the foreground components. Note that spatial object description has already been applied both in interactive [34] and in unsupervised image segmentation [102]. However, the unsupervised approach can only detect large objects with typical color or texture, since small segmentation classes are penalized [102]. The authors in [162] have characterized the foreground by assuming temporal persistence of the color and smooth changes in the place of the objects. Nevertheless, working with low frame rate video sequences

3.2 Foreground Extraction in Video Sequences

31

containing quickly moving and frequently overlapping objects, appropriate temporal information is often not available.

3.2.1.3

Texture Analysis and Color Space Selection

Apart from the color values of the pixels, microstructure information is exploited to make the segmentation more accurate. In several earlier methods [88, 207] background subtraction is performed based on purely texture descriptors, which approach is effective and justified in case of dynamic background (like a rippling lake), but it is usually less efficient than pixel value comparison in a static environment. A solution is proposed for integrating intensity and texture features for frame differencing in [119], however that method does not focus on accurate foreground segmentation. Regarding the background class, our proposed color-texture fusion process is similar to the technique of [191], which uses in parallel intensity level and local gradient features. As a novel contribution of our model different and adaptively chosen microstructural kernels are used, so that the local scene properties can be considered better. In addition, we also show how this probabilistic approach can be used to improve the introduced shadow model. Another key issue of color-based shadow segmentation is selecting the most appropriate color space. In our technique the CIE L*u*v* space is adopted, exploiting its two well-known properties. First, the perceptual distance between colors can be measured by the Euclidean distance [77]. Second, the three color components are approximately uncorrelated with respect to camera noise and changes in illumination [174]. Since the model parameters are derived in a statistical manner, we do not need accurate color calibration for each scene, but we can rely on the common CIE D65 standard. It is also not critical to explore the exact physical meaning of the different color components which is usually environment-dependent [72, 158]; we use only an approximate interpretation of the L, u, v color values and the validity of the model is demonstrated via experiments.

3.2.1.4

Summary of Contributions of the Proposed Model

The main contributions of the proposed foreground and shadow segmentation approach can be divided into three groups. First, a novel statistical shadow model introduced which is robust against various issues in real-world surveillance scenes (Sect. 3.2.3.2.), and an automatic update procedure is presented for the shadow parameters, which is an open question in many previous similar methods (Sect. 3.2.6.2). Second, an object-independent, spatial description is proposed for the foreground which can improve the segmentation output also in low frame rate videos (Sect. 3.2.5). Third, we demonstrate that microstructure analysis can enhance the segmentation in the discussed MRF framework (Sect. 3.2.4). The new method is validated using realistic surveillance videos and also in test sequences from a well-

32

3 Bayesian Models for Dynamic Scene Analysis

known benchmark set [148]. A detailed comparison against competing methods is presented in Sect. 3.2.8. We also use a few assumptions in this section: (i) the camera is static without significant ego-motion, (ii) static background objects are present in the scene (e.g. there is no waving river in the background), and (iii) regarding the external illumination, we assume that there is only one dominant emissive light source in the scene, however, we consider the presence of further reflected or diffuse light components.

3.2.2 MRF Model for Foreground Extraction The segmentation model follows the MRF-based Bayesian image labeling approach introduced in Sect. 2.2.3. Denote by S the 2Dpixel grid and we use henceforward a first ordered neighborhood system on the lattice. As defined earlier, a unique node of the MRF-graph G is assigned to each pixel. Thus for simplicity, s will denote also a pixel of the image and the corresponding node of G in this chapter. The proposed algorithm assigns a label ς (s) to every pixel s ∈ S form a threeelement label-set:  = {fg, bg, sh} corresponding to three considered classes: foreground (fg), background (bg), and shadow (sh). Following an MRF approach, the segmentation of a video frame is equivalent to a global labeling  = {[s, ς (s)] | s ∈ S}, and the probability of a given  ∈ ϒ in the label field follows Gibbs distribution. The image data (observation) at pixel s is characterized by a 4D feature vector: f (s) = [ f L (s), f u (s), f v (s), f χ (s)]T

(3.1)

where the first three coordinates represent the color components in the CIE L*u*v* space, and f χ (s) is a texture term (more specifically a microstructural response) which will be introduced in Sect. 3.2.4 in details. Set F = { f (s)| s ∈ S} denotes the global observation. To adjust the MRF framework to the targeted foreground segmentation problem, the main task is to define the conditional density functions pς (s) = P( f (s)| ς (s) = ς ), for all  ∈ ϒ and s ∈ S. For example, the notation pbg (s) refers to the probability value of the fact that the background process generates at pixel s the observed feature value f (s). For easier discussion, we will also call f (s) in the background as a random variable with a probability density function pbg (s). The above-mentioned conditional density functions are defined in Sects. 3.2.3– 3.2.6, and the segmentation procedure is presented in Sect. 3.2.8 in details. Note that since in fact we minimize the minus-logarithm of the global probability term (see Eq. (2.11)), we will also use the notation ς (s) = − log pς (s) for describing local energy terms.

3.2 Foreground Extraction in Video Sequences

33

3.2.3 Probabilistic Model of the Background and Shadow Processes 3.2.3.1

General Model

The color-texture feature distributions in the background and in the shadow are modeled by Gaussian density functions, similar to various earlier approaches [148, 155, 191]. Since in the CIE L*u*v* color space the different color components are highly uncorrelated [174], the joint distribution of the features can be efficiently approximated by a 4D Gaussian density function with diagonal covariance matrix: 2 2 2 2  φ (s) = diag{σφ,L (s), σφ,u (s), σφ,v (s), σφ,χ (s)},

(3.2)

for φ ∈ {bg, sh}. Consequently, the distribution parameters are μφ (s) = [μφ,L (s), . . . , μφ,χ (s)]T mean, and σ φ (s) = [σφ,L (s), . . . , σφ,χ (s)]T standard deviation vectors. Using this diagonal model we should not perform matrix inversion and determinant extraction during the calculation of the probabilities, and the φ (s) = − log pφ (s) energy values can be calculated directly from the 1D marginal probabilities: φ (s) = 2 log 2π +



 i={L ,u,v,χ}

1 log σφ,i (s) + 2



f i (s) − μφ,i (s) σφ,i (s)

2  .

(3.3)

Based on Eq. (3.3), each feature contributes with a distinct additional term to the energy calculus. Thus, we obtain a modular model structure where the 1D marginal 2 (s)], can be estimated independently of each distribution parameters, [μφ,i (s), σφ,i other.

3.2.3.2

Color Features in the Background Model

Using a Gaussian probability density function (pdf) to model the observed color values of a single background pixel is widely adopted in the literature, meanwhile various corresponding parameter estimation procedures exist [118, 170]. In our model, following one of the most popular approaches [170], we train the color components of the background parameters [μbg (s), σ bg (s)] in a similar manner to the conventional online k-means algorithm. Although this algorithm is widely known, it is important to be understood in terms of the following parts of this section, thus we briefly introduce it. We consider each pixel s as a separate process, which generates an observed pixel value sequence over time: {f

[1]

(s), f

[2]

[t]

(s), . . . , f (s)}.

(3.4)

34

3 Bayesian Models for Dynamic Scene Analysis

To model the recent history of the pixels, [170] suggested a mixture of K Gaussians distribution: [t]

P( f (s)) =

K 

 [t]  [t] κk[t] (s) · η f (s), μ[t] (s), σ (s) , k k

(3.5)

k=1

where η(.) is a Gaussian density function, with diagonal covariance matrix. We ignore here multi-modal background processes [170], and consider the background Gaussian term to be equivalent to the Gaussian component in the mixture, which has the highest weight. Thus, at time t: μbg (s) = μ[t] kmax (s), where

σ bg (s) = σ [t] kmax (s),

kmax = arg max κk[t] (s).

(3.6)

(3.7)

k

The parameters of the above distribution are estimated and updated without user interaction. First, we introduce a D matching operator between a pixel value and a local Gaussian component as follows: Ds ( f (s), k)

−1

T 

[t] [t]  (s) f (s) − μ (s) τu ) AND (sh (s) > τu ) bg otherwise

(3.30)

where ςs0 is the preliminary segmentation label of s, and τu is a threshold, which parameter is analogous with the uniform value in prior models [192] by choosing fg (s) = τu . In the next step, we estimate around each pixel s the local foreground color distribution, using the certainly foreground pixels in the neighborhood of s. The process is visualized in Fig. 3.5 with 1D grayscale feature vectors. A few new notations are used here. H is the set of pixels marked as certainly foreground pixels in the preliminary mask: (3.31) H = {r | r ∈ S, ςr0 = fg}. Note that we must consider here that H may be only a coarse estimation of the foreground, as shown in Fig. 3.5b.

Fig. 3.5 Determination of the foreground conditional probability term for a given pixel s (for simpler representation in grayscale). a Video image, with marking s and its neighborhood Us (window side z = 45 is used.), b Noisy preliminary foreground mask, c Set Hs : preliminary detected foreground pixels in Us . (Pixels of Us \Hs are marked with white.), d Histogram of Hs , marking f (s), and its τ f neighborhood, e Result of fitting a weighted Gaussian term for the [ f (s) − τ f , f (s) + τ f ] part of the histogram. Here, τu = 2.71 is used (it would be the foreground probability density value for each pixel according to the uniform model), but the procedure increases the foreground probability to 4.03, f Segmentation result of the model optimization with the uniform foreground calculus, g Segmentation result by the proposed model

3.2 Foreground Extraction in Video Sequences

43

Denote by Us the set of the nearby pixels to s, with using a rectangular neighborhood having a window size z × z (Fig. 3.5a). Thereafter, we define Hs with respect to s as the set of neighboring pixels pre-classified as foreground (see Fig. 3.5c): Hs = H ∩ Us . The color distribution of foreground pixels in the neighborhood of s can be described by a normalized histogram h s over Hs (Fig. 3.5d). In our algorithm, instead of using the noisy h s directly, we approximate it by a smooth parametric probability density function, hˆ s ( f ) (e.g. a mixture of Gaussians), and calculate the foreground probability term as   pfg (s) = hˆ s f (s) . To handle multicolor or textured foreground segments, we should use a multimodal hˆ s (.) function in the model (a bi-modal case is shown in Fig. 3.5d).  Since  we only use hˆ s (.) to calculate the foreground probability value of s as hˆ s f (s) , we only need to estimate the parameters of the mode of hˆ s (.), which covers f (s) (see Fig. 3.5e). For this reason, we model hˆ s (.) as a mixture of a weighted Gaussian term η(.) and a residual term ϑs (.), for which we only prescribe that ϑs (.) is a probability density function, ϑs ( f ) = 0 if f (s) − f < τ f , and κ(s) is a weighting factor: 0 < κ(s) < 1. Hence,   hˆ s ( f ) = κ(s) · η( f |μs ,  s ) + 1 − κ(s) · ϑs ( f ).

(3.32)

Consequently, the foreground probability value of pixel s is characterized by the statistical distribution of the color values observed in its neighborhood:     fg (s) = − log hˆ s f (s) = − log κ(s) − log η f (s)|μs ,  s .

(3.33)

The steps of the foreground energy term calculation algorithm are presented in Fig. 3.6. We can make the algorithm quicker, if we calculate the Gaussian parameters by considering only some randomly selected pixels in Hs . We describe the parameter settings in Sect. 3.2.6.1.

3.2.6 Parameter Settings The parameters of the proposed approach are in part scene-dependent and in part condition-dependent. Scene-dependent parameters are influenced by the scene geometry, object materials, and (static) camera settings, thus they do not change in time and they can be set in a case-by-case basis depending on a given scene. To support

44

3 Bayesian Models for Dynamic Scene Analysis

Fig. 3.6 Algorithm for determination of the foreground probability term. Notations are defined in Sect. 3.2.5

the system configuration process, we provide here some general parameter setting strategies. Condition-dependent parameters are related to time-varying phenomena, such as external illumination, therefore, it is necessary to develop adaptive algorithms which can automatically re-estimate their values during the operation. Exploiting some key properties of the proposed approach (see Sect. 3.2.3.1), we only need to estimate the 1D marginal distribution parameters of the background and shadow classes. Moreover, we should only focus on parameters of the color components, since the mean-deviation values of the microstructural feature can be analytically calculated from the distributions of the color descriptors (see Sect. 3.2.4.1).

3.2.6.1

Background and Foreground Model Parameters

The implemented parameter estimation and parameter update process of the background class is based on the widely used Mixture of Gaussians technique (detailed in Sect. 3.2.3.2), which presents efficient and stable results both in indoor and outdoor scenes. The foreground model parameters (Sect. 3.2.5) depend on prior information about the scene, like the size range of the moving objects and average image contrast. These

3.2 Foreground Extraction in Video Sequences

45

features represent quite general, low-level scene information, thus the method can deal with a large variety of moving objects without needing object-specific or shapespecific parameter calibration. In our experiments, we set these parameters based on trial-and-error using the following strategies: • z: size of the neighborhood window Us in pixels considered in the process.√This parameter depends on the expected object sizes in the scene: we used z = 1/3 TB , where TB is the estimated average area of the occurring objects’ bounding boxes. • κmin : control parameter for the minimum required number of pre-classified foreground pixels in the neighborhood. If the ratio of the pixels and the size of the neighborhood is smaller than κmin , the sigmoid function of Eq. (3.35) keeps the foreground probability low. A low value of κmin increases the number of extracted foreground pixels and can be used if the objects have compact shapes like in the Highway sequence. Otherwise small κmin results in a large number of false positive foreground pixels. Applying κmin = 0.1 for vehicle detection and κmin = 0.25 for people (including cyclists, baby carriages, etc.) proved to be efficient. • τ f : threshold parameter which prescribes the maximal distance in the feature space between pixels generated by the same Gaussian process. We use outdoors in high contrast τ f = 0.2 · dmax , and indoors τ f = 0.1 · dmax , where dmax is the maximum occurring distance in the feature space. Notes on parameter τu are given in Sect. 3.2.8.2.

3.2.6.2

Shadow Parameters

Color and sharpness of shadows can alter largely and rapidly in a given scene following the changes in global illumination (Fig. 3.7) for many reasons. For example, indoors one can turn on/off the lights, meanwhile outdoors the sun may hide quickly behind the clouds.

Fig. 3.7 Different parts of the day on Entrance sequence, segmentation results. Above left: in the morning (am), right: at noon, below left: in the afternoon (pm), right: wet weather

46

3 Bayesian Models for Dynamic Scene Analysis

In our system, we use different algorithms for the initialization and for the update of the shadow parameters. From a practical point of view, initialization can be implemented in a supervised way by annotating the shadow areas in a few video frames by hand, immediately after turning on the system. Based on the training data, the maximum likelihood estimate of the initial shadow parameters can be calculated. On the other hand, in an automated surveillance environment, after the initialization phase the system should operate continuously without user interaction. Therefore, for adoption to the illumination changes we need to implement an automated re-estimation procedure. Since as stated above, we use an established supervised initialization step, we only need to focus on the parameter updating algorithm in the following. Note that our proposed method has been validated in real life, since it was built into a 24-h operating surveillance system of our university campus. In this book, we validate our algorithm on four sequences captured by the same camera under different illumination conditions (Fig. 3.7). For all sequences, ground truth annotation has been performed manually by operators. As detailed in Sect. 3.2.3.2, the shadow process has six scalar parameters: the 3 plus 3 components of the μψ and σ ψ vectors, respectively. We can examine in Fig. 3.8 1D histograms for the observed ψ L , ψu and ψv values of shadowed points for each video sequence. As it is shown here, parameters σ ψ , μψ,u and μψ,v are

Fig. 3.8 Shadow ψ statistics on four sequences recorded by the Entrance camera of our University campus. Histograms of the occurring ψ L , ψu , and ψv values of shadowed points. Rows correspond to video shots from different parts of the day. We can observe that the peak of the ψ L histogram strongly depends on the illumination conditions, while the change in the other two shadow parameters is much smaller

3.2 Foreground Extraction in Video Sequences

47

nearly constant, or they change quite slowly, but μψ,L varies considerably over time. Therefore, we update the parameters in two different ways. The parameter update process for the chrominance-related Gaussian components (i.e. [μψ,u , σψ,u ] and [μψ,v , σψ,v ] parameters) is based on a classical approach [192]. We show it here for the u-parameter only, but the v component is also similarly updated. The parameters are re-estimated at constant time-intervals T . Let us denote by μψ,u [t], σψ,u [t] the parameters at time t. Wt2 is the set of the ψu values of the detected shadow pixels between time t1 = t2 − T and t2 : Wt2 = {ψu [t] (s)|t = t1 , . . . , t2 − 1, ς [t] (s) = sh, s ∈ S},

(3.37)

where upper index [t] refers to time, #Wt2 is the set cardinality, Mt2 and Dt2 are the arithmetic mean and the standard deviation values of Wt2 . We update the parameters as follows: (3.38) μψ,u [t2 ] = (1 − ξ [t2 ] ) · μψ,u [t1 ] + ξ [t2 ] · Mt2 , 2 2 [t2 ] = (1 − ξ [t2 ] ) · σψ,u [t1 ] + ξ [t2 ] · Dt22 . σψ,u

(3.39)

Here the parameter ξ [t] is a weighting term (0 ≤ ξ [t] ≤ 1), which depends on #Wt so that a high number of detected shadow points increases ξ [t] and the influence of the Mt and Dt2 terms. We use T = 60 sec. We continue the discussion with the re-estimation of the luminance parameters. Parameter μψ,L describes a mean luminance darkening ratio for shadowed background points. Obviously, μψ,L strongly depends on the external illumination. Outdoors can change from 0.6 in direct sunlight to 0.95 in overcast weather. Our simple re-estimation algorithm applied for the chrominance components cannot be adopted in this case, since the illumination properties between time t1 and t2 = t1 + T may rapidly change to a large extent, which may cause several erroneously detected shadow candidate points, whose ψ L values in set Wt2 presented absolutely false Mt2 and Dt2 parameters for the re-estimation procedure. For the above reason, we calculate the actual value of μψ,L based on a statistical analysis of all non-background ψ L values (we use here an approximately accurate background filter [170]). As Fig. 3.9 shows the peaks of the non-background ψ L histograms are nearly in the same location where they were in Fig. 3.8. The video scenes corresponding to the first and second rows are recorded around noon where the shadows are relatively small, however, the peak is still in the same location in the histogram. Based on the previous experiments, we approximate μψ,L by the location of the peak of the non-background ψ L -histogram in a given scene. Next we describe the updating algorithm of parameter μψ,L . First we need to define a data structure which contains a ψ L value with its timestamp: [ψ L , t]. We store the latest observed [ψ L , t] parameter couples of the non-background points in a set Q, and update the histogram h L of the ψ L values in Q in every iteration.

48

3 Bayesian Models for Dynamic Scene Analysis

Fig. 3.9 ψ statistics for all non-background pixels Histograms of the occurring ψ L , ψu and ψv values of all the non-background pixels in the same sequences as in Fig. 3.8

Fig. 3.10 Updating algorithm for parameter μψ,L

The key issue of the process is the management of set Q. We define MAX and MIN parameters which give boundaries for the size of Q. The steps of the queue management algorithm are detailed in Fig. 3.10. Consequently, Q contains always the latest available ψ L values, while the algorithm keeps the size of Q between prescribed bounds MAX and MIN. Practically, the actual size of Q is around MAX in cluttered scenarios. On the other hand, if there is no significant motion in the scene for a time period, the size of Q decreases until MIN, which property increases the influence of the forthcoming elements, and causes quicker adaptation, since it is faster to modify the shape of a smaller histogram.

3.2 Foreground Extraction in Video Sequences

49

Parameter σψ,L is updated similar to σψ,u but only in the time periods when μψ,L does not change significantly.

3.2.7 MRF Optimization The MAP estimator (Eq. (2.11)) is implemented by combining a conditionally independent random field of signals and a Potts model realizing interactions between different pixels [147]. The optimal global labeling   is defined by

  = arg min

 ∈ϒ

⎧ ⎪ ⎨ ⎪ ⎩ s∈S

⎫ ⎪ ⎬   − log P f (s) | ς (s) +  (ς (r ), ς (s)) , ⎪    r,s∈S ⎭ 

(3.40)

ς(s) (s)

where we aim to find a good approximation of the global minimum over all possible segmentations (ϒ) of a given input image. The first energy term in Eq. (3.40) contains the sum of the local class-energy values for all image pixels (see also Eqs. (3.3) and (3.36)). The second term ensures to obtain a smooth segmentation:  (ς (r ), ς (s)) = 0 if s and r are not neighboring pixels, otherwise:

 (ς (r ), ς (s)) =

−δ if ς (r ) = ς (s) +δ if ς (r ) = ς (s).

(3.41)

The above energy function can be minimized by standard MRF optimization algorithms such as the deterministic Modified Metropolis (MMD) [104] relaxation method (detailed in Sect. 2.2.4), or the quick graph-cut-based technique [40].

3.2.8 Results The proposed MRF technique has been tested on real surveillance video sequences, where the contributions of the method have been qualitatively (see Figs. 3.12 and 3.11) and quantitatively evaluated and compared to state-of-the-art results.

3.2.8.1

Test Sequences

For detailed evaluation the following seven test videos have been used: • Laboratory test sequence from the ATON benchmark set [148], which contains a quite simple office environment. • Highway video, which is an outdoor sequence also from the ATON benchmark. In this sequence, dark shadows are observable in a nearly homogeneous background

50

3 Bayesian Models for Dynamic Scene Analysis

road surface. Note that for this video earlier methods used always specific postprocessing operations [132], which ones are not required by the proposed model. • Corridor sequence, which contains an indoor surveillance video where bright objects and background segments often saturate the image sensors. As a main challenge, it is hard to accurately separate here the white shirts of the walking people from the white walls in the background. • Four different surveillance video sequences captured by the Entrance (outdoor) camera of our university campus in different lightning condition. (See Fig. 3.7: Entrance am, Entrance noon, Entrance pm, and Entrance overcast). These sequences suffer from difficult illumination and reflection effects, and sensor saturation (dark objects and shadows). Here, the proposed method presents significantly better segmentation results than the previous approaches.

3.2.8.2

Qualitative Demonstration of the Improvements

In this section, we demonstrate the improvements of the proposed method in terms of (i) shadow detection, (ii) foreground modeling, (iii) textural analysis qualitatively, via showing different representative segmented image outputs. Results of different shadow detection algorithms are shown in Fig. 3.11. For comparison an illumination invariant (II) method based on [158], and a constant ratio technique (CR, based on [132]) has been implemented. By examining the output images, we can observe that the results of the considered different approaches are quite similar in simple environments, but the improvements of the proposed method become significant in the more complex real surveillance scenes: • In the Laboratory sequence, the II approach yields minor segmentation errors, while both the CR and the proposed method are highly accurate. • For the Highway video, although both II and CR remove a significant part of the observable shadow regions, our model produces a notably better result. • In the Entrance am surveillance video, the II method cannot detect the shadowed image parts, and the foreground region is also largely noisy. The CR model also produces a week segmentation output: due to the long shadows and inhomogeneous surfaces the constant ratio model becomes inaccurate. On the other hand, we can confirm that the proposed model can handle the above artifacts quite robustly. The notable improvements of the proposed method versus the CR model can also be observed in another surveillance scene of Fig. 3.14 (2nd and 5th row). Regarding foreground modeling, a fundamentally novel approach has been presented, which needs neither high frame rate video inputs (unlike [155, 162, 191]), nor the availability of high level object descriptors (see [200]). Earlier competing techniques [132, 192] have used the uniform calculus expressing that foreground may generate any colors in a given scene with the same probability. As it is shown in Figs. 3.13 and 3.14 (3rd and 5th rows), the uniform model is often a too week approach for accurate region separation, while the proposed technique is able to improve the results significantly.

3.2 Foreground Extraction in Video Sequences

51

Fig. 3.11 Shadow model validation: Comparison of different shadow models in 3 video sequences (From above: Laboratory, Highway, Entrance am). Col. 1: video image, Col. 2: C1 C2 C3 spacebased illumination invariants [158]. Col. 3: constant ratio model by [132] (without object-based postprocessing) Col 4: Proposed model

We continue with the evaluation of microstructure analysis. The microstructural component in the feature vector can enhance the segmentation output in textured background or foreground regions. To demonstrate the efficiency of this additional feature component, Fig. 3.12 shows an example with a synthetic image. Let us consider Fig. 3.12a as a frame of a video sequence where the central bright rectangle represents the foreground (image v. shows a zoomed part of it). The background is divided into four equal rectangular regions, each of them has a particular texture, which are magnified in images (i)–(iv). To simulate real-world conditions, the observed pixel values are also affected by Gaussian white noise. At the bottom, results of background subtraction are shown with three different parameter settings. In the left image (see subfigure (b)) the used feature vector only consists of the gray value of the pixel. In the middle (image (c)), the feature vector is completed with the responses of horizontal and vertical edge detector kernels (similar to [191]). Finally, in the right image (see subfigure (d)), our proposed kernel set of Fig. 3.4 is used with adaptive kernel selection, which provides the best results. For validation of using the texture components, we can find some real-world examples in Fig. 3.14: the 4th and 5th rows show the segmentation results without and

52

3 Bayesian Models for Dynamic Scene Analysis

Fig. 3.12 Synthetic example to demonstrate the benefits of the microstructural features. a Input frame, (i–v) enlarged parts of the input, b–d result of foreground detection based on b gray levels c gray levels with vertical and horizontal edge features [191] d proposed model with adaptive kernel

with the textural components, improvements are observable in fine details, especially near the legs of the people in the magnified regions.

3.2.8.3

Numerical Evaluation

Quantitative evaluation has been performed using manually generated ground truth foreground masks. Since the goal is to ensure accurate foreground extraction, confusing shadow and background pixels do not count for errors. Let us denote the number of correctly detected foreground pixels of the test frames by TP (true positive). Similarly, we introduce FP for misclassified background points, and FN for misclassified foreground points.

3.2 Foreground Extraction in Video Sequences

53

Fig. 3.13 Foreground model validation: Segmentation results on the Highway sequence. Row 1: video image; Row 2: results by uniform foreground model; Row 3: Results by the proposed model

The evaluation metrics is composed of the Recall rate (Rc) and the Precision (Pr) of the detection: TP TP Pr = . (3.42) Rc = TP + FN TP + FP To combine Recall and Precision in a single efficiency measure, we also use the F-score [154] which is obtained as the harmonic mean of Rc and Pr: F − score =

2 · Rc · Pr . Rc + Pr

(3.43)

Note that while Rc and Pr characterize a given algorithm only together, e.g. by plotting Pr-Rc curves to visualize the effects of different parameter settings, F-score is in itself an efficient scalar valued evaluation metrics. For quantitative validation, we used in summary 861 annotated images chosen from the Laboratory, Highway, Entrance am, Entrance noon and Entrance pm video sequences. Further information about the test sets is provided in Table 3.1. As for competitor methods used in the verification procedure, we focus on the stateof-the-art MRF models, since advantages of using Markov Random Fields versus morphology-based approaches were examined previously [191]. The evaluation of the improvements is done by experimentally comparing our new model elements one by one to similar recent solutions from the literature. For evaluating our shadow detection approach, we use the CR model as reference technique, and we compare the foreground model again to the straightforward uniform calculus.

54

3 Bayesian Models for Dynamic Scene Analysis

Fig. 3.14 Validation of all improvements in the segmentation regarding Entrance pm video sequence Row 1. Video frames, Row 2. Ground truth Row 3. Segmentation with the constant ratio shadow model [132], Row 4. Our shadow model with uniform foreground calculus [192] Row 5. The proposed model without microstructural features Row 6. Segmentation results with our final model

We provide in Fig. 3.15 extensive comparison results between the performances of our shadow and foreground models and the reference methods. The presented results confirm that our probabilistic shadow modeling approach significantly improves the precision rate, since it largely decreases the number of false negative shadow pixels. Exploiting the advantages of the new foreground model, the recall rate increases due to the detection of several foreground parts with similar color to the local background or shadow. If we ignore both improvements both Rc and Pr decrease. Figure 3.15c

3.2 Foreground Extraction in Video Sequences

55

Table 3.1 Overview on the evaluation parameters regarding the five sequences Video Frames∗ fre∗∗ Duration (min) ∗∗∗ Laboratory Entrance am Entrance pm Entrance noon Highway

205 160 75 251 170

2-4 fre† 2 fre 1 fre 1 fre 5–8 fre†

1:28 1:20 1:15 4:21 0:29

Notes ∗ Number of frames in the ground truth set. ∗∗ Frame rate of evaluation (fre): number of frames with ground truth within one second of the video. ∗∗∗ Length of the evaluated video part. † fre was higher in crowded scenarios

shows that regarding the F-score the proposed model outperforms the former ones in all cases.

3.2.9 Summary and Applications of Foreground Segmentation This section has introduced a Bayesian model for foreground segmentation in video sequences recorded by static cameras, without any restrictions on scene properties and image quality issues. We also consider that the frame rate of the source videos can also be low or unstable, and the method is able to adapt to the changes in lighting conditions. The section contributed to the state-of-the-art in three main issues: (1) an accurate, adaptive shadow model has been introduced; (2) a novel description has been developed for the foreground based on spatial statistics of the neighboring pixel values; (3) it has been shown how different microstructure responses could be used in the proposed framework as additional feature components improving the results. To demonstrate the practical need and usability of efficient foreground detection, we introduce in the next section a Marked Point Process-based 3D person localization and height estimation technique, which expects as input high-quality foreground masks extracted in parallel from multiple optical cameras.

3.3 People Localization in Multi-camera Systems Person localization is a crucial step in people surveillance applications, since it is an important precursor of tracking and activity analysis. At each time frame, the 3D ground positions of the observed pedestrians should be automatically extracted in the world coordinate system. A possible approach for the problem is using multi-camera systems, which are able to monitor the scene from multiple viewpoints simultaneously, providing the advantage that people partially occluded from certain viewpoints

56

3 Bayesian Models for Dynamic Scene Analysis

Fig. 3.15 Comparing the proposed model (red columns) to previous approaches. The total gain due to the introduced improvements can be got by comparing the corresponding CRS+UF and SS+SF columns: regarding the F-score, the benefit is more than 12% for three out of the five sequences, 3−5% for the remaining two ones

might be clearly observable from another ones. Here people detection and localization require 3D information retrieval from multi-view inputs, with efficient approximation strategies of the missing information resulted by camera noise, artifacts of image matching (especially in featureless regions) and occlusion [4]. In this section, we introduce a Bayesian approach on multiple people localization in multi-camera systems [178]. First, pixel-level features are extracted, which are based on physical properties of the 2D image formation process, and provide information about the head and leg positions of the pedestrians, distinguishing standing and walking people, respectively. Then features from the multiple camera views are

3.3 People Localization in Multi-camera Systems

57

fused leading to a joint descriptor which can efficiently evaluate a person hypothesis described by a given ground plane location and height. Based on this descriptor, we formulate person detection as an inverse problem in the 3D world coordinate system. We create a 3D object configuration model, which also utilizes prior geometrical constraints for describing the possible interactions between two pedestrians. To approximate the position of the people, we use a population of 3D cylinder objects, which is realized by a 3D Marked Point Process (3DMPP). The final configuration results are obtained by an iterative stochastic energy optimization algorithm. The proposed approach is evaluated on two publicly available datasets, and compared to a recent state-of-the-art technique. To obtain relevant quantitative test results, a 3D Ground Truth annotation of the real pedestrian locations is prepared, while two different error metrics and various parameter settings are proposed and evaluated, showing the advantages of the proposed model.

3.3.1 A New Approach on Multi-view People Localization The input of the proposed multi-view person localization method consists of the foreground masks extracted from multiple calibrated camera views, monitoring the same scene, using the approach introduced in Sect. 3.2. The main idea of our method is to project the extracted foreground pixels both on the ground plane, and on the horizontal plane shifted to the height of the person (see Fig. 3.16). This projection [175] will create a distinct visual feature, observable from a virtual birds-eye viewpoint above the ground plane. However, the person’s height is unknown a priori, and the height of different people in the scene may also be different. Therefore, we project the silhouette masks on multiple parallel planes at heights in the range of typical human height. In crowded scenes the overlap rate between person silhouettes in the individual foreground masks is usually high, which could corrupt our hypothesis. We solve this problem by fusing the projected results of multiple camera views on the same planes. Finally, we search for the optimal configuration through stochastic optimization using the extracted features and geometrical constraints.

Fig. 3.16 Side view sketch of a person’s silhouette projected to the ground plane (blue) and to the horizontal plane intersecting the top of the head (red)

58

3 Bayesian Models for Dynamic Scene Analysis

3.3.2 Silhouette-Based Feature Extraction Let us denote by P0 the groundplane, and by Pz the parallel plane above P0 with an elevation z. In the first step of the proposed method, we project the detected silhouettes to P0 and to different Pz planes (with different z > 0 offsets) by using the projection model of the calibrated cameras. Consider the person with height h presented in Fig. 3.16, where we projected the silhouette on the P0 ground plane (marked with blue) and the Pz plane with the height of the person (i.e. z = h, marked with red). Also consider the v vertical axis of the person which is perpendicular to the P0 plane. We can observe that from this axis, the silhouette points projected to the Pz |z=h plane lie in the direction to the camera, while the silhouette print on P0 is on the opposite side of v. For more precise investigations, in Fig. 3.17 the scene is visualized from a viewpoint above Pz , watching down in a perpendicular direction to the ground. Here, the silhouette prints from Pz and P0 are projected to a common x − y plane and jointly shown by red and blue colors, respectively (overlapping areas are purple). We can observe in Fig. 3.17a, that if the height estimation is correct (z = h), the two prints just touch each other in the p = (x, y) point which corresponds to the ground position of the person. However, if the height of Pz is underestimated (i.e. z < h), the two silhouette prints will overlap as shown in Fig. 3.17b. When the height is overestimated (i.e. z > h), the silhouettes will move away, see Fig. 3.17c. Next, we derive a fitness function which evaluates the hypothesis of a proposed scene object with ground position p = (x, y) and height h, using the multiple camera information. As shown in Fig. 3.17d, for a given camera projection we can define for each p position candidate a head search region (HSR(p)) and a leg search region (LSR(p)), denoted by green circle sectors. If both the p ground position and the h height estimations are accurate, we expect several Ph -based (red) silhouette points

Fig. 3.17 Feature definition

3.3 People Localization in Multi-camera Systems

59

(Sil(h)) in HSR but not in LSR. Regarding the P0 (blue) silhouette points (Sil(0)) our expectation is the opposite: low coverage in HSR and high coverage in LSR. This observation leads to the following fitness features at the ith camera view: i f hd (p, h)

flgi (p)

    # Sili (h) ∩ HSRi (p) − # Sili (h) ∩ LSRi (p) )   . = # HSRi (p)

    # Sili (0) ∩ LSRi (p) − # Sili (0) ∩ HSRi (p) )   . = # LSRi (p)

(3.44)

(3.45)

If the object defined by the [p, h] parameters is completely visible for the ith camera, i (p, h) and fli (p) features should have high values. However, in the both the f hd available views, some of the legs or heads may be partially or completely occluded by other pedestrians or static scene objects, which can strongly corrupt the feature values. Although the descriptors may be weak in the individual cameras, we can construct a stronger feature if we average the responses of the N available cameras, i.e. N N 1  i 1  i ¯ ¯ · · f (p, h) , flg (p) = f (p) . (3.46) f hd (p, h) = N i=1 hd N i=1 lg Finally, the joint data feature f (p, h) is derived as f (p, h) =



f¯hd (p, h) · f¯lg (p) .

(3.47)

3.3.3 3D Marked Point Process Model The f (p, h) feature introduced in the previous section evaluates the hypothesis that a person with a height h stands in the ground position p, based on the multiple camera measurements. Our goal is to recognize a configuration of an unknown number of people in the scene, where each person is characterized by the (x, y, h) parameter triplet with p = (x, y). Since this problem can be formulated as a population extraction task, we have embedded the features into a Marked Point Process (MPP) model introduced in Sect. 2.3.1 of Chap. 2, where a 3D cylinder model describes a given person (see Fig. 3.18a). For simplicity, we use cylinders with a fixed R radii corresponding to the minimal expected half-distance between two ground positions. We monitor a rectangular Region of Interest (RoI) on P0 discretized into SW × S H locations corresponding to a regular grid, and also round the person heights to integers measured in cm. Therefore, the object space H can be obtained as: H = [1, . . . , SW ] × [1, . . . , S H ] × [h min , . . . , h max ].

60

3 Bayesian Models for Dynamic Scene Analysis

Fig. 3.18 Cylinder objects modeling people in the 3D scene coordinate system. Their ground plane position and height will be estimated. Intersection of cylinders in the 3D space is used as geometrical constraint in the object model

The remaining part of the MPP model construction follows the description in Sect. 2.3.1. We need to extract a population of an unknown number of objects ω = {u 1 , . . . , u n }, where an object u is described by the [p, h] parameters. Next we define a global energy function for the whole population: F (ω) =

 u∈ω

A(u) + γ ·



I (u, v).

(3.48)

u,v∈ω u∼v

The A(u) data terms are derived from the f (p, h)|p=(x,y) feature values using a monotonously decreasing nonlinear function [180], which maps the feature domain to [−1, 1]. (A possible M(.) nonlinear mapping function is defined later in Sect. 5.2.2.2 of Chap. 5 by Eq. (5.2)). The I (u, v) interaction terms (weighted by a constant γ ) prescribe a non-overlapping constraint between neighboring cylinders, as demonstrated in Fig. 3.18b:   Volume u ∩ v  . (3.49) I (u, v) = Volume u ∪ v Finally, the Multiple Birth and Death optimization technique (introduced in Sect. 2.3.3) is utilized to obtain the targeted population by minimizing F (ω).

3.3.4 Evaluation of Multi-camera People Localization We have compared our approach to the Probabilistic Occupancy Map (POM) technique [59], which has been a state-of-the-art method with similar purposes.1 1

Executable application of the POM reference technique is freely available at http://cvlab.epfl.ch/ software/pom/.

3.3 People Localization in Multi-camera Systems

61

For the evaluation of the two methods, we used two public sequences. First, from the PETS 2009 dataset [143] we selected the City center images containing approximately 1 minute of recordings (400 frames total) in an outdoor environment. From the available views we selected cameras with large fields of view (View_001, View_002, and View_003) and we used a RoI of size 12.2 m × 14.9 m, which is visible from all three cameras. The maximum number of pedestrians at the same time inside the RoI is 8. The second dataset we used in our experiments is the EPFL Terrace dataset, which is 3 min and 20 s long (5000 frames total). The scene is semi-outdoor, since it was recorded in a controlled outdoor environment and it also lacks some important properties of a typical outdoor scene, (e.g. no background motion caused by the moving vegetation is present, and no static background objects occlude the scene). We selected three cameras having small fields of view, and defined the RoI as a 5.3 m × 5.0 m rectangle. The scene is severely cluttered in some periods. For numerical evaluation, we created complete ground position annotation for both the City center and Terrace multi-camera sequences, using a newly developed 3D Ground Truth annotations tool introduced in [179]. For the City center sequence we annotated all 400 frames, while the Terrace sequence has been annotated with 1 Hz frequency resulting in 200 annotated frames We considered various error rates: Missed Detections (MD) counts the number of examples, where no detection could have been assigned to a Ground Truth target. False Detections (FD) value corresponds to examples, where no Ground Truth position could have been assigned to a detected sample. Multiple Instances (MI) measures the number of cases where multiple detections were assigned to a single Ground Truth position. Finally, the Total Error (TE) is taken as TE = MD + FD + MI. After counting all the false localization results (MD, FD, MI) on all annotated frames we express them in percent of the number of all objects, and we denote these ratios by MDR, FDR, MIR, and TER. Note that while MDR ≤ 1 and MIR ≤ 1 always hold, in case of many false alarms FDR (thus also TER) may exceed 1. For accuracy evaluation of position estimation, we also measured distance between the ground positions in the Ground Truth annotation and in the detection results yielding the Ground Position Error (GPE) metric. Detection examples by the proposed model in the two sample frames of the City center sequence are displayed in Fig. 3.19. We can numerically compare POM to the proposed 3DMPP method in Table 3.2, considering both test sequences and the GPE error metrics. Here in all cases the parameters have been set to minimize TER, while the corresponding FDR, MDR and MIR values are also listed. Results confirm the superiority of the proposed 3DMPP model over POM. A detailed study on parameter sensitivity of the proposed model has also been provided in [180].

62

3 Bayesian Models for Dynamic Scene Analysis

Fig. 3.19 Detection examples by the proposed 3DMPP model in the City Center sequence with multiple pedestrians and occlusions, projected to one of the camera images (note: as discussed, the detection is based on multiple camera views) Table 3.2 Comparison of the POM and the proposed 3DMPP models with optimized parameter sets (so that the total error rate TER is minimized), all three cameras are used Sequence Method Ground Position Errors (GPE) TER FDR MDR MIR City center Terrace

POM Prop. 3DMPP POM Prop. 3DMPP

0.252 0.122 0.686 0.131

0.179 0.020 0.354 0.043

0.073 0.096 0.331 0.083

0.000 0.006 0.001 0.005

3.3.5 Applications and Alternative Ways of 3D Person Localization This section presented a novel method to localize people with multiple calibrated cameras. For this tasks novel pixel-level features have been extracted based on the physical properties of the 2D image formation, which produce high response (evidence) for the real position and height of a person. To get a robust tool for cluttered scenes with high occlusion rate, the introduced approach fuses evidence from multi-plane projections from each camera. Finally, the positions and heights are estimated by a constrained optimization process, based on the Multiple Birth and Death Dynamics. For evaluation, images of public semi-outdoor and outdoor datasets have been used. According to the presented experiments, the proposed method produces accurate estimation, even in a cluttered environment, where partial or even complete occlusion is often present. The output of the proposed 3DMPP method has been integrated into a tracking system, which is able to eliminate many false detections [4]. Another possible improvement might be the use of a robust body part detector (e.g. [196]) for creating evidence, which modification can be easily integrated into the proposed algorithm. Let us observe that the main challenge of the proposed multi-camera-based technique is that the people’s 3D position and height parameters should be reconstructed based on 2D images. A possible way to facilitate the problem is using range sensors such as Lidars, which directly measure the distance of the moving objects from the

3.3 People Localization in Multi-camera Systems

63

camera. However, as discussed in the next section, by using the Lidar technology, besides exploiting its clear advantages one should also solve a number of additional issues originated from the physics and mechanics of the measurement process.

3.4 Foreground Extraction in Lidar Point Cloud Sequences A Rotating Multi-beam (RMB) Lidar sensor provides a time sequence of 3D point clouds capturing a 360◦ FoV of the scene. For efficient data processing, the 3D RMB Lidar points are often projected onto a cylinder-shaped range image [96, 97] as shown in Fig. 3.20. However, this mapping is usually ambiguous: On one hand, several laser beams with slight orientation differences are assigned to the same pixel, although they may return from different surfaces. As a consequence, a given pixel of the range image may represent different background objects at the consecutive time steps. In this section, we introduce a hybrid 2D–3D approach [12, 22] for dense foreground-background segmentation of RMB Lidar point cloud sequences obtained from a fixed sensor position (Figs. 3.20 and 3.21). Our technique solves the computationally critical spatial filtering steps in the 2D range image domain by an MRF model, however, ambiguities of discretization are handled by joint consideration of true 3D positions and back projection of 2D labels. By developing a spatial foreground model, we significantly decrease the spurious effects of irrelevant background motion, which is principally caused by moving tree crowns and bushes.

Fig. 3.20 Point cloud recording and range image formation with a Velodyne HDL-64E RMB Lidar sensor

64

3 Bayesian Models for Dynamic Scene Analysis

Fig. 3.21 Foreground segmentation in a range image part with three different methods

3.4.1 Problem Formulation and Data Mapping Assume that the RMB Lidar system contains R vertically aligned sensors, and rotates around a fixed axis with a possibly varying speed.2 The output of the Lidar within a time frame t is a point cloud of l t = R · ct points: Lt = { p1t , . . . , pltt }. Here ct is the number of point columns obtained at t, where a given column contains R concurrent measurements of the R sensors, thus ct depends on the rotation speed. Each point, p ∈ Lt , is associated to sensor distance d( p) ∈ [0, Dmax ], pitch index ˆ p) are ˆ p) ∈ {1, . . . , R} and yaw angle ϕ( p) ∈ [0, 360◦ ] parameters. d( p) and ϑ( ϑ( directly obtained from the Lidar’s data flow, by taking the measured distance and sensor index values corresponding to p. Yaw angle ϕ( p) is calculated from the Euclidean coordinates of p projected to the ground plane, since the R sensors have different horizontal view angles, and the angle correction of calibration may also be significant [135]. For efficient data manipulation, we also introduce a range image mapping of the obtained 3D data. We project the point cloud to a cylinder, whose central basis point is the ground position of the RMB Lidar and the axis is perpendicular to the ground plane. Note that slightly differently from [97], this mapping is also efficiently suited to configurations, where the Lidar axis is tilted to increase the vertical Field of View. Then we stretch a S H × SW sized 2D pixel lattice S on the cylinder surface, whose height S H is equal to the R sensor number, and the width SW determines the fineness of discretization of the yaw angle. Let us denote by s a given pixel of S, with [ys , xs ] coordinates. Finally, we define the P : Lt → S point mapping operator, so that ys is equal to the pitch index of the point and xs is set by dividing the [0, 360◦ ] domain of the yaw angle into SW bins:   S def ˆ p), xs = round ϕ( p) · W . s = P( p) iff ys = ϑ( 360◦

(3.50)

The goal of the foreground detector module is at a given time frame t to assign each point p ∈ Lt to a label ς ( p) ∈ {fg, bg} corresponding to the moving object (i.e. foreground, fg) or background classes (bg), respectively. 2

The speed of rotation can often be controlled by software, but even in case of constant control signal, we must expect minor fluctuations in the measured angle-velocity, which may result in different number of points for different 360◦ scans in time.

3.4 Foreground Extraction in Lidar Point Cloud Sequences

65

3.4.2 Background Model The background modeling step assigns a fitness term f bg ( p) to each p ∈ Lt point of the cloud, which evaluates the hypothesis that p belongs to the background. The process starts with a cylinder mapping of the points based on Eq. (3.50), where we bg use a R × SW pixel lattice S bg (R is the sensor number). For each s cell of S bg , we maintain a Mixture of Gaussians (MoG) approximation of the d( p) distance histogram of p points being projected to s. Following the approach of [170], we use a fixed K number of components (here K = 5) with weight κsi , mean μis and standard deviation σsi parameters, i = 1 . . . K . Then we sort the weights in decreasing order, and determine the minimal ks integer which satisfies ks 

κsi > Tbg ,

i=1

where we used Tbg = 0.89. We consider the components with the ks largest weights as the background components. Thereafter, denoting by η() a Gaussian density function, and by Pbg the projection transform onto S bg , the f bg ( p) background evidence term is obtained as f bg ( p) =

ks 

  κsi · η d( p), μis , σsi , where s = Pbg ( p).

(3.51)

i=1

The Gaussian mixture parameters are set and updated based on [170], while we used bg SW = 2000 angle resolution, which provided the most efficient detection rates in our experiments. By thresholding f bg ( p), we can get a dense foreground/background labeling of the point cloud [96, 170] (referred later as Basic MoG method), but as shown in Fig. 3.21b, this classification is notably noisy in scenarios recorded in large outdoor scenes.

3.4.3 DMRF Approach on Foreground Segmentation In this section, we propose a Dynamic Markov Random Field (DMRF) model to obtain smooth and observation consistent segmentation of the point cloud sequence (Fig. 3.22). Since MRF optimization in 3D is computationally expensive, we define the DMRF model in the range image space, and 2D image segmentation is followed by a point classification step to handle ambiguities of the mapping. As defined by Eq. (3.50), we use a P cylinder projection transform to obtain the range image, with a grid width bg SW = cˆ < SW ,

66

3 Bayesian Models for Dynamic Scene Analysis

Fig. 3.22 Components of the dynamic MRF model. a Structure of the multi-layer MRF model b Demonstrating the different local range value distributions in the neighborhood of a given foreground and background pixel, respectively, c Plot of the used sigmoid function

where cˆ denotes the expected number of point columns of the point sequence in a time frame. By assuming that the rotation speed is slightly fluctuating, this selected resolution provides a dense range image, where the average number of points projected to a given pixel is around 1. Let us denote by Ps ⊂ Lt the set of points projected to pixel s. For a given direction, foreground points are expected being closer to the sensor than the estimated mean background range value. Thus, for each pixel s we select the closest projected point: pst = argmin d( p), p∈Ps

and assign to pixel s of the range image the dst = d( pst ) distance value. For undefined pixels (Ps = ∅), we interpolate the distance from the neighborhood. For spatial filtering, we use an eight-neighborhood system in S, and denote by Ns ⊂ S the neighbors of s. Next, we assign to each s ∈ S foreground and background energy (i.e. negative fitness) terms, which describe the class memberships based on the observed d(s) values. The background energies are directly derived from the parametric MoG probabilities using (3.51):   t (s) = − log f bg ( pst ) . bg For description of the foreground, using a constant fg could be a straightforward choice [191] (we call this approach uniMRF), but this uniform model results in several false alarms due to background motion and quantization artifacts. Instead of temporal statistics, we use spatial distance similarity information to overcome this problem by using the following assumption: whenever s is a foreground pixel, we should find foreground pixels with similar range values in the neighborhood (Fig. 3.22b). For this reason, we use a non-parametric kernel density model for foreground: t (s) = fg

 r ∈Ns

 t ζ (bg (r ), τfg , m  ) · k

dst − drt h

 ,

3.4 Foreground Extraction in Lidar Point Cloud Sequences

67

where h is the kernel bandwidth and ζ : R → [0, 1] is a sigmoid function (see Fig. 3.22c): 1 ζ (x, τ, m) = . (3.52) 1 + exp(−m · (x − τ )) We use here a uniform kernel: k(x) = 1{|x| ≤ 1}, where 1{.} ∈ {0, 1} is the binary indicator function of a given event. To formally define the range image segmentation task, to each pixel s ∈ S, we assign a ςst ∈ {fg, bg} class label so that we aim to minimize the following energy function:    VD (dst |ςst ) + α · 1{ςst = ςrt−1 } + β · 1{ςst = ςrt }, (3.53) E= s∈S

s∈S r ∈Ns



 ξst



s∈S r ∈Ns



 χst



where VD (dst |ςst ) denotes the data term, while ξst and χst are the temporal and spatial smoothness terms, respectively, with α > 0 and β > 0 constants. Let us observe, that although the model is dynamic due to dependencies between different time frames (see the ξst term), to enable real time operation, we develop a causal system, i.e. labels from the past are not updated based on labels from the future. The data terms are derived from the data energies by sigmoid mapping: t (s), τbg , m bg ) VD (dst |ςst = bg) = ζ (bg

1, if dst > max{i=1...ks } μi,t s + d0 VD (dst |ςst = fg) = t ζ (fg (s), τfg , m fg ), otherwise.

The sigmoid parameters τfg , τbg , m fg , m bg and m  can be estimated by Maximum Likelihood strategies based on a few manually annotated training images. As for the smoothing factors, we use α = 0.2 and β = 1.0 (i.e. the spatial constraint is much stronger), while the kernel bandwidth is set to h = 30 cm. The MRF energy Eq. (3.53) is minimized via the fast graph-cut-based optimization algorithm [40]. The result of the DMRF optimization is a binary foreground mask on the discrete S lattice. As shown in Fig. 3.23, the final step of the method is the classification of the points of the original L cloud, considering that the projection may be ambiguous, i.e. multiple points with different true class labels can be projected to the same pixel of the segmented range image. With denoting by s = P( p) for time frame t, we use the following strategy: • ς ( p) = fg, iff one of the following two conditions holds: (†) ςst = fg and d( p) < dst + 2 · h (‡) ςst = bg and ∃r ∈ Nr : {ςrt = fg, |drt − d( p)| < h} • ς ( p) = bg: otherwise.

68

3 Bayesian Models for Dynamic Scene Analysis

Fig. 3.23 Backprojection of the range image labels to the point cloud. a Simple backprojection with assigning the same label to s and p, whenever s = P( p). b Result of the proposed backprojection scheme

The above constraints eliminate several (†) false positive and (‡) false negative foreground points, projected to pixels of the range image near the object edges, in which improvement can be seen by comparing the examples of Fig. 3.23a and b.

3.4.4 Evaluation of DMRF-Based Foreground-Background Separation We have tested the foreground detection algorithm in various sequences of the SZTAKI-LGA database and also on a traffic monitoring scenario (see Fig. 3.24). The Traffic sequence was recorded with 5 Hz from the top of a car waiting at a traffic light in a crowded crossroad. We have compared our proposed DMRF model for foreground-background separation to three reference solutions. First, we implemented the Basic MoG approach (already introduced in Sect. 3.4.2), which is based on [96] with using on-line K-means parameter update [170]. Second, we tested uniMRF (detailed in Sect. 3.4.3), which partially adopts the uniform foreground model of [191] for range image segmentation in the DMRF framework. Third, we also tested an MRF model in 3D, called 3D-MRF, based on [116]. We define in 3D-MRF point neighborhoods in the original Lt clouds based on Euclidean distance, and use the background fitness values of

3.4 Foreground Extraction in Lidar Point Cloud Sequences

69

Fig. 3.24 Foreground detection results on sample time frames with the Basic MoG, uniMRF and the proposed DMRF models: foreground points are displayed in blue (dark in gray print)

(3.51) in the data model. The graph-cut algorithm [40] is adopted again for MRF energy optimization. Qualitative segmentation results on sample frames from three sequences are shown in Fig. 3.24, concerning Basic MoG, uniMRF and the proposed DMRF model. For quantitative (numerical) evaluation, we manually generated Ground Truth (GT), through annotation around 100 relevant frames of each test sequence. For quantitative evaluation metric, we have chosen the point level F-score of foreground detection. We have also measured the processing speed in frames per seconds (fps). The numerical performance analysis is given in Table 3.3. The results confirm that the proposed model surpasses the reference techniques in F-score in all surveillance sequences, meanwhile the processing speed is 15–16 fps, which enables real-time operation. In the Traffic sequence with large and dense point clouds, the 3D-MRF approach is able to slightly outperform our approach in detection rate, but the proposed DMRF

Table 3.3 Point level evaluation of foreground detection accuracy (F-score in %) and processing speed Sequence Point cloud F-score based on 100 frames (in %) name size (K pts/fr.) Bas. MoG uniMRF 3D-MRF DMRF Summer1 65 Summer2 86 Summer3 86 Winter1 86 Winter2 86 Spring1 86 Spring2 86 Traffic 260 Processing speed (fps)

55.7 59.2 38.4 55.0 54.9 49.9 56.8 70.4 120 fps

81.0 86.9 83.3 86.6 86.6 84.8 89.1 68.3 17–18 fps

88.1 89.7 78.7 84.1 84.1 82.7 86.9 76.2 2–7 fps

95.1 93.2 89.0 91.9 91.9 88.9 94.4 74.0 15–16 fps

70

3 Bayesian Models for Dynamic Scene Analysis

method is significantly quicker: we measured there 2 fps processing speed with 3DMRF and 16 fps with the proposed DMRF model. We can also observe that differently from 3D-MRF, our range image- based technique is less influenced by the size of the point cloud.

3.4.5 Application of the DMFR Method for Person and Activity Recognition An application of the DMRF approach is presented in [18], where the proposed Lidar-based foreground detection algorithm is embedded into an integrated 4D (i4D) vision and visualization system, which is able to analyze and interactively display real scenarios in natural outdoor environments with walking pedestrians. The main focus of the investigations in [18] is gait-based person re-identification during tracking, and recognition of specific activity patterns such as bending, waving, making phone calls, and checking the time looking at wristwatches. The descriptors for training and recognition are observed and extracted from realistic outdoor surveillance scenarios, where multiple pedestrians are walking in the field of interest following possibly intersecting trajectories, thus the observations might often be affected by occlusions or background noise. After extracting the people trajectories, a free-viewpoint video is synthesized, where moving avatar models follow the trajectories of the observed pedestrians in real time, ensuring that the leg movements of the animated avatars are synchronized with the real gait cycles observed in the Lidar stream. In the i4D system the RMB Lidar sensor is used as a surveillance camera, which stands in a fixed position of the scene. The first step of the workflow is foregroundbackground separation with the DMRF algorithm, thereafter point cloud regions classified as foreground are clustered to obtain separate blobs for each moving person candidate. The center of each extracted blob is considered as a candidate for foot position on the ground. Next person tracking is implemented using a Kalman filterbased finite-state-machine algorithm [12], which yields 2D trajectories of the tracked people on the ground plane (see Fig. 3.26a). Apart from the above-described person detection and short-term tracking functions, the i4D system implements two further key tasks: person re-identification and activity recognition. As we demonstrate in the following both issues are solved by silhouette-based approaches, which require a very high accuracy of the DMRF- based foreground segmentation filter. A critical issue in surveillance of people is the assignment of broken trajectory segments during the tracking process, that are usually produced by frequent occlusions between the people in the scene, or simply by the fact that the pedestrians may temporarily leave the Field of View (FoV). People re-identification [4] requires the extraction of biometric descriptors, which in our case may be weak features, since we are focusing on a relatively small number of people (i.e. we are not trying to identify specific people from large databases). In the proposed approach, biometric

3.4 Foreground Extraction in Lidar Point Cloud Sequences

71

Fig. 3.25 Silhouette projection: a a tracked person and its projection plane in the point cloud from bird’s view. b Top view of the projection plane which is the tangent of the trajectory (sideview silhouettes)

gait descriptors have been utilized, which ones proved to be robust against low resolution and partial occlusion artifacts, due to capturing information from the whole body. In addition gait can be analyzed in real-world scenarios, where the people are non-cooperative, and they have to be recognized during their natural behavior in real time. To capture the person silhouettes at each time frame, we project the point cloud segment of each person to the plane, which intersects the actual ground position, it is perpendicular to the local ground plane, and it is parallel to the local tangent vector of the smoothed trajectory from top view (see Fig. 3.25). The projected point cloud consists of a number of separated points in the image plane, which can be transformed into a connected 2D foreground region by morphological operations as shown in Fig. 3.26b. Thereafter, the idea of Gait Energy Image (GEI)- based person recognition [79] is adopted to the Lidar surveillance environment yielding a new descriptor called Lidar-based GEI (LEGI) [18]. One can exploit here a main advantage of the Lidar technology: since the laser measurement is directly available in the 3D Euclidean world coordinate system, without perspective distortion and scaling effects, the projected silhouettes may be also compared without re-scaling unlike in optical video sequences. LGEIs are derived by averaging the normalized binary person silhouettes over N consecutive frames: N 1  Bt (x, y), G(x, y) = N t=1 here Bt (x, y) ∈ {0, 1} is the (binary) projected silhouette value of pixel (x, y) on time frame t, and G(x, y) ∈ [0, 1] is the (continuous) LGEI value. Person recognition is performed in a supervised approach. To represent a human which has not yet been registered to the database, k = 100 different LGEI-s are generated starting from k randomly selected seed frames of the recorded Lidar sequence of his/her walk. This LGEI set is considered as the training data for the given person, which is used to train a committee of a Multi-Layer Perceptron (MLP) and a convo-

72

3 Bayesian Models for Dynamic Scene Analysis

Fig. 3.26 LGEI generation process: a Output of the multi-pedestrian tracker for a sample Lidar frame (person point clouds+trajectories) b projected pedestrian silhouettes on the selected Lidar frame c Lidar-based Gait Energy Images extracted for the people of (b)

3.4 Foreground Extraction in Lidar Point Cloud Sequences

73

Fig. 3.27 a–b: A sample frame from an outdoor test sequence used for activity recognition c– d: Demonstration of the frontal projection and depth map calculation for activity recognition. Projection plane is perpendicular to the trajectory

lutional neural network (CNN) [18]. These neural networks are utilized later in the person recognition phase, when we need to identify a newly appearing human in the scene. In the recognition step, we generate probe LGEIs for the detected subjects, and based on the stored LGEIs of the database, we decide whether a new person entered the scene, or a previously disappeared one returned. Besides biometric person identification, the recognition of various actions can provide valuable information in surveillance systems. For this reason, a new algorithm has been proposed for recognizing selected—usually rarely occurring—activities in the Lidar surveillance framework, which can be used for generating automatic warnings in case of specific events, and excluding various non-walk segments from the training/test data of the gait recognition module. Apart from normal walk, we have selected five events for recognition: bend, check watch, phone call, wave and wave two-handed (wave2) actions. A sample outdoor frame with four people is shown in Fig. 3.27. The proposed approach for action recognition is motivated by the LGEI-based gait analysis technique, however, various key differences have been implemented here. First, while gait could be efficiently analyzed from side-view point cloud projections, the actions listed above are better observable from a frontal point of view. For this reason, we have chosen a projection plane for action recognition, which is perpendicular to the local trajectory tangent, as demonstrated in Fig. 3.27c. Second, various actions such as waving or making phone calls produce characteristic local

74

3 Bayesian Models for Dynamic Scene Analysis

Fig. 3.28 ADM (left) and AXOR (right) maps for the different actions

depth texture patterns (e.g. the hand goes forward for waving). Therefore, instead of deriving binarized silhouettes, we create depth maps by calculating the point distances from the projection plane according to Fig. 3.27c, a step which yields a depth image shown in Fig. 3.27d. Then, we introduce the averaged depth map (ADM) feature as a straightforward adoption of the LGEI concept, so that we average the depth maps for the last τ frames, where τ is a preliminary fixed time window related to the expected duration of the activities (we used τ = 40 frames uniformly). ADM sample images for each activity are shown in Fig. 3.28 (left samples). Third, while gait is considered a low-frequency periodic motion of the whole body, where we do not lose a significant amount of information by averaging the consecutive images, the above actions are aperiodic and only locally specific for given body parts. For example, waving contains sudden movements, which yield large differences in the upper body regions of the consecutive frames. Thus, apart from ADM we introduce a second feature, called averaged XOR image (AXOR), which aims to encode information about the motion dynamics. An exclusive-OR (XOR) operation is applied on two consecutive binarized frontal silhouettes, and the AXOR map is calculated by averaging these binary XOR images and taking the squares of the average values. The AXOR map displays high values for the regions of sudden movements, as shown in Fig. 3.28 (right image of each pair), especially regarding the waving actions in images (e) and (f). For each action from the set bend, watch, phone, wave, and wave2, two separate convolutional neural networks (CNN) were trained, one for the ADM and one for the AXOR features, respectively. If no activity is detected, we

3.4 Foreground Extraction in Lidar Point Cloud Sequences

75

assume that the observed person is in the walking state. If multiple CNN outputs surpass the decision threshold, we select the action with the highest confidence. The proposed gait-based biometric identification and activity recognition algorithms have been evaluated on a new database called SZTAKI-LGA [18]. The presented results confirm that based on a high-quality MRF-based foreground mask both tasks are efficiently achievable in the sparse point clouds of a single RMB Lidar sensor. Finally, we demonstrate the visualization module, which takes as input the trajectories of the identified walking people, and the timestamp and location of the recognized actions. As output a free-viewpoint is synthesized, where moving animated avatars follow the motions of the observed people. The moving avatars are properly detailed, textured dynamic models which are created in a 4D reconstruction studio, whose hardware and software components are described in [80]. The 4D

Fig. 3.29 Sample consecutive frames from the recorded a Lidar and b video sequences, and the synthesized 4D scene with leg movements synchronized with the observation

76

3 Bayesian Models for Dynamic Scene Analysis

Fig. 3.30 Demonstration of the dynamic scene reconstruction process, with gait step synchronization

person models can be placed into an arbitrary 3D background (point cloud, mesh, or textured mesh), which can be either created manually with a CAD system, or by automatic environment mapping of the Lidar measurements [19] (see Fig. 3.29). The last step of the workflow is the integration of the system components and visualization of the integrated model. The walking pedestrian models are placed into the reconstructed environment so that the center point of the feet follows the trajectory extracted from the Lidar point cloud sequence. The temporal synchronization of the observed and animated leg movements is implemented using the gait analysis. This step requires an approximation of the gait cycles from the Lidar measurement sequence, however the accuracy is not critical here, but the viewer has to be visually convinced that the leg movements are correct. The cycle estimation is implemented by examining the time sequence of the 2D bounding boxes, so that the box is only fitted to the lower third segment of the silhouette. After a median filter-based noise reduction, the local maxima of the bounding box width sequence are extracted, and the gait phases between the key frames are interpolated during the animation. A summarizing figure of the complete recognition and visualization process is displayed in Fig. 3.30.

3.5 Conclusions

77

3.5 Conclusions This chapter presented new Bayesian methods for three different problems in dynamic environment perception. First, a Markov Random Field (MRF) model has been proposed for efficient separation of foreground, background, and cast shadows regions in videos recorded by real surveillance applications. The method assumes that the sequences have been captured by static electro-optical cameras, which may have low quality and low/uncertain frame rate. The introduced model is able to compensate camera noise, temporal changes in external illumination, and the presence of reflecting scene surfaces with inhomogeneous albedo and geometry. The proposed shadow model can be used under variant illumination conditions, and stochastically models the differences of real scenes from an ideal Lambertian environment. Based on local focal feature vectors extracted in the individual pixel positions, the shadow’s domain is represented by a global probability density function in that feature space, while a parameter re-estimation algorithm has also been introduced to adopt to the changes of the shadow’s feature domain due to daily illumination changes. Test results confirm that in real scenes the accuracy of shadow detection is significantly higher than using the purely Lambertian model. In addition, a novel foreground description has been given based on spatial statistics of the nearby pixel values. It has been shown that the introduced approach enhances the detection of background or shadow-colored object parts, even in low and/or unsteady frame rate videos. Thereafter, we have given a probabilistic model of the microstructural responses in the background and in the shadow, and completed the MRF segmentation model with microstructure analysis. The proposed adaptive kernel selection strategy considers the local background properties. We have shown via synthetic and real-world examples, that the improved framework outperforms the purely color-based model, and methods using a single kernel. In the second part of the chapter, we formulated 3D object detection in multicamera systems as an inverse problem, and proposed a 3D Marked Point Process (3DMPP) of cylinders for modeling groups of multiple (possibly overlapping) pedestrians in 3D environments. Using features extracted from different calibrated camera views of a multi-camera system, it has been shown that the introduced approach can be efficiently applied for accurate 3D localization and height estimation of people, surpassing a state-of-the-art solution for the same problem. The proposed approach has been evaluated on two publicly available datasets. To obtain relevant quantitative test results, 3D Ground Truth annotation of the real pedestrian locations has been prepared, while two different error metrics and various parameter settings were evaluated, showing the advantages of the proposed 3DMPP model. In the third, final part of the chapter, we have focused on the utilization of the recently released Rotating Multi-beam (RMB) Lidar technology in advanced surveillance systems. While conventional optical or range sensors have a limited Field of View (FoV), RMB Lidars provide a full 360◦ FoV of the scene, with a vertical resolution equal to the number of the sensors, while the horizontal angle resolution depends on the speed of rotation. However, inhomogeneous point cloud density,

78

3 Bayesian Models for Dynamic Scene Analysis

noise of sensor calibration and consequences of the high speed sequential scanning process introduce various artifacts for the data processing modules, demanding novel solutions in data filtering and pattern recognition. We have proposed a new Dynamic MRF (DMRF) approach for foreground segmentation in RMB Lidar measurement sequences, which can be efficiently implemented for a range image representation of the recorded point cloud data. We have also demonstrated the efficiency of the Lidar- based approach for people tracking, biometric re-identification and activity recognition.

Chapter 4

Multi-layer Label Fusion Models

Abstract In this chapter, new multi-layer Bayesian label fusion models are proposed for two different change detection problems in remotely sensed images. First, a probabilistic model is proposed for automatic change detection from airborne images captured by moving cameras. To ensure robustness, an unsupervised coarse matching is used instead of a precise image registration. The challenge of the proposed model is to eliminate the registration errors, noise, and the parallax artifacts caused by the static objects having considerable height (buildings, trees, walls, etc.) from the difference image. The background membership of a given image point is described through two different features, and a novel three-layer Markov Random Field (MRF) model is introduced to ensure connected homogeneous regions in the segmented image. Second, we introduce a Bayesian approach, called the Conditional Mixed Markov model (CXM), for extracting regions of relevant changes from registered aerial image pairs taken with large time differences possibly under different illumination and seasonal conditions. The CXM model is derived as a combination of a mixed Markov model and a conditionally independent random field of signals. The new approach fuses global intensity histograms with local block-based correlation and contrast features. A global energy optimization process is developed, which can simultaneously ensure efficient local feature selection and smooth, observation-consistent image segmentation. Experiments are shown using real aerial image sets provided by the Lechner Knowledge Center of Budapest.

4.1 Markovian Fusion Models in Computer Vision As emphasized in Chap. 1, selecting an appropriate model structure is a critical issue for Bayesian image segmentation. Although since Geman and Geman’s paper from 1984 [71] extensive research has been conducted on image processing applications of Markov Random Fields (MRF), new focus areas emerged in the 2000s and 2010s due to the evolution of imaging sensor technologies, providing new measurement modalities and enhanced image qualities. The technological progression demanded the development of various new feature fusion approaches within the MRF framework. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 C. Benedek, Multi-Level Bayesian Models for Environment Perception, https://doi.org/10.1007/978-3-030-83654-2_4

79

80

4 Multi-layer Label Fusion Models

Techniques following the conventional multivariate vector approach stack the individual feature components into an n-dimensional feature vector, where the feature distribution in each class is approximated by an n-dimensional multinomial density function [50, 101]. Note that we have already discussed an example following this approach in Chap. 3, where different color components and microstructural responses have been integrated into a 4D feature vector. However, this straightforward fusion strategy may cause several practical issues: although the feature vector’s 1D marginal distributions can be often modeled by well-known densities (e.g. Gaussian, Beta, uniform, or a finite mixture of them), their joint distribution may be hard to express. Moreover, efficient methods for probability calculation and parameter estimation are only available for certain distributions, and modeling the correspondences between the feature components may also increase the number of free parameters (e.g. the Gaussian correlation matrix should be non-diagonal). For example, [101] fit a Gaussian mixture to the multivariate feature histogram of the training images assuming that a given class can be well described by a finite number of prototype models, which constraint is often not fulfilled. For the above reasons, multi-layer models have become popular in recent decades [94, 102, 153]. In this case, individual layers are assigned to the different feature components (or to a group of components). Each layer’s segmentation is directly influenced by its corresponding measurement component(s) and indirectly by features of other layers. The inter-layer connections may implement data-driven interactions [102], whereas the inter-layer interactions simultaneously use the feature values and the segmentation labels, or label fusion [94], where the interactions only use the node labels from the different layers. Usually, the right choice between these two approaches depends on the domain which we model. We show later that regarding the problems investigated in this chapter, label fusion is a more natural model. A classification approach is referred to as decision fusion if a pixel-by-pixel fusion step operates on the results of independent segmentation processes, which are applied earlier for the individual layers [153]. On the contrary, the integration processes of the label fusion-reaction framework [94], or the seminal multi-layer MRF model [102] consider in parallel spatial smoothing constraints over neighboring pixels and semantic rules for individual pixel labels, which are jointly optimized via an energy minimization process. Detailing the differences, in sequential fusion models [94, 153] the different features vote independently for label-candidates, and the fusion process integrates these labels afterward in a separate step. On the contrary, multilayer MRFs also weigh the label-votes with their reliability terms [102]. The Fusion Markov Random Field (fMRF) segmentation model [171] is a Post-Classification Comparison (PCC) technique, which simultaneously implements an adaptive segmentation and change detection model for remote sensing image sequences. In fMRF, each layer represents a given input image. While it is also possible to use fMRF with an input of an image pair, the use of three or more images are generally preferred to enhance cluster definitions depending on the quality of the images and the degree of similarity between the inputs. The fMRF method applies clustering on a fused image series by using the Cluster Reward Algorithm (CRA) as cross-layer similarity measure, which step is followed by a multi-layer MRF segmen-

4.1 Markovian Fusion Models in Computer Vision

81

tation. The resulting label map is utilized for the automatic training of the individual layers. After independently segmenting each layer, changes are detected between the label-maps of the different layers. While various feature fusion strategies have a vast bibliography, issues of incorporating novel types of prior information and inference rules in MRFs have been less widely explored in the literature. For this reason, we accomplished research in this direction, focusing on two different pixel-level change detection problems: • Task 1: Moving object detection in image pairs captured by moving aerial vehicles with a few seconds of time differences. The task needs an efficient combination of image registration for camera motion compensation and frame differencing (see Fig. 4.1a). Registration errors and parallax effects caused by 3D scene structures are modeled as noise components, and a statistical approach is developed to eliminate the undesired distortions from the change mask. • Task 2: Detecting relevant changes in registered aerial images captured with time differences of several months or years. Even staying at a low-level (region-based) model, this task needs a more sophisticated approach than simple pixel value differencing, since due to seasonal changes or altered illumination, the appearance of the corresponding unchanged areas may also be significantly different. A new region-based change detection model is presented, which locally estimates the reliability of two different discriminating features between ‘changed’ and ‘unchanged’ image regions (Fig. 4.1b). From a methodological point of view, we present in this chapter four main contributions: First, we construct new multi-layer label fusion model structures, which implement flexible integration of various (sub-)segmentation results, by keeping the advantages of the established MRF modeling approach. Second, using the Mixed Markovian concept, we introduce dynamic graph structures into the multi-layer

Fig. 4.1 Demonstration of the addressed object motion detection and long-term change detection problems

82

4 Multi-layer Label Fusion Models

framework to extend its modeling capabilities. Third, we work out efficient optimization methods for new multi-layer models. Fourth, we give an extensive review and quantitative comparison results of multi-layer models. In this book, we mainly focus on introducing the novel methodological issues of the discussed models, while the presentation of the application-specific model components serve to demonstrate the main motivation of the developments, and help the Reader in better understanding the methods. Note that further publications of the Author [27, 30] present detailed introduction and state-of-the-art review regarding the application environments of task 1 and task 2, respectively, with various additional qualitative and quantitative experiments comparing the proposed models to concurrent change detection approaches.

4.2 A Label Fusion Model for Object Motion Detection As the first example, we focus on the object motion detection problem having two partially overlapped images which were taken by moving airborne vehicles above urban roads with a few seconds of time difference. In Chap. 3, we have considered change or motion detection as a purely 2D image segmentation problem marking the pixels with foreground, background or (in some cases) shadow labels. In fact, classifying a pixel s to ‘background’ means that a 3D scene point, which is projected to the s pixel of the image plane, corresponds to the background in the 3D environment. Using a static camera, it is not needed to model the relationship between the 2D image plane and the 3D world, since in a given pixel position, the same background surface point (with the same color) is permanently observable, unless it is occluded by a foreground object. On the other hand, in the case of camera motion the static ‘voxels’ of the scene are projected to different pixel positions in the consecutive frames (see Fig. 4.2a). Finding the corresponding pixels in the images which represent the same 3D scene points is called image registration. Although registration is one of the fundamental problems of image processing, we still find challenges in the context of the current application. Here, we demonstrate a few of them following the study in [29]. An important existing approach is based on feature correspondence, where the goal is looking for corresponding pixels or other primitives such as edges, corners, contours, and shape in the images which are compared [5, 49, 133, 193, 204]. Unfortunately, these procedures may fail at occlusion boundaries and within regions where the chosen primitives or features cannot be reliably detected. Although we can find methods focusing on the reduction of errors at object boundaries caused by occlusion [89, 90], these approaches only work with slightly different image inputs, like image pairs recorded by stereo cameras. On the other hand, taking the photos from a rapidly moving airborne vehicle may cause significant global offset and rotation between the consecutive frames. As for the synthesis of wide-baseline composite views, [172] presented a motion-based method for automatic registration of images in multi-camera systems. However, the

4.2 A Label Fusion Model for Object Motion Detection

83

Fig. 4.2 a Illustrating the stereo problem in 3D. E 1 and E 2 are the optical centers of the cameras taking G 1 and G 2 , respectively. P is a point in the 3D scene, and s and r are its projections in the image planes. b A possible arrangement of pixels r , r˜ , and s; the 2D search region, Hr˜ . er is the error of the projective estimation, r˜ for s

latter method needs video flows recorded by static cameras, while in the present application we have only one image in each camera position. In summary, using the existing techniques we must expect that feature matching presents correct pixel correspondences only for sparsely distributed feature points instead of matching the two frames completely. A possible way to handle this problem is searching for a global projective transform T between the images. Thus, for a given pixel r = [r x , r y ] of the second frame, the corresponding pixel position s = [sx , s y ] in the first frame is approximated as s ≈ r˜ = T(r ). Using that an arbitrary projective transform can be represented by a linear transform of homogeneous coordinates [82, p. 3], T can be written in matrix form: [px , p y , pw ]T = T · [r x , r y , 1]T

(4.1)

where r˜x = px /pw , r˜y = p y /pw , r˜ = [˜r x , r˜y ], and T is the 3 × 3 homography matrix of transform T. Here, er = s − r˜ is the error of approximation at pixel r . In the 2 the warped second image, which is obtained by applying following, we denote by G   T for G 2 , thus its pixel values are g˜ 2 (r ) = g2 T−1 (r ) . The above-defined procedure is called 2D image matching [93], and two main approaches are available for unsupervised estimation of T. Pixel correspondencebased techniques estimate the optimal coordinate transform (e.g. homography) which maps the extracted feature points of the first image to the corresponding pixels identified by the feature tracker module in the second frame [204]. In global correlation methods, the goal is to find the parameters of a similarity [152] or affine transform

84

4 Multi-layer Label Fusion Models

Fig. 4.3 Illustration of the parallax effect, if a rectangular high object appears on the ground plane. We mark different sections with different colors on the ground and on the object, and plot their projection on the image plane with the same color. We can observe that the appearance of the corresponding sections is significantly different

[124] for which the correlation between the original first and transformed second image is maximal. For computational purposes, global correlation methods work in the Fourier domain. Although we find sophisticated ways to enhance the accuracy of the linear 2D mappings [111] (up to subpixel accuracy: [163]), these approaches only result in reasonable registration if the scene can be approximated by a flat surface [82, p. 8], the camera is very far from the ground plane or the camera motion is slight [111]. Otherwise, scene points out of the dominant plane (e.g. the plane of the roadway in a street scene) cause significantly different 2D displacements than calculated by the global projective transform. This effect is called parallax distortion (see Fig. 4.3). To overcome this problem, plane+parallax (P+P) models have become widely used: we also follow this way in this chapter. Here, the images are registered up to a global 2D projective transform, thereafter the parallax is locally handled. As it is pointed out in [93], different environmental conditions and circumstances may raise essentially different challenges, thus ‘P+P’ methods can be onward divided into subcategories. An example for sparse parallax is given in [159], which deals with very low-altitude aerial videos captured from sparsely cultural scenes, where shape constancy constraints can be used together with global motion estimation. In that case, the 3Dness of the scene is sparsely distributed containing a few moving objects, while the algorithm needs at least three frames from a video sequence. On the other hand, for scenarios being investigated in the current application, a dense parallax method should be developed, since both the 3D static objects and object motions may occur densely in the scene. Here, compared to [159], the frames are captured from higher altitude, and the parallax distortions after 2D registration usually cause

4.2 A Label Fusion Model for Object Motion Detection

85

errors of a few pixels. Consequently, if s and r are the corresponding pixels in G 1 and G 2 , respectively, we assume that the magnitude of the 2D estimation error, ||er || = ||s − r˜ ||, is lower than a threshold parameter. In other words, for a given r , the corresponding pixel s should be searched in a given neighborhood of r˜ denoted by Hr˜ . We will use rectangular neighborhoods with a fixed size (see Fig. 4.2b). Note 2 that using Hr˜ is symmetric: for a given s in G 1 , the corresponding pixel in the G transformed image, r˜ , is in the rectangular neighborhood of s, Hs . Since the length and orientation of the parallax error vectors er are different at different r pixel positions, the above approach does not solve the exact pixel matching problem, which may still remain difficult. It can only be stated that s lies in the search region Hr˜ assigned to r , unless it corresponds to an object displacement. A key point in our approach is that the proposed model will not aim at finding the corresponding point pairs. We get around this problem in a statistical way, via a probabilistic description of the local search regions. Note that the model does not exploit the well-known epipolar constraint [82, p. 240]. As emphasized in [93], the performance of such approaches is very sensitive to finding the accurate epipoles, which may fail if, besides camera motion, many independent object displacements are present in the scene. As for further corresponding issues, in this chapter we search for object displacements in image pairs taken with a time difference of approximately 1–2 s. It should be emphasized here that this is a different task from processing high frame rate aerial videos [138, 144], where the camera motion can be predicted based on previously processed frames. As being noted earlier, we will introduce a two-stage algorithm which consists of a coarse 2D image registration step for camera motion compensation, and a parallax error-eliminating step. From this point of view, this approach is similar to [57], where the authors assume that 2D registration errors mainly appear near sharp edges. Therefore, at locations where the magnitude of the gradient is large in both images, they consider that the differences of the corresponding pixel values are caused with higher probability by registration errors than by object displacements. However, this method is less effective, if there are several small objects (containing several edges) in the scene, because the post-processing may also remove some real objects, but it leaves errors in smoothed textured areas (e.g. group of trees, corresponding test results are shown in Sect. 4.2.6).

4.2.1 2D Image Registration In this section, we introduce briefly two approaches on coarse 2D image registration. Thereafter, we compare the methods on the images of our datasets, and we choose the most appropriate one to be the preprocessing step of our Bayesian labeling model.

86

4.2.1.1

4 Multi-layer Label Fusion Models

Pixel-Correspondence-Based Homography Matching (PCH)

This approach consists of two consecutive steps. First, corresponding pixels are collected in the images (for example, see [39, 123]), thereafter, the optimal coordinate transform is estimated between the elements of the extracted point pairs [204]. Therefore, only the first step is influenced directly by the observed image data, and the method may fail if the feature tracker produces poor results. On the other hand, we can approximate an arbitrary projective transform in this way. The set of the resulting point pairs usually contains several outliers, which are filtered out by the RANSAC algorithm [82, p. 290], while the optimal homography is estimated so that the back-projection error is minimized [1].

4.2.1.2

FFT-Correlation-Based Similarity Transform (FCS)

Reddy and Chatterji [152] proposed an automatic and robust method for registering images, which are related via a similarity transform (translation, rotation, and scaling). In this approach, the goal is to find the parameters of the similarity transform 2 = T(G 2 ) is maximal. T for which the correlation between G 1 and G The method is based on the Fourier shift theorem. In the first step, we assume that G 1 and G 2 images differ only in displacement, namely there exists an offset vector d ∗ , for which g1 (s) = g2 (s + d ∗ ) : ∀s, s + d ∗ ∈ S. Let us denote with G d2 the image we get by shifting G 2 with offset d. In this case, d ∗ = argmaxd R(d), where R is the correlation map: R(d) = Corr{G 1 , G d2 }. R can be determined efficiently in the Fourier domain. Let F1 and F2 be the Fourier transforms of the images G 1 and G 2 . We define the Cross Power Spectrum (CPS) by CPS(i, k) =

F1 (i, k) · F 2 (i, k) |F1 (i, k) · F 2 (i, k)|

= e j2π(dx i+d y k) ,

(4.2)

where F 2 means the complex conjugate of F2 . Finally, the inverse Fourier transform of the CPS is equal to the correlation map R [152]. The Fourier shift theorem also offers a way to determine the rotation angle. Assume that G 2 is a translated and rotated replica of G 1 , where the translation vector is o and the angle of rotation is α0 . It can be shown that with considering |F1 | and |F2 | as images, |F2 | is the purely rotated replica of |F1 | with angle α0 . On the other hand, rotation in the Cartesian coordinate system is equivalent to a translational displacement in the polar representation [152], which can be calculated similarly to the determination of d ∗ . The scaling factor of the optimal similarity transform may be retrieved in an analogous way [152]. In summary, we can determine the optimal similarity transform T between the 2 . two images based on [152], and derive the (coarsely) registered second image, G

4.2 A Label Fusion Model for Object Motion Detection

4.2.1.3

87

Experimental Comparison of PCH and FCS

The PCH and FCS algorithms have been tested on our test image pairs. Obviously, both approaches give coarse registrations only, which are inaccurate and affected by parallax artifacts. In fact, FCS is less effective if the projective distortion between the images is significant. The weak point of PCH appears if the object motion is dense, thus a lot of point pairs may be detected on moving objects, and the automatic outlier filtering may fail, or at least, the homography estimation becomes inaccurate. In our test database, the latter artifacts are more significant, since the corners of the several moving cars present dominant features for the Lucas–Kanade tracker. Some corresponding results are presented in Fig. 4.4. We can observe that using FCS, the error-appearances are limited to the static object boundaries, while regarding two out of the four frames, the PCH registration is highly erroneous. We note that

Fig. 4.4 Qualitative illustration of the coarse registration results presented by the FFT-correlationbased similarity transform (FCS), and the pixel-correspondence-based homography matching (PCH). In columns 3 and 4, we find the thresholded difference of the registered images. Both results are quite noisy, but using FCS, the errors are limited to the static object boundaries, while regarding P#25 and P#52 the PCH registration is erroneous. Our Bayesian post-processing is able to remove the FCS errors, but it cannot deal with the demonstrated PCH gaps

88

4 Multi-layer Label Fusion Models

using the Bayesian post-processing approach, presented later in this chapter, we can remove the FCS errors, but it is unable to deal with the large gaps caused by PCH. For the above-mentioned reasons, we will use the FCS method for preliminary registration in the following part of this chapter, however, in other test scenes it can be replaced with PCH in a straightforward way.

4.2.2 Change Detection with 3D Approach Let us denote by G 1 and G 2 two input images defined over the same pixel lattice S. The intensity value of a selected pixel s ∈ S is marked by g1 (s) in the first image and by g2 (s) in the second one. We interpret the change detection problem between the frames as a pixel-labeling task with two segmentation classes: foreground (fg) and background (bg). Pixel s corresponds to the foreground, if in the 3D real world, a scene point, which is projected to pixel s in the first image (G 1 ), changes its position in the world coordinate system or becomes covered by a moving object at the time when the second image (G 2 ) is taken. Otherwise, pixel s is assigned to the background regions. Assuming that the observed scene consists of an approximately planar ground region with various static and dynamic 3D urban objects (such as vehicles, walls, trees, and short building segments), a 2D similarity transform provided by the Fourier shift theorem-based method [152] can be used for a coarse estimation of the global transform between the images due to camera motion [29]. In the following, the 2 , with pixel values {g˜ 2 (s)|s ∈ S}. registered second frame is marked by G

4.2.3 Feature Selection In the next step, we define local image descriptors at each pixel s ∈ S which inform us regarding the classification of s either as foreground or as background point. Following a probabilistic approach, classes are considered as random processes, which can generate the selected descriptors according to various distributions. The feature selection process is demonstrated in Fig. 4.5. The first feature is the intensity 2 and G 1 , respectively: difference between the corresponding pixels in G f d (s) = g˜ 2 (s) − g1 (s). As shown in Fig. 4.5c, the observed f d (.) descriptor values in the background regions can be statistically characterized by a Gaussian distribution with an empirically calculated mean value μ (i.e. a constant intensity shift between the images) and standard deviation σ (uncertainty due to camera noise and registration errors):

4.2 A Label Fusion Model for Object Motion Detection

89

Fig. 4.5 Feature selection in the multi-layer MRF model. Notations are given in the text of Sect. 4.2.3

  1 ( f d (s) − μ)2 . P( f d (s)|bg) = N ( f d (s), μ, σ ) = √ exp − 2σ 2 2π σ

(4.3)

On the other hand, we can observe any f d (s) values in the foreground with the same prior probability, therefore, the foreground class can be modeled by a uniform density function:  1 , if f d (s) ∈ [ad , bd ] (4.4) P( f d (s)|fg) = bd −ad 0 otherwise. Next, we explore the limitations of this feature. We set first the distribution parameters optimally by a supervised strategy, and we calculate the D image in Fig. 4.5d as the maximum likelihood estimate, where the label of s is taken as argmax P( f d (s)|ψ).

ψ∈{fg,bg}

90

4 Multi-layer Label Fusion Models

We can observe here several false positive foreground points, mainly near to the boundaries of static 3D field objects. To compensate the limitations of the f d (s) feature, we introduce a second descriptor f c (s), which is obtained by calculating normalized cross correlation between the 2 for different os rectangular pixel neighborhoods W1 (s) in G 1 and W2 (s + os ) in G offset values within an l-sided search window. Then we take f c (s) = max Corr{W1 (s), W2 (s + os )}. os

As shown in Fig. 4.5e, f c (s) values in the background can be approximated by a Beta density function [78]: P( f c (s)|bg) = B( f c (s), α, β), where

 B(c, α, β) =

(α+β) α−1 c (1 (α)(β)

0 



(α) =

− c)β−1 , if c ∈ (0, 1) otherwise

(4.5)

t α−1 e−t dt.

0

Here again, the foreground class is modeled by a uniform probability density function P( f c (s)|fg) with ac and bc parameters. As demonstrated in Fig. 4.5f (see C image), the f c (.) descriptor is in itself also inappropriate to support efficient motion segmentation. However, unlike by the f d (.) descriptor the false alarms appear mainly in homogenous areas, where the variance of the pixel values in the blocks to be compared may be very low, thus the normalized correlation coefficient is highly sensitive to noise. On the other hand, if we consider D and C as two binary maps, a logical AND operation applied on them yields a significantly better result as shown in Fig. 4.5h. Let us observe that this classification is still quite noisy, although in the output motion mask we expect connected regions representing the object silhouettes. Applying Markov Random Fields (MRFs) could be a straightforward idea here, however, the above-introduced label-based ‘AND’ fusion rule requires a novel three-layer MRF (L 3 Mrf) model structure that will be introduced in the next section.

4.2.4 Multi-layer Segmentation Model In the proposed approach, an MRF model is constructed on a graph G whose structure is shown in Fig. 4.6. Let us recall that in the previous section, we applied two independent segmentations, thereafter the final motion mask has been derived using pixel-by-pixel operations on the obtained labels. Using this concept, we arrange the graph nodes of G into three disjunct layers S d , S c , and S ∗ , where the size of each

4.2 A Label Fusion Model for Object Motion Detection

91

Fig. 4.6 Structure of the proposed three-layer MRF (L 3 Mrf) model

layer is the same as the size of the input images. We map each pixel s ∈ S to a unique site in each layer: e.g. s d is the node corresponding to pixel s on the layer S d . We similarly mark s c ∈ S c and s ∗ ∈ S ∗ . Our labeling process assigns a label ς (.) to each node of G using the label set = {fg, bg}. The label of S d (resp. S c ) depends directly on the segmentation based on the f d (.) (resp. f c (.)) descriptor, while the labels of the S ∗ layer realize the final motion mask. A global labeling of G is



= ς (s i )|s ∈ S, i ∈ {d, c, ∗} .

(4.6)

Following the MRF concept, the labeling of an arbitrary node depends directly on the labels of its neighbors, defined by the neighborhood relations within G. To obtain smooth segmentations, connections are used within each layer between node pairs corresponding to neighboring pixels of the image lattice S (using 4 neighborhoods). On the other hand, the nodes from different layers assigned to the same pixel should also interact, so that the fusion of the two different segmentation labels is implemented in the S ∗ layer. Therefore, we use ‘inter-layer’ connections between sites s i and s j : ∀s ∈ S; i, j ∈ {d, c, ∗}, i = j. In summary, the graph has doubleton ‘intra-layer’ cliques (their set is C2 ) which consist of pairs of sites, and ‘inter-layer’ cliques (C3 ) containing site-triples. We also define singleton cliques (C1 ), which are single-element sets of individual sites, connecting the model to local observations. The set of cliques is C = C1 ∪ C2 ∪ C3 .

92

4 Multi-layer Label Fusion Models

The observation process is defined by F = { f (s)|s ∈ S}, where f (s) = [ f d (s), , which maximizes the posterior f c (s)]. We need to find the optimal labeling

probability P( |F) that is a maximum a posteriori (MAP) estimate [71]:

= argmax P( |F),

∈ϒ

where following the notations from Chap. 2, ϒ denotes the set of all possible global labelings. Based on the Hammersley–Clifford Theorem (Eq. 2.5), the a posteriori probability of a given labeling follows a Gibbs distribution: 

1 P( |F) = exp − VC ( C ) . Z C∈C

(4.7)

Our remaining task is to define the VC clique potentials, which have small values if C (the label-subconfiguration corresponding to C) is semantically correct, and large values otherwise. The observations influence the model through the singleton potentials. Since the labels in the S d and S c layers are directly affected by the f d (.) and f c (.) values, respectively, ∀s ∈ S:   V{s d } ς (s d ) = − log P( f d (s)|ς (s d )),   V{s c } ς (s c ) = − log P( f c (s)|ς (s c )),

(4.8)

where the same foreground and background probabilities are used as defined in Sect. 4.2.3. Since the labels at S ∗ have no direct connection with the above image features, constant zero potentials are used there:   V{s ∗ } ς (s ∗ ) = 0.

(4.9)

For ensuring smooth segmentation in each layer, the potential of an intra-layer clique C2 = {s i , r i } ∈ C2 , i ∈ {d, c, ∗} favors the same labels:   VC2 = ς (s i ), ς (r i ) =



−δ i if ς (s i ) = ς (r i ) +δ i if ς (s i ) = ς (r i )

(4.10)

with a constant δ i > 0. As the experiments in Sect. 4.2.3 confirmed, a pixel likely corresponds to the background regions, if at least one connected site has the label ‘bg’ in the Sd and Sc layers. Using an indicator function 1bg : S d ∪ S c ∪ S ∗ → {0, 1}, where  1bg (υ) =

1 if ς (υ) = bg 0 if ς (υ) = bg

(4.11)

4.2 A Label Fusion Model for Object Motion Detection

93

and a ρ > 0 positive constant parameter, the potential of an inter-layer clique C3 = {s d , s c , s ∗ } has the following form: VC3 ( C3 ) = VC3





ς (s d ), ς (s c ), ς (s ∗ ) =



  −ρ if 1bg (s ∗ ) = max 1bg (s d ), 1bg (s c ) +ρ otherwise.

(4.12) The optimal MAP labeling

that maximizes P(

|F) (hence minimizes − log P(

|F)) can be calculated using (4.8)–(4.12), and i ∈ {d, c, ∗} as 

log P( f d (s)|ς (s d )) − log P( f c (s)|ς (s c ))+

= argmin −

∈ϒ

s∈S

+

s∈S

    ς (s i ), ς (r i ) + VC3 ς (s d ), ς (s c ), ς (s ∗ ) .

i;{s,r }∈C2

s∈S

(4.13)

4.2.5 L3 M RF Optimization The energy term of (4.13) can be optimized by various iterative techniques, like ICM [32] or simulated annealing [71], where the three layers of the model are simultaneously optimized. The interactions between various nodes and the observation affect the final segmentation, which is taken at the end as the labeling of the S ∗ layer. To obtain a good suboptimal solution, we have developed a three-layer modification of the deterministic Modified Metropolis (MMD) algorithm, which provides an efficient trade-off between segmentation speed and quality in many applications [104]. The detailed pseudo code of our extended MMD algorithm adapted to the L3 MRF segmentation model is given in Fig. 4.7.

4.2.6 Experiments on Object Motion Detection In this section, we validate our method via image pairs from different test sets. We compare the results of the three-layer model with various reference methods first qualitatively, then using different quantitative measures. Thereafter, we test the significance of the inter-layer connections in the joint segmentation model. Finally, we comment on the complexity of the algorithm. The evaluation tests are performed based on manually generated Ground Truth masks for various aerial image pairs. We use three test sets which consist of 83(= 52 + 22 + 9) pairs of images, where for each image pair around 1–3 s elapsed between the capturing times of the compared two frames. The ‘Balloon1’ and ‘Balloon2’ test sets contain image pairs from a video sequence captured by a flying bal-

94

4 Multi-layer Label Fusion Models

Fig. 4.7 Pseudo code of the Modified Metropolis algorithm used for the three layer model. Corresponding notations are given in Sects. 4.2.3, 4.2.4, and 4.2.5. In the tests, we used τ = 0.3, T0 = 4, and an exponential heating strategy: Tk+1 = 0.96 · Tk

4.2 A Label Fusion Model for Object Motion Detection

95

loon, while in the set ‘Budapest’, we find different image pairs taken from a plane. For each test set, the model parameters are estimated in a supervised way, using a limited number of training image pairs (less than 10% of the whole database), and the quality figures were solely generated for the remaining test images which were excluded from the training. For enabling public usage, we published our new dataset called SZTAKI AirMotion Benchmark on the website of our research institute.1

4.2.6.1

Evaluation Versus Reference Methods for This Task

We have compared the results of the proposed three-layer model to five other solutions, using the same training strategies and evaluation metrics. The first reference method is constructed from our model by ignoring the segmentation layer and the second observation layer, i.e. it fully relies on the 2D registration approach of Reddy & Chatterji’s [152], which is followed by frame differencing. This comparison emphasizes the importance of using the correlation-peak features, since this reference approach attempts to perform a good segmentation purely based on the pixel-by-pixel gray-level difference values. The second reference technique is the method of Farin & With [57], which combines a difference map and a risk map w.r.t. registration errors in a single-layer MFR framework. The third comparison aims to demonstrate the limits of [111]: an optimal Affine transform between the frames (which was automatically estimated in [111]) is approximated in a semi-manual way, using appropriately selected corresponding corner point pairs in the two frames. Thereafter, we derive the output change map based on the pixel-by-pixel gray-level difference values between the registered images. Fourth, we implemented a sequential approach based on [203], which classifies pixels as background iff they fulfill either the widely considered Epipolar or the homography constraints [93], whereas the resulting foreground masks are enhanced by morphological operators in post-processing. Fifth, we also tested the K-Nearest-Neighbor-Based label Fusion framework (Knnbf) of [94], tailored for the motion segmentation task. Figure 4.8 shows four selected image pairs from the SZTAKI AirMotion Benchmark with the change mask results obtained by the proposed L 3 Mrf model. For qualitative analysis, Fig. 4.9 displays our results again, which can be compared to the outputs of the above-described five reference techniques and the manually prepared and verified Ground Truth (GT) change masks. As for quantitative validation, the different methods’ change output masks have been compared to the GT images, and the pixel-level F-measure (F-score) values have been calculated using Eq. (3.43). The obtained numerical F-measure rates are shown in Fig. 4.10. By examining the figures, we can notice a large number of false foreground pixels by using both the Reddy and the semi-manual Affine models, since they do not deal with the elimination of parallax-based errors. The parallax issues are partially handled 1

http://mplab.sztaki.hu/remotesensing/airmotion_benchmark.html.

96

4 Multi-layer Label Fusion Models

Fig. 4.8 Four selected test image pairs for qualitative comparison (see also Fig. 4.9)

by the Farin technique, however, its overall F-score is still notably weaker than the results of the proposed L 3 Mrf model, in particular due to poor performance in highly textured or low-contrasted image regions. The main weak point of the Epipolar approach proved to be the sensitivity of extracting point correspondences in featureless regions, and confusing parallax effects with objects moving along the calculated epipolar lines [203]. These issues decreased in parallel both the Recall and the Precision rates (defined by Eq. (3.42)) of the Epipolar method, which effect is visually observable in Fig. 4.11. Based on our tests, the performance of Knnbf label fusion approach critically depends on the optical flow-based preliminary object separation step, which can become a weak point as the elapsed time between the compared images increases. As shown in Fig. 4.12, the Knnbf method can efficiently deal with processing consecutive frames of a video sequence, however, the quality figures quickly decline by selecting frames with a time difference larger than 0.2 s. On the contrary, the proposed L 3 Mrf approach may be confused by small object displacements (classifying them as parallax errors), but it clearly outperforms Knnbf in non-consecutive video frames. We can conclude the superiority of the proposed L 3 Mrf model over earlier methods for image pairs taken with significant viewpoint differences or reflecting large object displacements, assuming that an upper bound can be given for the scale of parallax-based differences.

4.2 A Label Fusion Model for Object Motion Detection

97

Fig. 4.9 Comparative segmentations with different test methods and ground truth using the image pairs of Fig. 4.8. Reference methods are described in Sect. 4.2.6.1. In the right column, the ellipses demonstrate a limitation: a high standing lamp is detected as a false moving object by all methods

98

4 Multi-layer Label Fusion Models

Fig. 4.10 Numerical comparison of the proposed model (L 3 Mrf) to five reference methods, using three test sets: ‘balloon1’ (52 image pairs), ‘balloon2’ (22), and ‘Budapest’ (9)

Fig. 4.11 Segmentation example with the Epipolar method and the proposed L 3 Mrf model. Circle in the middle marks a motion region which erroneously disappears using the Epipolar approach

4.2.6.2

Significance of the Joint Segmentation Model

In the proposed model, the segmentations based on the f d (.) and f c (.) features are not performed independently: they interact through the inter-layer cliques. Although similar information fusion approaches have been already used for different image segmentation problems [102], the significance of intra-layer connections should be justified with respect to the current task. We can examine in Fig. 4.13 segmentation results of various fusion models, which are compared to the proposed approach (Fig. 4.13h). As the first reference, we implemented a multivariate vector-based model (referred hereafter as observation fusion), where a 2D f (s) = [ f d (s), f c (s)] feature vector is constructed and its distribution is described by different 2D density functions: a mixture of Gaussians P( f (s)|bg) for the background class and a uniform density P( f (s)|fg) for the foreground. During the tests with this approach, we have obtained

4.2 A Label Fusion Model for Object Motion Detection

99

(a) Qualitative test

F−measure

0.9 0.85 KNNBF

0.8 0.75

L3MRF

0.7 0.65 0.04

0.12 0.20 0.28 0.40 0.60 Time difference between the images (s)

1.20

(b) Quantitative test Fig. 4.12 Comparison of the proposed L 3 Mrf model to the Knnbf method [94]: a qualitative results using image pairs from the Karlsruhe sequence (# denotes the frame number). b Quantitative measurements, showing the F-measure of different frame pairs as a function of the time difference between the images

low-quality segmentation results (see Fig. 4.13d) regardless of the number of Gaussian mixture components. We demonstrate the role of the inter-layer cliques by comparing the proposed scheme with a sequential decision fusion model [153], where first, we perform two independent segmentations based on f d (.) and f c (.) (i.e. we segment the S d and S c layers ignoring the inter-layer cliques), thereafter, we get the segmentation of S ∗ by a per pixel AND operation on the f d (.)-based and f c (.)-based segmented images. In Fig. 4.13e, we can observe that the separate segmentation gives noisy results, since in this case, the intra-layer smoothing terms are not taken into account in the S ∗ layer. Thus, we can conclude that the proposed label fusion process enhances the quality of segmentation versus the sequential model.

100

4 Multi-layer Label Fusion Models

Fig. 4.13 Evaluation of the proposed L 3 Mrf model versus different fusion approaches

Next we have also examined whether it is required to include intra-layer smoothing terms both in the feature processing layers (δ d , δ c ) and in the combined layer (δ ∗ ). For this reason, we have tested two further modifications of the proposed model: L 3 Mrf-δ0∗ , uses δ ∗ = 0, δ d > 0, and δ c > 0 smoothing parameter values, while in L 3 Mrf-δ0d,c δ ∗ > 0, δ d = 0 and δ c = 0. As shown in Fig. 4.13f, g both modifications of the L 3 Mrf technique yield degraded motion masks. Similar to the above-discussed qualitative examples, numerical test results provided in Fig. 4.14 confirm the superiority of the proposed L 3 Mrf method versus the various competing approaches using alternative information fusion strategies.

4.3 Long-Term Change Detection in Aerial Photos

101

Fig. 4.14 Numerical comparison of the proposed model (L 3 Mrf) to different information fusion techniques with the same feature selection

4.3 Long-Term Change Detection in Aerial Photos This section focuses on change detection in optical aerial images which were taken with several years of time differences partially in different seasons and lighting conditions (Fig. 4.15). In this case, straightforward techniques, such as applying a simple threshold operation for the difference image [30, 150], cannot provide efficient results, since even the unchanged image regions may appear with significantly different pixel intensity values, due to different illumination conditions, or seasonal changes in vegetation. In addition, by relying on optical image sensors only, we receive limited information compared to many approaches, such as [142] where healthy vegetation can be easily detected in the near-infrared wavelength band, or [68] which exploits multitemporal SAR imagery that provides stable measurements independently of atmospheric conditions. In this section, we purely work with geometrically corrected and registered grayscale orthophotos. We propose a robust multi-layer Conditional MiXed Markov model (CXM) model to tackle the change detection problem in remote sensing optical images. In our approach, similar to Sect. 4.2, changes are identified through two complementary descriptors, however, the label fusion part will be more complex than in L3 MRF. Instead of combining the two feature-based label maps via logical operators, we utilize a third feature, which is responsible for locally choosing the more reliable change descriptor in the different image regions. This modification requires involving dynamic connections between the nodes of the multi-layer graph structure that will be implemented using the Mixed Markov model concept [26, 62].

4.3.1 Image Model and Feature Extraction First, we focus on the issues of feature selection and modeling. For simplicity, we use here various notations from Sect. 4.2. Let G 1 and G 2 be again the two input images converted to grayscale, but we assume now that G 1 and G 2 have an identical pixel lattice S and they are already registered by the image providers. The later assumption is reasonable since in contrast to object motion detection, long-term change detection

102

4 Multi-layer Label Fusion Models

Fig. 4.15 Feature selection for long-term change detection: a image 1 (G 1 ), b image 2 (G 2 ), c intensity-based change detection (φg (.), changes are marked with red), d correlation-based change detection (φc (.)), e local variance-based segmentation, red if φν (s) = c, f change detection results obtained by per pixel integration of φg (.), φc (.), and φν (.) maps, g Ground Truth

is an offline task. The gray values are henceforward denoted by g1 (s) and g2 (s) for a pixel s ∈ S of G 1 and G 2 , respectively. First, we need to define local features at each s ∈ S which provide us information about the class of s, which can be either change (ch) or background (bg), where bg denotes an unchanged surface point. Following again a Bayesian approach, we model the ch and bg classes by stochastic processes which generate the observed features according to different distributions. Feature selection is experimentally validated, which we demonstrate in this chapter for the sample image pair of Fig. 4.15a, b. We implement a supervised method, where we assume that each test set contains a sufficient number of training images with manually verified Ground Truth change masks. We also prescribe that the image pairs of the same test set are taken from the same geographical area, and within a time layer, the camera properties and settings, as well as the illumination conditions are similar.

4.3 Long-Term Change Detection in Aerial Photos

4.3.1.1

103

Segmentation Based on Global Intensity Statistics

The first feature is defined in the joint intensity domain of the two images. Let us build a 2D histogram from the f g (s) = [g1 (s), g2 (s)]T gray-level pairs extracted over the background regions of the training images, as shown in Fig. 4.16. In our model, we approximate this histogram by a mixture of K Gaussian distributions, where K is a parameter, which is fixed preliminary. This approach measures which intensity values occur frequently together in the two input photos. In this way, the conditional probability term of the f g (s) observation regarding the background class is calculated as K      (4.14) κi · η f g (s), μi ,  i , P f g (s)bg = i=1

where η(.) marks a 2D multivariate Gaussian probability density function (pdf) with μi expected value  K and  i covariance matrix, while the κi terms are positive κi = 1). Figure 4.16 shows the Expectation Maximization weighting factors ( i=1 (EM) estimate [33] of the pdf using K = 5 mixture components, where two Gaussian terms have large weights.

Fig. 4.16 a f g -histogram of background pixels, b Mixture of Gaussians approximation of P( f g (s)|bg) obtained by the EM algorithm [33], c f g -histogram of the changed pixels, d Uniform density estimation [31] for P( f g (s)|ch)

104

4 Multi-layer Label Fusion Models

While the intensity model of the background class expects a number of frequently co-occurring gray-level couples in the two images (e.g. the average intensity of a homogeneous meadow area), the f g (s) histogram of the changed regions contain usually many smaller peaks and valleys distributed over a larger part of the 2D intensity domain. Expressing that any f g (s) gray value pairs may be observed in the changed areas with the same prior probabilities, the ‘ch’ class is modeled by a uniform density function [31] (Fig. 4.16d):   P f g (s)ch = 



1 , (b1 −a1 )·(b2 −a2 )

0

if f g (s) ∈  otherwise,

(4.15)

where f g (s) ∈  iff a1 ≤ g1 (s) ≤ b1 and a2 ≤ g2 (s) ≤ b2 . To sum it up, the f g (s) feature-based model component for ch/bg separation is described by the following parameters: Og = {κi , μi ,  i |i = 1 . . . K } ∪ {a1 , b1 , a2 , b2 }.

(4.16)

Next we demonstrate the limitations of using the above intensity-based approach. After supervised estimation of the Og distribution parameters, we derive the φg change map as the pixel-by-pixel maximum likelihood (ML) estimate, where the label of s is    (4.17) φg (s) = argmaxψ∈{ch,bg} P f g (s)ψ . As Fig. 4.15c shows, the obtained f g (s)-based maximum likelihood label map φg : S → {ch, bg} contains several false changes in unaltered territories, mainly in highly textured regions (e.g. areas of buildings and roads), where the occurring f g (s) gray value pairs are less frequent in the global image statistics. Since these artifacts cannot be handled in the above approach, we introduce the second feature in the following section.

4.3.1.2

Segmentation Based on Local Correlation

Similar to the object motion detection application in Sect. 4.2, the second feature c(s) is calculated as the normalized cross correlation between corresponding rectangular regions in the two images. Let us denote the z × z rectangular neighborhood of s by Ns,z ⊂ S (used z = 17). We mark by λi (s) and νi (s) the mean and variance values of the intensity levels within the Ns,z part of G i , i ∈ {1, 2}. Next we calculate f c (s) as the normalized cross correlation coefficient between the corresponding Ns,z regions in the two images:  f c (s) =

r ∈Ns,z



   g1 (r ) − λ1 (s) · g2 (r ) − λ2 (s) . √ z 2 ν1 (s) · ν2 (s)

(4.18)

4.3 Long-Term Change Detection in Aerial Photos

105

Fig. 4.17 c histogram and Beta density approximation [78] of the P( f c (s)|ch) and P( f c (s)|bg) probabilities. a and b: initial estimation; c and d: optimized estimation

Note that the above calculation can be significantly sped up using the integral image trick [183]. To investigate the statistical properties of the correlation descriptor, we plot in Fig. 4.17 the histogram of the f c (s) values obtained in the changed and background areas of the training images. Considering that the observed empirical distributions are asymmetric, we used Beta density approximations [78] for both the change and the background classes (Fig. 4.17b):   P( f c (s)|ch) = B [ f c (s) + 1]/2, αch , βch .

(4.19)

We should note here that scaling [ f c (s) + 1]/2 was necessary to ensure that the correlation values fit the [0, 1] interval, which is the domain of the Beta density function. A similar model has been built for the background class:      P f c (s)bg = B [ f c (s) + 1]/2, αbg , βbg .

(4.20)

The corresponding parameter set is here: Oc = {αch , βch , αbg , βbg }.

(4.21)

106

4 Multi-layer Label Fusion Models

Let us observe that the f c (s) values in the changed image regions are approximately generated by a zero mean distribution, while the values extracted from the background are in general significantly higher within the [−1, 1] interval. Next we create a maximum likelihood (ML) segmentation estimate based on the f c (.) feature:    (4.22) φc (s) = argmaxψ∈{ch,bg} P f c (s)ψ . As the projected change mask in Fig. 4.15d confirms, the correlation descriptor is also inefficient in itself. However, one can observe that f g (s) and f c (s) are efficient complementary features. In homogeneous image regions with low contrast, the f c (s) descriptors proved to be very noisy, but at the same time, the f g (s) feature seems to be fairly reliable for extracting the changes. On the other hand by the classification of textured areas, f c (s) definitely outperforms f g (s). In the following section, we introduce a solid probabilistic interpretation of the contrast-based feature selection step.

4.3.1.3

Contrast-Based Feature Selection

In this section, we present a statistical model for local feature selection, considering our previous experiences. We have already seen that calculating local contrast information in the images can help us to measure the reliability of the f g (s) intensity and f c (s) correlation features in terms of change–background classification of pixel s. For image G i (i ∈ {1, 2}), we characterize the contrast by νi (s) variance of the intensity values in the neighborhood of s. Let ν(s) = [ν1 (s), ν2 (s)]T . We also use the notation T for the Ground Truth mask with t (s) ∈ {ch, bg} labels ∀s ∈ S, and δ marks the Kronecker-delta. Next, we attempt to statistically describe how the classification performance based on the f g (s), respectively, f c (s) features depends on the extracted ν(s) values. We divide the domain of the observed ν1 (s) and ν2 (s) values with L equally spaced bins: b1 , . . . , b L (note that each bn is a line segment in R). We say that ν(s) ∈ bm,n if ν1 (s) ∈ bm and ν2 (s) ∈ bn (bm,n is a rectangle in R2 ). We define next a ratio histogram h g , which measures for each bm,n bin the ratio of the number of correctly and falsely labeled pixels based on φg (.), so that the ν(s) values lie in bm,n . Using the notations: Sm,n = {s|s ∈ S, ν(s) ∈ bm,n }, the h g histogram is defined as   δ t (s), φg (s)  h g [m, n] =    . s∈Sm,n 1 − δ t (s), φg (s) 

s∈Sm,n

(4.23)

4.3 Long-Term Change Detection in Aerial Photos

107

Fig. 4.18 Illustration of the 2D h g and h c histograms as function of the corresponding ν1 (s) and ν2 (s) values

h c is similarly calculated for the f c (.) feature. The h g and h c 2D ratio histograms are visually demonstrated in Fig. 4.18a, c. Notable peaks of h g (resp. h c ) correspond to domains of ν(s) where the decision based on the f g (.) [resp. f c (.)] feature is reliable. After applying a normalization step, we consider the histograms as probability distributions which are approximated again with parametric density functions. Here, based on the ν(s) descriptor values, we have to separate image pixels, where the f g (s), respectively, f c (s) features can be more reliably used for classifying pixel s as change (ch) or background (bg). We have observed in these experiments that the two domains can be well separated with 2D Gaussian density approximations of the h g and h c histograms, which phenomenon is shown in Fig. 4.18b, d (the histograms are unimodal and they overlap only slightly). Thus, the following distributions are used:      P ν(s)h g = η ν(s), μg ,  g ,

(4.24)

     P ν(s)h c = η ν(s), μc ,  c .

(4.25)

The parameter set assigned to the contrast feature is Oν = {μg ,  g , μc ,  c }. Then we can take the maximum likelihood contrast map (Fig. 4.15e) as

(4.26)

108

4 Multi-layer Label Fusion Models

   φν (s) = argmaxχ∈{g,c} P ν(s)h χ .

(4.27)

For generating the final change mask, φ∗ , the following pixel-by-pixel classification process is performed:  φg (s) if φν (s) = g φ∗ (s) = . (4.28) φc (s) if φν (s) = c Let us observe that in the above classification approach, we can also refine the distribution parameters relying on the ch/bg labeled training data. Here, we recall that in Sects. 4.3.1.1 and 4.3.1.2, the parameters in the Og and Oc sets were estimated over all pixels of the training images. For example, the background f c -histogram in Fig. 4.17a also includes f c (s) features obtained from nearly homogenous areas, where the correlation is eventually unreliable and irrelevant w.r.t. creating the final change map. Based on (4.28), we should only estimate the f c (s) statistics over the highly textured (i.e. contrasted), while f g (s) distributions over the homogeneous image regions. Thus, we can re-estimate the Og parameters considering only the pixels of the training images with φν (s) = g; and Oc for the training pixels with φν (s) = c. Note that using such a linear parameter estimation method, we can observe a mutual dependency between the parameter sets Oν and Og ∪ Oc . Thus, we can use an iterative algorithm to refine the parameters as detailed in Fig. 4.19. In our experiments, the algorithm needed 3–5 iterations until convergence, and it caused a notable improvement especially regarding the Oc parameter set. We can compare in Fig. 4.17b, d the initial and final Gaussian density functions corresponding to Oc . Next, we provide an evaluation of the pixel-by-pixel segmentation algorithm defined by (4.28). Figure 4.15f, g shows the φ∗ -segmentation output and the proposed Ground Truth, respectively. Although we can observe significant improvement due to the feature integration step compared to the results of the individual descriptors (see also Fig. 4.15c, d), the joint segmentation output in Fig. 4.15f is still notably noisy. Principally, the segmented image Fig. 4.15f contains many fragmented regions, instead of consisting of smooth and connected blobs for representing the changed regions. To enhance the output quality, neighborhood interaction should be involved in the process besides the pixel-level descriptors. Previously in this chapter, MRFs have been used for obtaining observationconsistent and smooth image segmentation. However, converting the above-described segmentation algorithm (see also eq. (4.28)) to a Markovian schema requires following a different approach from single-layer Potts-like MRF models [147] for two reasons. First, the results of different segmentations should be efficiently integrated, which will be solved by a multi-layer technique similar to our model from Sect. 4.2. Second, let us observe that the ν(s) feature plays a unique role: it is used to locally switch ON and OFF the f g (s) and f c (s) features into the feature integration formula. Since conventional MRFs only use static interactions between the processing nodes, we will use here an extended structure called the mixed Markov model [62], which we already introduced in Sect. 2.2.5 of Chap. 2.

4.3 Long-Term Change Detection in Aerial Photos

109

Fig. 4.19 Iterative algorithm for estimating the Og , Oc , and Oν parameter sets. Og resp. Oc model the f g (s) intensity resp. f c (s) correlation features generated by the change and background classes, meanwhile Oν describes the ν(s) contrast observation, on condition that the intensity (h g ) resp. correlation (h c ) factors are reliable (used kmax = 5)

4.3.2 A Conditional Mixed Markov Image Segmentation Model In this section, we present a new probabilistic method, the Conditional MiXed Markov Model (CXM), which is constructed as a combination of a mixed Markov model [62] and a conditionally independent random field of signals. Let us recall here that in Sect. 4.3.1, we combined segmentation results based on three different features in a pixel-wise manner. To adopt this approach, the problem is mapped first to a graph G shown in Fig. 4.20c. The vertices of G are arranged into four layers: S g , S c , S ν , and S ∗ , so that the layers have the same size as the original image lattice S. Henceforward, we call S g , S c , and S ν as feature layers, while we also use a combined segmentation layer S ∗ . Each pixel s ∈ S corresponds to a unique node in every layer: for example, s g is the graph node mapped to pixel s in the S g layer. Other nodes s c ∈ S c , s ν ∈ S ν , and s ∗ ∈ S ∗ are similarly defined. In the CXM approach, instead of performing independent segmentations of each layer, we derive the result by stochastic optimization of a single energy function which encapsulates all constraints of the

110

4 Multi-layer Label Fusion Models

Fig. 4.20 Structure of the proposed model and overview of the segmentation process

model: observation-consistent classification, optimal local feature selection, and spatial smoothness. Next, we define a labeling random process, which assigns to each q node of G a label ς (q). Following the Mixed MRF approach [62], dependencies between the corresponding node labels are ensured by either static graph edges, or by dynamic address pointers. The S g , S c , and S ∗ layers of the model contain regular nodes, where labels mark possible change or background (ch/bg) classes: ∀s ∈ S, i ∈ {g, c, ∗} : ς (s i ) ∈ {ch, bg}.

(4.29)

More specifically, for a given pixel s, ς (s g ) resp. ς (s c ) is the label of a segmentation process based on the f g (s) resp. f c (s) feature, while the final change mask is represented by the labels in the S ∗ layer. On the other hand, we use the S ν layer to connect the given regions of the final change map S ∗ to correctly segmented image areas either in the S g or in the S c layers. For this reason, we solely use address nodes in layer S ν , which have node-pointer labels: {ς (s ν )|∀s ∈ S}. Next, we describe how the model utilizes the information extracted from the input images. We define an operator f (.) which connects the nodes of the feature layers S g , S c , and S ν to the corresponding local observations in the following way ∀s ∈ S: f (s g ) = f g (s), f (s c ) = f c (s) and f (s ν ) = ν(s).

4.3 Long-Term Change Detection in Aerial Photos

111

Fig. 4.21 Demonstration of (I) intra- and (II.a,II.b) inter-layer connections regarding nodes associated with pixel s. Continuous line is an edge of G , dotted arrows denote the two possible destinations of the address node s ν (in I: i ∈ {g, c, ν, ∗})

We denote the global observation process by F = { f (q)|q ∈ S g ∪ S c ∪ S ν }. Similar to conventional MRFs, our proposed CXM segmentation technique follows a Maximum a Posteriori (MAP) approach [30, 71], searching for an optimal global labeling

, which maximizes the following conditional probability: P( |F) = P(F| ) · P( ). Assuming conditionally independent observations, P(F| ) can be calculated as a   product of P f (q)|ς (q) singleton probability terms assigned to the nodes of the feature layers. In the S g and S c layers, the node-by-node singletons are derived based on the same probability density functions which we defined in Sect. 4.3.1. Thus, ∀s ∈ S and ψ ∈ {ch, bg}:     P f (s g )|ς (s g ) =ψ = P f g (s)|ψ ,     P f (s c )|ς (s c ) =ψ = P f c (s)|ψ .

(4.30) (4.31)

Here, we note that singletons of the S ν layer will be defined later. On the other hand, using CXM the P( ) prior probability is derived from a mixed Markov model, thus it follows Eq. (2.14). To calculate P( ), we have to define appropriately the edges (or cliques) of G and the corresponding VC clique potential functions. Considering various desired constraints, we use in the model two types of cliques representing intra- and inter-layer interactions (see Fig. 4.21). For ensuring smooth segmentations, within each layer we connect node pairs corresponding to (4-)neighboring pixels on the S image lattice. We mark the set of the resulting intra-layer cliques by C2 , and prescribe that the potential function of a clique in C2 penalizes neighboring nodes having different labels. Thus for r and s neighboring pixels on S, the potential of the doubleton clique C2 = {r i , s i } ∈ C2 for each i ∈ {g, c, ν, ∗} is taken as

112

4 Multi-layer Label Fusion Models

   −δ i if ς (s i ) = ς (r i ) VC2 ς (s i ), ς (r i ) = +δ i if ς (s i ) = ς (r i )

(4.32)

with a constant factor ϕ i > 0. Next, we describe the inter-layer interactions. Based on our previous experiments (see Eq. (4.28)), ς (s ∗ ) should be equal either to ς (s g ) or to ς (s c ), based on the value of the ν(s) feature. Hence, we connect s ∗ and s ν with and edge, and prescribe that address node s ν should point either to s g or to s c : ∀s ∈ S : ς (s ν ) ∈ {s g , s c }.

(4.33)

The directions of the address pointers are influenced by the singletons of S ν where we use the distributions defined in Sect. 4.3.1:     P f (s ν )|ς (s ν ) = s χ = P ν(s)|h χ , χ ∈ {g, c}.

(4.34)

Finally, we get the potential function of the inter-layer clique C3 = {s ∗ , s ν } as    −ρ if ς (s ∗ ) = ς˜ (s ν ) VC3 ς (s ∗ ), ς˜ (s ν ) = +ρ otherwise

(4.35)

  where ρ > 0, and using (2.13): ς˜ (s ν ) = ς ς (s ν ) . Using the above-introduced energy terms, the optimal

can be calculated as

= argmin

∈



  − log P f g (s)|ς (s g ) +

s∈S

+

    − log P f c (s)|ς (s c ) + − log P ν(s)|ς (s ν ) +

s∈S

+

s∈S

     VC2 ς (s i ), ς (r i ) + VC3 ς (s ∗ ), ς˜ (s ν ) ,

i;{s,r }∈C2

s∈S

(4.36)

where i ∈ {g, c, ν, ∗} and ϒ denotes the set of all the possible global labelings. We minimize the energy term of Eq. (4.36) again with the deterministic Modified Metropolis relaxation process, in a similar manner to the L3 MRF’s optimization algorithm presented in Fig. 4.7. Note that due to its fully modular structure, the introduced model could be completed in a straightforward way with additional sensor information (e.g. color or infrared sensors) or task-specific features depending on availability.

4.3 Long-Term Change Detection in Aerial Photos

113

Fig. 4.22 Qualitative comparison of the change detection results with the different test methods and the proposed CXM model, for datasets: Szada (S), Tiszadob (T), and Archive (A). Red regions mark the detected/Ground Truth changes

4.3.3 Experiments on Long-Term Change Detection 4.3.3.1

Test Databases and Ground Truth Generation

For evaluation of CXM, we collected three sets of optical aerial image pairs provided by the Lechner Knowledge Center (LKC) in Budapest (earlier FÖMI) (see Fig. 4.22). We published the labeled test data as the SZTAKI AirChange Benchmark Set.2 Dataset Szada contains photos taken by LKC in 2000 and in 2005, respectively. This test set consists of seven image pairs with manually prepared Ground Truth (GT) change masks, taken from a 9.5 km2 area with a resolution 1.5 m/pixel. One of the image pairs has been used here for parameter settings of the CXM model and the remaining six ones for validation. The second test set called Tiszadob includes five photo pairs from 2000 resp. 2007 (6.8 km2 ) with similar size and quality parameters to Szada. Finally, using the test set Archive, we can compare an aerial photo taken by LKC in 1984 to a satellite image from 2007. Note that the Archive test scenario proved to be highly challenging, since the photo from 1984 has a much 2

http://mplab.sztaki.hu/remotesensing/airchange_benchmark.html.

114

4 Multi-layer Label Fusion Models

lower quality, and several major differences appear due to the 23 years of time difference between the two shots. By Ground Truth generation, the following changes have been considered: new built-up regions, building operations, planting forests or individual trees (trees only at high resolution), fresh plow-lands, and groundworks before building over.

4.3.3.2

Evaluation

This section presents thorough quantitative and qualitative comparison between the proposed approach and prior results from the literature. It should be noted that the different approaches use significantly different assumptions: they can be either supervised or unsupervised, and their goals may also be slightly different, such as change detection (direct methods) or joint image segmentation (PCC). Such methodological differences must be also considered in a comparative evaluation. Considering that our proposed CXM method is supervised, we compare it to other supervised techniques so that we rely on the same training data samples for our CXM model and for the selected reference methods. Therefore, for the sake of fair comparison, we implemented supervised modifications of three previous unsupervised methods [42, 74, 195] for evaluation. We note here that in PCC models (e.g. [43, 44, 208]) the Ground Truth used for training and evaluation marks the land-cover classes in each frame, while changes in our proposed approach could not be interpreted as such class transition occurrences. For the above reason, we investigate first direct methods in this section, and we compare the CXM method to four previous solutions, which will be introduced briefly in the following. For easier notation, we mark the pixels of the difference image (DI) by d(s) = g1 (s) − g2 (s). • PCA: a supervised modification of the technique proposed by [195]. The f g (s) = [g1 (s), g2 (s)] joint intensity vectors extracted from the two images are projected into the space of the principal components estimated over some training regions for the background class. The descriptor of pixel s is obtained as the magnitude of the second largest principal component normalized by the local contrast. Finally, the change/background classification is derived by a Potts-MRF [147] model. • Hopfield: a Hopfield-type neural network is used [74], which is initialized by a pixel-by-pixel segmentation algorithm that is based on the difference image (DI). The combined change mask is generated by an iterative process, which minimizes the global network energy, meanwhile, interactions between the nodes ensure obtaining smooth change and background areas. • Parzen: a non-parametric supervised method is implemented to segment the DI following the approach of [42]. The P(d(s)|ς (s) = bg) and P(d(s)|ς (s) = ch) conditional probability density functions are approximated by Parzen kernel density estimators [65] and a smoothed change map is provided by an MRF model [42].

4.3 Long-Term Change Detection in Aerial Photos

115

Fig. 4.23 Quantitative comparison of the proposed CXM technique to four previous methods on the three sets of the SZTAKI AirChange Benchmark: Szada, Tiszadob, and Archive. False alarm, missed alarm, and overall error rates are given in percent of the checked pixels

• MLP: this approach directly estimates the P(ς |d) probability densities instead of approximating P(d|ς ). The DI is segmented again by an MRF, but here the P(ch|d(s)) and P(bg|d(s)) distributions are estimated by a multi-layer perceptron network trained with GT data. Note that highly similar approaches have been already adopted by various PCC methods [43, 44] to segment the different image layers, which fact justifies our interest regarding this ‘direct’ MLP technique. During the quantitative survey, we relied on the same evaluation metrics as [42, 74]: we compared the segmentation results provided by the different approaches to the manually prepared Ground Truth (GT) masks, and counted the number of false alarms (unchanged pixels which were detected as changes) and missed alarms (erroneously ignored changed pixels). Finally, the overall error was taken as the sum of the previous two quantities. Comparative results on the three subsets of the SZTAKI AirChange Benchmark Set are given in Fig. 4.23, in percent of the number of processed image pixels. As this figure shows, the overall error of the proposed CXM model was notably smaller than the error of the reference methods (gap was about 2–5%). Note that the generally weaker results in the Archive tests were also related to the lower image quality. For the visual demonstration, we show the comparative change detection results of some relevant image parts (of 45 m2 area) in Fig. 4.22. We can observe that the CXM model is able to identify changes smoothly and in a more accurate way than the reference techniques.

4.3.3.3

Comparison to Other Multi-layer MRF Models

Since the publication date of the CXM model [27], various novel multi-layer MRF models have been published. For this reason, in 2015 an up-to-date survey article has been prepared on the existing techniques [23], comparing CXM in detail to two newer approaches [164, 171]. Here, the first reference is a Multicue MRF model [164], which integrates the modified Histogram of Oriented Gradients and graylevel difference features into the original Multi-MRF structure framework proposed by [102], where two layers correspond to the two feature maps and the third one is the

116

4 Multi-layer Label Fusion Models

final segmentation layer. The class models and the inter-layer interaction terms are both affected by observation-dependent and prior constraints, differently from CXM where the feature maps only affect the singleton terms, while the interaction terms implement purely prior label fusion (soft-)constraints. The second reference, called Fusion-MRF [171] (mentioned already in Sect. 4.1), simultaneously realizes adaptive segmentation and change detection for optical remote sensing images, where each layer represents a given input image; thus ‘multi-layer’ refers here to multitemporal images. In the study [23], the quantitative comparison was based again on the SZTAKI AirChange Benchmark Set, however depending on the different approaches on change modeling, further considerations should have been taken. Multicue MRF is a direct change detection technique similar to our CXM [27] and all reference methods of Fig. 4.23, which obtain changes by segmenting similarity maps between the input images. On the other hand, the Fusion MRF follows a Post-Classification Comparison (PCC) approach, which segments first the input images into various land-cover classes, and changes are obtained indirectly as regions with different class labels in the different time layers. Therefore, by testing the Fusion MRF, we could not rely on the available Ground Truth (GT) change masks (referred to as AirChange GT ). Instead, we generated new GT (called Region PCC GT ), where various land-cover classes have been considered for different image pairs of the test set, such as urban and non-urban; or meadow, planted meadow, and forest. Note that as concluded in [23], the two GT generating approaches correspond to two different use-cases, and both may be relevant. Comparative experiments between CXM and the Fusion MRF with the two different GT types gave obviously different results, showing the superiority of CXM with 4–58% in F-score using the original AirChange GT masks, and the advantages of the Fusion MRF (with 14–31%) with the Region PCC GT . Calculating the Overall error for five selected image pairs (Fig. 4.23) showed the minor superiority of the newer Multicue MRF (with a margin of 0.5–2%), while calculating the traditional Precision, Recall, and F-score values indicated notable advantages of CXM due to a significantly higher Recall rate. In summary, the experiments of our survey demonstrated that the proposed CXM technique proved to be also competitive versus more recent multi-layer change detection models.

4.3.3.4

Discussion of the Results

Qualitative and quantitative results presented in Sect. 4.3.3.2 demonstrate that regarding the airborne change detection problem, the proposed CXM technique can significantly outperform various earlier reference methods. The observed improvements are mainly originated from the efficient fusion strategy of multiple features, and the novel probabilistic feature integration approach. Regarding the role of feature selection, various limitations of the Hopfield, Parzen, and MLP techniques confirm that considering pixel-level intensity differences only is not efficient enough for separating changes from the unchanged background regions in optical images, since there is too large overlap between the ch and bg classes in the d(s) descriptor’s domain. The weak point of the PCA approach is that it uses

4.3 Long-Term Change Detection in Aerial Photos

117

a simplified linear physical model for the irrelevant global illumination changes, which is often invalid in real optical photos. In particular, the linear model is unable to efficiently describe seasonal vegetation changes, while noise and sensor saturation causes additional false alarms. For example in Fig. 4.22, the 3rd row’s PCA image shows a false building change in the left image region. In addition, both the intensity differences, and the PCA features are very sensitive on parallax, mirroring effects, and shadow artifacts caused by dense 3D structures in the scenes. On the contrary, the CXM method considers different features in homogeneous and highly textured territories leading us to more accurate change regions (Fig. 4.22). Another important factor has been the novel CXM segmentation model. To eliminate the noise caused by pixel-wise classification strategies (see Fig. 4.15g and 4.24d), Markovian neighborhood interactions have been utilized to produce a smoothed change map [71]. However, the implementation of the introduced label-based feature fusion approach required a novel multi-layer model structure, differently from standard single-layer techniques used by the reference methods. As a key feature, in our proposed multi-layer model, the segmentation steps of the different layers are synchronized via inter-layer interactions, which property also yields measurable improvement, as shown in Fig. 4.24. Here, the CXM results are compared to an ensemble of independent MRFs [image (e)]. As shown, while the ensemble is able to significantly enhance the results of pixel-wise classification [compare Fig. 4.24d, e], several artifacts remain in the change mask [24, 94], which are appropriately corrected by the proposed CXM method [image (f)]. In summary, real optical aerial images have been used for validation, which differed in content, resolution, and various quality factors. The most efficient results have been obtained by using images of 1.5 m/pixel resolution, taken from sparsely populated areas (Fig. 4.22). The photos contained both natural regions with generally homogeneous (low-contrasted) image segments, and dense built-up territories appearing as textured areas, which fact justified the need for using multiple descriptors, clearly outperforming single feature- based techniques. Note that at a relatively low-resolution (e.g. several meters [42]) high-frequency texture components are usually not observable in the images, which reduce the role of the correlation descriptor. On the other hand, at very large spatial resolution (≤0.5m) misregistration and par-

Fig. 4.24 Impacts of the multi-layer CXM structure for the quality of the change mask. We compare the results of b the pixel-by-pixel classification without spatial smoothing, c the ensemble of three independent, single-layer MRFs, and d the proposed multi-layer model

118

4 Multi-layer Label Fusion Models

allax artifacts may spoil the process, whose errors can be reduced by choosing large regions for f c (s) and ν(s) calculation (see parameter z in Sect. 4.4) or applying a moving correlation window [30]. False alarms may also appear due to fine rarefaction of vegetation which changes the texture and color of the area at the same time in an unusual way (Fig. 4.22, 1st row). When the image quality depreciates, the CXM method shows graceful degradation as in Fig. 4.23 regarding the Archive test set. Nevertheless, dealing with dense urban areas and large occlusions should be handled at a higher processing level including geometric or appearance-based object modeling, as we will investigate it in Chaps. 5 and 6.

4.4 Parameter Settings in Multi-layer Segmentation Models The parameters of the introduced multi-layer segmentation models can be divided into three groups: (i) preliminary parameters of feature calculation, (ii) parameters of the probability density functions in the data terms, and (iii) parameters of the prior intra- and inter-layer potential functions. First, the size of block matching windows (z) used for correlation calculation and in L 3 Mrf the size of search window (l) are related to a priori knowledge about the image resolution and textureness, object size, and magnitude of the parallax distortion. The correlation window should not be significantly larger than the expected objects to ensure low correlation between an image part which contains an object and one from the same ‘empty’ area. Second, the feature distribution (pdf) parameters can be obtained by conventional Maximum Likelihood estimation algorithms from background and foreground training areas. If manually labeled training data is not available, the foreground training regions must be extracted through outlier detection [92] in the feature spaces. Third, while data-based pdf parameters strongly depend on the input image data, interaction potential factors are largely independent of it. Experimental evidence suggests that the model is not sensitive to a particular setting of ρ or δ i within a wide range, which can be estimated a priori. The parameters of the intra-layer potential functions, δ i , influence the size of the connected blobs in the segmented images. Although automatic estimation methods exist for similar smoothing terms [105], δ i is rather a hyper-parameter, which can be fixed by trial and error. Higher δ i values result in more compact foreground regions, however, fine details of the silhouettes may be distorted that way. We have used ρ = δ ∗ : this choice gives the same importance to the intra-layer smoothness and the inter-layer label fusion constraints.

4.5 Conclusions

119

4.5 Conclusions In this chapter, we have introduced novel Markovian label fusion models for two different change detection problems from the remote sensing domain. First, we proposed a three-layer MRF model (L 3 Mrf) for extracting the regions of object motions from image pairs taken by an airborne moving platform. The output of the proposed method is a change map, which can be used, e.g. for estimating the dominant motion tracks (i.e. roads) in traffic monitoring tasks, or for outlier region detection in mosaicking and in stereo reconstruction. Moreover, it can also provide an efficient preliminary mask for higher level object detectors and trackers in aerial surveillance applications. We have shown that even if the preliminary image registration was relatively coarse, the false motion alarms could be fairly eliminated with the integration of frame differencing with local cross correlation, which presented complementary features for detecting static scene regions. The efficiency of the method has been validated through three different sets of real-world aerial images, and its behavior versus five reference methods and four different information fusion models has been quantitatively and qualitatively evaluated. The experiments have shown that the proposed model outperformed the reference methods dealing with image pairs with large camera and object motions and significant but bounded parallax. In the second part of the chapter, we addressed the detection of statistically unusual changes in optical aerial image pairs taken with significant time differences. A novel Conditional Mixed Markov (CXM) model has been proposed, which could integrate the robustness of MRF-based segmentation techniques [71], the modularity of multilayer approaches [95], and semantic flexibility of mixed Markov models [62]. The introduced method utilized information from three different observations: global intensity statistics, local correlation, and contrast. The performance of the method has been validated using real-world aerial images. The superiority of CXM versus four earlier reference methods, and its relevance against two newer methods has been shown quantitatively and qualitatively. Both models of this chapter may be used as efficient and scalable change detection filters for several remote sensing applications. The methods are purely based on low-level features, working without object extraction or identification of land-cover classes. Therefore, they can be adopted for a large variety of scenes and purposes, even in situations where the concept of ‘interesting changes’ is not well defined. The methods can support manual evaluation of large datasets by focusing the operator’s attention to targets or changed areas, and also in automated systems by restricting the field of interest and presenting shape- or region-based descriptors for higher level image interpretation modules.

Chapter 5

Multitemporal Data Analysis with Marked Point Processes

Abstract In this chapter, we introduce new approaches for object-level dynamic scene modeling based on multitemporal measurements, by extending the conventional Marked Point Process framework with modules focusing on the time dimension. First, a new probabilistic method is proposed for simultaneously extracting building footprints and performing change detection in pairs of remotely sensed images captured with several years of time differences. The output of the method is a population of 2D building footprint segments, where status information is provided for each segment highlighting changes between the two time layers. In the second part, we propose a Multiframe Marked Point Process model of line segments and point groups for automatic target structure extraction and tracking in Inverse Synthetic Aperture Radar (ISAR) image sequences. For the purpose of dealing with scatterer scintillations, and ensuring robustness despite the high level of speckle noise in the ISAR frames, we derive the output target sequence of the detector by an iterative optimization process, which takes into account in parallel the captured ISAR image data and different prior geometric interaction constraints between the fitted target samples of the consecutive frames. For both models, detailed quantitative evaluation is performed on real remotely sensed measurements.

5.1 Introducing the Time Dimension in MPP Models Conventional Marked Point Process (MPP) techniques are applicable for the analysis of static scenarios, however, several applications require object-level investigations on multitemporal measurements. Our key contribution in this chapter is to propose methodologies for incorporating the time dimension into the MPP framework. We will address two different challenges: object-level change detection and moving target analysis. Although both issues are quite general, for easier discussion and validation, we introduce the new model structures for selected applications: building development monitoring and moving object analysis in radar (ISAR) image sequences.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 C. Benedek, Multi-Level Bayesian Models for Environment Perception, https://doi.org/10.1007/978-3-030-83654-2_5

121

122

5 Multitemporal Data Analysis with Marked Point Processes

5.2 Object-Level Change Detection In this section, we introduce a novel Multitemporal Marked Point Process (mMPP) model, which is able to detect objects and mark the object-level changes in remotely sensed image pairs taken at different time instances. We present methodological contributions in three key issues: • We implement a novel object-change modeling approach, which simultaneously exploits low-level change information between the time layers and object-level description to recognize and separate changed and unaltered objects. • Answering the challenges of data heterogeneity in aerial and satellite image repositories, we construct a flexible hierarchical framework which can create various object appearance models from different elementary feature-based modules. • To simultaneously ensure the convergence, optimality, and computation complexity constraints raised by the increased data quantity in remote sensing applications, we adopt the quick Multiple Birth and Death optimization technique for change detection purposes, and propose a novel non-uniform stochastic object birth process, which generates relevant objects with higher probability based on low-level image features.

5.2.1 Building Development Monitoring—Problem Definition Change detection in urban areas based on aerial and satellite images is a key problem in many remote sensing applications. Although several techniques have been proposed in the literature in the last 40 years, various challenges are yet to be solved, such as dealing with the quick evolution of data quality and quantity, large variety of different building appearances, heterogeneous data in the available image repositories, and various demands of new application areas [107]. As discussed in Chap. 4 (Sect. 4.3), pixel-level change detection approaches, such as the conditional mixed Markov model (CXM) [27], can be efficiently used for region-based comparison of two remotely sensed images. However, as demonstrated in Fig. 5.1, in cases of high-resolution images with large connected change regions, a low-level change mask cannot efficiently highlight the interesting image content. Stepping up to object level, we develop here a Marked Point Process approach, which models the building population as an optimal configuration of simple geometric objects [16], which is obtained through an iterative process of stochastic birth and death steps (see definitions in Sect. 2.3 of Chap. 2). Formally, the input of the proposed method consists of two co-registered aerial or satellite images which were taken from the same area with several months or years of time difference. We consider each building to be constructed from one or many rectangular building segments, and as output we provide the size, position,

5.2 Object-Level Change Detection

123

Fig. 5.1 Low-level change detection: a and b input images, c change mask ch

and orientation parameters of the detected building segments, giving information on which objects are new, demolished, modified/re-built, or unchanged [9, 17]. Let us mark the common pixel lattice of the input images by S, and a single pixel by s ∈ S. A building segment candidate u in the input image pair is jointly characterized by geometric attributes and a time stamp. The center of a given building c = [cx , c y ] is assigned to a pixel position in [0, SW ] × [0, S H ] ⊂ R2 . Let us denote by Ru ⊂ S the set of pixels corresponding to the rectangle of u. Apart from the center position, Ru is described by the e L , el side lengths, and θ ∈ [−90◦ , +90◦ ] orientation parameters, as shown in Fig. 2.5c of the Fundamentals chapter. For enabling to handle multiple time layers, we assign to each u an index flag, ξ(u) ∈ {1, 2, ∗}, where ‘∗’ marks an unchanged object (which one is present in both images), while flags ‘1’ and ‘2’ denote building segments which are only visible in the first or second image, respectively. The set of all the possible object records is denoted by u = (cx , c y , e L , el , θ, ξ ) ∈ H. The output of the method is a configuration of n building segments, ω ∈ H n , where n, the number of objects, is also unknown.

124

5 Multitemporal Data Analysis with Marked Point Processes

5.2.2 Feature Selection In the proposed image analysis pipeline, the first task is to define appropriate image features which facilitate building extraction and change detection. Since in our approach the optimal building segment configuration is obtained by iterating stochastic birth and death steps, we use the image data in two different manners: (i) we should frequently generate relevant objects during the birth step based on low-level image features and (ii) we should ensure that in majority adequate objects survive the death step by relying on object-level descriptors. Low-level features characterize local image properties extracted in rectangular segments of the pixel lattice S, such as expected color distribution, texture, and local correlation calculated between corresponding image parts in the two time layers. These features are mainly utilized by the birth step, to estimate the possible building positions, various appearance parameters, and to perform a preliminary exploration of the areas likely to have changed. As a result of this approach, high-quality object candidates are generated with higher frequency in the estimated built-in regions. On the other side, we use object-level features to evaluate a given building hypothesis, so that we assign a fitness value to each proposed oriented rectangle. The object descriptors contribute to deciding if an object candidate should be preserved or or killed in the death step of the algorithm, thus their accuracy is critical.

5.2.2.1

Low-Level Features for Building Detection

First, we define different low-level features, which are extracted from individual images. To initially estimate the built-in regions, we calculate by each pixel s two birth probability values, Pb(1) (s) and Pb(2) (s), which give a likelihood that s is an object center in the 1st and 2nd image, respectively. Our terminology expresses that in the birth step the frequency of proposing an object at s will be proportional to the local birth probabilities. The first feature exploits the fact that regions of buildings should contain edges in perpendicular directions, which can be robustly characterized by Local Gradient Orientation Density Histograms (GODH) [112] calculated around each pixel s in an l × l rectangular region, Wl (s). If this region covers a building, the orientation histogram has two peaks, located at 90◦ degree distance, which can be measured by correlating the histogram with an appropriately matched bi-modal density function. Thereafter based on the maximal correlation values, we can assign to each pixel s a gr likelihood Pb (s) that s is the center of a building. For the sample image in Fig. 5.2a, gr a thresholded Pb map is shown in Fig. 5.2b, marking an estimation for building regions. Furthermore, the location of the peak value of GODH around s estimates the dominant gradient directions in the local neighborhood. Thus, for a building with center s, we expect its θ parameter around a mean orientation value μθ (s) which is equal to the peak location of the local GODH.

5.2 Object-Level Change Detection

125

Fig. 5.2 Building candidate regions obtained by the low-level b gradient, c color, and d shadow descriptors

We continue our discussion with the analysis of roof color filtering. Various roof types appear with typical color distributions in optical images [165], therefore using a roof color hypothesis we can often extract an approximate roof mask co (s) ∈ {0, 1}, where co (s) = 1 denotes that the color of pixel s matches our roof model. Since a given roof is usually represented by a connected image region, we can assume that around a roof center we find several further roof-colored pixels, thus for each s we determine a co −filling factor in its neighborhood: s =



co (r ).

r ∈Wl (s)

The color-based birth map value is calculated as Pbco (s) = 

s

r ∈S

r

.

Let us keep in mind that due to the possibly overlapping color domains of the roofs and background regions [165], the co (s) mask may only highlight a subset of the building segments, for example, in Fig. 5.2c we can observe that red roofs are detected only. Another possible feature for building location estimation can be extracted from the analysis of cast shadows [98, 165]. In several aerial images, we can observe that the values of shadow pixels belong to the dark-blue color domain [176], thus a binary shadow mask sh (s) can be derived by color thresholding. The geometric alignment of a building and its shadow is determined by the global Sun direction, which is usually known a priori, or it can be automatically calculated [165]. Thereafter, we can estimate the building locations as image regions, which are near to the shadow blobs opposing the Sun direction. Using this observation, the shadow-based birth map contains a constant birth rate value Pbsh (s) = p0sh within the obtained building candidate regions and a smaller constant on the outside (see Fig. 5.2d and later Fig. 5.4b). Up to this point, we have used various descriptors to estimate the location and appearance of the buildings in the individual images. However, a low-level change mask ch demonstrated in Fig. 5.1c can be directly involved in the model, since it separates efficiently the image regions which may contain the possibly changed

126

5 Multitemporal Data Analysis with Marked Point Processes

and unchanged buildings, respectively. The probability of change around pixel s is derived as  ch (r )/#Wl (s), (5.1) Pch (s) = r ∈Wl (s)

where #Wl (s) denotes the area of Wl (s). Considering the change feature, we can exploit an additional information source, which is independent of the object recognizer module. During the birth step, we will propose an unchanged object at s with a probability proportional to (1 − Pch (s)) · max Pb(i) (s), i∈{1,2}

while at the same location, the likelihood of generating a changed building segment will be for image i ∈ {1, 2}: Pch (s) · Pb(i) (s). 5.2.2.2

Object-Level Features

While low-level features are primarily used to support efficient object generation, the second key issue in the proposed birth–death dynamics-based approach is to evaluate the generated building segment candidates. The object-level features are built into an energy function ϕ (i) (u) : H → [−1, 1], which is able to assign a negative building likelihood value to object u in the ith image (for simpler notation, we ignore the i superscript in the following part of this section). We call an object attractive if ϕ(u) < 0, and the ϕ(u) function is constructed so that attractive objects correspond to the real buildings. The object modeling process is composed of three steps: feature extraction, energy calculation, and feature integration. In the first step, we define different f (u) : H → R fitness features which describe a building hypothesis for u in the image, so that high-quality building candidates exhibit large f (u) values. Second, for every feature f we define one or several energy subterms, which should satisfy ϕ f (u) < 0 for true objects and ϕ f (u) > 0 for false candidates. Here, we map the feature domain to [−1, 1] with a monotonously decreasing function shown in Fig. 5.3:

Fig. 5.3 Plot of the nonlinear feature domain mapping function M(x, d0 , D)

1 Q(.) 0

−1

d0

x

5.2 Object-Level Change Detection

127

⎧  ⎨ 1 − f (u) , if f (u) < d0 d   0 ϕ f (u) = M( f (u), d0 , D) = ⎩ exp − f (u)−d0 − 1, if f (u) ≥ d0 . D

(5.2)

As can be observed, the M function has two parameters: d0 is an object acceptance threshold for a given feature f , where u is attractive according to the ϕ f (u) term if and only if f (u) > d0 . D performs a simple data-normalization. The third step is dedicated to feature integration, which is necessary since the decision based on a single feature f may often cause poor classification, as the building and background classes have usually overlapping f -domains. For this reason, the joint energy term ϕ(u) should appropriately integrate the beneficial effects of different ϕ f (u) feature modules. We begin the introduction of the different feature components with gradient analysis. Near the boundary of a high-quality building rectangle candidate Ru , we can observe that the magnitudes of the local gradient vectors (∇gs ) are large and the gradient orientations are similar to the normal vector (ns ) of the nearest rectangle side (Fig. 5.5c, d). Based on this observation, the f gr (u) feature is defined by the following equation: 1  ∇gs · ns , (5.3) f gr (u) = #∂˜ Ru s∈∂˜ Ru

where operator ‘·’ marks scalar product, ∂˜ Ru is the dilated edge map of the Ru rectangle, and #∂˜ Ru is the number of pixels in ∂˜ Ru . The corresponding data energy term is calculated as ϕgr (u) = M( f gr (u), d gr , D gr ). The definition of the roof color feature is demonstrated in Fig. 5.4a. We assume that inside the building footprint Ru , a large number of image pixels are included in the roofs’ color domain, while the object’s Tu neighborhood (see Fig. 5.4a) should mainly contain non-roof pixels. Therefore, we take the f inco (u) internal region and f exco (u) external filling factors, which are calculated as f inco (u) =

1  co (s), # Ru s∈R u

Fig. 5.4 Utility of the color roof and shadow features

(5.4)

128

5 Multitemporal Data Analysis with Marked Point Processes

Fig. 5.5 Illustration of the feature maps in the Budapest 2008 image. Gradient and shadow features are relevant in the left-bottom regions, while the color descriptor is efficient in the top-right image parts. In image (d), the gradient feature is shown under the GT object borders

f exco (u) = f exco (u) =

1  1 − co (s) , #Tu s∈T

(5.5)

u

here #Y denotes the area of a subimage Y in pixels and co (s) is the color mask value of pixel s. Following our definition, object u is attractive according to the color term if it is attractive both regarding the internal and external subterms, thus

co co co ), M( f exco (u), dex , Dex ) . ϕco (u) = max M( f inco (u), dinco , Din

(5.6)

We continue with the description of the shadow term. This step is based on the binary shadow mask sh (s), extracted in Sect. 5.2.2.1. Using the shadow direction vector vsh (opposite of the Sun direction vector), we identify a shadow candidate area Tush next to the Ru object region, as demonstrated in Fig. 5.4b. Thereafter, similar to

5.2 Object-Level Change Detection

129

the color feature, we expect low shadow presence in the Ru internal region and high shadow covering ratio in the Tush external region, which constraints are represented by the following features: f insh (u) =

1  1 − sh (s) , # Ru s∈R u

and f exsh (u) =

1  sh (s). #Tush sh s∈Tu

The shadow energy term is obtained as

sh sh sh ), M( f exsh (u), dex , Dex ) . ϕsh (u) = max M( f insh (u), dinsh , Din

(5.7)

Note that the above approach does not rely on accurate building height information, as there is no penalty factor, if shadow blobs of long buildings exceed the Tush regions. The next step is feature integration. Since as shown in Fig. 5.5, individual features cannot support the detection of all objects on large complex scenes, the proposed framework enables flexible feature integration. First, we construct building prototypes based on the above-introduced feature primitive terms. Each prototype may prescribe the fulfillment of one or many feature-based constraints, so that the corresponding ϕ f energy subterms are joined with the ‘max’ operator (logical AND) in the prototype’s complete energy function. Besides, various types of buildings may be present at the same time in a given image pair. To detect multiple prototypes, the individual prototype-energy terms are connected with the ‘min’ operator (logical OR). Finally, the complete object energy term is calculated using a logical function, which also considers prior (task-specific) knowledge about the scene and imaging circumstances, therefore it should be chosen on a case-by-case basis. As an example, in the Budapest photo pair (see Fig. 5.8, top) two prototypes are used: the first one is purely based on the roof color feature, while the second one simultaneously prescribes the edge and shadow constraints. Then, the joint energy term is calculated as ϕ(u) = min max {ϕgr (u), ϕsh (u)}, ϕco (u) . On the other hand, by processing the satellite images from Beijing (see Fig. 5.8, bottom) we use first a gradient (ϕgr ) plus shadow (ϕsh )-based model, second a homogeneity-based prototype (see details in [17]).

130

5 Multitemporal Data Analysis with Marked Point Processes

5.2.3 Multitemporal MPP Configuration Model and Optimization In this section, we represent the building change detection task as an energy minimization problem. Following the definitions from Sect. 2.3.1 of Chap. 2, u denotes a given building segment, and the goal of the proposed approach is to extract an object configuration ω = {u 1 , . . . , u n } ∈ , where n, the number of building segments, is initially unknown. The object neighborhood is defined here in a straightforward way: we say that u ∼ v if their rectangles Ru and Rv intersect. By denoting with F the union of all image features derived from the input data, the goal is to minimize a classical MPP energy function of (2.21):

F (ω) =



AF (u) + γ ·



I (u, v).

(5.8)

u,v∈ω u∼v

u∈ω

In the above equation, AF (u) ∈ [−1, 1] is called the unary potential which is datadependent, while I (u, v) ∈ [0, 1] is the prior interaction potential, and γ > 0 is a weighting factor between the two energy terms. The Maximum Likelihood (ML) configuration estimate can be taken as follows:

ωML = argmin F (ω) . ω∈

The unary potentials are assigned to individual building segment candidates, u = {cx , c y , e L , el , θ, ξ }, depending on local image features in both images, but independently of other objects of the configuration. This data term joins the building energy values ϕ (1) (u) and ϕ (2) (u) extracted from the two input images (see Sect. 5.2.2.2) and the low-level similarity map between the two time layers which is represented by the ch (.) change mask (see Sect. 5.2.2.1). Next, we describe the role of the earlier defined image index flags, which take values from the set {1, 2, ∗} depending on the fact that u is observable in only one [ξ(u) ∈ {1, 2}] or in both [ξ(u) = ∗] input images. Based on this flag, we can choose a straightforward approach for the classification of building segment u: • unchanged if and only of (iff) ξ(u) = ∗, • new iff ξ(u) = 2 and v ∈ ω : {ξ(v) = 1, u and v overlap}, • demolished iff ξ(u) = 1 and v ∈ ω : {ξ(v) = 2, u and v overlap}. Re-built buildings are handled as two different objects u 1 and u 2 , so that ξ(u 1 ) = 1, ξ(u 2 ) = 2.

5.2 Object-Level Change Detection

131

The potential terms can be based on three soft constraints in the different cases. First, considering an unchanged building u, we prescribe small object energy values in both images, but we penalize large low-level dissimilarity in the footprint regions (i.e. high number of pixels with ch (s) = 1 within the rectangle Ru ). Second, for a demolished or re-built building of the first image, we expect low ϕ (1) (u), we regard the ϕ (2) (u) value indifferent, but we penalize high similarity of the image regions under the footprint. Third, for a new or modified building in the second image, we need low ϕ (2) (u), indifferent ϕ (1) (u), meanwhile high local similarity is penalized again. Following the above observations and modeling decisions, and marking by 1{E} ∈ {0, 1} an indicator function for a given event E, the AF (u) potential value is determined as AF (u) = 1 {ξ(u) ∈ {1, ∗}} · ϕ (1) (u) + 1 {ξ(u) ∈ {2, ∗}} · ϕ (2) (u)+  1  1  + 1 {ξ(u) = ∗} · ch (s) + 1 {ξ(u) ∈ {1, 2}} · 1 − ch (s) . # Ru # Ru s∈Ru

s∈Ru

(5.9) Unlike the data-dependent unary terms, interaction potentials implement prior geometrical constraints, by penalizing large intersection between different object rectangles, which share the time layer (see earlier Fig. 2.6): I (u, v) = 1 {ξ(u) ξ(v)} ·

#(Ru ∩ Rv ) . #(Ru ∪ Rv )

(5.10)

Here, the ξ(u) ξ(v) relation holds iff ξ(u) = ξ(v), or ξ(u) = ∗, or ξ(v) = ∗. Since ∀u, v : I (u, v) ≥ 0, the optimal population contains only objects with negative data term values (i.e. attractive objects) for the following reason. Let us consider an object u with a positive data term, AF (u) > 0. It is easy to prove here that removing u from the configuration results in a lower F (ω) global energy according to Eq. (5.8), therefore it will not be kept in the final configuration. Note also that considering Eq. (5.8), the interaction term plays a non-maximum suppression role as well by penalizing multiple attractive objects in the same or strongly overlapping positions. By defining the AF (u) and I (u, v) potential functions, the F (ω) configuration energy is completely determined, and the optimal ωML building configuration can be taken by minimizing Eq. (5.8). For this purpose, we have developed a new relaxation algorithm called the bi-layer Multiple Birth and Death (bMBD), which is presented in Fig. 5.6. The bMBD method extends the conventional MBD technique by dealing with two different time layers, thus it handles change and geometric object information in parallel. Consecutive birth and death processes are iterated until convergence is obtained in the global configuration. In the birth step, we apply a strategy called Feature-Based Birth Process (FBB), where multiple object candidates are generated randomly following the birth maps Pb(i) (s) in the two time layers i ∈ {1, 2}, by also considering the low-level change maps Pch (s) and local dominant gradient orien-

132

5 Multitemporal Data Analysis with Marked Point Processes

Fig. 5.6 Pseudo code of the bi-layer Multiple Birth and Death (bMBD) algorithm

tation maps μθ (s) [17]. Meanwhile, weak object candidates are eliminated by the death process based on the global configuration energy.

5.2.4 Experimental Study of the mMPP Model During the evaluation, we validated the three key developments of the proposed model, with a comparison to the state of the art: (i) the proposed multiple featurebased building appearance model, (ii) the joint object-change modeling framework, and (iii) the non-homogeneous object birth process based on low-level features.

5.2 Object-Level Change Detection

133

Fig. 5.7 Evaluation of the single-view building model. Comparing the proposed MPP model to the SIFT [166], Gabor [167], MRF [98], Edge Verification (EV) [165], Segment-Merge (SM) [136] methods, and to the Ground Truth. Circles denote completely missing or false objects. SIFT and Gabor only extract building centers

We have evaluated our method using eight significantly different sets of aerial and satellite image pairs, published as the SZTAKI-INRIA Building Detection Benchmark Set.1 For parameter settings, we have chosen in each dataset 2–8 buildings (≈ 5%) as training data, while the remaining Ground Truth labels have only been used to validate the detection results. Qualitative results are shown in Figs. 5.7 and 5.8. 1

Url: http://mplab.sztaki.hu/remotesensing/building_benchmark.html.

134

5 Multitemporal Data Analysis with Marked Point Processes

Fig. 5.8 Results on Budapest (top, image part—provider: András Görög) and Beijing (bottom, provider: Liama Laboratory CAS, China) image pairs, marking the unchanged (solid rectangles) and changed (dashed) objects

We perform quantitative evaluation both at object and pixel levels. At the object level, we first need to establish a non-ambiguous assignment between the detected objects and the GT object samples. As a similarity feature, we use the normalized intersection area between the object figures, and we find the optimal match between the configuration elements with the Hungarian Algorithm (HA) [110, 180]. A detected object is labeled as True Positive (TP), if the HA matches it to a GT object with an overlapping rate of more than rh (used rh = 10%). Unpaired detection samples are marked as False Positive (FP), unpaired GT objects as False Negative (FN) hits. For change detection evaluation, we also count missing and false change alarms (MC, FC). At pixel level, we investigate the accuracy of the extracted object outlines. We compare the binary masks of the building footprints provided by the proposed method’s output to the manually verified Ground Truth masks, and calculate the Precision (Pr) and Recall (Rc) values of pixel-level detection. Finally, the F-score (F-s, defined as the harmonic mean of Pr and Rc) is calculated both at object and pixel levels. By evaluation of the building detection component, we presented numerical (Table 5.1) and qualitative (Fig. 5.7) comparison results versus four single-view building detection techniques, called SIFT [166], Gabor [167], Edge Verification (EV) [165], and the Segment-Merge (SM) model [136]. Since SIFT and Gabor extract the building centers instead of estimating the outline, they are only involved in the objectlevel comparison. Numerical results confirm that the proposed model surpasses all references with 10–26% at object level and with 5–18% at pixel level. According to our analysis, the improvements are particularly related to two key properties: the

5.2 Object-Level Change Detection

135

Table 5.1 Numerical object-level and pixel-level comparison of the SIFT, Gabor, EV, SM, and the proposed methods (MPP) on each test dataset (best results in each row are typeset in bold) Dataset Object-level performance Pixel-level performance SIFT Gabor EV SM MPP EV SM MPP #o. FN FP FN FP FN FP PN FP FN FP Pr Rc Pr Rc Pr Rc Bp 41 20 10 8 17 11 5 9 1 2 4 73 46 84 61 82 71 An 21 8 5 0 1 2 0 2 1 1 0 91 73 84 79 83 74 Bg 17 7 2 9 8 2 3 4 2 1 0 59 26 71 72 93 71 Sz 57 17 26 17 23 10 18 11 5 4 1 61 62 79 71 93 75 Cd 123 55 9 12 24 14 20 20 25 5 4 73 51 75 61 83 69 Bs 80 34 9 32 8 11 13 18 15 7 6 56 30 59 41 73 51 Nm 152 69 14 24 14 18 32 30 58 18 1 60 32 62 55 78 60 Mc 171 NA NA 53 85 46 17 53 42 19 6 64 38 60 56 86 63 F-s% 66.3 79.9 84.2 79.8 94.4 53.7 66.8 74.3 Datasets: Budapest (Bp), Abidjan (An), Beijing (Bg), Szada (Sz), Cot d’Azur (Cd), Bodensee (Bs), Normandy (Nm), and Manchester (Mc); #o is the number of objects Table 5.2 Quantitative evaluation results of building change detection. #CH and #UC denote the total number of changed resp. unchanged buildings in the set. PDC denotes the Post Detection Classification reference method and mMP refers to the proposed multitemporal Marked Point Process model. Evaluation rates FN, FP, MC, and FC are introduced in Sect. 5.2.4 FN FP MC FC Pix. F-sc. Set #CH #UC PDC mMP PDC mMP PDC mMP PDC mMP PDC mMP Bp 20 21 3 0 7 2 1 0 9 2 0.72 0.78 Bg 13 4 1 0 2 1 0 0 3 0 0.77 0.85 Sz 50 7 4 2 0 1 3 4 3 0 0.76 0.82 An 0 21 2 0 2 0 0 0 4 0 0.78 0.91

stochastic object generation process and the parallel utilization of multiple features in the building description module. In terms of computational complexity, processing 1MPixel images with the MPP model takes in average less than 1 min. The proposed approach is competitive with most reference techniques regarding the running time parameter, as detailed experiments in [17] confirm. After testing the introduced building detector module in single images, we continue with the validation of the proposed joint object-change classification framework. We compared the mMPP method to a conventional Post Detection Comparison (PDC) [173] technique, where the buildings are separately extracted from the two image layers, and the change information is a posteriori estimated by comparing the location, geometry, and spectral characteristics of the detected objects. Experiments on the Budapest, Abidjan, Beijing, Szada datasets have confirmed that 90% of false change alarms of PDC were eliminated by mMPP, as shown in Table 5.2.

136

5 Multitemporal Data Analysis with Marked Point Processes

To demonstrate the advantages of the Feature-Based Birth Process (FBB), we compared the convergence speed of the bMBD optimization using the proposed FBB and the conventional Uniform Birth (UB) processes. In the UB case, the Pb(i) (s) and Pch (s) maps follow uniform distributions and the orientation parameters are also set as uniform random values. We experienced that the FBB approach reaches the final error rate with three times less birth calls than the UB. Moreover, using the UB process the pixel-level accuracy rates converge much slower than the object errors; as shown in [17], to reach the 75% pixel-level F-score, we need to generate 400, 000 objects with the UB map, and only 24, 000 building candidates with the proposed FBB map.

5.3 A Point Process Model for Target Sequence Analysis While the mMPP model introduced in the previous section provides a solution for object-level change detection between two remotely sensed images, a significantly different problem family corresponds to scenarios, where a moving target must be followed across several frames of a measurement sequence. In this section, we propose a novel framework for object-level time sequence analysis, which we call henceforward Multiframe MPP (Fm MPP). The Fm MPP framework simultaneously considers the consistency of the observed data and the fitted objects in the individual time instances of the measurement, and also exploits interaction constraints between the object parameters in the consecutive frames of the sequence. We should point out that the optimization step is a particularly critical issue in the multiframe scenario: the dimension of the target sequence’s parameter space may be very large, as it is proportional to the number of frames. For this reason, in the proposed model we merge the advantages of both the bottomup and inverse approaches (see the definitions from Chap. 1). First, we apply a bottom-up detector for initial target extraction, which processes the sequence frameby-frame. This step is quick, but we must expect that the results are notably poor in low-quality frames. The output of the bottom-up detector provides the initial state of the Fm MPP optimization process, which yields the final output ensuring permanent target structure and smooth motion over the sequence via inter-frame constraints.

5.3.1 Application on Moving Target Analysis in ISAR Image Sequences We introduce the Fm MPP approach in the application context of moving target analysis in airborne Inverse Synthetic Aperture Radar (ISAR) image sequences. Remotely sensed ISAR images can provide valuable information for target classification and recognition in several difficult situations, where optical [27] or SAR imaging tech-

5.3 A Point Process Model for Target Sequence Analysis

137

niques fail [3, 185]. However, robust feature extraction and feature tracking in the ISAR images are usually difficult tasks due to strong image noise and lack of available details about the structure of the imaged targets, which artifacts can lead to significant detection errors in several low- quality frames [21]. Some previous studies have proposed frame selection strategies to exclude low-quality frames from the analysis. However, as demonstrated in [128] extracting reliable features for frame selection may be often unsuccessful. On the other hand, if we assume that the target has a rigid structure with a fixed size, and small movement can only occur between consecutive time frames, information from frame interactions can be exploited to improve the detection process. For this reason, our proposed system does not delete any frames from the input sequence, but it realizes an approach where the detection result on the actual frame jointly depends on the current image data and the target parameters in the neighboring frames. Besides extracting the target’s length and axis line, another issue is to detect characteristic feature points on the objects which support the identification process. For this reason, we also search for permanent bright points in the imaged targets, which are produced by stronger scatterer responses from the illuminated objects (see Fig. 5.10a). However, as a consequence of various image formation artifacts, including speckle, image defocus, and scatterer scintillation, several missing and false scatterer-like structures may appear in the individual frames, therefore we also need to focus on their elimination, which will be implemented with spatio-temporal filtering constraints. The contributions of the section are twofold. First, we propose the general Multiframe Marked Point Process (Fm MPP) framework, which provides a novel Bayesian tool for time sequence analysis in remotely sensed spatio-temporal measurements. Second, we introduce a task-specific implementation of the Fm MPP method, which is used for the analysis of large carrier ships and airplanes from ISAR data. Here, we also perform a detailed quantitative validation on a real dataset, which contains eight ISAR image sequences with 545 manually annotated frames.

5.3.2 Problem Definition and Notations The proposed method expects as input an n-frame long sequence of 2D ISAR images in the Range-Doppler domain, which contains a single ship (or airplane) target. The joint pixel lattice of the images is denoted by S, and s ∈ S refers to a single pixel. The normalized log-amplitude of pixel s in frame t ∈ {1, 2, . . . , n} is marked with gt (s). The logarithmic image representation suits well the widely adopted log-normal statistical models of ISAR target segmentation [194]. A given target candidate (e.g. a ship) in frame t is denoted by u t . The axis line segment of the target is described by the c(u) = [x(u), y(u)] center pixel, l(u) length, and θ (u) orientation parameters (see Fig. 5.9c). Additionally, an initially unknown K (u)(≤ K max ) number of scatterers may be assigned to the targets, where each scatterer qi is described in the target line segment’s coordinate system by the relative

138

5 Multitemporal Data Analysis with Marked Point Processes

Fig. 5.9 Target representation in an ISAR image: a input image with a single ship object, b binarized image, c duplicated image and target fitting parameters. Original image border is shown by the green rectangle

Fig. 5.10 Dominant scatterer detection problem: a highlighted true scatterers, i.e. Ground Truth (GT), b LocMax filter result, c parameterization

line directional position, τu (qi ), and the signed distance, du (qi ) from the center line of the parent object u (see Fig. 5.10c). The goal is to extract a ω = {u 1 , u 2 , . . . , u n } target sequence, also called configuration in the following.

5.3.3 Data Preprocessing in a Bottom-Up Approach Data preprocessing consists of four consecutive steps: foreground–background segmentation, initial center alignment and line segment estimation, scatterer candidate set extraction, and scatterer filtering. First, we separate foreground and background regions in the ISAR images by a Markov Random Field (MRF)-based segmentation model [21], which aims to decrease the noisy effects of speckle noise. The output of this step is a binary label map:

= {ς (s)|s ∈ S},

5.3 A Point Process Model for Target Sequence Analysis

139

where ς ∈ {fg, bg} labels are assigned to the foreground and background classes, respectively. Using the assumption that the ISAR amplitude values are generated by log-normal distributions for both classes [194], we approximate the pbg (s) = P(gt (s)|ς (s) = bg) and pfg (s) = P(gt (s)|ς (s) = fg) log-amplitude posterior probabilities by Gaussian density functions. To estimate the Gaussian distribution parameters, we use a semi-supervised approach. We assume that based on prior knowledge about target shape, image formation, and the cropping process, prior estimation is available about the ratio of foreground areas compared to the image size. Based on this ratio, the upper part of the image histogram is used to select the training regions for foreground, the lower one for background. fg We denote by 1s ∈ {0, 1} an indicator function of the foreground class in a given fg segmentation, where 1s = 1 iff ς (s) = fg. We denote by s ∼ r , if pixel s is in the 4-neighborhood of pixel r in the S lattice. The optimal foreground mask is generated by minimizing the following MRF energy [162] function:

opt = argmin

∈2 S

 s∈S

log

 pfg (s) fg  fg fg fg · 1s + β 1s · 1r + (1 − 1fg s ) · (1 − 1r ) . pbg (s) r ∼s (5.11)

We used the efficient graph-cut-based optimization algorithm [40] to derive the optimal B mask (Fig. 5.11). By also considering the time stamp, we mark later on by

t (s) ∈ {0, 1} the foreground mask value of pixel s in frame t. Thereafter, to get an initial estimation of the target axis segment, we extract an approximate axis line from the foreground mask by using the Hough transform. To implement this step, we also have to deal with a problem which is a consequence of the ISAR image synthesis technique. The image formation process models the images as spatially periodic signals both in the horizontal and vertical directions. The imaging algorithm estimates the target center, and attempts to crop the appropriate Region of Interest (RoI) from this periodic image (a correctly cropped frame is demonstrated in Fig. 5.10a). However, if the center of the RoI is incorrectly estimated, the target line segment may ‘break’ into two (or four) distinct pieces, which issue is also shown in Fig. 5.9a. For this reason, our proposed image processing approach searches for the longest foreground segment of the axis line in a duplicated mosaic image. Let us observe that this step also re-estimates the center of the input frame (see Fig. 5.9c). The scatterer candidate extraction step is based on the observation that permanent scatterers cause high amplitudes in the ISAR frames, although these amplitudes may significantly fluctuate over the consecutive images. In addition, we should count on a large variety of scatterer amplitudes within a given ISAR image. Consequently,

140

5 Multitemporal Data Analysis with Marked Point Processes

Fig. 5.11 Demonstration of foreground–background segmentation in the ISAR frames. Top left: background and foreground probability maps (high probabilities indicated with greater intensities); bottom left: foreground mask through pixel-by-pixel maximum likelihood classification (only for reference); top right: sketch of graph-cut-based MRF optimization [40]; bottom right: foreground mask (B) by the proposed MRF model

simple threshold-based strategies for scatterer extraction cannot efficiently perform. For this reason, we implemented a two-step algorithm: first, we detect a large preliminary set of possible scatterer candidates, among which several ones may be false target points. Second, we distinguish false candidates from real scatterers in an iterative process, by ensuring various geometric constraints such as the line-structure and temporal persistence of the target object. For this purpose, we apply first the Local Maxima (LocMax) filter to detect Preliminary Scatterer Candidates (PSC). As shown in Fig. 5.10b, the vast majority of real scatterers are extracted by this step, nevertheless a high false alarm rate is present. The scatterer filtering step applies various local kernels within the object configuration space. Next, two kernels will be introduced and their effects will be demonstrated. Then, we present our proposed complex spatio-temporal model and a corresponding iterative energy optimization process in Sects. 5.3.4 and 5.3.5. The Scatterer Filtering (SF) kernel takes as input an estimation of the axis line segment and the PSC set. Assuming that the targets are large carrier ships, the kernel exploits two facts: • For each target candidate, we assume that the scatterer candidates are near its axis line. • The projections of two different scatterers to the axis line should be far enough from each other, since too close bright feature points are usually yielded by multiple echoes from the same scatterer. The minimum distance of two real scatterer

5.3 A Point Process Model for Target Sequence Analysis

141

Fig. 5.12 Pseudo code of the implemented Scatterer Filtering (SF) kernel

projections in the τu (q) domain is defined by a threshold parameter, Tτ , which takes values between 0.05 and 0.07 (see also Fig. 5.12). Based on the above expectations, we select a set of filtered scatterers by a straightforward algorithm, which is presented in Fig. 5.12. Let us note that this filter is very sensitive to the accuracy of the preceding axis estimation step. Therefore if we directly apply the SF kernel for the output of the Hough-based axis detector, we may observe a quite weak classification (see Fig. 5.15a). For handling the limitations of the above SF kernel, we have implemented a second move, called the RANSAC kernel, which can be applied for targets like large ships, where a strict line-arrangement constraint is satisfied for the scatterer set. More specifically, if a subset of the extracted LocMax-scatterer candidates fits a given line el , we strongly assume that el is the axis line of the object. The widely used RANSAC algorithm has been adopted here for re-estimating the optimal line to the set of preliminary scatterer candidates. Once the re-estimated axis is obtained, the Scatterer Filtering kernel is applied again. The result of RANSAC-based re-estimation for the previously discussed sequence part is demonstrated in Fig. 5.15b. Significant improvement is shown in multiple frames (see #19, #21, and #22), however, we can still find a false scatterer (#19) and also a missing one (#22), meanwhile the detection result on frame #20 is faulty. Note as well that the RANSAC kernel also has a few drawbacks. First, we cannot use it, if there are only a few permanent scatterers on the target image. Second, the RANSAC-based estimation may give errors if there are several duplicated scatterers in the PSC set, which can form parallel lines (an artifact

142

5 Multitemporal Data Analysis with Marked Point Processes

which can be a consequence of echoes in the imaging step). However, since we can assume that the target is rigid, and the position and orientation differences cannot be large between appearances in consecutive frames, temporal constraints can be utilized to refine the detector output. Based on these observations, in the following sections we integrate the previously discussed deterministic kernels into a stochastic iterative energy optimization framework, which improves the quality of detection by considering a frame sequence instead of single images.

5.3.4 Multiframe Marked Point Process Model In this section, we present a novel Multiframe Marked Point Process model, which can handle target sequences from multiple time frames instead of focusing on individual objects, by exploiting information from entity interactions. Following the classical Markovian approach, each target sample only affects objects in its neighboring frames directly. Due to this feature, the number of interactions in the population is limited, which yields a compact global sequence description that can be efficiently analyzed. In our model, we use a tmax -radius frame neighborhood. Using henceforward the F notation for the union of all image features derived from the input data, we characterize a given ω target sequence with a F (ω) data-dependent Gibbs energy term: n n  

F (ω) = AF (u t ) + γ · I (u t , ωt ). (5.12) t=1

t=1

As shown in Eq. (5.12), F (ω) consists of a data-driven term, AF (u t ) ∈ [−1, 1], called the unary potential, and a prior term I (u t , ωt ) ∈ [0, 1], referred to as the interaction potential, where ωt = {u t−tmax , . . . , u t , . . . , u t+tmax } is a sub-sequence of u t ’s 2tmax -nearest neighbors. The γ parameter is a positive weighting factor between the two components of the potential function. With the following definitions of the energy terms (Sects. 5.3.4.1 and 5.3.4.2), we attempt to ensure that the optimal sequence candidate exhibits the maximal likelihood, thus minimal F (ω) energy. Thereafter, the optimal sequence can be obtained by minimizing F (ω).

5.3.4.1

Construction of the Unary Potentials

The AF (u t ) unary potential consists of two subterms: AF (u t ) =

 1 B AF (u t ) + ASc F (u t ) , 2

5.3 A Point Process Model for Target Sequence Analysis

143

where ABF (u t ) is the body-term and ASc F (u t ) is the scatterer-term. First, we describe the derivation of the data term. Denote by L u ⊂ S the set of pixels lying under the dilated line of u in the duplicated image. We mark by Ru ⊂ L u the set of pixels under the line segment of u (see Fig. 5.9c): Ru = {s ∈ L u | d(s, [x(u), y(u)]) < l(u)/2} , and by Tu ⊂ L u \Ru the pixels of L u which are located outside the u segment, but they are quite close to its endpoints. By using the body fitting feature, f b (u), we favor object candidates, where the line segments (Ru ) cover in majority foreground classified pixels within the -mask of the actual ISAR image, meanwhile the external area Tu contains background regions: ⎛ ⎞   1 ⎠, ·⎝ f b (u) = 1fg 1 − 1fg s + s #{Ru ∪ Tu } s∈R s∈T u

u

where #{.} marks area measured in pixels. Then, we derive the body-term of the unary potential of u as AFB (u) = M ( f b (u), d0 , 10), where we utilize again a monotonously decreasing M function, as defined by Eq. (5.2). Parameter d0 is used as an acceptance threshold for valid objects. In addition, we also use a scatterer-term, which can penalize scatterers that are not located at local maxima of the ISAR frame:   K (u)  1 · (i, u), d , 10 , ASc F (u) = M K (u) i=1 where  (i, u) =

0 if qi is in a local maximum in the input ISAR frame 1 otherwise.

Parameters d0 and d are set using annotated training images. 5.3.4.2

Construction of the Interaction Potentials

Interaction potentials are used to consider multiframe temporal information and prior geometric (soft-)constraints in the model. Since we assume that the observed object has a rigid structure, we can expect strong dependence between the appearance parameters of the target in the consecutive frames. Since due to the ISAR image formation process, the c(u) center coordinates cannot inform us about the real position of the target, we only penalize large differences between the l(u) length and θ (u) angle parameters, and significant differences in the normalized positions and numbers of scatterers within close-in-time images of the sequence.

144

5 Multitemporal Data Analysis with Marked Point Processes

We construct the prior interaction term I (u t , ωt ) as a weighted sum of four subterms: the median angle difference Iθ (u t , ωt ), the median length difference Il (u t , ωt ), the median scatterer number difference I#s (u t , ωt ), and the median scatterer alignment difference Isd (u t , ωt ) terms: I (u t , ωt ) = δl · Il (u t , ωt ) + δθ · Iθ (u t , ωt ) + δ#s · I#s (u t , ωt ) + δsd · Isd (u t , ωt ), (5.13) where δl , δθ , δ#s , δsd are positive and δl + δθ + δ#s + δsd = 1. The first three subterms are calculated as the median values of the parameter differences between the actual and the nearby frames:

 l ,1 , Il (u t , ωt ) = min medl (t)/dmax

 θ Iθ (u t , ωt ) = min medθ (t)/dmax ,1 ,

 K I#u (u t , ωt ) = min med K (t)/dmax ,1 , where for target parameters f ∈ {l, θ, K }: med f (t) =

median

t−tmax ≥i≥t+tmax

| f (u t ) − f (u i )|,

(5.14)

l θ K while dmax , dmax , and dmax are normalizing constants. We note here that applying median filtering proved to be more robust than averaging the difference values due to outlier images with erroneously estimated targets. The scatterer alignment difference term Isd (u t , ωt ) can be used to evaluate the correlation between the relative scatterer positions of targets from close frames. First, we define a scatterer alignment vector for each object in the following way:

 τ (u) = τu (q1 ), τu (q2 ), . . . , τu (q K (u) ) , where—as defined in Sect. 5.3.2—τu (q) is the normalized line directional component of the q scatterer’s projection to the axis of u. Let us consider two objects, u and v, in two different frames, which might even have different number of scatterers. The difference between the τ (u) and τ (v) descriptors is calculated as follows:  K (u) 1  1 min |τu (qi ) − τv (q j )|+  (τ (u), τ (v)) = 2 K (u) i=1 j≤K (v)  K (v) 1  min |τu (qi ) − τv (q j )| . K (v) j=1 i≤K (u) Thereafter, by using (5.14), the scatterer alignment difference term can be got as

5.3 A Point Process Model for Target Sequence Analysis

145

 sd Isd (u t , ωt ) = min medsd (t)/dmax , 1 where medsd (t) =

median

t−tmax ≥i≥t+tmax

 (τ (u t ), τ (u i )) .

For the purpose of quick computation, we approximate the  (τ (u t ), τ (u i )) feature by determining the 1D distance transform map in the discretized [0, 1] interval.

5.3.5 Multiframe MPP Optimization As mentioned in the beginning of this section, our proposed optimizer solution is initialized with the output of the preliminary detector of Sect. 5.3.3, which provides an initial configuration that is in most of the frames consistent with the input data. Thereafter, we apply an iterative refinement algorithm, which inspects in each step the complete image sequence, and attempts to propose instead of the actual objects higher quality candidates by taking into account both the data-driven and the prior constraints. This procedure has two key steps: the generation of new object candidates, and the verification of the proposed objects based on the current configuration and features extracted from the input data. Two types of moves are applied during object generation, a Perturbation Kernel and a RANSAC-based birth kernel, which are randomly chosen at each iteration step: • The Perturbation kernel creates a clone of the actual object either from the current, or the previous or the next frame; and it perturbates with zero mean Gaussian random values its center position, length, and orientation parameters. Finally, the scatterer positions are simply copied from the object of the current frame, while new scatterers are optionally added or some scatterers are removed. • The RANSAC-based birth kernel performs a re-estimation of the optimal line based on the preliminary scatterer candidates (PSC) using the RANSAC algorithm. The pseudo codes of the above kernel functions can be found in Fig. 5.13. The proposed optimization algorithm iterates object proposal and evaluation steps, which are followed by the possible replacement of the original objects versus newly generated ones. Let us assume that we are currently in the kth iteration of the process. To decide if we accept or decline the replacement of the object on the tth frame for the newly proposed u object, we calculate first the energy difference  ω (u, t) between ω[k] , the original configuration before the kth iteration, and the configuration ω∗ we would get from ω[k] by exchanging u [k] t with u. Exploiting the Markovian property of the energy function, to calculate the energy difference we should only check the objects in the tmax -neighborhood of frame t and inspect the corresponding unary and interaction potential terms. We have  ω (u, t) < 0 if the potential move decreases the global energy level. However, to prevent us from being stuck in a lowquality local energy minimum, we implement the iterative process within a simulated annealing framework. In this way, based on the  ω (u, t) energy difference value, we determine a probability of accepting the object replacement move, which is used

146

5 Multitemporal Data Analysis with Marked Point Processes

Fig. 5.13 Pseudo code of the Object Generation Kernels

by a randomized decision process. For the cooling scheme, we have followed the suggestion of [54]. Details of optimization are provided in Fig. 5.14. Final detection results on the previously discussed sample frames are shown in Fig. 5.15c.

5.3.6 Experimental Results on Target Sequence Analysis Our test dataset consists of seven aerial ISAR image sequences which contain carrier ship targets. The key properties of the used image sets are presented in Table 5.3.

5.3 A Point Process Model for Target Sequence Analysis

147

Fig. 5.14 Pseudo code of the multiframe energy optimization algorithm

The annotated dataset includes 520 ISAR frames (40–90 frames in each sequence) and 4250 true scatterer locations (8 or 9 scatterers in each frame). For quantitative validation, we have manually prepared Ground Truth (GT) information for both the axes and the scatterer positions (for sequences which were longer, we only considered the first 90 frames). To take into account various aspects of evaluation, we have considered three different types of error metrics. The Normalized Axis Parameter Error (E AX ) is defined by the sum of the x-y center position and axis length errors normalized with the length of the GT target, and the angle error normalized by 90◦ :

148

5 Multitemporal Data Analysis with Marked Point Processes

Fig. 5.15 Center alignment and target line extraction results on Frames #19–22 of the SHIP1 ISAR image sequence. Top: initial detection: Middle: RANSAC re-estimation; Bottom: proposed Fm MPP model Table 5.3 Main properties of the eight ISAR image test sequences Sequence Number Frame Tot. num. Avg axis name of frames size (pix) of scatterers length SHIP1 45 256 × 128 360 153.9 SHIP2 90 256 × 96 720 195.3 SHIP3 40 256 × 96 320 133.9 SHIP4 90 256 × 96 720 179.8 SHIP5 90 256 × 96 720 172.2 SHIP6 90 256 × 96 720 133.7 SHIP7 75 256 × 96 600 169.8 AIRPLN 25 128 × 128 NA 75.2 Overall 520 + 25 – 4250 151.7

Scat. per fr. 8 8 8 8 8 9 8 NA 7.79

x l θ + E AX + E AX E AX E AX + ,  gt n 1 90◦ t=1 l(u t ) n y

E AX =

where the following subterms are calculated: x = E AX

n n 1 1 gt gt y ||x(u t ) − x(u t )||, E AX = ||y(u t ) − y(u t )||, n t=1 n t=1

5.3 A Point Process Model for Target Sequence Analysis l E AX =

149

n n 1 1 gt gt θ ||l(u t ) − l(u t )||, E AX = θ (u t , u t )). n t=1 n t=1

θ , let us assume that θ (u t ), θ (u t ) ∈ [0, 180◦ ], and we use Regarding E AX gt

  gt gt gt θ (u t , u t ) = min |θ (u t ) − θ (u t )|, 180◦ − |θ (u t ) − θ (u t )| . Table 5.4 shows the axis-level detection quality rates (smaller errors are considered better) for each of the three consecutive steps of the workflow shown by Fig. 5.15: (a) Initial detection, (b) RANSAC-based structure re-estimation, and (c) the final Fm MPP result by the iterative optimization. We can observe decreasing errors after each consecutive step, and according to the final result the aggregated E AX rate stays between 3% and 7% for all sequences. x , E l θ Table 5.4 Axis Detection—Evaluation Results for the test sequences. E AX AX , E AX , and E AX mean errors are measured in pixels, the normalized E AX error rate is expressed in percent (%) y x l θ Sequence Step E AX E AX E AX E AX EAX (%) SHIP1 Initial 6.31 9.89 10.6 5.64 23.6 RANSAC 5.11 5.69 9.11 2.18 15.3 Fm MPP 0.44 0.27 3.73 0.8 3.8 SHIP2 Initial 5.85 1.72 13.11 1.51 12.3 RANSAC 2.99 1.02 6.56 0.71 6.2 Fm MPP 0.47 0.17 4.29 0.58 3.2 SHIP3 Initial 2.80 2.15 5.70 2.15 10.3 RANSAC 1.65 1.33 4.92 1.52 7.5 Fm MPP 0.33 0.30 2.65 0.90 3.4 SHIP4 Initial 2.37 0.83 5.96 0.58 5.7 RANSAC 2.70 0.82 5.69 0.79 6.0 Fm MPP 0.64 0.06 4.37 0.38 3.2 SHIP5 Initial 2.07 0.96 5.86 1.10 6.4 RANSAC 1.43 0.47 3.50 0.86 4.1 Fm MPP 0.19 0.09 4.01 0.80 3.3 SHIP6 Initial 2.33 1.54 3.71 1.96 7.8 RANSAC 1.46 0.70 4.09 1.11 5.9 Fm MPP 0.01 0.07 3.20 0.50 3.0 SHIP7 Initial 4.53 0.87 9.27 1.12 9.9 RANSAC 3.32 0.72 9.21 0.75 8.6 Fm MPP 2.13 0.13 8.13 0.56 6.7 AIRPL Initial 1.68 6.16 16.32 2.56 34.9 Fm MPP 0.24 0.80 3.28 0.76 6.6 y

150

5 Multitemporal Data Analysis with Marked Point Processes

The Scatterer Detection Rates evaluate the quality of permanent scatterer identification. Here, each detected scatterer is matched to its probably corresponding GT scatterer automatically by the Hungarian algorithm [110], which uses the scatterer parameter τ (q) for the assignment. A proposed match is only validated if the distance of the assigned feature points is smaller than a given threshold. Then, the numbers of true positive, false negative, and false positive scatterers are counted, as shown in columns 3–5 of Table 5.5. The third quality rate is called the Average Scatterer Position Error (E SP ) that we measure in pixels: E SP =

n  t=1

K (u t ) 1 gt 1{m it =0} · ||qi (u t ) − qm it (u t )||, #{i : m it = 0} i=1

Table 5.5 Evaluation of Scatterer Detection. Number of False Positive and False Negative scatterers determine the precision and recall factors of the process, while the E SP rate shows the scatterer positioning accuracy (low values are preferred) Sequence Step Number of True/False Scatterers E SP in pixel True Pos False Pos False Neg SHIP1 Initial 249 117 111 7.4 RANSAC 339 49 21 2.2 Fm MPP 349 10 11 0.5 SHIP2 Initial 680 49 40 4.8 RANSAC 703 23 17 1.6 Fm MPP 718 2 2 0.4 SHIP3 Initial 301 33 19 1.5 RANSAC 306 30 14 1.6 Fm MPP 311 22 9 1.0 SHIP4 Initial 696 66 24 1.1 RANSAC 699 64 21 1.0 Fm MPP 705 22 15 0.7 SHIP5 Initial 691 69 29 0.9 RANSAC 695 71 25 0.6 Fm MPP 707 29 13 0.3 SHIP6 Initial 763 48 47 0.9 RANSAC 763 49 47 0.8 Fm MPP 764 18 46 0.7 SHIP7 Initial 562 61 38 3.5 RANSAC 567 58 33 2.9 Fm MPP 559 37 41 2.5

5.3 A Point Process Model for Target Sequence Analysis

151

Fig. 5.16 Sample frames from the SHIP2–SHIP7 datasets, and the corresponding detection results of the Fm MPP approach obtained by the optimization of the proposed ISAR sequence-based model

where m it is the ordinary number of the GT scatterer matched to the ith detected scatterer in frame t, and m it = 0 marks the unmatched scatterers. 1{.} denotes an indicator function and #{.} marks the set cardinality. The obtained E SP rates are listed in the 6th column of Table 5.5. By examining the results shown in Tables 5.4 and 5.5, we can conclude that the proposed method can efficiently handle all the seven test cases (SHIP1-SHIP7). The improvement between the outputs of the Initial and Optimized F m MPP phases of the process is particularly significant in the SHIP1 (shown in Fig. 5.15), SHIP2 and SHIP5 sequences, which contain complex test cases. The developments are also remarkable in the SHIP3, SHIP4, and SHIP6 cases (see sample frames in Fig. 5.16), while the SHIP7 sequence contains noisier images with several blurred frames, where the final error rates remain larger (see also the last row of Fig. 5.16). Besides the surveillance of large carrier ships, the proposed model can be used to detect further types of targets in ISAR image sequences. Airplanes appear as crosslike structures in the ISAR frames, where usually one of the two wings can be clearly observed. In addition to length and orientation of the body axis, the length of the wings and their connecting positions to the airplane body are also relevant shape parameters. Consequently, for analyzing airplanes, one can use a cross-shaped target geometry model, as it is demonstrated in Fig. 5.17. Like ship detection, the airplane extraction algorithm consists of a preliminary detection step, and the Fm MPP-based iterative refinement. The preliminary detection includes the Hough transform-based extraction of the body line, while the initial wing root position is determined with an exhaustive search process: histograms of silhouette pixels are created, which can be perpendicularly projected to the same points of the body line. In the Fm MPP- based refinement stage the data term ( AF (u t )) is calculated in a similar way to the carrier ship model, however here the filling factors for the left and right wings are separately calculated, and their minimum (i.e. the better match) counts into the data term of the model. This later trick is necessary, since in most cases only one of the wings is largely visible in the ISAR images. Airplane detection results in four selected frames are shown in Fig. 5.17, demonstrating the outputs of the preliminary stage and the final Fm MPP model, respectively. The improvement between the two algorithmic

152

5 Multitemporal Data Analysis with Marked Point Processes

Fig. 5.17 Airplane detection example: a Airplane silhouette and the cross-shaped fitted model b–d. Comparing the results of the initial and the optimized Fm MPP detection in four sample frames

stages are similar to the ones experienced in the ship detection scenarios. Note as well that it might be also possible to observe permanent scatterers in the airplanes’ ISAR images. However, since airplane scatterers can appear both in the wings and in the body, their geometric alignment constraints may be more complex compared to cases of linear vessels. The processing speed of the proposed algorithm varies over the different test sets between 2 frames per second (fps) and 5fps. Note that the total computational cost depends on various factors, such as image size, length of the sequence, target length, scatterer number, and quality of initial detection. Based on our experiences, the additional computational cost of iterative optimization is not significantly higher than the cost of the initialization and the RANSAC steps.

5.4 Parameter Settings in Dynamic MPP Models We can divide the parameters of the proposed dynamic MPP methods into three groups corresponding to the prior model, data model, and the optimization. The prior model parameters of the mMPP model, such as the l window size for GODF calculation by (see Sect. 5.2.2.1) and maximal/minimal rectangle side lengths at the difference scales, depends on image resolution and expected object dimensions, thus they are set based on sample objects.

5.4 Parameter Settings in Dynamic MPP Models

153

l θ K sd As for the Fm MPP approach, the rfg , Tτ , dmax , dmax , dmax , and dmax factors are also calibrated in a supervised way. Here, further relevant prior parameters are the weighting factors within the I (u t , ωt ) interaction term; we used here uniform weights:

δl = δθ = δ#s = δsd = 0.25. We used a constant γ = 2 weight between the data term and the overlapping coefficient in Eqs. (5.8) and (5.12). The parameters of the data model in mMPP are estimated based on training image gt gt gt regions containing Ground Truth objects {u 1 , u 2 , . . . , u n }. Consider an arbitrary gr f (u) feature from the feature library (e.g. f (u) gradient descriptor for building detection). We remind the Reader that each f (u) of our model is a noisy quality measure, and the corresponding energy term is obtained as (see Sect. 5.2.2.2): f

ϕ f (u) = M( f (u), d0 , D f ). Here, we set the normalizing constant as gt

gt

D f = max f (u j ) − min f (u j ). j

j

Exploiting that the M transfer function is monotonously decreasing with a sole root f f f (u) = d0 , object u is attractive in image i (i.e. ϕ (i) f (u) < 0) if and only if f (u) > d0 . f

Consequently, increasing d0 may decrease the false alarm rate and increase the missing alarms corresponding to the selected feature. Since in the proposed model f we can simultaneously utilize several objects prototypes, our strategy for setting d0 is to minimize the false alarms for each prototype, and eliminate the missing objects using further feature configurations. For the building detection task, we have also tested the sensitivity of the proposed model against the parameters of various feature extraction steps. Figure 5.18 shows the pixel-level F-scores of detection on the Budapest image, where we perturbated the chrominance threshold (τcr ) of roof color filtering, the shadow darkness threshold (τsh ), and the gradient acceptance threshold (d gr ) with maximum ±30% around the optimal value. Results show that the performance varies around 10% in these parameter domains, most significant being the dependence on τcr . By setting the Fm MPP data model parameters (d0 and dψ ), we utilized that using similar ISAR imaging conditions, the contrast parameters of the images are very similar. Finally, to set the optimization parameters, we followed the guidelines provided in [54] and used δ0 between 10000 and 20000, β0 between 20 and 50, and geometric cooling factors 1/0.96.

154

5 Multitemporal Data Analysis with Marked Point Processes 0.8

0.8

Settings: sh=30 dgr=22

0.75

Settings: cr=16 dgr=22

0.75

0.7

0.7

0.65

0.65 0.6

0.6 10

12

14

16

18

20

22

21

24

27

30

33

36

thresh.

cr

0.8

39 sh

Settings: cr=16 sh=30

0.75 0.7 0.65 0.6 16

18

20

22

24

26

28 gr

Fig. 5.18 Building detections’ pixel-level performance (F-score) in case of different parameter settings

5.5 Conclusions In this chapter, we proposed two different solutions for dynamic Marked Point Processes. In the first part, we have proposed a multitemporal Marked Point Process (mMPP) framework for building extraction and change monitoring in remotely sensed image pairs in a joint probabilistic approach. A global optimization process attempted to find the optimal configuration of buildings, considering the observed data, prior knowledge, and interactions between the neighboring building parts. The computational cost has been significantly decreased by a non-uniform stochastic object birth process, which proposed relevant objects with higher probability based on low-level image features. The second part has addressed the detection and characterization of large ship and airplane targets in ISAR image sequences using an energy minimization approach. We have proposed a robust joint model for axis extraction, feature point detection, and tracking. We have shown that in the case of noisy sequences, the introduced Multiframe Marked Point Process schema can significantly improve the results of frame-by-frame detection.

Chapter 6

Multi-level Object Population Analysis with an Embedded MPP Model

Abstract In this chapter, we introduce a probabilistic approach for extracting complex hierarchical object structures from digital images used by various vision applications. The proposed framework extends conventional Marked Point Process (MPP) models by including object-subobject ensembles in parent–child relationships, and creating coherent object groups from corresponding objects, by a Bayesian partitioning of the patent entity population. Unlike former, largely domain-specific attempts on MPP generalization, the proposed method is defined at an abstract level, while it provides simple and clear interfaces for the possible applications. We also introduce a global optimization process for the multi-layer framework, which attempts to find the optimal configuration of entities, considering the image data (observation), prior knowledge-based constraints, and interactions between the neighboring and the hierarchically related objects. The efficiency of the proposed method is demonstrated in three different application areas qualitatively and quantitatively: built-in area analysis in remotely sensed images, traffic monitoring on airborne Lidar data and optical circuit inspection.

6.1 A Hierarchical MPP Approach In the recent years, one of the main evolving characteristic features of commercial image sensors has been the spatial resolution. In the remote sensing domain, several very high-resolution satellites have been launched including the Pleiades system which provides submetric resolution data incorporating stereo facilities. On the other hand, Full High Definition (HD) video cameras are available at affordable prices for many surveillance applications, which systems can be supported by arrays of multiple thermal or Time-of-Flight (ToF) sensors. The spatial resolution of airborne or terrestrial laser scanning is also increasing due to a focused development of aerial Lidar devices and mobile mapping systems. In industrial optical inspection systems research works on designing proper sources of illumination, decreasing lens aberrations and improving the limited depth of field result in sharp images up to a few µm resolution.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 C. Benedek, Multi-Level Bayesian Models for Environment Perception, https://doi.org/10.1007/978-3-030-83654-2_6

155

156

6 Multi-level Object Population Analysis with an Embedded MPP Model

From the processing side, the above hardware developments imply a notable shift in computer vision methodologies. While several earlier technologies focused on an compensating low image resolution, e.g. by mosaicking or super resolution techniques, nowadays fine details are observable in high-resolution images, demanding hierarchical content parsing algorithms [146], which can interpret the observed information at multiple levels. While the conventional MPP-based image analysis models (see Sect. 2.3 of Chap. 2 and also our earlier introduced multitemporal models in Chap. 5) are purely focusing on object extraction with direct (bilateral) object interaction modeling within populations, dealing with higher level object grouping and object decomposition issues are also critical parts of complex scene understanding processes. There have already been a few attempts conducted for multi-level entity modeling with point processes in the literature. The Multi-MPP framework proposed by [115] extends MPP models regarding two issues. First, for simultaneous detection of entities of different shapes, it can jointly sample various prototypes of geometric objects. Second, through implementing a statistical analysis of the locally occurring shape types and the relative alignment of the extracted nearby entities, a local texture representation of the different image regions is realized. Although this approach may fit well to traditional bottom-up image content exploration tasks, it is not straightforward in this framework for many machine vision applications, how one can efficiently segment and model an object population based on domain specific top-down knowledge. On the other hand, several hierarchical phenomena can be more naturally described by object-subobject relationships (parent–child objects) rather than by applying object grouping constraints. As notable examples, we can think about Circuit Elements (CE) of Printed Circuit Boards (PCB) and recognizable patterns of included defective parts within the CEs [11, 20] in µm resolution images, ships and containers in radar images (see Sect. 5.3.1 of Chap. 5) building roofs and chimneys in aerial or satellite photos. For the above reasons, we introduce in this chapter a new three level Embedded Marked Point Process (EMPP) framework [14], which has the following two key properties: • The hierarchy between objects and object parts is described by a parent–child relationship embedded into the MPP framework. The appearance model of a child object depends on its parent entity, taking into account geometrical and spectral constraints, such as the geometric figure of a parent object should encapsulate its a child object, or the color/texture distribution of the parent object may influence the expected appearance characteristic of the child entity. • For avoiding the limitations of using purely pairwise object interactions, we propose here a multi-level MPP model, which divides the global parent entity configuration into disjunct object groups, called configuration segments, and it extracts the objects and the optimal segments in parallel by an integrated energy minimization algorithm. Object interactions are differently handled within a given configuration segment and between two different segments, allowing to model adaptive object neighborhoods. In this way, we can simultaneously use strong alignment or spec-

6.1 A Hierarchical MPP Approach

157

tral similarity constraints within a group, but the coherent segments may even have irregular, or thin, elongated shapes. As it will be shown in the following sections, the EMPP model has a complex structure, with several general and task specific components mixed together in a unified framework. Practical experiences show that for such composite, application dependent models, the adaption to another application domain is rarely straightforward, and usually a significant amount of modeling work and code (re-)implementation is needed to transform or modify the framework for a different field. As an important novelty of our present method, after collecting similar connecting tasks appearing in different areas, we address them by a joint methodological approach. We provide here a formal problem statement and introduce a novel general three-level MPP framework which enables us to handle a wide family of applications. The structure elements and the energy optimization algorithm of the complex model are defined and implemented at the abstract level, while we keep focus on ensuring very simple interfaces to the different applications, providing flexible options for domain adaption for end-users. The development of the EMPP model contained three phases. First, we proposed an Automatic Optical Inspection (AOI) method for PCB validation, with introducing a new parent–child object relationship in the conventional MPP framework [10, 11, 20]. Second we designed a two-level MPP model focusing on the joint extraction of vehicles and groups of corresponding vehicles within a traffic scenario from aerial Lidar data [36, 37]. Both models have been thoroughly validated on real measurements, and the methodological improvements have been demonstrated versus earlier approaches from the literature. In the third phase, we have connected the two models, and defined the general three-layer EMPP framework containing the object group— object—object part levels [13, 14]. Thereafter we gave proof-of-concept examples of how the new three layer model can be adopted to the targeted AOI, traffic monitoring and built-in area estimation applications. For a compact and less redundant presentation, we start in this Chapter with the introduction of the general EMPP model and we detail the application specific contributions regarding the AOI and traffic monitoring tasks thereafter.

6.2 Problem Formulation and Notations Since our primary aim is to model hierarchical scene content, the proposed Embedded Marked Point Process (EMPP) framework has a multi-layer structure, which is demonstrated in Fig. 6.1. We have a super node at the top, called the population or the configuration, which represents a high-level model of the scene. A population may include an arbitrary number of object groups, where each group is composed of a single or several parent objects. Finally, the parent objects may be connected to any number of child objects.

158

6 Multi-level Object Population Analysis with an Embedded MPP Model

Fig. 6.1 Structure elements of the EMPP model. Left: a sample population with three object groups, and various object shapes both at parent and child layers. Right: The multi-layer structure of the model featuring the encapsulation relation

Following the notations introduced in Chaps. 2 and 5, the input of the EMPP method is an image over a pixel lattice S, and s ∈ S denotes a single pixel. As the first extension of conventional MPP models, each parent object u ∈ H may contain a set of child objects Q u = {qu1 . . . qum(u) }, where m(u) ≤ m max and qui ∈ H. Q u = ∅ marks that u has no child. Let us denote by H Q the parameter space of all possible Q u vectors. Both the parent and child objects are represented by plane figures from preliminary defined shape libraries. As for the second level of the proposed object hierarchy, we introduce the object grouping process. A given population, denoted by ω, is a set of k object groups (also referred later as configuration segments), ω = {ψ1 , . . . , ψk }, where each group ψi (i = 1 . . . k) is a configuration of n i objects: ψi = {u i1 , . . . , u ini } ∈ (H × H Q )ni .

(6.1)

Here we prescribe that ψi ∩ ψ j = ∅ for i = j, while the k set number and n 1 , . . . , n k set cardinality values may be arbitrary (and initially unknown) integers. We mark with u ≺ ω if u belongs to any ψ in ω, i.e. ∃ψi ∈ ω : u ∈ ψi . Let us denote by Nu (ω) the proximity-based neighborhood of u ≺ ω, which is independent of the group level:

6.2 Problem Formulation and Notations

159

Nu (ω) = {v ≺ ω : u ∼ v}. Finally, we denote by  the space of all the possible global configurations, which is constructed as:  k   ∞ {ψ , (6.2) , . . . , ψ } ∈ ∪   = ∪∞ 1 k k=0 n=1 n   where n = {u 1 , . . . , u n } ∈ (H × H Q )n . In this way, we consider that each population ω ∈  may include any number of groups composed of any number of objects and child objects.

6.3 EMPP Energy Model The EMPP energy function is derived with minor modifications of the basic formula (2.21): (6.3) (ω) = d (ω) + γ ·  p (ω),  where d (ω) = u∈ω A(u) is the unary term and  p (ω) is the prior interaction term. The unary term construction process follows the same approach as introduced in Sect. 5.2.2.2 of Chap. 5, with the only difference that in the EMPP model, the p A(u) data energy function is decomposed into a parent term ϕd (u) and child terms ϕdc (u, qu ). As indicated by the notation, a child term may depend both on the local image data and on the geometry of the parent object (e.g. an intensity histogram within the parent region). Both the parent and the child level energy components are derived according to the earlier introduced schema: First fitness features are derived to characterize the efficiency of the generated super/sup-object candidates, respectively. Second, the features are mapped with the nonlinear M( f, d0 , D) function to obtain the energy subterms corresponding to the f feature. Third, the joint data energy of object u is derived by combining averaging, max and min operators, using the multiple prototype definition strategies presented in Sect. 5.2.2.2. The complete unary term of u is the sum of the parent level terms and the child level terms: p ϕdc (u, qu ). (6.4) A(u) = ϕd (u) + qu ∈Q u

The interaction terms implement geometric or feature-based interaction constraints between different objects, child objects and object groups of ω.  p (ω) =

u∼v



I p (u, v)

parent-parent interaction

+

u≺ω



Ic (u, Q u ) +

parent-child interaction

u,ψ



Ig (u, ψ)

parent-group interaction

.

(6.5)

160

6 Multi-level Object Population Analysis with an Embedded MPP Model

(a) Non-overlapping siblings

(b) Child encapsulation

Fig. 6.2 Examples for the Ic (u, Q u ) parent–child prior interaction terms

First, the I p (u, v) terms provide classical pairwise interaction constraints, for example, we can use the common intersection term defined by Eq. (2.22), which penalizes overlapping objects: #{Ru ∩ Rv } . I p (u, v) = #{Ru ∪ Rv } Second, the Ic (u, Q u ) terms define interaction constraints for corresponding parent and child objects, and interactions between different child objects of the same parent. As examples, the model can prescribe that siblings (i.e. children of a given parent) shall not overlap with each other (see Fig. 6.2a), the figure of the child should be included by the figure of the parent (Fig. 6.2b), or the sibling objects must have same shape type, similar color/intensity histogram, size, orientation parameters, etc. Third, with the Ig (u, ψ) energy components, one can define various constraints between the object groups and the (parent) objects in the scene. To measure whether an object u appropriately matches to a population segment ψ, we construct a distance function dψ (u) ∈ [0, 1], where dψ (u) = 0 means a high quality match. Usually, we expect that the segments are spatially connected, therefore, we use a constant high distance factor, if object u does not have any neighbors within ψ considering the ∼ relation. Therefore we define a modified distance metric as follows:  1 if v ∈ ψ\{u} : u ∼ v ˆ (6.6) dψ (u) = dψ (u) otherwise. By definition of Ig (u, ψ), we slightly penalize population segments which only contain a single object: (6.7) Ig (u, ψ) = c iff ψ = {u}, with a small 0 < c constant. As for segments containing multiple objects, we penalize big dˆψ (u) distance values within a given group, while we also try to avoid small dˆψ (u) distances whenever u is not a member of ψ:  Ig (u, ψ) =

if u ∈ ψ dˆψ (u) ˆ 1 − dψ (u) if u ∈ / ψ.

(6.8)

6.4 Multi-level MPP Optimization

161

6.4 Multi-level MPP Optimization For optimizing the energy function of Eq. (6.3), we extend again the Multiple Birth and Death (MBD) [54] algorithm. To accommodate the requirements of the EMPP energy function, the main task is to insert the group assignment, object re-grouping, and child maintenance processes into the original MBD framework. On one hand, after each birth step, the generated object should be assigned to a new, or an existing group. Then, following the death procedure, we execute a new step, called Group rearrangement, which may re-direct some objects to neighboring object groups based on data-based and prior soft-constraints. On the other hand, in the last step of an iteration, called Child Maintenance, we may add, remove or replace child objects for each parent. As already discussed in Chap. 5, efficient object proposal strategies can significantly speed up the MPP energy optimization algorithms. While the Feature-Based Birth Process (FBB) introduced in Sect. 5.2.4 of Chap. 5 proved to be efficient for the building detection application, we have also introduced an extended schema, called the Bottom-Up Stochastic Entity Proposal (BUSEP) method, which has been successfully adopted first to Printed Circuit Board (PCB) inspection [20], then to Lidar-based vehicle detection applications. The BUSEP algorithm is executed as a preprocessing step of the iterative optimization, where we assign to the different image pixels the following attributes: • pseudo probability values that the pixel is an object reference point (e.g. center of an ellipse), and • narrow distributions for all object parameters (including orientation and side/axis length parameters) expected around the given pixel, based on a deterministic object candidate extraction procedure. During the generation of a new object in the birth step, we follow the distributions of the expected parameters, which strategy results in significantly faster generation of efficient candidates. On the other hand, similar to the birth maps [54] and FBB strategies, the entity proposal maintains the reversibility of the iterative evolution process of the object population [177], instead of implementing a suboptimal greedy algorithm. One can use as input of BUSEP a binary foreground mask obtained by a task specific deterministic segmentation algorithm from the input image, e.g. thresholding or deep learning-based semantic segmentation, which realizes a coarse separation of the parent or child objects from the background. The pseudo code of the proposed new Multi-level Multiple Birth-Death-Maintenance (MM BDM) algorithm is detailed in Sect. 6.6 (Fig. 6.17).

162

6 Multi-level Object Population Analysis with an Embedded MPP Model

6.5 Applications of the EMPP Model For adopting the proposed EMPP model to different applications, it is necessary to implement its clearly defined interfaces, which process consists of specifying the following issues: Model elements: parent/child objects and object groups are semantically defined. The shape libraries for parent/child objects are fixed a priori, and possibly further domain specific constraints need to be defined such as a maximum number for siblings having the same parent. Unary terms: the application dependent f features and feature integration rules are p defined, which are used to calculate the parent level ϕY (u) and child level ϕYc (u, qu ) unary terms. Parent–parent interactions: the calculation formula is elaborated for the I p (u, v) interaction term, which is defined between (spatially) neighboring parent objects. Parent–child interactions: between the corresponding parent a child objects the Ic (u, Q u ) interaction constraints are defined. Parent-group interactions: grouping constraints are defined using the calculation formula of the dψ (u) object-segment distance term. Object candidates for BUSEP: the BUSEP process (Sect. 6.4) is based on a deterministic object candidate extraction procedure, which is application specific. However, as demonstrated in Sect. 6.6, several steps of this process can be generalized to various problems. We emphasize hereby that all further model elements and algorithmic steps introduced in Sects. 6.2–6.4 are independent of the concrete application, which property is ensured during model implementation by a clear separation of the general and tasks specific source code components.

6.5.1 Built-in Area Analysis in Aerial and Satellite Images As we introduced in Chap. 5, semantic analysis of built-in areas based on aerial and satellite images is an important task in several remote sensing applications, including cartography, GIS data indexing and search-able storage, disaster preparedness and response. Many of the earlier published approaches focus on the extraction of separate buildings from the images [17], however, as emphasized in [108] detecting regions of corresponding buildings (for example a residential housing district) is also a highly relevant task in urban environment management, or official surveillance (e.g. detecting illegally built objects). Moreover, telecommunication companies, and city protection or infrastructure planning authorities also have to register and verify the presence of various objects on the roofs such as chimneys or parabolic antenna dishes for either helping market research by statistical analysis, or for the prediction of air pollution. For security reasons, detection of illegal or irregular chimneys is also a critical task in city surveillance.

6.5 Applications of the EMPP Model

163

Fig. 6.3 Results of built-in area analysis, displayed at three different scales. Building groups are distinguished with different colors (purple: red roofs’ district, others: orientation-based groups); red markers denote the detected chimneys

Utilization of the EMPP model for urban area analysis requires high-resolution aerial or satellite images, where the above-detailed object and region-level information is available. For this reason, we demonstrate the approach in an aerial photo pair of 12 cm/pixel resolution, which was captured above a sub-urban area of Budapest, Hungary (see a representative sample region in Fig. 6.3). Next we specify the task dependent model elements one after another. Model components in built-in area monitoring: Parent objects are building footprint segments, which are modeled by 2D rectangles. Here we assume that from the bird’s-eye view the shape of every building can be approximated either by a rectangle or by a few slightly overlapping rectangles. Children of the buildings in the model are columnar prominent objects of the roof, such as chimneys or satellite dishes, which are also described by rectangles for simplicity. For the sake of fluent wording, we call all child objects simply as chimneys in the following part of this section. Finally, we assume that the configuration segments are composed of corresponding buildings, like houses belonging to the same residential housing district (see Fig. 6.3a). Unary terms of buildings and chimneys: Parent level unary terms are derived in the same way as introduced in Chap. 5: the energy function integrates featureinformation about roof color, roof edge, and shadow. The feature modeling process

164

6 Multi-level Object Population Analysis with an Embedded MPP Model

Fig. 6.4 Built-in area analysis—features for chimney extraction

related to the child unary terms (i.e. chimneys or other column-shaped roof elements) is displayed in Fig. 6.4. Two main assumptions are used. First, the observed color values in chimney areas have usually smaller saturation components compared to the surrounding roof parts, which fact can be easily demonstrated by applying a threshold operation on the saturation channel of the image in the HSV color space (Fig. 6.4c). Second, we can often observe the cast shadows of chimneys on the roofs, which phenomenon can be similarly handled to shadow-based building localization on the parent object level. However, for gable or mansard roofs [114], illuminated and self-shadowed roof segments should be handled in a different way. Therefore, the parent object (i.e. roof) regions are segmented first by applying a region growing step (Fig. 6.4d), and a locally adaptive color model is used in each roof segment, which is based on the region histograms. Figure 6.4e shows by blue and red overlays two chimneys and their shadow regions extracted in the above way. Through the child object’s data term the model favors the presence of chimney pixels within the object mask and it also expects shadow regions in the neighboring roof segments considering the actual shadow direction. Various examples for detected chimneys are displayed in Figs. 6.4 and 6.18. Parent–child terms J (u, Q u ): In a given roof, we expect from top-view nonoverlapping chimney silhouettes with similar orientation angles. Children objects should be completely included in the parent rectangles (Fig. 6.3c). Object-segment distance dˆψ (u): By examining various urban regions in aerial images, one can conclude that very different alignment patterns may occur within the roofs of different sorts of housing districts. For this reason, in this application the

6.5 Applications of the EMPP Model

165

Fig. 6.5 Built-in area analysis—types of used building groups

definition of the object-segment distance term should be fixed either with machine learning, or on a case-by-case basis. As a first example, in several urban areas, we can find many distinct building groups which are formed by regularly aligned houses, with parallel or perpendicular footprints. On the other hand, we may also see large building groups (e.g. purple group in the center of Fig. 6.3a), where the roof orientations are irregular, but the roof colors are nearly uniform. Finally, some smaller family houses and large condominiums can be present in the same region, which fact can also be used as a basis for grouping. In our implementation, we distinguished three building group prototypes: • If ψ is an alignment-based group (Fig. 6.5a), dψ (u) is proportional to the angle difference between u and the mean angle within ψ. • If ψ is a color group (Fig. 6.5b), dψ (u) evaluates the match between the color histogram of u and the ψ group’s estimated color distribution, which parameter is set during the system configuration. • For separating individual houses from larger condominiums, the roof size and the side length ratios are efficient discriminative descriptors (Fig. 6.5c).

6.5.2 Traffic Monitoring-Based on Lidar Data Traffic surveillance in smart cities requires an automatic and hierarchical approach: apart from detecting individual vehicles, we need to identify groups of corresponding vehicles, like parking cars in a given side of the street, or a vehicle queue waiting in front of a traffic light. (We call henceforward these coherent groups of vehicles as traffic segments). On the other hand, detecting some relevant parts of the vehicles— such as windshields—provides us valuable information for their categorization or maneuver analysis. In this section, we use 3D remote sensing data collected by either an airborne Lidar laser scanner or a car-mounted mobile laser scanning (MLS) system, which can provide us sparse or dense point clouds completed with intensity or RGB information. Working with the aerial data, the low resolution of the considered point cloud measurements (max. 8 points/m2 ) allow only a coarse extraction of the vehicle shapes. Even in these cases, as shown in Fig. 6.6, the windshields are

166

6 Multi-level Object Population Analysis with an Embedded MPP Model

Fig. 6.6 Vehicles appearances in raw triangulated Lidar data (intensity-based coloring was used), b calculation of the data model features

quite clearly observable, and we can separate them from the vehicle body by a joint consideration of the geometry and intensity measurements. Windshield extraction may have various practical applications, such as estimation of the vehicle orientation or driving direction, but they can also contribute to the classification of different vehicle types. The very high-resolution MLS data (Fig. 6.9) preserves several fine details compared to aerial scans, however ghost objects, occlusions, and invisible object parts may cause notable challenges for the scene analyzing algorithms. The EMPP implementation for multi-level traffic surveillance is based on a twostep method for Lidar-based vehicle detection [37]. That approach takes as input a 3D point cloud that is segmented first into vehicle and background classes, in which classification is used as a coarse input for the object detector. Thereafter, points with the corresponding class labels and intensity values are projected to an adaptively estimated ground plane, yielding a two-channel 2D image where the optimal vehicle and traffic segment configuration is modeled by a rectangle population. A sample class label map extracted from aerial data is demonstrated in Fig. 6.7a, while an intensity map projected from an MLS data segment is displayed in Fig. 6.9d. Model elements: parent objects model vehicles, child objects describe windshields, where both object shapes are rectangles. Configuration segments are vehicle groups in different traffic situations (Fig. 6.7a). p Parent unary terms (ϕY ): following the approach of [37], the following different descriptors are calculated for vehicle detection (see Fig. 6.8): • The vehicle evidence ( f ve ) and intensity ( f it ) descriptors are taken as the number of vehicle classified pixels within the proposed object rectangle in the label and intensity maps, normalized with the area of the rectangle. • The external background ( f eb ) descriptor is the relative occurrence of pixels classified as background within the proposed u object’s neighboring regions. The φve , φit and φeb energy terms are calculated following Eq. (5.2), similar to the built-in area analysis application. The final data energy term of object u is taken as p

ϕY (u) = max(min(φit (u), φve (u)), φeb (u)),

(6.9)

where we consider that due to the usage of various polishing materials, not necessarily all vehicles appear as bright blobs in the intensity map.

6.5 Applications of the EMPP Model

167

Fig. 6.7 Sample results on traffic analysis. Super rectangles mark the detected vehicles, different colors correspond to the different groups. In the background, gray levels refer to the input label map: white—vehicle candidates, light gray—road, dark gray—roof. a cars and traffic segments b selected region with the detected windshields c intensity map of a selected car and the detection result

Fig. 6.8 Vehicle detection from airborne Lidar data: calculation of the data model features

168

6 Multi-level Object Population Analysis with an Embedded MPP Model

Fig. 6.9 Processing workflow for Mobile Laser Scanning (MLS) data. a Used scanner ©Budapest Közút Zrt. b input point cloud scene (b) estimated vehicle regions by point cloud classification—two selected segments are highlighted from different viewpoints c EMPP detection results

Child unary terms (ϕYc ): since windshields are made of glass, their top-view bounding rectangles cover regions either without any laser returns, or containing a few Lidar points only with low associated intensity values (Figs. 6.6 and 6.7c). Here the descriptive features are defined similar to the parent level descriptors by coverage ratios. Parent–child terms J (u, Q u ): the windshield rectangle is completely included by the car’s top-view bounding box, and its orientation is perpendicular to the car’s main symmetry axis (Fig. 6.7c). Object-segment distance dψ (u): the distance value depends on the modeled traffic situation. In our implementation, we prescribe that the vehicles of the same segment have similar orientations, and they form regular rows. We calculate dψ (u) as the average value of two energy terms. The first one is the normalized angle difference between u and the mean angle among the vehicles of group ψ (see Fig. 6.10a-left). To derive the second term, we fit one or a couple of parallel lines to the object centers within ψ using RANSAC. Then calculate a normalized distance between the center of u and the closest line from the extracted line set (Fig. 6.10a-right). A more general solution for this feature, which also deals with curved road segments can be found in [37].

6.5 Applications of the EMPP Model

169

Fig. 6.10 √Grouping energies for a traffic monitoring and b printed circuit analysis applications. Favored ( ) and penalized (×) sub-configurations within an object group

6.5.3 Automatic Optical Inspection of Printed Circuit Boards Reliable quality analysis of Printed Circuit Boards (PCBs) is a key task in the manufacture of electronic devices, which issue is often implemented with automatic optical inspection (AOI) techniques. Template-free methods are frequently adopted for examining uniquely designed circuits. Circuit Elements (CEs) of similar shapes and orientations form often groups in PCBs which implement a given functionality, therefore understanding the operation of the board requires to cluster the CE population. In addition, filtering faulty PCBs by AOI is another critical problem. The most widespread assembling technology of electronic circuit modules uses reflow soldering nowadays [109]. Here a frequent problem, called scooping can occur during manufacturing, which influences the strength of solder joints in stencil prints [20]: a board should be discarded if the overall volume of such artifacts surpasses a given threshold. In an AOI image, a scoop can be recognized as a bright patch surrounded

170

6 Multi-level Object Population Analysis with an Embedded MPP Model

Fig. 6.11 PCB inspection: Feature demonstration for unary term calculation

by a darker ring within the solder paste, as displayed in Fig. 6.11a. Automatic scoop detection is quite challenging due to the varying contrast of AOI images. Next, we detail the steps of the EMPP implementations for the AOI-based PCB inspection task. Model elements: Parent objects are variously shaped CEs. Child objects are the scoops, which are modeled by pairs of concentric ellipses. Groups are formed by CEs which belong to a given functional unit. p Parent unary terms (ϕY ): CEs are modeled [10, 11] as bright rectangles, ellipses or triangles surrounded by dark background regions. To calculate the contrast between the CEs and the background, we approximate the Bhattacharya [54] distance d B (u) between the pixel intensity histograms of the internal CE regions and their boundaries p (see Fig. 6.11a). Then the ϕY (u) unary term is taken by mapping d B (u) with the M function (see Eq. (5.2)). Child unary terms (ϕYc ): as introduced in [20] we distinguish three different regions of each scoop: a central bright spot, a darker median ring and a bright external ring, which is demonstrated in Fig. 6.11b. Based on our experiences, the gray level histogram of a real scoop’s (q) central region, λqc (x) can be modeled by a skewed

6.5 Applications of the EMPP Model

171

Fig. 6.12 Results of PCB analysis. CEs are grouped by shape and orientation, scoops are extracted within the CEs

distribution, while the histograms of the medium and external regions (λqm (x) and λqe (x)) follow Gaussian densities. Next we mark by μqc , μqm and μqe the peak locations of the smoothed λqc (x), λqm (x) and λqe (x) histogram curves. We characterize an efficient scoop candidate by three constraints: • high μqc values, • an intensity ratio μqc u /μqmu greater than a prescribed contrast threshold d cm , • an intensity ratio μqe u /μqmu greater than a prescribed contrast threshold d em . To ensure the simultaneous fulfillment of the above constraints, the child’s data energy term is derived by using the maximum operator (logical AND) from the energy subterms corresponding to the introduced three constraints. To calculate the final energy term, the M function is used here again:  ϕYc (u, qu ) = max M(μqc u , d c , D c ), M(μqc u /μqmu , d cm , D cm ),  M(μqe u /μqmu , d em , D em ) .

(6.10)

Parent–child terms J (u, Q u ): as a consequence of technological constraints of the manufacturing process more than a single scoop cannot appear in a solder paste,

172

6 Multi-level Object Population Analysis with an Embedded MPP Model

thus every parent CE may have a maximum of one child, whose region must be fully encapsulated by its parent solder paste. Object-segment distance dψ (u): in a given CE group, the circuit elements have the same shape type, and they follow strongly regular alignment patterns (Fig. 6.10b). For this reason, we use • dψ (u) = 1 if the type of u, tp(u) is not equal to the type of the group ψ, • otherwise dψ (u) is the maximum of the angle difference and symmetric distance terms, which were already defined in Sect. 6.5.2 (see the traffic monitoring application).

6.6 Implementation Details This section presents some implementation details of Embedded Marked Point Processes (EMPP). First, we deal with the Bottom-Up Stochastic Entity Proposal (BUSEP), which is a highly application dependent part of the EMPP workflow. As introduced in Sect. 6.4, instead of applying fully random sampling, we construct a data-driven stochastic entity generation scheme, which proposes relevant parent objects with higher probability based on various image features. A concrete implementation of the BUSEP algorithm, developed for the Printed Circuit Board (PCB) analysis application can be followed in Figs. 6.13, 6.14, 6.15, 6.16. In this case, we have to deal with variously shaped and scaled circuit elements (CEs): rectangles, ellipses, and triangles, while the size of CEs can be notably different (see Fig. 6.13a), which factors significantly increase the size of the parameter space. In the preprocessing step a binary foreground mask B is used which is derived by Otsu’s thresholding method from the input image. This step implements a coarse

Fig. 6.13 Steps of the bottom-up entity proposal process

6.6 Implementation Details

173

Fig. 6.14 Pseudo code of the Circuit Element (CE) Candidate Generation algorithm, used in the Bottom-Up Stochastic Entity Proposal (BUSEP) process during PCB analysis

Fig. 6.15 Pseudo code of BUSEP parameter map estimation for PCB analysis

174

6 Multi-level Object Population Analysis with an Embedded MPP Model

Fig. 6.16 Pseudo code of Parent Object (solder paste) generation used in the PCB analysis application

separation of the circuit elements (i.e. foreground areas) from the background regions of the board. However, due to noise and contrast issues, such a mask cannot be often used for reliable CE separation and for the estimation of their shapes. Moreover, some nearby CEs can also be merged into a single blob in the foreground mask. Our proposed preprocessing algorithm starts with CE candidate generation based on the foreground mask, as described in Algorithm 6.1 (see Fig. 6.14). Thereafter, based on these CE candidates, probabilistic parameter maps (i.e. extended birth maps) are calculated for the BUSEP process, which step is detailed in Algorithm 6.2 (Fig. 6.15). Finally, parent object generation—which uses the above parameter maps—is realized by Algorithm 6.3 (Fig. 6.16). For the building detection and traffic analysis applications, the object candidate extraction step needs obviously different features. Alternatively, deep learning-based semantic image segmentation techniques [122] may also be adopted to robustly extract the above-mentioned foreground mask B. The remaining steps can be highly similar to the ones presented in Algorithms 6.2 and 6.3, with a simplification that for buildings and vehicles only rectangular object candidates should be considered. Finally, we give the pseudo code of the Multi-level Multiple Birth-DeathMaintenance optimization in Algorithm 6.4 (Fig. 6.17). Notice that this algorithm is largely general, although the parent object generation step is naturally application depended, for example, in the PCB analysis task one can use Algorithm 6.3.

6.7 Quantitative Evaluation Framework For experimental validation of the EMPP framework, it is essential to use relevant test data and to construct an efficient quantitative evaluation metric. Since none of the available public datasets proved to be appropriate for analyzing the discussed complex three-layer model, we have constructed a new database, called the EMPP

6.7 Quantitative Evaluation Framework

175

Fig. 6.17 Pseudo code of Multi-level Multiple Birth-Death-Maintenance (MM BDM) algorithm

176

6 Multi-level Object Population Analysis with an Embedded MPP Model

Benchmark, which enables the numerical evaluation of multi-level object population analysis techniques in high-resolution images. By exploiting the manually verified Ground Truth (GT) data of the new benchmark, we also proposed an automatic validation methodology, which evaluates a given EMPP output configuration by comparing it to the GT, and it calculates various fitness scores on the three different hierarchical levels. The new EMPP Benchmark1 is composed of several data sets. For each test case Ground Truth (GT) data has been prepared which allows the validation of the proposed hierarchical embedded model. The GT data records in the annotation files describe the relationships of objects, object groups and child objects within a population, using the same syntax for all applications. (Of course, different application fields require different semantic interpretation of the abstract models, as introduced in Sect. 6.5.) For recording and saving the GT information a computer program with graphical user interface has been created, which makes it easy to manually draw or edit GT configurations composed of various geometric objects, with also marking the parent–child relationship between the entities of the population. Using the application, one can also create new object groups, or assign a given parent object to an already existing group.

6.7.1 EMPP Benchmark Database The EMPP Benchmark database includes the following input images with annotated Ground Truth information (see also Table 6.1): 1. Building detection: Budapest aerial image with 12 cm resolution (69 buildings, 79 chimneys), Manchester satellite image (50 cm res., 155 buildings) from the SZTAKI-INRIA Benchmark [17], and two Quickbird images (#2 and #11, 60– 80 cm res., 218 buildings) from the dataset by A.O. Ok [130]. 2. Traffic analysis: the dataset contains aerial Lidar point clouds, and from a smaller region mobile laser scanning (MLS) data samples (for proof-of-concept evaluation) • Aerial data: 6 point cloud segments from Budapest, Hungary, dense urban regions, 792 vehicles (scanner: Optec ALTM Gemini 167, point denisty: 8 pts/m2 ) [37]. • MLS data: 2 point cloud segments from Budapest, Hungary, dense urban regions, 42 vehicles (scanner: Riegl VMX-450 mobile mapping system). 3. Optical circuit board analysis: 44 printed circuit board images of 6 µm resolution, containing 4439 CEs and 664 scooping errors [20].

1

Website: http://mplab.sztaki.hu/%7Ebcsaba/EMPPBenchmark.html.

6.7 Quantitative Evaluation Framework

177

Table 6.1 Dataset parameters Applicat. Input Resolution Covered Parent Child area objects objects Building Rem.sens. 0.12– 1.0 km2 442 79∗ analysis RGB 0.8 m/pixel buildings chimneys image Traffic Aerial 8 pts/m2 0.3 km2 817 817 analysis Lidar vehicles windshields pointcloud Mobile up to 5700 m2 42 42 laser scan. 7000 vehicles windshields 2 data pts/m PCB Grayscale 6 µm/pix 1232 mm2 4439 664 scoops inspection AOI circ.elem. image ∗ chimneys can only be reliably analyzed in the 12cm resolution sample

Child/ Groups/ parent image {0, 1, 2, 5–16 . . .} 1

7–9

1

3–5

{0, 1}

3–7

6.7.2 Quantitative Evaluation Methodology Comprehensive evaluation of the EMPP approach requires the analysis of the proposed model at multiple levels. For the different layers of the hierarchical approach, we defined different quality measures, ensuring that the values of all metrics are clear and can be calculated automatically from the output of the EMPP method and the recorded GT. In the parent object layer, we use both object-based and pixel-based accuracy rates in the same way as defined in Sect. 5.2.4. The Hungarian Algorithm (HA) is used again to make an optimal assignment between the detected object candidates and the GT object. Next, the true positive (TP), false positive (FP), and false negative (FN) hits are counted, since the relative values of the TP, FP, and FN numbers characterize well the goodness of object recognition. Apart from the object-based metrics, we perform a pixel level comparison between the binary silhouette masks of the detected parent objects and their matched GT objects, and calculate the Parent level Precision (PPr), Recall (PRc), and F-score (PFs) values. The child object extraction step is evaluated using similar object level metrics to the parent layer. As a minor difference, during the calculation of the Child level Precision (CPr), Recall (CRc), and F-score (CFs) values, only those matches are accepted between the detected and GT child objects, where their parents are also correctly paired at the upper layer. In the last step of the validation, to evaluate the (parent) object grouping layer of the EMPP model, the correct Group Classification Rate (GR, %) is derived among the true positive parent object samples, based on the group classification information from the GT. The GR value calculation requires counting the number of correctly grouped objects (TG), and the number of falsely grouped objects (FG), thereafter GR is taken as the following ratio factor: GR =

TG . TG + FG

178

6 Multi-level Object Population Analysis with an Embedded MPP Model

6.8 Experimental Results We evaluated our method on the new EMPP Benchmark database. Various qualitative sample results of three-level object population detection are shown in Figs. 6.3, 6.7, 6.9 and 6.12, while Fig. 6.18 shows several examples for (child level) chimney extraction from the built-in area monitoring task. During the quantitative analysis, the obtained EMPP results were compared to the GT configuration of the benchmark, and the performance rates defined in Sect. 6.7.2 were calculated in each case. Table 6.2 shows the quantitative results at parent and group levels, while Table 6.3 presents the child level performance of the proposed technique.

6.8.1 EMPP Versus an Ensemble of Single Layer MPPs A key issue of the validation process is the comparative evaluation of the EMPP framework’s methodological innovations versus earlier solutions used by conventional MPP techniques. As the baseline of the comparison, a classical sequential approach is considered, where in a first step the object population is extracted by a single layer MPP model (sMPP), using exactly the same unary terms and child detection algorithm as we presented in the above described EMPP implementation. However the  p (ω) prior term is simplified: it only encapsulates the I (u, v) nonoverlapping term and the J (u, Q u ) parent–child interaction component, while the parent-group energy term is set to zero (A(u, ψ) = 0). In a second (post-processing) step, parent objects are grouped by using a recursive region-growing algorithm-like clustering process. This grouping step is initialized with a randomly selected unassigned object, then we divide all its spatial neighbors to the same group if and only if

Fig. 6.18 Building analysis—sample results for chimney detection. True hits are marked by yellow circles, a false negative is highlighted in the third image of the upper row by a yellow rectangle. In the corners of the samples, the raw images of the chimney regions are displayed separately for visual verification

6.8 Experimental Results

179

Table 6.2 Object and group level evaluation of the proposed EMPP model, and comparison to a conventional sMPP approach App.

Method Parent level analysis

Group level

Number of objects

Pixel level %

accuracy

TP

FP

FN

PRc

PPr

PFs

FG#

GR%

Building

sMPP

406

24

36

80

75

78

58

14

analysis

EMPP

417

14

25

84

88

86

28

7

Aerial traffic

sMPP

792

30

25

79

77

78

202

25

monitoring

EMPP

793

30

24

82

85

83

43

5

Ground-based sMPP

42

0

0

92

86

89

2

5

traffic analysis EMPP

42

0

0

96

89

92

0

0

PCB

sMPP

4408

39

31

87

86

87

448

10

inspection

EMPP

4415

9

24

92

97

94

137

3

Table 6.3 Child level evaluation of the proposed EMPP model Application CRc% CPr% Building analysis Aerial traffic monitoring Ground-based traffic analysis PCB inspection

CFs%

80 92

71 92

75 92

93

93

93

91

95

93

the difference between the orientations is smaller than a selected τ threshold value. This process is recursively repeated until all objects get a unique group label. As our qualitative and quantitative tests (presented in the following paragraphs) show, the bottleneck of the sequential approach is its critical dependence on the τ threshold, which cannot be generally set for a whole configuration containing many noisy objects. Figure 6.19 and Table 6.2 demonstrate that the proposed EMPP method outperforms the classical sMPP approach in two important quality factors. Firstly, EMPP is significantly better in terms of the pixel-based error rates (PRc, PPr, and PFs), which confirms that the extracted object shapes become much more accurate. Secondly, the number of objects with False Groups (FG, GR) is largely reduced by using the EMPP technique. Particularly, the single layer model suffers often from the inaccuracy of estimating object orientations based on the input feature maps: in the building analysis task the edge map might be noisy and unreliable, the vehicle monitoring approach is affected by the low resolution of the aerial Lidar point clouds, and in the PCB verification challenge the unexpected shape deformations of the solder pastes may make the estimation inaccurate. On the contrary, by using our proposed EMPP model, the object orientations are efficiently adjusted by considering the higher (group) level

180

6 Multi-level Object Population Analysis with an Embedded MPP Model

Fig. 6.19 Qualitative validation of the sMPP and the EMPP configurations versus the Ground Truth (only parent group levels are displayed). Yellow ellipses mark grouping errors and purple ones false objects. By building analysis (row 1) groups of houses and condos are separated

alignment constraints. As Table 6.2 shows, the differences between the sMPP and EMPP performance are less significant by working with high resolution and highly accurate Mobile Laser Scanning (MLS) data, where more reliable feature extraction can be performed in the initial phase from the input measurements. We note that in some specific cases, the sMPP output could also be enhanced by using pairwise orientation smoothing terms implemented by classical MPP models [115]. However, the proposed EMPP model offers a higher degree of freedom for the application layer than the earlier approaches: one can simultaneously consider various group level features and exploit interactions between corresponding, but not necessarily closely located objects. In our case, the only prescribed constraint is a regular alignment within the estimated object groups, while outlier labels can be even used to indicate unusual object behavior.

6.8 Experimental Results

181

6.8.2 Application Level Comparison to Non-MPP-Based Techniques Another relevant point of evaluation is the justification of using an MPP approach versus various alternative non-MPP-based techniques for the selected application domains. Regarding the building detection problem, a detailed state-of-the comparison has already been provided in Chap. 5, which demonstrated the advantages of the point process-based solution. Vehicle detection from airborne Lidar has also a broad literature. In our task specific paper [37], we compared our solution to the digital elevation map-based principal component analysis (DEM-PCA) technique [151], to a h-maxima suppression approach (h-max) [198] and to a Floodfill (Floodf)-based method [36]. Although the reference methods were chosen so that they provide complex and valid solutions for the vehicle detection task in general urban environments, we have also observed a number of limitations for each case. The main bottleneck of the DEM-PCA method is that it cannot robustly deal with noise and quantization artifacts of the estimated elevation maps. Moreover, low vegetation (bushes, tall grass) or different street objects may mislead the method since the range of their elevation values often overlaps with the typical domain of the vehicle height parameters. During the test of the h-max method, similar issues have been observed to the problems mentioned in [198]: in crowded environment and parking lots, the h-max technique results in a quite noisy contour map, therefore many nearby objects are merged, while additional false alarms appear in vegetation areas. As for the Floodfill algorithm, we experienced that 3D connected component search is sensitive to noisy data caused by occlusions, and here again the neighboring vehicles are often merged together. On the contrary, by using the proposed EMPP implementation the 2D projection step proved to be an efficient noise filter, while the energy minimization-based approach of MPP does not request fully connected components for detecting a given vehicle. The quantitative comparison results shown in Table 6.4 confirm again the superiority of using MPP at the parent object level. The scooping detection problem investigated in the third application example is a strongly technology specific issue, which understandably does not have a wide bibliography. However, the advantages of the stochastic parent–child relationship approach in the EMPP can be convincingly illustrated in connection with the PCB quality analysis application. As a reference approach for scooping detection, a morphology-based technique—called Morph—has been implemented, which is based on two threshold operators applied for the input image: in the first step a lower threshold value is used, which results in a probably undersegmented binary solder paste candidate mask. Using the second threshold, the brightest image parts are extracted which are supposed to encapsulate the scoop center areas. In the final step a verification procedure deletes the falsely detected scoop candidates (see further details in [11]). Table 6.5 shows results of scooping detection by the deterministic Morph technique and the proposed stochastic EMPP approach, respectively. We can observe that the child level performance of EMPP outperforms Morph by 20%.

182

6 Multi-level Object Population Analysis with an Embedded MPP Model

Table 6.4 Traffic analyis evalutation vs. state of the art. Parent level F-scores (in %) by the PCA [151], h-max [198], Floodfill (Floodf) and the proposed EMPP methods Set NV* Object level F-score % Pixel level F-score% PCA h-max Floodf EMPP PCA h-max Floodf EMPP #1 #2 #3 #4 #5 #6 #7 All

191 94 170 160 110 131 153 1009

78 89 85 68 48 89 80 77

78 81 87 77 79 81 90 82

88 80 91 88 92 73 88 86

97 97 96 97 98 98 93 97

63 80 77 61 37 80 60 66

63 38 76 68 61 70 76 65

66 60 85 75 82 48 65 71

82 73 74 89 84 88 88 83

*NumV = Number of real Vehicles in the test set Table 6.5 PCB inspection task: Comparison of the child level performance on scooping detection between the Morph technique and the proposed EMPP model PCB insp. TP FP FN F-score (%) method Morph technique 514 [11] Proposed EMPP 629

228

150

73

65

35

93

6.8.3 Effects on Data Term Parameter Settings As discussed in Chap. 5 and in Sect. 6.3, the most important application depenf dent parameter, which significantly affects the performance of the method, is the d0 p object acceptance threshold value associated with the different features in the ϕd (u) unary term (and similar thresholds of the child-data terms). While in Sect. 5.4 we already investigated some parametrization issues for the building detection applicaf tion, Fig. 6.20 demonstrates how critical the selection of an appropriate d0 parameter value is in terms of discriminating true objects from false candidates in the Lidarbased vehicle detection task. In another experiment, we measured the dependence of the EMPP method’s performance on the data term threshold parameter. Figure 6.21 displays in the context of the traffic monitoring tasks the obtained object level pref cision, recall and F-rate values as a function of the d0 parameters corresponding to four selected features. As shown, the precision and recall curves indicate a nearly monotonous increasing and decreasing characteristics, respectively, due to the adoption of fitness-like f (u) descriptors, where ‘high’ f values correspond to high quality object candidates. On the other hand, the F-rate plots are gentle curves with a single global maximum value, which in fact ensures a graceful degradation in case of minor f deviations from the hypothetically optimal (but usually unknown) d0 values.

6.8 Experimental Results

183

ve

40

d0 : vehicle evidence feature’s acceptance threshold Vehicle evidence feature histogram of true objects

20 0

0.1

0.2

0.4

0.3

0.5

40

0.7

0.8

1

0.9

Vehicle evidence feature hist. of random (false) rectangles

20 0

0.6

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

eb

d0 : external bakground feature’s acceptance threshold External bakground feature histogram of true objects

40 20 0

0.1

0.2

0.3

0.4

40

0.5

0.6

0.7

0.8

0.9

1

External bakground feature hist. of random (false) rectangles

20 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 6.20 Histograms of vehicle evidence and external background features for true and false training objects in the traffic monitoring task

The interaction term I (u, v) of Eq. (6.5) has a non-maxima suppression effect p by removing object candidates strongly overlapping with objects having low ϕd (u) unary energies, therefore several suboptimal attractive objects will not appear as false detections.

6.8.4 Computational Time The computational complexity of the proposed approach is largely determined by the implementation of the iterative Multi-level Multiple Birth-Death-Maintenance (MM BDM) optimization algorithm (Fig. 6.17). To keep the running time acceptable an exponential temperature cooling strategy has been adopted, and we also exploited the Bottom-Up Stochastic Entity Proposal (BUSEP) process introduced in Sect.

6 Multi-level Object Population Analysis with an Embedded MPP Model

Fig. 6.21 Effects of the change in the data term acceptance threshold values on the object level performance for the aerial traffic monitoring task

1

Performance

184

0.98

Precision Recall F−rate

0.96 0.94 0.92 0

dve parameter 0 0.02

0.04

0.06

0.08

0.1

0.12

Performance

1

Precision Recall F−rate

0.95 0.9 0.85 0.8 0.05

deb parameter 0 0.1

0.15

0.2

0.25

Table 6.6 Average computational time and parent object number for sample images of the different application fields Built-in Aerial Traffic PCB insp. Avg. EMPP time Avg. sMPP time Avg. obj.num.

17.8 s 13.9 s 110

11.1 s 9.1 s 136

21.7 s 20.1 s 100

6.4, with using various application dependent image descriptors [17, 20, 37]. This strategy ensures a quick convergence to a suboptimal solution, which proved to be efficient in the investigated three applications. For analyzing the running speed quantitatively on a standard desktop computer, we executed our proposed EMPP method and the sMPP reference technique multiple times, and we measured the mean computational time on one test image. Results provided in Table 6.6 show that the average computational time of EMPP is between 11 and 22 s in the different applications. For the built-in area analysis and aerial traffic surveillance tasks EMPP is only 20–30% slower than the less accurate sMPP. Moreover, the running time of the two methods is nearly identical for PCB analysis. The experiments also showed that the computational time is nearly independent of the number of objects, but it is correlated to the average pixel-based area of the parent objects, which parameter is larger for the building detection and PCB inspection tasks.

6.8 Experimental Results

185

6.8.5 Experiment Repeatability Due to the stochastic nature of the MM BDM algorithm, it is important to validate the stability and repeatability of the proposed iterative optimization method. Let us recall that in every main step of the process probabilistic operators modify the actual configuration, such as random birth, death, parameter perturbation, or redirection of objects between groups. Our experiments confirmed that the proposed framework produces notably stable output, as the resulting object populations are largely similar for each run. To support our statement about the method’s repeatability, we have performed a quantitative experiment using an aerial Lidar point cloud segment, which contains 169 vehicles classified into 10 object groups. The experiment was performed 200 times in an independent manner with the same parameter settings and with the same input data, meanwhile the resulting multi-level object population of the stochastic process has been compared to the Ground Truth each time. Table 6.7 shows the mean and standard deviation values of the measured error rates during this experiment. The survey confirms that at the parent object level the standard deviations of the measured TP/FN/FP values are less than 1 object, meanwhile the deviation of the pixel-based rates is less than 0.01 over the 200 test runs. We can observe that the object grouping module is also notably reliable. Although the scene considered here is particularly challenging due to the low resolution of the aerial Lidar (e.g. the real object sizes and orientations can be only inaccurately extracted from the sparse local point cloud data), the introduced object level grouping constraints can greatly improve the output result. Table 6.8 displays statistics about falsely grouped objects (FG) during the 200 trial runs: in almost all cases only 0–5 grouping errors occurred among the 169 (parent) objects, and we experienced an FG error larger than 6 only in three out of the 200 experiments, while this error value was never greater than 20.

Table 6.7 Experiment repeatability for the vehicle detection task: Mean values and standard deviations of the measured error rates for 200 independent run in the same aerial Lidar segment TP FP FN PFR TG FG Mean Dev

161.4 0.81

4.27 0.45

7.56 0.81

0.78 0.0077

158.5 2.37

2.89 2.24

Table 6.8 Distribution of the number of falsely grouped objects (out of 169 vehicles) in the 200-run experiment of Table 6.7 FG val. 0 1 2 3 4 5 6 7–20 21+ Freq.

26

36

20

41

41

25

8

3

0

186

6 Multi-level Object Population Analysis with an Embedded MPP Model

6.9 Conclusion This chapter introduced a novel Embedded Marked Point Process (EMPP) model for joint extraction of objects, object groups, and specific object parts from highresolution digital images. The efficiency of the approach has been tested in three different application domains, and Ground Truth data has been prepared and published to enable quantitative evaluation. Based on the obtained results, we can confirm that the proposed EMPP model is able to handle real-world tasks from significantly different application areas, providing a Bayesian framework for multi-level image content interpretation.

Chapter 7

Concluding Remarks

This book has focused on various region level and object-based pattern recognition problems, which raise nowadays important challenges to experts in computer vision and machine perception. Stochastic Bayesian energy minimization techniques have been chosen as bases for the introduced new methods, and improvements versus state-of-the-art approaches have been proposed in various aspects, including observation processing, combination of different model structures, and new spatial and temporal interpretation of up to-date sensor measurements. While in several real time applications, the high computation cost of energy minimization methods may mean bottleneck of applying complex models, we have shown that with using appropriate dimension reducing techniques, combining stochastic and deterministic relaxation approaches, and the utilization of prior knowledge-based rules we can often obtain high-quality solutions in a computationally efficient manner. Although the application examples from the book cover a broad field, we have mainly focused on general problems appearing concurrently in various domains, with exploring the possible applicability of the presented models under varying circumstances. The book put particular attention on the connections between the theoretical results of established mathematical models and the applicability of the implemented methods using real-life measurement data collected from realistic scenarios. For this reason, the validation of the proposed models has mostly been experimental, and significant efforts have been conducted for test data collection, Ground Truth generation, and relevant quantitative comparison to state-of-the-art approaches targeting the same or similar problems.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 C. Benedek, Multi-Level Bayesian Models for Environment Perception, https://doi.org/10.1007/978-3-030-83654-2_7

187

References

1. OpenCV documentation. https://opencv.org/. 2. E. Aarts and J. Korst. Simulated Annealing and Boltzman Machines. John Wiley & Sons, New York, 1990. 3. D. A. Ausherman, A. Kozma, J. L. Walker, H. M. Jones, and E. C. Poggio. Developments in radar imaging. IEEE Trans. Aerosp. Electron. Syst., 20:363–400, 1984. 4. D. Baltieri, R. Vezzani, R. Cucchiara, Á. Utasi, C. Benedek, and T. Szirányi. Multi-view people surveillance using 3D information. In Proc. International Workshop on Visual Surveillance at ICCV, pages 1817–1824, Barcelona, Spain, 2011. 5. S. T. Barnard and W. B. Thompson. Disparity analysis of images. IEEE Trans. Pattern Anal. Mach. Intell., 2:333–340, 1980. 6. S. Ben Hadj, F. Chatelain, X. Descombes, and J. Zerubia. Parameter estimation for a marked point process within a framework of multidimensional shape extraction from remote sensing images. In ISPRS Technical Commission III Symposium on Photogrammetry Computer Vision and Image Analysis (PCV), Paris, France, 2010. 7. D. Benboudjema and W. Pieczynski. Unsupervised statistical segmentation of nonstationary images using triplet Markov fields. IEEE Trans. Pattern Anal. Mach. Intell., 29(8):1367–1378, 2007. 8. C. Benedek. Novel Markovian Change Detection Models in Computer Vision. PhD thesis, Péter Pázmány Catholic University, Faculty of Information Technology, 2008. 9. C. Benedek. Efficient building change detection in sparsely populated areas using coupled marked point processes. In Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), Honolulu, Hawaii, USA, 2010. 10. C. Benedek. Analysis of solder paste scooping with hierarchical point processes. In Proc. IEEE Int. Conf. Image Process. (ICIP), pages 2121–2124, Brussels, Belgium, 2011. 11. C. Benedek. Detection of soldering defects in printed circuit boards with hierarchical Marked Point Processes. Pattern Recognit. Lett., 32(13):1535 – 1543, 2011. 12. C. Benedek. 3D people surveillance on range data sequences of a rotating Lidar. Pattern Recognit. Lett., 50:149–158, 2014. Special Issue on Depth Image Analysis. 13. C. Benedek. Hierarchical image content analysis with an embedded marked point process framework. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pages 5147– 5151, Florence, Italy, 2014. 14. C. Benedek. An embedded marked point process framework for three-level object population analysis. IEEE Trans. Image Process., 26(9):4430–4445, 2017. © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 C. Benedek, Multi-Level Bayesian Models for Environment Perception, https://doi.org/10.1007/978-3-030-83654-2

189

190

References

15. C. Benedek. Képi alapú többszint˝u környezetelemzés (Image based multi-level environment analysis). Thesis for the Doctoral Degree of the Hungarian Academy of Sciences (D.Sc.), 2019. 16. C. Benedek, X. Descombes, and J. Zerubia. Building detection in a single remotely sensed image with a point process of rectangles. In Proc. Int. Conf. Pattern Recognit. (ICPR), Istanbul, Turkey, 2010. 17. C. Benedek, X. Descombes, and J. Zerubia. Building development monitoring in multitemporal remotely sensed image pairs with stochastic birth-death dynamics. IEEE Trans. Pattern Anal. Mach. Intell., 34(1):33–50, 2012. 18. C. Benedek, B. Gálai, B. Nagy, and Z. Jankó. Lidar-based gait analysis and activity recognition in a 4D surveillance system. IEEE Trans. Circuits Syst. Video Technol., 28(1):101–113, 2018. 19. C. Benedek, Z. Jankó, C. Horváth, D. Molnár, D. Chetverikov, and T. Szirányi. An integrated 4D vision and visualisation system. In Proc. International Conference on Computer Vision Systems (ICVS), volume 7963 of Lecture Notes in Computer Science, pages 21–30. Springer, St. Petersburg, Russia, 2013. 20. C. Benedek, O. Krammer, M. Janóczki, and L. Jakab. Solder paste scooping detection by multi-level visual inspection of printed circuit boards. IEEE Trans. Ind. Electron., 60(6), 2013. 21. C. Benedek and M. Martorella. Moving target analysis in ISAR image sequences with a multiframe Marked Point Process model. IEEE Trans. Geosci. Remote Sens., 52(4):2234– 2246, 2014. 22. C. Benedek, D. Molnár, and T. Szirányi. A dynamic MRF model for foreground detection on range data sequences of rotating multi-beam Lidar. In Proc. International Workshop on Depth Image Analysis (WDIA), volume 7854 of Lecture Notes in Computer Science, pages 87–96. Springer, Tsukuba City, Japan, 2012. 23. C. Benedek, M. Shadaydeh, Z. Kato, T. Szirányi, and J. Zerubia. Multilayer Markov random field models for change detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens., 107:22–37, 2015. Special Issue on Multitemporal Remote Sensing Data Analysis. 24. C. Benedek and T. Szirányi. Markovian framework for structural change detection with application on detecting built-in changes in airborne images. In Proc. IASTED International Conference on Signal Processing, Pattern Recognition and Applications (SPPRA), pages 68–73, Innsbruck, Austria, 2007. 25. C. Benedek and T. Szirányi. Bayesian foreground and shadow detection in uncertain frame rate surveillance videos. IEEE Trans. Image Process., 17(4):608–621, 2008. 26. C. Benedek and T. Szirányi. A mixed Markov model for change detection in aerial photos with large time differences. In Proc. Int. Conf. Pattern Recognit. (ICPR), Tampa, FL, USA, 2008. 27. C. Benedek and T. Szirányi. Change detection in optical aerial images by a multi-layer conditional mixed Markov model. IEEE Trans. Geosci. Remote Sens., 47(10):3416–3430, 2009. 28. C. Benedek and T. Szirányi. Shadow detection in digital images and video. In Computational Photography: Methods and Applications, Digital Imaging and Computer Vision, pages 283– 312. CRC Press, Taylor & Francis, 2010. 29. C. Benedek, T. Szirányi, Z. Kato, and J. Zerubia. A three-layer MRF model for object motion detection in airborne images. Research Report 6208, INRIA Sophia Antipolis, France, 2007. 30. C. Benedek, T. Szirányi, Z. Kato, and J. Zerubia. Detection of object motion regions in aerial image pairs with a multi-layer Markovian model. IEEE Trans. Image Process., 18(10):2303– 2315, 2009. 31. M. Benši´c and K. Sabo. Estimating the width of a uniform distribution when data are measured with additive normal errors with known variance. Computational Statistics & Data Analysis, 51(9):4731–4741, 2007. 32. J. Besag. On the statistical analysis of dirty images. Journal of Royal Statistics Society, 48:259–302, 1986.

References

191

33. J. A. Bilmes. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technical Report TR-97-021, International Computer Science Institute and Computer Science Division, University of California at Berkley, Berkley, CA, 1998. 34. A. Blake, C. Rother, M. Brown, P. Perez, and P. Torr. Interactive image segmentation using an adaptive GMMRF model. In Proc. European Conference on Computer Vision (ECCV), pages 456–468, Prague, Czech Republic, 2004. Springer. 35. T. Blaskovics, Z. Kato, and I. Jermyn. A Markov random field model for extracting nearcircular shapes. In Proc. IEEE Int. Conf. Image Process. (ICIP), pages 1073–1076, Cairo, Egypt, 2009. 36. A. Börcs and C. Benedek. Urban traffic monitoring from aerial LIDAR data with a two-level marked point process model. In Proc. Int. Conf. Pattern Recognit. (ICPR), pages 1379–1382, Tsukuba City, Japan, 2012. 37. A. Börcs and C. Benedek. Extraction of vehicle groups in airborne lidar point clouds with two-level point processes. IEEE Trans. Geosci. Remote Sens., 53(3):1475–1489, 2015. 38. A. Börcs, B. Nagy, and C. Benedek. Dynamic environment perception and 4D reconstruction using a mobile rotating multi-beam Lidar sensor. In Handling Uncertainty and Networked Structure in Robot Control, Studies in Systems, Decision and Control, pages 153–180. Springer, 2016. 39. J-Y. Bouguet. Pyramidal implementation of the Lucas Kanade feature tracker: Description of the algorithm. Technical report, Intel Corporation, 1999. 40. Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell., 26(9):1124–1137, 2004. 41. M. Bredif, O. Tournaire, B. Vallet, and N. Champion. Extracting polygonal building footprints from digital surface models: A fully-automatic global optimization framework. ISPRS J. Photogramm. Remote Sens., 77(1):57–65, 2013. 42. L. Bruzzone and D. Fernández-Prieto. An adaptive semiparametric and context-based approach to unsupervised change detection in multitemporal remote-sensing images. IEEE Trans. Image Process., 11(4):452–466, 2002. 43. L. Bruzzone, D. Fernández-Prieto, and S.B. Serpico. A neural-statistical approach to multitemporal and multisource remote-sensing image classification. IEEE Trans. Geosci. Remote Sens., 37(3):1350–1359, 1999. 44. L. Castellana, A. D’Addabbo, and G. Pasquariello. A composed supervised/unsupervised approach to improve change detection from remote sensing. Pattern Recogn. Lett., 28(4):405– 413, 2007. 45. A. Cavallaro, E. Salvador, and T. Ebrahimi. Detecting shadows in image sequences. In Proc. ACM SIGGR. Eur. Conf. Vis. Media Prod. (CVMP), pages 167–174, London, UK, 2004. 46. F. Chatelain, A. Costard, and O.J.J. Michel. A Bayesian marked point process for object detection. Application to muse hyperspectral data. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process.(ICASSP), pages 3628–3631, Prague, Czech Republic, 2011. 47. F. Chatelain, X. Descombes, and J. Zerubia. Parameter estimation for marked point processes. application to object extraction from remote sensing images. In Energy Minimization Methods in Comp. Vision and Pattern Recogn., volume 5681 of Lecture Notes in Computer Science, pages 221–234. Springer, Bonn, Germany, 2009. 48. S. Chaudhuri and D.R. Taur. High-resolution slow-motion sequencing: How to generate a slow-motion sequence from a bit stream. IEEE Signal Processing Magazine, 22(2):16–24, 2005. 49. J. K. Cheng and T. S. Huang. Image registration by matching relational structures. Pattern Recognit., 17:149–159, 1984. 50. D.A. Clausi and H. Deng. Design-based texture feature fusion using Gabor filters and cooccurrence probabilities. IEEE Trans. Image Process., 14(7):925–936, 2005. 51. R. Cutler and L.S. Davis. Robust real-time periodic motion detection, analysis, and applications. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):781–796, 2000.

192

References

52. L. Czúni and T. Szirányi. Motion segmentation and tracking with edge relaxation and optimization using fully parallel methods in the cellular nonlinear network architecture. Real-Time Imaging, 7(1):77–95, 2001. 53. X. Descombes, editor. Stochastic geometry for image analysis. Digital Signal and Image Processing. Wiley-ISTE, 2011. 54. X. Descombes, R. Minlos, and E. Zhizhina. Object extraction using a stochastic birth-anddeath dynamics in continuum. J. Math. Imaging Vis., 33:347–359, 2009. 55. X. Descombes and J. Zerubia. Marked point processes in image analysis. IEEE Signal Process. Mag., 19(5):77–84, 2002. 56. N. Dorudian, S. Lauria, and S. Swift. Moving object detection using adaptive blind update and RGB-D camera. IEEE Sensors Journal, 19(18):8191–8201, 2019. 57. D. Farin and P. With. Misregistration errors in change detection algorithms and how to avoid them. In Proc. IEEE Int. Conf. Image Process. (ICIP), pages 438–441, Genova, Italy, 2005. 58. D. Finlayson, S.D. Hordley, C. Lu, and M. S. Drew. On the removal of shadows from images. IEEE Trans. Pattern Anal. Mach. Intell., 28(1):59–68, 2006. 59. F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua. Multicamera people tracking with a probabilistic occupancy map. IEEE Trans. Pattern Anal. Mach. Intell., 30(2):267–282, 2008. 60. D. A. Forsyth. A novel algorithm for color constancy. Int. J. Comput. Vis., 5(1):5–36, 1990. 61. C. Fredembach and G. D. Finlayson. Hamiltonian path based shadow removal. In Proc. British Machine Vision Conference (BMVC), volume 2, pages 502–511, Oxford, UK, 2005. 62. A. Fridman. Mixed Markov models. Proc. National Academy of Sciences of USA, 100(14):8092–8096, 2003. 63. N. Friedman and S. Russell. Image segmentation in video sequences: A probabilistic approach. In Proc. Conf. on Uncertainty in Artificial Intelligence, pages 175–181, 1997. 64. Z. Fu, Y. Chen, H. Yong, R. Jiang, L. Zhang, and X. Hua. Foreground gating and background refining network for surveillance object detection. IEEE Trans. Image Process., 28(12):6077– 6090, 2019. 65. K. Fukunaga and R.R. Hayes. The reduced Parzen classifier. IEEE Trans. Pattern Anal. Mach. Intell., 11(4):423–425, 1989. 66. A. Gamal-Eldin, X. Descombes, and J. Zerubia. Multiple birth and cut algorithm for point process optimization. In Proc. International Conference on Signal-Image Technology and Internet-Based Systems (SITIS), pages 35–42, Kuala Lumpur, Malaysia, 2010. 67. A. Gamal-Eldin, X. Descombes, and J. Zerubia. A novel algorithm for occlusions and perspective effects using a 3D object process. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pages 1569–1572, Prague, Czech Republic, 2011. 68. P. Gamba, F. Dell’Acqua, and G. Lisini. Change detection of multitemporal SAR data in urban areas combining feature-based and pixel-based techniques. IEEE Trans. Geosci. Remote Sens., 44(10):2820–2827, 2006. 69. W. Ge and R. T. Collins. Crowd detection with a multiview sampler. In Proc. European Conf. Comput. Vis. (ECCV), volume 6315 of Lecture Notes in Computer Science, pages 324–337. Springer, Heraklion, Crete, Greece, 2010. 70. W. Ge and R.T. Collins. Marked point processes for crowd counting. In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 2913–2920, Miami, FL, USA, 2009. 71. S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell., 6(6):721–741, 1984. 72. T. Gevers and A.W.M. Smeulders. Color based object recognition. Pattern Recognit., 32:453– 464, 1999. 73. T. Gevers and H. Stokman. Classifying color edges in video into shadow-geometry, highlight, or material transitions. IEEE Trans. Multimedia, 5(2):237–243, 2003. 74. S. Ghosh, L. Bruzzone, S. Patra, F. Bovolo, and A. Ghosh. A context-sensitive technique for unsupervised change detection based on Hopfield-type neural networks. IEEE Trans. Geosci. Remote Sens., 45(3):778–789, 2007. 75. I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www. deeplearningbook.org.

References

193

76. J. Guo, C. Hsia, Y. Liu, M. Shih, C. Chang, and J. Wu. Fast background subtraction based on a multilayer codebook model for moving object detection. IEEE Trans. Circuits Syst. Video Technol., 23(10):1809–1821, 2013. 77. Y.V. Haeghen, J.M.A.D. Naeyaert, I. Lemahieu, and W. Philips. An imaging system with calibrated color image acquisition for use in dermatology. IEEE Trans. Med. Imag., 19(7):722– 730, 2000. 78. G. J. Hahn and S. S Shapiro. Statistical models in engineering. John Wiley & Sons, New York, 1994, p. 95. 79. J. Han and B. Bhanu. Individual recognition using gait energy image. IEEE Trans. Pattern Anal. Mach. Intell., 28(2):316–322, 2006. 80. J. Hapák, Z. Jankó, and D. Chetverikov. Real-time 4D reconstruction of human motion. In Proc. International Conference on Articulated Motion and Deformable Objects (AMDO), volume 7378 of Lecture Notes in Computer Science, pages 250–259. Springer, Port d’Andratx, Mallorca, Spain, 2012. 81. R.M. Haralick. Digital step edges from zero crossing of second directional derivatives. IEEE Trans. Pattern Anal. Mach. Intell., 6(1):58–68, 1984. 82. R. I. Hartley and A. Zissermann. Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge, 2000. 83. M. Harville, G.G. Gordon, and J. Woodfill. Foreground segmentation using adaptive mixture models in color and depth. In IEEE Workshop on Detection and Recognition of Events in Video, pages 3–11, Vancouver, BC, Canada, 2001. 84. L. Havasi and T. Szirányi. Estimation of vanishing point in camera-mirror scenes using video. Optics Letters, 31(10):1411–1413, 2006. 85. L. Havasi, Z. Szlávik, and T. Szirányi. Higher order symmetry for non-linear classification of human walk detection. Pattern Recognit. Lett., 27:822–829, 2006. 86. L. Havasi, Z. Szlávik, and T. Szirányi. Detection of gait characteristics for scene registration in video surveillance system. IEEE Trans. Image Process., 16(2):503–510, 2007. 87. J.B. Hayfron-Acquah, M.S. Nixon, and J.N. Carter. Human identification by spatio-temporal symmetry. In Proc. Int. Conf. Pattern Recognit. (ICPR), volume 1, page 10632, Washington, DC, USA, 2002. 88. M. Heikkila and M. Pietikainen. A texture-based method for modeling the background and detecting moving objects. IEEE Trans. Pattern Anal. Mach. Intell., 28(4):657–662, 2006. 89. H. Hirschmüller, F.Scholten, and G. Hirzinger. Stereo vision based reconstruction of huge urban areas from an airborne pushbroom camera (HRSC. In Proc. Joint Pattern Recognition Symposium (DAGM), volume 3663 of Lecture Notes in Computer Science, pages 58–66. Springer, Vienna, Austria, 2005. 90. H. Hirschmüller, P. R. Innocent, and J. Garibaldi. Real-time correlation-based stereo vision with reduced border errors. Int. J. Comput. Vis., 47(1/2/3):229–246, 2002. 91. T. Hoberg, F. Rottensteiner, R. Q. Feitosa, and C. Heipke. Conditional random fields for multitemporal and multiscale classification of optical satellite imagery. IEEE Trans. Geosci. Remote Sens., 53(2):659–673, 2015. 92. V. Hodge and J. Austin. A survey of outlier detection methodologies. Artif. Intell. Rev., 22(2):85–126, 2004. 93. M. Irani and P. Anandan. A unified approach to moving object detection in 2D and 3D scenes. IEEE Trans. Pattern Anal. Mach. Intell., 20(6):577–589, 1998. 94. P.M. Jodoin and M. Mignotte. Motion segmentation using a K-nearest-neighbor-based fusion procedure of spatial and temporal label cues. In Proc. Int. Conf. on Image Analysis and Recognition (ICIAR), volume 3656 of Lecture Notes in Computer Science, pages 778–788. Springer, Toronto, Canada, 2005. 95. P.M. Jodoin, M. Mignotte, and C. Rosenberger. Segmentation framework based on label field fusion. IEEE Trans. Image Process., 16(10):2535–2550, 2007. 96. R. Kaestner, N. Engelhard, R. Triebel, and R.Siegwart. A Bayesian approach to learning 3D representations of dynamic environments. In Proc. Int. Symposium on Experimental Robotics (ISER), pages 461–475, Berlin, Germany, 2010. Springer.

194

References

97. B. Kalyan, K. W. Lee, W. S. Wijesoma, D. Moratuwage, and N. M. Patrikalakis. A random finite set based detection and tracking using 3D LIDAR in dynamic environments. In Proc. IEEE Int. Conf. Syst. Man Cybern. (SMC), pages 2288–2292, Istanbul, Turkey, 2010. 98. A. Katartzis and H. Sahli. A stochastic framework for the identification of building rooftops using a single remote sensing image. IEEE Trans. Geosc. Remote Sens., 46(1):259–271, 2008. 99. J. Kato, T. Watanabe, S. Joga, L. Ying, and H. Hase. An HMM/MRF-based stochastic framework for robust vehicle tracking. IEEE Trans. Intell. Transp. Syst., 5(3):142–154, 2004. 100. Z. Kato. Multiresolution Markovian models in computer vision. Application on segmentation of SPOT images. PhD thesis, University of Nice, INRIA, Sophia Antipolis, France, 1994. Available in French and English. 101. Z. Kato and T. C. Pong. A Markov random field image segmentation model for color textured images. Image and Vision Computing, 24(10):1103–1114, 2006. 102. Z. Kato, T. C. Pong, and G. Q. Song. Multicue MRF image segmentation: Combining texture and color. In Proc. Int. Conf. Pattern Recognit. (ICPR), pages 660–663, Quebec, Canada, 2002. 103. Z. Kato and J. Zerubia. Markov random fields in image segmentation. Foundations and Trends in Signal Processing. Now Publishers, 2012. 104. Z. Kato, J. Zerubia, and M. Berthod. Satellite image classification using a modified Metropolis dynamics. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pages 573–576, San Francisco, California, USA, 1992. 105. Z. Kato, J. Zerubia, and M. Berthod. Unsupervised parallel image classification using Markovian models. Pattern Recognit., 32(4):591–604, 1999. 106. E. A. Khan and E. Reinhard. Evaluation of color spaces for edge classification in outdoor scenes. In Proc. IEEE Int. Conf. Image Process. (ICIP), volume 3, pages 952–955, Genoa, Italy, 2005. 107. A. Kovács, C. Benedek, and T. Szirányi. A joint approach of building localization and outline extraction. In Proc. IASTED International Conference on Signal Processing, Pattern Recognition and Applications (SPPRA), Innsbruck, Austria, 2011. 108. A. Kovács and T. Szirányi. Orientation based building outline extraction in aerial images. In Proc. XXII. ISPRS Congress, volume I-7 of ISPRS Annals Photogram. Rem. Sens. and Spat. Inf. Sci., pages 141–146. Melbourne, Australia, 2012. 109. O. Krammer and B. Sinkovics. Improved method for determining the shear strength of chip component solder joints. Microelectronics Reliability, 50(2):235 – 241, 2010. 110. H. W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistic Quarterly, 2:83–97, 1955. 111. S. Kumar, M. Biswas, and T. Nguyen. Global motion estimation in spatial and frequency domain. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pages 333–336, Montreal, Canada, 2004. 112. S. Kumar and M. Hebert. Detection in natural images using a causal multiscale random field. In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), volume 1, pages 119–126, Madison, USA, 2003. 113. S. Kumar and M. Hebert. Discriminative random fields. Int. J. Comput. Vis., 68(2):179–202, 2006. 114. F. Lafarge, X. Descombes, J. Zerubia, and M. Pierrot-Deseilligny. Structural approach for building reconstruction from a single DSM. IEEE Trans. Pattern Anal. Mach. Intell., 32(1):135–147, 2010. 115. F. Lafarge, G. Gimel’farb, and X. Descombes. Geometric feature extraction by a multimarked point process. IEEE Trans. Pattern Anal. Mach. Intell., 32(9):1597 –1609, 2010. 116. F. Lafarge and C. Mallet. Creating large-scale city models from 3D-point clouds: A robust approach with hybrid representation. Int. J. of Computer Vision, 2012. 117. M. Lahraichi, K. Housni, and S. Mbarki. Bayesian detection of moving object based on graph cut. In Proc. International Conference on Intelligent Systems: Theories and Applications (SITA), pages 1–5, Mohammedia, Morocco, 2016.

References

195

118. D. S. Lee. Effective Gaussian mixture learning for video background subtraction. IEEE Trans. Pattern Anal. Mach. Intell., 27(5):827–832, 2005. 119. L. Li and M.K.H. Leung. Integrating intensity and texture differences for robust change detection. IEEE Trans. Image Process., 11(2):105–112, 2002. 120. S. Z. Li. Markov random field modeling in computer vision. Springer-Verlag, London, UK, 1995. 121. Y. Li and J. Li. Oil spill detection from SAR intensity imagery using a marked point process. Remote Sensing of Environment, 114(7):1590 – 1601, 2010. 122. Y. Liu, Z. Zhang, R. Zhong, D. Chen, Y. Ke, J. Peethambaran, C. Chen, and L. Sun. Multilevel building detection framework in remote sensing images based on convolutional neural networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 11(10):3688–3700, 2018. 123. B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proc. Int. Jt. Conf. Artif. Intell. (IJCAI), pages 674–679, Vancouver, BC, Canada, 1981. 124. L. Lucchese. Estimating affine transformations in the frequency domain. In Proc. IEEE Int. Conf. Image Process. (ICIP), volume II, pages 909–912, Thessaloniki, Greece, 2001. 125. H. Luo, C. Wang, C. Wen, Z. Chen, D. Zai, Y. Yu, and J. Li. Semantic labeling of mobile LiDAR point clouds via active learning and higher order MRF. IEEE Trans. Geosci. Remote Sens., 56(7):3631–3644, 2018. 126. D. K. Lynch and W. Livingstone. Color and Light in Nature. Cambridge University Press, 1955. 127. L. Maddalena and A. Petrosino. Stopped object detection by learning foreground model in videos. IEEE Trans. Neural Netw. Learn. Syst., 24(5):723–735, 2013. 128. A. Maki and K. Fukui. Ship identification in sequential ISAR imagery. Mach. Vision Appl., 15:149–155, 2004. 129. C. Mallet, F. Lafarge, M. Roux, U. Soergel, F. Bretar, and C. Heipke. A marked point process for modeling Lidar waveforms. IEEE Trans. Image Process., 19(12):3204–3221, 2010. 130. A. Manno-Kovács and A.O. Ok. Building detection from monocular VHR images by integrated urban area knowledge. IEEE Geosci. Remote Sens. Lett., 12(10):2140–2144, 2015. 131. N. Martel-Brisson and A. Zaccarin. Moving cast shadow detection from a Gaussian mixture shadow model. In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), volume 2, pages 643–648, 2005. 132. I. Mikic, P. Cosman, G. Kogut, and M. M. Trivedi. Moving shadow and object detection in traffic scenes. In Proc. Int. Conf. Pattern Recognit. (ICPR), volume 1, pages 321–324, Barcelona, Spain, 2000. 133. I. Miyagawa and K. Arakawa. Motion and shape recovery based on iterative stabilization for modest deviation from planar motion. IEEE Trans. Pattern Anal. Mach. Intell., 28(7):1176– 1181, 2006. 134. A. Monnet, A. Mittal, N. Paragios, and V. Ramesh. Background modeling and subtraction of dynamic scenes. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), volume 2, pages 1305–1312, Washington, DC, USA, 2003. 135. N. Muhammad and S. Lacroix. Calibration of a rotating multi-beam Lidar. In Proc. IEEE Int. Conf. Intell. Robots Syst. (IROS), pages 5648–5653, Taipei, Taiwan, 2010. 136. S. Müller and D.W. Zaum. Robust building detection in aerial images. In Proc. ISPRS Object Extraction for 3D City Models, Road Databases and Traffic Monitoring - Concepts, Algorithms and Evaluation (CMRT05), pages 143–148, Vienna, Austria, 2005. 137. Z. Németh and C. Benedek. Automatic tumuli detection in Lidar based digital elevation maps. In Proc. XXIV. ISPRS Congress, volume XLIII-B2-2020 of Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., pages 879–884. Nice, France, 2020. 138. J. M. Odobez and P. Bouthemy. Detection of multiple moving objects using multiscale MRF with camera motion compensation. In Proc. IEEE Int. Conf. Image Process. (ICIP), volume II, pages 257–261, Austin, Texas, USA, 1994. 139. D. Ortego, J. C. Sanmiguel, and J. M. Martínez. Hierarchical improvement of foreground segmentation masks in background subtraction. IEEE Trans. Circuits Syst. Video Technol., 29(6):1645–1658, 2019.

196

References

140. M. Ortner, X. Descombes, and J. Zerubia. A marked point process of rectangles and segments for automatic analysis of digital elevation models. IEEE Trans. Pattern Anal. Mach. Intell., 30(1):105–119, 2008. 141. N. Paragios and V. Ramesh. A MRF-based real-time approach for subway monitoring. In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), volume 1, pages 1034–1040, Kauai, HI, USA, 2001. 142. G. Perrin, X. Descombes, and J. Zerubia. 2D and 3D vegetation resource parameters assessment using marked point processes. In Proc. Int. Conf. Pattern Recognit. (ICPR), volume 1, pages 1–4, Hong Kong, 2006. 143. PETS. Dataset - Performance Evaluation of Tracking and Surveillance, 2009. 144. R. Pless, T. Brodsky, and Y. Aloimonos. Detecting independent motion: The statistics of temporal continuity. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):68–73, 2000. 145. F. Porikli and J. Thornton. Shadow flow: a recursive method to learn moving cast shadows. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), volume 1, pages 891–898, Beijing, China, 2005. 146. J. Porway, Q. Wang, and S. C. Zhu. A hierarchical and contextual model for aerial image parsing. Int. J. Comput. Vis., 88(2):254–283, 2010. 147. R. Potts. Some generalized order-disorder transformation. Mathematical Proceedings of the Cambridge Philosophical Society, 48(1):106–109, 1952. 148. A. Prati, I. Mikic, M. M. Trivedi, and R. Cucchiara. Detecting moving shadows: algorithms and evaluation. IEEE Trans. Pattern Anal. Mach. Intell., 25(7):918–923, 2003. 149. W. K. Pratt. Digital Image Processing. John Wiley & Sons, 2nd edition, 1991. 150. R. J. Radke, S. Andra, O. Al-Kofahi, and B. Roysam. Image change detection algorithms: A systematic survey. IEEE Trans. Image Process., 14(3):294–307, 2005. 151. Á. Rakusz, T. Lovas, and Á. Barsi. Lidar-based vehicle segmentation. In Proc. XX. ISPRS Congress, volume XXXV-2 of ISPRS Archives Photogram. Rem. Sens. and Spat. Inf. Sci., pages 156–159. Istanbul, Turkey, 2004. 152. B. Reddy and B. Chatterji. An FFT-based technique for translation, rotation and scale-invariant image registration. IEEE Trans. Image Process., 5(8):1266–1271, 1996. 153. S. Reed, I. T. Ruiz, C. Capus, and Y. Petillot. The fusion of large scale classified side-scan sonar image mosaics. IEEE Trans. Image Process., 15(7):2049–2060, 2006. 154. C. J. Van Rijsbergen. Information Retrieval. Butterworths, London, 2nd edition, 1979. 155. J. Rittscher, J. Kato, S. Joga, and A. Blake. An HMM-based segmentation method for traffic monitoring. IEEE Trans. Pattern Anal. Mach. Intell., 24(9):1291–1296, 2002. 156. J.D. Romero, M.J. Lado, and A. J. Méndez. A background modeling and foreground detection algorithm using scaling coefficients defined with a color model called lightness-red-greenblue. IEEE Trans. Image Process., 27(3):1243–1258, 2018. 157. M. Russell, J.J. Zou, G. Fang, and W. Cai. Feature-based image patch classification for moving shadow detection. IEEE Trans. Circuits Syst. Video Technol., 29(9):2652–2666, 2019. 158. E. Salvador, A. Cavallaro, and T. Ebrahimi. Cast shadow segmentation using invariant color features. Computer Vision and Image Understanding, 95(2):238–259, 2004. 159. H.S. Sawhney, Y. Guo, and R. Kumar. Independent motion detection in 3D scenes. IEEE Trans. Pattern Anal. Mach. Intell., 22(10):1191–1199, 2000. 160. G. Scarpa, R. Gaetano, M. Haindl, and J. Zerubia. Hierarchical multiple Markov chain model for unsupervised texture segmentation. IEEE Trans. Image Process., 18(8):1830–1843, 2009. 161. I. Schiller and R. Koch. Improved video segmentation by adaptive combination of depth keying and Mixture-of-Gaussians. In Proc. Scandinavian Conference on Image Analysis, volume 6688 of Lecture Notes in Computer Science, pages 59–68. Springer, 2011. 162. Y. Sheikh and M. Shah. Bayesian modeling of dynamic scenes for object detection. IEEE Trans. Pattern Anal. Mach. Intell., 27(11):1778–1792, 2005. 163. H. Shekarforoush, M. Berthod, and J. Zerubia. Subpixel image registration by estimating the polyphase decomposition of cross power spectrum. In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 532–537, Washington, DC, USA, 1996. 164. P. Singh, Z. Kato, and J. Zerubia. A multilayer Markovian model for change detection in aerial image pairs with large time differences. In Proc. Int. Conf. Pattern Recognit. (ICPR), pages 924–929, Stockholm, Sweden, 2014.

References

197

165. B. Sirmaçek and C. Ünsalan. Building detection from aerial imagery using invariant color features and shadow information. In Proc. Int. Symp. on Computer and Information Sciences (ISCIS), Istanbul, Turkey, 2008. 166. B. Sirmaçek and C. Ünsalan. Urban-area and building detection using SIFT keypoints and graph theory. IEEE Trans. Geosc. Remote Sens., 47(4):1156–1167, 2009. 167. B. Sirmaçek and C. Ünsalan. A probabilistic framework to detect buildings in aerial and satellite images. IEEE Trans. Geosc. Remote Sens., 49:211–221, 2011. 168. P. Soille. Morphological Image Analysis: Principles and Applications. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2nd edition, 2003. 169. Z.Y. Song, C.H. Pan, Q. Yang, F.X. Li, and W. Li. Building roof detection from a single high-resolution satellite image in dense urban area. In Proc. XXI. ISPRS Congress, Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., pages 271–277. Beijing, China, 2008. 170. C. Stauffer and W. E. L. Grimson. Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):747–757, 2000. 171. T. Szirányi and M. Shadaydeh. Segmentation of remote sensing images using similaritymeasure-based fusion-MRF model. IEEE Geosci. Remote Sens. Lett., 11(9):1544–1548, 2014. 172. Z. Szlávik, T. Szirányi, and L. Havasi. Stochastic view registration of overlapping cameras based on arbitrary motion. IEEE Trans. Image Process., 16(3):710–720, 2007. 173. S. Tanathong, K.T. Rudahl, and S.E. Goldin. Object oriented change detection of buildings after the Indian ocean tsunami disaster. In Proc. IEEE Int. Conf. Electr. Eng./Electron. Comput. Telecommun. Inf. Technol. (ECTI-CON), pages 65–68, Krabi, Thailand, 2008. 174. M. G. A. Thomson, R. J. Paltridge, T. Yates, and S. Westland. Color spaces for discrimination and categorization in natural scenes. In Proc. Congress of the International Colour Association, pages 877–880, 2002. 175. R. Y. Tsai. A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf tv cameras and lenses. IEEE Journal of Robotics and Automation, 3(4):323–344, 1987. 176. V.J.D. Tsai. A comparative study on shadow compensation of color aerial images in invariant color models. IEEE Trans. Geosc. Remote Sens., 44(6):1661–1671, 2006. 177. Z. Tu and S-C. Zhu. Image segmentation by Data-Driven Markov Chain Monte Carlo. IEEE Trans. Pattern Anal. Mach. Intell., 24:657–673, 2002. 178. Á. Utasi and C. Benedek. A 3-D marked point process model for multi-view people detection. In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 3385–3392, Colorado Springs, USA, 2011. 179. Á. Utasi and C. Benedek. A multi-view annotation tool for people detection evaluation. In Proc. Int. Workshop on Visual Interfaces for Ground Truth Collection in Computer Vision Applications (VIGTA), Capri, Italy, 2012. 180. Á. Utasi and C. Benedek. A Bayesian approach on people localization in multi-camera systems. IEEE Trans. Circuits Syst. Video Technol., 23(1):105–115, 2013. 181. A. Veillard, S. Bressan, and D. Racoceanu. SVM-based framework for the robust extraction of objects from histopathological images using color, texture, scale and geometry. In Proc. IEEE Int. Conf. Mach. Learn. Appl. (ICMLA), pages 70–75, Boca Raton, FL, USA, 2012. 182. Y. Verdie and F. Lafarge. Detecting parametric objects in large scenes by Monte Carlo sampling. Int. J. Comput. Vis., 106:57–75, 2014. 183. P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), volume 1, pages 511–518, Kauai, HI, USA, 2001. 184. B-T. Vo and B-N. Vo. Labeled random finite sets and multi-object conjugate priors. IEEE Trans. Signal Process., 61(13):3460–3475, 2013. 185. J. L. Walker. Range-doppler imaging of rotating objects. IEEE Trans. Aerosp. Electron. Syst., 16:23–52, 1980. 186. M. Wan, G. Gu, W. Qian, K. Ren, X. Maldague, and Q. Chen. Unmanned aerial vehicle video-based target tracking algorithm using sparse representation. IEEE Internet Things J., 6(6):9689–9706, 2019.

198

References

187. C. Wang, F. Liao, and C. Ma. Detection of pedestrian crossing from focus to spread. In Proc. World Congress Intell. Control and Automat. (WCICA), pages 4897–4901, Beijing, China, 2012. 188. K. Wang, Y. Liu, C. Gou, and F. Wang. A multi-view learning approach to foreground detection for traffic surveillance applications. IEEE Trans. Veh. Technol., 65(6):4144–4158, 2016. 189. L. Wang, T. Tan, H. Ning, and W. Hu. Silhouette analysis-based gait recognition for human identification. IEEE Trans. Pattern Anal. Mach. Intell., 25(12):1505–1518, 2003. 190. Q. Wang, J. Gao, and Y. Yuan. A joint convolutional neural networks and context transfer for street scenes labeling. IEEE Trans. Intell. Transp. Syst., 19(5):1457–1470, 2018. 191. Y. Wang, K-F. Loe, and J-K. Wu. A dynamic conditional random field model for foreground and shadow segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 28(2):279–289, 2006. 192. Y. Wang and T. Tan. Adaptive foreground and shadow detection in image sequences. In Proc. Int. Conf. Pattern Recognit. (ICPR), pages 983–986, Quebec, Canada, 2002. 193. J. Weng, N. Ahuja, and T. S. Huang. Matching two perspective views. IEEE Trans. Pattern Anal. Mach. Intell., 14:806–825, 1992. 194. R.G. White and M.L. Williams. Processing ISAR and spotlight SAR data to very high resolution. In Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), pages 32–34, Hamburg, Germany, 1999. 195. R. Wiemker. An iterative spectral-spatial bayesian labeling approach for unsupervised robust change detection on remotely sensed multispectral imagery. In Int. Conf. on Computer Analysis of Images and Patterns (CAIP), volume 1296 of Lecture Notes in Computer Science, pages 263–270. Springer, Kiel, Germany, 1997. 196. B. Wu and R. Nevatia. Detection and segmentation of multiple, partially occluded objects by grouping, merging, assigning part detection responses. Int. J. Comput. Vis., 82(2):185–204, 2009. 197. G. Wyszecki and W. Stiles. Color Science: Concepts and Methods, Quantitative Data and Formulas. John Wiley & Sons, 2nd edition, 1982. 198. W. Yao, S. Hinz, and U. Stilla. Automatic vehicle extraction from airborne LiDAR data of urban areas aided by geodesic morphology. Pattern Recogn. Letters, 31(10):1100 – 1108, 2010. 199. A. Yilmaz, O. Javed, and M. Shah. Object tracking: A survey. ACM Computing Surveys, 38(4), 2006. 200. A. Yilmaz, X. Li, and M. Shah. Object contour tracking using level sets. In Proc. Asian Conference on Computer Vision (ACCV), Jaju Islands, Korea, 2004. 201. A. Yoneyama, Chia H. Yeh, and C-C. Jay Kuo. Moving cast shadow elimination for robust vehicle extraction based on 2D joint vehicle/shadow models. In Proc. IEEE Int. Conf. Adv. Video Signal Based Surveill. (AVSS), pages 229–236, Miami, FL, USA, 2003. 202. Y. Yu, J. Li, H. Guan, C. Wang, and M. Cheng. A marked point process for automated tree detection from mobile laser scanning point cloud data. In Proc. IEEE Int. Conf. Computer Vision in Remote Sensing (CVRS), pages 140–145, Xiamen, China, 2012. 203. C. Yuan, G. Medioni, J. Kang, and I. Cohen. Detecting motion regions in the presence of a strong parallax from a moving camera by multiview geometric constraints. IEEE Trans. Pattern Anal. Mach. Intell., 29(9):1627–1641, 2007. 204. Z. Zhang, R. Deriche, O. Faugeras, and Q.-T. Luong. A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial Intelligence Journal, 78:87–119, 1995. 205. Z. Zhang, T. Jing, J. Han, Y. Xu, and X. Li. Flow-process foreground region of interest detection method for video codecs. IEEE Access, 5:16263–16276, 2017. 206. Z. Zhao, H. Li, R. Zhao, and X. Wang. Crossing-line crowd counting with two-phase deep neural networks. In Proc. European Conf. Comput. Vis. (ECCV), volume 9912 of Lecture Notes in Computer Science, pages 712–726. Springer, Amsterdam, The Netherlands, 2016. 207. J. Zhong and S. Sclaroff. Segmenting foreground objects from a dynamic textured background via a robust Kalman filter. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pages 44–50, Nice, France, 2003.

References

199

208. P. Zhong and R. Wang. A multiple conditional random fields ensemble model for urban area detection in remote sensing optical images. IEEE Trans. Geosci. Remote Sens., 45(12):3978– 3988, 2007. 209. J. Zhou, C. Proisy, P. Couteron, X. Descombes, J. Zerubia, G. le Maire, and Y. Nouvellon. Tree crown detection in high resolution optical images during the early growth stages of eucalyptus plantations in Brazil. In Proc. Asian Conf. on Pattern Rec. (ACPR), pages 623–627, 2011. 210. Y. Zhou, Y. Gong, and H. Tao. Background segmentation using spatial-temporal multiresolution MRF. In Proc. IEEE Workshops Appl. Comp. Vis. (WACV/MOTION’05), pages 8–13, Breckenridge, CO, USA, 2005. 211. S.C. Zhu and A. L. Yuille. A flexible object recognition and modeling system. Int. J. of Comput. Vis., 20(3), 1996. 212. Z.Zivkovic. Motion Detection and Object Tracking in Image Sequences. PhD thesis, University of Twente, 2003.

Index

A Aerial image / aerial photo, 2, 10, 17, 21, 79, 81, 93, 101, 113, 117, 119, 125, 163, 164, 176 Aerial Lidar, 2, 3, 155, 157, 176, 177, 179, 185 Automated Optical Inspection (AOI), 157, 169, 170, 177

B Benchmark, 32, 49, 95, 113, 115, 116, 133, 176, 178 Building detection, 21, 124, 133, 134, 153, 154, 161, 174, 176, 181, 182, 184

C Change detection, 1, 3, 4, 6, 22, 25, 79– 82, 88, 101, 102, 113–116, 119, 121– 124, 130, 134–136 Color space, 29, 31, 33, 164 Conditional Mixed Markov (CXM) model, 79, 101, 109, 111, 113–119, 122

D Dynamic Markov Random Field (DMRF), 65–70, 78

E Embedded Marked Point Process (EMPP), 6, 156–159, 161–163, 166, 168, 170, 172, 174, 176–182, 184, 186 Expectation Maximization (EM), 103

F Foreground detection, 27, 28, 52, 55, 68–70

G Gait Energy Image (GEI), 71, 72 Gaussian distribution / Gaussian density, 33, 34, 37, 39, 65, 88, 103, 107, 108, 132, 139, 171 Gibbs distribution, 12, 13, 18, 32, 92

H Hammersley–Clifford, 12, 13, 92

I Image registration, 79, 81, 82, 85, 119 Image segmentation, 1, 6, 9, 13–15, 22, 23, 28, 30, 65, 67, 68, 79, 82, 98, 108, 109, 114, 174 Interaction potential, 19, 118, 130, 142, 143, 145 Inverse Synthetic Aperture Radar (ISAR), 2, 121, 136–140, 143, 146–148, 151– 154

L Label fusion, 6, 22, 79–82, 95, 96, 99, 101, 116, 118, 119 Light Detection and Ranging (Lidar), 4, 21, 25, 26, 62–64, 70–73, 75–78, 155, 161, 165–168, 181, 182

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 C. Benedek, Multi-Level Bayesian Models for Environment Perception, https://doi.org/10.1007/978-3-030-83654-2

201

202 M Marked Point Process (MPP), 5, 6, 9, 11, 17–23, 25, 55, 57, 59, 60, 77, 121, 122, 130, 133, 135, 152, 154–158, 161, 178–181, 184 Markov Random Field (MRF), 3–6, 9, 11– 17, 19, 22, 23, 25–27, 31, 32, 35, 49, 53, 63, 65, 67–69, 75, 77, 79–81, 89– 91, 108, 110, 111, 114–117, 119, 133, 138–140 Maximum a Posteriori (MAP), 13, 92, 111 Microstructure analysis, 31, 38, 51, 77 Mixture of Gaussians (MoG), 43, 44, 65, 66, 68, 69, 98, 103 Mobile Laser Scanning (MLS), 165, 166, 168, 176, 180 Modified Metropolis Dynamic (MMD), 14, 15, 49, 93 Motion detection, 4, 25, 81, 82, 93, 101, 104 Multiframe Marked Point Process (Fm MPP), 6, 121, 136, 137, 142, 148–154 Multi-layer MRF, 66, 80, 89, 115 Multiple Birth and Death (MBD), 5, 20, 22, 23, 60, 122, 131, 161 Multitemporal Marked Point Process (mMPP), 6, 122, 132, 135, 136, 152– 154

P Person detection, 57, 70 Person localization / people localization, 55– 57, 60, 62 Post-Classification Comparison (PCC), 80, 114–116 Post Detection Comparison (PDC), 135

Index Potts model, 13, 14, 49 Principal Component Analysis (PCA), 114, 116, 117, 181, 182 Printed Circuit Board (PCB), 156, 157, 161, 169–174, 176, 179, 181, 182, 184

R Random Sample Consensus (RANSAC), 86, 141, 145, 147–150, 152, 168 Re-identification, 25, 70, 78 Reversible Jump Markov Chain Monte Carlo (RJMCMC), 5, 20 Rotating Multi-beam Lidar (RMB Lidar), 22, 25, 26, 63

S Shadow detection, 25, 28, 35, 50, 53, 77 Ship detection, 151, 152 Simulated Annealing (SA), 15, 20, 93, 145 Singleton, 13, 14, 19, 91, 92, 111, 112, 116

T Terrestrial Laser Scanning (TLS), 155 Texture analysis, 31 Tracking, 5, 6, 22, 25–27, 30, 55, 62, 70, 78, 121, 137, 154 Traffic monitoring, 4, 25, 119, 155, 157, 165, 169, 172, 179, 182–184

V Vehicle detection, 21, 27, 45, 161, 166, 167, 181, 182, 185