Handbook of convex optimization methods in imaging science
 9783319616087, 9783319616094

Table of contents :
Preface......Page 6
Acknowledgments......Page 7
Contents......Page 8
1.1 Convex Optimization in Imaging Science......Page 15
1.2.1 Convex Sets......Page 17
1.2.2 Convex Functions......Page 18
1.2.4 Convex Optimization Problems: Examples......Page 19
1.2.5.1 Lagrange Multipliers and Duality......Page 20
1.2.5.2 Descent Methods......Page 21
1.2.5.4 Alternative Direction Methodof Multipliers......Page 22
1.2.6 Handling Non-convex Problems......Page 23
1.3 Book Outline......Page 25
References......Page 27
2.1.1 Image Quality Assessment Measures......Page 28
2.1.2 Perceptual Optimization Framework......Page 30
2.2.1 Structural and Non-Structural Distortions......Page 31
2.2.2 Convexity and Quasi-Convexity......Page 32
2.2.3 Combination Rule......Page 33
2.3.2 Equalization Problem......Page 34
2.3.3.1 Equalization Problem Redefined......Page 35
2.3.3.2 StatSSIM-Optimal Linear Equalization......Page 36
2.3.3.3 Problem Reformulation......Page 37
2.3.3.4 Quasi-Convex Optimization......Page 38
2.3.4 SSIM-Optimal Soft-Thresholding......Page 39
2.3.4.1 SSIM Index in the Wavelet Domain......Page 40
2.3.4.2 Problem Formulation......Page 43
2.4 Local SSIM-Optimal Approximation......Page 45
2.4.1.1 Orthogonal Basis......Page 46
2.4.1.3 Non-Linear Approximation......Page 47
2.4.2.1 Linear Approximation......Page 49
2.4.3 Non-Linear Approximation......Page 51
2.5 Image-Wide Variational SSIM Optimization......Page 52
References......Page 53
3.1 Introduction......Page 55
3.2.2 Color Correction......Page 57
3.2.2.1 Linear Models......Page 58
3.2.2.2 Linear Color Correction......Page 59
3.2.2.4 Root-Polynomial Color Correction......Page 60
3.2.2.6 Other Methods......Page 61
3.2.3.1 The RGB Model of Image Formation......Page 62
3.2.3.3 Experiments......Page 63
3.2.3.4 Extending Moment-Based Illuminant Estimation......Page 64
3.3.2 Display Characterization......Page 65
3.3.3.1 Color Distortion Model......Page 67
3.3.3.2 Power Consumption Model of Emissive Displays......Page 68
3.3.3.3 Constrained Optimization Problem......Page 69
3.3.3.4 Simulation......Page 70
3.4.1 Introduction......Page 71
3.4.2 Lattice-Based Printer Characterization......Page 74
3.4.2.2 Optimization of Node Values......Page 76
3.4.2.3 Optimization of Node Locations......Page 77
3.4.2.5 Experimental Results......Page 78
References......Page 80
4.1 Introduction......Page 83
4.2 SAR Forward Models......Page 84
4.2.1 Generalized Radon Transforms......Page 85
4.2.2 Classical Radon Transform......Page 87
4.3.1.1 Minimum Norm Solution: Filtered Back-Projection Formula......Page 88
4.3.2 Linear Statistical Reconstruction Methods......Page 89
4.3.2.1 Best Linear Unbiased Estimation......Page 90
4.3.2.2 Minimum Mean Square Error Estimation......Page 91
4.3.3 Nonlinear Reconstruction Methods......Page 94
4.3.3.1 Iterative Re-weighted Least-Squares Algorithm......Page 98
4.3.3.2 Iterative Shrinkage ThresholdingApproach......Page 99
4.4 Numerical Optimization Methods for SAR Imaging......Page 101
4.4.1 Compressed Sensing in SAR Imaging......Page 103
4.4.1.1 Total Variation Regularization in SAR Imaging......Page 105
4.4.2 Low-Rank Matrix Recovery Methods......Page 106
4.4.3 Bayesian Compressed Sensing......Page 107
4.5 Optimization in SAR Related Problems......Page 108
4.6 Conclusion......Page 109
References......Page 110
5.1 Introduction......Page 116
5.2.1 General Forward Problem Formulation......Page 117
5.2.2 Image Reconstruction with NonsmoothRegularization......Page 118
5.3.1 Introduction......Page 120
5.3.2 Overview of Computational Spectral Imaging Approaches......Page 121
5.3.3 Computational Photon-Sieve Spectral Imaging(PSSI)......Page 123
5.4.2 Overview of Compressed Ultrafast Imaging Techniques......Page 128
5.4.3.1 Operating Principle and Convex Optimization......Page 129
5.4.3.3 Dual-Channel CUP and Convex Optimization with ComplementaryEncoding......Page 133
5.5 Conclusions......Page 134
References......Page 136
6.2 Sparse Representation......Page 139
6.2.1 Sparse Representation-based Classification......Page 141
6.2.2 Robust Biometrics Recognition via Sparse Representation......Page 142
6.3.1 Dictionary Learning Algorithms......Page 143
6.3.2.1 Information Theoretic DictionaryLearning......Page 145
6.3.2.2 Discriminative KSVD......Page 148
6.3.2.4 Label Consistent KSVD......Page 149
6.3.3 Non-Linear Kernel Dictionary Learning......Page 150
6.3.4 Joint Dimensionality Reduction and Dictionary Learning......Page 151
6.3.6 Dictionary Learning from Partially Labeled Data......Page 153
6.3.7 Dictionary Learning from Ambiguously LabeledData......Page 155
6.3.8 Multiple Instance Dictionary Learning......Page 156
6.3.9 Domain Adaptive Dictionary Learning......Page 157
6.4 Analysis Dictionary Learning......Page 160
6.5 Convolutional Sparse Coding......Page 161
References......Page 162
7.1 Introduction......Page 167
7.2.1 Variational Formulations......Page 168
7.3.1 Evolution of Image Denoising: From Wiener Filtering to GSM and BM3D......Page 169
7.3.2 Simultaneous Sparse Coding with Gaussian Scalar Mixture (GSM): An Optimization-basedFormulation......Page 170
7.3.3 Solving Simultaneous Sparse Coding via Alternating Minimization......Page 172
7.3.4 Application into Image Denoising: From Patch-based to Whole Image......Page 173
7.4.1 Compressed Sensing via l1-optimization: Ten Years After......Page 176
7.4.2 Compressed Sensing via Nonlocal Low-Rank (NLR) Regularization......Page 178
7.4.3 Optimization Algorithm for CS Image Recovery......Page 179
7.4.4 Experimental Results and Discussions......Page 182
7.5 Conclusions......Page 183
References......Page 184
8.1.1 Estimation Problems in Image Processing and Computer Vision......Page 187
8.1.2 Sparsity Constrained Optimization: Overview......Page 189
8.1.3 Chapter Overview......Page 190
8.2.1 The Spike-and-Slab Prior......Page 191
8.2.2 Bayesian Sparse Recovery Framework......Page 192
8.2.3 Iterative Convex Refinement (ICR)......Page 194
8.3.1 Sparse Signal Recovery......Page 199
8.3.2.1 Sparse Representation-based Classification: Overview......Page 202
8.3.2.2 Structured Sparse Priors for Image Classification (SSPIC)......Page 203
8.3.2.3 Experimental Validation......Page 206
8.3.2.5 Recognition Under Random PixelCorruption......Page 207
8.3.2.7 Outlier Rejection......Page 208
8.3.2.8 Effect of Training Size on Performance......Page 209
8.3.2.10 Effect of Training Size on Performance......Page 210
8.4 Concluding Remarks......Page 211
References......Page 212
9.1 Introduction and Motivation......Page 217
9.1.1 Related Work......Page 219
9.2.2 Product Space of the Special Euclidean Group......Page 220
9.2.3 Grassmann Manifold as a Shape Space......Page 221
9.2.4 A Preliminary Example: Dictionary Learning for Human Actions Using Extrinsic Representations......Page 222
9.2.4.1 Dictionary Learning and Sparse Coding Problem......Page 223
9.2.4.2 Limitations......Page 224
9.3 Elastic Representations of Riemannian Trajectories......Page 225
9.3.1 Riemannian Functional Coding......Page 227
9.3.2.1 Action Recognition......Page 229
9.3.2.2 Visual Speech Recognition......Page 230
9.4 Matching Riemannian Trajectories Efficiently......Page 231
9.4.3 Conscience based Competitive Learningon Manifolds......Page 233
9.5 Conclusion......Page 236
References......Page 237

Citation preview

Vishal Monga Editor

Handbook of Convex Optimization Methods in Imaging Science

Handbook of Convex Optimization Methods in Imaging Science

Vishal Monga Editor

Handbook of Convex Optimization Methods in Imaging Science

123

Editor Vishal Monga The Pennsylvania State University University Park, PA, USA

ISBN 978-3-319-61608-7 ISBN 978-3-319-61609-4 (eBook) DOI 10.1007/978-3-319-61609-4 Library of Congress Control Number: 2017952608 © Springer International Publishing AG 2017 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

This book is dedicated to Mr. Vijay Chaudhary, who remains a constant source of support and encouragement.

Preface

The value of mathematical optimization to engineering cannot possibly be overstated. Engineering and applied math problems invariably involve minimization of an error or cost function, or alternatively the maximization of the likelihood of success. The optimal choice of real-world physical variables or design parameters that enable the best minimization/maximization is hence crucial. Often these parameters are constrained by physical laws, the available memory and/or computation, the requirements of fast or realtime operation, etc. This book studies optimization techniques, in particular convex optimization methods and algorithms, for the high impact engineering domain of imaging science. Imaging science concerns itself with all aspects of image acquisition, processing, rendering, and transmission. The economic footprint of imaging and vision systems is quite significant and continues to grow. Technology fueled by optimization algorithms in image processing and vision can be found in smartphones, cameras, displays, medical imaging devices and software, remote-sensing machinery and satellites, robotics, and defense and aerospace products to name a few. This book is an outgrowth of my individual research and teaching in the intersection of convex optimization and imaging science. Truthfully, I have been cajoled into writing it by several distinguished colleagues engaged in computational aspects of image processing and vision, whose work I admire greatly. The areas of image processing and computer vision benefitted tremendously from the applied math subsets of linear algebra and probability in the 1980s to early 2000s. I am firm on the view that an in-depth understanding of optimization and the opportunities it presents in solving realworld imaging problems is imperative for the modern researcher in this area. The book is hence aimed at graduate and undergraduate students in Electrical Engineering, Computer Science, Applied Physics, and Mathematics. Because the book is an edited volume with state-of-the-art coverage of many imaging science problems from a computational viewpoint, it should also be of value to industrial researchers and practitioners. University Park, PA, USA

Vishal Monga

vii

Acknowledgments

I am grateful to all my colleagues who have kindly contributed high-quality and thorough chapters to this book. In terms of writing and organization, this was uncharted territory, which makes their efforts even more commendable. Other than the contributors, many notable senior researchers in the signal and image processing area have been immensely supportive and encouraging and have made this book possible; this includes (but is not limited to) Charlie Bouman at Purdue, Trac Tran at Johns Hopkins, P.P. Vaidyanathan at CalTech, and Muralidhar Rangaswamy at the US Air Force Research Lab. The book would not have come about if not for the excellent research and educational efforts of my doctoral students and alumni of the Information Processing and Algorithms Laboratory (iPAL) (http://signal.ee.psu.edu) at Penn State. Umamahesh Srinivas and Hojjat Mousavi have helped cowrite a chapter with me on sparsity-constrained optimization and estimation and have also helped me organize, plan, and ultimately execute the vision for this book—my special thanks to them. I would greatly appreciate hearing of any errors that may be found in the book for reasons beyond my control and that of the publisher. I take responsibility for any limitations that the book may have and invite constructive feedback which may be sent to [email protected] University Park, PA, USA

Vishal Monga

ix

Contents

1

2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vishal Monga 1.1 Convex Optimization in Imaging Science . . . . . . . . . . . . . . . . . 1.2 Convex Optimization Review . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Convex Optimization Problems: Examples . . . . . . . . . . 1.2.5 Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.5.1 Lagrange Multipliers and Duality . . . . . . . . . 1.2.5.2 Descent Methods . . . . . . . . . . . . . . . . . . . . . . . 1.2.5.3 Iterative Thresholding Methods . . . . . . . . . . . 1.2.5.4 Alternative Direction Method of Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.6 Handling Non-convex Problems . . . . . . . . . . . . . . . . . . . 1.3 Book Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Optimizing Image Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dominique Brunet, Sumohana S. Channappayya, Zhou Wang, Edward R. Vrscay, and Alan C. Bovik 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Image Quality Assessment Measures . . . . . . . . . . . . . . 2.1.2 Perceptual Optimization Framework . . . . . . . . . . . . . . . 2.1.3 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Mathematical Properties of the SSIM Index . . . . . . . . . . . . . . . 2.2.1 Structural and Non-Structural Distortions . . . . . . . . . . . 2.2.2 Convexity and Quasi-Convexity . . . . . . . . . . . . . . . . . . . 2.2.3 Combination Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Spatial Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Perceptually Optimal Algorithm Design . . . . . . . . . . . . . . . . . . 2.3.1 SSIM-Optimal Equalizer Design . . . . . . . . . . . . . . . . . . 2.3.2 Equalization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3.1 Equalization Problem Redefined . . . . . . . . . . 2.3.3.2 StatSSIM-Optimal Linear Equalization . . . . .

1 1 3 3 4 5 5 6 6 7 8 8 9 12 13 15

15 15 17 18 18 18 19 20 21 21 21 21 22 22 23

xi

xii

Contents

2.3.3.3 2.3.3.4 2.3.3.5 2.3.3.6

Problem Reformulation . . . . . . . . . . . . . . . . . Quasi-Convex Optimization . . . . . . . . . . . . . Search for ˛ . . . . . . . . . . . . . . . . . . . . . . . . . . . Application to Image Denoising and Restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 SSIM-Optimal Soft-Thresholding . . . . . . . . . . . . . . . . . 2.3.4.1 SSIM Index in the Wavelet Domain . . . . . . . 2.3.4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . 2.3.4.3 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Local SSIM-Optimal Approximation . . . . . . . . . . . . . . . . . . . . 2.4.1 L2 -Based Approximation . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1.1 Orthogonal Basis . . . . . . . . . . . . . . . . . . . . . . 2.4.1.2 Linear Redundant Basis . . . . . . . . . . . . . . . . . 2.4.1.3 Non-Linear Approximation . . . . . . . . . . . . . . 2.4.2 SSIM-Based Approximation . . . . . . . . . . . . . . . . . . . . . 2.4.2.1 Linear Approximation . . . . . . . . . . . . . . . . . . 2.4.3 Non-Linear Approximation . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Variational SSIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Image-Wide Variational SSIM Optimization . . . . . . . . . . . . . . 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Computational Color Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Raja Bala, Graham Finlayson, and Chul Lee 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Color Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Color Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2.1 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2.2 Linear Color Correction . . . . . . . . . . . . . . . . . 3.2.2.3 Polynomial Color Correction . . . . . . . . . . . . . 3.2.2.4 Root-Polynomial Color Correction . . . . . . . . 3.2.2.5 Experimental Results . . . . . . . . . . . . . . . . . . . 3.2.2.6 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Illuminant Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3.1 The RGB Model of Image Formation . . . . . . 3.2.3.2 Moment-Based Illuminant Estimation . . . . . 3.2.3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3.4 Extending Moment-Based Illuminant Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3.5 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Color Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Display Characterization . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 RGB-to-RGBW Conversion . . . . . . . . . . . . . . . . . . . . . 3.3.3.1 Color Distortion Model . . . . . . . . . . . . . . . . . 3.3.3.2 Power Consumption Model of Emissive Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24 25 26 26 26 27 30 32 32 33 33 34 34 36 36 38 39 39 40 40 43 43 45 45 45 46 47 48 48 49 49 50 50 51 51 52 53 53 53 53 55 55 56

Contents

xiii

3.3.3.3 Constrained Optimization Problem . . . . . . . . 3.3.3.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Color Printing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Lattice-Based Printer Characterization . . . . . . . . . . . . . 3.4.2.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2.2 Optimization of Node Values . . . . . . . . . . . . . 3.4.2.3 Optimization of Node Locations . . . . . . . . . . 3.4.2.4 Joint Optimization of Node Values and Locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Optimization Methods for Synthetic Aperture Radar Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eric Mason, Ilker Bayram, and Birsen Yazici 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 SAR Forward Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Generalized Radon Transforms . . . . . . . . . . . . . . . . . . . 4.2.2 Classical Radon Transform . . . . . . . . . . . . . . . . . . . . . . . 4.3 Analytical Optimization Methods for SAR Imaging . . . . . . . . 4.3.1 Linear Deterministic Reconstruction Methods . . . . . . . 4.3.1.1 Minimum Norm Solution: Filtered Back-Projection Formula . . . . . . . . . . . . . . . . 4.3.1.2 Regularized Minimum Norm Solution . . . . . 4.3.1.3 Minimum Error Solution: Backprojection Filtering Formula . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Linear Statistical Reconstruction Methods . . . . . . . . . . 4.3.2.1 Best Linear Unbiased Estimation . . . . . . . . . . 4.3.2.2 Minimum Mean Square Error Estimation . . . 4.3.3 Nonlinear Reconstruction Methods . . . . . . . . . . . . . . . . 4.3.3.1 Iterative Re-weighted Least-Squares Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3.2 Iterative Shrinkage Thresholding Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Numerical Optimization Methods for SAR Imaging . . . . . . . . 4.4.1 Compressed Sensing in SAR Imaging . . . . . . . . . . . . . . 4.4.1.1 Total Variation Regularization in SAR Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Low-Rank Matrix Recovery Methods . . . . . . . . . . . . . 4.4.3 Bayesian Compressed Sensing . . . . . . . . . . . . . . . . . . . . 4.5 Optimization in SAR Related Problems . . . . . . . . . . . . . . . . . . . 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57 58 59 59 62 64 64 65 66 66 68 68 71 71 72 73 75 76 76 76 77 77 77 78 79 82 86 87 89 91 93 94 95 96 98 98

xiv

5

6

Contents

Computational Spectral and Ultrafast Imaging via Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figen S. Oktem, Liang Gao, and Farzad Kamalabadi 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The General Image Formation Model and Image Reconstruction Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 General Forward Problem Formulation . . . . . . . . . . . . 5.2.2 Image Reconstruction with Nonsmooth Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Computational Spectral Imaging . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Overview of Computational Spectral Imaging Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Computational Photon-Sieve Spectral Imaging (PSSI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Computational Ultrafast (Temporal) Imaging . . . . . . . . . . . . . 5.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Overview of Compressed Ultrafast Imaging Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Compressed Ultrafast Photography (CUP) . . . . . . . . . 5.4.3.1 Operating Principle and Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3.2 Time-of-Flight CUP and Convex Optimization with a Spatial Constraint . . . . . 5.4.3.3 Dual-Channel CUP and Convex Optimization with Complementary Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discriminative Sparse Representations . . . . . . . . . . . . . . . . . . . . . He Zhang and Vishal M. Patel 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Sparse Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Sparse Representation-based Classification . . . . . . . . . 6.2.2 Robust Biometrics Recognition via Sparse Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Dictionary Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Dictionary Learning Algorithms . . . . . . . . . . . . . . . . . . 6.3.2 Discriminative Dictionary Learning . . . . . . . . . . . . . . . 6.3.2.1 Information Theoretic Dictionary Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2.2 Discriminative KSVD . . . . . . . . . . . . . . . . . . 6.3.2.3 Fisher Discrimination Dictionary Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2.4 Label Consistent KSVD . . . . . . . . . . . . . . . . . 6.3.3 Non-Linear Kernel Dictionary Learning . . . . . . . . . . .

105 105 106 106 107 109 109 110 112 117 117 117 118 118 122

122 125 125 129 129 129 131 132 133 133 135 135 138 139 139 140

Contents

xv

6.3.4 Joint Dimensionality Reduction and Dictionary Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.5 Unsupervised Dictionary Learning . . . . . . . . . . . . . . . . 6.3.6 Dictionary Learning from Partially Labeled Data . . . . 6.3.7 Dictionary Learning from Ambiguously Labeled Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.8 Multiple Instance Dictionary Learning . . . . . . . . . . . . . 6.3.9 Domain Adaptive Dictionary Learning . . . . . . . . . . . . . 6.4 Analysis Dictionary Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Convolutional Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

8

Sparsity Based Nonlocal Image Restoration: An Alternating Optimization Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xin Li, Weisheng Dong, and Guangming Shi 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Optimization-based Image Restoration . . . . . . . . . . . . . . . . . . . 7.2.1 Variational Formulations . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Nonlocal Image Denoising via Simultaneous Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Evolution of Image Denoising: From Wiener Filtering to GSM and BM3D . . . . . . . . . . . . . . . . . . . . . 7.3.2 Simultaneous Sparse Coding with Gaussian Scalar Mixture (GSM): An Optimization-based Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Solving Simultaneous Sparse Coding via Alternating Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Application into Image Denoising: From Patch-based to Whole Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Nonlocal Compressed Sensing via Low-rank Methods . . . . . . 7.4.1 Compressed Sensing via l1 -optimization: Ten Years After . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Compressed Sensing via Nonlocal Low-Rank (NLR) Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Optimization Algorithm for CS Image Recovery . . . . . 7.4.4 Experimental Results and Discussions . . . . . . . . . . . . . 7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

141 143 143 145 146 147 150 151 152 152 157 157 158 158 159 159 159

160 162 163 166 166 168 169 172 173 174

Sparsity Constrained Estimation in Image Processing and Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Vishal Monga, Hojjat Seyed Mousavi, and Umamahesh Srinivas 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 8.1.1 Estimation Problems in Image Processing and Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

xvi

Contents

8.1.2 Sparsity Constrained Optimization: Overview . . . . . . 8.1.3 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Sparsity Constrained Estimation Using Spike-and-Slab Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 The Spike-and-Slab Prior . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Bayesian Sparse Recovery Framework . . . . . . . . . . . . 8.2.3 Iterative Convex Refinement (ICR) . . . . . . . . . . . . . . . . 8.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Sparse Signal Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2.1 Sparse Representation-based Classification: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2.2 Structured Sparse Priors for Image Classification (SSPIC) . . . . . . . . . . . . . . . . . . 8.3.2.3 Experimental Validation . . . . . . . . . . . . . . . . . 8.3.2.4 Test Images with Geometric Alignment . . . . 8.3.2.5 Recognition Under Random Pixel Corruption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2.6 Recognition Under Disguise . . . . . . . . . . . . . 8.3.2.7 Outlier Rejection . . . . . . . . . . . . . . . . . . . . . . . 8.3.2.8 Effect of Training Size on Performance . . . . 8.3.2.9 Average Classification Accuracy . . . . . . . . . . 8.3.2.10 Effect of Training Size on Performance . . . . 8.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Optimization Problems Associated with Manifold-Valued Curves with Applications in Computer Vision . . . . . . . . . . . . . . . Rushil Anirudh, Pavan Turaga, and Anuj Srivastava 9.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Mathematical Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Preliminaries in Differential Geometry . . . . . . . . . . . . 9.2.2 Product Space of the Special Euclidean Group . . . . . . 9.2.3 Grassmann Manifold as a Shape Space . . . . . . . . . . . . 9.2.4 A Preliminary Example: Dictionary Learning for Human Actions Using Extrinsic Representations . . . . 9.2.4.1 Dictionary Learning and Sparse Coding Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.4.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Elastic Representations of Riemannian Trajectories . . . . . . . . 9.3.1 Riemannian Functional Coding . . . . . . . . . . . . . . . . . . . 9.3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2.1 Action Recognition . . . . . . . . . . . . . . . . . . . . . 9.3.2.2 Visual Speech Recognition . . . . . . . . . . . . . . 9.3.3 Visualizing the Action Space . . . . . . . . . . . . . . . . . . . . .

179 180 181 181 182 184 189 189 192 192 193 196 197 197 198 198 199 200 200 201 202 207 207 209 210 210 210 211 212 213 214 215 217 219 219 220 221

Contents

xvii

9.4 Matching Riemannian Trajectories Efficiently . . . . . . . . . . . . . 9.4.1 Piece-wise Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Symbolic Approximation . . . . . . . . . . . . . . . . . . . . . . . . 9.4.3 Conscience based Competitive Learning on Manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.4 Applications in Search, Compression and Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

221 223 223 223 226 226 227

1

Introduction Vishal Monga

1.1

Convex Optimization in Imaging Science

Images and videos are ubiquitous in our multimedia-rich environment today. We can capture photographs on a variety of devices, from inexpensive mobile phones to highly sophisticated cameras, spanning an impressive range in sensor pixel-count and picture resolution. We watch video content on displays ranging in size from handheld devices to gigantic movie screens. The resolution on these videos ranges from the grainy CIF (352  288) in CCTVs all the way up to 8K ultra high-definition (7680  4320) and beyond. If that is not impressive enough, devices routinely support high-quality streaming video content, while preserving the richness of spectral or color information. By some estimates, the global consumer electronics market is poised to be worth a mind-boggling trillion US dollars by 2020. Needless to say, this technological advancement has enhanced not merely the entertainment experience for people. Applications in medical imaging have contributed immensely to improving the quality of life and healthcare. Defense and remote sensing applications are heavily reliant on sensing modalities beyond the visible specV. Monga () The Pennsylvania State University, University Park, PA 16802, USA e-mail: [email protected]

trum, such as radar, infra-red, multispectral and hyperspectral imaging. Imaging science concerns itself with all aspects of image acquisition, processing, rendering, and transmission. It has charted a fascinating path of growth over the past few centuries, fueled initially by innovations in the print industry and then rapidly accelerating in the slipstream of remarkable modern advances in device physics, systems engineering, computing power, and algorithms. Particularly, the sub-area of computational imaging, dealing with mathematical techniques for image formation and processing, has emerged as a fertile ground for research and innovation. From the toolbox of mathematical techniques that have found success in imaging science, of particular interest is convex optimization [1], which refers to a sub-class of optimization problems, the most popular of which perhaps are the least-squares and linear programming problems. Convex optimization has a long history of association with signal processing, dating back to the Wiener filter [2]. Why is convex optimization important? The primary reason is that the resulting formulation can be solved via fast, efficient—and often elegant—methods in practice. The key breakthrough was the discovery of interior-point methods to solve linear programs [3]. Subsequently it was shown that interior-point methods could be utilized to solve other convex optimization problems too [4]. The interested reader can find a comprehensive discussion on

© Springer International Publishing AG 2017 V. Monga (ed.), Handbook of Convex Optimization Methods in Imaging Science, DOI 10.1007/978-3-319-61609-4_1

1

2

V. Monga

this topic in [5, 6] and elsewhere. Giant strides in computing power have been a significant driver in the adoption of convex optimization solutions in real-world signal processing tasks. Some common convex optimization problems are reviewed in the following section. Convex optimization can be useful even in the context of non-convex problems. Such problems can be approximated by convex problems, known as convex relaxation, which can then be solved exactly. Problems previously thought to be intractable have been resolved in this manner. Convex approximations also offer valuable insights into the bounds on the optimal solution. In virtually every image processing application, the design goal typically is to minimize some appropriately-chosen error. In other words, the notion of optimizing the choice of parameters—or at least making the best possible choice under the circumstances—to meet a design goal is deeply intertwined with image processing algorithm development. Unsurprisingly, several classical image processing problems have been formulated as constrained optimization problems. The constraints arise from practical considerations such as device physics, power, computational costs, etc. Vision science has inspired the genesis of many image processing solutions, and perception-based constraints are commonly enforced in such approaches. The following chapters of this book introduce the reader to several interesting formulations of convex optimization problems. The keen reader will notice identical formulations, such as the quadratic program, `1 -sparse recovery problem, pop up repeatedly in very different scenarios. The upside of this is that any algorithmic improvement towards solving a particular problem better will benefit several such applications with the same formulation. As just one example, analytical progress in solving the sparse recovery problem better has led to its ubiquity in all aspects of imaging science applications within a short span of time. Consider the classical least-squares problem:

where x 2 Rn ; y 2 Rm , and A 2 Rmn . Practical considerations often dictate that the solution of (1.1) obeys additional constraints as mentioned earlier. Over the past few decades, arguably the most influential constrained setting in which the above problem has been formulated is that of sparsity. Specifically, sparsity implies that the number of nonzero elements of the decision variable (x 2 Rn ) should be limited; in other words, a small subset of the decision variables capture all information about the signal y. Analytically, this translates into minimizing the `0 pseudo-norm,1 kxk0 , of x. It is quite clear that the domain of `0 is Rn but its output is a non-negative integer. As a result, `0 is not a convex function (more about this in Sect. 1.2.2). Viewed through the perspective of compressive sensing (CS) [7], the sparsity-constrained problem is interpreted as the ability to recover certain signals from far fewer samples or measurements than required by traditional methods. Here, A takes on the role of a sensing/sampling matrix. In image processing, this perspective allows us to capture simpler structure hidden in visual data, and minimizes the undesirable effects of noise in practical settings. Formally, the CS problem is stated as follows: min kxk0 subject to y D Ax: x

We will discuss techniques to solve this analytically hard problem later in this book. Here, we highlight how the same analytical formulation lends itself to very different real-world scenarios presented in the subsequent chapters. In Chap. 3, the y and x are colorimetric representations in different color spaces and A captures the mapping from one space to another. This has implications for display characterization. In the same chapter, printer characterization can be modeled as a joint optimization on both A and x. Chapters 4 and 5 arrive at (1.2) by modeling the image reconstruction problem (in modalities different from common photography, such as radar and ultrafast imaging) with constraints The `0 operator is not technically a norm, but it is occasionally referred to as a norm in literature, admitting a mild abuse of convention.

1

min ky  x

Axk22 ;

(1.1)

(1.2)

1 Introduction

such as Laplacian priors and image smoothness. Chapter 6 addresses the important vision problem of object recognition via dictionary learning. This approach belongs to a family of techniques spawned out of a seminal recent contribution [8] that interprets A as a dictionary of training images and the resulting problem as one of constructing an image from a small subset of the dictionary of similar training images. Rich algorithmic contributions have been made in this area towards both finding the optimal sparse x for a fixed dictionary A as well as in adaptively selecting the dictionary A itself. Interestingly, this notion of optimizing both A and x that we encountered in Chaps. 3 and 6 shows up yet again in Chap. 7, in the context of the classical image processing problem of denoising. Chapter 8 invests the CS problem with a Bayesian perspective via structured priors, with applications to image recovery as well as classification.

1.2

Convex Optimization Review

In recent years, convex optimization has become a powerful computational tool in signal and image processing, with the capability to solve largescale problems reliably and efficiently. The goal of this section is to provide an overview of the basic concepts in convex optimization, in order to familiarize the reader with the convex optimization formulations occurring throughout this book. We now outline this section. First, we describe convex sets and convex functions, followed by a description of convex optimization problems in their general form. Next, we present some standard optimization problems used throughout this book, which have also been found to be extremely useful in practice. We conclude by briefly commenting on some particular tricks for convex and non-convex problems.

1.2.1

Convex Sets

Note that throughout this chapter, we will be concerned only with problems in the real space Rn . Also, henceforth in this chapter, given a set of

3

points xi 2 Rn and i 2 R, any new point y D 1 x1 C : : : C k xk is said to be: • a linear combination of xi ’s if i ’s are all real, • an affineP combination of xi ’s if i ’s sum up to one, i.e. i i D 1, • a convex combination ofP xi ’s if i ’s are positive and sum up to one, i.e. i i D 1; i  0, and • a conic combination of xi ’s if i ’s are positive, i.e. i  0. Definition 1. A set S  Rn is affine if it contains the line through any two distinct points in the set: x; y 2 S;  2 R ) x C .1  /y 2 S; (1.3) i.e., every affine combination of points in the set is also in the set. Two common examples of affine sets are the range of affine functions: S D fAx C bjx 2 Rq g;

(1.4)

and the solution set of linear equations: S D fxjAx D bg:

(1.5)

Definition 2. A set S  Rn is a convex set if it contains the line between any two points in the set: x; y 2 S; 0    1 ) xC.1/y 2 S: (1.6) In other words, a set is convex if it contains any convex combination of its points. Definition 3. A set S  Rn is a convex cone if it is convex and contains all the rays passing through any of its points and the origin, i.e., x; y 2 S; 1 ; 2  0 ) 1 x C 2 y 2 S: (1.7) In other words, a set is a convex cone if it contains all conic combinations of points in the set. Geometrically speaking, it means that a convex cone contains an entire “pie slice” in Rn . Examples of convex cone are the non-negative orthant RnC and the set of symmetric positive semidefinite (PSD) matrices SnC D fX 2 Sn jX  0g.

4

V. Monga

Definition 4. A convex cone K  Rn is said to be a proper cone if it is closed (contains its boundary), solid (has nonempty interior), and pointed (contains no line in K). Nonnegative orthant RnC and positive semidefinite cone SnC are also proper cones. Definition 5. A proper cone K defines a generalized inequality K in Rn : x K y , y  x 2 K;

In other words, the first order Taylor approximation of f is a global under-estimator, as illustrated in Fig. 1.1. Second-Order Condition A twice differentiable function f with convex domain is convex if and only if for all x 2 domf , r 2 f .x/  0, i.e. Hessian matrix of f is positive semidefinite on its domain. The convexity of the following functions can be easily verified using first and second order conditions of convexity:

and the strict version x K y , y  x 2 intK:

(1.8)

The properties of affinity and convexity of sets are preserved under some operations. The most important and common one is intersection, i.e. the intersection of any number of convex (affine or convex cone) sets is also convex (affine or convex cone). Other operations that preserve convexity are transformation under affine functions (such as scaling, translation and projection) and linear-fractional functions.

1.2.2

A restatement of the classical definition of convexity results in Jensen’s inequality: • if f W Rn ! R is convex, then for 1 ; 2  0 and 1 C 2 D 1

Convex Functions

Definition 6. A function f W Rn ! R is convex if its domf is convex and for all x; y 2 domf ;  2 Œ0; 1, f .xC.1/y/  f .x/C.1  /f .y/: (1.9) Note that f is concave if (-f ) is convex. Here are some simple examples on R: x2 is a convex function and log x is concave. One can directly check for the convexity of a function according to the definition of convex function. The convexity of a differentiable function can be tested using gradient and Hessian. First-Order Condition A differentiable function f with convex domain is convex if and only if for all x; y 2 domf , f .y/  f .x/ C rf .x/T .y  x/:

• Exponential function: f .x/ D x˛ is convex on positive axis for ˛  1 • f .x/ D x log x is also convex on positive axis. • Quadratic functions: f .x/ D xT Px C 2qT x C r with a symmetric positive semidefinite matrix P is convex since r 2 f D 2P • Least square objective: f .x/ D kAx  bk22 is convex for any matrix A and vector b. r 2 f D 2AT A.

(1.10)

f .1 x1 C2 x2 /  1 f .x1 /C2 f .x2 /: (1.11) n • if Pf W R ! R is convex, then for i  0 and i i D 1

X X f. i xi /  i f .xi /: i

(1.12)

i

• if f W Rn ! R is convex, then for any random variable z: f .EŒz/  EŒf .z/

(1.13)

where E is the expected value of a random variable. The concept of convexity can be generalized to vector or matrix-valued functions using convex cones. Recall that a pointed convex cone K Rm defines the generalized inequality K . A function

1 Introduction

5

Fig. 1.1 The first order Taylor expansion at any point is a global under estimator of the function

f W Rn ! Rm is K-convex if domf is convex and for all  2 Œ0; 1, f .xC.1/y/ K f .x/C.1/f .y/: (1.14)

1.2.3

Convex Optimization

A vast number of problems in image and signal processing can be posed as constrained optimization problems, of the standard form minimize f0 .x/ subject to fi .x/  0; i D 1; : : : ; m hi .x/ D 0; i D 1; : : : ; p: (1.15) where x 2 Rn is the vector of optimization (decision) variable; f0 ; fi ; hi W Rn ! R are, respectively, the objective or cost function, inequality constraints and equality constraints. A point x is feasible if it satisfies all the constraints. The set of all feasible points forms the feasible set C. The optimization problem is said to be feasible if set C is non-empty and is said to be unconstrained if m D p D 0. The solution to the optimization problem, i.e. the optimal value, is denoted by f  , and a point x 2 C is an optimal point if f .x / D f  . In other words, f  D infx2C f0 .x/ and x D arg minx2C f0 .x/. In general, such problems can be very tedious to solve, especially when the number of optimization variables is large. This is due to many factors. First, the problem may have many local optima as opposed to a single global optimum. Second, it is not always straightforward to find a feasible point. Third, stopping criteria used in

general optimization algorithms are often arbitrary and not generalizable. Finally, these algorithms might have very poor convergence rates. However, it is well-known that under certain conditions, the first three problems mentioned above can be mitigated. In fact, it then follows that: (1) any local optimum is the global optimum, (2) feasibility of the problem can be determined and hence algorithms are easy to initialize, and (3) efficient numerical methods and accurate stopping criteria can be achieved using duality theory, that can handle large-scale problems. The conditions that lead to the aforementioned guarantees, lead us to convex optimization. An optimization problem in standard form is a convex optimization problem if f0 and fi are all convex functions and hi are all affine functions, hi .x/ D aTi x  bi . This is often written in the following form: minimize

f0 .x/

subject to fi .x/  0 ; i D 1; : : : ; m Ax D b;

(1.16)

where A 2 Rpn and b 2 Rp .

1.2.4

Convex Optimization Problems: Examples

In this section, we present a few canonical convex optimization problems which are extremely useful in practice and have efficient solutions. Essentially, if an optimization problem can be converted into one of these canonical forms, it can be considered solved.

6

V. Monga

Linear Program (LP) is a convex optimization problem of the form minimize cT x C d

where A 2 Rni n and F 2 Rpn . Note that if ci D 0, the SOCP reduces to a QCQP; if Ai D 0, it reduces to an LP.

subject to Gx h Ax D b;

(1.17)

where G 2 Rmn , A 2 Rpn and c; d 2 Rn . In these problems the feasible set is a polyhedron. Quadratic Program (QP) is a convex optimization problem of the form minimize .1=2/xT Px C qT x C r Gx h

subject to

Ax D b;

(1.18)

where P is symmetric positive semidefinite, i.e. P 2 SnC , G 2 Rmn and A 2 Rpn . Arguably the most common QP is the least-squares problem: minx kAx  bk22 , which has an analytical solution: x D A b, where A is the pseudo-inverse of matrix A. Quadratically Constrained Quadratic Program (QCQP) is a convex optimization problem of the form minimize

.1=2/xT P0 x C qT0 x C r0

subject to .1=2/xT Pi x C qTi x C ri  0 Ax D b;

i D 1; : : : ; m

Semi-Definite Program (SDP) is a convex optimization problem with linear matrix inequality (LMI) constraint of the form minimizex

subject to x1 F1 C : : : C xn Fn C G 0 Ax D b;

minimizex kAx  bk1 subject to

minimizex;t

subject to kAi x C bi k2  cTi x C di Fx D g;

i D 1; : : : ; m (1.20)

(1.22)

t

(1.23)

subject to Ax  b  t1 Ax  b t1 Fx  g:

1.2.5

fTx

Fx  q:

By introducing a slack variable t we can easily convert the above problem to an LP in variables x 2 Rn and t 2 R and rewrite the problem as follows: (1 2 Rn )

where Pi 2 SnC and A 2 Rpn . This form of QCQP optimization problem is widely used throughout this book. Chapter 3 especially benefits from this category of optimization problems in computational color imaging.

minimizex

(1.21)

where G; Fi 2 Sk and A 2 Rpn . Sometimes, it is possible to transform a given problem in non-standard form into a canonical representation. As an example, here we provide an `1 optimization problem and convert it to a canonical form using slack variables. Consider the following optimization problem, which is not straightforward to solve at first glance.

(1.19)

Second Order Cone Program (SOCP) is a convex optimization problem that subsumes both LP and QP and has the form

cT x

(1.24)

Optimization Methods

1.2.5.1 Lagrange Multipliers and Duality Consider a constrained optimization problem in standard form which is not necessarily convex. The basic idea in Lagrangian Multipliers method is to augment the objective function with a weighted sum of constraints in the optimization

1 Introduction

7

problem. The Lagrangian L W Rn  Rm  Rp ! R for an optimization problem in standard form (1.15) is defined as: L.x; ; / D f0 .x/ C

m X

i fi .x/

iD1

C

p X

i hi .x/ (1.25)

maximize g.; / subject to  < 0:

(1.28)

The Lagrange dual problem is always a convex optimization problem with optimal value denoted as d , which is the best lower bound on f  . In particular the following important inequality always holds:

iD1

where i and i are, respectively, referred to as Lagrange multipliers associated with inequality and equality constraints. Lagrange Dual Function or simply dual function is defined as g W Rm  Rp ! R, the minimum value of Lagrangian function over optimization variable x, where  2 Rm ;  2 Rp are vectors containing all i and i respectively. g.; / D infx2D L.x; ; / D   P Pp infx2D f0 .x/ C m iD1 i fi .x/ C iD1 i hi .x/ (1.26) where D is the domain of all constraint functions. Note that the dual function is always concave even if the original problem is not convex, since it is the pointwise infimum of affine functions in terms of ; . Lagrange multiplier methods have been widely used in image processing to solve optimization problems. For example in Chap. 4, in synthetic aperture radar (SAR) imaging, Lagrange multipliers are used within the framework of calculus of variations to reconstruct radar imagery. The dual function poses a lower bound on the optimal value f  of optimization problem. For any  < 0 and any  we have g.; /  f  :

(1.27)

Lagrange dual problem gives us a lower bound on the optimal value of cost function. The best lower bound that can be found from the Lagrange problem is obtained from the Lagrange dual problem:

Weak duality: d  p

(1.29)

If the equality d D p holds, i.e. the optimal duality gap is zero, then strong duality holds. Strong duality does not hold in general but it does for many convex problems. The conditions under which strong duality holds are called constraint qualifications. One simple example of such constraints is called Slater’s condition, which states that strong duality holds if there exists a strictly feasible point for the convex optimization problem.

1.2.5.2 Descent Methods In this section we present a family of methods called descent methods. These algorithms produce a minimizing sequence x.k/ ; k D 1; 2; : : :, where x.kC1/ D x.k/ C t.k/ x.k/

(1.30)

Here x 2 Rn is called the step or search direction, k denotes the iteration number and t.k/ is the step size at iteration k. All the descent methods have the property that at each iteration the cost function decreases, i.e. f .x.kC1/ / < f .x.k/ /

(1.31)

There are many descent algorithms for optimization, such as exact line search, backtracking line search, Newton’s method, gradient descent, etc. For example, a natural choice for search direction is the negative of the gradient x D rf .x/, which results in a descent sequence.

8

V. Monga

1.2.5.3 Iterative Thresholding Methods Iterative thresholding methods have recently gained much attention in solving hard convex and non-convex optimization problems. Arguably one of the best-known examples of such methods is the Iterative Shrinkage Thresholding Algorithm (ISTA) [9]. The general step of ISTA is of the form: x

.kC1/

.k/

D T .G.x //;

(1.32)

where G. / stands for a gradient step to fit to the data and T is the thresholding step. ISTA is well-suited to solve the more general non-smooth convex optimization problem of the form: min f .x/ C g.x/; x

(1.33)

where g is convex but possibly non-smooth and f W Rn ! R is a smooth convex function, which is continuously differentiable with Lipschitz continuous gradient L.f /: krf .x/  rf .y/k  L.f /kx  yk 8 x; y 2 Rn :

(1.34)

In summary, the ISTA algorithm with constant step size is as follows:

1.2.5.4 Alternative Direction Method of Multipliers The Alternative Direction Method of Multipliers (ADMM) is an algorithm designed to solve problems of the form: minimize f .x/ C g.z/ subject to Ax C Bz D c;

with variables x 2 Rn and z 2 Rm , where A 2 Rpn ; B 2 Rpm and c 2 Rp . The assumption here is that both f and g are convex functions. The only difference from the general linear equalityconstrained problems is that the optimization variable is split into two parts, i.e. x and z. The augmented Lagrangian of the optimization problem is formed as follows: L .x; z; y/ D f .x/ C g.z/ C yT .Ax C Bz  c/ C.=2/kAx C Bz  ck22 :

   rf .x.k/ /  L  2 x.kC1/ D arg min g.x/ C x  x.k/   : x 2 L

(1.37)

ADMM algorithm in its general form comprises the following steps at each iteration ( > 0): xkC1 D arg min L .x; zk ; yk /

(1.38)

zkC1 D arg min L .xkC1 ; z; yk /

(1.39)

x z

1. Initialize x.0/ 2 Rn and L as a Lipschitz constant of rf 2. Repeat the following iterative step until convergence:

(1.36)

ykC1 D yk C.AxkC1 CBzkC1  c/ (1.40) It consists of an x-minimization step (1.38), a z-minimization step (1.39), and a dual variable y update (1.40). For convenience, ADMM can be written as:

(1.35)

This method is guaranteed to converge to the optimal point of the cost function. There are other variations of ISTA methods that can guarantee faster convergence or less expensive iterations, which are beyond the scope of this chapter. Iterative shrinkage thresholding methods are very useful in many sparse recovery problems in image and signal processing. For example, the faster version of ISTA (FISTA) is extensively used in Chap. 8 for solving sparsity-constrained optimization problems in image processing.

xkC1 D arg minx .f .x/ C .=2/kAx  CBzk  c C uk k22 (1.41)  zkC1 D arg minz g.z/ C .=2/kAxkC1  CBz  c C uk k22 (1.42) ukC1 D uk C AxkC1 C BzkC1  c: (1.43) These two forms are equivalent. The former is called unscaled form, while the latter is the scaled form.

1 Introduction

9

Defining the residual at iteration k as rk D Ax C Bzk  c, it can be seen that uk is the running sum of all the residuals: k

uk D u0 C

k X

rj :

(1.44)

jD1

Under mild assumptions the ADMM satisfies some important convergence properties: 1. Residual converges to zero as the number of iterations grows. i.e. rk ! 0 as k ! 1. 2. The objective function converges to the optimal value (p ) as k ! 1. 3. The dual variable yk ! y as k ! 1. ADMM is a very powerful optimization tool in image and signal processing. Later in this book, we shall see its application in image recovery problems (Chap. 8) and in dictionary learning problems for classification (Chap. 6). In its simpler form, which is an alternative optimization, it is applicable in computational color imaging (Chap. 3), image restoration (Chap. 7), dictionary learning, and joint and bi-convex optimizations.

1.2.6

Handling Non-convex Problems

In this section we present some methods and analytical tricks to handle non-convex optimization problems. While these methods cannot be applied to every non-convex problem, they can come in handy in many situations. Non-convex problems are more common in practice than convex problems. However, convex optimization methods can still be very helpful in solving these, even if approximately. Well-known examples of non-convex problems are mixed integer programming, programs with non-convex constraints or cost functions and problems with non-convex feasible set. We first illustrate how to convert a non-convex problem into convex form with a few simple examples. Subsequently, we will proceed to more complicated examples that demonstrate how convex optimization methods can help in solving non-convex problems.

Log-Convexity In many problems, especially in statistics, log-concave or log-convex functions are encountered often. By definition, a function f W Rn ! RC is log-concave (log-convex) if log f is concave (convex). Consider a Gaussian distribution as a common example. f .x/ D exp.0:5.x  x0 /T ˙.x  x0 // is a non-convex function but is also log-concave, i.e. log f .x/ is a concave function and since log is a monotonically increasing function we can optimize the log f .x/ instead of the non-convex f .x/. Geometric Programs Here we discuss a family of problems that are not generally convex. However, they can be transformed into convex optimization problems via change of variables or transformation of cost function or constraints. We first present the geometric programming problem in its general form. A function f W RnCC ! R of the form: f .x/ D cx1a1 x2a2 : : : xnan ;

(1.45)

where c > 0 and ai 2 R is a monomial function. The sum of monomial functions, where ck > 0, is called a posynomial function, which has the following form: f .x/ D

K X

ck x1a1k x2a2k : : : xnank ;

(1.46)

kD1

An optimization problem in the following form: minimize

f0 .x/

subject to

fi .x/  1; i D 1; : : : ; m hi .x/ D 1; i D 1; : : : ; p x 0;

(1.47)

where f0 ; f1 ; : : : fm are posynomial functions and h1 ; : : : ; hp are monomial functions, is called a geometric program. We will first introduce a change of variables, yi D logxi . Then we can rewrite the monomial function in (1.45) as an exponential of an affine function,

10

V. Monga

f .x/ D c.ey1 /a1 .ey2 /a2 : : : .eyn /an D cea

T

y

D ea

T

yCb

:

(1.48)

With the same change of variables, the posynomial function (1.46) is transformed into the following form: K X

f .x/ D

T

eak yCbk ;

(1.49)

signal processing, including but not limited to classification, recognition, deblurring, inpainting and superresolution, denoising, segmentation, 3D reconstruction, optical flow estimation, etc. One of the most important optimization problems in image processing is sparse recovery (1.2). The NP-hard `0 problem is relaxed [10] in practice to `1 , which lends itself to a convex formulation:

kD1

which is a sum of exponentials of affine functions. Substituting the above into (1.47) results in: PK0

minimize

PKi

subject to

kD1

T

kD1 gTi yChi

e

T

ea0k yCb0k

eaik yCbik  1;

D 1; i D 1; : : : ; p:

iD1; : : : ; m (1.50)

Equivalently, this can be written as:  P K0 aT0k yCb0k log e kD1  P Ki aTik yCbik  0; i D 1; : : : ; m e subject to log kD1 minimize

gTi y C hi D 0; i D 1; : : : ; p:

(1.51)

In this form, all the constraints and the cost function are affine functions. As a consequence, we have managed to convert a non-convex geometric programming problem into convex form through a transformation of variables. It must be emphasized that the problems in the original nonconvex form as well as the transformed convex form are in fact equivalent. Convex Relaxations Directly solving a nonconvex problem is often analytically challenging, even NP-hard perhaps. It is sometimes difficult or even impossible to convert a non-convex problem into convex form using change of variables or transformation of cost function. One way to handle the non-convexity of cost function or constraints is convex relaxation. Convex relaxation methods approximate the non-convex cost function or constraint by convex ones. There are several applications of convex relaxation methods in image and

min kxk1 subject to y D Ax: x

(1.52)

Note that in the relaxed version of the problem, `0 -norm Pn is substituted with the `1 -norm (kxk1 D iD1 jxi j). As discussed at the end of Sect. 1.1, optimization problems in the form of (1.2) and (1.52) are widely used in many image processing applications, ranging from inverse problems such as image reconstruction, denoising, inpainting, and super resolution [11–15], to classification tasks [8]. We will exploit these types of optimization problems in Chap. 6 for dictionary learning and classification, in Chap. 7 for image restoration, and in Chap. 8 for recovery using sparse priors. Another example of convex relaxation deals with minimizing the rank of a matrix. A matrix X of rank r has exactly r nonzero singular values. A convex relaxation of the hard rank-minimization problem minimizes the sum of singular values, or the nuclear norm, instead of counting nonzero ones: kXk D

n X

k .X/;

(1.53)

kD1

where k .X/ denotes the kth largest singular values of matrix X. This relaxation has been useful in many areas of image and video processing, including matrix completion [16], background detection [17], denoising [18], etc. Later in this book, we will encounter applications of optimization problems with rank/nuclear norm terms for radar imagery in Chap. 4 and image restoration in Chap. 7. ADMM in Non-convex Problems We now explore the use of ADMM for non-convex prob-

1 Introduction

11

lems. Consider a constrained optimization problem: f .x/

minimize

subject to x 2 S;

(1.54)

where f is a convex function but S is non-convex. In such a scenario, ADMM steps in the following form lead to a suboptimal solution of the main optimization problem:  xkC1 D arg minx f .x/ C .=2/k  x  zk C uk k22 (1.55) kC1

z u

kC1

D ˘S .x k

Du Cx

kC1

kC1

Cu / k

kC1

z

(1.56) ;

(1.57)

where ˘S is the projection onto S. The x-update step is convex since f is also a convex function, but the z-update step is a projection onto a nonconvex set. In some special cases it can be computed exactly. As an example, if • S represents a cardinality set, i.e. S D fxjkxk0  Kg then ˘S keeps the largest K elements and zeroes out all other smaller elements. This is just like ADMM for the `1 problem, except that soft thresholding is replaced with hard thresholding. • S is the set of matrices with rank r, then ˘S is computed using singular value decomposition and keeping the r largest singular values. • S represents a boolean set, then ˘S rounds entries to zero or one. Another scenario where ADMM is useful is the bi-convex problem, minimize

f .x; z/

subject to g.x; z/ D 0;

(1.58)

where f is bi-convex and g is bi-affine. It means that it is convex (affine) in terms of x for each

fixed z and is convex (affine) in terms of z for each fixed x. Now, if f is separable in x and z, we can rewrite the ADMM procedure in the following form:  xkC1 D arg minx f .x; zk / C .=2/k  g.x; zk / C uk k22 (1.59)  zkC1 D arg minx f .xkC1 ; z/ C .=2/k  g.xkC1 ; z/ C uk k22 (1.60) ukC1 D uk C g.xkC1 ; zkC1 /;

(1.61)

where both x and z updates are tractable convex optimization problems. Note that in the absence of constraints, this ADMM procedure reduces to simple alternating minimization. Sequence of Convex Problems Another trick that may come in handy in dealing with non-convexity or non-linearity is to consider a sequence of related convex optimization problems instead of a non-convex problem. The idea is to develop, and solve, a sequence of convex optimization problems—which are individually tractable—whose eventual solution converges to a (possibly sub-optimal) solution of the original non-convex optimization problem. Sequential Quadratic Programming (SQP) [19, 20] is a well-known iterative method that solves a sequence of quadratic programs and is applied to solve both convex and nonconvex optimization problems. Under certain assumptions on the cost function and constraints, it has been shown that the SQP converges to a KKT point. More generally, researchers have developed other sequential approaches that involve solving customized convex subproblems [21, 22] in an iterative manner. Many of these have found significant application in image and signal processing [21–24]. An illustration of these techniques can be seen in Chap. 8 to solve a non-convex, mixed integer programming problem that arises in the context of image recovery and classification.

12

1.3

V. Monga

Book Outline

Through the remaining chapters in this book, it is our intention to introduce the reader to recent advances in different aspects of imaging science that are based on the formulation and solution of novel optimization problems. Image quality is a cardinal measure of design performance in several imaging applications. Accordingly, we begin with the development of a systematic framework for optimizing perceptual image quality in Chap. 2, authored by Brunet et al. The Structural SIMilarity (SSIM) index is arguably one of the most widely-used fullinference image quality assessment methods. Its key premise is to exploit the higher sensitivity of human perception to structural distortions in images, compared to non-structural (e.g. luminance, contrast, translation and rotation) distortions. The authors leverage the quasi-convexity property of the SSIM index to formulate a set of SSIM-optimal solutions that address popular image processing problems such as denoising, restoration, and dictionary construction. In Chap. 3, Bala et al. present different flavors of color transform optimization that have been developed for color correction and illuminant estimation, display characterization, and printer characterization. In each case, an appropriate notion of perceptual error is minimized in a constrained setting. The constraints are motivated by practical considerations such as computational cost, power, and device physics. It is interesting to note that the analytical development of sparse lattice optimization in the context of printer characterization has wider applicability in other domains warranting multidimensional function transformations. Chapters 4 and 5 both view the process of image formation or reconstruction through an optimization perspective. Image reconstruction is invariably an inverse problem, where the original scene is reconstructed from a set of noisy sensor measurements. It is also typically ill-posed, and the recovery process involves solving a regularized least-squares

problem. A recurring theme in such solutions is the use of non-Gaussian priors to preserve specific image characteristics. Another common thread emerging in optimization problems is to identify sparsity of the signal of interest in an appropriately chosen basis and then leverage the algorithmic tools of compressive sensing. In Chap. 4, the authors discuss analytical and numerical optimization methods for synthetic aperture radar (SAR) image reconstruction. They also explore low-rank matrix recovery methods and Bayesian compressive sensing, specifically the relevance vector machine. In Chap. 5, recent multidimensional imaging techniques for spectral and ultrafast imaging are discussed that overcome the temporal, spectral, and spatial resolution limitations of conventional scanningbased systems. In Chaps. 6–8, classical image processing problems such as restoration and classification are revisited in light of recent developments in sparse signal representation theory. The central problem in compressive sensing is to recover a signal given a collection of linear measurements with respect to a measurement matrix, or dictionary, often in the presence of acquisition noise. Accordingly, two flavors of problems arise: (1) how to recover the signal given a fixed dictionary, and (2) how to learn the optimal model dictionary for a specific task. While true sparse recovery is NP-hard, convex relaxations to this problem have been shown to work very well in practice. Chapter 6 focuses on the aspect of dictionary learning for object recognition, in a variety of learning paradigms, viz. supervised, semi-supervised, unsupervised and weakly supervised. Chapter 7 presents an alternative minimization strategy to approximately solve a non-convex optimization formulation, with applications in non-local image denoising and compressive sensing. The equivalence of Bayesian and sparsity-enforcing regularization formulations of such estimation problems, identified in Chap. 7, is formalized in the next chapter. The authors in Chap. 8 introduce formulations of, and solutions to, novel flavors of estimation problems, which enforce

1 Introduction

sparsity-inducing priors in an optimization framework built on the family of spike-andslab priors. The framework is developed in all generality for signal estimation, with customized applications to image recovery and classification. Chapter 9 presents different optimization problems on manifold-valued curves that can benefit applications in computer vision. The nonlinear geometry of manifolds and variable time parameterizations of manifold-valued curves pose significant challenges. The optimization problems encountered in this context include: learning a sparse representation for human action recognition using extrinsic and intrinsic features, solving the registration problem between two Riemannian trajectories, and learning an optimal clustering scheme for symbolic approximation.

References 1. Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge 2. Wiener N (1949) Extrapolation, interpolation, and smoothing of stationary time series: With engineering applications. J Am Stat Assoc 47(258):489–509 3. Karmarkar N (1984) A new polynomial time algorithm for linear programming. Combinatorica 4(4):373–395 4. Nesterov Y, Nemirovskii A (1994) Interior-point polynomial methods in convex programming. Society for Industrial and Applied Mathematics, Philadelphia, PA 5. Wright SJ (1997) Primal-dual interior-point methods. Society for Industrial and Applied Mathematics, Philadelphia, PA 6. Forsgren A, Gill PE, Wright MH (2002) Interior methods for nonlinear optimization. SIAM Rev 44(4):525–597 7. Candes EJ, Romberg JK, Tao T (2006) Stable signal recovery from incomplete and inaccurate measurements. Commun Pure Appl Math 59(8): 1207–1223 8. Wright J, Yang AY, Ganesh A, Sastry SS, Ma Y (2009) Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell 31(2): 210–227

13 9. Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imag Sci 2(1):183–202 10. Donoho DL (2006) For most large underdetermined systems of linear equations the minimal 1-norm solution is also the sparsest solution. Commun Pure Appl Math 59(6):797–829 11. Baraniuk RG (2007) Compressive sensing. IEEE Signal Process Mag 24(4):118–121 12. Yang J, Wright J, Huang T, Ma Y (2008) Image super-resolution as sparse representation of raw image patches. In: Proc. IEEE Conf Comput Vision Pattern Recogn, pp 1–8 13. Bertalmio M, Sapiro G, Caselles V, Ballester C (2000) Image inpainting. In: Proc. 27th Annual Conf Computer Graph Interactive Tech, pp 417–424 14. Deledalle C-A, Denis L, Tupin F (2009) Iterative weighted maximum likelihood denoising with probabilistic patch-based weights. IEEE Trans Image Process 18(12):2661–2672 15. Elad M, Aharon M (2006) Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans Image Process 15(12): 3736–3745 16. Candes EJ, Tao T (2010) The power of convex relaxation: near-optimal matrix completion. IEEE Trans Inf Theory 56(5):2053–2080 17. Cui X, Huang J, Zhang S, Metaxas DN (2012) Background subtraction using low rank and group sparsity constraints. In: Proc. European Conf Comput Vision. Springer, Berlin, pp 612–625 18. Liang X, Ren X, Zhang Z, Ma Y (2012) Repairing sparse low-rank texture. In: Proc. European Conf Comput Vision. Springer, Berlin, pp 482–495 19. Nocedal J, Wright SJ (2006) Sequential quadratic programming. Springer, Berlin 20. Boggs PT, Tolle JW (1995) Sequential quadratic programming. Acta Numer 4:1–51 21. Candes EJ, Wakin MB, Boyd SP (2008) Enhancing sparsity by reweighted l1 minimization. J Fourier Anal Appl 14(5–6):877–905 22. Mousavi HS, Monga V, Tran TD (2015) Iterative convex refinement for sparse recovery. IEEE Signal Process Lett 22(11):1903–1907 23. Aldayel O, Monga V, Rangaswamy M (2016) Successive QCQP refinement for MIMO radar waveform design under practical constraints. IEEE Trans Signal Process 64(14):3760–3774 24. Srinivas U, Suo Y, Dao M, Monga V, Tran TD (2015) Structured sparse priors for image classification. IEEE Trans Image Process 24(6): 1763–1776

2

Optimizing Image Quality Dominique Brunet, Sumohana S. Channappayya, Zhou Wang, Edward R. Vrscay, and Alan C. Bovik

2.1

Introduction

Optimization of perceptual image quality has traditionally been associated with image compression and is aimed at achieving the highest perceptual quality at the lowest possible encoding rate. The earliest known design of perceptually optimal algorithms can be traced to Mannos and Sakrison’s seminal work [21] on image coding with respect to a visual fidelity criterion. The primary challenge with perceptual optimization is the fact that a majority of the state-of-theart perceptual quality algorithms do not enjoy convenient mathematical properties such as differentiability, convexity and metricity. Further, several of these algorithms employ parameters and thresholds that are again not easy to deal with in an optimization framework. While these challenges appear to make the problem intractable, we present a systematic framework in this chapD. Brunet • Z. Wang () • E.R. Vrscay University of Waterloo, ON N2L 3G1, Canada e-mail: [email protected]; [email protected]; [email protected] S.S. Channappayya Indian Institute of Technology Hyderabad, Sangareddy, Khandi, Telangana, India e-mail: [email protected] A.C. Bovik The University of Texas at Austin, Austin, TX 78712, USA e-mail: [email protected]

ter to address the problem of perceptual optimization. As a particular and particularly practical example to illustrate the framework, we work with the Structural SIMilarity (SSIM) index.

2.1.1

Image Quality Assessment Measures

The goal of image quality assessment (IQA) is to predict the quality of images in a manner that is consistent with human subjective evaluation. IQA can be divided into full-reference, reducedreference and no-reference, depending on the full, reduced and non-availability of the original (ground truth) image. Full-reference IQA algorithms are more of image fidelity predictors since the goal is to measure similarity between two images. They are often used as quality predictors when one of the images being compared is considered to have pristine quality. By contrast, reduced- and no-reference IQA algorithms rely more on prior knowledge about high quality natural images in the statistical sense [36]. The focus in this chapter will be limited to full-reference algorithms, and specifically to those methods that are based on the notion of structural similarity. A brief chronological evolution of the Structural SIMilarity (SSIM) index is presented next. The Universal Image Quality Index (UIQI) [35] was the precursor to the successful SSIM

© Springer International Publishing AG 2017 V. Monga (ed.), Handbook of Convex Optimization Methods in Imaging Science, DOI 10.1007/978-3-319-61609-4_2

15

16

D. Brunet et al.

index [37] and introduced the idea of measuring local luminance, contrast and structural similarity between a reference image and its test version. For an original image signal x D fxi j i D 1; : : : ; Ng and its distorted version y D fyi j i D 1; : : : ; Ng, UIQI is defined as ! !

2 x y 2 x y xy ; Q.x; y/ D

2x C 2y x2 C y2 x y (2.1) where x is the mean of the image signal, x2 its variance and xy the covariance between the original and distorted version of the image signal. The three terms in this formulation are the foundations of structural similarity based image quality assessment. The first term measures luminance similarity between the images, the second measures contrast similarity and the third term measures correlation or structural similarity between the images. If the quality metric were to be applied on a patch-wise basis to the images, the overall image quality is estimated as M 1 X QD Qj ; M jD1

(2.2)

where Qj is the UIQI of the jth image patch and M is the number of patches. The Structural SIMilarity (SSIM) index builds on the ideas introduced by UIQI by using weighted means, variances and covariances in addition to introducing stabilizing constants to Q so as to avoid numerical problems when the local statistics are close to zero. The local SSIM index is defined as [37] SSIM.x; y/ D 2 x y C C1

2x C 2y C C1

!

2 x y C C2 x2 C y2 C C2

!

xy C C3 x y C C3

;

(2.3) where C1 ; C2 and C3 are stabilizing constants. Also, the mean, variance and covariance are estimated locally via

x D wT x;

(2.4)

x2 D .x  x e/T diag.w/.x  x e/; (2.5) xy D .x  x e/T diag.w/.y  y e/; (2.6)

where w is a normalized weight vector and e is a vector of ones. The SSIM index is also expressed as SSIM.x; y/ D l.x; y/c.x; y/s.x; y/;

(2.7)

where l.x; y/ is the luminance term, c.x; y/ is the contrast term and s.x; y/ corresponds to the structure term. As shown in [37], SSIM embodies important masking mechanisms in the terms l.x; y/ and c.x; y/, specifically luminance masking (Weber’s law) and contrast masking, both of which are key determinants of image quality. The structure term s.x; y/ captures the notion that image distortion destroys perceptually relevant image structure. In its most general form, the SSIM index is defined as SSIM.x; y/ D Œl.x; y/˛ Œc.x; y/ˇ Œs.x; y/ ; (2.8) where the positive exponents ˛; ˇ; determine the importance assigned to each of the components. Since the structure term can be negative, should be normalized to 1 to avoid any complex number and not to favorize any anticorrelation. Alternatively, s.x; y/ may be transformed to max.0; s.x; y// to avoid any negative values. The SSIM index is applied block-wise to arrive at the image level SSIM index. Again, as with UIQI, SSIM.x; y/ D

M 1 X SSIMj .x; y/; M jD1

(2.9)

where SSIMj is the SSIM index of the jth image patch and M is the number of patches. The Multi-Scale-SSIM (MS-SSIM) index [41] is an improvement over the SSIM index in that it measures structural similarity over multiple spatial scales. The MS-SSIM index is defined as SSIM.x; y/ D ŒlJ .x; y/˛J

J Y Œcj .x; y/ˇj Œsj .x; y/ j jD1

(2.10)

2 Optimizing Image Quality

corresponding to J spatial scales. Starting from the highest spatial scale corresponding to the original image resolution, successive spatial scales (at lower resolution) are obtained by decimation i.e., low-pass filtering of the current scale followed by downsampling by a factor of 2. Following the notation established earlier, lJ .x; y/ corresponds to luminance similarity after .J  1/ stages of decimation. In similar fashion, cj .x; y/ and sj .x; y/ correspond to contrast and structural similarity after j  1 decimation stages respectively. Further, ˛J is the exponent applied to the luminance term while ˇj and j correspond to exponents applied to the contrast and structure terms at the .j  1/st decimation stage, respectively. Many extensions of the SSIM index have been proposed in the literature. To name a few, we mention the Complex-Wavelet-SSIM (CWSSIM) [34], Information content Weighted SSIM (IW-SSIM) [38], SSIM extension for color images [19, 22] and for videos [39]. A comprehensive comparison of IQA measures was performed in the TID-2008 experiment [29]. MS-SSIM was the clear winner, followed by SSIM at the second place. Several other metrics with better performance on this database have been introduced since then, many of them inspired by SSIM. According to the TID-2013 experiment [28], Feature SIMilarity (FSIM) [43], Sparse Feature Fidelity (SFF) [8], PNSR-HA/HMA [27], Block-Based MultiMetric (BMMF) [17] and Spectral Residual SIMilarity (SR-SIM) [42] correlate better with subjective quality assessment than MS-SSIM on the TID database. Yet, SSIM and MS-SSIM have found vast commercial and academic acceptance, owing to its high efficiency in terms of computation (it is very fast) and performance (it correlates with human judgments nearly as well as the top models). Indeed, SSIM and MS-SSIM are marketed and used throughout the global broadcast, cable and satellite television industries. The SSIM team members each received a Primetime Emmy Award in October 2015 for their work. Moreover, because of the simplicity of the SSIM index, it is a good prototype of a perceptual optimization

17

criterion. It is hoped that a detailed study of optimization techniques for the SSIM index will inspire similar studies for future IQA models and for the design of improved IQA methods that will better serve the multimedia industry.

2.1.2

Perceptual Optimization Framework

Perceptual optimization is a particular case of the general optimization framework where the objective measure models perceptual quality of an image. Before studying the specific case of SSIMbased optimization, we lay the groundwork for perceptual optimization in both Bayesian and variational perspectives. We first consider the no-reference IQA case. Let x be an image and let Q.x/ be a quality measure of x with Q.x/  0 and Q.x/ D 0 for perfect quality. Given a distorted image y obtained from an unknown image x by a model of distortion y D D.x/, we want to restore the original image x. To do so, we seek an image xO that will optimize the quality criterion Q.Ox/ with the constraint y D D.Ox/. Often, the distortion model will include a stochastic component such as additive noise. In that case, the distortion model can be expressed as y D D.x/ C , where D is the deterministic part and is the noise component. For example, for additive white Gaussian noise, the probability distribution function of will be P.„ D / / exp.1=2k k22 = 2 /; where 2 is the variance of the noise. We can also assume a probabilistic model for the unknown image x such as P.X D x/ / exp.Q.x/˛ =ˇ/: Note that other monotonic transformations of Q.x/ could be found instead of a power transformation. Since D y  D.x/ the probability distribution function of can also be seen as the conditional probability of y given x: P.Y D yjX D x/ / exp.1=2ky  D.x/k22 = 2 /:

18

D. Brunet et al.

We can now use the Bayes’ rule to find the probability distribution of the clean image x given the distorted image y: P.X D xjY D y/ / P.Y D yj X D x/P.X D x/

(2.11)

/ exp.1=2ky  D.x/k22 = 2 / ˛

exp.Q.x/ =ˇ/:

(2.12)

The maximum a posteriori estimator is then obtained by finding the optimal image x. Taking the negative of the logs, it yields to  log.P.X D xjY D y// / 1=2ky D.x/k22 = 2 C Q.x/˛ =ˇ: This is the variational form for the image restoration problem. Notice that the mean squared error in the first term does not come from the image quality model, but rather the model of the noise. The second term, usually called the regularization term, often adopts simple forms such as the L2 -norm (for Tikhonov regularization), the total variation or the norm of the gradient or of secondorder partial derivatives (for thin-plate spline). From a Bayesian perspective, these represent different models of image quality. For the full-reference case, we can use a distance (or dissimilarity) measure d.x; z/. The goal of the optimization will then be to bring the estimated clean image xO close to a prior image z. For example, we could minimize d.x; z/ over all x satisfying the constraint y D D.x/. Instead of a single prior image, we could also have a dictionary of images z1 ; z2 ; : : : ; zP . In this case, we will solve the problem xO D arg min min d.x; zp /: p

xWyDD.x/

Finding the maximum a posteriori estimator does not require any optimization of an image quality measure. (Or we could say that the image quality measure was empirically found to be Q.z/˛ /  log.P.Z D z//.) However, we could instead look to maximize the expected perceptual quality: xO D arg min EZ Œd.z; x/jY D y (2.14) x Z D arg min d.z; x/P.Z D zjY D y/dz: x

(2.15) The optimization problems presented in this chapter will roughly follow one of the forms presented above. However, by sampling through the SSIM-optimization literature, we will not present a completely methodical and exhaustive approach as it was outlined here.

2.1.3

Chapter Overview

We first review the mathematical properties of SSIM that makes it a suitable criterion for optimization. We then study several optimization problems starting with a local (block-based) perspective. Specifically, we discuss SSIMoptimal equalizer design, SSIM-optimal softthresholding algorithms, and SSIM best basis approximation. We then present a variational formula with an SSIM term that allows to pass from a local (block-based) to a global (imagewide) solution. We illustrate the techniques with several examples and compare the mean-square error equivalent.

2.2

Mathematical Properties of the SSIM Index

2.2.1

Structural and Non-Structural Distortions

(2.13)

If instead we have a (possibly empirical) prior probability distribution P.z/, with a stochastic model of distortion P.Y D yjZ D z/, then Bayes’ formulas will lead to P.Z D zjY D y/ / P.Y D yjZ D z/P.Z D z/:

The main insight behind the Structural Similarity index and the family of related quality measures is a decomposition of images/signals into structural and non-structural components.

2 Optimizing Image Quality

19

Indeed, it is generally observed that for the same mean-square error, structural image distortions are perceived more strongly than non-structural ones. We shall refer to distortions that do not alter the general shape of an image as being “nonstructural”. Change in luminance, change in contrast, translation and rotation are some examples of non-structural distortions. We want perceptual metrics to be quasi-invariant to these types of distortions. On the other hand, we shall refer to distortions that strongly affect the perceptual quality of an image as being “structural”. As we will see, all remaining distortions will be lumped together once the non-structural distortions have been accounted for. A commonly used simplification of the local SSIM (2.3) is obtained by setting C3 D C2 =2. The formula then reduces to SSIM.x; y/ D

2 x y C C1

2x C 2y C C1

!

2 x;y C C2 x2 C y2 C C2

! ;

(2.16) D S1 .x; y/S2 .x; y/:

(2.17)

In this simplified form of the SSIM index, only one type of non-structural distortion is considered explicitly: the luminance shift. Given a grayscale image patch x, its luminance is a function of the average value x of the patch. The decomposition of an image patch into structural and non-structural parts is then given by: x D x e C .x  x e/; ;

(2.18)

where e D .1; 1; : : : ; 1/ is in the direction of the mean and x  x e is the zero-mean component. The luminance term S1 of the SSIM index acts on the mean direction of the signal, whereas the combined contrast-correlation term S2 involves the zero-mean component of the signal. We can then rewrite the components S1 and S2 using the two following projections: P1 .x/ D x e; and

(2.19)

P2 .x/ D .Id  P1 /x D x  x e; (2.20)

where Id.x/ D x is the identity operator (matrix). For x; y 2 RN , we define d.x; y/ D

kx  yk22 kxk22 C kyk22 C C

1=2

for some positive constant C. This is a valid normalized distance metric [7]. Note that 1  S1 .x; y/D

j x  y j2 D Œd.P1 x; P1 y/2 C 2y C C1

2x

and

(2.21)

1  S2 .x; y/D

y e/k22

k.x  x e/  .y  kx  x ek22 C ky  y ek22 C .N  1/C2

DŒd.P2 x; P2 y/2 :

(2.22)

This implies that 1  S2 .x; y/ can be interpreted as an inverse variance weighted normalized mean square error. The fundamental difference between 1  S2 .x; y/ and the normalized mean square error kP2 xP2 yk22 is that the former models the masking effect by weighting more heavily distortions on flatter patches. By convention, optimization problems are cast as minimization problems. This is of course always possible to pass from maximization to minimization since maxx f .x/ D minx af .x/ C b with a < 0. We thus consider the minimization of f1 .x; y/ WD 1  S1 .x; y/ and f2 .x; y/ WD 1  S2 .x; y/.

2.2.2

Convexity and Quasi-Convexity

The functions f1 and f2 that were derived from the components of the simplified SSIM are not convex everywhere. As it was studied in detail in [7], for p fixed y, f1 is convex for fx W 0  P1 x  3P1 yg and f2 is convex p for all2 points 2 2 in fx W kP2 x  P2 yk  kP2 yk . 3  1/ g. The exact limit of the region of convexity is actually a tear-shaped region which is rotated around the direction of P2 y. Although the functions f1 and f2 are not convex everywhere, they possess a weaker form of convexity called quasi-convexity.

20

D. Brunet et al.

Definition 1. Given a convex set X, a function f W X ! R is said to be quasi-convex if its hsublevel set, defined as Xh D fx 2 Xjf .x/  hg;

SSIM.x; y/ D .1  f1 .x; y//.1  f2 .x; y// (2.24) D 1  f1 .x; y/  f2 .x; y/ Cf1 .x; y/f2 .x; y/:

(2.23)

(2.25)

is a convex set for all h 2 Range.f /. Any convex function is necessarily quasiconvex. An example of a function that is quasiconvex but not convex is f .x/ D jxj1=2 . Quasi-convexity is a useful property for nonlinear optimization [4]:

Maximizing SSIM is equivalent to minimizing the following function,

Theorem 2.1. Let X be a convex vector space and let f W X ! R be a quasi-convex function. If f has a minimum, then it is either unique or the function is constant in a neighborhood of the minimum. It thus means that a function f will have a unique minimum if it is monotonically increasing away from the minimum. The quasi-convexity of f1 for P1 x  0; P2 y  0 and of f2 on the half-plane .P2 y/T P2 x  0 is easy to prove. The set of points such that fi  h < 1 for i D 1; 2 is, respectively, the interval Pi y (hyperball) centered at 1h 2 of mid-length (rar   1 h2 dius) of kPi yk2 .1h C Ci 1h 2 /2  1 2 , which

Another simple combination rule would be to take the sum of f1 and f2 :

is a convex set.

2.2.3

Combination Rule

In order to optimize perceptual quality, we need either to optimize the vector-valued function .f1 ; f2 / or to devise some way to collapse the vector to a scalar function. We then examine each option in terms of the (quasi-)convexity of the resulting function. In the simplified form of SSIM, the product between S1 and S2 is taken. Other combination rules could also be devised, and the choice we make will affect not only how good our perceptual image quality model is, but also the mathematical properties of the created function. Writing the local SSIM in term of f1 D 1  S1 and f2 D 1  S2 , we get

1  SSIM.x; y/ D f1 .x; y/ C f2 .x; y/ f1 .x; y/f2 .x; y/: (2.26)

F1 .x; y/ D f1 .x; y/ C f2 .x; y/

(2.27)

This can be seen as a linear approximation of 1-SSIM. Surprisingly, this approximation performed slightly better in a psycho-visual experiment [5]. Moreover, this alternative form preserves the property of a metric [7]. If x is in the region of convexity for f1 and f2 , then it will be automatically in the region of convexity of F1 . However, contrary to the convex function case, the sum of quasi-convex functions is not necessarily quasi-convex. So F1 does not inherit the quasi-convexity property of f1 and f2 . Even if the scalarized function to be optimized is neither convex nor quasi-convex, it is still sometimes possible to optimize each component independently. Indeed, the decomposition given in (2.18) is an orthogonal decomposition. The Orthogonal Decomposition Theorem goes as follows: Theorem 1. Let X be a space and fPk gK1 kD1 be orthogonal projections. Define PK D PK1 Id  kD1 Pk , where Id.x/ D x is the identity function. Then each point x 2 X has a unique decomposition as xD

K X

Pk .x/:

kD1

In particular, x D y ” Pk .x/ D Pk .y/ for 1  k  K:

2 Optimizing Image Quality

21

For the particular case of SSIM, the Orthogonal Decomposition Theorem says that if x D y and x  x e D y  y e, then x D y, which is quite trivial. However, this kind of result will also be valid for more complex types of orthogonal decomposition. The next proposition follows immediately from the Orthogonal Decomposition Theorem: Theorem 2. Let X D X1 C X2 be an orthogonal decomposition of X and let f1 on X1 and f2 on X2 be two functions. Then min

z1 Wz1 DP1 z1

f1 .z; P1 y/ C

min

z2 Wz2 DP2 z2

f2 .z2 ; P2 y/

D min f1 .P1 z; P1 y/ C f2 .P2 z; P2 y/: z

(2.28)

Moreover, if f1 ; f2 > 0, then min

z1 Wz1 DP1 z1

f1 .z; P1 y/

min

z2 Wz2 DP2 z2

f2 .z2 ; P2 y/

D min f1 .P1 z; P1 y/f2 .P2 z; P2 y/: z

(2.29)

This theorem can be also used for a subdomain A X, but with the limitation that A D P1 A C P2 A.

2.2.4

Spatial Aggregation

The simplest way to spatially aggregate local SSIM scores is by averaging (2.9). Other pooling options could be possible such as a power mean or maximum error [40]. Visual saliency models can also be used to guide the pooling strategy [44]. Finally, multi-resolution schemes such as MS-SSIM can be devised. In order to preserve the quasi-convexity property of the components of SSIM, an interesting option would be to take the maximum value of the scores on local patches: SSIMmax D max SSIMj .x; y/: 1jM

(2.30)

This strategy can be justified empirically by the fact that the eye will be attracted to the most salient feature, which could be said to be the one with the greatest dissimilarity. The advantage of this choice is that the aggregated score will be

quasi-convex, since it is the maximum of quasiconvex functions.

2.3

Perceptually Optimal Algorithm Design

2.3.1

SSIM-Optimal Equalizer Design

One of the oldest and most widely researched problems in digital image processing is image restoration, a.k.a. equalization [1], which has its origins in the early days of America’s space program. Given the problem’s rich history, there exist several excellent solutions that form a part of most image acquisition systems [18]. In this section, we discuss a relatively recent image restoration solution that explicitly optimizes the SSIM index. A precursor to this solution is the SSIM-optimal linear estimator that addresses the image denoising problem [10].

2.3.2

Equalization Problem

The equalization problem is as follows: Design an equalizer gŒn of length N (for any N) that optimizes the SSIM index between the reference and restored wide sense stationary (WSS) processes xŒn and xO Œn respectively, given the observed process yŒn that is a blurred and noisy version of xŒn. The blurring filter hŒn of length M is assumed to be linear and time-invariant (LTI) and the noise process Œn is assumed to be additive and white in nature. Furthermore, it is assumed that the blurring filter hŒn and the power spectral density (PSD) of Œn are known at the receiver. This system is summarized in Fig. 2.1 and the optimization problem is set up as follows. Given yŒn D hŒn xŒn C Œn;

(2.31)

design a filter gŒn of length N such that g Œn D arg max SSIM.xŒn; xO Œn/; gŒn2RN

(2.32)

22

D. Brunet et al.

Fig. 2.1 Block diagram of a general linear time invariant equalizer system. The goal is to design a linear equalizer block G that maximizes the SSIM index between the source process xŒn and the restored process xO Œn is

maximized. To solve this problem it is assumed that the LTI filter H , and the power spectral density of the noise process Œn are known

where

Definition 2. Given two WSS random processes xŒn and yŒn with means x and y respectively, the statistical SSIM index is defined as

xO Œn D gŒn yŒn:

2.3.3

(2.33)



Solution StatSSIM.xŒn; yŒn/ D

For completeness, the standard SSIM index [37] as defined for deterministic signals is reviewed first. However, the equalizer design problem in (2.32) is defined on WSS random processes thereby rendering a direct application of the deterministic SSIM index infeasible. To circumvent this issue, the definition of the SSIM index is extended to handle WSS processes. The optimization problem is then restated in terms of the extended definition of the SSIM index.

2.3.3.1 Equalization Problem Redefined As noted previously, the definition of the SSIM index in (2.3) needs to be modified to measure similarity between WSS processes. This is accomplished via the definition of the statistical SSIM index (StatSSIM index).



2EŒxŒnEŒyŒn C C1 EŒxŒn2 C EŒyŒn2 C C1



2EŒ.xŒn  EŒxŒn/.yŒn  EŒyŒn/ C C2 EŒ.xŒn  EŒxŒn/2  C EŒ.yŒn  EŒyŒn/2  C C2

;

(2.34) where EŒ is the expectation operator. This is a straightforward extension of the pixel domain definition of the SSIM index by replacing sample means and variances with their statistical equivalents. This could be seen as a plug-in estimator of the expectation of SSIM (2.14). The problem in (2.32) is redefined as g Œn D arg max StatSSIM.xŒn; xO Œn/; (2.35) gŒn2RN

To solve (2.35), the StatSSIM index is first expressed in terms of the equalizer filter coefficients gŒn using the definition in (2.34).

!

2 x xO C C1 2EŒ.xŒn  x /.OxŒn  xO / C C2 StatSSIM.xŒn; xO Œn/ D f .g/ D EŒ.xŒn  x /2  C EŒ.OxŒn  xO /2  C C2

2x C 2xO C C1 ! P 2 x EŒ N1 iD0 gŒiyŒn  i C C1 D P 2

2x C .EŒ N1 iD0 gŒiyŒn  i/ C C1 ! P PN1 2EŒ.xŒn  x /. N1 iD0 gŒiyŒn  i  EŒ iD0 gŒiyŒn  i/ C C2  P PN1 2 EŒ.xŒn  x /2  C EŒ. N1 iD0 gŒiyŒn  i  EŒ iD0 gŒiyŒn  i/  C C2

2 Optimizing Image Quality

23

(which follows from the definition of convolution)

D

2 x

PN1 iD0

!

gŒi y C C1

P 2

2x C . N1 iD0 gŒi y / C C1

! P PN1 2EŒ.xŒn  x /. N1 iD0 gŒiyŒn  i  iD0 gŒi y / C C2 P PN1 2 EŒ.xŒn  x /2  C EŒ. N1 iD0 gŒiyŒn  i  iD0 gŒi y /  C C2 .since xŒn is WSS, hŒn is LTI, yŒn is also WSS/

D

2 x gT e y C C1

2x C gT eeT g 2y C C1

!

(2.36)

! P 2EŒ.xŒn  x /. N1 iD0 gŒi.yŒn  i  y // C C2 P 2 EŒ.xŒn  x /2  C EŒ. N1 iD0 gŒi.yŒn  i  y //  C C2 D

2 x gT e y C C1

2x C gT eeT g 2y C C1

!

2gT cxy C C2 x2 C gT Kyy g C C2

where g D ŒgŒ0; gŒ1; : : : ; gŒN  1T ; e D Œ1; 1; : : : ; 1T are both length N vectors, x ; y are the means of the source and observed processes respectively, cxy D EŒ.xŒn  x /.y  e y /, is the cross-covariance between the source (xŒn) and the observed processes (y D .yŒn; yŒn  1; : : : ; yŒn  .N  1//T ), x2 is the variance of the source process at zero delay, Kyy D EŒ.y  e y /.y  e y /T , is the covariance matrix of size N  N of the observed process yŒn, and C1 ; C2 are stabilizing constants. From (2.36), it can be seen that the StatSSIM index is the ratio of a second degree polynomial to a fourth degree polynomial in g. It is now demonstrated that problem (2.36) admits a tractable, and in particular, near closed-form solution, with a complexity that is comparable to that of the minimum MSE solution.



2.3.3.2 StatSSIM-Optimal Linear Equalization The StatSSIM index is a non-convex function of g and local optimality conditions such as Karush-Kuhn-Tucker (KKT) cannot guarantee global optimality. The non-convex nature of the StatSSIM index is demonstrated in Fig. 2.2 and contrasted with the convex nature of MSE. In particular, any approach based on descent-type algorithms are likely to get stuck in local optima. To address this issue, the problem is transformed from its non-convex form into a quasi-convex formulation. Convex optimization problems are efficiently solvable using widely available optimization techniques and software [2, 4]. Moreover, we show below that a near-closed form solution can be achieved. In particular, the N-tap filter optimization is transformed into an

24

D. Brunet et al.

Fig. 2.2 (a) MSE and (b) StatSSIM indices as a function of equalizer taps gŒ0, gŒ1

400 350 300

MSE

250 200 150 100 50

10 5

0 –10

0 –8

–6

–4

–2

0

2

–5 4

6

8

–10

10

g1

g0

(a) MSE

1 0.8 0.6 0.4 SSIM

0.2 0 –0.2 –0.4 –0.6 –0.8

10 5

–1

0 10

8

6

4

2

0

–2

–4

–6

–5 –8

–10

g0

g1

(b) SSIM optimization problem over only two variables for any N. Exploiting convexity properties, one can quickly search over one parameter by means of a bisection technique, thus reducing the problem to a univariate optimization problem. This last step can be efficiently performed using an analytic solution of a simplified problem.

2.3.3.3 Problem Reformulation The problem is reformulated by noting that the first term of (2.36) (corresponding to the mean) is a function of only the sum of the filter coefficients gT e. A typical constraint in filter design problems

is that the filter coefficients add up to unity. The optimization problem in (2.36) is simplified by constraining gT e = ˛ and takes the form 2 4

 g.˛/ D argmaxg2RN subject to W gT e D ˛:

2gT cxy CC2 x2 CgT Kyy gCC2

3 5 ; (2.37)

The solution is now a function of ˛. The problem now changes to finding the highest StatSSIM index by searching over a range of ˛ (typically in the interval Œ1  ı; 1 C ı, for a small ı). The solution of this problem is discussed in the next section.

2 Optimizing Image Quality

25

2.3.3.4 Quasi-Convex Optimization The optimization problem in (2.37) is still nonconvex. It is first converted into a quasi-convex form as follows, g.˛/ D argmaxg2RN

2gT cxy C C2 2 x C gT Kyy g C C2



"

;

min W subject to W

# ;

(2.39)

,

 max W

maxg2RN W . x2 C gT Kyy g C C2 /  .2gT cxy C C2 / subject to W gT e D ˛

subject to W gT e D ˛;

"

arguments. The second equivalence relation holds since the denominator in (2.37) is strictly positive, allowing for the rearrangement of terms. then becomes a true upper bound if the problem,

2gT cxy CC2 x2 CgT Kyy gCC2 T





#

subject to W g e D ˛;

,

has a non-negative optimal value. The objective function is a linear term minus a convex quadratic and is therefore concave. The constraint is affine, and thus convex. Therefore, the overall problem is convex, and can be solved by introducing a Lagrange multiplier  and applying the first order sufficiency conditions, rg f . x2 C gT Kyy g C C2 /  .2gT cxy C C2 /

min W

C.gT e  ˛/g D 0:

(2.40)

subject to W 2 6 4

min W Π. x2 C gT Kyy g C C2 /

The solutions for g and , denoted by g.˛/; .˛/ to emphasize their dependence on ˛, are given by

3

7 .2gT cxy C C2 /  0 5 :

subject to W gT e D ˛

g.˛/ D (2.38)

The first step involves the introduction of the auxiliary variable as an upper bound on (2.37). The first equivalence relation is true since minimizing is the same as finding the least upper bound of the function in (2.37). This is equal to the maximum value of the function, which exists, as seen by straightforward continuity

Fig. 2.3 An algorithm to search for the optimal

1 1 K .2cxy  .˛/e/ 2 yy

1 .˛/ D T 1 .2cTxy K1 yy e  2 ˛/: e Kyy e

(2.41)

The optimal can then be computed in O.log.1=// iterations using a standard bisection procedure. The algorithm is summarized in Fig. 2.3. The tolerance specified by  determines the tightness of the bound .

26

D. Brunet et al.

2.3.3.5 Search for ˛ It should be noted that the solution in (2.41) is still a function of ˛. The overall solution to (2.36) is found by searching over ˛. One method is to simply initialize ˛ to be the sum of the filter coefficients of the MSE-optimal filter, i.e., ˛init D gTmse e. Another heuristic method is to initialize ˛ to be the sum of the filter coefficients of a structure-optimal filter. By structureoptimal filter is meant a filter that optimizes only the structure term in the StatSSIM index without any constraints on the mean. In other words, the goal is to find a filter gstruct such that g O Œn/ struct D argmaxg2RN Structure.xŒn; x D D

2EŒ.xŒn  x /.OxŒn  xO / C C2 EŒ.xŒn  x /2  C EŒ.OxŒn  xO /2  C C2 2gT cxy C C2 2 x C gT Kyy g C C2



:

1 struct

.Kyy /1 cxy ;

(2.43)

and so the initial value of ˛ is ˛init D eT gstruct :

As before, the blurring filter and the power spectral density of the additive noise component are assumed to be known at the receiver. The procedure used to estimate the correlation and covariance values for denoising and restoration can be found in [30]. To find the estimates, the neighborhood of size L  L is unwrapped into a vector of size L2  1. The results of the image restoration procedure is illustrated in Figs. 2.4, 2.5, 2.6, 2.7 and 2.8. The effectiveness of the StatSSIMoptimal solution is very clear from these illustrations.

(2.42)

This problem is very similar to (2.37) and can be solved using the technique described above. The solution is gstruct D

• The StatSSIM-optimal solution is computed as follows, – ˛ is initialized using (2.43). – In a small range around ˛ (chosen above), compute the solution in (2.41) and choose the one with the maximum StatSSIM index.

(2.44)

The value of struct is computed using the algorithm described in Sect. 2.3.3.4.

2.3.3.6 Application to Image Denoising and Restoration The StatSSIM-optimal equalizer can be applied to restore images as outlined in the following steps. • At each pixel in the distorted image, estimate the values of rxy ; cxy ; Ryy ; Kyy from a neighborhood of size LL. The value of L is chosen so as to compute stable correlation values. • The minimum MSE solution is computed as gmse D R1 yy rxy (after removing mean from the blocks).

2.3.4

SSIM-Optimal Soft-Thresholding

While the previous section explored the optimization of a linear system with respect to the SSIM index, this section explores SSIMoptimal soft-thresholding, a non-linear image denoising solution. A soft-thresholding operator with threshold  is defined as g.y/ D sgn.y/.jyj  /C ;

(2.45)

where sgn.y/ is the signum function and .:/C is the rectifier function. Soft-thresholding for signal denoising was first proposed by Donoho [14–16] and has been extended to image denoising most notably by Chang et al. [9]. These are risk/cost minimizing solutions where the risk or cost function is the mean squared error between the estimated source and the noise-free source. In other words, these solutions find  that is MSE-optimal. While these solutions were proposed over two decades ago, they continue to be relevant for image denoising. The soft-thresholding problem has been solved for SSIM optimality [11] and is discussed next.

2 Optimizing Image Quality

27

Fig. 2.4 Denoising example 1: Img0039.bmp from the ‘City of Austin’ database. (a) Reference: Original image. (b) Distorted image with noise D 35, MSE D 1226.3729, SSIM index D 0.5511. (c) MSE-optimal filter (length 7):

Image denoised with a 7-tap MSE-optimal filter, MSE D 436.6929, SSIM index D 0.6225. (d) SSIM-optimal filter (length 7): Image denoised with a 7-tap SSIM-optimal filter, MSE D 528.0777, SSIM index D 0.6444

2.3.4.1 SSIM Index in the Wavelet Domain The soft-thresholding problem is defined and solved in the wavelet domain while the SSIM index is defined in the space domain. To bridge this gap, the SSIM index is first expressed in the wavelet domain. This is achieved by expressing the space domain mean, variance, and crosscovariance terms in terms of wavelet coefficients. Of the several classes of wavelet transforms, only orthonormal wavelets are energy preserving. This property allows for the space domain variance and covariance terms to be expressed in terms

of the wavelet coefficients in a straightforward manner. The approximation subband (low-low (LL) subband) of the wavelet decomposition contains all the information required to calculate the mean of the space domain signal. A scaling factor k is applied to the mean of the LL subband to find the mean. Let x denote an image patch of size N  N and X denote the L level wavelet transform of the patch (also of size N  N). Then the mean of x is given by

x D .k/L X;LL ;

(2.46)

28

Fig. 2.5 Denoising example 2: (a) Reference: Original image. (b) Distorted image with noise D 40, MSE D 1639.3132, SSIM index D 0.541485. (c) MSEoptimal filter (length 7): Image denoised with the a 7-

where k is the known scaling factor, and X;LL is the mean of the LL subband of X. The fact that orthonormal wavelets obey Parseval’s Theorem is used to calculate the variance and covariance terms in the SSIM index. Let x; y represent image patches of size N  N and X; Y be their respective orthonormal wavelet transforms. From Parseval’s Theorem, it follows that

D. Brunet et al.

tap MSE-optimal filter, MSE D 383.3375, SSIM index D 0.734963. (d) SSIM-optimal filter (length 7): Image denoised with a 7-tap SSIM-optimal filter, MSE D 455.2577, SSIM index D 0.753917

x2 D

xy D

N1 N1 1 XX 2 X  ..k/L X;LL /2 ; N 2 iD0 jD0 i;j

(2.47)

N1 N1 1 XX Xi;j Yi;j ..k/L X;LL /..k/L Y;LL /: N 2 iD0 jD0

(2.48) The SSIM index can now be written in terms of the wavelet coefficients as

2 Optimizing Image Quality

29

Fig. 2.6 Restoration example 1: Image Img0073.bmp of the ‘City of Austin’ database. (a) Reference: Original image. (b) Distorted image with blur D 15; noise D 40, MSE D 2264.4425, SSIM index D 0.3250. (c) MSEoptimal filter: Image restored with a 11-tap MSE-optimal

filter, MSE D 955.6455, SSIM index D 0.3728. (d) StatSSIM-optimal filter: Image restored with a 11-tap SSIM-optimal filter, MSE D 1035.0551, SSIM index D 0.4215

SSIM.x; y/

(2.49) 1

0

D

N1 P N1 P 1 Xi;j Yi;j  ..k/L X;LL /..k/L Y;LL / C C2 C

B 2 N2 iD0 jD0 2..k/ X;LL /..k/ Y;LL / C C1 C B C: B A P N1 P 2 ..k/L X;LL /2 C ..k/L Y;LL /2 C C1 @ 1 N1 2 2 2 2L .

X C Y  k C

/ C C 2 i;j i;j X;LL Y;LL N2 L

L

iD0 jD0

30

D. Brunet et al.

Fig. 2.7 Restoration example 3: A 128128 block of Barbara image. (a) Reference: Original image. (b) Distorted image with blur D 1; noise D 40, MSE D 1781.9058, SSIM index D 0.5044. (c) MSE-optimal filter:

Image restored with a 11-tap MSE-optimal filter, MSE D 520.1322, SSIM index D 0.6302. (d) StatSSIM-optimal filter: Image restored with a 11-tap SSIM-optimal filter, MSE D 584.9232, SSIM index D 0.6568

2.3.4.2 Problem Formulation The soft-thresholding problem is now formulated in terms of the wavelet domain SSIM index defined in (2.49). Let x denote a pristine image patch of size N  N, n be zero mean Gaussian noise, and y D x C n be the noisy observation of x. Let X; Y represent an L level orthonormal wavelet transform of x; y respectively. Since an L level orthogonal transform consists of 3L subbands it leads to the design of 3L thresholds (one per subband). Let the threshold vector be denoted by ƒ D Œ1 ; 2 ; : : : ; 3L  where each element corresponds to the threshold of one subband. It

should be noted that the approximation band is O be the soft thresholded not thresholded. Let X output, and let xO denote the space domain version O of X. As with the StatSSIM-optimal equalizer design, it is assumed that the noise variance is known to the receiver. The receiver has only the observation y and therefore a direct evaluation of the SSIM index between x and xO is not possible. In order to estimate the SSIM index, a Gaussian source model is applied to the pristine wavelet coefficients. Since the noise is assumed to be zero mean, the mean of the pristine image patch

2 Optimizing Image Quality

31

Fig. 2.8 Restoration example 4: A 512512 block of Mandrill image. (a) Reference: Original image. (b) Distorted image with blur D 5; noise =50, MSE D 3065.31, SSIM index D 0.1955. (c) MSE-optimal filter: Image

and the thresholded estimate are identical (since the approximation subband is not thresholded). This results in the mean term of the SSIM index becoming unity. Further, since the noise is additive, the source variance can be estimated to be the difference between the variance of the observation y2 and the noise variance n2 as

restored with a 7-tap MSE-optimal filter, MSE D 863.85, SSIM index D 0.3356. (d) StatSSIM-optimal filter: Image restored with a 7-tap SSIM-optimal filter, MSE D 908.47, SSIM index D 0.3446

The SSIM index is now expressed as SSIM.x; xO / D 1 0 N1 P N1 P 1 O i;j  .kL Y;LL /2 C C2 X X 2 i;j 2 N C B iD0 jD0 C B C: B N1 N1 A @ 1 P P 2 2 L

2  2 C C O Y C X  2.k / Y;LL 2 i;j i;j n N2 iD0 jD0

(2.52)

x2  y2  n2 D

1 N2

N1 X N1 X iD0 jD0

(2.50)

2 Yi;j .kL Y;LL /2  n2 : (2.51)

It should be noted that since the noise is additive, N1 N1 1 P P Xi;j XO i;j  . Y2  n2 C 2y /. N2 iD0 jD0

32

D. Brunet et al.

The SSIM-optimal soft-thresholding problem is formulated as: SSIM.x; xO / ƒ D argmaxƒ2R3L C

(2.53)

2.3.4.3 Solution The objective function is nonlinear in ƒ and maps a 3L-dimensional vector to a one-dimensional scalar. The optimization is constrained by the requirement for ƒ to be non-negative. The quasi-Newton optimization method provides a good tradeoff between complexity and performance in finding local optima. The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm [3] is one such method that was employed here to find local optima. As a consequence, this solution is locally optimal and no guarantees on global optimality can be made. The application of this solution to the image data is presented in the following steps. • The noisy image is divided into nonoverlapping blocks of size 32  32 • Apply a L level orthonormal wavelet transform to each block • For each wavelet transformed block, compute its subband statistics • Find a locally optimal ƒ using the BFGS algorithm. The search can be initialized to the MSE-optimal soft thresholding solution from Chang et al. [9]. • Using ƒ to soft-threshold the corresponding wavelet subband coefficients • The denoised coefficients are then transformed back to the pixel domain by applying the inverse wavelet transform on a block by block basis To perform a qualitative and quantitative comparison, the locally SSIM-optimal solution is compared with the MSE-optimal solution by Chang et al. [9]. The denoising results are presented in Fig. 2.9. It can be observed that the SSIM-based solution retains more image detail and has a better perceptual quality compared to the MSE-optimal solution. This method provides conclusive evidence that optimization for perceptual quality is a worthwhile endeavor indeed.

2.4

Local SSIM-Optimal Approximation

We now study whether a perceptual criterion such as SSIM can be used for best-basis approximation. This section follows the treatment in [5]. The Minimal Mean Square Error approximation problem is xO D arg min kx  zk22 ; z2A

(2.54)

where x is the image patch to be approximated, xO is the best approximation and A is a constraint that represents some prior knowledge on x. A very successful and popular image prior is based on sparse representation. Sparsity is the image model that assumes that any image patch can be represented by a linear combination of a few atomic elements. If the matrix ‰ is a dictionary of basis elements that is multiplied by coefficients c, then a sparsity constraint is kck0  K, which means that there are K or fewer nonzero elements of c. We enumerate the columns of ‰ as . 1 ; 2 ; : : : ; P /. The set A is then the span of ‰: A D spanf

1;

2;

;

P g:

(2.55)

The optimization of the mean square error can be interpreted in two different ways: either (1) a perfect approximation is assumed, except for a stochastic additive white Gaussian noise part, or (2) an imperfect approximation is assumed, but with the mean squared error as a model of dissimilarity. As it is empirically observed that images are not exactly sparse, but more accurately compressible, that is almost sparse, the first assumption cannot be valid. Moreover, as the mean squared error is a poor model of perception, the second assumption should also be dismissed. The best approximation problem should instead be posed in term of a perceptual measure. We solve this problem below for the SSIM index case. But before embarking into SSIM-based approximation, it will be useful to recall L2 -based approximation results. As we will see later on, the two problems share striking characteristics.

2 Optimizing Image Quality

33

Fig. 2.9 (a) Original: Undistorted Mandrill image. (b) Noisy image with n D 50, MSE D 2492.49, SSIM index D 0.2766. (c) MSE-optimal soft thresholding, MSE

D 509.16, SSIM index D 0.4835. (d) SSIM-based soft thresholding, MSE D 577.54, SSIM index D 0.4954

L2 -Based Approximation

The expansions of the approximation xO will be denoted as N X xO D ck k ; (2.57)

2.4.1

The solution of the L2 -based sparse approximation problem can be divided into three cases: orthogonal basis, linear redundant basis and nonlinear approximation. The L2 -based expansion of x in this basis is, of course, xD

N X

ak

k;

ak D

T k x;

1  k  N:

kD1

(2.56)

kD1

where .c1 ; c2 ; : : : ; cN / are unknown coefficients.

2.4.1.1 Orthogonal Basis In the orthogonal case, the approximation spaces A in (2.54) will be the span of subsets of the set of basis functions f k gNkD1 . At this point, we do not exactly specify which other k basis functions will be used but consider all possible subsets of M < N basis functions:

34

D. Brunet et al. .1/ ;

A D spanf

.2/ ;

;

.M/ g;

(2.58)

where .i/ 2 f1; 2; ; Ng and c .MC1/ D D c .N/ D 0. The well-known L2 -based optimal approximation is summarized in the following theorem: Theorem 4.1. For a given x 2 RN , the M coefficients ck of the optimal L2 -based approximation y 2 A to x are given by the M Fourier coefficients ak D hx; k i of greatest magnitude, i.e. 

1  k  M; M C 1  k  N: (2.59) where ja .1/ j  ja .2/ j  : : :  ja .M/ j  jal j with l 2 f1; 2; ; Ng n f .1/; ; .M/g. c .k/ D

T .k/ x;

a .k/ D 0;

2.4.1.2 Linear Redundant Basis A redundant or over-complete basis of RN consists of P  N column vectors ‰ D f k gPkD1 such that N of them are linearly independent. Given x in RN , we search for an approximation xO of x with the help of the first M < N linearly independent vectors of a redundant basis: xO D

M X

ck

k:

(2.60)

kD1

c D .‰ T ‰/1 ‰ T x

(2.63)

C

D ‰ x:

(2.64)

This pseudo-inverse can be computed from the singular value decomposition U†V T of ‰: ‰ C D V†C U T ;

(2.65)

where †C is the diagonal matrix whose positive elements are the reciprocal of the non-zero elements of †.

2.4.1.3 Non-Linear Approximation The problem can be written as arg min kx  xO k22

subject to kck0  K; (2.66)

where x is the signal to be approximated, xO WD ‰c D

P X

ck

k

(2.67)

kD1

is the approximation and kck0 WD

P X

c0k D #fk W jck j > 0g

(2.68)

kD1

We seek the coefficients c D Œc1 ; c2 ; : : : ; cM  that will minimize the L2 -error: kx  xO k22 D .x 

M X

cj j /T .x 

jD1

M X

ck

k/

kD1

(2.61) The solution is the well-known normal equation, M X

cj

T j

k

D

T k x;

1  k  M: (2.62)

jD1

Thus, to find the optimal coefficients c, we need to solve a MM linear system of equations ˆc D a with j;k WD jT k and a D Œa1 ; a2 ; : : : ; aM  where ak D kT x. In practice, c is found by multiplying the pseudo-inverse ‰ C of the dictionary matrix ‰ with x:

is the 0-pseudonorm. We follow the convention that 00 D 0. Thus for a non-linear approximation, we choose the M best vectors from the P  N vectors ‰ D f k gPkD1 . We solve a linear system in a manner similar to the linear case (see (2.62)) in order to determine the coefficients c. There are PŠ possibilities, which grow exponentially .PM/ŠMŠ with P. In fact, it has been shown that finding the sparse approximation that minimizes kx  xO k22 is an NP-hard problem [13]. Two approaches have been adopted to avoid the NP-hard problem: Matching Pursuit (MP) and Basis Pursuit. Matching Pursuit The MP algorithm from Mallat and Zhang [20] greedily adds vectors one at a time until a Mvector approximation is found. For the first vector, we minimize

2 Optimizing Image Quality 2 .1/ k2

kx  c .1/

35

D kxk22  2c .1/ a .1/ C a2 .1/ : (2.69)

By taking partial derivatives and setting them to zero, we find that the solution is exactly the same as the orthogonal case. We choose the index that maximizes jam j D j mT xj. For the K-th vector, the L2 -error is kx 

K X

c .k/

2 .k/ k2

D kx 

kD1

K1 X

c .k/

2 .k/ k2

kD1

Cc2 .K/ 2c .K/

T .K/

.K/

T .K/ .x



K1 X

c .k/

.k/ /:

(2.70)

kD1 T The PK1error will be minimized when j m .x  kD1 c .k/ .k/ /j is maximized. Note that in the orthogonal basis case, the matching pursuit algorithm coincides with the optimal algorithm which chooses the M largest basis coefficients.

Orthogonal Matching Pursuit The Orthogonal Matching Pursuit (OMP) [26] combines the MP with a Gram-Schmidt procedure in order to obtain an orthogonal basis. Given a linearly independent basis f 1 ; 2 ; : : : ; N g of RN , the Gram-Schmidt procedure successively projects the basis to an orthogonal subspace G1 D

1

G2 D

2

(2.71) 

N



1. Initialize: I D ¿; r WD x and a WD 0. 2. While krk2 > T, do 3. k WD arg maxk j kT rj; 4. Add k to the set of indices I; 5. aI WD ‰IC x; 6. r WD x  ‰I aI ; 7. end while. Here, aI and ‰I represent the restriction of, respectively, a and ‰ to the elements or columns of indices I. Note that since the pseudo-inverse of ‰ has to be computed for incrementally larger matrices, there are ways to make computations more efficient with the help of Cholesky factorization (see [33]). It is not immediately clear whether this algorithm really performs the OMP. To see this, note that since g D Œg1 ; g2 ; : : : ; gK  is an orthonormal basis, the change of basis, K X

(2.74)

a D .‰ T ‰/1 ‰ T ggT x

(2.75)

j

D

jD1

jD1

is performed via

D ‰ x:

GT1 2 G1 GT1 G1

K X

gTj xgj ;

aj

C

(2.76)

(2.72) Thus, the residual r can be computed from the original basis ‰ and the orthonormalization is hidden in the computation of the coefficients a.

:: : GN D

For the numerical implementation, the naive approach of computing an orthonormal basis is well known to be unstable. A more reliable way to perform the OMP is with the following algorithm [33]:

N1 X

GTj

jD1

GTj Gj

N

Gj :

(2.73)

The basis is then normalized with gk D Gk =kGk k2 . The OMP algorithm thus alternates between finding the best matching vector and the orthonormalization process. The trade-off is a convergence with a finite number of iterations against an extra computational cost for the orthonormalization.

Basis Pursuit In Basis Pursuit, the non-linear approximation problem is replaced by an L1 -regularization problem: arg min kx  xO k22

subject to kck1  T; (2.77)

for some constant T. The choice of the L1 -norm makes the problem convex. Details on techniques to solve this problem are given in [12].

36

D. Brunet et al.

2.4.2

So for all 1  k  M,

SSIM-Based Approximation

The problem now can be stated as follows: Given x and ‰, find the z D ‰c that maximizes S WD SSIM.x; z/.

1 @S 2 x Nk 2 z Nk D  2 S @ck 2 x z C C1 x C 2z C C1 C

This problem was first solved for the orthogonal case in [6] before being generalized to the redundant case in [32] and [31].

2.4.2.1 Linear Approximation In a linear approximation, the choice of dictionary vectors ‰ D . 1 ; 2 ; : : : ; M / is already fixed and we need only to find the coefficients c D .c1 ; c2 ; : : : ; cM / that maximize the SSIM. To do that, we search for the stationary points of the partial derivatives of SSIM with respect to ck . First, we write the mean, the variance and the covariance of z in terms of c:

z D

M X

ck Nk ;

(2.78)



.N 

D

M M X X

N k D 0 and k

M X

kk

D1

for 1  k  M: (2.85)

This leads to

z D 0;

cj ck

T j

k



:

Solution for Oscillatory Basis In the particular case where the basis is comprised of normalized oscillatory functions, we have

N 2z ;

.N  1/ z2 D

jD1 kD1

.N  1/ x;z D

.N  1/. x2 C s2z C C2 /

(2.84)

kD1

1/ z2

2 kT x  2N xN Nk .N  1/.2 x;z C C2 / P T N 2 M k  2N z k jD1 cj j

(2.86)

M M X X

cj ck

T j

k

(2.87)

jD1 kD1

ck

T kx

 N x z :

(2.79)

.N  1/ x;z D

kD1

M X

ck

T k x:

(2.88)

kD1

Next, we find the partial derivatives: Therefore the partial derivative in (2.84) becomes @ z D Nk I @ck .N  1/

(2.80)

M X @ z2 D2 cj @ck jD1

T j

.N  1/

@ x;z D @ck

# PM 2 jD1 cj jT k 2 kT x  : S .N  1/.2 x;z C C2 / .N  1/. x2 C z2 C C2 /

k

2N x Nk I

@S D @ck "

(2.89)

(2.81) T kx

 N x Nk : (2.82)

The logarithm of SSIM can be written as log S D log.2 x z C C1 /  log. 2x C 2z C C1 / C log.2 x;z C C2 /  log. x2 C z2 C C2 /:

(2.83)

We now search for stationary points: T x @S D 0 ) PK k T @ck jD1 cj j

D

k

2 x;z C C2 1 DW ; 2 C z C C2 ˛

x2

for 1  k  M: (2.90)

2 Optimizing Image Quality

37

Flat Approximation Case We now consider the case in which a flat basis function 0  1 is added to the oscillatory basis. In this case,

We can rewrite this equation as M X

cj

T j

k



T k x;

jD1

for 1  k  M:

This equation is very similar to (2.62) for the optimal coefficients for the L2 -based approximation. In fact, since the equations (2.62) and (2.91) are identical up to a scaling factor and since the solution of the linear system is unique, we have ck D ˛ak :

˛2A C B ; ˛C C D

(2.100)

The coefficient c0 is the stationary point of (2.84): " @S 2 x N 0 DS @c0 2 x c0 C C1 

(2.92)

Now, we seek to find an expression for ˛. Starting from the right hand side of (2.90), we replace z2 and x;z by their basis expansion (2.86) and then employ (2.92) for the ck to obtain ˛D

z D c0 :

(2.91)



2 0T x  2N x N 0 2c0 N 0 C 2 .N  1/.2 x;z C C2 /

2x C c0 C C1 2

PM

T j 1/. x2

jD0 cj

.N 

0

 2Nc0 N 0

#

C z2 C C2 /

2 x 2c0 : DS  2 x c0 C C1 2x C c20 C C1

(2.101)

(2.93)

Solving for the stationary point leads to the following quadratic equation in c0 :

where 1 XX aj ak N  1 jD1 kD1 M

AD

M

B D x2 C C2 ; 2 X 2 a ; N  1 kD1 k

T j

k;

(2.94)

Its solution is (2.95)

M

CD

D D C2 :

c20 xN C C1 c0  x . 2x C C1 / D 0:

(2.96) (2.97)

Equation (2.93) is a quadratic equation in ˛ with solutions, p D2 C 4.C  A/B ˛D (2.98) 2.C  A/ q C2 ˙ C22 C2C. x2 CC2 / D : (2.99) C D ˙

c0 D

C1 ˙

2 x

: (2.103)

We choose the positive branch to maximize the SSIM index, which is simply c0 D x , as expected. The other coefficients are found as in the oscillatory basis case. Orthogonal Basis In the orthogonal case, the constants in the equation for ˛ (2.99) simplify to ˛D C2 ˙

Note that C  A D C=2 D A, since the ak ’s are found by solving the linear system (2.62).

q C12 C 4 2x . 2x C C1 /

(2.102)

q 4 PM T 2 2 C22 C . N1 kD1 . k x/ /. x C C2 / : 2 PM T 2 kD1 . k x/ N1 (2.104)

38

D. Brunet et al.

The SSIM index is maximized with the positive branch. If C2 D 0, then ˛D q

1 N1

sx PM

kD1 .

PN D

kD1 PM kD1

a2k

T 2 k x/

;

(2.105)

!1=2

a2k

;

Non-Linear Approximation

is maximized. SSIM-Based Matching Pursuit From Sect. 2.4.2.1, we know how to find the best coefficients given a set of vectors. It remains to determine which vectors to choose. We assume an oscillatory dictionary which contains a flat element 0 D e. This flat element is always included in the approximation so that, by (2.103), c0 D x and S1 .x; z/ D 1. It remains to optimize S2 . First, we want to find 0 and c 0 that will maximize S2 .x; c 0 0 /. The second component of the simplified SSIM index is written as 0 / D

2c 0

T 0 .x

 x e/ C C2 .N  1/

kx  x ek2 C c2 0 C C2 .N  1/

:

(2.108) For any fixed ck , the SSIM will be maximized when j kT .x  x e/j D j kT xj is maximized. We thus choose 0 D arg max j 1kP

T 0 x:

In general, we want to find will maximize S2 .x;

K1 X

c k

k

(2.110) K

and c K that

K /:

C c K

(2.111)

kD0

Similar to the L2 -case, the problem is to find, given a dictionary ‰ 2 RPN and a signal x 2 RN , the coefficients c 2 RP with kck0 D M < N such that SSIM.x; ‰c/ (2.107)

S2 .x; c 0

c 0 D ˛

(2.106)

where the second equality follows from Parseval’s Theorem. Thus the coefficients are adjusted in order to preserve the variance of the original signal.

2.4.3

and

T k xj

(2.109)

For every choice of K , we would need to find fa k g0kK , i.e. we have to solve a K  K linear system of equations and compute the SSIM with c k D ˛a k , then pick the basis K that yields the maximum value. In practice this procedure may be intractable given that a potentially large linear system has to be solved for every possible basis of the dictionary and at every iteration of the greedy algorithm. SSIM-Based Orthogonal Matching Pursuit According to the MP algorithm, the choice of the first basis that maximizes the SSIM index is the same as that of the optimal L2 -basis. Indeed, (2.108) is maximized when j kT xj is maximized. For the choice of the K-th basis, we seek to maximize S2 .x;

K1 X

c j

j

C c K

K /

jD1

D S.x;

M1 X

c j g j C c M g M /

(2.112)

jD1

D

2

PK1

.gT j x/c j C .gT K x/c K C .N  1/C2 : PK kx  x ek2 C jD1 c2 j C .N  1/C2 jD1

(2.113)

The choice of basis that will maximize the SSIM index is K D arg max jgTk xj: 1kP

(2.114)

Note that gTk x D .

k



K1 X jD1

.

T T k gj /gj / x

(2.115)

2 Optimizing Image Quality

D

T k .x



K1 X .

39

T k gj /gj /

xO D min max d.Ri .x/; zi /:

jD1

D

T k r:

(2.117)

Thus, the optimal basis for the SSIM-based and the L2 -based algorithms are exactly the same. Indeed, the SSIM-based coefficients will be simply a scaling of the L2 -based coefficients. The difference will be in the stopping criterion: the SSIM-OMP stopping criterion will depend on the SSIM index instead of the L2 -error.

2.4.4

Variational SSIM

Otero et al. [23–25] have explored the direction of SSIM-optimal algorithm design directly in the pixel domain. They exploit the quasi-convexity properties for zero-mean signals (see Sect. 2.2). Based on these observations, several optimization problems are formulated and solved using a simple bisection method. Because of space limitations, we omit a detailed discussion of these methods in this chapter and simply refer readers to [23–25].

2.5

x2A

(2.116)

Image-Wide Variational SSIM Optimization

Block-based schemes find a perceptually optimal solution locally, but blockiness artifacts might appear when combining these local solutions to form an image. Several ad hoc methods such as taking the central pixel or averaging overlapping blocks are possible, but they might not be perceptually optimal. We propose a solution based on convex or quasi-convex optimization. Let Ri be the block extraction operator, which from an image x gets the i-th block. Let zi be the local perceptually optimal estimator corresponding to the i-th image block. Finally, let d be a quasi-convex dissimilarity measure and let A be a convex set. Then the solution of the following optimization problem is the desired image-wide perceptually optimal estimator:

i

(2.118)

Since Ri is a linear function and d is quasiconvex, d.Ri .x/; zi / is also quasi-convex. As summarized in Sect. 2.2, the maximum of quasiconvex is also quasi-convex. Since A is a convex set, the optimization problem can thus be solved using the bisection method. In the special case of image denoising, A takes the form of kx  yk22  ˛, where ˛ is a parameter controlling how close to the noisy image y the restored image x will be. Alternatively, the estimator could be written in a variational form as xO D max d.Ri .x/; zi / C ˇkx  yk22 ; i

(2.119)

where ˇ is a parameter controlling the balance between the fit to the local perceptual estimator and the global information from the noisy input. These optimization problems can be compared to the one defined in [31]: arg max fSSIM.w; x/CSSIM.x; y/g: (2.120) x

There are two major differences: (1) the mean SSIM score is optimized instead of the minimal SSIM score, (2) the Structural Similarity between the noisy data and the global solution is taken instead of the Mean Squared Error. We argue that the new formulation (2.119) is not only more convenient mathematically, but also conceptually more preferable. Indeed, it is mathematically convenient since we can prove the existence and uniqueness of a solution (assuming that d is not-flat) and we have a (bisection) method to compute the optimal solution that will always converge to the global minimum. It is also justifiable conceptually, since the maximum dissimilarity corresponds to the most salient feature, the one that will be the most perceptually annoying. (However, if the image quality is non-uniform the maximum might put too much weight on a single location.) Moreover, taking the mean-squared error between the noisy image and the restored image is justified by the model of distortion (additive noise), not the model of perception.

40

2.6

D. Brunet et al.

Conclusions

In this chapter, we have discussed the problem of optimizing the perceptual quality of images by employing perceptual quality measures in the optimization problem as opposed to traditional distance measures, e.g. MSE/RMSE. The SSIM index was considered to be the cost function in this discussion. The mathematical properties of the SSIM index were first presented, most notably its quasi-convexity. Subsequently, the classical problem of image restoration was reformulated with the statistical version of SSIM index as the cost function. The solution to this problem demonstrated the gains to be had by explicitly optimizing for perceptual quality metrics. Corroborative evidence to this claim was shown in the form of a soft-thresholding solution, again optimized with respect to the SSIM index. Subsequently, a methodology for constructing SSIM-optimal basis functions was discussed and variations to popular pursuit algorithms such as matching pursuit and basis pursuit were presented. Through these discussions, a set of SSIM-optimal solutions that address popular image processing problems ranging from denoising and restoration to dictionary construction were presented. These solutions provide a platform to address a larger set of problems that could be reduced to one of these forms. Furthermore, these solutions demonstrate that optimization of perceptual quality is indeed the way forward in building next generation multimedia systems.

References 1. Andrews HC, Hunt BR (1977) Digital image restoration. Signal processing series. Prentice-Hall, Englewood Cliffs, NJ 2. Bertsekas DP (1995) Dynamic programming and optimal control. Athena Scientific, Belmont, MA 3. Bertsekas DP (1999) Nonlinear programming. Athena Scientific, Belmont, MA 4. Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, New York 5. Brunet D (2012) A study of the structural similarity image quality measure with applications to image processing. PhD thesis, University of Waterloo

6. Brunet D, Vrscay ER, Wang Z (2010) Structural similarity-based approximation of signals and images using orthogonal bases. In: Kamel M, Campilho A (eds) Proceedings on an international conference on image analysis and recognition. Lecture notes in computer science, vol 6111. Springer, Heidelberg, pp 11–22 7. Brunet D, Vrscay ER, Wang Z (2012) On the mathematical properties of the structural similarity index. IEEE Trans Image Process 21(4):1488–1499 8. Chang H-W, Yang H, Gan Y, Wang M-H (2013) Sparse feature fidelity for perceptual image quality assessment. IEEE Trans Image Process 22(10):4007– 4018 9. Chang SG, Yu B, Vetterli M (2000) Adaptive wavelet thresholding for image denoising and compression. IEEE Trans Image Process 9(9):1532–1546 10. Channappayya SS, Bovik AC, Heath RW (2006) A linear estimator optimized for the structural similarity index and its application to image denoising. In: Proceedings on IEEE international conference on image processing, IEEE, pp 2637–2640 11. Channappayya SS, Bovik AC, Heath RW (2008) Perceptual soft thresholding using the structural similarity index. In: Proceedings on IEEE international conference on image processing, IEEE, pp 569–572 12. Chen SS, Donoho DL, Saunders MA (2001) Atomic decomposition by basis pursuit. SIAM Rev 43:129– 159 13. Davis G, Mallat S, Avellaneda M (1997) Greedy adaptive approximation. J Constr Approx 13:57–98 14. Donoho DL (1995) De-noising by soft-thresholding. IEEE Trans Inf Theory 41(3):613–627 15. Donoho DL, Johnstone IM (1994) Ideal spatial adaptation by wavelet shrinkage. Biometrika 81(3):425– 455 16. Donoho DL, Johnstone IM (1995) Adapting to unknown smoothness via wavelet shrinkage. J Amer Stat Assoc 90(432):1200–1224 17. Jin L, Egiazarian K, Kuo CCJ (2012) Perceptual image quality assessment using block-based multimetric fusion (BMMF). In: Proceedings on IEEE international conference on acoustics, speech, and signal processing, IEEE, pp 1145–1148 18. Katsaggelos AK (2012) Digital image restoration. Springer, Heidelberg 19. Kolaman A, Yadid-Pecht O (2012) Quaternion structural similarity: a new quality index for color images. IEEE Trans Image Process 21(4):1526–1536 20. Mallat S, Zhang Z (1993) Matching pursuit with time-frequency dictionaries. IEEE Trans Signal Process 41:3397–3415 21. Mannos J, Sakrison D (1974)The effects of a visual fidelity criterion of the encoding of images. IEEE Trans Inf Theory 20(4):525–536 22. Okarma K (2009) Colour image quality assessment using structural similarity index and singular value decomposition. In: Kamel M, Campilho A (eds) Proceedings on international conference on image

2 Optimizing Image Quality

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

analysis and recognition. Lecture notes in computer science, vol 5337. Springer, Heidelberg, pp 55–65 Otero D, La Torre D, Vrscay ER (2015) Structural similarity-based optimization problems with ˆ L1-regularization: smoothing using mollifiers. In: Proceedings on international conference on image analysis. Springer, pp 33–42 Otero D, Vrscay ER (2014) Solving optimization problems that employ structural similarity as the fidelity measure. In: Proceedings on international conference on image processing, computer vision and pattern recognition, CSREA Press, pp 474–479 Otero D, Vrscay ER (2014) Unconstrained structural similarity-based optimization. In: Proceedings on international conference on image analysis and recognition. Springer, Heidelberg, pp 167–176 Pati YC, Rezaiifar R, Krishnaprasad PS (1993) Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In: Proceedings on IEEE Asilomar conference on signals, systems, and computers, pp 40–44 Ponomarenko N, Ieremeiev O, Lukin V, Egiazarian K, Carli M (2011) Modified image visual quality metrics for contrast change and mean shift accounting. In: Proceedings on international conference on the experience of designing and aApplication of CAD systems in microelectronics, pp 305–311 Ponomarenko N, Jin L, Ieremeiev O, Lukin V, Egiazarian K, Astola J, Vozel B, Chehdi K, Carli M, Battisti F et al (2015) Image database TID2013: Peculiarities, results and perspectives. Signal Processing: Image Communication 30:57–77 Ponomarenko N, Lukin V, Zelensky A, Egiazarian K, Carli M, Battisti F (2009) TID2008-a database for evaluation of full-reference visual quality assessment metrics. Adv Mod Radioelectronics 10(4):30–45 Portilla J, Simoncelli E (2003) Image restoration using gaussian scale mixtures in the wavelet domain. In: Proceedings on IEEE international conference on image processing, vol 2, IEEE, pp 965–968 Rehman A, Rostami M, Wang Z, Brunet D, Vrscay ER (2012) SSIM-inspired image restoration using sparse representation, EURASIP J Adv Signal Processing. Special Issue on Image and Video Quality Improvement Techniques for Emerging Applications 16(1):1–12 Rehman A, Wang Z, Brunet D, Vrscay ER (2011) SSIM-inspired image denoising using sparse

41

33.

34.

35. 36.

37.

38.

39.

40.

41.

42.

43.

44.

representations. In: Proceedings on IEEE international conference on acoustics, speech, and signal processing, Prague, Czech Republic, pp 1121–1124 Rubinstein R, Zibulevsky M, Elad M (2008) Efficient implementation of the K-SVD algorithm using batch orthogonal matching pursuit, Technical Report, Department of Computer Science, Technion, Israel Institute of Technology, Haifa, Israel Sampat MP, Wang Z, Gupta S, Bovik AC, Markey MK (2009) Complex wavelet structural similarity: A new image similarity index. IEEE Trans Image Process 18(11):2385–2401 Wang Z, Bovik AC (2002) A universal image quality index. IEEE Signal Process Lett 9(3):81–84 Wang Z, Bovik AC (2011) Reduced- and noreference image quality assessment. IEEE Signal Process Mag 28(6):29–40 Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: From error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612 Wang Z, Li Q (2011) Information content weighting for perceptual image quality assessment. IEEE Trans Image Process 20(5):1185–1198 Wang Z, Lu L, Bovik AC (2004) Video quality assessment based on structural distortion measurement. Signal Processing: Image Communication 19(2):121–132 Wang Z, Shang X (2006) Spatial pooling strategies for perceptual image quality assessment. In: Proceedings on IEEE international conference on image processing, IEEE, pp 2945–2948 Wang Z, Simoncelli EP, Bovik AC (2003) Multiscale structural similarity for image quality assessment. In: Proceedings on IEEE Asilomar conference on signals, systems, and computers, vol 2, IEEE, pp 1398– 1402 Zhang L, Li H (2012) SR-SIM: A fast and high performance IQA index based on spectral residual. In: Proceedings on IEEE international conference on image processing, pp 1473–1476 Zhang L, Zhang L, Mou X, Zhang D (2011) FSIM: A feature similarity index for image quality assessment. IEEE Trans Image Process 20(8):2378–2386 Zhang W, Borji A, Wang Z, Le Callet P, Liu H (2016) The application of visual saliency models in objective image quality assessment: A statistical evaluation. IEEE Trans Neur Netw Learn Sys 27(6):1266–1278

3

Computational Color Imaging Raja Bala, Graham Finlayson, and Chul Lee

3.1

Introduction

A primary goal of a color imaging system is to obtain images of consistent high quality from a variety of digital color devices such as cameras, displays, and printers. The design and manufacturing of these devices is a nontrivial endeavor involving many practical constraints pertaining to hardware, device physics, cost, throughput, power consumption, form factor, and scalability. In this Chapter, we present techniques to optimize color transformations associated with color devices that account for these constraints. Figure 3.1 is a diagram of a color imaging system showing the various types of devices and associated processing. Input color devices such as cameras and scanners transform a physical or analog representation of a scene or printed image into digital form. The image captured by such devices is commonly in a red-green-blue (RGB) color coordinate system. Conversely, output devices such as displays and printers take as input R. Bala () Palo Alto Research Center, 3333, Coyote Hill Road, Palo Alto, CA 94304, USA e-mail: [email protected] G. Finlayson University of East Anglia, Norwich NR4 7TJ, UK C. Lee Pukyong National University, 45 Yongso-ro, Daeyeon 3(sam)-dong, Nam-gu, Busan, South Korea

digital image pixel values and produce an analog representation that reaches the observer’s eye via either light emission (in the case of displays) or reflection (in the case of printers). Displays produce color by mixing additive RGB primaries, while printers produce color by mixing subtractive cyan-magenta-yellow (CMY) primaries. As is evident in Fig. 3.1 there are numerous digital processing functions that an image undergoes as it moves through the system from capture to presentation. In particular, this chapter focuses on the optimization of device color transformations mapping images from one color space to another. The objective function for the optimization is a measure of error introduced by the transformation. Since the human observer is the ultimate judge of image quality, the error measure often incorporates perceptual properties of the human visual system. Color transformations come in two flavors. The first maps device dependent color signals d in an m-dimensional color space (e.g. RGB, CMY, or CMYK) to device-independent color signals c belonging to an n-dimensional color space (e.g. CIEXYZ or CIELAB, [1]). These mappings, often referred to as color correction, are derived via the process of device characterization, and enable the exchange of color images among different devices via a common intermediate device-independent color representation. Device characterization can be based on physical models [2, 4, 5] on mathematical approximation

© Springer International Publishing AG 2017 V. Monga (ed.), Handbook of Convex Optimization Methods in Imaging Science, DOI 10.1007/978-3-319-61609-4_3

43

44

R. Bala et al.

Fig. 3.1 Color imaging system diagram

techniques [4, 6] or hybrid methods that draw from both principles. This chapter will present a variety of techniques to optimize device characterization and color correction transforms. In the case of displays, the physics of additive light mixing enables a relatively simple 3  3 matrix characterization [2]. In contrast, printers exhibit complex color characteristics due to the nonlinear nature of subtractive color mixing. The evaluation of complex printer characterization functions can be too computationally intensive for realtime execution, and hence the typical practice is to approximate the function on a structured multidimensional lattice that facilitates fast interpolation [6]. Lattice optimization will be described in detail in Sect. 3.4. For digital capture, color correction maps the raw captured RGB either to a device independent color space or directly to the color space of a display. An essential element in camera color correction is estimating the scene illuminant from the captured image. Color correction and illuminant estimation are posed as optimization problems in Sect. 3.2.

The second type of color transformation has both its domain and range within a device dependent color coordinate system. An example is device calibration which maintains a device’s 1-D channel-wise tone response in a known reference state [2]. A variant that captures inter-colorant interactions via two-dimensional calibration is proposed in [3]. Another type of device transform applies for the case where there are m > 3 device signals. An example for emissive displays is the addition of white (W) light to the additive RGB primaries. The motivation is to achieve a greater dynamic range and light efficiency. In such cases it is common practice to derive a mapping from the m-dimensional device color space to a canonical 3-dimensional RGB representation. There are several purposes for doing this. One is to incorporate differentiating (and often proprietary) technology into a device embedded into the device’s driver firmware or operating system. Another is to hide the complexity of m > 3 device color representations, and expose a standard RGB interface to external applications. In Sect. 3.3 a technique will be presented for optimizing the RGB-to-RGBW transform for mobile

3 Computational Color Imaging

displays that formally accounts for constraints in power consumption.

3.2

Color Capture

3.2.1

Introduction

Suppose we capture an image with a camera and then we wish to display it on a color monitor (or phone or tablet). The raw acquired RGBs cannot be used to drive a display (since the camera samples light differently from the way humans do). The process of mapping acquired RGBs to display RGBs so that the image ‘looks right’ is called color correction. Color correction is the most important part of camera characterization and one of the two foci of this subsection. For a review of methods which attempt to estimate the spectral characteristics of a camera, see [7]. Here, in regard to color correction ‘looks right’ means that we attempt to find the display RGBs that make a correct physical reproduction (i.e. a colorimetric rendering intent, e.g. see [8]). However, we point out that the color correction step also usually implements aspects of preference with, for example, color corrected images having slightly more saturated colors compared to those observed in the real world [9]. Not only is the nature of preference highly subjective, different manufacturers implement their own proprietary solutions. The physical signal a camera sees is biased by the color of the prevailing illuminant under which an image is captured. At the sensor level a white piece of paper viewed in shadow or in direct sunlight is respectively bluish and yellowish. Our own visual system automatically processes what we see to discount—to a large measure—the color of the light. This color constancy is equally important in digital reproduction. After all, we want to see the images we remember seeing. So, we need to emulate our own color constancy processing in a camera processing pipeline. Color correction is ‘one half’ of color constancy. That is, we need to understand [10] how colors might be mapped to remove (color correct) the bias due to the illuminant color. Illuminant estimation (the second half of the problem and

45

the second topic covered in this section) is the process whereby the color bias due to illumination is inferred from the image. Illuminant estimation has been studied for at least 50 years and although good progress has been made, there is still a gap between the performance delivered and the performance needed for accurate image reproduction.

3.2.2

Color Correction

How then do we transform the RGBs a camera measures to those that will correctly drive a display to make an accurate physical reproduction? Let us begin by assuming, completely wrongly (we will recant later in this section), that the camera has spectral sensitivities which are the same as the cone sensors in our eye. The advantage of making this assumption is that we will uncover the linear mathematics that underpins color correction and—for this simplified world— will see that not only is color correction possible, there exists a simple closed-form solution. Let us assume that a monitor has 3 color outputs with spectral power distributions in the short (or blue), medium (green) and long (red) parts of the visible spectrum. The camera/eye response to each display color is written as: Z pck D Rk ./Pc ./d k 2 fR; G; Bg !

c 2 fl; m; sg

(3.1)

In Eq. (3.1), Pc ./ denotes the spectral output of each of the three channels of a color display, Rk ./ denotes the spectral sensitivity of the sensor, and ! represents the visible spectrum. Our visual system and digital cameras, at the raw encoding level, are linear, i.e. if we double the light we double the measured response. The linearity is a consequence of the response of the eye and camera being well modeled by linear integration. Linearity is useful. Suppose we capture an image of the monitor where the monitor outputs Pl and Pm are at 50% and 75% of their respective maximum intensities (and Ps is at 0%), then R p0:5lC0:75m k R D ! Rk ./Œ0:5Pl ./RC 0:75Pm ./d D 0:5 ! Rk ./Pl ./d C 0:75 ! Rk ./Pm ./d (3.2)

46

R. Bala et al.

It follows from Eqs. (3.1) and (3.2) that the camera response to an arbitrary intensity weighting of the display outputs can be written as a matrix equation: 3 2 l m s 32 3 pr pr pr qr pr s 54 4 pg 5 D 4 plg pm qg 5 ) p D Mq g pg s pb qb plb pm b pb (3.3) where qr , qg and qb vary the intensity of the color channels (from 0 to 100% or minimum to maximum power). We are now in a position to ask the central question: how do we solve for the display weights—the correct RGBs—to drive the display. Denoting the 3-vector of responses in Eq. (3.3) as p the correct image display weights q (the values recorded in an image pixel) are calculated as: (3.4) q D M 1 p 2

Equation (3.4) teaches color correction [11] in closed form. Equation (3.4) is also the exact solution to the classical color matching problem (i.e. how we mix three primary lights to match an arbitrary test light). Collecting the three camera sensor responses into a matrix R(), let us calculate: Q./ D M 1 R./

In practice—because the camera sensitivities are not a linear combination of the cones— the color correction problem involves estimating sRGB color coordinates from measured camera responses. In panels a and b in Fig. 3.2 we show a raw camera image and the sRGB counterpart (all images in Fig. 3.2 also have the sRGB gamma applied, i.e. they are raised to the power 1/2.2). The image shown in panel a is created by numerically integrating a hyperspectral image from [14] using Nikon camera sensitivities measured in [15]. The image in panel b integrates the same hyperspectral image but sRGB matching curves are used for the sensors.

3.2.2.1 Linear Models To understand how to transform camera RGBs to sRGB counterparts, we need to model how a camera and the eye respond to the spectral power distribution E./ striking a surface with a spectral reflectance S./. This is written as: Z pD R./E./S./ d (3.6) !

It is well established [16–18] that surface reflectance can be written to a tolerable approximation as the weighted sum of three basis functions:

(3.5)

We call Q./ color matching functions. Color matching functions depend on the primaries of the display device. All color matching functions are a linear transform from the cone spectral sensitivities [12]. If the camera had the sensitivities equal to Q./ then the camera outputs would drive the display directly (in this special case M 1 in Eq. (3.4) is the identity matrix). There has been considerable work in standardizing the target display RGBs (and the target color matching functions). One is example is the sRGB [13] standard which is widely used. Henceforth in this chapter we will assume that the target of color correction is sRGB. That is, in color correction we map the camera RGB, p, to the display sRGB, q.

S./ 

3 X

Sj ./ j

(3.7)

jD1

It follows that under a given illuminant E./ we can write the camera response as p D E./;R./  Z E./;R./ ij D E./Sj ./Ri ./d (3.8) !

Equation (3.8) teaches that image formation is a 33 linear matrix  multiplying a 3  1 weight vector  . It follows that a 33 matrix transform relates colors recorded by a camera to corresponding sRGB responses: q D E./;Q./ ŒE./;R./ 1 p:

(3.9)

3 Computational Color Imaging

47

Fig. 3.2 In (a) a raw image (with a gamma of 0.5 applied to make it appear brighter) is shown next to the sRGB counterpart (b). Images (d) and (f) respectively show the output of a least-squares and a root-polynomial regression. Panels (c) and (e) show a pseudo color representation of CIELAB Delta E fitting error for images (d) and (f) compared with the ground truth (b)

Of course it is well known that Eq. (3.9) is only approximate. Indeed, sensor metamerism—the phenomenon that two surfaces look the same to one set of sensors but different to another under—cannot be accounted for by a simple 33 matrix of illuminant change. Yet, practically, metamerism is rare. Indeed, Marimont and Wandell [19] extended the linear basis model formalism to incorporate image formation into the derivation of the optimal linear basis and found that 33 matrices could account for sensor (and illuminant) change. A similar result was reported by Funt et al. [20, 21].

3.2.2.2 Linear Color Correction Although Eq. (3.9) is for an idealized and simplified world, linear color correction works well in practice and is implemented in almost all camera processing pipelines. Further, a useful property of linear color correction for photography applications is that it is independent of exposure. That is, if the 3  3 matrix M optimally maps p to q then M also optimally maps kp to kq where k is a scalar modelling brightness change. Let us now consider how we can estimate the ‘best’ 3  3 matrix for color correction. Given N measured camera responses denoted pi

48

R. Bala et al.

in correspondence with the desired display RGBs qi (i D 1; 2; ; N) we can find the 3  3 matrix M by minimizing: N min ˙iD1 jjf.Mpi /  f.qi /jj2 M

(3.10)

Here if f./ is the identity function—mapping the argument vector to itself—or if it is a linear operator (3  3 matrix) then Eq. (3.10) denotes the simple least-squares solution (for which there is a closed form solution [22]). The reader might wonder where the ground-truth qi comes from. Often, color correction is carried out for a reflective target where the reflectances have been previously measured. Then numerical integration can be employed to calculate the sRGB responses for these reflectances viewed under a measured reference illuminant. Often in color we wish to minimize a perceptual error. In this case f./ denotes a mapping to a perceptual color space such as CIELAB or CIELUV [12]. In this case, the minimization is non-linear and the matrix M is found through a gradient descent type search (local minimum) or by approximately linearizing the problem [23].

3.2.2.3 Polynomial Color Correction Naively, it turns out, we might assume we can improve the accuracy of color correction by adopting a simple polynomial regression model. Let the vector function P o .:/ take as argument a camera RGB vector and return all the terms in an order o polynomial expansion. As an example the second order polynomial expansion P 2 .ŒR G Bt / equals ŒR G B R2 G2 B2 RG RB GBt . With respect to this example we modify (3.10) and solve for the 3  9 matrix M 2 that minimizes: N min ˙iD1 jjf.M 2 P 2 .pi //  f.qi /jj2 M2

(3.11)

In general, M o is an 3  m matrix where m is the number of terms in an order o polynomial expansion. The method of optimization of M o is identical to that in solving for the linear case (3.10).

Perhaps unsurprisingly, if we train and test on the same dataset then minimizing (3.11) results in a lower fitting error than minimizing (3.10). After all a solution to M o could be the same 3  3 matrix found in (3.10) (the higher order terms could be ignored). But polynomial regression can work poorly when we train and test on different datasets. Indeed, if higher order terms are used then it is possible that the RGB vectors pi and pj D kpi regress to qi and qj where qj ¤ k0 qi (where k and k0 are scalars). That is, the camera measures the same color at different brightnesses but after color correction the colors are different (the estimated display RGBs are not a scalar apart). In the worst-case the unnatural color shift that results from using polynomial regression can be quite large [24]. We point out that for certain fixed exposure viewing conditions (so, not for digital camera pipelines) the polynomial model can be used [25].

3.2.2.4 Root-Polynomial Color Correction Is there a way we can use the undoubted power of polynomial data fitting in a way that does not depend on exposure? In [24] it is observed that the terms in any polynomial fit each have a degree e.g. R, RG and R2 B are respectively degree 1, 2 and 3. Multiplying each of R, G and B by a scalar k results in the terms kR, k2 RG and k3 R2 B. That is, the degree of the term is reflected in the power to which the exposure scaling is raised. It follows that by taking the reciprocal root will result in terms which have the same scalar: .kR/ D kR, .k2 RG/1=2 D k.RG/1=2 , .k3 R2 B/1=3 D k.R2 B/1=3 . The act of taking the pth root leads to the root polynomial expansion having fewer terms. For example the 2nd p order root-polynomial p p has the 6 terms: R, G, B, RG, RB and GB. Details of the general formulae for generating a rootpolynomial expansion of order p can be found in [24]. The optimization for the 2nd order RootPolynomial Color Correction is written as: N min ˙iD1 jjf.M 2 R 2 .pi //  f.qi /jj2 M2

(3.12)

3 Computational Color Imaging

where the vector function R o takes as argument a camera RGB vector and return all the terms in an order o root-polynomial expansion. In Eq. (3.12) the second order root-polynomial regression matrix M 2 is a 3  6 matrix transform.

3.2.2.5 Experimental Results In Fig. 3.2 we show a comparison of color correction methods. In panels a and b we show a camera raw and sRGB (ground truth image). In d we show the output of a least-squares regression calibration procedure. Clearly, the correction works quite well but the red of the jersey is visibly incorrect. The CIELAB Delta E error is visualized in c. In color difference evaluation, a ‘Delta E’ of 1 corresponds to a just noticeable visual difference. For the least-squares regression there are image regions with an error of 10 or more and this is indicative of visually noticeable fitting error. In panels f and e we show respectively the outputs of root-polynomial regression and the Delta E error map. Clearly, the error is significantly reduced when root-polynomial regression is used. We now carry out a simple experimental test using synthetic data and numerical integration. For surface reflectances we take the composite set of 1995 measurements from [26]. From the same database we use the 102 measured illuminant spectra. Finally we use a dataset of previously measured spectral sensitivities for 28 cameras [27]. For the ith light and the jth camera we numerically integrate the 1995 RGBs and also the corresponding 1995 sRGBs for this light. We now wish to evaluate in terms of CIELAB Delta E error how we can color correct the data. We test 3 methods. LsqCC uses simple least-squares regression to find a 3  3 regression matrix. Since we are optimising for CIELAB error we find the best 3  3 matrix, LabCC, that minimises this error. Finally, we calculate the error for the 2nd order root polynomial method, RpCC. We evaluate each method on a 3-fold cross validation basis. That is to say we divide the RGBs, randomly, into 3-fold sets of roughly

49 Table 3.1 Mean, median, 95% quantile and max CIELAB E color correction performance averaged over 28 cameras and 102 lights, boldface indicates the leading algorithm per column Algorithm

Mean

Median

95% quantile

Max

LsqCC LabCC RpCC

1.93 1.85 1.49

1.18 1.36 0.96

5.96 5.00 4.35

22.63 14.31 17.15

equal sizes. Then for each set we train our regression method on the complement of each fold set (the RGBs not in the set) and test on the RGBs on the fold set itself. For each fold we calculate 4 summary statistics the mean, median, 95% quantile and max Delta E errors. Over all 28 cameras and all 102 illuminants we then average these average statistics. The results are reported in Table 3.1. We draw three conclusions from Table 3.1. First, that simple least-squares works surprisingly well. Second, if we are minimizing a perceptual error like CIELAB the non-linear minimization LabCC can deliver a benefit and this is quite significant for the 95% quantile and max error. Lastly, additional terms—here in RpCC— can further and significantly drive down the error.

3.2.2.6 Other Methods Extensions to linear color correction which do scale with intensity have been a recent focus of development in color correction research. Andersen [28] divided chromaticity, centered at the whitepoint, into k unbounded triangular regions. Per region, the chromaticities at the boundary and the white-point uniquely defined a 33 correction matrix. Further, by construction, the method implements a continuous color mapping. The method is extended in [29] to allow the bounding chromaticities to be chosen through optimization. The idea of optimizing chromaticity and intensity separately is also studied in [30] where the chromaticity mapping is implemented as a general lookup-table.

50

3.2.3

R. Bala et al.

Illuminant Estimation

In Eq. (3.6), image formation is shown to be a function of light, surface and sensor. Interestingly, the role that the spectral power distribution of the light E./ and the spectral reflectance function S./ plays are equally important. If we change the light from being whitish to bluish then all the recorded sensor responses become more blue. In color constancy we would like to estimate the color of the light and then map the RGBs to corresponding values for a known reference (typically white) light. An example of color constancy processing is shown in Fig. 3.3. The flower on the left looks too bluish. Post color constancy processing (estimating and then removing the color bias), the image looks much more natural.

tion algorithm to recover the full spectrum E./. Thankfully, the complexity of Eq. (3.6) can, for most practical purposes be simplified and the RGB model of image formation used. Let Z pSk D

!

Z S./Rk ./d

pEk D

!

E./Rk ./d (3.13)

Then, according to the RGB model of image formation: E S pE;S k D pk pk

(3.14)

3.2.3.1 The RGB Model of Image Formation Equation (3.6) is quite complex. It would, for example, be impossible for an illuminant estima-

That is, we can use Eqs. (3.13) and (3.14) instead of the more complex Eq. (3.6). Remarkably, Eq. (3.14) with certain caveats, generally holds [31]. An important interpretation of pSk is that it is the color of the surface viewed under a white uniform light E./ D 1. Subject to this observation, color constancy can be thought of as mapping the RGBs measured in an image back to a reference white lighting condition. That is, the color constancy problem involves solving for pSk .

Fig. 3.3 Example of color constancy processing. The image on the left is bluish (biased by the color of the prevailing illuminant), the right image—post

color constancy—processing has the color bias removed. Images taken from http://en.wikipedia.org/wiki/ Colorbalance

3 Computational Color Imaging

51

We rewrite Eq. (3.14) in vector form: pE;S D pE pS

(3.15)

Note the meaning of Eq. (3.15) is that we multiply the RGB vectors pE and pS component-wise (even although this operation is not in conventional linear algebra). If pO E denotes the estimate of the illuminant made by some algorithm then we recover the color of the surface (remove color bias due to the color of the prevailing light) by calculating: pE;S  pS pO E

(3.16)

where again the division of vectors here is defined as the componentwise division of the vector components. In Eq. (3.16), how good the approximation is depends on how well the illuminant is estimated.

3.2.3.2 Moment-Based Illuminant Estimation Under the RGB model of image formation, the goal of illuminant estimation algorithms is to ‘solve for’ pE . The majority of illuminant estimation algorithms treat the estimation problem statistically. That is we attempt to estimate the RGB for the illuminant by calculating a statistical moment. For an N-pixel image, we might calculate pO E D moment.fpE;S1 ; pE;S2 ; : : : ; pE;SN g/ (3.17) Moments that appear in the literature include the mean [32], the per-segment mean [33] and the per-channel maximum. Recently, it has been proposed that the input to the estimation processes might not be the RGBs themselves but rather the RGBs post linear filtering where the filtering typically calculates edge-type information. The majority of the statistical moments previously proposed as illuminant estimates can be summarised in a single equation [34]:

!1=p ˇ Z ˇ n ˇ ı pff .x/ ˇ1=p ˇ ˇ D kpO En;p; ˇ ıxn ˇ dx

(3.18)

Here, we have an RGB color image where the camera response at image location x is smoothed with a Gaussian averaging filter with standard deviation pixels to produce the 3-vector pff .x/. The smoothed image is differentiated with an order n differential operator (0 order means no differentiation). We then take the absolute Minkowski p-norm average [35] over the whole image. This results in a single illuminant estimate pO E (we use hat O to denote an estimate). Note k is an unknown scalar drawing attention to the fact that it is not possible to recover the true magnitude of the prevailing illuminants. A simple extension to Eq. (3.18) is to allow arbitrary linear filters to be use (rather than differential operators) [36].

3.2.3.3 Experiments A common way of evaluating the performance of an illuminant estimation algorithm is to calculate the angle between the RGB of the actual light— the RGB measured for a white surface in the same scene. Then over a dataset of images we can calculate an average performance statistic. In the first 5 rows of Table 3.2 we report the mean and median angular recovery error for simple moment-type algorithms (expressable by Eq. (3.18) running on the 2,000 image NUS image set [37] (see [38] for details of how these statistics were derived). Table 3.2 Mean and median angular error for the NUS [37] data set Algorithm

Mean

Median

White-Patch Gray-World Shades-of-Gray 2nd-order Gray-Edge 1st-order Gray-Edge Deep-Learning-1 Deep-Learning-2 Corrected-Moment

10.6 4.14 3.40 3.20 3.20 – 2.24 2.53

10.6 3.20 2.57 2.26 2.22 1.77 1.46 1.78

52

R. Bala et al.

The current best reported statistic reported in the literature is for a deep learning approach. However, it is interesting to compare the algorithms. In the 1st-order Grey-Edge approach we differentiate the image and then calculate a simple statistical moment (a Minkowski average of the absolute value of the derivatives). An oversimplified interpretation of the deep learning approach [39] is that the features (the linear convolution) are learned as part of an overall optimization. In fact features are learnt to make local (patch-based illuminant estimates) using a convolutional neural network. Then a second ‘regressor’ network is trained to ‘pool’ the local estimates. The approach of [40] is a yet more complex network and demonstrates modest improvements can be made albeit at the cost of greater complexity.

3.2.3.4 Extending Moment-Based Illuminant Estimation Computer vision is in thrall to deep learning just now. This is understandable. First, deep learning for many tasks simply delivers the best results. Second, often computer vision tasks have a simple bottom up approach to problem solving where image features are extracted, grouped and then a hypothesis and decisions are made. Deep learning architectures are designed to mimic and deliver optimal processing for problems posed in this way. Here, we take a slightly heretical stance and wonder whether—for illuminant estimation—all this complexity is necessary. What if the simple estimates made by moment-based illuminant estimation algorithms were themselves biased. And, what if these biases could be corrected in a simple way. In Eq. (3.19) below we propose ‘correcting’ an illuminant estimate using a simple 33 matrix pO E;corrected D CpO E

(3.19)

Interestingly, the matrix C is found via quite a novel (non-linear) optimisation. Let us assume we have the correct answer, the RGB of the light, for a set of N images. We place each RGB in

the row of a N  3 light matrix L. For each of the corresponding images we calculate their moment-type estimates in an N  3 matrix P . In [41] we solve for C by minimizing: N min ˙iD1 jjdi Pi C  Li jj2 C;di

(3.20)

where di is a scalar and Pi is the ith row of P and Li is the ith row of L. Notice that extra degrees of freedom are allowed in that the rows of P are scaled before applying the correction matrix. These scalars are used because the magnitude of the estimate may not be the same as the magnitude of the ground truth. In [41] it is shown that incorporating per estimate scalars impacts significantly on the matrix C that is recovered (and in the illuminant estimation performance achievable by Eq. (3.19). In [41] a simple alternating least-squares solution strategy is developed to solve for C in Eq. (3.19). Below, D is the N  N diagonal matrix where the ith diagonal component is di . 1. initialise: D0 D I (the identity matrix) and i=1 2. Ci D ŒDi1 PC L (where C denotes the ‘pre multiplying’ pseudo inverse AC D ŒAt A1 At ) 3. Dijj D Lj ŒPj Ci  (here we find the best scalar using the per-row ‘post-multiplying’ pseudo inverse Œvt  D vvt v ) 4. i D i C 1, goto Step 2 until convergence While this solution strategy is simple it does not guarantee an optimal global solution. However, it empirically appears to work well (and convergence is guaranteed). The overall best results that can be obtained by correcting a moment-based estimate are shown in the last row of Table 3.2). To obtain this result, the first derivative in Eq. (3.18) is used, the images are not smoothed, and p D 5. The results are surprisingly compelling. They are as good as the 1st and competitive with the 2nd deep learning method. In contrast, the complexity of the corrected moment approach is orders of magnitudes less than that of the deep learning approaches. Significantly, the corrected moment approach could easily be integrated into a camera processing pipeline.

3 Computational Color Imaging

3.2.3.5 Other Methods Wider surveys of color constancy—which include reviews of physics-based approaches [42, 43] are not considered here—and can be found in [26, 44, 45]. Equally, our algorithm is developed for ‘general’ scenes and so does not consider approaches predicated on identifying known objects—e.g. faces—in images [46, 47]. However, despite the large size of the field and the many and diverse insights applied to solving the illuminant estimation problems, the results shown in Table 3.2 are state of the art (especially the last 3 rows).

53

become a focal issue in deriving display color processing algorithms [48, 50, 51]. For example, in a transmissive display, based on the observation that power consumption is mainly affected by the backlight intensity, algorithms aim to dim the backlight while increasing pixel values to preserve the same level of perceived quality. In Sect. 3.3.3, a constrained optimizationbased RGB-to-RGBW conversion is presented for emissive displays that explicitly accounts for power consumption.

3.3.2

3.3

Color Display

3.3.1

Introduction

Modern displays can be classified broadly into transmissive displays and emissive displays [48] based on how color light is produced. A transmissive display filters light from a white light source. The best known example of a transmissive display is a liquid crystal display (LCD), which produces an image by passing light from the backlight source through a liquid-crystal material that can be aligned to either block or transmit the light. On the other hand, in an emissive display, phosphors convert electron beams or ultraviolet light at each pixel into visible light, thus directly producing an image on the screen. Cathode-ray tubes (CRTs), plasma display panels (PDPs), organic light-emitting diode (OLED), and field emissive displays (FED) are examples of emissive displays [48]. Both transmissive and emissive displays conform to an additive (and for most practical purposes linear) color mixing model. This permits a simple (3  3 matrix) form for the display characterization transform that was utilized for camera color correction in the previous section. An important issue that arises particularly for mobile displays is power consumption. Indeed, the display is the most power consuming component of a mobile device [49], and its impact on battery lifespan is a major concern in the mobile consumer market. Therefore power reduction has

Display Characterization

As previously stated, characterization is a process to derive transformations between devicedependent and device-independent (colorimetric) representations. Display characterization assumes that the individual R, G, B channels are calibrated so that digital values are linearly related to emitted luminance. Different calibration methods are required for the different display technologies. For example, for CRTs, the relationship between the R, G, B input digital values and the output displayed luminance is usually modeled by the power-law function [52– 54]. The parameters for the model are obtained by measuring a series of different primary color values and then fitting the measurements to the model. While the relationship of LCD is often better modeled as a sigmoidal function [55–57], many LCD manufacturers provide correction tables so that the LCD response mimics that of a CRT, facilitating a power-law calibration. Another way to calibrate the nonlinear response curve of LCD is to directly interpolate from measurements along the luminance for each primary. OLED calibration can be derived with a modelbased approach or empirical approach, similar to CRTs and LCDs [58, 59]. In most color displays, the RGB subpixels are independently driven and controlled, and the resulting emitted light is combined to produce the final color light. This property of the display is referred as to channel independence, which allows a complete characterization of the display from separate characterization of each color

54

R. Bala et al.

channel. Furthermore, if the spectral radiance for each channel is spectrally nonselective, i.e., the spectral radiance is scaled by the driving device signal (e.g., voltage) equally across all wavelengths, the chromaticity constancy assumption is also said to apply. Assuming channel independence and chromaticity constancy, the relationship between input device RGB values and device independent tristimulus XYZ values is given by a linear model. Specifically, let D0R , D0G , and D0B denote the linearized R, G, and B values, respectively, and ARGB the matrix to convert RGB colors into the resulting tristimulus values in CIEXYZ. Then, the conversion of the RGB color vector d0RGB D ŒD0R ; D0G ; D0B T can be written as a linear equation, given by 2 3 2 X XR 4 Y 5 D 4 YR Z ZR „ƒ‚… „ c

32 0 3 DR XG XB YG YB 5 4 D0G 5; ZG ZB D0B ƒ‚ … „ ƒ‚ … ARGB

(3.21)

d0RGB

where XR , YR , and ZR are the tristimulus values of the red channel at its maximum intensity, and corresponding definitions apply for the green and blue channels. We can rewrite the conversion in (3.21) and its inversion in vector notations as c D ARGB d0RGB ; d0RGB D A1 RGB c;

(3.22) (3.23)

respectively. The matrix ARGB 2 R33 in (3.21) can be obtained via data fitting or interpolation on the characterization samples. An effective and widely used approach is linear least-squares regression. Specifically, given a training set of N training samples .d0RGB;i ; ci /, i D 1; : : : ; N, the optimal ARGB can be obtained by minimizing the mean squared error of the linear approximation in (3.21) to a set of the training samples, i.e., ( ARGB

D arg min ARGB

) N 1 X 0 2 kci  ARGB dRGB;i k : N iD1 (3.24)

For further details on display characterization models and procedures, the reader is referred to previous work applicable to CRTs [52–54], LCDs [55–57], and OLEDs [58, 59]. Recent advancements in display technology have significantly increased the resolution of display devices, thus increasing display density in terms of pixels per unit area. However, this reduces the light efficiency of the display, since the aperture ratio is decreased. A recent effort to improve light efficiency is to add white (W) subpixels to the conventional RGB display, resulting in an RGBW display [51, 60, 61]. The improved light efficiency of RGBW increases the overall effective luminance output of a display. Characterization techniques for the conventional RGB display can be extended for the RGBW display in a straightforward manner. Specifically, the relationship between the linearized RGBW color vector d0RGBW D ŒD0R ; D0G ; D0B ; D0W T and corresponding tristimulus values is given by 2 3 2 X XR 4 Y 5 D ˇ 4 YR Z ZR „ƒ‚… „ c

2 3 3 D0 XG XB XW 6 0R 7 DG 7 YG YB YW 5 6 4 D0 5; (3.25) B ZG ZB ZW 0 ƒ‚ … DW „ ƒ‚ … ARGBW d0RGBW

where XW , YW , and ZW are the tristimulus coordinates of W at maximum intensity, and ˇ is a scaling parameter that considers the ratio of subpixel sizes. In matrixvector notation, the conversion in (3.25) becomes c D ARGBW d0RGBW ;

(3.26)

where ARGBW 2 R34 . The optimal ARGBW that minimizes the mean squared error can be obtained with least squares regression as in (3.24). Note that, in the system image path of Fig. 3.1, we desire the inverse of (3.26), i.e., a transformation from images in colorimetric coordinates c to device coordinates d0RGBW . However, unfortunately the system in (3.26) is not invertible in

3 Computational Color Imaging

55

that many different d0RGBW can potentially give rise to the same colorimetric value. One way to address the problem is to derive a separate RGBto-RGBW transformation that employs certain rules to ensure that distinct RGB values produce distinct colors c. This produces a new pseudodisplay that internally converts RGB-to-RGBW for light efficiency, but that externally exposes to the system a standard RGB interface, and can thus then be characterized and color-corrected using the standard method of (3.21). The derivation of the RGB-to-RGBW transform is addressed next.

3.3.3

RGB-to-RGBW Conversion

The image quality of an RGBW display depends strongly on the RGB-to-RGBW conversion. Thus, this topic has elicited much interest in the display community, and several algorithms have been proposed [51, 60–65]. As Wang et al. studied in [63], most conventional RGB-to-RGBW conversion algorithms [60, 62] are heuristically derived, and consist of three steps. First, the white pixel value is extracted as a function of the minimum and maximum intensities among input R, G, and B values. Specifically, let M D max.D0R;i ; D0G;i ; D0B;i / and m D min.D0R;i ; D0G;i ; D0B;i /, where D0R;i , D0G;i , and D0B;i are the normalized input light intensities of RGB colors. Conventional algorithms can all be expressed in one of the forms below according to how they obtain the output light intensity of the white channel D0W;o : D0W;o D m;

(3.27)

D0W;o D m2 ;

(3.28)

D0W;o D m3 C m2 C m;  mM m ; if M < 0:5; D0W;o D Mm M; otherwise:

(3.29) (3.30)

D0

CM

Next, pixel gains are obtained by K D W;om , so as to increase the display intensities. Finally, the pixel gains are applied to the input values, and then the output RGBW values are generated by subtracting the white value as 2

3 2 0 3 2 0 3 D0R;o DR;i DW;o 4 D0G;o 5 D K  4 D0G;i 5  4 D0W;o 5 ; (3.31) D0B;o D0B;i D0W;o where D0R;o , D0G;o , and D0B;o denote the output light intensities of the RGBW color. Recent approaches have extended the procedure to develop a scene-adaptive RGB-to-RGBW conversion [61] or a color distortion minimizationbased conversion [65]. The primary goal in the aforementioned conversion methods [60–63] is to increase displayed brightness while preserving the hue and saturation. Whereas the functions are simple to implement, they neither consider perceptual color distortion nor explicitly consider the power consumed in displaying the image. We describe an optimization-based powerconstrained RGB-to-RGBW conversion algorithm for emissive displays recently developed by Lee and Monga [51]. This method measures the perceived color distortion using a color difference model in a perceptually uniform color space, and computes the power consumption for displaying an RGBW pixel on an emissive display. The optimization problem is formulated to minimize the color distortion subject to a constraint on power consumption, and an efficient solution is provided.

3.3.3.1 Color Distortion Model RGB and RGBW displays have different gamuts [61, 66]. Therefore, an RGB value and its corresponding RGBW value may show perceptually different colors. A color distortion model is proposed that quantifies the perceptual

56

R. Bala et al.

color difference caused by an RGB-to-RGBW conversion, assuming that the W subpixel emits the same color as the display white point, e.g., D65 , and that an sRGB display is used. RGB and RGBW colors are converted into a perceptually linear space, so that the perceptual color difference between the input RGB and converted RGBW colors can be measured. To this end, the linearized CIELAB space [67] is employed to approximate the CIELAB uniform color space. First, the RGB color d0RGB and the RGBW color d0RGBW are converted to the CIEXYZ color via (3.22) and (3.26), respectively. Then, two colors d0RGB and d0RGBW are converted into the linearized CIELAB color space (Yy Cx Cz ) about the D65 white point [67], given by Y  16; Yn X Y ;  Cx D 500 Xn Y n Y Z Cz D 200  ; Yn Zn Yy D 116

Finally, the perceptual color distortion Dd0RGB .d0RGBW / due to the RGB-to-RGBW conversion is defined as the Euclidean distance between the input RGB color d0RGB and the converted RGBW color d0RGBW in the linearized CIELAB space, given by Dd0RGB .d0RGBW / D kTd0RGBW  b0RGB k2 : (3.37) Although the linearized transformation in (3.32)(3.34) expresses the CIELAB color space approximately, it greatly facilitates the formulation of the color distortion as a quadratic form in (3.37) and subsequently enables a tractable solution via quadratic programming.

(3.33)

3.3.3.2 Power Consumption Model of Emissive Displays An experimental model of the power consumption PRGB to display a single RGB pixel on an emissive display is given by [68]

(3.34)

PRGB D !0 C !r D0R C !g D0G C !b D0B ; (3.38)

(3.32)

where .Xn ; Yn ; Zn / is the D65 white point for the CIEXYZ color space. Since only the perceptual color difference rather than absolute values is considered, the subtraction in (3.32) can be omitted, and the transformation in (3.32)(3.34) are denoted as linear equations, which is denoted by the matrix TLab 2 R33 . Then, the conversions from RGB and RGBW colors, respectively, into the linearized CIELAB space are rewritten compactly in vector notations as TLab ARGB d0RGB D b0RGB ;

(3.35)

TLab ARGBW d0RGBW D Td0RGBW ;

(3.36)

where b0RGB denotes the color vector in the linearized CIELAB space for the input RGB value d0RGB , and T is the color space transformation matrix from the RGBW space to the linearized CIELAB space about D65 white point.

where !r , !g , !b are weighting coefficients that express the different characteristics of R, G, and B subpixels, and a constant !0 accounts for static power consumption, which is independent of pixel values. Without loss of generality, the power model in (3.38) can be extended to emissive RGBW displays as PRGBW D !0 C !r D0R C !g D0G C !b D0B C !w D0W ; (3.39) where !w is the coefficient for the linear white value. Since pixel values are modified to find the optimal conversion, the parameter !0 for static power consumption can be ignored. Then, the power consumption for displaying the RGBW color d0RGBW can be written in a vector notation as PRGBW .d0RGBW / D !T d0RGBW ;

(3.40)

3 Computational Color Imaging

57

where ! D Œ!r ; !g ; !b ; !w T . The weighting coefficient vector ! has different values for different display technologies. For example, in [69], the weighting ratios are measured as !r W !g W !b W !w D 1:58 W 1:07 W 5:32 W 1:00 for a particular emissive RGBW display. It was also experimentally shown with an OLED display that relative power consumption by colors varies significantly from device to device [70]. The linear form of the power consumption in (3.40) arises from the fact that power consumption of a pixel is a linear function of the current, which is in turn a linear function of the pixel value [71]. This relationship has been experimentally verified in [68, 70],

not only as a constraint, but as a regularizer that conditions the problem towards a unique solution. That is, if multiple RGBW solutions exist that minimize perceptual color distortion, the one with the minimum power cost will be selected. In addition to minimizing Jd0RGB .d0RGBW / in (3.41), the RGBW pixel value d0RGBW should satisfy a further constraint. The minimum and maximum intensities in d0RGBW should be bounded by the minimum and maximum values of the displayable range, i.e., 0 and 1, respectively. Therefore, the optimization problem is reformulated as minimize 0 dRGBW

3.3.3.3 Constrained Optimization Problem The RGB-to-RGBW conversion involves two competing goals: one is to achieve the optimal conversion by minimizing the color distortion Dd0RGB .d0RGBW / in (3.37), and the other is to reduce the power consumption PRGBW .d0RGBW / in (3.40) for displaying color d0RGBW . One way to frame the optimization statement is that, given the input pixel value d0RGB , the color distortion Dd0RGB .d0RGBW / should be minimized subject to a constraint on the power consumption PRGBW .d0RGBW /. The cost is given by:  2 Jd0RGB .d0RGBW / D Td0RGBW  b0RGB  C!T d0RGBW ;

(3.41)

where  controls the trade-off between the color distortion and the power consumption. When  D 0, the perceived color is perfectly maintained by the conversion within the common gamut of RGB and RGBW, with no constraint on power consumption. As  increases, more weight is given to power consumption reduction at the possible cost of introducing perceptual error. As alluded to earlier, the RGB-to-RGBW conversion can be a one-to-many mapping, and so in such cases the second term in (3.41) serves

 0 2 0 Td  C !T d0 RGBW  bRGB RGBW

subject to 0 d0RGBW 1;

(3.42)

where 0 and 1 are the column vectors, all elements of which are 0 and 1, respectively, and

denotes the element-wise inequality between two vectors. The optimization problem in (3.42) is a wellknown quadratic program with inequality constraints, and solvers are readily available that apply numerical algorithms such as the interiorpoint method [72]. While quadratic programs enjoy the benefit of fast solutions, obtaining the solution to (3.42) for each pixel in an image is unrealistic for real-time conversion and display. Instead, a color lattice or lookup table (LUT) that maps an input RGB color to the output RGBW color is built offline. More specifically, the optimization problem in (3.42) is solved for all possible color entries d0RGB ’s to the LUT, and saved with the corresponding outputs d0RGBW ’s. In practice, sparse LUTs with real-time interpolation can be used to facilitate a desirable memory vs computation trade-off. Section 3.4 discusses at length the topic of efficient LUT construction. Using color LUTs, color conversion can be executed in real time with computational complexity O .N/, where N is the number of pixels in an image. In addition, since only the LUT

58

Fig. 3.4 The arrangement of virtual pixel configurations for the (a) RGB and (b) RGBW displays

needs to be stored in memory, the memory complexity does not scale with the image size and is therefore O .1/. Note that, as experimentally shown in [49], since the power consumption in memory is significantly lower than that in display during normal operations, the memory access due to LUTs would not cause additional power overhead compared with power reduction in display.

3.3.3.4 Simulation We evaluate the performance of RGB-to-RGBW conversion by emulating both RGB and RGBW displays as shown in Fig. 3.4. Specifically, each pixel in the emulated display is composed of virtual RGB and RGBW pixels, respectively, which are formed by four (22) real RGB pixels. The resolution of an input image is doubled using the nearest-neighbor interpolation for the virtual RGB display. The R, G, and B subpixels in each pixel out of four have the same value to express the W subpixel value for the virtual RGBW display. We evaluate the performance on a test set, consist of 24 images from the Kodak Lossless True Color Image Suite [73]. The parameter ˇ in (3.25) is fixed to 0.75 in all tests, assuming that the pixel sizes of RGB and RGBW displays are the same. Also, we evaluate the performance with different numbers of RGB nodes in the color LUTs: 2563 , 643 , and 163 . In the case of a sparse LUT, we employ a uniform lattice with trilinear interpolation. We compare the performance of the algorithms using two objective metrics: color distortion against the ground-truth (input) CIELAB color in terms of the CIE94 E (E94 ) met-

R. Bala et al.

ric [54] and the power consumption PRGBW in (3.40).1 Table 3.3 lists the average performance over 24 test images from the Kodak Lossless True Color Image Suite [73]. Lee and Monga’s optimization-based algorithm [51] significantly outperforms the competing algorithms in terms of E94 at  D 0 and 50. Even when  D 100, Lee and Monga’s algorithm provides comparable or even lower color distortion. Can and Underwood’s [65] algorithm consumes the least power but provides higher color distortion due to the color conversion as groups. In all tests, with the exception for  D 0, which does not consider power consumption, Lee and Monga’s algorithm consumes the least power. Moreover, we see that this algorithm can achieve more power saving by increasing . For example, when  D 100, more than 34% of power is saved compared with the case  D 0. In addition, we note that the average power consumption in the RGB display PRGB is 1.91 at the same test condition, which is significantly higher than that of the RGBW display. When the number of nodes in the LUT is reduced to 643 and 163 , the color distortion increases at  D 0, but the distortion decreases with the moderate increase in power consumption at  D 50 and 100. Therefore, the number of nodes in the LUT can be selected adaptively depending on the memory space and the distortion vs power consumption trade-off. Figure 3.5 compares the emulation results of the RGB-to-RGBW conversion algorithms on an sRGB test image. The LUT size is 2563 , and the parameter  in (3.42) is set to 0 to obtain the results of Lee and Monga’s algorithm in Fig. 3.5h. Figure 3.5 also shows the corresponding color distortion E94 maps. We see that Lee and Monga’s algorithm reproduces the input colors more faithfully with less distortion than the competing algorithms by taking the color distortion into account in the optimization. 1 PRGBW is a unit-less measurement for comparing the relative power consumption for different RGBW pixel values. According to an experiment in [49], an OLED display in a mobile device consumes approximately 3070% of power during a web browsing operation. Therefore, if we save 30% of power PRGBW for the device, at most 921% of total power is saved (with equivalently prolonged battery life).

3 Computational Color Imaging

59

Table 3.3 Comparison of Wang et al.’s algorithms [63] in (3.27)(3.30), Kwon and Kim’s algorithm [61], Can and Underwood’s algorithm [65], and Lee and Monga’s algorithm [51] in terms of the average color distortion Wang et al. in (3.27) Wang et al. in (3.28) Wang et al. in (3.29) Wang et al. in (3.30) Kwon and Kim Can and Underwood Lee and Monga (2563 ) Lee and Monga (643 ) Lee and Monga (163 )

D0  D 50  D 100 D0  D 50  D 100 D0  D 50  D 100

Avg. E94

95%E94

Avg. PRGBW

4.32 5.40 3.83 3.18 5.30 6.28 0.43 1.56 3.26 0.44 1.55 3.20 0.61 1.39 3.07

7.46 7.95 6.71 5.65 8.54 11.84 0.57 3.89 7.49 0.58 3.86 7.38 1.15 3.39 7.20

1.34 1.36 1.37 1.37 1.28 1.07 1.61 1.13 1.06 1.61 1.13 1.07 1.61 1.14 1.08

Lee and Monga’s optimized RGB-to-RGBW conversion algorithm can control the trade-off between the power consumption and the color distortion. Figure 3.6 shows the result images and the corresponding color distortion E94 maps of Lee and Monga’s algorithm for three different values of . When  D 0, the best results are obtained to minimize the color distortion. On the contrary, as  gets higher, more power is saved at the expense of increased color distortion. However, we note that the sacrifice in color accuracy between the case without the power constraint ( D 0) and the case when  D 100 is tolerable for many applications. Specifically, the average E94 values for the sRGB test image in Fig. 3.6 are 1.24, 1.93, and 2.71, respectively, while the average power consumptions are 3.51, 3.43, and 3.33, respectively.

3.4

Color Printing

3.4.1

Introduction

E94 , the 95th percentile E94 , and the average power consumption PRGBW on the test images. The results of Lee and Monga’s algorithm are obtained with three different sizes of LUT, i.e., 2563 , 643 , and 163

Color printing involves a complex optical lightsubstrate interaction, and thus the linear models

just described for cameras and displays unfortunately do not apply here. Referring to Fig. 3.7, as light travels into the colorant layers and underlying paper substrate, it undergoes one or more of absorption, reflection, and scattering by colorant and substrate particles. These optical mechanisms are further complicated by the fact that most printers use halftone dot patterns to provide multiple gray levels from binary colorant placement [2]. The variety of colorants (e.g toner, ink, dye, etc.), substrates (e.g. coated paper, uncoated paper, card, transparency, etc.), and varied technologies for transferring and fastening the colorant to the substrate make it challenging to develop simple unified models for the printing process. The goal of printer characterization is to quantify the nonlinear relationship between the colorant amounts d placed on the substrate and the color c of the resulting print, as measured in a device-independent coordinate system. This relationship takes the form of a forward transform f W d ! c mapping colorant to color coordinates, and its inverse g D f 1 W c ! d mapping color to colorant space. The latter is sometimes termed color correction as it is the operation most

60

R. Bala et al.

Fig. 3.5 RGB-to-RGBW conversion results on the sRGB test image. The input RGB image in (a) is converted by Wang et al.’s algorithms [63] in (b)–(e), Kwon and Kim’s algorithm [61] in (f), Can and Underwood’s algorithm [65] in (g), and Lee and Monga’s algorithm in (h).

The top rows compare the emulation results on the RGBW display, and the bottom rows show the corresponding color distortion E94 maps. For Lee and Monga’s algorithm in (h),  D 0 and the number of nodes in the LUT is 2563

frequently applied in the image path of Fig. 3.1. To derive these transforms, a target of color patches with known device colorant amounts is printed and each patch is measured using a colorimeter or spectrophotometer to acquire deviceindependent color coordinates. Methods for target generation and measurement are closely re-

lated to the underlying model assumed for the printing process. The reader is referred to [2] for details. The result is a training dataset of sample pairs T D Œci 2 Rn ; di 2 Rm ; i D 1 : : : NT wherein di denotes device-dependent coordinates in mdimensional space and ci belongs to an ndimensional device-independent color space.

3 Computational Color Imaging

61

Fig. 3.6 The result images of Lee and Monga’s algorithm for (a)  D 0, (b)  D 50, and (c)  D 100. The first row corresponds to the emulation results, and the second row shows the corresponding color distortion E94 maps Fig. 3.7 Light paper interaction in color printing

Commonly m D 4 for CMYK printing and n D 3 for colorimetric measurements such as CIEXYZ and CIELAB. Note that T can be a direct outcome of target measurements, or an indirect output from a printer model that is derived from the target measurements [2]. The next step is to derive transforms f and g from T. In general, these functions take on complex nonlinear mathematical forms in order

to adequately model the printing process, thus can be computationally prohibitive for realtime evaluation on large images. The common solution to this problem is to pre-compute the transformations on a sparse lattice structure, and employ a fast interpolation technique for real time evaluation on large volumes of image data. We next formally introduce the multidimensional lattice structure for printer characterization.

62

R. Bala et al.

Without loss of generality, we describe lattice optimization for the inverse (color correction) function g. Examples will be given in 2and 3-dimensions, bearing in mind that generalization to n dimensions is conceptually straightforward.

3.4.2

Lattice-Based Printer Characterization

A lattice comprises a set of node vertices V D Œai 2 Rn ; bi 2 Rm ; i D 1 : : : NV . The node locations ai partition the n-dimensional domain of g into small subvolumes, and the corresponding node values bi are approximations of g evaluated at the ai . Figure 3.8 illustrates two lattice structures for the case of n D 2, with training samples superimposed. The figure on the left is an example of a regular lattice structure wherein the ai are formed by Cartesian products of scalar node levels along each dimension, and the sub-volumes are rectangles [74]. The inter-node spacing need not be constant along any given dimension or across dimensions. The figure on the right shows an irregular lattice wherein the sub-volumes in V are triangles. Note that there are many triangulations that are possible, the most widely adopted method being Delaunay triangulation [75]. In n dimensions, the corresponding subvolumes for

the two lattice structures are respectively hyperrectangles and simplices. The problem of lattice design boils down to selecting (ai ; bi ) to optimize the approximation at the training samples. Figure 3.9 shows a simple 1-D illustration of the effect of node locations and values on approximation error (the area of the shaded region in each graph) for a linear interpolation geometry. We note from the top row that approximation error is reduced when ai are more densely packed in regions of higher transform curvature. This intuition has been exploited in [76, 77] and will be revisited via a formal optimization method in Sect. 3.4.2. The left column of graphs illustrates the effect of node values bi on transform accuracy. In the top left figure the bi coincide exactly with the function value at the node locations. In the bottom left figure, bi depart from the function at locations ai in order to produce a lower overall average error. This has been quantitatively demonstrated in [79]. The lower right figure shows how joint optimization of node location and value can further reduce approximation error. True joint optimization is an np-hard problem; an iterative convex approximation will be presented later in this section. Note finally that the accuracy of any lattice approximation technique can be trivially improved by increasing the number of nodes throughout the domain of the function g./. How-

Fig. 3.8 Illustration of (a) regular and (b) simplex 2D lattices. Nodes are shown as circles, and training data are shown as crosses

3 Computational Color Imaging

63

Fig. 3.9 Effect of node location on transform accuracy (a) uniform node sampling, and node values coincide with true function; (b) Node location optimization; (c)Node

value optimization; (d) Joint optimization of node locations and values

ever in practice, limitations in available memory and storage requirements place a constraint on lattice size. Once the lattice has been designed, execution of the lattice transform on n-dimensional input color data comprises two steps:

2. Interpolation is performed among a set of node vertices aj from the enclosing subvolume to generate an output value. For the common case of linear interpolation, a set of weights wij is determined satisfying:

1. Given an input pixel color ci , the appropriate enclosing n-dimensional lattice sub-volume is identified for interpolation. In the regular lattice in Fig. 3.8, this is a simple matter of finding node levels surrounding the input point along each dimension. For the case of the irregular lattice comprising simplex subvolumes, the search for the enclosing simplex involves computing the barycentric coordinates of the input point with respect to the vertices of each simplex in the lattice [78]. A point is within a simplex if all its barycentric coordinates with respect to that simplex lie between 0 and 1.

Nv X

wij aj D ci ;

Nv X

jD1

wij D 1

(3.43)

jD1

These weights are then used to compute an estimate dOi of function g at ci : dOi D

Nv X

wij bj

(3.44)

jD1

In general (3.43) is under-determined, so that there are many strategies for choosing wij , the particular choice being uniquely related to the interpolation geometry. One popular geometry is n-linear interpolation, which is the generalization of 2D bilinear and 3D tri-

64

R. Bala et al.

linear interpolation to n dimensions. Another common method is simplex interpolation, a generalization of 2D triangular and 3D tetrahedral interpolation. Other schemes such as prism and pyramidal schemes have also been proposed for 3-D inputs. Weight calculations for common interpolation schemes are described in [6].

the interpolation geometry. The lattice transform output is given by dO D Wa b. Rewriting (3.45) in matrix-vector form yields the following optimization problem:  2    .a; b/ D arg min dO  d a;b

n o D arg min kWa b  dk2 (3.46) a;b

In practice, lattice design must meet a certain level of accuracy on specified test datasets. As stated earlier, greater accuracy can be achieved with a larger lattice or nonlinear interpolation methods. However practical constraints on storage, memory, and computational complexity translate to trade-offs and constraints in lattice design. For example, regular lattices offer simpler sub-volume search, but the lattice size NV increases exponentially as a function of data dimensionality. Irregular lattices, when designed properly, can achieve an equivalent accuracy with smaller NV but at the cost of a more complex subvolume search.

3.4.2.1 Problem Setup We will assume that practical considerations for the given application restrict the lattice to a fixed size of Nv nodes. For notational convenience, and without loss of generality, we assume that the lattice node outputs are scalars (i.e. m D 1) collected into an NV  1 vector b. We define the error between the training outputs and the lattice approximations thereof:

E D

NT X

.dO i  di /2

iD1

D

NT Nv X X .. wij bj /  di /2 iD1

(3.45)

jD1

where the second expression is obtained from (3.44). Collect training outputs into an NT  1 vector d D Œd1 ; : : : ; dNT , and define matrix Wa 2 Œ0; 1NT NV with entries wij . Wa is specified uniquely by lattice node locations aj ; j D 1 W NT , training inputs ci ; i D 1 W NV and

Equation (3.46) is not jointly convex in a and b; however fortunately the problem is separably convex with respect to the two arguments. That is, minimization of E with respect to b for a fixed Wa is convex, and vice versa. We next present solutions to the two convex subproblems, followed by a framework for joint optimization that iteratively alternates between the two solutions. While in principle any interpolation geometry can be used, the following treatment chooses simplex interpolation for two reasons: i) unlike most other methods, it is not restricted to regular lattices; and ii) a simple linear relationship exists between node locations a and induced weights W [6].

3.4.2.2 Optimization of Node Values We begin with a set of lattice node locations ai determined by a technique such as from [76]. The traditional approach to node value calculation adopts a two-step approach: (1) An analytical function is fitted to the training data using global or locally weighted least-squares regression; (2) The fitted function is evaluated at the ai to obtain bi [79]. A shortcoming with this approach is that lattice interpolation is not accounted for, and therefore there is no guarantee that the error at the training samples is minimized. We adopt instead the approach of directly minimizing the training error in (3.46), while explicitly accounting for the interpolation operation expressed within Wa . It is readily seen that for fixed Wa , (3.46) yields the classic least squares solution .WTa W/1 WTa b. Unfortunately in the case of printer characterization, due to the nonuniform, and potentially sparse distribution of training samples, (3.46) is often an undetermined system of equations, leading to .WTa W/ being

3 Computational Color Imaging

65

of insufficient rank and non-invertible. If for example a given subvolume does not contain any training samples, there are an infinite number of bj from which to choose for that subvolume. To avoid this condition, and to reduce over-fitting of noise, a regularization term can be added to (3.46) to ensure that the lattice transform meets a smoothness property. In [80], a graph Hessian regularizer is proposed that ensures smoothness of the lattice transform by penalizing secondorder differences in each lattice dimension, summed over the n dimensions: n X

X

2

..bh  bi /  .bi  bj //

kD1 ah ;ai ;aj 2Nik

D

n X

X

.bh  2bi C bj /2

(3.47)

kD1 ah ;ai ;aj 2Nik

D bT KH b where KH is an NV NV matrix capturing secondorder differences of neighboring nodes, and Nik denotes the set of node levels immediately neighboring the ith node along the kth dimension . In order to ensure that the matrix is positive definite, QH D the authors in [80] add a small correction: K 6 KH C 10 I. The new cost function is: Q Hb E D kWa b  dk2 C bT K

(3.48)

where  > 0 is a parameter that trades off training accuracy and smoothness. Problem (3.48) is convex with a unique global minimizer: Q H /1 Wa T d bO D .Wa T Wa C K

(3.49)

The dimensionality of WTa W is determined by the lattice size (Nv  Nv ). For a typical printer characterization application, Nv may be as large as 333 D 35937, leading to inversion of a large Q H are matrix. Fortunately since both WTa W and K positive definite and sparse, efficient inversion can be effected via sparse Cholesky factorization.

3.4.2.3 Optimization of Node Locations Next we develop an optimization statement to solve (3.46) for ai assuming fixed bi on an initial

lattice structure Vinit D Œai ; bi . The formulation is adopted from [78]. Note that the objective measure in (3.46) is not convex with respect to the ai themselves, but only to the weight matrix Wa induced by the ai . Thus we tackle the problem in two steps. First we seek the optimal Wa by solving the following constrained optimization problem: n o Wopt D arg min kWa b  dk2

(3.50)

Wa

subject to 3 constraints: (i) Wa ei D 0; i D 1; : : : ; NT (simplex membership constraint) (ii) wij  0 (non-negativity constraint) (iii) Wa 1 D 1 (interpolation constraint) In the first constraint, vector ei , i D 1; : : : ; NT , is constructed such that the jth element takes on value zero if bj is used in the interpolation calculation to determine output di , and one otherwise. For 3D tetrahedral interpolation, ei has at most 4 zeros. The ei are determined based on membership of training points to the initial lattice structure Vinit . In the third constraint, 1 is the NT -dimensional unity vector and is used to ensure that the sum of interpolation weights is unity. While (3.50) is a convex problem with 3 convex constraints, it can be computationally burdensome for large NT and NV . In order to arrive at an efficient solution, the cost function can be rewritten as a quadratic program. Define the following quantities: • z D vec.WTa / where vec() denotes vectorization • BQ D IT ˝ bbT , where ˝ denotes Kronecker product • u WD 2vec.bdT / • E 2 RNT NT NV , defined below • F 2 RNT NT NV , defined below E is the membership constraint matrix given by:

66

R. Bala et al.

2

3 eT1 0T 0T 0T 6 0T eT 0T 0T 7 2 7 ED6 4 5 0T 0T 0T eTT

(3.51)

with ei being the membership vector for the ith training sample. F is the interpolation constraint matrix given by: 2 T T T 3 1 0 0 0T 6 0T 1T 0T 0T 7 7 FD6 (3.52) 4 5 0T 0T 0T 1T with 1 being the NV -dimensional unity vector. The optimization problem (3.50) can be written as: Q C uT z; .over z/ minimize zT Bz subject to Ez D 0.membership/ z  0.non  negativity/ Fz D 1.interpolation/

(3.53)

Equation (3.53) is a well-known quadratic programming problem. The convex constraints can be formulated in terms of Karush-KuhnTucker (KKT) conditions, and the sparsity of Wa can be exploited to greatly reduce complexity. Note that BQ is positive semi-definite (i.e. the quadratic cost function is convex but not strictly convex), so that multiple minimizers may exist. It is common in such cases to select the minimumnorm solution. The reader is referred to [78] for details of the derivation. The solution to (3.50) does not automatically provide a set of node locations a. This is because the mapping from a to Wa is many to one mapping with a non-unique inverse. It is even possible that a simplex topology doesn’t exist that exactly yields Wopt . To this end, in a second step, we search for the a that induces a weight matrix most similar to Wopt as follows:

aopt D arg min a

8

0 is a large constant and  > 0 is a small constant. Equation (3.54) minimizes a weighted Frobenius norm between the optimal weight matrix and the one induced by node locations. The weighting factors sij and associated parameters K and  are soft constraints, chosen to encourage a solution that is close in structure and content to Wopt . Minimization is performed by varying node locations ai one at a time until reduction in the Frobenuis norm falls below a convergence threshold.

3.4.2.4 Joint Optimization of Node Values and Locations Algorithm 1 combines the two subproblems just described to iteratively optimize node locations and values. Combining the two subproblems just described, the following is an algorithm for iteratively optimizing node locations and values. This algorithm is reminiscent of the LloydMax and Linde-Buzo-Gray iterative algorithms used in scalar and vector quantization respectively [81], and to the k  means algorithm used for data clustering [82]. 3.4.2.5 Experimental Results The results in this section are reported from [78]. We use data generated on a Xerox DocuColor 8000 printer to test various methods of generating inverse lattice color transforms g() from CIELAB to CMYK. A set of 5000 randomly distributed CMYK samples were printed and measured to generate training sets T. Another random set of 1400 CMYK combinations were printed and measured to serve as an independent evaluation of lattice approximation error. The resulting CIELAB data were then processed through various inverse lattice transforms to generate corresponding output CMYK values. The objective function being minimized is training error in the output (i.e. CMYK) space of the lattice transform. However in order to assess transform accuracy in a visually relevant color space, the output CMYK values from the test set were printed and measured to obtain CIELAB values, and compared with the original CIELAB inputs to the lattice. Errors were measured in terms of the CIE94 E metric [2].

3 Computational Color Imaging

67

The span of realizable CMYK combinations produces a volume of attainable colors in CIELAB space referred to as the printer gamut. In our experiments, all lattices were constructed within the gamut (see [78] for details). In the first step of the proposed iterative Algorithm 1 the initial simplex topology in CIELAB space is obtained as follows. A regular 3-D lattice is defined in device CMY space, wherein each rectangular sub-volume is partitioned into 6 simplices (tetrahedra) sharing a common diagonal edge. These 3-D simplexes are then mapped through a constrained black (K) generation function and through the forward printer transformation f ./ to generate the initial simplex lattice structure in CIELAB space. Following that, alternate optimization of output CMYK node values bi and CIELAB node locations ai was performed. Algorithm 1 was compared against several standard methods, and the results are reported in Table 3.4 for a training set of size 2300. In Row 1, a regular lattice with equally spaced node levels along each dimension was used, and

true function values g./ were used for bi . This is a simple baseline technique with no use of optimization. Row 2 is obtained with the same uniformly spaced lattice, but with node value optimization, as reported in [80] and described in Sect. 3.4.2.3. In Row 3, a greedy algorithm [76] is used to select node levels along each dimension to maximally reduce approximation error, while maintaining a regular lattice structure. In this case node spacings along each dimension are not uniform, but rather adapt to local transform curvature. The last two rows show results for irregular simplex lattice structures. Row 4 reports accuracy obtained from node value optimization alone, i.e. when performing steps 1 and 2 of Algorithm 1 once. During node value optimization, both training sets are selected such that every in-gamut subvolume contains at least one training sample, so that (3.46) is not underdetermined and therefore does not necessitate a regularization function. Row 5 shows the result from joint optimization of node values and location by executing the full iterative form of Algorithm 1. To place the error magnitudes in context, a stability experiment was conducted wherein the same test target was printed twice, and the color difference from one run to the next was meaAlgorithm 1: Iterative optimization of node sured. The average and 95th percentile CIE94 locations and values 1: Initialize lattice locations ai using method such as E error was respectively 1:2 and 2:8, and can in [76], or from a priori knowledge of transform be interpreted as a lower bound on the color curvature. 2: Obtain optimum node values bi as a solution to the transformation error introduced by noise inherent in the printing system. regularized least squares problem (3.49). 3: Obtain node locations ai by first deriving the Several conclusions can be drawn from these optimal weight matrix Wa using (3.53), and then results. First, when comparing Rows 2 and 3 with searching for the node locations that induce the Row 1, the benefit of optimizing respectively most similar weight matrix using (3.54). 4: Repeat steps 2-3 till no further minimization of the the node values and locations is apparent. This cost function is possible. is a quantitative confirmation of the intuition conveyed by Fig. 3.9. Second, when comparing

Table 3.4 Performance of various LUT based techniques in approximating an inverse printer color transform. T = 2,300 training samples were used Row number

Node location

Node value

# of nodes

Avg. E94

95%E94

1 2 3 4 5

Uniform lattice Uniform lattice Non-uniform lattice Non-uniform lattice Optimized

True function value Lattice regression [83] True function value Optimized Optimized

1296 1296 1296 1296 1296

3.441 3.262 3.014 2.428 1.813

7.012 6.872 6.500 4.888 3.901

68

R. Bala et al.

the first 3 against the last 2 rows, we note that for a given lattice size NV D 1296, the simplex structure, by virtue of its more flexible topology, offers a more accurate approximation of g./ than the more restrictive regular lattice. This gain comes at the price of a more expensive subvolume search. Finally, and most significantly, joint optimization of node location and value (Row 5) offers significant gain over optimization of only one or the other parameter.

3.5

Conclusions

Several methods have been presented for posing and optimizing color transformations for different device genres. Common to all techniques is the notion of minimizing a perceptually relevant fitting error over a set of device characterization data subject to a set of real world constraints induced by physics, power, or computational cost. When suitably constructed the optimization statement boils down to a form of linear regression or constrained quadratic program. For more complex or intractable objectives, a successful strategy is an iterative one whereby, at each iteration, certain variables are fixed and the optimization over the remaining variables becomes tractable. By alternating between the fixed and variable sets, we converge to a solution that is close to optimal. We note that while the topic of sparse lattice optimization in Sect. 3.4 has been described in the context of printer color correction, the problem has broad applicability in many domains warranting multidimensional function transformation (e.g. geometrical registration, bioinformatics, etc.). Many interesting challenges pave the way for future exploration in computational color imaging. Transformation methods must continually keep abreast with rapidly evolving image processing technologies such as high dynamic range imaging, subpixel capture and rendering for N > 3 primaries, and computational photography. Also with the ever-expanding variety of device form factors (e.g. mobile, wearable, head-mounted, drone, etc.), we will undoubtedly

encounter a larger variety in color image content, as well as applications that drive new goals and criteria for consuming these images. At the same time, economy in power consumption will become even more crucial with miniaturization. It will thus be increasingly important to conceive optimization techniques that adapt dynamically and in real time to the image and task, while simultaneously being computationally efficient and energy conscious. Such advances will call for an intimate interplay between image processing research, software engineering, and hardware acceleration architectures.

References 1. Hunt R (2004) The reproduction of color. Wiley, New York 2. Bala R (2003) Device characterization. Digital color imaging handbook: Chap. 5, 687–725 (2003) 3. Bala R, Sharma G, Monga V, Van de Capelle JP (2005) Two-dimensional transforms for device color correction and calibration, IEEE Trans Image Proc 14(8):1172–1186 4. Kang HR (1997) Color technology for electronic devices. SPIE Press 5. Emmel P (2003) Chapter 3: Physical models for color prediction in Digital Color Imaging Handbook. CRC Press 6. Bala R, Klassen V (2003) Efficient color transformation implementation. Digital Color Imaging Handbook: Chap. 11, 687–725 7. Finlayson G, Darrodi MM, Mackiewicz M (2016) Rank-based camera spectral sensitivity estimation. J Opt Soc Am A 33(4):589–599 8. Green F (2010) Color Management: Understanding and Using ICC Profiles. Wiley, New York 9. Ramanath R, Snyder W, Yoo Y, Drew M (2005) Color image processing pipeline. Signal Process Mag IEEE 22(1):34–43 10. West G, Brill M (1982) Necessary and sufficient conditions for von Kries chromatic adaption to give colour constancy. J Math Biol 15:249–258 11. Vrhel MJ, Trussell HJ (1998) The mathematics of color calibration. ICIP (1), pp 181–185 12. Wyszecki G, Stiles W (1982) Color science: concepts and methods, quantitative data and formulas, 2nd ed. Wiley, New York 13. Multimedia Systems and Equipment–Colour Measurement and Management–Part 2-1: Colour Management–Default RGB Colour Space sRGB (1999) IEC 61966 2-1:1999 14. Foster DH, Amano K, Nascimento SMC, Foster MJ (2006) Frequency of metamerism in natural scenes. J Opt Soc Am A 23(10):2359–2372

3 Computational Color Imaging 15. Darrodi MM, Finlayson G, Goodman T, Mackiewicz M (2015) Reference data set for camera spectral sensitivity estimation. J Opt Soc Am A 32(3):381– 391 16. Maloney L (1986) Evaluation of linear models of surface spectral reflectance with small numbers of parameters. J Opt Soc Am A 3:1673–1683 17. Parkkinen J, Hallikanen J, Jaaskelainen T (1989) Characteristic spectra of munsell colors. J Opt Soc Am A 6:318–322 18. Vrhel M, Trussel H (1992) Color correction using principal components. Color Res Appl 17:328–338, 1992. 19. D. Marimont, B. Wandell (1992) Linear models of surface and illuminant spectra. J Opt Soc Am A 9(11):1905–1913 20. Funt B, Jiang H (2003) Nondiagonal color correction. In: Proceedings on IEEE international conference on image processing, vol 1, pp I–481–4 21. Finlayson GD, Funt BV, Jiang H (2003) Predicting cone quantum catches under illumination change. In: The 11th color imaging conference, pp 170–174 22. Fraleigh J, Beauregard A (1995) Linear algebra, 3rd ed. Addison Wesley 23. Burns PD, Berns RS (1998) Image noise and colorimetric precision in multispectral image capture. In: Proc 6th IS&T/SID color imaging conference, IS&T, pp 83–85 24. Finlayson GD, Mackiewicz M, Hurlbert A (2015) Color correction using root-polynomial regression. IEEE Trans Image Process 24(5): 1460–1470 25. Hong G, Luo MR, Rhodes PA (2001) A study of digital camera colorimetric characterization based on polynomial modeling. Color Res Appl 26(1):76–84 26. Barnard K, Cardei VC, Funt BV (2002) A comparison of computational color constancy algorithms. I: Methodology and experiments with synthesized data. IEEE Trans Image Process 11(9):972–984 27. Jiang J, Liu D, Gu J, Süsstrunk S (2013) What is the space of spectral sensitivity functions for digital color cameras. In: Workshop on the applications of computer vision. IEEE Computer Society, pp 168–179 28. Andersen CF, Hardeberg JY (2005) Colorimetric characterization of digital cameras preserving hue planes. In: 13th color and imaging conference, pp 141–146 29. Mackiewicz M, Andersen CF, Finlayson G (2015) Hue plane preserving colour correction using constrained least squares regression. In: 23rd color and imaging conference, pp 18–23 30. McElvain JS, Gish W (2013) Camera color correction using two-dimensional transforms. In: 21st color and imaging conference, pp 250–256 31. Finlayson G, Drew M, Funt B (1994) Spectral sharpening: Sensor transformations for improved color constancy. J Opt Soc Am A 11(5):1553–1563 32. Buchsbaum G, Gottschalk A (1983) Trichromacy, opponent colours coding and optimum colour information transmission in the retina. Proc R Soc Lond B 220:89–113

69 33. Gershon R, Jepson A, Tsotsos J (1998) From Œr; g; b to surface reflectance: Computing color constant descriptors in images. Perception, 755–758 34. van de Weijer J, Gevers T, Gijsenij A (2007) Edgebased color constancy. IEEE Trans Image Process 16(9):2207–2214 35. Finlayson GD, Trezzi E (2004) Shades of gray and colour constancy. In: 12th color imaging conference, pp 37–41 36. Chakrabarti A, Hirakawa K, Zickler T (2008) Color constancy beyond bags of pixels. In: IEEE conference on computer vision and pattern recognition 37. Cheng, D, Prasad DK, Brown MS (2014) Illuminant estimation for color constancy: Why spatialdomain methods work and the role of the color distribution. J Opt Soc Am A 31(5):1049–1058. May 2014. [Online]. Available: http://josaa.osa.org/ abstract.cfm?URI=josaa-31-5-1049 38. Barron JT (2015) Convolutional color constancy. In: 2015 IEEE international conference on computer vision, pp 379–387 39. Bianco S, Cusano C, Schettini R (2015) Color constancy using CNNs. In: 2015 IEEE conference on computer vision and pattern recognition workshops, Boston, MA, USA, June 7–12, 2015, pp 81– 89. [Online]. Available: http://dx.doi.org/10.1109/ CVPRW.2015.7301275 40. Shi W, Loy CC, Tang X (2016) Deep specialized network for illuminant estimation. In Computer vision – ECCV 2016 – 14th european conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV, pp 371–387 41. Finlayson GD (2013) Corrected-moment illuminant estimation. In: IEEE International conference on computer vision, ICCV 2013, Sydney, Australia, December 1–8, 2013, pp 1904–1911 42. Finlayson GD, Schaefer G (2001) Solving for colour constancy using a constrained dichromatic reflection model. Int J Comput Vis 42(3):127–144 43. Tan RT, Nishino K, Ikeuchi K (2003) Illumination chromaticity estimation using inverse-intensity chromaticity space. In: IEEE conference on computer vision and pattern recognition, pp 673–682 44. Barnard K, Martin L, Coath A, Funt BV (2002) A comparison of computational color constancy algorithms. II. Experiments with image data. IEEE Trans Image Process 11(9):985–996 45. Gijsenij A, Gevers T, van de Weijer J (2011) Computational color constancy: Survey and experiments. IEEE Trans Image Process 20(9):2475–2489 46. Montojo J (2009) Face-based chromatic adaptation for tagged photo collections. University of Toronto 47. Bianco S, Schettini R (2012) Color constancy using faces. In: IEEE IEEE conference on computer vision and pattern recognition, pp 65–72 48. Lee C, Lee C, Lee Y-Y, Kim C-S (2012) Powerconstrained contrast enhancement for emissive displays based on histogram equalization. IEEE Trans Image Process 21(1):80–93 49. Carroll A, Heiser G (2010) An analysis of power consumption in a smartphone. In: Proc USENIX Ann Technical Conf, Jun 2010 pp. 1–14

70 50. Lee C, Kim J-H, Lee C, Kim C-S (2014) Optimized brightness compensation and contrast enhancement for transmissive liquid crystal displays. IEEE Trans Circuits Syst Video Technol 24(4): 576–590 51. Lee C, Monga V (2016) Power-constrained RGBto-RGBW conversion for emissive displays: Optimization-based approaches. IEEE Trans Circuits Syst Video Technol 26(10):1821–1834 52. Cowan WB (1983) An inexpensive scheme for calibration of a colour monitor in terms of CIE standard coordinates. SIGGRAPH Comput Graph 17(3):315–321 53. Berns RS, Motta RJ, Gorzynski ME (1993) CRT colorimetry. Part I: Theory and practice. Color Res Appl 18(5):299–314 54. Sharma G (ed) (2002) Digital color imaging handbook. CRC Press, Boca Raton, FL 55. Marcu GG, Chen W, Chen K, Graffagnino P, Andrade O (2001) Color characterization issues for TFTLCD displays. In: Proc. SPIE color imaging: device-independent color, color hardcopy, and applications VII, vol 4663, pp 187–198 56. Sharma G (2001) Comparative evaluation of color characterization and gamut of LCD versus CRTs. In: Proc SPIE color imaging: device-independent color, color hardcopy, and applications VII, vol 4663, pp 177–186 57. Kwak Y, Li C, MacDonald L (2003) Controling color of liquid-crystal displays. J Soc Inf Display 11(2):341–348 58. Gong R, Xu H, Tong Q (2012) Colorimetric characterization models based on colorimetric characteristics evaluation for active matrix organic light emitting diode panels. Appl Opt 51(30): 7255–7261 59. Sun P-L, Luo RM (2013) Color characterization models for OLED displays. In: SID symp dig tech papers, pp 1453–1456 60. Lai C-C, Tsai C-C (2007) A modified stripe-RGBW TFT-LCD with image-processing engine for mobile phone displays. IEEE Trans Consum Electron 53(4):1628–1633 61. Kwon KJ, Kim YH (2012) Scene-adaptive RGB-toRGBW conversion using retinex theory-based color preservation. J Display Technol 8(12):684–694 62. Lee S, Kim C, Seo Y, Hong C (2002) Color conversion from RGB to RGB+White while preserving hue and saturation. In: Proc. IS&T/SID color imag Conf, pp 287–291 63. Wang L, Tu Y, Chen L, Teunissen K, Heynderickx I (2007) Trade-off between luminance and color in RGBW displays for mobile-phone usage. In: SID symp dig tech papers, pp 1142–1145 64. Miller ME, Murdoch MJ (2009) RGB-to-RGBW conversion with current limiting for OLED displays. J Soc Inf Display 17(3):195–202 65. Can C, Underwood I (2013) Compact and efficient RGB to RGBW data conversion method and its application in OLED microdisplays. J Soc Inf Display 21(3):109–119

R. Bala et al. 66. Kwak Y, Park J, Park D-S, Park JB (2008) Generating vivid colors on red-green-blue-white electronic-paper display. Appl Opt 47(25):4491–4500 67. Flohr TJ, Kolpatzik BW, Balasubramanian R, Carrara DA, Bouman CA, Allebach JP (1993) Model based color image quantization. In: Proc SPIE human vision, visual processing, and digital display IV, vol 1913, pp 270–281 68. Dong M, Zhong L (2012) Power modeling and optimization for OLED displays. IEEE Trans Mobile Comput 11(9):1587–1599 69. Xiong Y, Wang L, Xu W, Zou J, Wu H, Xu Y, Peng J, Wang J, Cao Y, Yu G (2009) Performance analysis of PLED based flat panel display with RGBW sub-pixel layout. Org Electron 10(5):857–862 70. Dong M, Zhong L (2012) Chameleon: A color-adaptive web browser for mobile OLED displays. IEEE Trans Mobile Comput 11(5): 724–738 71. Wyszecki G, Stiles WS (2000) Color science: concepts and methods, quantitative data and formulae, 2nd ed. Wiley, New York 72. Nocedal J, Wright SJ (2006) Numerical optimization, 2nd ed. Springer, New York 73. Kodak Lossless True Color Image Suite. [Online]. Available: http://r0k.us/graphics/kodak/ 74. Amidror I (2002) Scattered data interpolation methods for electronic imaging systems: A survey. J Electron Imaging 11(2):157–176 75. Lee D, Schachter B (1980) Two algorithms for constructing a delaunay triangulation. Int J Comput Inf Sci 9(3):219–242 76. Monga V, Bala R (2008) Sort-select-damp: An efficient strategy for color look-up-table lattice design. SID/IS&T color imaging conference, Society for Imaging Science and Technology, pp 247–253 77. Chang JZ, Allebach JP, Bouman CA (1997) Sequential linear interpolation of multidimensional functions. IEEE Trans Image Process, 1231–1245 78. Monga V, Bala R, Mo X (2012) Design and optimization of color lookup tables on a simplex topology. IEEE Trans Image Process 21(4): 1981–1996 79. Balasubramanian R, Maltz M (1996) Refinement of printer transformations using weighted regression. In: Proc. SPIE color imaging: processing, hardcopy and applications, Mar 1996 80. Garcia E, Gupta M (2012) Optimized regression for efficient function evaluation. IEEE Trans Image Process 21(9):4128–4140 81. Gersho A, Gray R (1990) Vector quantization and signal compression. Springer 82. Jin X, Han J (2010) K-means clustering. Springer, Boston, MA, pp 563–564 83. Garcia EK, Gupta MR (2009) Building accurate and smooth ICC profiles by lattice regression. SID/IS&T color imaging conference

4

Optimization Methods for Synthetic Aperture Radar Imaging Eric Mason, Ilker Bayram, and Birsen Yazici

4.1

Introduction

Synthetic Aperture Radar (SAR) mathematically synthesizes a long aperture by employing moving antennas and short pulses to form high resolution images. SAR image formation methods reconstruct the image of an illuminated scene from scattered field measurements based on a linear forward model. Large amounts of data collected by SAR systems and the computational requirements in processing data have been the key factors in the development of practical SAR image reconstruction algorithms. Many of the SAR image reconstruction techniques used in practice rely on Fast Fourier Transform (FFT) and simplifying imaging geometries and assumptions. These methods include range-Doppler and polar formatting algorithms [55, 71, 72, 120] and more sophisticated and geometrically accurate !  k, and chirp scaling algorithms and their variants [15, 80, 96, 97, 104]. These algorithms E. Mason • B. Yazici () Department of Electrical, Computer and Systems Engineering, Rensselaer Polytechnic Institute, 110, 8th Street, Troy, NY 12180, USA e-mail: [email protected]; [email protected] I. Bayram Department of Electronics and Communications Engineering, Istanbul Technical University, Maslak 34469, Istanbul, Turkey e-mail: [email protected]

and associated linear models rely on simplifying assumptions including linear flight trajectories, constant antenna velocity, short aperture, small scene, linear wavefronts and far-field assumptions. In the last decade, new classes of SAR image reconstruction methods and forward models have been developed. These models and reconstruction methods avoid some of the limitations of the existing ones, can incorporate prior models and utilize optimization based approaches. In this chapter, we present Generalized Radon Transforms (GRT) as a unifying approach to modeling SAR received signal and optimization based approaches to SAR image reconstruction. A GRT is a generalization of the Radon transform that projects a weighted and filtered function onto some smooth manifolds such as circles, ellipses, hyperbolas, etc. GRTs are suitable to model data in different SAR configurations and modalities including bistatic and monostatic SAR as well as Doppler SAR [74, 87, 125, 143] and passive SAR [124, 141, 142]. Furthermore, they can accommodate arbitrary imaging geometries, wide apertures, large scenes and complex wave propagation models arising from multiple scattering and moving targets [118,119,122,126– 128, 128, 129, 129, 129]. A major advantage of GRT-based models is the availability of analytic inversion methods that can be implemented computationally efficiently [38, 86, 114, 146].

© Springer International Publishing AG 2017 V. Monga (ed.), Handbook of Convex Optimization Methods in Imaging Science, DOI 10.1007/978-3-319-61609-4_4

71

72

We review recent SAR image reconstruction methods from an optimization perspective. Majority of these methods can be viewed as constrained least squares problems exploiting sparsity. These methods offer substantial improvements in image quality and new degrees of freedom in imaging. We largely categorize these methods into two classes. The first class includes those that can be implemented analytically, at least at each iteration. The second class includes those that utilize large scale numerical optimization methods. In the first class of methods, the optimization criterion uses a GRT as a forward model and the optimization problem is addressed analytically. Next, the solution is discretized and implemented. We first consider optimization criteria that result in linear reconstruction methods and next consider those that result in nonlinear, iterative reconstruction algorithms. The linear reconstruction methods are derived from minimum norm, regularized minimum norm and minimum error criteria and statistical methods are derived from Best Linear Unbiased Estimation and Minimum Mean Square Error criteria. All linear reconstruction formula are in the form of filtered-backprojection, regularized filteredbackprojection and backprojection filtering operators. The non-linear reconstruction methods are formulated in Bayesian framework and can be viewed as constrained least-squares optimization problems. We present approximate analytic implementations of the solutions at each iteration using reweighted least-squares and shrinkagetype iterative approaches. In the second class of methods, SAR forward models are discretized upfront and the optimization problems are posed in finite dimensional vector spaces. This approach has gained popularity in recent years, particularly in the context of compressed sensing and sparse signal recovery methods. We review radar imaging methods developed in the `1 minimization framework, including total variation and low-rank recovery methods; and Bayesian compressive sensing framework. Finally, we summarize application of sparse signal recovery based optimization methods to other radar imaging problems

E. Mason et al.

including recovery of missing data, autofocus and moving target imaging problems.

4.2

SAR Forward Models

SAR uses antenna motion to form highresolution images. A conventional SAR system transmits short electromagnetic pulses and measures the backscattered signal at different spatial positions. A scene reflectivity image is formed by inverting an underlying forward model which relates scene reflectivity to received signal. The forward model is typically derived from the scalar wave equation under Born approximation. Historically, large amounts of SAR data and computational requirements in processing have been the key factors in forward modeling and in the development of associated inversion techniques. Due to computational requirements, many of the SAR image reconstruction techniques rely on Fast Fourier Transform (FFT) and simplifying imaging geometries and assumptions. These assumptions include linear flight trajectory at a constant height, constant velocity, short aperture, small scene, linear wavefronts and far field assumptions. Alternative to the conventional SAR forward modeling and image formation techniques is the Radon transform based tomographic approach. This approach views received signal as the integral of scene reflectivity along lines under some simplifying assumptions. Image reconstruction then involves inversion of the Radon transform. In this paper, we model SAR data as a Generalized Radon Transform (GRT) of the scene reflectivity or radiance. GRTs generalize Radon transform in two ways: (i) The received signal is no longer the projection of the scene onto lines, but some other smooth manifolds, such as circles, ellipse or hyperbolas. (ii) The received signal is a weighted or filtered projections of the scene onto smooth manifolds. The weighting is determined by the underlying system or imaging parameters. The GRT-based SAR data models are versatile and more accurate than alternatives. The received signal in several different SAR configurations and modalities can be modeled as GRTs. These

4 Optimization Methods for Synthetic Aperture Radar Imaging

include monostatic, bistatic, multi-static configurations [74, 87, 143] and novel modalities such as hitchhiker SAR and Doppler SAR modalities [118, 119, 123–125, 141, 142]. Additionally, arbitrary imaging geometries, waveforms, system related parameters and physics-based complex wave propagation and target models can be built into GRT-based models. A major advantage of GRT-based models is the availability of analytic inversion methods that can be implemented computationally efficiently. These analytic inversion methods are derived or incorporated into the solutions of a wide range of optimization problems including least-squares optimization, Lp constrained least-squares optimization and optimization problems resulting from statistical estimation methods. In this section, we introduce the GRT based forward model and in the subsequent sections we present analytic image reconstruction methods resulting from a range of optimization problems.

4.2.1

Generalized Radon Transforms

Let x D Œx1 ; x2  2 R2 denote the twodimensional position of a scatterer and .x/ denote the reflectivity of the scatterer. Note that in three-dimensional space, the scatterer is located at x D Œx; .x/ 2 R3 where W R2 ! R denotes the topography of ground. Let s 2 Œs1 ; s2  be the slow-time variable parameterizing the location of a moving antenna and .s/ 2 R3 be the trajectory of the antenna. Let ! 2 Œ!1 ; !2  denote the fast-time temporal frequency. Then using scalar wave equation and under Born approximation, we can model the received signal d.ø; s/ as follows: Z d.!; s/ WD F Œ  ei.!;s;x/ A.!; s; x/.x/dx where

(4.1) 2! .!; s; x/ D R.s; x/; c0

(4.2)

c0 is the speed of light in free-space and A.!; s; x/ is a slow-varying function of ! that depends on transmitted waveforms, antenna beam patterns and geometric spreading factors.

73

In (4.2) R.s; x/ represents the range involving distances travelled by the electromagnetic wave. Equation (4.1) represents a general forward model for SAR using wideband signals. Depending on the SAR modality, the range and amplitude functions take different forms. SAR modalities can vary based on the receiver/transmitter configuration, processing type and transmitted waveforms. For the case of wideband monostatic SAR in which the transmitter and receiver are colocated and traversing the trajectory .s/, the range function in the (4.2) has the form [87] Rm .s; x/ D 2j.s/  xj:

(4.3)

This function defines a sphere centered at the antenna location. Setting A  1, we see that the received signal in (4.1) is the projection of the ground reflectivity onto manifolds defined by the intersection of the sphere and ground topography. For a flat topography, these manifolds are simply circles. Since A ¤ 1, (4.1) defines filtered or weighted projections of the ground reflectivity. An alternative to monostatic configuration is bistatic SAR in which the transmitter and receiver are sufficiently far apart [143]. Let  R .s/ and  T .s/ denote the receiver and transmitter trajectories. Then the range function for a wideband bistatic SAR becomes Rb .s; x/ D j T .s/  xj C jx   R .s/j:

(4.4)

This function defines an ellipsoid with foci located at the two antenna locations. The underlying manifolds onto which the scene is projected are defined by the intersection of these ellipsoids with the ground topography. Multistatic SAR configuration also involves projections onto the intersection of ellipsoids and ground topography [74]. The last modality we consider here is hitchhiker SAR developed for passive SAR imaging [142]. In this modality, two sets of bistatic measurements collected by the receivers traversing the trajectories  i .s/ and  j .s/ are first correlated. For a stationary transmitter and under the incoherent field assumption, the range function becomes [142]

74 Fig. 4.1 Illustrations of the imaging configurations for (a) monostatic, (b) bistatic, (c) Hitchhiker SAR modalities

E. Mason et al.

4 Optimization Methods for Synthetic Aperture Radar Imaging

Fig. 4.2 Isorange contours for a fixed pulse where the receiver is located at the red circle, and the transmitter at the green square, the trajectory is the dashed line, for (a)

Rh .s; x/ D j i .x/  xj  j j .x/  xj:

(4.5)

This function defines a two-sheet hyperboloid with the foci located at the receiver locations. The GRT projects the scene onto the intersection of this hyperboloid and the ground topography as in the previous cases. Examples of the imaging geometries and manifolds which we refer to as isorange contours are shown in Fig. 4.1 and Fig. 4.2 respectively. The forward model in (4.1) is general enough to include novel SAR modalities and dynamic scenes. For wideband transmission, the phase term involves a range function that depends on the geometric configuration. For narrowband transmission, the range in the phase is replaced by temporal Doppler function and ! is replaced by fast-time t [117, 123, 125]. For a dynamic scene, the phase term involving either range or Doppler includes the velocities of scatterers [117, 119, 126, 128, 129].

4.2.2

Classical Radon Transform

The forward model in (4.1) can be reduced to the classical Radon transform under the far-field and small scene assumptions. Making a Taylor series expansion in (4.2) around x D 0 and truncating to the linear term, we obtain .!; s; x/  .!; s; 0/Crx .!; s; 0/ x: (4.6)

75

monostatic, (b) bistatic, (c) Hitchhiker SAR modalities. Adapted from [138]

Substituting this term into (4.1), we get Z d.!; s/ D ˛.!; s/ eix .x/dx; where

 D rx .!; s; 0/;

(4.7)

(4.8)

and ˛.!; s/ D A.!; s; 0/ei.!;s;0/ :

(4.9)

We see that the forward model is now a twodimensional Fourier transform multiplied by a scaling function. This is the statement of the classical Radon transform via the Fourier slice theorem [95]. To perform imaging the complex function ˛.!; s/ should be compensated for by multiplying by ˛.!; s/=j˛.!; s/j2 . Next, an interpolation is done from the support region defined by  to a uniform grid and perform a twodimensional FFT. This approach is primarily developed for monostatic SAR. It is widely used in SAR image reconstruction due to its computational efficiency. It is known as the polar format algorithm [39, 100]. Equation (4.7) is the generalization of the monostatic model to any range-based SAR modality. The reconstruction method can be applied to any of the SAR modalities discussed previously by changing the interpolation from

76

E. Mason et al.

one modality to another. While this approach is efficient, the size of the scene is limited and error can result from the interpolation and approximation of the phase function [55].

such that F ΠD d

The solution to P1 is given by the following reconstruction formula: O D F  .F F  /1 Œd:

4.3

Analytical Optimization Methods for SAR Imaging

Equation (4.7) shows that the received signal is approximately the Fourier transform of the scene where the set of Fourier vectors  is determined by the bandwidth as well as the spatial aperture of the system. Since all SAR systems are band limited and have finite apertures, SAR image reconstruction is inherently an ill-posed inverse problem. In this section, we consider SAR image reconstruction as an optimization problem and address ill-posedness of the inverse problem by means of different optimization criteria. We first consider optimization criteria that result in linear reconstruction methods and next consider those that result in non-linear, iterative reconstruction methods. The linear reconstruction methods can be derived in either deterministic or statistical settings. Deterministic reconstruction methods are derived from minimum norm, regularized minimum norm and minimum error criteria and statistical methods are derived from Best Linear Unbiased Estimation (BLUE) and Minimum Mean Square Error criteria. All linear reconstruction formula are in the form of filtered-backprojection, regularized filteredbackprojection and backprojection filtering operators.

(4.10)

(4.11)

The formula in (4.11) results in a filteredbackprojection type reconstruction which can be implemented computationally efficiently [38, 74, 86, 87, 114, 129, 142, 143, 146]. Let K D F  Q denote the FBP operator where F  is the L2 adjoint of F and Q is the filter in (4.11). Then, . O z / WD K Œd.z / 

R

ei.ø;s;z/

Q.ø; s; z /d.ø; s/døds

(4.12)

where Q is the kernel of Q . To approximate the filter Q we substitute (4.1) into (4.12) and obtain an expression for the composition of the two operators . O z / WD K F .z / 

R

ei˚.ø;s;z/ Q.ø; s; z /

A.!; s; z /.x/dxdøds;

(4.13)

where ! .R.s; x/  R.s; z //: (4.14) c0

˚.!; s; x; z/ D

We make a Taylor series expansion of R.s; x/ around x D z and keep only the first order term to approximate ˚ as follows: ˚.!; s; x/ 

! rx R.s; x/jxDz : c0

(4.15)

Next, we make the following change of variables

4.3.1

Linear Deterministic Reconstruction Methods

4.3.1.1 Minimum Norm Solution: Filtered Back-Projection Formula The simplest approach to image reconstruction is to consider the solution of the following optimization problem: R P1 W O D Argmin J./ D j.x/j2 dx 

.!; s/ !  D

! rx R.s; x/jxDz c0

(4.16)

and write (4.13) as follows: . O z/ D

R

ei.xz/ Q.; z /A.; x/

.; z ; x/.x/ddx

(4.17)

where .; z ; z / is the determinant of the Jacobian that comes from the change of variables

4 Optimization Methods for Synthetic Aperture Radar Imaging

in (4.16) and  2 ˝z , the set of Fourier vectors  contributing to the reconstruction of the reflectivity at z . .; z ; z / is also known as the Beylkin determinant. We now use the facts that K is a pseudoinverse of F and that the main contributions to (4.17) comes from x D z 1 to determine the filter as follows: Q.; z / D

A .; z / : jA.; z /j2 .; z /

(4.18)

where we set .; z ; z / D .; z / for notational simplicity. With this choice of filter (4.17) becomes Z . O z/ D ei.xz/ d.x/dx: (4.19) ˝z

Equation (4.19) shows that the point spread function of the minimum norm solution is a sinc function, i.e., the kernel of a band-limited identity operator as expected. The resolution of the reconstructed image is determined by the data collection manifold ˝z , i.e., the support of the sinc function. The FBP reconstruction formula can be also analyzed from the perspective of Fourier Integral Operators [41, 62]. Since F  F is a pseudodifferential operator, F reconstructs all the edges of the scene at the correct position and orientation in the image. Since K is a pseudo-inverse of F with the choice of filter as in (4.18), it also reconstructs edges at the correct strength.

4.3.1.2 Regularized Minimum Norm Solution Taking into account uncertainty in received signals, we consider the following regularized optimization problem to SAR image reconstruction:

The solution to Problem `2 is also given in the form of an FBP operator in which the filter Q is now regularized as follows: Q.; z / D

P2 W

O D Argmin J./ D 

(4.20) where k stands for L -norm and  > 0 is a regularization constant. k22

1

2

F  F is a pseudo-differential operator.

A .; z / : jA.; z /j2 j .; z /j C 

(4.21)

4.3.1.3 Minimum Error Solution: Backprojection Filtering Formula Yet another SAR image reconstruction formula can be obtained by considering the following optimization problem: P3 W

O D Argmin J./ D kd  F Œk22 (4.22) 

where k k22 stands for `2 -norm. The solution to this problem is given by the left pseudo-inverse of F which can be implemented in the form of Backprojection Filtering (BPF) formula in which the first order approximation to the BPF filter QBPF is now given by [138] QBPF .z 0 ; z / D

Z

0

ei.z z/ Q.; z /d: (4.23)

where Q is the filter in (4.18). A desirable aspect of backprojection based image reconstruction methods are the availability of fast back-projection algorithms [86, 114, 116, 146]. However, these algorithms rely on specific SAR geometries and simplifying assumptions. An alternative class of fast algorithms applicable for arbitrary GRTs is developed in [20, 38, 64]. These algorithms take advantage of the low-rank separability and oscillatory nature of the kernel of the GRTs.

4.3.2 kd F Œk22 Ckk22

77

Linear Statistical Reconstruction Methods

One of the fundamental problems in radar is the ubiquitous presence of noise and clutter in sensor measurements. Therefore, we extend the model in (4.1) to include clutter and noise. d.!; s/ D F ΠC c .!; s/ C n.!; s/; (4.24)

78

E. Mason et al.

where the additional terms c and n are clutter (unwanted radar return) and an additive noise, respectively. Both quantities are modeled statistically, typical assumptions include Gaussian on the noise and Rayleigh for the clutter. In the following two sections, we present two FBP-type reconstruction formulae in a statistical framework taking into account noise and clutter. The first one is developed using the best linear unbiased estimation criterion and the second one is based on minimum mean square error criterion. The reconstruction formulae are developed using a high frequency analysis. Detailed descriptions of the derivation can be found in [138, 140, 145]. Since both reconstruction formulae are restricted to be linear, only the first and second order statistics of the underlying random processes need to be specified. For simplicity we assume that the noise and clutter processes are zeromean. Clutter is statistically stationary. The noise process is statistically uncorrelated in slow-time s and the fast-time frequency variable !. Both noise and clutter processes admit spectral density functions. In addition to these assumptions, for the MMSE reconstruction, we also assume that the scene of interest is statistically stationary. For both BLUE and MMSE criteria the assumption of statistical stationarity can be relaxed to include a large class of statistically nonstationary processes called pseudo-stationary processes as shown in [138, 140].

4.3.2.1 Best Linear Unbiased Estimation In [138], an FBP-type reconstruction is presented based on the best linear unbiased estimation (BLUE) criteria. Thus, the reconstructed image is modeled as

1 .z / D K1 Œd WD

R

ei.!;s;z/ Q1 .s; !; z /

d.!; s/d!ds;

(4.26)

and K2 is a pseudo-differential operator with kernel Z ! 0 O 2 .z ; /d: Q2 .z 0 ; z / D ei c0 .z z/ Q (4.27) The FBP-type reconstruction under the BLUE criterion is given by the solution of the following optimization problem: Z P4 W

O D ArgminJ./ Q D Q

EŒj. Q z/

 EŒ. Q z /j2 dz such that EŒ. Q z / D .z / Q D K2 K1 Œd

(4.28)

where EΠ denotes the expectation operator. This optimization problem reduces to determining the filters Q1 and Q2 satisfying (4.36). To obtain the filters, the variance of the estimated image is minimized along with the constraint that the bias of the image estimate be zero. The optimization is solved using Lagrange multipliers within the framework of the calculus of variations. The resulting forms of the filters used in (4.26) and (4.27) are Q1 .; z / D A .; z 0 / .; z 0 / ; jA.; z 0 / .; z 0 /j2 jSc ./j2 C .; z 0 /jSn ./j2 (4.29) and

. O z / D K Œd.z /

(4.25)

where the imaging operator K has the following linear structure: • The operator K designed to be the composition of two GRTs, K D K2 K1 where

Q2 .z ; z 0 / D Z jA.; z / .; z /j2 jSc ./j2 C .; z /jSn ./j2 0 ei2.z z/ d; jA.; z / .; z /j2

(4.30) where  is defined in (4.16) and Sc and Sn denote the power spectral density functions of clutter and noise, respectively.

4 Optimization Methods for Synthetic Aperture Radar Imaging

79

Fig. 4.3 (a) A simple airplane phantom used for experiments in this section. (b) The airplane phantom with clutter included. Adapted from [138]

Using this choice of filters we know the leading order singularities of the expectation of the estimated image under the BLUE criteria is equal to the leading order singularities of the true scene. Furthermore, the total variance of the BLUE reconstruction is minimized at its leading order singularities and the variance has a closed form given in [138]. Figures 4.4 and 4.5 show reconstructed images at different noise and clutter levels using BLUE, and the original phantom and clutter mask is shown in Fig. 4.3.

4.3.2.2 Minimum Mean Square Error Estimation In BLUE we assume that the unknown scene of interest is deterministic. An alternative reconstruction formula can be designed by assuming that the scene is random and incorporating a prior model to the estimation within a Bayesian approach. In [145], an FBP-type inversion is developed using the minimum mean square error criterion. As in the case of BLUE, estimator design reduces to the design of an appropriate filter satisfying the MMSE criterion. To constrain the reconstruction to be a linear one, we use only the first and second-order statistics. As before, we assume that the clutter and noise are zero-mean stationary processes. We

assume that the scene has a spatially varying mean value EŒ.x/ D .x/:

(4.31)

The scene of interest  is a stationary process with spectral density function S . We restrict the estimator K to be linear, in particular to be of the following FBP form: Z K Œd.z / D

ei.!;s;z/ Q.!; s; z /d.!; s/d!ds (4.32)

We design K operator based on the solution of the following optimization problem. Z Q D P5 W O D Argmin J./ Q

EŒj. Q z /  .z /j2 dz

such that . O z / D K Œd.z /:

(4.33)

As in the case of BLUE, design of K reduces to determining the filter Q satisfying (4.33). The mean square error shows that the choice of filter is a trade-off between the bias and the total variance of the reconstructed image. The filter can be chosen such that the estimate is unbiased up to its

80

E. Mason et al.

Fig. 4.4 The reconstructed average of the image over 10 realizations with SCR 10 dB. (a) BLUE after applying filter Q1 and (b) BLUE after applying filter Q2 with SNR

0 dB. (c) BLUE after applying filter Q1 and (d) BLUE after applying filter Q2 with SNR 32 dB. Adapted from [138]

leading order (in terms of the singularities which form edges). This means that the bias of the estimator is one degree smoother than the mean of the image estimate. Such a filter preserves the location, orientation, and strength of the edges of the image. However, with this choice there is no suppression of the singularities of the second order statistics of the clutter and noise processes. The filter is given as

where the two terms in the denominator are the amplitude function (4.34) and the Beylkin determinant. This is similar to the choice of filter in the noise-free case developed in the previous section. The second choice for the filter Q is designed by minimizing the total variance of the scene estimate mse .z / D K Œd.z /. The resulting optimization problem is addressed by computing the stationary points of the variational derivative of the objective functional with respect to Q. This leads to the following filter

Q.; z / D

1 A.; z / .; z /

(4.34)

4 Optimization Methods for Synthetic Aperture Radar Imaging

Fig. 4.5 The reconstructed average of the image over 10 realizations with SNR 10 dB. (a) BLUE after applying filter Q1 and (b) BLUE after applying filter Q2 with SCR

Q.; z / D 

A .; z /ST ./ jA.; z /j2 .; z /ŒS

 ./

C Sc ./ C SQ n ./

;

(4.35) where  is defined in (4.16). S ./ and Sc ./ are the power spectral density functions of .x/ 

.x/ and c .x/, respectively and SQ n is the power spectral density function of the n scaled by .2/4 .

81

0 dB. (c) BLUE after applying filter Q1 and (d) BLUE after applying filter Q2 with SCR 32 dB. Adapted from [138]

In [49] a similar approach is taken in determining the weights of the underlying filter based on the MMSE criteria. In Fig. 4.3 we display the phantom used and the clutter mask. Figure 4.6 displays the results of using the deterministic FBP method described in Sect. 4.3.1.1 for a variety of noise and clutter levels. Reconstructed images using the MMSE method described in Sect. 4.3.2.2 are displayed in Fig. 4.7. Using the pseudo stationary model

82

E. Mason et al.

Fig. 4.6 Images reconstructed using deterministic FBP assuming stationary noise and clutter. Reconstructed images with constant clutter at SCR 10 dB and noise SNR

20 dB are (a) and SNR 40 dB (b). Reconstructed images with constant noise at SNR 10 dB and clutter SCR 20 dB are (c) and SCR 40 dB (d). Adapted from [138]

described in [138] with the MMSE reconstruction method the resulting images for the cases when the power spectral density is known a priori and estimated are displayed in Figs. 4.8 and 4.9, respectively. As expected deterministic FBP fails to suppress noise, while the statistical methods clearly overcome that problem. Furthermore, the pseudo stationarity assumption on the noise and clutter leads to superior results over the MMSE approach using the ordinary stationarity assumption on the noise.

4.3.3

Nonlinear Reconstruction Methods

In the previous sections we considered linear reconstruction methods in both deterministic and statistical settings. The linearity of the reconstruction methods can be viewed as a direct consequence of the inherent Gaussianity assumption on the additive noise and the unknown scene of interest. In fact, all other FBP formulae can be obtained via Maximum A Posteriori (MAP) es-

4 Optimization Methods for Synthetic Aperture Radar Imaging

83

Fig. 4.7 Images reconstructed using MMSE criteria assuming stationary noise and clutter. Reconstructed images with constant clutter at SCR 10 dB and noise SNR 20 dB

are (a) and SNR 40 dB (b). Reconstructed images with constant noise at SNR 10 dB and clutter SCR 20 dB are (c) and SCR 20 dB (d). Adapted from [138]

timation within Bayesian framework with appropriate Gaussian prior models. However, Gaussian prior models are often not sufficient to capture edges in SAR images. As a result Gaussian priors typically lead to over smoothing of edges. A broader class of reconstruction methods can be developed by the MAP estimator coupled with non-Gaussian prior models:

where L.dj/ is the data log-likelihood function designed based on the forward model governing the physics of the problem and uncertainties in the measurements, and p./ is the prior model quantifying what we know about the likely properties of the solution even before we have made any measurements. Intuitively, the MAP estimate seeks to determine the solution that both fits the data accurately, as well as maintain consistency with our prior knowledge of how the solution should behave. The linear reconstruc-

PMAP W

OMAP DArgmax J./DL.dj/ C ln p./ 

(4.36)

84

E. Mason et al.

Fig. 4.8 Images reconstructed assuming pseudo stationary noise and clutter and the power spectral density of the noise and clutter are known a priori. Reconstructed images with constant clutter at SCR 10 dB and noise SNR

20 dB are (a) and SNR 40 dB (b). Reconstructed images with constant noise at SNR 10 dB and clutter SCR 20 dB are (c) and SCR 40 dB (d). Adapted from [138]

tion methods including the deterministic ones and BLUE can be considered as MAP solutions with Gaussian data likelihood and appropriate Gaussian prior models2 . One major limitation of the Gaussian models is that it does not accurately model the edges and other discontinuities that often occur in real images. In order to overcome

this limitation, we extend our approach to nonGaussian models, in particular we consider the following distributions:

Z p./ / exp  ˛.z /VŒ.z /dz

2 For the deterministic case, we can consider a Gaussian model with large variance to obtain pseudo-inverse solutions.

Z

p./ / exp  ˛.z /VŒr.z /dz (4.37) where V is a potential functional. There are many potential functionals studied in the litera-

4 Optimization Methods for Synthetic Aperture Radar Imaging

85

Fig. 4.9 Images reconstructed assuming pseudo stationary noise and clutter and the power spectral density are estimated from the data. Reconstructed images with constant clutter at SCR 10 dB and noise SNR 20 dB

are (a) and SNR 40 dB (b). Reconstructed images with constant noise at SNR 10 dB and clutter SCR 20 dB are (c) and SCR 40 dB (d). Adapted from [138]

ture some of which include the following [6, 8, 9, 53, 54, 57, 59, 105, 108]:

Under the assumption of statistically uncorrelated Gaussian noise, we formulate the image reconstruction as the following optimization problems:

VΠW Nonconvex:

2 ; 1 C 2 2

log.1 C  /; VΠW Convex: jj;

minf2 ; 1g; jj 1 C jj

P6 W (4.38)

jjp 1 C jjqp (4.39)



R C VŒ.z /dz ;

log coshfg;

minfjj2 ; 2  1g; jjp ;

O D Argmin J./ D kd  F Œk22;Sn

P7 W

(4.40)

O D Argmin J./ D kd  F Œk22;Sn 

R C VŒr.z /dz

(4.41)

86

E. Mason et al.

where EŒn.!; s/n .!; s/ D Sn .!; s/ and Sn is a positive-definite trace-class operator whose kernel is Sn .!; s/. Typically, Sn .!; s/ D Sn .!/ 2 for some constant 2 > 0. kdk22;Sn

Z D

jd.!; s/j2 døds Sn .!; s/

(4.42)

where  > 0 can be viewed as a term related to signal-to-noise ratio trading-off between the data likelihood and prior information. If we consider the discrete version of (4.40) and (4.41), the minimization problems can be addressed using various classical iterative optimization algorithms ranging from steepest-descent and conjugate-gradient to more involved interiorpoint algorithms. However, even in discrete setting, all these general purpose methods are often inefficient, requiring too many iterations and too many computations. This is especially the case for high-dimensional problems, as often encountered in SAR image reconstruction. In the last decade, the optimization problems P6 and P7 together with the potential functional VΠD jjp have been rediscovered in the context of sparse signal recovery and compressive sensing theory [17,30]. Several new families of numerical algorithms with performance guarantees were developed in the context of sparse signal recovery, compressive sensing and robust statistics. Without claiming to be very comprehensive, we categorize these algorithms as greedy pursuit type algorithms, approximate solutions such as iterative reweighted least squares, and iterative-shrinkage type algorithms [45]. Our objective is to combine the ideas and algorithms developed in the context of sparse signal recovery with analytic techniques to develop nonlinear, iterative image reconstruction techniques that are computationally efficient, robust and quantifiable in terms of image quality.

4.3.3.1 Iterative Re-weighted Least-Squares Algorithm An interesting example of sparse signal processing techniques is the Focal Underdetermined System Solver Algorithm [56]. This is a reweighted norm minimization algorithm that

reduces the `p -norm constrained problem to a sequence of reweighted `2 -constrained problems. We adapt this methodology to design a sequence of analytic `2 -norm constrained FBP-type inversions methods for the V-constrained image reconstruction problems in (4.40) and (4.41). We refer to this method as Iterative Re-weighted Type Algorithm (IRTA). Below we outline our method in detail for the `2 -norm constrained case and comment on its extension to other potential functionals and (4.41). The method recasts the `2 -norm constrained problem as sequence of `2 minimizations using appropriate filters designed using the solution in the previous iteration. p

1. We assume that   exp.kkp;˛ / where Z kkpp;˛

D

j.z /jp dz < C1 ˛.z /

with ˛.z / > 0. Then, the MAP estimation of  is equivalent to the following optimization problem: O D Argmin J./ D kd  F Œk22;Sn

P8 W



p

Ckkp;˛ :

(4.43)

2. We convert the problem P8 defined above to a sequence of quadratic problems, P5 , as defined in (4.33) and solve each problem analytically as in our work [145]. This amounts to a fast, analytic, iterative solution of problem P8 . Let k , k D 0; 1; : : : ; be the solution of problem P5 at the kth iteration. We define q

I˛;k1 .z / D

.z / : ˛ 1=2 .z /jk1 .z /jq

(4.44)

Then, for p D 2  2q, kI˛;k1 f .z /k22 ' kf kpp;˛ : q

(4.45)

Thus, P6 becomes Pa8 W

O D Argmin J./ D kd  F Œk22;Sn 

CkI˛;k1 .z /k22 : q

(4.46)

4 Optimization Methods for Synthetic Aperture Radar Imaging

• Initialize 0  1 on its support. Increment k by 1 and do the following steps: q • Use k1 and define I˛;k1 for q D 1=2. • Derive an analytic filter using ˇk1 .z / and minimizing P5 as in [145]. • For an estimate k D BQk Œd where B is the backprojection operator. 1=2 1=2 • Update I˛;k ! I˛;kC1 and ˇk1 .z / ! ˇk .z / using k . • Terminate if stopping criteria met, otherwise set k D k C 1 and iterate. 5. • For the general V potential function, a linear transformation that we can be considered for Step 2 is

3. We now have an MMSE equivalent of the MAP estimation problem defined above. We address this problem by designing a FBP operator as follows: Let Kk be the inverse map at the kth iteration step. We write Kk D BQk

where Qk is the filter that solves Problem P5 . We design the kernel of Qk as in our work [145]. Specifically, let ˇk .z / D

1 ˛ 1=2 .z /jk .z /jq

(4.47)

R and Sˇ;k D j eiz ˇk .z /dz . Then the filter at the kth iteration is given by (4.35) with Sn ! Sˇ;k1 Sn . 4. The procedure described above solves the `1 constrained problem efficiently by using analytic FBP operators at each iteration. This procedure is summarized as follows:

jr.z /j  P Œ.z / WD q

ITV;k1 .z / D

P Œf .z / jP Œk1 .z /jq

87

.z /jV 1=2 .k1 .z //j : jk1 .z /j (4.48) • For Problem P7 in which V.rf / D jrf jp , we consider the following modifications in Step 2 above. I˛;k1 .z / D

R ˝z

ei.zx/ jj.x/dx;

) jITV;k1 .z /j2  jr.z /j22q :

Figure 4.10 displays results using the IRTA algorithm and conventional filtered backprojection described in Sect. 4.3.1.1.

4.3.3.2 Iterative Shrinkage Thresholding Approach The family of Iterative Shrinkage Thresholding Algorithms (ISTA) provides an alternative approach to the optimization problem in (4.43). Roughly speaking, in these iterative methods, each iteration comprises of a multiplication by F and its adjoint, along with a scalar shrinkage step on the obtained result. Despite their simple structure, these algorithms are shown to be very effective in addressing the optimization problem in (4.40) and (4.41). There are various iterative-shrinkage methods with variations between them. These algorithms are shown to emerge from different consider-

q

(4.49)

ations, such as the Expectation-Maximization algorithm in statistical estimation theory, majorization-minimization, forward-backward splitting, variations on greedy methods and more [33, 34, 50, 51, 65] Iterative shrinkage type of optimization involves the following steps:   kC1 D  k C F  Œd  F Œk  (4.50) where ˛ is a shrinkage operator given as  ./ D Argmin J./ D 12 k  k2 



R

V./.z /dz :

(4.51)

Challenges addressed by iterative-shrinkage type approaches include non-smooth objective functionals and large scale data especially in SAR imaging.

88

E. Mason et al.

Fig. 4.10 Phantom and reconstructed images at SNR 0 2 using IRTA with V./ D 1C2 . (a) Rectangular phantom used in reconstruction. (b) Deterministic FBP reconstruc-

tion of the rectangular phantom. (c) The result after 1 iteration of IRTA. (d) The result after 20 iterations of IRTA. Adapted from [138]

To obtain a computationally efficient variation of iterative shrinkage type optimization, we consider the following approach. Let K D BQ where B is the backprojection operator and Q is the operator whose kernel is the filter Q defined in (4.18). We modify (4.50) as follows:

Linearizing (4.53) around k using a Taylor expansion, we obtain the following iteration:

1 rk kC1  k C  I C ˛r 2 V.k /

(4.54)

(4.52)

where rk D K Œd  F Œk  ˛ rV. k/. We can rewrite (4.54) in the following alternative form:

Solving (4.51) by solving for a critical point, we have the expression

˛ kC1  k CKk Œd  F Œk  Ck Œk  (4.55) 

kC1 D  .k C K Œd  F Œk / :

k D kC1 C ˛rV.kC1 /:

(4.53)

where Kk is a FBP operator in which the filter Qk now includes the term ŒI C ˛r 2 V.k /1 

4 Optimization Methods for Synthetic Aperture Radar Imaging

ŒI  ˛r 2 V.k /3 , and Ck Œk   ŒI  ˛r 2 V.k /rV.k /. While there is no formal analysis of this algorithm we provide a brief and non rigorous discussion of the expected performance. The iteration in (4.55) requires that the potential function V be strictly convex to be well defined. Specifically, when V is strictly convex the objective functional in (4.36) is also strictly convex for  > 0, which means there is a single stationary point which is a global minimum. In the case that V is nonconvex and lower bounded then iteration (4.53) will lead to a stationary point, not necessarily optimal. (4.50) has been studied in the finite dimensional setting. In this case (4.50) can be considered a proximal gradient descent and provably converges with appropriate choice of step size  > 0. Specifically, the gradient descent is applied to the `2 -norm which has Lipschitz continuous derivative, with Lipschitz constant equal to the largest eigenvalue of the discretized F  F matrix. Thus, the step size  should be upper bounded by two times the inverse of the Lipschitz constant to ensure a decrease in objective value on each iteration. Figure 4.11 displays the results of the filtered backprojection, iterative re-weighted and iterative shrinkage algorithms, with the potential function V.jj/ D krk1 used for the latter two methods. These simulations are performed using the AFRL civilian vehicle data set [43]. For these simulations only 10% of the available pulses are used which results in artefacts in the image. Gaussian noise is added at SNR 20 db [138].

4.4

Numerical Optimization Methods for SAR Imaging

We now discretize the forward model upfront, formulate the imaging problem based on the discrete forward model and utilize numerical optimization for image reconstruction. This has ŒI  ˛r 2 V.k / can be expressed as a time-varying convolution and combined with Q, the kernel of the filter Q in K . 3

89

been a popular approach in SAR imaging and is useful for incorporating prior knowledge. Using an orthonormal basis, scene  can be discretized to obtain a finite vector . Typical basis functions used in radar imaging are pixel basis functions. The domain of the data .!; s/ is discretized based on the system parameters used when acquiring the data. The slow-time variable s is discretized at a sampling rate determined by the pulse repetition interval. The frequency variable is typically sampled uniformly based on the bandwidth of the transmitted waveforms. For a fixed slow-time sample si , we form dQ i D FQ i . Stacking measurement vectors and matrices for all slow-time variables, we get d D F

(4.56)

where d is a stacking of all dQ i , and F is a stacking of all FQ i . The most common approach to numerical optimization for SAR image formation is within the compressed sensing framework and focuses on promoting sparsity in the solution. This assumption is based on either representing the SAR image in a sparsifying basis or frame, or the standard Euclidean basis assuming the scene consists of a few strong scatterers. These methods offer the ability to achieve high resolution SAR imagery with limited data, suppressing noise and clutter, enhancing features for automatic target recognition tasks, autofocus and ground moving target imaging. The sparsity based applications in radar can be roughly categorized into three classes. The first class includes those that utilize greedy algorithms [84, 112]. In [3, 76] greedy algorithms are used to image wide-swath SAR imaging with reduced measurements and 2D inverse synthetic aperture imaging. Although these algorithms are straightforward to implement and have convergence properties under some conditions, they generally converge to local minimum that is not the sparsest. The second class of algorithms are the basis pursuit type algorithms which employ convex relaxation of cardinality and rank minimization [21, 40, 98, 113]. This approach has been applied to radar imaging, autofocus problem

90

Fig. 4.11 Images reconstructed using the CV data dome data set and the IRTA and ISTA algorithm described in Sect. 4.4 for SNR of 20 dB. (a) Reconstruction of car using deterministic FBP. (b) Reconstructed car image

E. Mason et al.

using iterated reweighted least-squares method. (c) Reconstructed car image using the iterated shrinkage thresholding method. Adapted from [138]

4 Optimization Methods for Synthetic Aperture Radar Imaging

and ground moving target imaging in [26, 92– 94, 130]. While these algorithms in general have better performance than those of greedy ones, they require fine tuning of the regularization parameters. The basis pursuit type rank minimization class of algorithms emerges naturally in passive radar imaging problems and offers the advantage of avoiding limiting underlying assumptions [78, 79, 148]. The third class of algorithms include those that are formulated in a Bayesian framework and offers a number of advantages over the other two classes of algorithms. First, noise and clutter are ubiquitous in radar signals. Therefore, a statistical approach is better suited in modeling and processing of radar signals than deterministic approaches. Secondly, Bayesian approach offers a rich class of prior models as well as statistical quantification of results. Finally, Bayesian approach does not require explicit regularization and hence avoids the need for fine tuning of regularization parameters. Bayesian compressed sensing has been applied to radar in [63, 85, 136, 147]. In the following sections, we provide an overview of convex relaxation based compressed sensing, low-rank matrix recovery methods, Bayesian compressed sensing and their application to radar imaging.

4.4.1

Compressed Sensing in SAR Imaging

Compressed sensing involves efficiently acquiring and reconstructing a signal by solving an underdetermined linear system. The reconstruction relies on the assumptions that the signal is sparse in some transform domain and the incoherence of the underlying linear system. These assumptions also allow the signal to be recovered from fewer samples than required by the Nyquist sampling rate. The recovery of a sparse signal is formulated as a minimization of its cardinality or `0 -norm. However, `0 -norm minimization is generally NPhard. In [17, 40], the connection between `0 minimization and `1 minimization is shown. In

91

the presence of additive noise, the recovery of a sparse signal is formulated as follow: 1 kFdk22 Ckk1 : 2  (4.57) Effective solutions with performance guarantees were developed to this problem from different perspectives including robust statistics, optimization theory and the compressive sensing theory [18, 19, 40, 58, 113]. A popular and simple approach is to use a forward-backward splitting, which performs a gradient descent update of the `2 -norm term of the objective in the forward step, and then performs a soft-thresholding using the shrinkage operator in the backward step [5]. The use of the `1 norm to impose sparsity for the inverse problem in radar imaging is not new to compressed sensing [13, 37]. `1 -norm has been used in regularizing the SAR image reconstruction problem, reducing the required number of measurements, and enhancing features for automatic target recognition (ATR) tasks, autofocus and ground moving target imaging [35, 70, 92, 106, 115]. In Fig. 4.12 we demonstrate the performance of some `1 minimization methods using numerical solvers in the SparseLab Matlab toolbox. We use Least Angle Regression (LARS) [44], Least absolute shrinkage selection operator (LASSO) [109], and Polytope Faces Pursuit [91]. The performance is inferior to the analytical methods which can be explained by use of the discretized forward model matrix. Compressed sensing results depend on the columns of the measurement matrix having lowcoherence or satisfying the Restricted Isometry Property (RIP) with a small RIP constant. Generally, the discretized SAR forward model does not satisfy these requirements, thus, random sampling and specific waveforms are used to obtain these conditions. In the rest of this section we discuss approaches taken to improve the properties of the forward matrix. One of the most well-known applications of compressed sensing involve sampling below the P9 W

O D Argmin J./ D

92

E. Mason et al.

Fig. 4.12 Reconstruction methods discussed in Sect. 4.4.1 of the phantom shown in (a) at 30 dB SNR. (a) Matlab Least-Squares (LSQ). (b) Least

Angle Regression (LARS). (c) Least absolute shrinkage selection operator (LASSO). (d) Polytope Faces Pursuit (PFP). Adapted from [138]

Nyquist rate. This is a desirable attribute in SAR imaging as it can be costly and difficult to collect measurements. In [4] the linear frequency modulated or chirp waveform is used to obtain a measurement matrix with low-coherence. Thus, assuming the underlying scene is sparse, a good reconstruction is obtained. Imaging is performed by recasting the problem as a linear program and solving it numerically using standard methods. While [4] rely on the coherence of the measurements matrix, [131] use random sampling in slow-time to obtain a measurement matrix which

satisfies the RIP condition with high probability, and obtain an algorithm that requires a sampling rate below Nyquist. Chai et al. [14, 28] develop an `1 -norm minimization method for array imaging using rectangular and circular arrays under the assumption that scene contains only a few strong scatterers. Furthermore, they studied the performance in terms of the mutual coherence and show asymptotically that as the size of the array and pixel spacing get larger compared to the wavelength the mutual coherence tends to zero. While this

4 Optimization Methods for Synthetic Aperture Radar Imaging

analysis offers guidance in designing the system so that the forward matrix has desirable properties, performance guarantees are task dependent since the results are asymptotic. The application of basis pursuit compressed sensing methods is not restricted to the case when the scene is sparse. They can also be applied if a SAR scene can have a sparse representation in some basis or frame [26, 89, 92, 102, 106]. In [7] the Discrete Gabor Transform is used as a sparsifying dictionary, then Problem P9 given in (4.57) is solved in the transform domain for polarimetric SAR imaging. For SAR imaging and moving target detection [93] argue that the standard Euclidean basis is appropriate when the transmitted signal is a Chirp waveform.

4.4.1.1 Total Variation Regularization in SAR Imaging Typical remote sensing or photographic images are not necessarily sparse, however, they are often piecewise constant making the presence of edges sparse. Total variation (TV) is a measure of the magnitude of the gradient of an image. For a piecewise smooth image, its TV is expected to be small. In this section, we review total variation as a regularization method promoting piecewise smooth image reconstruction. The isotropic total variation of f 2 CKN is defined as [101] TV.f / D

qP N1 PK1 iD1

jD1

jf .i; j/  f .i; j C 1/j2

Cjf .i C 1; j/  f .i; j/j2 :

O D Argmin J./ D 

eral image processing applications, see for instance [29, 47] and the references therein. In [90] a random sampling method based on jitter is used to obtain a forward model with low-coherence and (4.57) is used to reconstruct the SAR image. This model includes the use of tight frames to obtain a sparse representation. When speckle is present, Patel et al. [90] recommend additional regularization using TV defined in (4.58). However, in practice SAR images are complex valued, thus it is common to replace the TV./ with TV.jj/. Total variation regularization promotes solutions with smooth derivatives to promote piecewise continuity. With the addition of this norm (4.57) becomes Problem P11 with p D 1. The additional regularization term makes the problem non-convex and more challenging to optimize. The use of total variation to suppress the effects of clutter has also been studied in [24, 25, 92]. In [24] the following problem is posed P11 W

O D Argmin J./ D 12 kF  dk22 

C 1 kkpp C 2 TV.jj/

1 kF   dk22 C  TV./ 2 (4.59)

is convex. Numerous algorithms have been proposed for addressing the problem (4.59) in gen-

(4.60)

where the 0 < p  1.When p  1 this term is non-smooth, thus one approach to make the objective function of (4.60) smooth is to replace the `p -norm with

(4.58)

In words, the total variation of f is the `2 norm of rf , where r represents a discrete gradient operator. Or more compactly in vector notation we have TV.f / D krf k2 . TV, when viewed as a function from RKN to RC , is a sublinear function and therefore convex [61]. Thus, the regularized linear least squares problem with TV regularization is convex, specifically, P10 W

93

kkpp 

N X 

ji j2 C 

p=2

(4.61)

iD1

where  > 0 is a small constant. The use of regularization with the `p -norm, p  1, of the gradient of the scene promotes the enhancement of useful features such as piece-wise continuity and sparsity. The resulting image possesses strong edges and suppressed noise and speckle. The optimization problem P11 is solved using a quasi Newton method with an approximation for the Hessian using the underlying structure of the problem. Cetin et al. [25] considers P11 with 1 D 0 which is solved using a half-quadratic minimization method.

94

4.4.2

E. Mason et al.

Low-Rank Matrix Recovery Methods

In this section, we discuss a low-rank matrix recovery (LRMR) formulation applicable to passive and intensity only imaging [27, 78]. In the LRMR passive SAR image formation the forward model is formed by correlating bistatic measurements at two different receivers [78]. The result of this correlation is a non-linear problem in the unknown scene reflectivity, however, the problem can be reformulated using the wellknown lifting technique [22]. Using this observation one obtains a linear model acting on rankone unknown positive definite operator formed by the tensor product of the scene reflectivity with itself [78]. Thus, in this section we refer to the forward model and unknown as F and  as we have before, however, it should be noted in this case, these parameters correspond to those used in [78]4 . When the unknown  is low-rank a natural approach is to pose image formation as a rank minimization problem with a data fidelity constraint. However, since rank minimization is an NP-hard combinatorial problem it is typical to use the nuclear norm k k , defined as the sum of the singular values of  as a heuristic. This regularization function is convex, which means that when a linear or quadratic data fidelity constraint is used to the problem is convex. Furthermore, the nuclear norm is non-smooth, thus, a popular approach to optimization is to use forward-backward splitting methods [34]. In fact, following the theoretical results of compressed sensing, one can develop equivalent recovery results based on the RIP condition for low-rank matrices [98], and more general low-dimensional structures [31]. Using the nuclear norm heuristic,

4 The passive SAR forward model is formed by correlating two sets of bistatic measurements in fast-time. The result ing forward model is FQ D Fb Fb where Fb is (4.1) with phase (4.4). The unknown is then modeled as Q D  ˝ , where ˝ denotes the tensor product. For the notational simplicity and uniformity FQ ! F and Q !  in this section.

when the unknown has low-rank a natural formulation is the following problem P12 W

O D Argmin J./ D 

1 kF Π dk22 C kk : 2 (4.62)

This problem and certain popular variants are convex and can be solved efficiently using a variety of methods [16, 81]. In [78] the unknown Kronecker scene is formed by taking a tensor product of a function with itself resulting in a rank one and positive semidefinite operator. Thus, the nuclear norm can be replaced with the trace, since all the eigenvalues are positive, thus sparsity of the point spectrum is favoured using trace regularization. Thus, problem P12 can be recast as the following constrained problem P13 W

O D Argmin J./ D 12 kF Π dk22 

C TrŒ

such that   0 (4.63)

where   0 restricts  to the positive semidefinite cone. The simplest forward-backward splitting approach to solve this is a projected gradient descent defined by the following iteration   kC1 D PPSD k  ˛k .F  F .k  d/ C I˝ / (4.64) where PPDS is a projection onto the positive semidefinite cone and is realized as a hard thresholding of the negative eigenvalues. If the step size ˛k is fixed to be strictly less than the inverse of the Lipschitz constant of the gradient of the objective functional then the iterates converge, or a line search can be used to obtain faster convergence. In [78] the iteration in (4.64) is used along with a Nesterov momentum update to accelerate convergence. Furthermore, under certain conditions the passive SAR forward operator and its adjoint are Fourier integral operators, which means that gradient based methods can be implemented in a computationally efficient manner using fast Fourier integral operator algorithms [20, 38, 64].

4 Optimization Methods for Synthetic Aperture Radar Imaging

4.4.3

95

Bayesian Compressed Sensing

˛ja; b 

Y

 .˛i ja; b/;

(4.66)

i

From a Bayesian perspective, the sparse signal recovery problem can be viewed as a maximum a posteriori (MAP) estimation problem from observations embedded in white Gaussian noise with a sparsity-promoting prior model. The use of sparsity-promoting or robust prior models also has a long history in the area of Bayesian image reconstruction [9, 69] and references therein. Bayesian approach coupled with sparsity promoting prior is sometimes referred to as Bayesian compressed sensing [67]. This approach has found success in SAR imaging, specifically in the presence of noise and clutter [23, 63, 85, 99, 134]. The Laplacian prior [2, 69] leads to the `1 problem in (4.57). The use of Laplacian prior has a long history in signal processing problems [32, 75, 103]. Additionally, the Laplace prior has also been used extensively within the MAP formulation for SAR imaging [26, 99, 133]. In this case the posterior does not have a closed form solution. Thus, an alternative approach is to use conjugate priors which with appropriate choice of parameters will produce a distribution that is peaked sharply with zero mode. This approach coupled with a hierarchical Bayesian model can be used to obtain sparsity promoting models with tractable posterior distributions. Relevance vector machine (RVM) offers an alternative approach to obtaining a sparsity inducing prior, utilizing the hierarchical Bayes formulation [110]. This approach has been used for SAR imaging [121]. Assuming that each element of  are independent and have zero mean, the scene reflectivity is distributed according to j˛ 

Y

N .i j0; ˛1 i /;

(4.65)

i

where N denotes Gaussian distribution. ˛ and ˛1 are two vectors in which the elements of ˛1 are the reciprocals of the elements of ˛, and equal to the variance for each pixel. ˛ is referred to as the “precision” variable, its elements are independent and distributed according to

where  denotes the Gamma distribution with parameters a and b. The Gamma distribution is chosen to model the unknown variance of the data since it is conjugate prior to the Gaussian. Combining (4.65) and (4.66) and marginalizing over ˛, scene distribution given the parameters a and b, ja; b, becomes Student-t distribution. The parameter a and b can be chosen such that the distribution is strongly peaked at  D 0, thus, promoting sparsity. The advantage of this type of prior is that it allows for efficient sampling of the posterior using iterative algorithms. Finally, assuming zero mean Gaussian noise with variance 2 , with prior distribution that follows a Gamma distribution  . 2 jd; c/. If the hyper-parameters ˛ and 2 are known then the posterior distribution of  is a multivariate Gaussian parametrized by  D 2 ˙ FH d;

˙ D . 2 FH F C A/1 (4.67) where  is the mean vector, ˙ is the covariance matrix, and A D diag.˛i /. The parameters ˛ and 2 can be estimated by maximizing the posterior distribution. The MAP estimates of ˛ and noise variance 2 are ˛i D

ˇi ;

2i

2 D

kd  Fk22 P N  ˇi

(4.68)

i

where ˇ D 1  ˛i ˙ii , and N is the total number of samples in fast- and slow-time. Since there is an interdependence between the hyperparameters and the posterior, the estimation is performed iteratively alternating between the calculation of (4.67) and (4.68) until a stopping criteria is met. Additionally, there also exists efficient algorithms to carry out this procedure [48, 111]. Another prior for the scene reflectivity that promotes sparsity is the spike-and-slab prior [60, 147]: j; ˛ 

Y

.1  i /ı.i / C i N .i j0; ˛i1 /

i

(4.69)

96

E. Mason et al.

where i is the prior probability of a non-zero element. However, this model can be difficult to use in practice, but thankfully it has an equivalent representation which is more tractable. Specifically, let  D ı  where ı denotes elementwise multiplication, is a Gaussian distributed random vector,Qand is Bernoulli distributed such Q that  N .0; ˇi1 / and   Bern.i / i

i

where i denotes the i-th index of the vector. As in the case of the RVM method, it is common to chose a Gamma distribution for the hyperparameter ˇ since it is the conjugate prior to Gaussian distribution. This model promotes group sparsity and has been used in constructing Bayesian compressed sensing methods for SAR imaging [135, 136].

4.5

Optimization in SAR Related Problems

The use of optimization methods in SAR are not restricted to imaging of stationary scenes, we simply restricted our attention as a full review of optimization in radar would be a book on its own. In this section we discuss the application of numerical optimization methods for problems related to SAR imaging. Specifically, we discuss how the methods discussed in the previous section have been applied to SAR interferometry, moving target detection, recovery of missing data, and the autofocus problem. While the details of these problems vary, they can all be posed as an inverse problem, a popular approach is numerical optimization methods. Thus, in the rest of this section we will use notation corresponding to the linear model d D F, but note that these terms vary for each problem. • SAR interferometry (InSAR) is a method to form 3 dimensional SAR imagery using the phase difference between two different complex SAR images [11, 46, 52, 139, 144, 150]. The process can be modeled as a linear mapping of the scene reflectivity which is a function of the elevation and can be obtained by

solving an inverse problem. The previously discussed methods are applicable to this SAR problem, the choice of regularization depends on the terrain. The effects of multiple scattering in InSAR imagery are more apparent than in classical SAR imagery, and the differences between man-made structures in urban environments require different regularization than in forested areas. Specifically, in urban environments the InSAR scenes are sparse since scattering off the ground, facade or roofs dominate in the image [152]; this is not the case in forested environments [1]. Under this observation, in urban environments, [11, 12, 149] use the compressed sensing methods discussed in the previous section and pose Problem P9 , discussed in Sect. 4.4.1. This approach is modified in [151, 152] where a hard thresholding is done to obtain an estimate of the support set of the image, then a regularized least squares solution restricted to the columns of the measurement matrix corresponding to the support set. This can be carried out analytically using the FBP solution of Problem P2 in Sect. 4.3.1.1. The performance of these methods are analyzed using the RIP constant for two scatterers. It is shown that the RIP constant decreases as the distance between the scatterers increase [149]. Alternatively, [12] considers the coherence of the measurement matrix as a performance criteria. In the case of forested terrain the scene is not directly sparse, however, using a different basis or frame the scene may be regarded as sparse. A wavelet transform is used in [1] to obtain a sparse representation of the scene. Along with sparse reconstruction methods TV regularization is added and Problem P11 , given in Sect. 4.4.1.1, is solved. • When there are moving targets in the scene, neglecting this motion will causes blur, often called smearing [42, 66, 106]. The velocity of the target appears as a Doppler scaling or shifting in the phase function and coupling between the position and velocity variables [79, 106, 118, 126]. However, the velocity space

4 Optimization Methods for Synthetic Aperture Radar Imaging

can be regarded as sparse since each scatterer can only be travelling at single velocity. Attempting to estimate these variables jointly is difficult, therefore, it is common to estimate position and velocity separately. In the setting of passive radar we take this approach, utilizing the low-rank and sparse structure to solve the inverse problem using numerical optimization methods [79]. First, an estimation of the scatterer positions are obtained using the LRMR approach described in Sect. 4.4.2. Then the velocity is estimated for each scatterer using greedy methods. A similar approach is taken in [106], where the problem of position and velocity estimation is then posed as a `0 norm minimization which includes a binary constraint used to indicate the true velocity. Since the problem is nonconvex and mixedinteger it is not tractable. Thus, a two step convex relaxation is posed, first an `1 -norm minimization is performed, in the second step the maximum element is retained and the others set to zero. The second step is equivalent to enforcing the single velocity constraint at each pixel. • In Sect. 4.4.1 we discussed how compressed sensing theory could be used to reduce the sampling rate requirement for SAR imaging. Here we discuss alternative approaches to solve this problem. The methods discussed here don’t directly leverage CS theory to sample at lower rates, but instead use these ideas to develop alternative solutions to outperform conventional imaging methods using fewer measurements [36, 73, 137]. In [107] the problem of interrupted SAR is addressed and performance evaluated in the context of coherent change detection. Using multiple passes a group sparsity optimization problem is posed with a constraint coupling the measured data at different passes to form imagery. The method is shown experimentally to be more robust to limited measurements. An alternative approach is to use LRMR and the special case of matrix completion

97

(MC) to synthesize missing data [36, 132, 137]. In [137] a matrix completion formulation is proposed for sparse scenes, when sampling is performed at irregular intervals. Specifically, the authors note that the received data, in matrix form, has low-rank after rangecompression and migration correction. This allows for the use of MC to interpolate new data. Taking a similar approach [132] consider the Swerling target model and assume sparsity of the scatterers with respect to the number of measurements. Under these assumptions the received data matrix will be low-rank. It is shown that a random uniform sampling scheme results in a matrix satisfying the coherence requirements of MC theory, allowing for synthesis of new data. • In Sect. 4.2 the SAR forward model depends on accurate knowledge of the antenna position. Typically, there will be error introduced in the antenna position due to limitations on hardware, this is especially true for small unmanned aerial vehicles. Inexact knowledge of the position of the SAR platform or cases when the propagation medium characteristics vary from the ideal will lead to phase errors in the data, and a poorly focussed image [68, 82, 83]. The autofocus procedure aims to reduce or remove the effect of this phase error. The phase error is modeled as a unit amplitude complex exponential which is the kernel of a filter W . Using this approach we can view the measured data as d D WF. Thus, the autofocus problem can be posed as joint optimization of the unknown scene vector , and W is the matrix representation of the filter W . The resulting problem fits into the general form of a regularized least-squares problem with a constant modulus constraint on the filter kernel, which is an extensively used approach [70, 71, 88, 115]. In [77] an alternative formulation is posed where the problem is reformulated as a constant modulus quadratic program (QMQP) solved efficiently as a semidefinite program (SDP) [10].

98

4.6

E. Mason et al.

Conclusion

We presented a review of recent developments in SAR image formation from an optimization perspective. We began with a discussion of GRTs as SAR received signal models. GRTs are more versatile than their alternatives, avoid limiting assumptions, provide a unifying mathematical framework for different SAR modalities and configurations and offer advantages in computational efficiency. We next presented SAR image reconstruction methods in largely two classes: analytic and numerical optimization based methods. In both of these methods, the majority of SAR image reconstruction methods can be viewed as solutions to constrained least-squares problems. Analytic image reconstruction methods take advantage of the underlying mathematical structure of GRTs and obtain an approximate solution to the underlying optimization problems analytically. This approach results in FBP or BPF operators for image reconstruction. When the optimization functional is non-quadratic, it typically results in sequences of FBP or BPF operators in which the underlying filters can be designed to obtain approximate minimizers. The optimization problem can be posed in either deterministic or statistical setting. This approach has the advantage of computational efficiency. We next presented large scale numerical optimization based SAR image reconstruction methods. This approach uses a discretized SAR forward model and poses the optimization problems in finite dimensional vector spaces. The most common approach to SAR image reconstruction within an optimization framework is based on compressed sensing to promote sparsity. We reviewed basis-pursuit type and Bayesian compressive sensing based reconstruction methods. In this approach compressed sensing offers a theoretical framework to study performance guarantees and explore other degrees of freedom in imaging based on novel sampling and forward matrix design.

In both analytic and numerical optimization based approaches, statistical approach has the inherent advantages of avoiding regularization parameter tuning, employing flexible and appropriate statistical models and obtaining statistical error bounds. In both approaches using sparsity promoting, novel and flexible prior models offers substantial improvements in image quality over conventional methods. Acknowledgements This material is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-16-1-0234, and by the National Science Foundation (NSF) under Grant No. CCF1421496.

References 1. Aguilera E, Nannini M, Reigber A (2013) Waveletbased compressed sensing for SAR tomography of forested areas. IEEE Trans Geosci Remote Sens 51(12):5283–5295 2. Alliney S, Ruzinsky SA (1994) An algorithm for the minimization of mixed l1 and l2 norms with application to Bayesian estimation. IEEE Trans Signal Process 42(3):618–627 3. Alonso MT, Lopez-Dekker P, Mallorqui JJ (2010) A novel strategy for radar imaging based on compressive sensing. IEEE Trans Geosci Remote Sens 48(12):4285–4295 4. Baraniuk R, Steeghs P (2007) Compressive radar imaging. In: 2007 IEEE radar conference, pp 128–133 5. Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2(1):183–202 6. Besag J (1993) Towards bayesian image analysis. J Appl Stat 20(5–6):107–119 7. Biondi F (2014) Sar tomography optimization by interior point methods via atomic decomposition – the convex optimization approach. In: 2014 IEEE geoscience and remote sensing symposium, pp 1879–1882 8. Blake A (1989) Comparison of the efficiency of deterministic and stochastic algorithms for visual reconstruction. IEEE Trans Pattern Anal Mach Intell 11(1):2–12 9. Bouman C, Sauer K (1993) A generalized Gaussian image model for edge-preserving MAP estimation. IEEE Trans Image Process 2(3):296–310 10. Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, New York

4 Optimization Methods for Synthetic Aperture Radar Imaging 11. Budillon A, Evangelista A, Schirinzi G (2009) SAR tomography from sparse samples. In: 2009 IEEE international geoscience and remote sensing symposium, vol 4, pp IV–865–IV–868 12. Budillon A, Evangelista A, Schirinzi G (2011) Three-dimensional SAR focusing from multipass signals using compressive sampling. IEEE Trans Geosci Remote Sens 49(1):488–499 13. Burns JW, Subotic NS, Pandelis D (1997) Adaptive decomposition in electromagnetics. In: Antennas and propagation society international symposium, IEEE, 1997 digest, vol 3, pp 1984–1987 14. Chai A, Moscoso M, Papanicolaou G (2013) Robust imaging of localized scatterers using the singular value decomposition and 1 minimization. Inverse Problems 29(2):025016 15. Cafforio C, Prati C, Rocca F (1991) SAR data focusing using seismic migration techniques. IEEE Trans Aerosp Electron Syst 27(2):194–207 16. Cai J-F, Candes EJ, Shen Z (2010) A singular value thresholding algorithm for matrix completion. SIAM J Optim 20(4):1956–1982 17. Candes EJ (2008) The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique 346(9):589–592 18. Candes EJ, Tao T (2006) Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Trans Inf Theory 52(12):5406– 5425 19. Candes EJ, Tao T (2007) The Dantzig selector: Statistical estimation when p is much larger than n. Ann Stat 35(6):2313–2351 20. Candes EJ, Demanet L, Ying L (2007) Fast computation of fourier integral operators. SIAM J Sci Comput 29(6): 2464–2493 21. Candes EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9(6):717–772 22. Candes EJ, Strohmer T (2013) Phaselift: Exact and stable recovery from magnitude measurements via convex programming. Comm Pure Appl Math 66:1241–1274 23. Carlin M, Rocca P, Oliveri G, Massa A (2013) Bayesian compressive sensing as applied to directions-of-arrival estimation in planar arrays. JECE 2013:1:1–1:1 24. Cetin M, Karl WC (2001) Feature-enhanced synthetic aperture radar image formation based on nonquadratic regularization. IEEE Trans Image Process 10(4):623–631 25. Cetin M, Karl WC, Willsky AS (2006) Featurepreserving regularization method for complexvalued inverse problems with application to coherent imaging. Opt Eng 45(1):017003–017003 017003-017003-11 26. Cetin M, Stojanovic I, Onhon O, Varshney K, Samadi S, Karl WC, Willsky AS (2014) Sparsitydriven synthetic aperture radar imaging: Reconstruction, autofocusing, moving targets, and com-

27.

28.

29.

30.

31.

32. 33.

34.

35.

36.

37.

38.

39.

40. 41.

99 pressed sensing. IEEE Signal Process Mag 31(4):27–40 Chai A, Moscoso M, Papanicolaou G (2011) Array imaging using intensity-only measurements. Inverse Problems 27(1):015005 Chai A, Moscoso M, Papanicolaou G (2014) Imaging strong localized scatterers with sparsity promoting optimization. SIAM J Imaging Sci 7(2):1358–1387 Chambolle A, Pock T (2011) A first-order primal-dual algorithm for convex problems with applications to imaging. J Math Imaging Vis 40(1):120–145 Chan TF, Esedoglu S (2005) Aspects of total variation regularized l1 function approximation. SIAM J Appl Math 65(5):1817–1837 Chandrasekaran V, Recht B, Parrilo PA, Willsky AS (2012) The convex geometry of linear inverse problems. Found Comput Math 12(6): 805–849 Claerbout JF, Muir F (1973) Robust modeling with erratic data. Geophysics 38(5):826–844 Combettes PL, Wajs VR (2005) Signal recovery by proximal forward-backward splitting. Multisc Model Simul 4(4):1168–1200. https://doi.org/10. 1137/050626090 Combettes PL, Pesquet J-C (2011) Proximal splitting methods in signal processing. In: Bauschke HH, Burachik RS, Combettes PL, Elser V, Luke DR, Wolkowicz H (eds) Fixed-point algorithms for inverse problems in science and engineering. Springer, New York, pp 185–212. https://doi.org/10. 1007/978-1-4419-9569-8_10 Cetin M, Stojanovi´c, Onhon NO, Varshney KR, Samadi S, Karl WC, Willsky AS (2014) Sparsitydriven synthetic aperture radar imaging: Reconstruction, autofocusing, moving targets, and compressed sensing. IEEE Signal Process Mag 31(4):27–40 Dao M, Nguyen L, Tran TD (2013) Temporal rate up-conversion of synthetic aperture radar via lowrank matrix recovery. In: 20th IEEE international conference on image processing, pp 2358–2362 Delaney AH, Bresler Y (1996) A fast and accurate fourier algorithm for iterative parallel-beam tomography. IEEE Trans Image Process 5(5):740–753 Demanet L, Ferrara M, Maxwell N, Poulson J, Ying L (2012) A butterfly algorithm for synthetic aperture radar imaging. SIAM J Imaging Sci 5(1):203– 243 Doerry AW (2012) Basics of polar-format algorithm for processing synthetic aperture radar images. Sandia National Laboratories report SAND2012-3369, Unlimited Release Donoho DL (2006) Compressed sensing. IEEE Trans Inf Theory 52(4):1289–1306 Duistermaat JJ, Hormander L (1972) Fourier integral operators. II. Acta Mathematica 128(1): 183–269

100 42. Duman K, Yazici B (2015) Moving target artifacts in bistatic synthetic aperture radar images. IEEE Trans Comput Imaging 1(1):30–43 43. Dungan KE, Austin C, Nehrbass J, Potter LC (2010) Civilian vehicle radar data domes. Proc SPIE 7699:76990P–76990P-12. http://dx.doi.org/ 10.1117/12.850151 44. Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499 45. Elad M (2010) Sparse and redundant representations: from theory to applications in signal and image processing, 1st edn. Springer Publishing Company, Incorporated 46. Ertin E, Moses RL, Potter LC (2010). Interferometric methods for three-dimensional target reconstruction with multipass circular SAR. IET Radar Sonar Navigation 4(3):464–473 47. Esser E, Zhang X, Chan TF (2010) A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science. SIAM J Imaging Sci 3(4):1015–1046 48. Faul AC, Tipping ME (2001) Analysis of sparse bayesian learning. Adv Neural Inf Process Syst, MIT Press 14:383–389 49. Ferrara M, Parker JT, Cheney M (2013) Resolution optimization with irregularly sampled fourier data. Inverse Problems 29(5):054007 50. Figueiredo MAT, Nowak RD (2003) An EM algorithm for wavelet-based image restoration. IEEE Trans Image Process 12(8):906–916. doi:10.1109/ TIP.2003.814255 51. Figueiredo MAT, Bioucas-Dias JM, Nowak RD (2007) Majorization-minimization algorithms for wavelet-based image restoration. IEEE Trans Image Process 16(12):2980–2991. doi:10.1109/TIP.2007. 909318 52. Fornaro G, Lombardini F, Serafino F (2005) Threedimensional multipass SAR focusing: experiments with long-term spaceborne data. IEEE Trans Geosci Remote Sens 43(4):702–714 53. Ganan S, McClure D (1985) Bayesian image analysis: an application to single photon emission tomography. Am Stat Assoc, pp 12–18 54. Geman D, Reynolds G (1992) Constrained restoration and the recovery of discontinuities. IEEE Trans Pattern Anal Mach Intell 14(3):367–383 55. Gorham LA, Rigling BD (2016) Scene size limits for polar format algorithm. IEEE Trans Aerosp Electron Syst 52(1):73–84 56. Gorodnitsky IF, Rao BD (1997) Sparse signal reconstruction from limited data using FOCUSS: a re-weighted minimum norm algorithm. IEEE Trans Signal Process 45(3):600–616 57. Green PJ (1990) Bayesian reconstructions from emission tomography data using a modified EM algorithm. IEEE Trans Med Imaging 9(1):84–93 58. Haupt J, Nowak R (2006) Signal reconstruction from noisy random projections. IEEE Trans Inf Theory 52(9):4036–4048

E. Mason et al. 59. Hebert T, Leahy R (1989) A generalized EM algorithm for 3-D bayesian reconstruction from poisson data using gibbs priors. IEEE Trans Med Imaging 8(2):194–202 60. Hernandez-Lobato D, Hernandez-Lobato JM, Dupont P et al (2013) Generalized spike-and-slab priors for bayesian group feature selection using expectation propagation. J Mach Learn Res 14(1):1891–1945 61. Hiriart-Urruty JB, Lemarechal C (2004) Fundamentals of convex analysis. Springer 62. Hormander L (1971) Fourier integral operators. I. Acta Mathematica 127(1):79–183 63. Hou X, Zhang L, Gong C, Xiao L, Sun J, Qian X (2014) SAR image bayesian compressive sensing exploiting the interscale and intrascale dependencies in directional lifting wavelet transform domain. Neurocomputing 133:358–368 64. Hu J, Fomel S, Demanet L, Ying L (2013) A fast butterfly algorithm for generalized radon transforms. Geophysics 78(4):U41–U51 65. Hunter DR, Lange K (2004). A tutorial on MM algorithms. Am Stat 58(1):30–37. http://dx.doi.org/ 10.1198/0003130042836 66. Jao JK (2001) Theory of synthetic aperture radar imaging of a moving target. IEEE Trans Geosci Remote Sens 39(9):1984–1992 67. Ji S, Xue Y, Carin L (2008) Bayesian compressive sensing. IEEE Trans Signal Process 56(6):2346–2356 68. Jakowatz Jr CV, Wahl DE, Eichel PH, Ghiglia DC, Thompson PA (1996) Spotlight-mode synthetic aperture radar: a signal processing approach. Springer 69. Kaipio J, Somersalo E (2006) Statistical and computational inverse problems, vol 160. Springer 70. Kelly S, Yaghoobi M, Davies M (2014) Sparsitybased autofocus for undersampled synthetic aperture radar. IEEE Trans Aerosp Electron Syst 50(2):972–986 71. Kelly SI, Du C, Rilling G, Davies ME (2012) Advanced image formation and processing of partial synthetic aperture radar data. IET Signal Process 6(5):511–520 72. Kirk JC (1975) A discussion of digital processing in synthetic aperture radar. IEEE Trans Aerosp Electron Syst 11(3):326–337 73. Kragh TJ, Kharbouch AA (2006) Monotonic iterative algorithms for SAR image restoration. In: 2006 international conference on image processing, pp 645–648 74. Krishnan V, Swoboda J, Yarman CE, Yazici B (2010) Multistatic synthetic aperture radar image formation. IEEE Trans Image Process 19(5):1290–1306 75. Levy S, Fullagar PK (1981) Reconstruction of a sparse spike train from a portion of its spectrum and application to highresolution deconvolution. Geophysics 46(9):1235–1243

4 Optimization Methods for Synthetic Aperture Radar Imaging 76. Li G, Zhang H, Wang X, Xia X-G (2012) ISAR 2D imaging of uniformly rotating targets via matching pursuit. IEEE Trans Aerosp Electron Syst 48(2):1838–1846 77. Liu KH, Wiesel A, Munson DC (2013) Synthetic aperture radar autofocus via semidefinite relaxation. IEEE Trans Image Process 22(6):2317–2326 78. Mason E, Son I-Y, Yazici B (2015) Passive synthetic aperture radar imaging using low-rank matrix recovery methods. IEEE J Sel Top Signal Process 9(8):1570–1582 79. Mason E, Yazici B (2016) Moving target imaging using sparse and low-rank structure. In: SPIE defense, security, and sensing, International Society for Optics and Photonics, vol 9843, pp 98430D– 98430D–10 80. Milman AS (1993) SAR imaging by migration. Int J Remote Sens 14(10):1965–1979 81. Mohan K, Fazel M (2012) Iterative reweighted algorithms for matrix rank minimization. J Mach Learn Res 13(Nov):3441–3473 82. Morrison RL, Do MN, Munson DC (2007) Sar image autofocus by sharpness optimization: A theoretical study. IEEE Trans Image Process 16(9):2309–2321 83. Munson DC, O’Brien JD, Jenkins WK (1983) A tomographic formulation of spotlight-mode synthetic aperture radar. Proc IEEE 71(8):917–925 84. Needell D, Vershynin R (2010) Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit. IEEE J Sel Top Signal Process 4(2):310–316 85. Newstadt G, Zelnio E, Hero A (2014) Moving target inference with bayesian models in SAR imagery. IEEE Trans Aerosp Electron Syst 50(3):2004–2018 86. Nilsson S (1997) Application of fast backprojection techniques for some inverse problems in integral geometry. PhD thesis, Linkoping Studies in Science and Technology 87. Nolan CJ, Cheney M (2002) Synthetic aperture inversion. Inverse Problems 18(1):221 88. Onhon N, Cetin M (2012) A sparsity-driven approach for joint SAR imaging and phase error correction. IEEE Trans Image Process 21(4):2075– 2088 89. Patel VM, Easley GR, Chellappa R, Nasrabadi NM (2014) Separated component-based restoration of speckled sar images. IEEE Trans Geosci Remote Sens 52(2):1019–1029 90. Patel VM, Easley GR, Healy Jr DM, Chellappa R (2010) Compressed synthetic aperture radar. IEEE J Sel Top Signal Process 4(2):244–254 91. Plumbley MD (2007) On polar polytopes and the recovery of sparse representations. IEEE Trans Inf Theory 53(9):3188–3195 92. Potter LC, Ertin E, Parker JT, Cetin M (2010) Sparsity and compressed sensing in radar imaging. Proc IEEE 98(6):1006–1020 93. Prnte L (2010) Application of compressed sensing to SAR/GMTI-data. In: 8th European conference on synthetic aperture radar, pp 1–4

101

94. Prnte L (2013) GMTI from multichannel SAR images using compressed sensing under off-grid conditions. In: 2013 14th international radar symposium (IRS), vol 1, pp 95–100 95. Ramm AG, Katsevich AI (1996) The Radon transform and local tomography. CRC Press 96. Raney RK (1971) Synthetic aperture imaging radar and moving targets. IEEE Trans Aerosp Electron Syst 7(3):499–505 97. Raney RK, Runge H, Bamler R, Cumming IG, Wong FH (1994) Precision SAR processing using chirp scaling. IEEE Trans Geosci Remote Sens 32(4):786–799 98. Recht B, Fazel M, Parrilo P (2010) Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev 52(3):471–501 99. Ren X, Chen L, Yang J (2014) 3D imaging algorithm for down-looking MIMO array SAR based on Bayesian compressive sensing. Int J Antennas Prop 2014, Article ID 612326, 9 pp. doi:10.1155/2014/ 612326 100. Rigling BD, Moses RL (2005) Taylor expansion of the differential range for monostatic SAR. IEEE Trans Aerosp Electron Syst 41(1): 60–64 101. Rudin LI, Osher S, Fatemi E (1992) Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena 60(1):259–268 102. Samadi S, Cetin M, Masnadi-Shirazi MA (2011) Sparse representation-based synthetic aperture radar imaging. IET Radar Sonar Navigation 5(2):182–193 103. Santosa F, Symes WW (1986) Linear inversion of band-limited reflection seismograms. SIAM J Sci Statist Comput 7(4):1307–1330 104. Soumekh M (1990) A system model and inversion for synthetic aperture radar imaging. In: International conference on acoustics, speech, and signal processing, vol 4, pp 1873–1876 105. Stevenson RL, Delp EJ (1990) Viewpoint invariant recovery of visual surfaces from sparse data. In: Proceedings third international conference on computer vision, pp 309–312 106. Stojanovic I, Karl WC (2010) Imaging of moving targets with multi-static SAR using an overcomplete dictionary. IEEE J Sel Top Signal Process 4(1):164– 176 107. Stojanovic I, Novak L, Karl WC (2014) Interrupted SAR persistent surveillance via group sparse reconstruction of multipass data. IEEE Trans Aerosp Electron Syst 50(2):987–1003 108. Thibault J-B, Sauer KD, Bouman CA, Hsieh J (2007) A three-dimensional statistical approach to improved image quality for multislice helical CT. J Med Phys 34(11):4526–4544 109. Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B (Methodological), 267–288 110. Tipping ME (2001) Sparse bayesian learning and the relevance vector machine. J Mach Learn Res 1:211–244

102 111. Tipping ME, Faul AC (2003) Fast marginal likelihood maximisation for sparse bayesian models. In: Proceedings on the ninth international workshop on artificial intelligence and statistics, pp 3–6 112. Tropp JA (2004) Greed is good: algorithmic results for sparse approximation. IEEE Trans Inf Theory 50(10):2231–2242 113. Tropp JA (2009) Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans Inf Theory 55(2):917–918 114. Ulander LMH, Hellsten H, Stenstrom G (2003) Synthetic-aperture radar processing using fast factorized back-projection. IEEE Trans Aerosp Electron Syst 39(3):760–776 115. Uur S, Arkan O (2012) Sar image reconstruction and autofocus by compressed sensing. Digital Signal Process 22(6):923–932 116. Vu VT, Sjögren TK, Pettersson MI (2011) Fast backprojection algorithm for UWB bistatic SAR. In: 2011 IEEE RadarCon (RADAR), IEEE, pp 431–434 117. Wacks S, Yazici B (2014) Bistatic Doppler-SAR DPCA imaging of ground moving targets. In: 2014 Radar conference, IEEE, pp 1071–1074 118. Wacks S, Yazici B (2014) Passive synthetic aperture hitchhiker imaging of ground moving targets – part 1: Image formation and velocity estimation. IEEE Trans Image Process 23(6):2487–2500 119. Wacks S, Yazici B (2014) Passive synthetic aperture hitchhiker imaging of ground moving targets – part 2: Performance analysis. IEEE Trans Image Process 23(9):4126–4138 120. Walker JL (1980) Range-doppler imaging of rotating objects. IEEE Trans Aerosp Electron Syst 16(1):23–52 121. Wang F-F, Zhang Y-R, Zhang H-M, Hai L, Chen G (2015) Through wall detection with relevance vector machine. Prog Electromagn Res M 42:169–177 122. Wang L, Son I-Y, Yazici B (2010) Passive imaging using distributed apertures in multiple-scattering environments. Inverse Problems 26(6):065002 123. Wang L, Yarman CE, Yazici B (2011) DopplerHitchhiker: A novel passive synthetic aperture radar using ultranarrowband sources of opportunity. IEEE Trans Geosci Remote Sens 49(10):3521–3537 124. Wang L, Yazici B (2012) Bistatic synthetic aperture radar imaging using narrowband continuous waveforms. IEEE Trans Image Process 21(8):3673–3686 125. Wang L, Yazici B (2012) Bistatic synthetic aperture radar imaging using ultranarrowband continuous waveforms. IEEE Trans Image Process 21(8):3673– 3686 126. Wang L, Yazici B (2012) Passive imaging of moving targets exploiting multiple scattering using sparse distributed apertures. Inverse Problems 28(12):125009 127. Wang L, Yazici B (2012) Passive imaging of moving targets using sparse distributed apertures. SIAM J Imaging Sci 5(3):769–808

E. Mason et al. 128. Wang L, Yazici B (2013) Ground moving target imaging using ultranarrowband continuous wave synthetic aperture radar. IEEE Trans Geosci Remote Sens 51(9):4893–4910 129. Wang L, Yazici B (2014) Bistatic synthetic aperture radar imaging of moving targets using ultranarrowband continuous waveforms. SIAM J Imaging Sci 7(2):824–866 130. Wei SJ, Zhang XL, Shi J (2012) An autofocus approach for model error correction in compressed sensing SAR imaging. In: 2012 IEEE International Geoscience and Remote Sensing Symposium, pp 3987–3990 131. Wei SJ, Zhang XL, Shi J, Xiang G (2010) Sparse reconstruction for SAR imaging based on compressed sensing. Progress In: Electromagnetics Research 109:63–81 132. Weng Z, Wang X (2012) Low-rank matrix completion for array signal processing. In: 2012 IEEE international conference on acoustics, speech, and signal processing, pp 2697–2700 133. Wu J, Liu F, Jiao LC, Wang X (2011) Compressive sensing sar image reconstruction based on bayesian framework and evolutionary computation. IEEE Trans Image Process 20(7):1904–1911 134. Wu Q, Zhang YD, Amin MG, Himed B (2014) Multi-static passive SAR imaging based on bayesian compressive sensing. In: Proc. SPIE, vol 9109, pp 910902–910902–9 135. Wu Q, Zhang YD, Amin MG, Himed B (2015) High-resolution passive SAR imaging exploiting structured bayesian compressive sensing. IEEE J Sel Top Signal Process 9(8):1484–1497 136. Wu Q, Zhang YD, Amin MG, Himed B (2015) Structured Bayesian compressive sensing exploiting spatial location dependence. In: 2015 IEEE international conference on acoustics, speech, and signal processing, pp 3831–3835 137. Yang D, Liao G, Zhu S, Yang X, Zhang X (2014) Sar imaging with undersampled data via matrix completion. IEEE Geoscience and Remote Sensing Letters 11(9):1539–1543 138. Yanik HC (2014) Analytic methods for SAR image formation in the presence of noise and clutter. PhD thesis, Rensselaer Polytechnic Institute 139. Yanik HC, Yazici B (2012) Mono-static synthetic aperture radar interferometry with arbitrary flight trajectories. In: SPIE Defense, Security, and Sensing, International Society for Optics and Photonics, vol 8394, pp 83940B–83940B–8 140. Yanik H, Yazici B (2015) Synthetic aperture inversion for statistically nonstationary target and clutter scenes. SIAM J Imaging Sci 8(3):1658–1686 141. Yarman CE, Wang L, Yazici B (2010) Doppler synthetic aperture hitchhiker imaging. Inverse Problems 26(6):065006 142. Yarman CE, Yazici B (2008) Synthetic aperture hitchhiker imaging. IEEE Transactions on Imaging Processing, pp 2156–2173

4 Optimization Methods for Synthetic Aperture Radar Imaging 143. Yarman CE, Yazici B, Cheney M (2008) Bistatic synthetic aperture radar imaging for arbitrary flight trajectories. IEEE Trans Image Process 17(1):84–93 144. Yazici B, Yanik HC (2014) Synthetic aperture radar interferometry by using ultra-narrowband continuous waveforms. In: SPIE Defense, Security, and Sensing, International Society for Optics and Photonics, vol 9093, pp 909305–909305–8 145. Yazici B, Cheney M, Yarman CE (2006) Synthetic-aperture inversion in the presence of noise and clutter. Inverse Problems 22(5): 1705 146. Yegulalp AF (1999) Fast backprojection algorithm for synthetic aperture radar. In: Radar Conference, 1999. The Record of the 1999 IEEE, pp 60–65 147. Yu L, Sun H, Barbot J-P, Zheng G (2012) Bayesian compressive sensing for cluster structured sparse signals. Signal Processing 92(1):259–269

103

148. Zhang S, Zhang Y, Wang W-Q, Hu C, Yeo TS (2016) Sparsity-inducing super-resolution passive radar imaging with illuminators of opportunity. Remote Sensing 8(11):929 149. Zhu XX, Bamler R (2010) Tomographic sar inversion by l1 -norm regularization the compressive sensing approach. IEEE Trans Geosci Remote Sens 48(10):3839–3846 150. Zhu XX, Bamler R (2010) Very high resolution spaceborne SAR tomography in urban environment. IEEE Trans Geosci Remote Sens 48(12):4296–4308 151. Zhu XX, Bamler R (2012) Super-resolution power and robustness of compressive sensing for spectral estimation with application to spaceborne tomographic SAR. IEEE Trans Geosci Remote Sens 50(1):247–258 152. Zhu XX, Bamler R (2014) Superresolving SAR tomography for multidimensional imaging of urban areas: Compressive sensing-based tomosar inversion. IEEE Signal Process Mag 31(4):51–58

5

Computational Spectral and Ultrafast Imaging via Convex Optimization Figen S. Oktem, Liang Gao, and Farzad Kamalabadi

5.1

Introduction

Multidimensional optical imaging, that is, capturing light in more than two-dimensions (unlike conventional photography), has been an emerging field with widespread applications in physics, chemistry, biology, medicine, astronomy, and remote sensing [1]. The measured multidimensional data, including the spatial, spectral, and temporal distributions of light, provide unprecedented information about the chemical, physical, and biological properties of targeted objects. While the objective of conventional photography is to measure only the two-dimensional spatial distribution of light, the objective of multidiF.S. Oktem Department of Electrical and Electronics Engineering, Middle East Technical University, 06800 Cankaya, Ankara, Turkey e-mail: [email protected] L. Gao Department of Electrical and Computer Engineering and Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA e-mail: [email protected] F. Kamalabadi () Department of Electrical and Computer Engineering, Coordinated Science Laboratory, and Department of Statistics, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA e-mail: [email protected]

mensional optical imaging is to form images of a scene as a function of more than two variables. That is, the goal is to obtain a datacube of high dimensions: three for spatial coordinates .x; y; z/, one for wavelength ./ and one for emission time .t/. However, obtaining this high-dimensional datacube with inherently two-dimensional detectors poses intrinsic limitations on the spatio-spectral-temporal extent of the technique. Conventional techniques circumvent this limitation by sequential scanning of a series of twodimensional measurements to construct the highdimensional data cube. For example, in spectral imaging, the three-dimensional datacube .x; y; / is typically obtained by either using a spectrometer with a long slit and scanning the scene spatially, or by using an imager with a series of spectral filters and scanning the scene spectrally. In the former case only a thin slice of the scene is observed at a time, whereas in the later case only one spectral band is observed at a time. As a result, these scanning-based conventional methods sacrifice light throughput, and also are best suited for imaging of scenes that remain stationary during the scanning process involved. For dynamic scenes, however, the conventional methods are subject to temporal artifacts. The conventional multidimensional imaging techniques also have inherent limitations on the attainable temporal, spatial, and spectral resolutions as imposed by their reliance on purely physical measurement systems. On the other hand,

© Springer International Publishing AG 2017 V. Monga (ed.), Handbook of Convex Optimization Methods in Imaging Science, DOI 10.1007/978-3-319-61609-4_5

105

106

F.S. Oktem et al.

recent developments in computational imaging techniques offer the prospect of transcending these physical limitations by combining information from different multiplexed measurements and/or by incorporating additional prior (statistical or structural) knowledge (in the form of spatial, spectral, and temporal distributions or properties) about the objects of interest into the image formation process. In this chapter, recent multidimensional imaging techniques for spectral and temporal imaging are described that overcome the temporal, spectral, and spatial resolution limitations of conventional scanning-based systems. Each development is based on the computational imaging paradigm, which involves distributing the imaging task between a physical and a computational system and then digitally forming the image datacube of interest from multiplexed measurements by means of solving an inverse problem. The development of each computational imaging technique requires the following three steps: First, a novel optical system is designed with the goal of overcoming the resolution limitations of conventional systems. Second, the associated inverse problem for image reconstruction is formulated by combining multiplexed measurements with an image formation model based on either an estimation-theoretic framework or a variational formulation. Third, computationally efficient algorithms emerged from convex optimization are designed to solve the resulting inverse problems.

5.2

The General Image Formation Model and Image Reconstruction Approach

5.2.1

General Forward Problem Formulation

In many multidimensional imaging systems, the relationship between the measured (sensor) data and the unknown source can often be adequately characterized by a linear observation model. The observation/source relationship can thus be

formulated by the Fredholm integral equation of the first kind [2]: Z Y.r/ D ˝

h.rI r0 / X.r0 / dr0

(5.1)

where for a two-dimensional observation geometry r D .r; s; t/ with r and s denoting the two spatial variables, t denoting time, and ˝ R2 denoting the 2-dimensional region of support. Also, Y.r/ and X.r/ denote the measured sensor data (observation) and the unknown source (the desired state to be imaged) respectively. The observation kernel or system point spread (response) function is denoted by h.rI r0 /. Note that the above image formation model assumes a linear relationship between the observation and the source, and that an explicit knowledge of the system response function is assumed in this model, as obtained from the characteristics of the imaging sensor and the imaging scenario, i.e., the medium through which the electromagnetic radiation propagates between the source and the detector. While the presence of absorption or scattering effects in the propagation would imply a nonlinear relationship, here we focus our attention on the commonly encountered case where these effects are negligible and can be omitted in the model. Even in the presence of nonlinear effects, in many cases the nonlinear relationship may be well approximated by a linear model through linearization and hence the linear approximation may provide some degree of relevance. In that case, the system response function is determined mainly from the specific geometric configuration of the sensor with respect to the source to be reconstructed. The system function in the above model, while linear, is most commonly not shift-invariant [3]. An algebraic formulation of the imaging model requires a discretization of the above integral equation. In practice, the observations are often a discrete sequence of measured data, fyi gM iD1 . For example, in optical measurements with a CCD, the observations are inherently of a discrete nature spatially due to the array structure of the detector, and intrinsically discrete in time due to the finite integration time of the photon

5 Computational Spectral and Ultrafast Imaging

107

counting device and the subsequent readout operation for each time-stamped measurement. Furthermore, for a nonanalytical solution, the unknown field X.r/ must be discretized. In what follows, it is assumed that the unknown field can be adequately represented by a weighted sum of N basis functions fj .r/gNjD1 as follows: X.r/ D

N X

xj j .r/

(5.2)

jD1

For instance, fj .r/gNjD1 are often chosen to be the set of unit height boxes corresponding to a 2-D (or 3-D when time is explicitly included in the model) array of pixels. In that case, if a square gf pixel array is used and the number of discrete time samples is k, for example, then N D g f k and the discretized field is completely described by the set of coefficients fxj gNjD1 , corresponding to the pixel values. Collecting all the observations into a vector y of length M, and the unknown image coefficients into a vector x of length N, results in the following observation model in the form of a matrix equation: y D Hx (5.3) where H 2 RMN is the linear operator relating the unknown field to the observations comprised of inner products of the basis functions with the corresponding observation kernel: Z hi .r0 /j .r0 / dr0 ; .H/ij D ˝

1  i  M; 1  j  N (5.4) where hi .r0 / D h.ri I r0 / denotes the kernel function corresponding to the i-th observation. In practice, there is inherent noise in the measurement process. The exact form of noise depends on the type of measurement. For example, in optical observations where a detector such as a CCD is used, the main source of measurement noise has components from the incident photon counting process which is characterized by a Poisson distribution, as well as background and thermal noise, and a component due to readout effects which may be modeled as additive noise.

For most applications of practical interest here, the noise may be well approximated with an additive Gaussian component. Therefore, the linear model above may be expressed more completely by the addition of the noise or measurement uncertainty component denoted by w, i.e., y D Hx C w

(5.5)

The noise component w also accounts for mismatch between the reality and the model, resulting from the discretization process as well as other approximations.

5.2.2

Image Reconstruction with Nonsmooth Regularization

Regardless of the specifics of the data acquisition scenario, the task amounts to estimating the N element vector x given the M-element vector of measurements y, the MN forward model matrix H, and all known information, statistical or otherwise, regarding the M-element measurement noise vector w. However, in typical computed imaging scenarios the resulting inverse problems are ill-posed and ill-conditioned since the kernel in the Fredholm equation is a smoothing operator which causes the attenuation of some (usually the high frequency) singular modes, implying that the presence of even modest amounts of noise in the measurements may result in significant reconstruction artifacts [2, 4–7]. Consequently, it is often the case that a systematic approach to the incorporation of additional constraints in the solution space is required. In other words, it becomes necessary to replace the original illposed and ill-conditioned problem with another inverse problem with better conditioning that is close to the original one. A systematic approach to regularization leads to the minimization of an appropriately formulated cost function [8]. This approach derives from the use of prior knowledge concerning the unknown solution in a least squares setting. The prior information can be introduced in a deterministic way [4–6], or in a statistical setting [7], which is related to the Bayes paradigm of [9].

108

F.S. Oktem et al.

A general formulation for the cost function (the objective function) can be expressed as: ˚.x/ D ky  Hxk2W C

X

i Ci .x/

(5.6)

i

where ky  Hxk2W denotes the weighted residual norm, i.e., .y  Hx/T W .y  Hx/, Ci and i are the i-th regularization functional and regularization parameter respectively, and W is an appropriate weight, all to be chosen according to the specifics of the problem. The first term controls data fidelity (i.e., how faithful the reconstruction is to the data), whereas the second term (the regularization term) controls how well the reconstruction matches our prior knowledge of the solution. Tikhonov (quadratic) regularization [10] is perhaps the most common technique used for regularization and is equivalent to maximum a posteriori (MAP) estimation assuming Gaussian statistics for both the unknown image and noise [11]. Assuming w  N .0; ˙ w / and x  N .x0 ; ˙ x /, where N .; ˙ / represents the normal distribution with mean  and covariance ˙ , the MAP estimate is: xO MAP D arg min Œ log p.yjx/ log p.x/ x2RN

h i D arg min kyHxk2˙ 1 Ckxx0 k2˙ 1 (5.7) x2RN

w

x

1 1 T 1 Dx0 C.HT ˙ 1 w HC˙ x / H ˙ w .yHx0 /

The connection between this MAP formulation of Tikhonov regularization and the variational form just discussed becomes apparent by assuming independent identically distributed (IID) Gaussian noise and taking ˙ x D 12 .LT L/1 and x0 D 0, hence arriving at the well known Tikhonov regularization functional:

xO Tik

1 D arg min kyHxk22 C 2 kL.xx0 /k22 N w2 x2R

 D arg min kyHxk22 C kLxk22 D



x2RN

1 1 T 1 T 2 T H HC L L H y 2 w w2

(5.8)

where L is a positive definite regularization matrix (often a derivative operator) and  D . w /2 where w2 is the variance of the noise samples. A special case is when L D I, which results in  being inverse of the signal-to-noise ratio. Although we assumed IID noise, more general forms of noise covariance have also been applied in different imaging applications. Although the choice of a quadratic regularization functional leads to an optimization problem with a stable solution, it generally results in a reconstruction that is globally smooth. This property is due to the fact that it treats all structures in the image equally and in suppressing noise, large gradients, or edges, are also penalized, and thus deter the formation of sharp edges. An approach to modifying the Tikhonov regularization to allow for preserving sharp structures in reconstructions involves changing the norm used in the regularization term from `2 to `p [12]. This will result in the following cost function: ˚.x/ D ky  Hxk22 C kLxkpp (5.9) P where kLxkpp D i j.Lx/i jp . Note that unlike the `2 case, when other norms are used no closed-form solution exists and by necessity the optimization of the cost function must be performed numerically by iterative methods regardless of the size of the problem. Also, note that the problem is no longer quadratic in x and the corresponding optimization problem is, in general, nonlinear. Furthermore, when p  1, the cost function is not differentiable at the origin, so a smooth approximation to the `p is often used. However, when 1  p  2 the norms are still convex functions of the argument and thus the resulting optimization problems can be solved using standard convex optimization techniques or fixed-point algorithms. Despite the additional difficulties in solving the optimization problem in comparison with the quadratic case, when these norms are used in the side constraint, the solutions resulting from this class of techniques have been shown to produce superior results in cases where the underlying image is not globally smooth since these norms are less severe in penalizing large values than in comparison with the `2 norm.

5 Computational Spectral and Ultrafast Imaging

Measures with even more drastic penalty terms have also been used, for example with p < 1 which result in non-convex functions. When used in the data fidelity term, these measures provided robustness to the outliers in the data and uncertainty in the model and are connected to the notions of robust statistics. Of particular recent interest is the case when p D 1, which although still convex, is nondifferentiable in the origin. The resulting cost function then becomes: ˚.x/ D ky  Hxk22 C kLxk1

Y

e j.Lx/i j

shrinkage algorithms [16] (such as TwIST [17]), and gradient-projection algorithms (such as GPSR [18]). Another important class of methods specialized to `1 -norm case are interior-point methods, such as available l1-ls code [19] and l1-magic package [20]. There are also homotopy methods [21–23]. Another important strand of research focuses on greedy pursuit methods [14] for sparse reconstruction with `0 -norm. There are also nonconvex optimization approaches resulting from relaxing `0 -norm to a different norm than `1 [24].

(5.10)

P where kLxk1 D i j.Lx/i j. When a discrete approximation to the gradient operator is used for L, the result is the well-known “total variation” regularization which has achieved significant popularity due to its superior results for reconstructing images with significant structure [12]. This is due to the fact that instead of setting a limit on individual changes, this cost function limits only the overall variation in the underlying image only, allowing edges to form where they fit the data. It can be shown that this cost function has a statistical interpretation in terms of Laplacian distribution [7] where p.x/ /

109

(5.11)

i

which has larger tails compared to the Gaussian prior associated with the Tikhonov case. This p D 1 setting also has significance in the context of sparse signal recovery and compressed sensing. By convex relaxation of the sparsityinducing `0 norm with the `1 norm, the combinatorial sparse signal recovery problem is often converted to the convex optimization problem in Eq. (5.10). There are various algorithmic approaches with different convergence guarantees and convergence rates for solving optimization problems in the form of Eq. (5.9). A commonly used class of algorithms are gradient-based algorithms, which are often obtained through an approximation model followed by a fixedpoint approach [13, 14]. Examples include halfquadratic regularization methods [15], iterative

5.3

Computational Spectral Imaging

5.3.1

Introduction

Spectral imaging, the simultaneous imaging and spectroscopy of a radiating scene, is a fundamental diagnostic technique in the physical sciences with application in diverse fields such as physics, chemistry, biology, medicine, astronomy, and remote sensing. In this imaging modality, also known as imaging spectroscopy, multispectral or hyperspectral imaging, the intensity of the light radiated from each spatial point in the scene is sensed as a function of wavelength. The measured spectrum provides information for uniquely identifying the physical, chemical, and biological properties of targeted objects. This makes spectral imaging a useful diagnostic tool in various applications including remote sensing of astrophysical plasmas, environmental monitoring, resource management, biomedical diagnostics, industrial inspection, and surveillance, among many others. The objective of spectral imaging is to form images of a scene as a function of wavelength. For a two-dimensional scene, this requires simultaneously obtaining a three-dimensional data I.x; y; /: one for spectral and two for spatial dimensions. However, obtaining this three-dimensional spectral data cube with inherently two-dimensional detectors poses intrinsic limitations on the spatio-spectral extent of the technique [25].

110

Conventional spectral imaging techniques rely on a scanning process to build up the threedimensional (3D) data cube from a series of two-dimensional (2D) measurements that are acquired sequentially. Typically this is done by using a spectrometer with a long slit and scanning the scene spatially, or by using an imager with a series of spectral filters and scanning the scene spectrally. In the former case (referred to as rastering or push broom) only a thin slice of the scene is observed at a time, whereas in the latter case only one spectral band is observed at a time. Similarly, Fourier and Hadamard transform based spectrometers perform scanning in a transform domain (through their movable parts) to build up the three-dimensional data cube [26]. One important disadvantage of these conventional spectral imaging techniques is that the number of scans (hence measurements) proportionally increases with the desired spatial and spectral resolutions [27]. This disadvantage causes long acquisition times as well as hardware complexity. Resulting temporal artifacts in dynamic scenes and low light throughput are other undesired aspects [25, 28]. Moreover, since the conventional spectral imaging techniques purely rely on physical systems, there are inherent physical limitations on their performance such as temporal, spatial, and spectral resolutions. Due to these reasons, in many applications the spectral imaging information provided by these conventional techniques are limited. To overcome these limitations, computational imaging based spectral imaging emerges as an effective approach by passing on some of the burden to a computational system. In these approaches, the three-dimensional datacube is represented in terms of voxels and the values of the voxels are reconstructed from some indirect (multiplexed) measurements. The added computational part provides flexibility to combine information from different multiplexed measurements, as well as to incorporate the additional prior knowledge about the images of interest into the image formation process in the form of a regularization. The common approach is to take multiplexed measurements using a novel optical configuration, and then formulate

F.S. Oktem et al.

an inverse problem for image reconstruction by combining multiplexed measurements with an image formation model and additional prior information. Computationally efficient algorithms emerged from convex optimization are then designed to solve the resulting inverse problems. Representative computational spectral imaging techniques include computed tomography imaging spectrometry (CTIS) [29, 30], compressive coded aperture snapshot spectral imaging (CASSI) [31, 32], compressive hyperspectral imaging by separable spectral and spatial operators (CHISSS) [33], and photon-sieve spectral imaging (PSSI) [34, 35]. Motivated by the limitations of conventional spectral imaging systems, each of these techniques aims at achieving either of the following: (i) enabling fast/snapshot spectral imaging by reducing the number of measurements, (ii), enabling higher spatial or spectral resolution by overcoming the physical limitations on resolution and (iii) reducing the hardware complexity or cost.

5.3.2

Overview of Computational Spectral Imaging Approaches

Computed tomography imaging spectrometry (CTIS) aims at enabling snapshot spectral imaging by reducing the number of measurements [29, 30, 36–38]. Instead of directly measuring the every voxel of the 3D datacube, the image datacube is reconstructed from fewer measurements through a tomographic approach. The multiplexed measurements in this approach correspond to 2D projections of the 3D image datacube, where the projections are obtained through spectrally dispersed images of the scene, each dispersed in a different direction and by a different amount. This is achieved by using an optical configuration similar to a conventional slit (pushbroom) spectrometer, with the major difference of allowing an instantaneous 2D FOV (rather than a 1D FOV limited by a slit) and simultaneously obtaining multiple measurements with different diffraction orders. For image reconstruction, a multiplicative

5 Computational Spectral and Ultrafast Imaging

algebraic reconstruction algorithm [39] is used, which is an expectation-maximization algorithm applied to a concave cost function, hence ensuring global convergence. No additional prior information is, however, incorporated to this image reconstruction. These tomographic techniques require a large set of projections (i.e. dispersed images) to be captured at once, hence demanding a large detector area to be used. The resulting cost generally limits either the field-of-view (FOV) or the resolution. In addition, these techniques suffer from their incapability of fully sampling the Fourier domain representation of the data cube, which results from the limited angle of projections and the projection-slice theorem [30]. The unsampled Fourier volume, known as the missing cone, makes unique reconstruction of the data cube impossible. Consequently, the resulting tomographic reconstruction suffers from artifacts unless the data deficiency, arising from the missing cone problem, is fully compensated for with additional prior knowledge. Because of these reasons, CTIS has not yet shown sufficient performance success to yield widespread use [25, 40]. A recent work compensates, to some extent, for this missing cone problem by incorporating prior knowledge about the discreteness of the spectra into the image formation framework via a parametric model [25]. While an optical configuration similar to a CTIS system is used, the image datacube is represented in terms of pixels along the spatial dimension and with a parametric model along the spectral dimension. The resulting nonconvex optimization problem is solved by developing an efficient dynamic programming algorithm, which yields parameter estimates that are close to the global optimum. Hence, with the additional prior knowledge that the spectra consist of discrete lines, it becomes possible to reconstruct the 3D image datacube with significantly smaller number of projections (i.e. dispersed images) compared to the CTIS system, hence allowing a smaller detector area and better reconstructions. However, this parametric approach is applicable only for settings where the spectrum is discrete and dispersed images of the

111

scene resulting from different spectral lines are non-overlapping at the detector. Another important class of computational spectral imaging techniques are based on compressed sensing (compressive sampling) [41–46]. Instead of a parametric prior, these approaches use a sparsity prior which requires the image data cube to be sparse in some transform domain. Ultimately, they aim to perform spectral imaging with reduced number of measurements based on the theory of compressed sensing (CS). For this, two requirements that the CS theory relies on should be satisfied: sparsity and incoherence. Because natural spectral images generally have correlation among neighboring pixels and along spectral dimension, their sparse representations are possible using a proper transformation [28, 47]. For incoherence, the measurement system should be designed in a way that the sensing waveforms, in contrast to the datacube, have a dense representation in the chosen transform domain (equivalently, satisfy the restricted isometric property) [28, 44, 48]. This incoherence principle enables the underdetermined inverse problem— arising from compressed measurements—to have a unique solution. Then the image data cube can be reconstructed from measurements by using sparse signal recovery algorithms. Various different optical configurations have been suggested for compressive spectral imaging to take the compressed measurements in accordance with the incoherence principle. Some of the compressive spectral imaging systems are inspired by the compressive singlepixel camera architecture [33, 49–51]. In these works, the single photosensor in the single-pixel camera architecture is replaced with a spectrometer. A spatial light modulator is used to randomly choose the pixels from the scene, and only the light reflected from these pixels enters the spectrometer. As a result of this, each measurement contains the superimposed spectra of the randomly chosen pixels from the scene. While in the earlier approaches, the measurements are compressed only along the spatial dimension, an optical configuration that enables compression both along spatial and spectral dimensions is

112

also developed later [33]. This latter approach is known as compressive hyperspectral imaging by separable spectral and spatial operators (CHISSS). In CHISSS, image reconstruction is performed using two-step iterative shrinkage/thresholding (TwIST) algorithm [17], which is a convex optimization approach. These compressive spectral imagers require multiple successive acquisitions, instead of a single-shot measurement, rendering them not ideal for dynamic scenes; but they offer advantages over their conventional counterparts such as improved reconstruction and resolution and reduced acquisition time and storage requirements. However, although dependent on the complexity of the object to be imaged, these still generally require long acquisition times that can be comparable to the scanning-based conventional techniques. Moreover, while the spatial resolution is limited physically by the resolution of the spatial modulator, the spectral resolution is determined by the spectrometer used. An alternative approach to compressive spectral imaging, known as compressive coded aperture spectral imager or CASSI, involves coded apertures [27, 32, 52]. In the most popular of these approaches, an optical configuration similar to a conventional slit spectrometer is used with the difference of allowing an instantaneous 2D FOV by widening the slit and placing a randomly coded mask (a coded aperture) on this opening. As a result, similar to a CTIS system, the measurements correspond to projections of the 3D spectral data cube; the only difference is that these projections now contain randomness. Moreover, by changing the mask, it is also possible to take more than one measurements for a non-snapshot setting [53]. Different variants of CASSI approach have also been suggested such as one involving a dual disperser design [31] and a system based on a digital micromirror device (DMD) [54]. For image reconstruction, convex optimization approaches such as TWIST and gradient projection for sparse reconstruction (GPSR) are used. The spatial resolution of CASSI systems is limited by the minimum structure in the coded aperture, while the spectral resolution is deter-

F.S. Oktem et al.

mined both by the coded aperture and the spectral dispersion amount. However, for the CASSI systems developed to date, it has been observed that the spatial resolution is much worse than the diffraction-limited resolution and the spectral resolution is worse than what is expected [1]. The main reasons of these include the considerable sensitivity of the image reconstruction performance to the errors in the measurement system model as well as impossibility of perfect calibration due to the quite a number of optical elements contained in the system [40]. In none of these computational spectral imaging systems, the main motivation has been improving the spatial and spectral resolutions of conventional systems beyond their physical limitations. The photon-sieve spectral imaging (PSSI) approach has been developed with this motivation [34]. While in the other approaches gratings or prisms are used for the purpose of dispersion and many other optical components for the purpose of imaging, the PSSI system only relies on a photon sieve both for the purposes of imaging and dispersion. This simplified optical configuration offers both diffraction-limited high spatial resolution as well as higher spectral resolution than conventional spectral imagers employing wavelength filters. In the following section, we discuss the PSSI approach in more detail.

5.3.3

Computational Photon-Sieve Spectral Imaging (PSSI)

The photon-sieve spectral imaging technique uses an optical configuration that only consists of a diffractive imaging element known as photon sieve. Photon sieves provide an alternative to lenses and mirrors, and are obtained by replacing the open zones of Fresnel zone plates with large number of circular holes [55]. They have been proposed as a superior image forming device than the Fresnel zone plate [55], to be especially used at UV and x-ray wavelengths where refractive lenses are not available due to strong absorption of materials, and reflective mirrors are difficult to manufacture to achieve near diffraction-

5 Computational Spectral and Ultrafast Imaging

113

Fig. 5.1 The photon sieve spectral imaging system

limited resolution [56–58]. On the other hand, photon sieves offer diffraction-limited high spatial resolution with relaxed manufacturing tolerances [55, 57–65]. As a result, this new class of lightweight and low-cost diffractive imaging elements opens up new possibilities for high resolution imaging and spectroscopy. Another important property of the photon sieve for spectral imaging is its wavelength-dependent focal length, hence its dispersive character. Different than the photon-sieve imaging systems developed earlier in the literature, the photon-sieve spectral imaging modality, conversely, takes advantage of chromatic aberration. That is, the fact that different wavelengths are focused at different distances from photon sieve is exploited to develop the technique. The idea is, by using a photonsieve imaging system, to focus each spectral component with a different amount and take measurements that are superposition of these images of different spectral components (with each individual image being either in focus or out of focus). The photon sieve spectral imaging (PSSI) system is illustrated in Fig. 5.1. Here ds and dk denote the distances from the object and kth measurement plane to the plane where the photon sieve resides (k D 1; : : : ; K). We consider a polychromatic radiation from the object, which can be regarded to be composed of P different

wavelength components, each with a different wavelength p (p D 1; : : : ; P) and being mutually incoherent from other [66]. Because the focal length of the photon-sieve depends on the wavelength, each wavelength component is focused at a different distance from the photonsieve. Hence, on a plane where one of the spectral components is focused, there also exists defocused images of all other spectral components. As a result of this, if, for example, a measurement is taken from a plane where one spectral component is in focus, the focused image of this spectral component overlaps with the defocused images of all other components. A total of K such measurements are recorded by the imaging system. To create this measurement diversity, one can either use a moving detector (so measurements can be taken from different distances using the same photon-sieve element) or a spatial light modulator to dynamically change the photon-sieve design and hence the focusing amount (so a different wavelength can be focused each time onto the fixed held detector). The image formation model mathematically relates the intensities of individual spectral components to the measurements as follows [34]: tk .x; y/ D

P X pD1

sp .x; y/ gp ;dk .x; y/;

(5.12)

114

F.S. Oktem et al.

Here tk .x; y/ is the observation corresponding to the kth measurement. As seen, each measurement consists of P terms arising from P different spectral components. The term sp .x; y/ here represents the intensity of the spectral component with wavelength p . The spatial intensity distribution of each spectral component at wavelength p is convolved at distance dk with the incoherent point-spread function (PSF), gp ;dk .x; y/, of the photon sieve, given by under Fresnel approximation [67] ˇ

ˇˇ2 ˇ  i x2 Cy2 x y ˇ p ˇ 2 gp ;dk .x; y/ D ˇi e k p dk A ; ˇ ˇ k p dk p dk ˇ (5.13) Here k D 1=ds C 1=dk , and A.fx ; fy / is the Fourier transform of the aperture (transmission) function of the photon sieve. The aperture (transmission) function, a.x; y/, of the photon sieve is defined as the ratio of the transmitted field amplitude to the incident field amplitude at every point .x; y/ on the photon sieve. Note that the PSF gp ;dk of the system is band limited, and hence also the measurements tk [34]. This results from the inherent diffractionlimit [68]. By aiming to recover the same bandlimited versions of the original spectral images (that is, to achieve diffraction-limited spatial resolution), each continuous band limited function can be replaced with its discrete representation with sinc basis. With this, the continuous convolution operation in (5.12) reduces to a discrete convolution: tk Œm; n D

P X

sp Œm; n gp ;dk Œm; n;

(5.14)

pD1

where tk Œm; n, sp Œm; n, and gp ;dk Œm; n are uniformly sampled versions of their continuous forms, e.g. tk Œm; n D tk .m; n/ for some  smaller than the Nyquist sampling interval. We assume that tk Œm; n, i.e. uniformly sampled version of the kth observation, is approximately the same as the detector measurements with a pixel size of  (i.e., the averaged intensity over a pixel). We also assume that the size of

the input objects are limited to the detector range as determined by N  N pixels, i.e., m; n D 0; : : : ; N  1. As expected, each measurement consists of superimposed images of different wavelengths, with each individual image being either in focus or out of focus. The goal in the inverse problem is to recover the unknown intensities of different spectral components from these superimposed measurements. In the inverse problem framework, the following matrix-vector form obtained from the above image-formation model will be used [34]: Qt D Hs C n; (5.15) where 2

3 Qt1 : 7 QtD 6 4 :: 5 ; QtK

2

3 s1 6 7 s D 4 ::: 5 ; sP

3 H1;1 : : : H1;P 6 :: 7 ;(5.16) H D 4 ::: : 5 HK;1 : : : HK;P 2

Here Qtk is the lexicographically ordered noisy kth measurement vector, whereas Qt is the overall noisy measurement vector obtained by combining all the K measurements. Similarly, the vector sp contains the intensity of the spectral component with wavelength p , and the vector s is obtained by concatenating the intensities of all the spectral components. The matrix Hk;p is an N 2  N 2 block Toeplitz matrix corresponding to the discrete convolution operation with gp ;dk , and H is the overall system matrix of size KN 2 PN 2 . Lastly, the vector n D ŒnT1 j : : : jnTK T is the additive noise vector with .nk /i  N.0; k2 / representing white Gaussian noise that is uncorrelated across both different pixels i and measurements k. In the inverse problem, the goal is to recover the unknown spectral intensities, s, from their noisy, superimposed and blurred measurements, Qt. This inverse problem can be viewed as a multi frame (multichannel) image deconvolution problem involving multiple objects. (Note that each measurement is composed of focused/defocused images of different spectral components.) The

5 Computational Spectral and Ultrafast Imaging

term “multiframe” refers to the availability of multiple blurred images through different measurements. Here we consider a stochastic inversion approach based on MAP estimation to incorporate the prior statistical knowledge of the targeted scenes. This results in the following regularized linear least squares problem: max jjQt  HOsjj2W C jjLOsjjpp sO

(5.17)

Here W is a weight chosen according to the standard deviation of the noise in different measurements, L is a discrete approximation to the gradient operator, and is the regularization parameter arising from the statistical prior model of sO which is taken as aQgeneralized Gaussian in the form of p.Os/ / s/i jp /. The i exp.j.LO above regularization term allows reconstructions of globally smooth objects to sharper objects, depending on the prior knowledge in a specific application. For example, choosing p D 2 results in a quadratic regularization term and leads to globally smooth reconstructions by penalizing large variations in the images. In situations where the underlying images are not globally smooth and vary from pixel to pixel without strong correlations, as in many practical applications, this type of regularization is not appropriate. Various works on inverse problems have indicated that variation from pixel to pixel is best preserved by using total variation regularization [69] with p D 1. This regularization with `1-norm penalizes only the total variation in the reconstructed images, and hence allows preservation of the edges and rapidly changing structures when they fit the data. When 1  p < 2, the resulting problem is a nonlinear, but a convex optimization problem, and becomes non-smooth (non-differentiable) when p D 1. Since there is no closed-form solution, numerical techniques are used to find the solution. Here we use a fixed-point iterative algorithm [69] adapted to this problem. This algorithm is a special case of the “half-quadratic regularization” method [15], which obtains the reconstruction as the solution of a series of approximating quadratic problems.

115

A sample reconstruction with `p -norm regularization is shown in Fig. 5.2 for the two sources, together with the only-diffraction-limited versions of the original scenes, for comparison. Here, we consider a polychromatic input source generating two quasi-monochromatic radiation at close (but different) EUV wavelengths: 1 D 33:4 nm and 2 D 33:5 nm (i.e., P D 2). Moreover, the photon sieve system records the intensities at the two focal planes, f1 and f2 , corresponding to the wavelengths 1 and 2 (i.e., K D 2). Then at the first focal plane, the measurement consists of a focused image of the first source overlapped with a defocused image of the second source, and vice versa at the other focal plane. For the photon sieve, a sample design in [58] for EUV solar imaging is considered, with the outer diameter of the photon sieve as 25 mm, and the diameter of the smallest hole as 5 µm. This results in a photon sieve with first-order focal lengths of f1 D 3:742 m and f2 D 3:731 m, and Abbe’s diffraction resolution limit of 5 µm on the detector plane [56, 68]. The pixel size on the detector is then chosen as 2:5 µm to match the diffraction-limited resolution of the system (i.e., corresponding to Nyquist rate). Solar EUV scenes of size 128  128 are used as the inputs to the photon sieve system. Using the forward model in (5.14), we generated a data set corresponding to y at the signalto-noise ratio of 30 dB. Figure 5.3 shows the resulting measurements at the two focal planes together with the contributions of each source and the corresponding PSFs of the system. The reconstructed images with `p -norm regularization are shown in Fig. 5.2 for the two sources, together with the only-diffraction-limited versions of the original scenes, for comparison. This suggests that the proposed system achieves near diffraction-limited resolution, with the PSNR between reconstructions and diffraction-limited images larger than 42 dB. For this experiment, p D 1:5 is chosen since this corresponds to the optimal p parameter for reconstruction. The above experiment illustrates that the PSSI technique can achieve diffraction-limited spatial resolution as well as unprecedented spectral res-

116

F.S. Oktem et al. x 10−4

x 10−4

−1.5

−1.5

−1

−1

−0.5

−0.5

0

0

0.5

0.5

1

1

1.5

1.5

−1.5

−1

−0.5

0

0.5

1

1.5 x 10−4

−1.5

x 10−4

−0.5

0

0.5

1

1.5 x 10−4

−1

−0.5

0

0.5

1

1.5 x 10−4

x 10−4

−1.5

−1.5

−1

−1

−0.5

−0.5

0

0

0.5

0.5

1

1

1.5 −1.5

−1

1.5 −1

−0.5

0

0.5

1

1.5 x 10−4

−1.5

Fig. 5.2 Estimated intensities of the first and second sources (top left)–(bottom left), and diffraction-limited images of the originalsources (top right)–(bottom right),

respectively. Courtesy of NASA/SDO and the AIA, EVE and HMI science teams

olution compared to the conventional spectral imagers with wavelength filters. Note that the sources have wavelengths 33:4 nm and 33:5 nm, resulting in a spectral resolution of 0:1 nm, which is less than 0:3% of the central wavelength of each source. Such a high spectral resolution is not possible to achieve with the state-of-the-

art EUV wavelength filters, which can at best have a spectral resolution of 10% of the central wavelength [70]. This becomes an issue when this 10% spectral band contains more than one spectral line, in which case resolving the individual lines is not possible [34].

5 Computational Spectral and Ultrafast Imaging x 10−4

117

x 10−4

−1.5

x 10−4

−1

−1

−1

−0.5

−0.5

−0.5

0

0

0

0.5

0.5

0.5

1

1

1.5 −1.5

−1

−0.5

0

0.5

1

1.5 x 10−4

−1

−0.5

(a)

0

0.5

1

1.5 x 10−4

−0.5

−1

−1

−1

−0.5

−0.5

−0.5

0

0

0

0.5

0.5

0.5

1

0.5

1

1.5 x 10−4

(e)

1.5 −1.5

0.5

1

1.5 x 10−4

−1

(c) −1.5

1

0

0

1 x 10−4

(d)

x 10−4

−1.5

0

1 −1

(b) x 10−4

−0.5

0

1.5 −1.5

−1.5

−1

−1

1

1.5 −1.5

x 10−4

1.5 −1.5

x 10−4

−1.5

−1.5

x 10−4 −1

0

1

−1

−0.5

0

0.5

1

1.5 x 10−4

1.5 −1.5

1 −1

(f)

Fig. 5.3 Measured intensities at the first and second focal planes (a)–(e), the underlying images of the first source at the first and second focal planes (b)–(f), the underlying images of the second source at the first and second focal

5.4

Computational Ultrafast (Temporal) Imaging

5.4.1

Introduction

The frame rate of a conventional electronic photon detector array, such as CCD or CMOS, is fundamentally limited by the onboard storage and the data transfer speed from the camera to the host computer. Restricted by these bottlenecks, the frame rate of a state-of-the-art high-speed digital camera is up to a few tens of million frames per second (fps). To break the data storage and transfer bandwidth limitation, the compressed-sensingbased ultrafast imaging has emerged as an effective approach attracting growing interest in the past decade. In conventional multidimensional optical imaging, the number of camera pixels must be greater than twice the datacube voxels at Nyquist sampling condition [1]. By contrast, in compressed optical imaging, the camera pixels can be fewer if the scene

−0.5

0

(g)

0.5

1

1.5 x 10−4

−1

0

1 x 10−4

(h)

planes (c)–(g), and sampled and zoomed point-spread functions of the system for the focused and defocused cases (d)–(h), respectively

can be considered sparse in a given domain. Specifically, in ultrafast imaging, rather than measuring the entire event datacube .x; y; t/ (x, y, spatial coordinate; t, time), compressed ultrafast imaging acquires light signals at only selected event voxels and therefore does not generate massive image data, which otherwise would require tremendous hardware resources if measured using a conventional high-speed imager. Provided sparsity constraints, the original event datacube can be reasonably estimated using a typical convex minimization approach, such as two-step iterative shrinkage/thresholding (TwIST) [17] and gradient projection for sparse reconstruction (GPSR) [18].

5.4.2

Overview of Compressed Ultrafast Imaging Techniques

Representative compressed ultrafast imaging methods encompass programmable pixel compressed imaging (P2C2) [71], compressed

118

ultrafast photography (CUP) [72], coded aperture compressed temporal imaging (CACTI) [73], and smart pixel imaging (SPI) [74, 75]. We briefly review P2C2, CACTI, and SPI in this section, and detail CUP and its recent advances in Sect. 5.4.3. The programmable pixel compressed camera (P2C2) is based on per-pixel modulation [72,76]. P2C2 uses a liquid crystal on silicon (LCOS) to encode the input scene with a random binary pattern and then relays the resultant image to a time-integration device, such as a CCD. The LCOS’s pixels are one-to-one mapped to the CCD’s pixels, functioning as per-pixel shutters. The light intensity measured at each CCD pixel is thus an integration of the incident light modulated by its own shutter. During exposure, the LCOS’s pixels are modulated at a rate higher than the CCD’s frame rate. The image reconstruction requires the solution of the inverse problem of the above image formation process. Compared with the global-shutter coding architecture employed in a flutter shutter video camera [77], the perpixel coding design leveraged in P2C2 results in a less ill-conditioned measurement matrix and therefore higher reconstruction quality. However, the spatial resolution of P2C2 is generally worse than the diffraction limit because of the spatiotemporal multiplexing at the CCD. In addition, the imaging speed of a P2C2 is fundamentally limited by the modulation rate of the LCOS. Similar to P2C2, coded aperture compressed temporal imaging (CACTI) [73] first spatially encodes the input image with a pseudo-random binary pattern by using an absorption mask, then relays the resultant image to a CCD where photons are spatiotemporally integrated. However, unlike P2C2, CACTI introduces temporal modulation to the detected images by mechanically translating the mask along the vertical axis using a piezo element. Again, the image reconstruction process of CACTI requires the solution of the inverse problem. Llull et al. adapted both a generalized alternating projection (GAP) algorithm [78] and TwIST algorithm [17] to estimate the event datacube. Compared with TwIST, which performs best in a scene that can be considered sparse in the gradient domain, GAP requires no prior knowledge of the object and can use one of

F.S. Oktem et al.

several bases, such as wavelets or discrete cosine transform, to represent a sparse signal. However, in cases where TwIST can be applied, experimental results show that TwIST-reconstructed videos generally provide greater visual quality. The frame rate of the reconstructed video is up to 4500 fps, limited by the moving speed of the mask and the CCD’s pixel size. P2C2 and CACTI share a common thread in that the spatial encoding is accomplished with an optical architecture. By contrast, smart pixel imaging (SPI) [74, 75] with computational imaging arrays transfers this encoding process to the digital domain by using a digital pixel focal plane array, thereby minimizing the signal-to-noise loss caused by physical encoding elements, such as the DMD in P2C2 and the absorption mask in CACTI. In SPI, each detector pixel is modulated by a time-varying, pseudo-random, and dualbinary signal (1, 1 or 1, 0) at a rate up to 100 MHz. The image formation model using such a digital pixel focal plane array is similar to that in P2C2. However, in SPI the temporal modulation is introduced in the digital domain, rather than in the real image domain as in P2C2. By employing algorithms such as TwIST, Fernandez-Cull et al. demonstrated that the SPI could yield a reasonable estimation of original dynamic scenes [74].

5.4.3

Compressed Ultrafast Photography (CUP)

5.4.3.1 Operating Principle and Convex Optimization A streak camera is a one-dimensional (1D) ultrafast imager with a rate up to one trillion fps [79]. However, to image a two-dimensional (2D) scene, the object must be scanned in the direction perpendicular to the entrance slit of the streak camera. More problematically, at each scanning position, the event itself must be strictly repeated. In the case of unrepeatable events, such as optical rogue waves [80] or nucleus explosion, a streak camera is inapplicable. To overcome this limitation, compressed ultrafast photography (CUP) introduces the paradigm of compressed sensing [81] into streak photogra-

5 Computational Spectral and Ultrafast Imaging

phy, enabling 2D ultrafast imaging at an unprecedented speed. Figure 5.1 shows a typical CUP system. A photographic lens images the object to an intermediate image plane. This intermediate image is then relayed to a spatial encoding element, digital micromirror device (DMD), by a tube lens and a microscope objective. A random, binary pattern is displayed at the DMD, encoding the image in the spatial domain. The DMD consists of tens of thousands of micromirrors which can be individually turned on or off. The light reflected from ‘on’ micromirrors is collected by the same microscope objective and forms an image at the entrance port of the streak camera. In CUP, the entrance slit of streak camera is fully opened to maximum, thereby allowing a 2D image input. The encoded image is temporally sheared along the vertical axis by the sweeping voltage in the streak camera and finally measured by a CCD at the output. The image formation process in CUP can be expressed as: E.m; n/ D TSCI.x; y; t/;

(5.18)

where E.m; n/ is the light energy measured at the CCD’s pixel .m; n/, I.x; y; t/ is the event datacube, C is the spatial encoding operator depicting the function of a DMD, S is the temporal shearing operator depicting the function of a streak camera, and T is the spatiotemporal integration operator depicting the function of a CCD camera. The forward models embedded in CUP’s image formation process in Eq. 5.18 can be discretized as follows. The intensity distribution of the dynamic scene, I.x; y; t/, is first imaged onto the DMD. Given a unit magnification and ideal optical imaging—i.e., the point-spread-function Z E.m; n/ D

Z dt

dx00

Z

119

approaches a delta function, the intensity distribution of the resultant image at the DMD is identical to that of the original scene. The DMD displays a pseudo-randomly distributed, squared, and binary-valued pattern, encoding the image in the spatial domain. The image immediately after the DMD is: Ic .x0 ; y0 ; t/ D

X



I.x0 ; y0 ; t/ Ci;j rect

i;j

x0 1 y0 1 /; / .i C .jC d0 2 d0 2 (5.19)

Here, Ci;j is an element of the matrix representing the coded pattern, i,j are matrix element indices, and d0 is the DMD pixel size. For each dimension, the rectangular function is defined as ( rect.x/ D

1; 0;

if jxj  else

1 2

(5.20)

This encoded image is then passed to the entrance port of a streak camera. By applying a voltage ramp, the streak camera acts as a shearing operator along the vertical axis (y00 axis in Fig. 5.4) on the input image. If we again assume ideal optics with unit magnification, the sheared image can be expressed as Is .x00 ; y00 ; t/ D Ic .x00 ; y00  vt; t/

(5.21)

where v is the shearing velocity of the streak camera. Is .x00 ; y00 ; t/ is then spatially integrated over each camera pixel and temporally integrated over the exposure time. The optical energy, E.m; n/, measured at pixel (m, n) is

dy00 Is .x00 ; y00 ; t/ rect



x00 1 y00 1  .m C /;  .n C / d00 2 d00 2

(5.22)

120

F.S. Oktem et al. y

Fig. 5.4 System configuration of CUP. DMD, digital micromirror device. Figure reprinted with permission from [72]

x

Object

Streak camera V

x ′′

V

t

y ′′

Camera lens

CCD Wide-open entrance slit Beam splitter

Tube lens Objective

y′

DMD x′

Here, d00 is the camera pixel size. Accordingly, we can voxelize the input scene, I.x; y; t/, into Ii;j;k as follows:

I.x; y; t/ D

X

Ii;j;k rect

i;j;k

x 1 y 1 t 1  .i C  .j C  .k C /; /; / d00 2 d00 2 t 2

where t D d00 =v. If the DMD pixels are mapped one-to-one to the camera pixels (i.e., d0 D d00 ) and perfectly registered, combining Eqs. (5.19)–(5.23) yields n1 d003 X E.m; n/ D Cm;nk Im;nk;k v kD0

(5.24)

Here Cm;nk Im;nk;k represents a coded, sheared scene. The image reconstruction is the solution of the inverse problem of Eq. (5.24). Provided sparsity in the spatiotemporal gradient domain, the original event datacube can be estimated by forming an appropriate penalized cost and solving the following optimization problem:

1 arg min f jjE  TSCIjj2 C ˚.I/g I 2

(5.23)

(5.25)

Here ˚.I/ is the regularization function in the form of total variation, and is the regularization parameter. In CUP, the two-step iterative shrinkage/thresholding (TwIST) [17] algorithm is employed for iterative minimization. As an example, Fig. 5.5 shows light reflection, refraction, and racing in two different media captured by the CUP system. Furthermore, by mounting a dichroic filter on a mirror with a small tilt angle and placing the resultant spectral separation module in front of the streak

5 Computational Spectral and Ultrafast Imaging

121

Fig. 5.5 CUP of laser pulse. (a) Reflection, (b) Refraction, and (c) Racing in two different media. Figure reprinted with permission from [72]

Fig. 5.6 Two-color CUP. (a) Spectral separation unit. (b) Time-lapse laser excitation and fluorescence emission decay. (c) Plots of light signals in the dashed box in (b). Figure reprinted with permission from [72]

camera (Fig. 5.6a), CUP demonstrated two-color ultrafast imaging of laser excitation and the induced fluorescence decay (Fig. 5.6b–c). Because the minimization process (Eq. 5.25) forces sparsity in the spatiotemporal gradient domain, the

CUP’s spatial and temporal resolutions generally degrade from the diffraction limit and streak camera’s original temporal resolution, respectively [72].

122

F.S. Oktem et al. Encrypted data

User

x′

3D object

z

y′

V

Streak camera Camera lens Intermediate image plane Fully opened entrance slit

V

CCD 2

Host tToF

Key

y z

ED

Beam splitter

Mirror

x

Tube lens

Objective

DMD

CCD 1 Laser

Fig. 5.7 Time-of-flight CUP. Figure reprinted with permission from [82]

5.4.3.2 Time-of-Flight CUP and Convex Optimization with a Spatial Constraint CUP is a passive device, requiring no active illumination to achieve ultrafast frame rate. However, by adding a pulsed light source and measuring the time of flight signals, we can convert CUP into a volumetric imager, and enable the acquisition of a .x; y; z/ (z, depth) datacube within a single camera exposure [82]. Figure 5.7 shows the system setup of time-of-flight CUP. Laser pulses with picosecond duration impinge on an engineering diffuser, uniformly illuminating the scene. A CUP system images the scattered photons and measures the time of flight signals. The depths (z) of the objects are calculated by: z D ctTOF =2;

(5.26)

where c is the speed of light in air, and tTOF is the time of flight of scattered photons from the scene. In addition, the system also adds a reference camera to capture an additional time-integrated image. The image formation in time-of-flight CUP thus is formulated as E TSC D I (5.27) E0 T0 .1  / Here T0 depicts the spatiotemporal integration of the reference camera, is the percentage of light that is measured by CUP, and E and E0 is the light

energy measured by the streak camera and the reference camera, respectively. The original event datacube I is then estimated by a concatenated convex minimization 

 1 E TSC 2  0 k I k C ˚.I/ arg min T .1  / E0 I 2 (5.28) Accordingly, we must adapt the TwIST algorithm in a concatenated form to solve the optimization problem in Eq. (5.28). The depth imaging results using time-of-flight CUP are shown in Fig. 5.8. Compared with the original CUP, the addition of spatial constraint imposed by the reference camera improves the quality of reconstructed images [83], particularly when the objects are static during the pulsed illumination.

5.4.3.3 Dual-Channel CUP and Convex Optimization with Complementary Encoding The original CUP, referred to as G1-CUP, retains approximately 50% throughput at the spatial encoding device, DMD, because the light reflected from only “on” micromirrors is collected by the following optics, limiting its application in low light imaging such as fluorescence microscopy. To improve the light throughput, the second-generation (G2) CUP system

5 Computational Spectral and Ultrafast Imaging

123

Fig. 5.8 Depth imaging by time-of-flight CUP. (a) Object. (b) Reconstructed depth map. Figure reprinted with permission from [82]

adopts a dual-channel imaging configuration as shown in Fig. 5.9a. Akin to G1-CUP, a photographic lens first images the scene onto the spatial encoding device, DMD. However, unlike G1CUP, the light rays reflected from both “on” and “off” micromirrors are collected simultaneously by a stereoscopic objective, forming two pupils at the objective’s back aperture. Another two lenses reimage the light emitting from these two pupils and create two complementary images on the entrance port of the streak camera. Finally, the temporally-sheared images (Figs. 5.9b,c) are measured by a CCD at the output port of the streak camera. The image formation in dual-channel CUP is described by

E1 E2





TSC1 D I; TSC2

and

C1 C C2 D 1: (5.29)

Here C1 and C2 are spatial encoding operators indicating the functions of complementary encoding patterns as seen by the two imaging channels. The image reconstruction is the solution of the inverse problem of Eq. (5.29). The convex minimization process can be written as:

 arg min I

 1 E1 TSC1  I k2 C ˚.I/ k E2 TSC2 2 (5.30)

Again, the optimization can be performed by adapting the TwIST algorithm in a concatenated form. Figure 5.10a,b shows the imaging results of an obliquely incident laser wavefront sweeping across a Siemens star pattern on a flat screen, measured by G1-CUP and G2-CUP, respectively. Figure 5.10c,d show the correspondent timeintegrated images by summing all the temporal frames. Moreover, a reference image was captured without spatial encoding and temporal shearing (Fig. 5.10e). Compared with G1CUP, G2-CUP reduces the reconstruction artifacts and therefore considerably improves the image quality. It is noteworthy that dualchannel imaging described herein is essentially equivalent to the endeavor of increasing the number of measurements in conventional compressed optical imaging. However, since two complementary images are acquired in parallel, G2-CUP is still a snapshot ultrafast imager, therefore providing the full throughput advantage [84].

124

F.S. Oktem et al.

Fig. 5.9 Dual-channel CUP system. (a) System layout. The green dashed line indicates the incident light. The red dashed line indicates the reflected light from the DMD. (b)–(c) Complementary images formed at the streak camera

Fig. 5.10 Comparison among single-channel and dualchannel CUP images. (a) Representative temporal frames acquired from single-channel CUP. (b) Representative temporal frames acquired from dual-channel CUP. (c) Time-integrated image from reconstructed dynamic scenes acquired by single-channel CUP. (d) Time-

integrated image from reconstructed dynamic scenes acquired by dual-channel CUP. (e) Ground truth image captured without introducing spatial encoding and temporal shearing. The arrows in (a) and (b) indicate the wavefront’s in-plane propagation direction

5 Computational Spectral and Ultrafast Imaging

5.5

Conclusions

In this chapter, computational approaches to multi-dimensional imaging (spectral and ultrafast) have been illustrated to yield enhancements in desirable imaging attributes such as spatial, temporal, and spectral resolutions. The superior performance over conventional imaging systems is obtained through a computational imaging approach, which involves distributing the imaging task between a physical and a computational system and then digitally forming the image datacube of interest from multiplexed measurements by means of solving an inverse problem via convex optimization techniques. Such computational approaches signal the promise of exciting future directions toward further enhancements in multidimensional imaging capabilities, enabling an array of unprecedented applications in physical and life sciences.

References 1. Gao L, Wang LV (2016) A review of snapshot multidimensional optical imaging: Measuring photon tags in parallel. Physics Reports 616:1–37 2. Groetsch CW (1993) Inverse Problems in the Mathematical Sciences. Vieweg 3. Kamalabadi F (2010) Multidimensional image reconstruction in astronomy. IEEE Signal Process Mag 27(1):86–96 4. Tikhonov A-IN, Arsenin VY (1977) Solutions of IllPosed Problems. Winston, Washington, DC 5. Hanke M, Engl HW, Neubauer A (1996) Regularization of Inverse Problems. Kluwer, Dordrecht 6. Bertero M, Bocacci P (1998) Introduction to Inverse Problems in Imaging. IOP Publishing, Bristol 7. Kaipio J, Somersalo E (2005) Statistical and Computational Inverse Problems. Springer, New York 8. Hansen PC (2010) Discrete Inverse Problems: Insight and Algorithms 7. SIAM 9. Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell PAMI6(6):721–741 10. Tikhonov AN (1963) Solution of incorrectly formulated problems and the regularization method. Soviet Mathematics 4:1035–1038 11. Karl WC (2000) Handbook of Image and Video Processing, Regularization in Image Restoration and Reconstruction. Plenum Press

125 12. Vogel CR (2002) Computational Methods for Inverse Problems. SIAM 13. Beck A, Teboulle M (2009) Gradient-based algorithms with applications to signal recovery. Convex Optimization in Signal Processing and Communications, pp 42–88 14. Tropp JA, Wright SJ (2010) Computational methods for sparse solution of linear inverse problems. Proc IEEE 98(6):948–958 15. Geman D, Yang C (1995) Nonlinear image recovery with half-quadratic regularization. IEEE Trans Image Process 4(7):932–946 16. Zibulevsky M, Elad M (2010) L1-L2 optimization in signal and image processing. IEEE Signal Process Mag 27(3):76–88 17. Bioucas-Dias JM, Figueiredo MA (2007) A new twist: two-step iterative shrinkage/thresholding algorithms for image restoration. IEEE Trans Image Process 16(12):2992–3004 18. Figueiredo MA, Nowak RD, Wright SJ (2007) Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE J Sel Top Signal Process 1(4):586–597 19. Kim SJ, Koh K, Lustig M, Boyd S, Gorinevsky D (2007) An interior-point method for large-scale l1 regularized least squares. IEEE J Sel Top Signal Process 1(4):606–617 20. Candes E, Romberg J (2005) l1-magic: Recovery of sparse signals via convex programming. Tech Rep, California Inst Technol, Pasadena, CA 21. Efron B, Hastie T, Johnstone I, Tibshirani R et al (2004) Least angle regression. Ann Statist 32(2):407– 499 22. Osborne MR, Presnell B, Turlach BA (2000) A new approach to variable selection in least squares problems. IMA J Numer Anal 20(3):389–403 23. Donoho DL, Tsaig Y (2008) Fast solution of l1 norm minimization problems when the solution may be sparse. IEEE Trans Inf Theory 54(11): 4789–4812 24. Chartrand R (2007) Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal Process Lett 14(10):707–710 25. Oktem FS, Kamalabadi F, Davila JM (2014) A parametric estimation approach to instantaneous spectral imaging. IEEE Trans Image Process 23(12):5707– 5721. https://doi.org/10.1109/TIP.2014.2363903 26. Shepherd GG (2002) Spectral Imaging of the Atmosphere, vol. 82. Academic Press 27. Arce G, Brady D, Carin L, Arguello H, Kittle D (2014) Compressive coded aperture spectral imaging: An introduction. IEEE Signal Process Mag 31(1):105–115 28. Willett R, Duarte M, Davenport M, Baraniuk R (2014) Sparsity and structure in hyperspectral imaging: Sensing, reconstruction, and target detection. IEEE Signal Process Mag 31(1):116–126 29. Okamoto T, Yamaguchi I (1991) Simultaneous acquisition of spectral image information. Opt Lett 16(16):1277–1279

126 30. Descour M, Dereniak E (1995) Computedtomography imaging spectrometer: experimental calibration and reconstruction results. Appl Opt 34(22):4817–4826 31. Gehm ME, John R, Brady DJ, Willett RM, Schulz TJ (2007) Single-shot compressive spectral imaging with a dual-disperser architecture. Opt Express 15(21):14013–14027 32. Wagadarikar A, John R, Willett R, Brady D (2008) Single disperser design for coded aperture snapshot spectral imaging. Appl Opt 47(10):B44–B51 33. August Y, Vachman C, Rivenson Y, Stern A (2013) Compressive hyperspectral imaging by random separable projections in both the spatial and the spectral domains. Appl Opt 52(10):D46–D54 34. Oktem FS, Kamalabadi F, Davila JM (2014) Highresolution computational spectral imaging with photon sieves. In: 2014 IEEE international conference on image processing, pp 5122–5126 35. Oktem FS (2014) Computational imaging and inverse techniques for high-resolution and instantaneous spectral imaging. PhD thesis, University of Illinois at Urbana-Champaign 36. Kankelborg CC, Thomas RJ (2001) Simultaneous imaging and spectroscopy of the solar atmosphere: advantages and challenges of a 3-order slitless spectrograph. In: Proc SPIE 4498:16–26 37. Ford BK, Volin CE, Murphy SM, Lynch RM, Descour MR (2001) Computed tomography-based spectral imaging for fluorescence microscopy. Biophys J 80(2):986–993 38. Hagen N, Dereniak EL (2008) Analysis of computed tomographic imaging spectrometers. I. Spatial and spectral resolution. Appl Opt 47(28): F85–F95 39. Shepp LA, Vardi Y (1982) Maximum likelihood reconstruction for emission tomography. IEEE Trans Med Imaging 1(2):113–122 40. Hagen N, Kudenov MW (2013) Review of snapshot spectral imaging technologies. Opt Eng 52(9):090901–090901 41. Donoho DL (2006) Compressed sensing. IEEE Trans Inf Theory 52(4):1289–1306 42. Candès EJ et al (2006) Compressive sampling. In: Proceedings on the International Congress of Mathematicians vol 3, pp 1433–1452, Madrid, Spain 43. Baraniuk RG (2007) Compressive sensing. IEEE Signal Process Mag 24(4):118–121 44. Candès EJ, Wakin MB (2008) An introduction to compressive sampling. IEEE Signal Process Mag 25(2):21–30 45. Romberg J (2008) Imaging via compressive sampling [introduction to compressive sampling and recovery via convex programming]. IEEE Signal Process Mag 25(2):14–20 46. Fornasier M, Rauhut H (2011) Compressive sensing. In: Handbook of mathematical methods in imaging, pp 187–228. Springer

F.S. Oktem et al. 47. Eismann MT (2012) Hyperspectral remote sensing. SPIE, Bellingham 48. Willett RM, Marcia RF, Nichols JM (2011) Compressed sensing for practical optical imaging systems: a tutorial. Opt Eng 50(7):072601–072601 49. Sun T, Takhar D, Laska J, Duarte M, Bansal V, Baraniuk R, Kelly K (2008) Realization of confocal and hyperspectral microscopy via compressive sensing. In: APS Meeting Abstracts 1, p 36008 50. Sun T, Kelly K (2009) Compressive sensing hyperspectral imager. In: Computational Optical Sensing and Imaging, p CTuA5, Optical Society of America 51. Duarte MF, Davenport MA, Takhar D, Laska JN, Sun T, Kelly KE, Baraniuk RG et al (2008) Single-pixel imaging via compressive sampling. IEEE Signal Process Mag 25(2):83–91 52. Wagadarikar AA, Pitsianis NP, Sun X, Brady DJ (2009) Video rate spectral imaging using a coded aperture snapshot spectral imager. Opt Express 17(8):6368–6388 53. Kittle D, Choi K, Wagadarikar A, Brady DJ (2010) Multiframe image estimation for coded aperture snapshot spectral imagers. Appl Opt 49(36):6824 54. Wu Y, Mirza IO, Arce GR, Prather DW (2011) Development of a digital-micromirror-device-based multishot snapshot spectral imaging system. Opt Lett 36(14):2692–2694 55. Kipp L, Skibowski M, Johnson R, Berndt R, Adelung R, Harm S, Seemann R (2001) Sharper images by focusing soft x-rays with photon sieves. Nature 414(6860):184–188 56. Attwood D (2000) Soft x-rays and extreme ultraviolet radiation: principles and applications. Cambridge University Press, Cambridge 57. Gorenstein P, Phillips JD, Reasenberg RD (2005) Refractive/diffractive telescope with very high angular resolution for X-ray astronomy. In: Proc SPIE, Optics for EUV, X-Ray, and Gamma-Ray Astronomy II, 5900:590018 58. Davila J (2011) High-resolution solar imaging with a photon sieve. In: SPIE Optical Engineering+ Applications, International Society for Optics and Photonics, pp 81480O–81480O 59. Menon R, Gil D, Barbastathis G, Smith HI (2005) Photon-sieve lithography. J Opt Soc Am A 22(2):342–345 60. Andersen G (2010) Membrane photon sieve telescopes. Appl Opt 49:6391–6394 61. Andersen G, Asmolova O, McHarg MG, Quiller T, Maldonado C (2016) FalconSAT-7: a membrane space solar telescope. In: Proc SPIE, Space Telescopes and Instrumentation, 9904:99041P 62. Zhou C, Dong X, Shi L, Wang C, Du C (2009) Experimental study of a multiwavelength photon sieve designed by random-area-divided approach. Appl Opt 48(8):1619–1623 63. Artzner GE, Delaboudiniere JP, Song X (2003) Photon sieves as euv telescopes for solar orbiter. In: Proc SPIE 4853:159

5 Computational Spectral and Ultrafast Imaging 64. Andersen G (2005) Large optical photon sieve. Opt Lett 30(22):2976–2978 65. Andersen G, Tullson D (2007) Broadband antihole photon sieve telescope. Appl Opt 46(18):3706–3708 66. Blahut RE (2004) Theory of remote image formation. Cambridge University Press, Cambridge 67. Oktem FS, Davila JM, Kamalabadi F (2014) Image formation model for photon sieves. In: 2013 IEEE International Conference on Image Processing (ICIP), IEEE, pp 2373–2377 68. Goodman JW (2005) Introduction to Fourier Optics, 3rd edn. Roberts, Englewood, Colorado 69. Vogel CR, Oman ME (1998) Fast, robust total variation-based reconstruction of noisy, blurred images. IEEE Trans Image Process 7(6):813–824 70. Lemen JR, Title AM, Akin DJ, Boerner PF, Chou C, Drake JF, Duncan DW, Edwards CG, Friedlaender FM, Heyman GF et al (2011) The atmospheric imaging assembly (AIA) on the solar dynamics observatory (SDO). Solar Physics, pp 1–24 71. Reddy D, Veeraraghavan A, Chellappa R (2011) P2c2: Programmable pixel compressive camera for high speed imaging. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 329–336 72. Gao L, Liang J, Li C, Wang LV (2014) Single-shot compressed ultrafast photography at one hundred billion frames per second. Nature 516(7529):74–77 73. Llull P, Liao X, Yuan X, Yang J, Kittle D, Carin L, Sapiro G, Brady DJ (2013) Coded aperture compressive temporal imaging. Opt Express 21(9):10526– 10545 74. Fernandez-Cull C, Tyrrell BM, D’Onofrio R, Bolstad A, Lin J, Little JW, Blackwell M, Renzi M, Kelly M (2014) Smart pixel imaging with computationalimaging arrays. In: SPIE Defense+ Security, International Society for Optics and Photonics, pp 90703D– 90703D

127 75. Shepard RH, Fernandez-Cull C, Raskar R, Shi B, Barsi C, Zhao H (2014) Optical design and characterization of an advanced computational imaging system. In: SPIE Optical Engineering+ Applications, International Society for Optics and Photonics, pp 92160A–92160A 76. Liu D, Gu J, Hitomi Y, Gupta M, Mitsunaga T, Nayar SK (2014) Efficient space-time sampling with pixelwise coded exposure for high-speed imaging. IEEE Trans Pattern Anal Mach Intell 36(2):248–260 77. Holloway J, Sankaranarayanan AC, Veeraraghavan A, Tambe S (2012) Flutter shutter video camera for compressive sensing of videos. In: IEEE International Conference on Computational Photography (ICCP), IEEE, pp 1–9 78. Liao X, Li H, Carin L (2014) Generalized alternating projection for weighted-2,1 minimization with applications to model-based compressive sensing. SIAM J Imaging Sci 7(2):797–823 79. K.K. Hamamatsu Photonics (2013) Guides to streak cameras. https://www.hamamatsu.com 80. Solli D, Ropers C, Koonath P, Jalali B (2007) Optical rogue waves. Nature 450(7172):1054–1057 81. Eldar YC, Kutyniok G (2012) Compressed sensing: theory and applications. Cambridge University Press, New York 82. Liang J, Gao L, Hai P, Li C, Wang LV (2015) Encrypted three-dimensional dynamic imaging using snapshot time-of-flight compressed ultrafast photography. Scientific Reports 5:15504 83. Zhu L, Chen Y, Liang J, Xu Q, Gao L, Ma C, Wang LV (2016) Space-and intensity-constrained reconstruction for compressed ultrafast photography. Optica 3(7):694–697 84. Hagen N, Kester RT, Gao L, Tkaczyk TS (2012) Snapshot advantage: a review of the light collection improvement for parallel highdimensional measurement systems. Opt Eng 51(11): 111702–1

6

Discriminative Sparse Representations He Zhang and Vishal M. Patel

6.1

Introduction

Sparse and redundant signal representations have recently drawn much interest in vision, signal and image processing [20, 53, 54, 71, 96]. This is due in part to the fact that signals and images of interest can be sparse or compressible in some dictionary. The dictionary can be either based on a mathematical model of the data or it can be learned directly from the data. It has been observed that learning a dictionary directly from training data rather than using a predetermined dictionary such as wavelet or Fourier usually leads to better representation and hence can provide improved results in many practical applications such as image restoration and classification [53, 71, 96]. In this chapter, we summarize approaches to object recognition based on sparse representation (SR) and dictionary learning (DL). We first highlight the idea behind sparse representation. Then, we will outline the Sparse Representation-based Classification (SRC) algorithm [97] and present its applications in robust biometrics recognition [60, 61, 97]. Finally, we present several supervised, unsupervised, semi-supervised and weakly H. Zhang • V.M. Patel () Rutgers University, 94, Brett Road, Piscataway, NJ 08854, USA e-mail: [email protected]; [email protected]

supervised dictionary learning algorithms for object representation and recognition.

6.2

Sparse Representation

Representing a signal involves the choice of a basis, where the signal is uniquely represented as a linear combination of the basis elements. In the case when we use an orthogonal basis, the representation coefficients are simply found by computing inner products of the signal with the basis elements. In the non-orthogonal basis case, the coefficients are found by taking the inner products of the signal with the bi-orthogonal basis. Due to the limitations of orthogonal and bi-orthogonal basis in representing complex signals, overcomplete dictionaries have been used. An overcomplete dictionary has more elements, also known as atoms, than the dimension of the signal. Consider the dictionary B D Œb1 ; ; bL  2 NL R , where L  N and the columns of B are the dictionary atoms. Representing x 2 RN using B entails solving the following optimization problem ˛O D arg min C.˛0 / subject to x D B˛0 ; (6.1) 0 ˛

for some cost function C.˛/: In practice, since we want the sorted coefficients to decay quickly,

© Springer International Publishing AG 2017 V. Monga (ed.), Handbook of Convex Optimization Methods in Imaging Science, DOI 10.1007/978-3-319-61609-4_6

129

130

H. Zhang and V.M. Patel 1.5

Fig. 6.1 The behavior of the scalar function jxjp for various values of p. As p goes to zeros, jxjp becomes the delta function, which is zeros for x D 0 and 1 elsewhere

p=1 p=2 p=0.5

p=0.01 1

0.5

0 −1.5

sparsity of the representation is usually enforced. This can be done by choosing, C.˛/ D k˛kp , 0  p  1; where k:kp denotes the `p -norm defined as

kxkp D

X

! 1p j xi j

p!0

X

j xi jp :

i

In general, the `0 -norm counts the number of non-zero elements in a vector kxk0 D ]fi W xi ¤ 0g:

(6.2)

Figure 6.1 shows the behavior of the scalar weight functions j˛jp for various values of p. Note that as p goes to zero, j˛jp becomes a count of the nonzeros in ˛. Hence, by setting C.˛/ D k˛k0 , one can look for the sparsest solution to the underdetermined linear system of equations x D B˛. The optimization problem in this case becomes the following .P0 /

˛O D arg min k˛0 k0 subject to x D B˛0 : 0 ˛

0

0.5

1

1.5

As it turns out, this problem is NP-hard and can not be solved in polynomial time. As a result, other alternatives are usually sought. The approach often taken in practice is to instead solve the following `1 -minimization problem .P1 /

and the `0 -norm is defined as the limit as p ! 0 of the `p -norm

p!0

−0.5

p

i

kxk0 D lim kxkpp D lim

−1

(6.3)

˛O D arg min k˛0 k1 subject to x D B˛0 : 0 ˛

(6.4)

.P1 / is a convex optimization problem and it is the one closest to .P0 / which can be solved by the standard optimization tools. Problem (6.4), is often referred to as Basis Pursuit [12]. It has been shown that for B with incoherent columns, whenever .P0 / has a sufficiently sparse solution, that solution is unique and is equal to the solution of .P1 /. Define the mutual coherence of the matrix B as follows: Definition 1 ([9]). The mutual coherence of a given matrix B is the largest absolute normalized inner product between different columns from B. Denoting the kth column in B by bk , the mutual coherence is given by

.B/ Q D

jbTk bj j : 1k;jL;k¤j kbk k2 :kbj k2 max

With this definition, one can state the following theorem.

6 Discriminative Sparse Representations

Theorem 1 ([19, 28]). For the system of linear equations x D B˛ .B 2 RNL full rank with L  N), if a solution ˛ exists obeying

1 1 k˛k0 < 1C ; 2

.B/ Q that solution is both unique solution of .P1 / and the unique solution of .P0 /. In the rest of the section, we will show how variants of (6.1) can be used to develop robust algorithms for object classification.

6.2.1

Sparse Representation-based Classification

In object recognition, given a set of labeled training samples, the task is to identify the class to which a test sample belongs to. Following [97], we briefly describe the use of sparse representations for biometric recognition, however, this framework can be applied to a general object recognition problem. Suppose that we are given L distinct classes and a set of n training images per class.

131

One can extract an N-dimensional vector of features from each of these images. Let Bk D Œxk1 ; : : : ; xkj ; : : : ; xkn  be an N  n matrix of features from the kth class, where xkj denote the feature from the jth training image of the kth class. Define a new matrix or dictionary B, as the concatenation of training samples from all the classes as B D ŒB1 ; : : : ; BL  2 RN.n:L/ D Œx11 ; : : : ; x1n jx21 ; : : : ; x2n j : : : : : : jxL1 ; : : : ; xLn :

We consider an observation vector y 2 RN of unknown class as a linear combination of the training vectors as

yD

L X n X

˛ij xij

(6.5)

iD1 jD1

with coefficients ˛ij 2 R. The above equation can be written more compactly as y D B˛;

(6.6)

where

˛ D Œ˛11 ; : : : ; ˛1n j˛21 ; : : : ; ˛2n j : : : : : : j˛L1 ; : : : ; ˛Ln T and :T denotes the transposition operation. We assume that given sufficient training samples of the kth class, Bk , any new test image y 2 RN that belongs to the same class will lie approximately in the linear span of the training samples from the class k. This implies that most of the coefficients not associated with class k in (6.7) will be close to zero. Hence, ˛ is be a sparse vector. In order to represent an observed vector y 2 RN as a sparse vector ˛, one needs to solve the system of linear equations (6.6). Typically n:L  N and hence the system of linear equations (6.6) is under-determined and has no unique solution. As mentioned earlier, if ˛ is sparse enough and

(6.7)

B satisfies certain properties, then the sparsest ˛ can be recovered by solving the following optimization problem ˛O D arg min k ˛0 k1 subject to y D B˛0 : 0 ˛

(6.8)

When noisy observations are given, Basis Pursuit DeNoising (BPDN) can be used to approximate ˛ ˛O D arg min k ˛0 k1 subject to kyB˛0 k2  "; 0 ˛

(6.9)

132

H. Zhang and V.M. Patel

where we have assumed that the observations are of the following form y D B˛ C

(6.10)

with k k2  ". Given an observation vector y from one of the L classes in the training set, one can compute its coefficients ˛O by solving either (6.8) or (6.9). One can perform classification based on the fact that high values of the coefficients ˛O will be associated with the columns of B from a single class. This can be done by comparing how well O the different parts of the estimated coefficients, ˛, represent y. The minimum of the representation error or the residual error can then be used to identify the correct class. The residual error of class k is calculated by keeping the coefficients associated with that class and setting the coefficients not associated with class k to zero. This can be done by introducing a characteristic function, ˘k W Rn ! Rn , that selects the coefficients associated with the kth class as follows O 2: rk .y/ D ky  B˘k .˛/k

(6.11)

Here the vector ˘k has value one at locations corresponding to the class k and zero for other entries. The class, d, which is associated with an observed vector, is then declared as the one that produces the smallest approximation error d D arg min rk .y/: k

(6.12)

The sparse representation-based classification method is summarized in Algorithm 1. For classification, it is important to be able to detect and then reject the test samples of poor

quality. To decide whether a given test sample has good quality, one can use the notion of Sparsity Concentration Index (SCI) proposed in [97]. The SCI of a coefficient vector ˛ 2 R.L:n/ is defined as SCI.˛/ D

L: max k˘i .˛/k1 k˛k1

L1

1

:

(6.13)

SCI takes values between 0 and 1. SCI values close to 1 correspond to the case where the test image can be approximately represented by using only images from a single class. The test vector has enough discriminating features of its class, so has high quality. If SCI D 0 then the coefficients are spread evenly across all classes. So the test vector is not similar to any of the classes and has of poor quality. A threshold can be chosen to reject the images with poor quality. For instance, O <  and a test image can be rejected if SCI.˛/ otherwise accepted as valid, where  is some chosen threshold between 0 and 1.

6.2.2

Robust Biometrics Recognition via Sparse Representation

To illustrate the effectiveness of the SRC algorithm for face and iris biometrics, we highlight some of the results presented in [60, 97], and [61]. The recognition rates achieved by the SRC method for face recognition with different features and dimensions are summarized in Table 6.1 on the extended Yale B Dataset [26]. As it can be seen from Table 6.1 the SRC method achieves the best recognition rate of 98:09% with randomfaces of dimension 504. Table 6.1 Recognition rates (in %) of SRC Algorithm [97] on the extended Yale B database

Algorithm 1: Sparse Representation-based Classification (SRC) Algorithm Input: B 2 RN.n:L/ , y 2 RN . 1. Solve the BP (6.8) or BPDN (6.9) problem. 2. Compute the residual using (6.11). 3. Identify y using (6.12). Output: Class label of y.

Dimension

30

56

120

504

Eigen Laplacian Random Downsample Fisher

86.5 87.49 82.60 74.57 86.91

91.63 91.72 91.47 86.16 –

93.95 93.95 95.53 92.13 –

96.77 96.52 98.09 97.10 –

6 Discriminative Sparse Representations

133

Table 6.2 Recognition results with partial facial features [97] Dimension

Right eye 5,040

Nose 4,270

Mouth 12,936

SRC NN NS SVM

93.7% 68.8% 78.6% 85.8%

87.3% 49.2% 83.7% 70.8%

98.3% 72.7% 94.4% 95.3%

Partial face features have been very popular in recovering the identity of human face [87, 97]. The recognition results on partial facial features such as an eye, nose, and mouth are summarized in Table 6.2 on the same dataset. The SRC algorithm achieves the best recognition performance of 93:7%; 87:3%; 98:3% on eye, nose and mouth features, respectively and it outperforms the other competitive methods such as Nearest Neighbor (NN), Nearest Subspace (NS) and Support Vector Machines (SVM). These results show that SRC can provide good recognition performance even in the case when partial face features are provided (Fig. 6.2). One of the main difficulties in iris biometric is that iris images acquired from a partially cooperating subject often suffer from blur, occlusion due to eyelids, and specular reflections. As a result, the performance of existing iris recognition systems degrade significantly on these images. Hence, it is essential to select good images before they are input to the recognition algorithm. To this end, one such algorithm based on SR for iris biometric was proposed in [61] that can select and recognize iris images in a single step. The block diagram of the method based on SR for iris recognition is shown in Fig. 6.3.

In Fig. 6.4, we display the iris images having the least SCI value for the blur, occlusion and segmentation error experiments performed on the real iris images in the University of Notre Dame ND dataset [7]. As it can be observed, the low SCI images suffer from high amounts of distortion. The recognition performance of the SR based method for iris biometric [61] is summarized in Table 6.3. As it can be seen from the table SRC provides the best recognition performance over that of NN and Libor Masek’s iris identification source code [44].

6.3

Dictionary Learning

Instead of using a pre-determined dictionary B, as in (6.1), one can directly learn it from the data [51]. In this section, we will highlight some of the methods for learning dictionaries and present their applications in object representation and classification [14, 53, 57, 71, 73, 84, 96].

6.3.1

Dictionary Learning Algorithms

Several algorithms have been developed for the task of learning a dictionary. Two of the most well-known algorithms are the method of optimal directions (MOD) [23] and the KSVD algorithm [2]. Given a set of examples X D Œx1 ; ; xn , the goal of the KSVD and MOD algorithms is to find a dictionary B and a sparse matrix that minimize the following representation error

Fig. 6.2 Examples of partial facial features. (a) Eye. (b) Nose. (c) Mouth

134

H. Zhang and V.M. Patel

O O / D arg min kX  B k2 subject to k i k0 .B; F B;

Input Iris Image

 T0 8i;

where  i represent the columns of and T0 denotes the sparsity level. Both MOD and KSVD are iterative methods and alternate between sparse-coding and dictionary update steps. First, a dictionary B with `2 normalized columns is initialized. Then, the main iteration is composed of the following two stages:

Iris Segmentation

Feature Extraction

Sparse Representation

• Sparse coding: In this step, B is fixed and the following optimization problem is solved to compute the representation vector  i for each example xi

Compute SCI

i D 1; ; n; min kxi  B i k22 s. t. i

No SCI > Threshold

(6.14)

Reject Image

Yes Compute Reconstruction Error

Select Minimizer

Fig. 6.3 Block diagram of the method proposed in [61] for the selection and recognition of iris images

k i k0  T0 : Approximate solutions of the above problem can obtained by using any sparse coding algorithms such Orthogonal Matching Pursuit (OMP) [58, 92] and BP [12]. • Dictionary update: This is where both MOD and KSVD algorithms differ. The MOD algorithm updates all the atoms simultaneously by solving the following optimization problem

Fig. 6.4 Iris images with low SCI values in the ND dataset. Note that the images in (a), (b) and (c) suffer from high amounts of blur, occlusion and segmentation errors, respectively

6 Discriminative Sparse Representations

135

Table 6.3 Recognition rate on ND dataset [61] Image quality NN

Masek’s implementation SRC

Good Blurred Occluded Seg. Error

97.5 96.01 89.54 82.09

98.33 95.42 85.03 78.57

99.17 96.28 90.30 91.36

arg min kX  B

B

k2F

(6.15)

whose solution is given by  1 : B D X T T Even though the MOD algorithm is very effective and usually converges in a few iterations, it suffers from the high complexity of the matrix inversion as discussed in [2]. In the case of KSVD, the dictionary update is performed atom-by-atom in an efficient way rather than using a matrix inversion. The term to be minimized in Eq. (6.15) can be rewritten as 2    X   2 T  kX  B kF D X  bj  j    j 2

2 0 1   X   TA T @X D b  b  j j j0 j0  ;    j¤j0 2

where  Tj represents the jth row of . Since we want to update bj0 and  Tj0 , the first term in the above equation Ej0 D X 

X

bj  Tj

j¤j0

can be precomputed. The optimal bj0 and  Tj0 are found by an SVD decomposition. In particular, while fixing the cardinalities of all representations, a subset of the columns of Ej0 is taken into consideration. This way of updating leads to a substantial speedup in the convergence of the training algorithm compared to the MOD method.

Both the MOD and the KSVD dictionary learning algorithms are described in detail in Algorithm 2.

6.3.2

Discriminative Dictionary Learning

Dictionaries can be trained for both reconstruction and discrimination applications. In the late nineties, Etemand and Chellappa proposed a Linear Discriminant Analysis (LDA) based basis selection and feature extraction algorithm for classification using wavelet packets [24]. Recently, similar algorithms for simultaneous sparse signal representation and discrimination have also been proposed in [34, 38, 55, 69, 98]. The basic idea in learning a discriminative dictionary is to add an LDA type of discrimination on the sparse coefficients which essentially enforces separability among dictionary atoms of different classes. Some of the other methods for learning discriminative dictionaries include [37, 41–43, 62, 98, 104]. Additional techniques may be found within these references.

6.3.2.1 Information Theoretic Dictionary Learning A dictionary learning method based on information maximization principle was proposed in [64] for action recognition. The objective function in [64] maximizes the mutual information between what has been learned and what remains to be learned in terms of appearance information and class distribution for each dictionary item. A Gaussian Process (GP) model is proposed for sparse representation to optimize the dictionary objective function. The sparse coding property allows a kernel with a compact support in GP to realize a very efficient dictionary learning process. Hence, an action video can be described by a set of compact and discriminative action attributes. Given the initial dictionary Bo , the objective is to compress it into a dictionary B of size k, which encourages the signals from the same class to have very similar sparse representations. Let L denote the labels of M discrete values, L 2 Œ1; M. Given a set of dictionary atoms B ,

136

H. Zhang and V.M. Patel

Algorithm 2: The MOD and KSVD Dictionary Learning Algorithms Objective: Find the best dictionary to represent the samples X D Œx1 ;    ; xn  as sparse compositions, by solving the following optimization problem: arg min kX  B k2F subject to 8i k i k0  T0 : B;

Input: Initial dictionary B.0/ 2 RNP , with normalized columns, signal matrix X D Œx1 ;    ; xn  and sparsity level T0 . 1. Sparse coding stage: Use any pursuit algorithm to approximate the solution of O i D arg min kxi  Bk22 subject to kk0  T0 obtaining sparse representation vector O i for 1  ı  n: These form the matrix : 2. Dictionary update stage: MOD: Update the dictionary by the formula B D X

T



T

1

:

KSVD: For each column k D 1;    ; P in B.J1/ update by - Define the group of examples that use this atom, !k D fij1  i  P;  Tk .i/ ¤ 0g: - Compute the overall representation error matrix, Ek ; by Ek D X 

X

bj  Tj :

j¤k

- Restrict Ek by choosing only the columns corresponding to !k and obtain ERk : - Apply SVD decomposition ERk D UVT : Select the updated dictionary column bO k to be the first column of U. Update the coefficient vector  kR to be the first column of V multiplied by .1; 1/: 3. Set J D J C 1. Output: Trained dictionary B and sparse coefficient matrix .

P define P.LjB / D jB1 j bi 2B P.Ljbi /. For simplicity, denote P.Ljb / as P.Lb /, and P.LjB / as P.LB /. To enhance the discriminative power of the learned dictionary, the following objective function is considered

arg max I.B I Bo nB /CI.LB I LBo nB / (6.16)  B

where   0 is the parameter to regularize the emphasis on appearance or label information and I denotes mutual information. One can approximate (6.16) as N  / arg max ŒH.b jB /  H.b jB b 2Bo nB

CŒH.Lb jLB /  H.Lb jLBN  /;

(6.17)

where H denotes entropy. One can easily notice that the above formulation also forces the classes associated with b to be most different from classes already covered by the selected atoms B ; and at the same time, the classes associated with b are most representative among classes covered by the remaining atoms. Thus the learned dictionary is not only compact, but also covers all classes to maintain the discriminability. In Fig. 6.5, we present the recognition accuracy on the Keck gesture dataset with different dictionary sizes and over different global and local features [64]. Leave-one-person-out setup is used. That is, sequences performed by a person are left out, and the average accuracy is reported. Initial dictionary size jBo j is chosen to be twice the dimension of the input signal and sparsity 10 is used in this set of experiments. As can be seen the mutual information-based method, denoted as MMI-2 outperforms the other methods. Sparse representation over a dictionary with coherent atoms has the multiple representation problem. A compact dictionary consists of incoherent atoms, and encourages similar signals, which are more likely from the same class, to be consistently described by a similar set of atoms with similar coefficients [64]. A discriminative dictionary encourages signals from different classes to be described by either a different set of atoms, or the same set of atoms but with different coefficients [34, 42, 69]. Both aspects are critical for classification using sparse representation. The reconstructive requirement to a compact and discriminative dictionary enhances the robustness of the discriminant sparse representation [69]. Hence, learning reconstructive, compact and discriminative dictionaries is impor-

137

0.9

0.95

0.8

0.9

Recognition Accuracy

Recognition Accuracy

6 Discriminative Sparse Representations

0.7 0.6 0.5 0.4 0.3 0.2 0.1

MMI−2 MMI−1 k−means Liu−Shah ME 50

0.85 0.8 0.75 0.7

MMI−2 MMI−1 k−means Liu−Shah ME

0.65 0.6

100

150

200

Dictionary Size |D*|

(a)

0.55 20

40

60

80

100

120

140

160

Dictionary Size |D*|

(b)

Fig. 6.5 Recognition accuracy on the Keck gesture dataset with different features and dictionary sizes (shape and motion are global features. STIP is a local fea-

ture.) [64]. The recognition accuracy using initial dictionary Do : (a) 0.23. (b) 0.42. In all cases, the MMI-2 (red line) outperforms the rest

tant for classification using sparse representation. Motivated by this observation, Qiu et al. [62] proposed a general information theoretic approach to leaning dictionaries that are simultaneously reconstructive, discriminative and compact. Suppose that we are given a set of n signals (images) in an N-dim feature space X D Œx1 ; : : : ; xn ; xi 2 RN . Given that signals are from p distinct classes and Nc signals are from the cth class, c 2 f1; ; pg, we denote D p f c gcD1 , where c D Œ c1 ; ;  cNc  are signals p in the cth class. Define D f c gcD1 , where

c D Πc1 ; ;  cNc  is the sparse representation of Xc . Given X and an initial dictionary Bo with `2 normalized columns, a compact, reconstructive and discriminative dictionary B is learned via maximizing the mutual information between B and the unselected atoms Bo nB in Bo , between the sparse codes B associated with B and the signal class labels C, and finally between the signals X and B , i.e.,

A two-stage approach is adopted to satisfy (6.18). In the first stage, each term in (6.18) is maximized in a unified greedy manner and involves a closed-form evaluation, thus atoms can be greedily selected from the initial dictionary while satisfying (6.18). In the second stage, the selected dictionary atoms are updated using a simple gradient ascent method to further maximize 2 I. B I C/ C 3 I.XI B/:

arg max1 I.BI Bo nB/C2 I. B I C/C3 I.XI B/ B

(6.18) where f1 ; 2 ; 3 g are the parameters to balance the contributions from compactness, discriminability and reconstruction terms, respectively.

To illustrate how the discriminability of dictionary atoms selected by the information theoretic dictionary section (ITDS) method can be further enhanced using the information theoretic dictionary update (ITDU) method, consider Fig. 6.6. The Extended YaleB face dataset [26] and the USPS handwritten digits dataset [1] are used for illustration. Sparsity 2 is adopted for visualization, as the non-zero sparse coefficients of each image can now be plotted as a 2-D point. In Fig. 6.6, with a common set of atoms shared over all classes, sparse coefficients of all samples become points in the same 2-D coordinate space. Different classes are represented by different colors. The original images are also shown and placed at the coordinates defined by their non-zero sparse coefficients. The atoms to be updated in Figs. 6.6a and 6.6d are selected using ITDS. It can be seen from Fig. 6.6 that the ITDU

138

H. Zhang and V.M. Patel

Fig. 6.6 Information-theoretic dictionary update with global atoms shared over classes. For a better visual representation, sparsity 2 is chosen and a randomly selected subset of all samples are shown. The recognition rate associated with (a) Before update, (b) After 100 updates, and (c) Converge after 489 updates are: 30:63%, 42:34%

and 51:35%. The recognition rate associated with (d) Before update, (e) After 50 updates, and (f) Converge after 171 updates are: 73:54%, 84:45% and 87:75%. Note that the ITDU effectively enhances the discriminability of the set of common atoms [62]

method makes sparse coefficients of different classes more discriminative, leading to significantly improved classification accuracy [62].

level. The optimization problem (6.19) can be rewritten as follows 

 pX < B; ; W > D arg min  1 L B;;W 

  B (6.20)   p  1 W 2

6.3.2.2 Discriminative KSVD A discriminative version of the KSVD algorithm was proposed in [104], in which, a linear classifier is incorporated into the optimization formulation. Let X 2 RMN be training samples from C classes. Then, the following optimization problem is proposed in [104] < B; ; W > D arg min kX  B k2 B;;W

C 1 kL  W k2 subject

(6.19)

to k i k0 < T;

where  i is the ith column of  , W 2 RCK is the learned linear transform for the linear classifier L D W , L 2 RCN is the label matrix, 1 is a regularization parameter and T is the sparsity

subject

to k i k0 < T:

As a result, one can update B; W and  simultaneously using the KSVD algorithm [2]. During the testing stage, the sparse coefficients  t of a new test sample xt based on the normalized learned dictionary Bn Bn D fbn1 ; bn2 ; : : : ; bnk g   b1 b2 bk D ; ;:::; (6.21) kb1 k2 kb2 k2 kbk k2 are found and then the class label of xt is determined based on the index of the largest value in the vector Wn  t , where

6 Discriminative Sparse Representations

139

Wn D fwn1 ; wn2 ; : : : ; wnk g   w2 wk w1 ; ;:::; : (6.22) D kb1 k2 kb2 k2 kbk k2

Sw . / D

C X X cD1  i 2c

Sb . / D

6.3.2.3 Fisher Discrimination Dictionary Learning In [98], Yang et al. proposed a discriminative DL method based on the Fisher discrimination criterion [3]. In particular, a set of class-specific dictionary B D ŒB1 ; : : : ; BC  to sparsely represent each input sample is sought where the corresponding sparse coefficients  D Œ1 ; : : : ; C  are enforced to have small within class scatter but large between class scatter. The following optimization problem is proposed in [98] < B;  > D arg min

C X

r.Xc ; B; c /

cD1

C1 k k1 C 2 fd . /; (6.23) where the function r. / represents the part to learn the class-specific dictionaries and fd . / enforces the Fisher discrimination criterion on the sparse coefficient vectors. Here, 1 and 2 are regularization parameters. Specifically, the first part is expressed as r.Xc ; B; c / D kXc  Bc k22 C kXc  Bc cc k22 C

C X

kBi ci k;

(6.24)

iD1;i¤c

where the first term enforce the whole dictionary B to sparsely represent the class-specific samples Xc . The second term imposes a better representation of Xc by its corresponding dictionary Bc . Finally, the last term ensures that the class-specific samples Xc cannot be well represented by the other sub-class dictionaries. Here, ci denotes the representation coefficients of Xc over Bi . The second term, fd . /, is the discriminative part that includes the Fisher criterion [3] into the optimization. The goal of this Fisher criterion is to minimize the within-class scatter Sw and maximize the between-class scatter Sb , where Sw and St are defined as follows

. i  c /. i  c /T ; (6.25)

C X

nc . c  / c  /T ;

(6.26)

cD1

where and c are the mean vectors of  and c , respectively and nc is the number of samples in class c. Using these definitions, fd can be rewritten as follows fd . / D tr.Sw . /  Sb . // C k k22 ; (6.27) where the last part k k22 is the regularization term and is the regularization parameter. The optimization details for solving (6.23) can be found in [98].

6.3.2.4 Label Consistent KSVD Another approach for learning discriminative dictionaries for object recognition was proposed in [37]. The following optimization problem is proposed in [37] < B; ; S > D arg min kX  B k22 B;;S

C1 kQ  S k22 subject

to

(6.28) k i k0 < T;

where T controls the sparsity of the coefficient vectors, 1 is a regularization parameter, S 2 RKK is a linear transformation matrix and Q 2 RKN is a pre-defined discriminative sparse code matrix, whose .i; j/ entry is ‘1’ if the ith dictionary atom and the jth training sample are from the same class, otherwise it is ‘0’. The last term kQ  S k22 imposes discrimination on the sparse coefficients. To directly involve the classification error into the optimization framework, (6.29) is modified as follows < B; ; S; W > D arg min kX  B k22 B;;S;W

C1 kQ  S k22 C 2 kL  W k22 subject

to

k i k0 < T;

(6.29)

140

H. Zhang and V.M. Patel

where 2 is a regularization parameter and kL  WXk22 is the term for classification error similar to the one proposed in [104], in which L is the

label matrix and W is the learned transformation matrix. With this, (6.29) can be reformulated as

0 1 0 1 2    pX  pB  @ A @ A < B; ; S; W >D arg min  p1 Q  p 1 S X  B;;S   2 H 2 W 2 subject

to

k i k0 < T;

which can be solved using any one of the standard dictionary learning algorithms such as KSVD [2]. The optimization details for solving (6.30) can be found in [37].

6.3.3

Non-Linear Kernel Dictionary Learning

Linear representations are almost always inadequate for representing nonlinear data arising in many practical applications. For example, many types of descriptors in computer vision have intrinsic nonlinear similarity measure functions. The most popular ones include the spatial pyramid descriptor [39] which uses a pyramid match kernel, and the region covariance descriptor [93] which uses a Riemannian metric as the similarity measure between two descriptors. Both of these distance measures are highly non-linear. Unfortunately, traditional sparse coding methods are based on linear models. This inevitably leads to poor performances for many datasets, e.g., object classification of Caltech-101 [36] dataset, even when discriminant power is taken into account during the training. This has motivated researchers to study non-linear kernel sparse representation and dictionary learning for object recognition [18, 25, 49, 75, 80, 81, 83, 99, 103]. To make the data in an input space separable, the data is implicitly mapped into a highdimensional kernel feature space by using some nonlinear mapping associated with a kernel function. The kernel function,  W RN  RN ! R, is defined as the inner product

(6.30)

.xi ; xj / D h.xi /; .xj /i;

(6.31)

where,  W RN ! F RNQ is an implicit mapping projecting the vector x into a higher dimensional space, F . Some commonly used kernels include polynomial kernels .x; y/ D h.x; yi C c/d and Gaussian kernels

kx  yk2 .x; y/ D exp  ; c where c and d are the parameters. One can learn a non-linear dictionary B in the feature space F by solving the following optimization problem: arg min k˚.X/  B k2F s:t k i k0  T0 ; 8i: B;

(6.32) Q where B 2 RNK is the sought dictionary,

2 RKn is a matrix whose ith column is the sparse vector  i corresponding to the sample xi , with maximum of T0 non-zero entries and with the abuse of notation we denote .X/ D Œ.x1 /; ; .xn /: It was shown in [49], that there exists an optimal solution B to the problem (6.32) that has the following form: B D ˚.X/A

(6.33)

for some A 2 RnK . Moreover, this solution has the smallest Frobenius norm among all optimal solutions. As a result, one can seek an optimal

6 Discriminative Sparse Representations

141

dictionary through optimizing A instead of B. By substituting (6.33) into (6.32), the problem can be re-written as follows: arg min k˚.X/  ˚.X/A k2F s:t k i k0  T0 ; 8i: A;

(6.34)

In order to see the advantage of this formulation over the original one, we will examine the objective function. Through some manipulation, the cost function can be re-written as: k˚.X/˚.X/A k2F D tr..IA /T K.X; X/.IA //; (6.35)

where K.X; X/ is a kernel matrix whose elements are computed from .i; j/ D ˚.xi /T ˚.xj /. It is apparent that the objective function is feasible since it only involves a matrix of finite dimension K 2 Rnn , instead of dealing with a possibly infinite dimensional dictionary. An important property of this formulation is that the computation of K only requires dot products. Therefore, we are able to employ Mercer kernel functions to compute these dot products without carrying out the mapping ˚. To solve the above optimization problem for learning non-linear dictionaries, variants of MOD and K-SVD algorithms in the feature space have been proposed [47,49]. The procedure essentially involves two stages: sparse coding and dictionary update in the feature space. For sparse coding, one can adapt the non-linear version of orthogonal matching pursuit algorithm [49]. Once the sparse codes are found in the feature space, the dictionary atoms are updated in an efficient way [47, 49]. The optimization problem (6.34) is purely generative. It does not explicitly promote the discrimination which is important for many classification tasks. Using the kernel trick, when the data is transformed into a high dimensional feature space, the data from different classes may still overlap. Hence, generative dictionaries may lead to poor performance in classification even when data is non-linearly mapped to a feature space. To overcome this, a method for designing non-linear dictionaries that are simultaneously generative and discriminative was proposed in [79].

Figure 6.7 presents an important comparison in terms of the discriminative power of learning a discriminative dictionary in the feature space where kernel LDA type of discriminative term has been included in the objective function. A scatter plot of the sparse coefficients obtained using different approaches show that such a discriminative dictionary is able to learn the underlying non-linear sparsity of data as well as it provides more discriminative representation. See [47, 49, 79] for more details on the design of nonlinear kernel dictionaries. One disadvantage of some of the previous kernel DL algorithms such as [49] and [79] is that they require storing and handling of a very large kernel matrix. This in turn leads to very high computational cost and forces one to use these methods on small number of training samples. In order to deal with this issue, [27] proposed an efficient approach based on the Nystrom method to “linearize” the kernel dictionary learning methods. It was shown in [27] that this method can significantly reduce the complexity of many kernel DL algorithms without reducing their effects on the recognition accuracy.

6.3.4

Joint Dimensionality Reduction and Dictionary Learning

Signals are usually assumed to lie on a lowdimensional manifold embedded in a high dimensional space. Dealing with the highdimension is not practical for both learning and inference tasks. As an example of the effect of dimension on learning, Stone [90] showed that, under certain regularity assumption including that samples are identically independent distributed, the optimal rate of convergence for nonparametric regression decreases exponentially with the dimension of the data. As the dimension increases, the Euclidean distances between feature vectors become closer to each other making the inference task harder. This is known as the concentration phenomenon [5], To address these issues, various linear and non-linear dimensionality reduction (DR) techniques have

142

H. Zhang and V.M. Patel 2 Class 1 Class 2 Class 3

1

Class 1 Class 2 Class 3

0 −5

1

0

−2

0

−1 −5

0

5

−1

−3 −5

5

0

(a) 4 2

0

−5

5

0

−5

5

(b)

Class 1 Class 2 Class 3

4 3

Class 1 Class 2 Class 3

2 0

1 0

−2

−1 −4 −6 −5

−2

0

5

0

−5

(c)

−3 −5

0

(d)

Fig. 6.7 A synthetic example showing the significance of learning a discriminative dictionary in feature space for classification. (a) Synthetic data which consists of linearly non separable 3D points on a sphere. Different classes are represented by different colors. (b) Sparse

coefficients from K-SVD projected onto learned SVM hyperplanes. (c) Sparse coefficients from a non-linear dictionary projected onto learned SVM hyperplanes. (d) Sparse coefficients from non-linear discriminative kernel dictionary projected onto learned SVM hyperplanes [79]

been developed [40]. Some examples include PCA [59], ISOMAP [91], LLE [70], Laplacian Eigenmaps [4], etc. In general, these techniques map data to a lower-dimensional space such that non-informative or irrelevant information in the data are discarded. As we saw in the previous sections, dictionary learning methods have been popular for representing and processing of signals. However, the current algorithms for finding a good dictionary have some drawbacks. The learning of B is challenging due to the high dimensional nature of the training data, as well as the lack of training samples. Therefore, DR seems to be a natural solution. Unfortunately, the current DR techniques are not designed to respect and promote underlying sparse structures of data. There-

fore, they cannot help the process of learning the dictionary B. An interesting framework, called sparse embedding (SE), that brings the strength of both dimensionality reduction and sparse learning together was recently proposed in [48]. In this framework, the dimension of signals is reduced in a way such that the sparse structures of signals are promoted. The algorithm simultaneously learns a dictionary in the reduced space, yet, allows the recovery of the dictionary in the original domain. This empowers the algorithm with two important advantages: (1) Ability to remove the distracting part of the signal that negatively interferes with the learning process, and (2) Learning in the reduced space with smaller computational complexity. In addition, the framework is able to

6 Discriminative Sparse Representations

143

handle sparsity in a non-linear model through the use of the Mercer kernels. Different from the classical approaches, the algorithm embeds input signals into a lowdimensional space and simultaneously learns an optimized dictionary. Let M denote the mapping that transforms input signals into the output space. In general, M can be non-linear. However, for the simplicity of notations, we temporarily restrict our discussion to linear transformations. As a result, the mapping M is characterized using a matrix P 2 RdN . One can learn the mapping together with the dictionary through minimizing some appropriate cost function CX : fP ; B ;  g D argmin CX .P; B; / (6.36) P;B;

This cost function CX needs to have several desirable properties. First, it has to promote sparsity within the reduced space. At the same time, the transformation P resulting from optimizing CX has to retain useful information present in the original signals. A second criterion is also needed in order to prevent the pathological case when everything is mapped into the origin, which obtains the sparsest solution but is obviously of no interest. Towards this end, the following optimization was proposed in [48]: fP ; B ;  g D argmin P;B;

  kPX  B k2F C kX  PT PXk2F subject to: PPT D I; and k i k0  T0 ; 8i

(6.37)

where I 2 Rdd is the identity matrix,  is a positive constant, and the dictionary is now in the reduced space, i.e., B D Œb1 ; b2 ; : : : ; bK  2 RdK . The first term of the cost function promotes sparsity of signals in the reduced space. The second term is the amount of energy discarded by the transformation P, or the difference between low-dimensional approximations and the original signals. In fact, the second term is closely related to PCA as by removing the first term, it can be shown that the solution of P coincides with the principal components of the largest eigenvalues, when the data are centered.

A simple two step procedure for finding both the projection mapping and learning linear as well as non-linear dictionaries was proposed in [48]. It was shown that sparse embedding can capture the meaningful structure of data and can perform significantly better than many competitive algorithms on signal recovery and object classification tasks.

6.3.5

Unsupervised Dictionary Learning

Dictionary learning techniques for unsupervised clustering have also gained some traction in recent years. In [89], a method for simultaneously learning a set of dictionaries that optimally represent each cluster is proposed. To improve the accuracy of sparse coding, this approach was later extended by adding a block incoherence term in their optimization problem [65]. Additional sparsity motivated subspace clustering methods include [22, 66, 88]. In particular, scale and in-plane rotation invariant clustering approach, which extends the dictionary learning and sparse representation framework for clustering and retrieval of images was proposed in [16, 17]. Figure 6.8 presents and overview this approach [16]. Given a database of images fxj gNjD1 and the number of clusters K, the Radon transform [33] is used to find scale and rotation invariant features. It then uses sparse representation methods to simultaneously cluster the data and learn dictionaries for each cluster. One of the main features of this method is that it is effective for both texture and shape-based images. Various experiments in [16, 17] demonstrated the effectiveness of this approach in image retrieval experiments, where the significant improvements in performance are achieved.

6.3.6

Dictionary Learning from Partially Labeled Data

The performance of a supervised classification algorithm is often dependent on the quality and

144

H. Zhang and V.M. Patel

Fig. 6.8 Overview of simultaneous scale and in-plane rotation invariant clustering and dictionary learning method [16]

diversity of training images, which are mainly hand-labeled. However, labeling images is expensive and time consuming due to the significant human effort involved. On the other hand, one can easily obtain large amounts of unlabeled images from public image datasets like Flickr or by querying image search engines like Bing. This has motivated researchers to develop semisupervised algorithms, which utilize both labeled and unlabeled data for learning classifier models. Such methods have demonstrated improved performance when the amount of labeled data is limited. See [11] for an excellent survey of recent efforts on semi-supervised learning. Two of the most popular methods for semisupervised learning are Co-Training [6] and

Semi-Supervised Support Vector Machines (S3VM) [86]. Co-Training assumes the presence of multiple views for each feature and uses the confident samples in one view to update the other. However, in applications such as image classification, one often has just a single feature vector and hence it is difficult to apply Co-Training. Semi-supervised support vector machines consider the labels of the unlabeled data as additional unknowns and jointly optimizes over the classifier parameters and the unknown labels in the SVM framework [10]. An interesting method to learn discriminative dictionaries for classification in a semisupervised manner was proposed in [81, 84].

6 Discriminative Sparse Representations

145

Fig. 6.9 Block diagram illustrating semi-supervised dictionary learning [84]

Figure 6.9 shows the block diagram of this method [84] which uses both labeled and unlabeled data. While learning a dictionary, probability distribution is maintained over class labels for each unlabeled data. The discriminative part of the cost is made proportional to the confidence over the assigned label of the participating training sample. This makes the method robust to label assignment errors. This method was later made nonlinear using kernel methods in [81]. See [84] and [81] for more details on the optimization of the partially labeled dictionary learning. Fig. 6.10 Each face is associated with three names out of which only one is the true name [13]

6.3.7

Dictionary Learning from Ambiguously Labeled Data

In many practical image and video applications, one has access only to partially labeled data. For example, given a picture with multiple faces and a caption stating the names of the people, the challenge is learning the mapping between the faces and the names (see Fig. 6.10). The problem of learning accurate object identities where each example is associated with multiple labels, with only one being correct, is known as partially or ambiguously labeled learning. Dictionary learning algorithms for processing such ambiguously labeled data have been proposed in [13, 15]. Figure 6.11a shows the block diagram of this dictionary learning method proposed in [15].

Given ambiguously labeled training samples (e.g. faces), the algorithm consists of two main steps: confidence update and dictionary update. The confidence for the labels for each sample is defined as the probability distribution on its ambiguous labels. In the confidence update phase, the confidence is updated for each sample based on its residual error when the sample is projected onto the class specific dictionaries. Then, with the confidence values fixed, the dictionaries are updated using the KSVD algorithm [2]. Two approaches for updating the dictionary are proposed: dictionary learning with hard decision (DLHD), and dictionary learning with soft deci-

146

H. Zhang and V.M. Patel

Fig. 6.11 Overview of the ambiguously labeled dictionary learning framework [15]. (a) Block diagram. (b) An illustration of how multiple labeled samples are collected

to learn intermediate dictionaries, which are then used to update the confidence for each sample xi

sion (DLSD). In DLHD, the class dictionaries are learned using the KSVD algorithm directly from clusters that are determined by hard decision on ambiguous labels. In DLSD, class dictionaries are learned using a weighted KSVD algorithm, where the weighting parameters are computed by a soft decision rule based on the current confidence values. In the testing stage, a novel test image is projected onto the span of the atoms in each learned dictionary. The resulting residual vectors are then used for classification. This algorithm was made nonlinear using kernel methods in [13].

bag is negative. A noisy-OR model-based method for learning MIL dictionaries was proposed in [82, 85]. Figure 6.12a provides the motivation behind the proposed method. In this figure, instances from 1 negative bag and 3 positive bags are shown. They can be imagined intersecting at different locations. From the problem definition, the negative bag contains only negative class samples, hence the region around the negative instances is very likely to be a negative concept, even if it intersects with positive bags. However, the intersection of positive bags, is likely to belong to the positive concept. Traditional diverse density-based approaches can find only one positive concept that is close to the intersection of positive bags and away from the negative bags. Since one point in the feature space can not describe the positive class distribution, these approaches tend to compute different positive concepts with multiple initializations. It was shown in [82,85] that the multiple concepts are naturally captured by dictionary atoms and can lead to a better performance. Figure 6.12b shows an overview of this MIL dictionary learning method. An interesting property about this method is that in the case

6.3.8

Multiple Instance Dictionary Learning

A multi-class, multiple instance learning (MIL) algorithm using the dictionary learning framework where the data is given in the form of bags was recently proposed in [82, 85]. In MIL, each bag contains multiple samples, called instances, out of which at least one belongs to the class of the bag. A bag is positive if at least one of its instances is a positive example otherwise the

6 Discriminative Sparse Representations

147 Positive Input Bags

Negative Atoms

Negative Input Bags

Positive Atoms

Update Sparse Coefficients Instances from a negative bag

Instances from 3 positive bags

MIL Dictionary Learning

Update Dictionary iterate

Output Dictionary

(a)

(b)

Fig. 6.12 (a) Motivation behind the MIL-based dictionary learning framework proposed in [82]. (b) An overview of the MIL DL framework proposed in [82]

where there is only one instance in each bag, the resulting method reduces to the traditional dictionary learning problem. A kernelized version of the MIL DL algorithm has also been proposed in [82]. Various experiments using popular vision-related MIL datasets as well as the UNBC-McMaster Pain Shoulder Archive database showed that this MIL-based DL method can perform significantly better than many existing MIL methods.

6.3.9

Domain Adaptive Dictionary Learning

When designing dictionaries for face recognition tasks, we are often confronted with situations where conditions in the training set are different from those present during testing. For example, in the case of face recognition, more than one familiar view may be available for training. Such training faces may be obtained from a live or recorded video sequences, where a range of views are observed. However, the test images can contain conditions that are not necessarily present in the training images such as a face in

a different pose. The problem of transforming a dictionary trained from one visual domain to another can be viewed as a problem of domain adaptation [46, 50, 56, 63, 76, 77]. Several dictionary-based methods have been proposed in the literature to deal with this domain shift problem in visual recognition. A function learning framework for the task of transforming a dictionary learned from one visual domain to the other, while maintaining a domain-invariant sparse representation of a signal was proposed in [63]. Domain dictionaries are modeled by a linear or non-linear parametric function. The dictionary function parameters and domain-invariant sparse codes are then jointly learned by solving an optimization problem. In [50], a domain adaptive dictionary learning framework was proposed by generating a set of intermediate dictionaries which smoothly connect the source and target domains. One of the important properties of this approach is that it allows the synthesis of data associated with the intermediate domains while exploiting the discriminative power of generative dictionaries. The intermediate data can then be used to build a classifier for recognition under domain shifts.

148

H. Zhang and V.M. Patel

In [76] a domain adaptive dictionary learning framework is proposed for learning a single dictionary to optimally represent both source and target data. As the features may not be correlated well in the original space, one can project data from both the domains onto a common lowdimensional space while maintaining the manifold structure of data. Learning the dictionary on a low-dimensional space makes the algorithm faster and irrelevant information in the original features can be discarded. Moreover, joint learning of dictionary and projections ensures that the common internal structure of data in both the domains is extracted, which can be represented well by sparse linear combinations of dictionary atoms. In what follows, we briefly review the generalized domain adaptive dictionary learning framework proposed in [76]. An overview of this method is shown in Fig. 6.13. The classical dictionary learning approach minimizes the representation error of the given set of data samples subject to a sparsity constraint (6.14). Now, consider a special case, where we have data from two domains, X1 2 Rn1 N1 and X2 2 Rn2 N2 . We wish to learn a shared Katoms dictionary, B 2 RnK and mappings W1 2 Rnn1 , W2 2 Rnn2 onto a common low-dimensional space, which will minimize the representation error in the projected space. Formally, we desire to minimize the following cost function: C1 .B; W1 ; W2 ; 1 ; 2 / D kW1 X1  B1 k2F

CkW2 X2  B2 k2F

Dictionary Learning Stage Source Domain

Target Domain

W1

W2

Common Latent Subspace

D= Shared discriminative dictionary

Fig. 6.13 Overview of domain adaptive latent space dictionary learning framework [76]. Note that D in this figure corresponds to the dictionary B

original domains after projecting onto the latent space, a PCA-like regularization term is added which preserves energy in the original signal, given as C2 .W1 ; W2 / D kX1  WT1 W1 X1 k2F

CkX2  WT2 W2 X2 k2F : (6.39) It is easy to show after some algebraic manipulations that the costs C1 and C2 , after ignoring the constant terms in X, can be written as

(6.38)

Q Q / D kW QX Q  BQ k2 ; C1 .B; W; F

(6.40)

subject to sparsity constraints on 1 and 2 . We further assume that rows of the projection matrices, W1 and W2 are orthogonal and normalized to unit-norm. This prevents the solution from becoming degenerate, leads to an efficient scheme for optimization and makes the kernelization of the algorithm possible. In order to make sure that the projections do not lose too much information available in the

Q D trace..W Q X/. Q W Q X/ Q T/ C2 .W/

(6.41)

where, Q D ŒW1 W2 ; X Q D W

X1 0 0 X2

; and Q

D Œ1 2 : Hence, the overall optimization is given as

6 Discriminative Sparse Representations

Q  ; Q  g D argmin C1 .B; P; Q Q / C C2 .W/ Q fB ; W Q Q B;W;

s.t. Wi WTi D I; i D 1; 2 and k Qj k0  T0 ; 8j (6.42) where,  is a positive constant. See [76] for the details regarding the optimization of the above problem. In order to show the effectiveness of this method, a pose alignment experiment was done in [76] using the CMU Multi-PIE dataset [29]. The Multi-pie dataset [29] is a comprehensive face dataset of 337 subjects, having images taken across 15 poses, 20 illuminations, 6 expressions and 4 different sessions. For the purpose of this experiment, 129 subjects common to both Session 1 and 2 were used. The experiment was done on 5 poses, ranging from frontal to 75o . Frontal faces were taken as the source domain, while different off-frontal poses were taken as the target domains. Dictionaries were trained using illuminations f1; 4; 7; 12; 17g from the source and the

149

target poses, in Session 1 per subject. All the illumination images from Session 2, for the target pose, were taken as probe images. Pose alignment is challenging due to the highly non-linear changes induced by 3-D rotation of face. Images at the extreme pose of 60o were taken as the target pose. First, a shared discriminative dictionary was learned. Then, given the probe image, it was projected on the latent subspace and reconstructed using the dictionary. The reconstruction was backprojected onto the source pose domain, to give the aligned image. Figure 6.14 shows the synthesized images for various conditions. The best alignment is achieved when K is equal to 5. It can be seen from rows 2 and 3 that the dictionary-based method is robust even at high levels of noise and missing pixels. Moreover, de-noised and in-painted synthesized images are produced as shown in rows 2 and 3 of Fig. 6.14, respectively. This experiment clearly shows the effectiveness of the domain adaptive dictionary learning method for pose alignment [76].

Fig. 6.14 Examples of pose-aligned images. Synthesis in various conditions demonstrate the robustness of the domain adaptive dictionary learning method [76]

150

6.4

H. Zhang and V.M. Patel

Analysis Dictionary Learning

So far the DL methods discussed in this chapter are so called synthesis DL approaches where given a data matrix X, it is decomposed into a dictionary B and sparse codes  . In recent years, an alternative form of DL known as analysis DL has also gained a lot of interest in the literature [31, 45, 68, 72, 74]. Figure 6.15 presents a brief comparison of the two dictionary learning models. Different from (6.14), the analysis dictionary leaning methods aim to learn an analysis dictionary W 2 Rml by solving min kWX  Ck2F s. t. W 2 W ; W;C

kci k0  T0 ; i D 1; 2; ; n;

(6.43)

where X D ŒX1 ; X2 ; : : : ; XC  2 Rln is the training data from C classes, T0 is the sparsity level, and W is a set of constraints on W to make the solution non-trivial. As indicated in [74], W can be matrices with either relatively small Frobenius norm or unity row-wise norm. In particular, the following optimization problem is solved in [74] min kWX  Ck2F C  .W/ such that kwi k2 D 1

C;W

8i D 1; ; m; kCk0  T0 ;

(6.44)

where wi is the ith row of the analysis dictionary W,  > 0 is a hyperparameter and (  .W/ D

 log.det .WT W//

if m  l

 log.det .WW //

if m < l: (6.45) The overall cost function is non-convex, however, one can follow a strategy of alternate minimization to optimize the cost [74]. This can be done in the following two steps: T

• Update sparse code, C: Fixing W, the solution for C can be obtained by a simple thresholding. The optimal solution for C is given by retaining the top T0 coefficients in each column of WX. We can also relax the `0 -norm constraint to the `1 -norm constraint to make the problem convex. In this case, one can solve the following equivalent problem arg min kWX  Ck2F C ˇkCk1 ; C

where ˇ is a regularizing parameter. Problem (6.46) can be solved by applying a soft thresholding scheme as follows 8 ˇ ˆ 0). Then we have .˛; / D argminjjx  D˛jj22 C 4 n2 log. C / ˛;

C n2

X .˛i  i /2 i2

i

:

P where log. C / D i log.i C / is adopted for its notational simplicity. Note that the matrix form of original GSM model is ˛ D ƒˇ and  D ƒ where ƒ D diag.i / 2 RKK is a diagonal matrix characterizing the variance field for the chosen image patch. Accordingly, the optimization problem in Eq. (7.5) can be translated from .˛; / domain to .ˇ; / domain— namely

(7.6) .ˇ; / D argmin jjx  Dƒˇjj22 C 4 n2 log. C / C n2 jjˇ  jj22 : ˇ;

Note that the above formulation of sparse coding is for a single image patch. When a collection of similar patches are available, it is plausible to formulate a simultaneous sparse coding (SSC) problem as follows. Based on the observation that the sparse coefficients ˛’s of similar patches should be characterized by the same prior (i.e., P.˛j / with the same and , one can generalize the optimization problem in Eq. (7.7) from vector form to matrix form

can be efficiently solved (actually in analytical forms). Therefore, it is plausible to iteratively solve the optimization problem in Eq. (7.8) by alternating between these two subproblems. A. Solving for a Fixed B The first subproblem D argmin jjX  DƒBjj2F C 4 n2 log. C /:

C

n2 jjB



jj2F :

(7.8) where X D Œx1 ; : : : ; xm  is the data matrix containing the collection of m similar patches and A D ƒB is the matrix representation of ˛ D ƒˇ. Similarly, we have  D Œ 1 ; : : : ;  m  2 RKm and B D Œˇ 1 ; : : : ; ˇ m  2 RKm respectively, wherein  j D ; j D 1; 2; ; m.

7.3.3

Solving Simultaneous Sparse Coding via Alternating Minimization

Next, we will show how to solve the SSC problem in Eq. (7.8) by alternating minimization [12]. The key motivation behind the method of alternating minimization is that the two subproblems—minimization of B for a fixed and minimization of for a fixed B—both

(7.9)

can be rewritten into

.B; / D argminjjX  DƒBjj2F C 4 n2 log. C / B;

(7.7)

D argmin jjX

K X

di ˇ i i jj2F C 4 n2 log. C /

iD1

Q 2 C 4 2 log. C /; D argmin jjQx  D jj 2 n

(7.10) where xQ 2 Rnm denotes the vectorization of data Q D ŒdQ 1 ; dQ 2 ; ; dQ K  2 RmnK matrix X, matrix D where each column dQ j denotes the vectorization of rank-one matrix di ˇ i , and ˇ i 2 Rm denotes the ith row of matrix B. Note that when the dictionary D is orthogonal (e.g., DCT or PCA basis), Eq. (7.9) can be further simplified into D argmin jjA  ƒBjj2F C 4 n2 log. C /:

(7.11) where we have used X D DA. Although log. C/ is non-convex, we can efficiently solve it using an approximation method P for a local minimum. Let f . / D KiD1 log.i C /; one can approximate f . / by its first-order Taylor expansion—i.e.,

7 Sparsity Based Nonlocal Image Restoration: An Alternating Optimization Approach

163

f . .kC1/ / D f . .k/ /C < rf . .k/ /;  .k/ > : (7.12) where .k/ denotes the solution obtained after the kth By using the fact that rf . .k/ / D PK iteration. .k/ C / and ignoring the constants in iD1 1=.i Eqs. (7.12), (7.11) can be solved by iteratively minimizing

B D argmin jjX  DƒBjj2F C n2 jjB  jj2F : (7.17) Since both terms are l2 , the closed-form solution to Eq. (7.17) is given by the classical Wiener filtering

.kC1/ D argmin jjA  ƒBjj2F C 4 n2 jjW jj1 :

O D Dƒ. Note that when D is orthogonal, where D Eq. (7.18) can be further simplified into



where W D diag.

1 / .k/ i C

(7.13) is the reweighting

matrix (similar to iterative reweighted l1 minimization [49]). Since both ƒ and W are diagonal, we can decompose the minimization problem in Eq. (7.13) into K parallel scalar optimization problems which admit highly efficient implementations. Let ˛i 2 R1m and ˇ i 2 R1m denote the ith row of matrix A 2 Rnm and B 2 Rnm , respectively. Eq. (7.13) can be rewritten as .kC1/ D argmin

C4 n2

K X

jj.˛i /T  .ˇ i /T i jj22

.kC1/

i

.k/ iD1 i

C

;

(7.14)

D argmin jj.˛i /T .ˇ i /T i jj22 C4 n2

i

; C (7.15) Now one can see this is standard l2 -l1 optimization problem whose closed-form solution is given by i

.kC1/

i

D

B D .ƒT ƒ C n2 I/1 .ƒT A C /:

(7.18)

(7.19)

where ƒT ƒ C n2 I is a diagonal matrix and therefore its inverse can be easily computed. By alternatively solving both subproblems of Eqs. (7.9) and (7.17) for the updating of ƒ and B, we can reconstruct image data matrix X as O D Dƒ O O B; X

(7.20)

O denotes the final estimates of ƒ O and B where ƒ and B.

iD1

K X

which can be conveniently decomposed into a sequence of independent scalar optimization problems i

O TD O C 2 I/1 .D O T X C /: B D .D n

1 Œˇ i .˛i /T  C ; ˇ .ˇ i /T i

where the threshold  D

4 n2

.k/

i C

.k/ i

(7.16)

and ΠC denotes

the soft shrinkage operator. B. Solving B for a Fixed The second subproblem takes the following form

7.3.4

Application into Image Denoising: From Patch-based to Whole Image

Now we will generalize the solution to SSC problem from a single image data matrix X to the whole-image reconstruction. Assuming a standard degradation model: y D x C w where x 2 RN ; y 2 RM denotes the original and degraded images respectively, H 2 RNM is the degradation matrix and w is additive white Gaussian noise observing N.0; n2 /. The wholeimage reconstruction problem is given by .x; fBl g; f l g/ D argmin jjy  Hxjj22 x;fBl g;f l g

L X Q  Dƒl Bl jj2 C f jjRx F lD1

C n2 jjB  jj2F C 4 n2 log. l C /g: (7.21)

164

X. Li et al.

: Q lx D where R ŒRl1 x; Rl2 x; ; Rlm x 2 Rnm denotes the data matrix formed by the group of image patches similar to the lth exemplar patch xl (including xl itself), Rl 2 RnN denotes the matrix extracting the lth patch xl from x, and L is the total number of exemplars extracted from the reconstructed image x. Similar to the previous section, we propose to solve the wholeimage reconstruction problem in Eq. (7.21) by alternatingly solve the following two subproblems: A. Solving x for a Fixed fBl g; f l g O l D Dƒl Bl . When fBl g and f l g are Let X O l g. Therefore, Eq. (7.21) reduces to fixed, so is fX the following l2 -optimization problem x D argmin jjyxjj22 C x

L X

O l jj2 : (7.22)

jjRxl X F

lD1

which admits the following closed-form solution x D .I C

L X

Q >R Q l /1 .y C R l

lD1

L X

Q >X O l /: R l

lD1

(7.23) : : > >O Q Q >R Ql D R R , R D where R X j l jD1 j l l Pm > O lj and xO lj denotes the jth column jD1 Rj x O l . Note that for image denoising of matrix X application where H D I the matrix to be inversed in Eq. (7.23) is diagonal, and its inverse can be solved easily. Actually, similar to the KSVD approach Eq. (7.23) can be computed by weighted averaging each reconstructed patches Q l. sets X B. Solving fBl g; f l g for a Fixed x When x is fixed, the first term in Eq. (7.21) disappears and the subproblem boils down to a sequence of patch-level SSC problems formed for each exemplar—i.e., Pm

.Bl ; l / D argmin jjXl  Dƒl Bl jj2F Bl ; l

C

n2 4 2 jjB  jj2F C n log. l C /:

(7.24)

Q l x. This is exactly the where we use Xl D R problem we have studied in the previous section. Putting things together, we have obtained a complete image denoising algorithm summarized as follows. We have compared Algorithm 1 against three current state-of-the-art methods including BM3D Image Denoising with Shape-Adaptive PCA (BM3D-SAPCA) [50] (it is an enhanced version of BM3D denoising [39] in which local spatial adaptation is achieved by shape-adaptive PCA), learned simultaneous sparse coding (LSSC) [43] and nonlocally centralized sparse representation (NCSR) denoising [45]. As can be seen from Table 7.1, the proposed algorithm has achieved highly competitive denoising performance to other leading algorithms. For the collection of 12 test images, BM3D-SAPCA and ours are mostly the best two performing methods—on the average, ours falls behind BM3D-SAPCA by less than 0:2 dB for three out of six noise levels but deliver at least comparable for the other three. We note that the complexity of BM3D-SAPCA is much higher than that of the original BM3D; by contrast, our pure Matlab implementation of Algorithm 1 (without any C-coded optimization) still runs reasonably fast. It takes around 20 seconds to denoise a 256  256 image on a PC with an Intel i7-2600 processor at 3.4 GHz. Figures 7.1 and 7.2 include the visual comparison of denoising results for two typical images (lena and house) at moderate ( w D 20) and heavy ( w D 100) noise levels respectively. It can be observed from Fig. 7.1 that BM3D-SAPCA and SSC-GSM seem to deliver the best visual quality at the moderate noise level; by contrast, restored images by LSSC and NCSR both suffer from noticeable artifacts especially around the smooth areas close to the hat. When the noise contamination is severe, the superiority of SSCGSM to other competing approaches is easier to justify—as can be seen from Fig. 7.2, SSC-GSM achieves the most visually pleasant restoration of the house image especially when one inspects the zoomed portions of roof regions closely.

7 Sparsity Based Nonlocal Image Restoration: An Alternating Optimization Approach

165

Algorithm 1: SSC-GSM based Image Restoration • Initialization: (a) set the initial estimate as xO D y for image denoising and deblurring; or initialize xO by bicubic interpolation for image super-resolution; (b) Set parameters ; (c) Obtain data matrices fXl g’s from xO (though kNN search) for each exemplar and compute the PCA basis fDl g for each Xl . • Outer loop (solve Eq. (7.21) by alternative optimization): Iterate on k D 1; 2;    ; kmax (a) Image-to-patch transformation: obtain data matrices fXl g’s for each exemplar; (b) Estimate biased means  using nonlocal mean method for each Xl ; (c) Inner loop (solve Eq. (7.24) for each data Xl ): iterate on J D 1; 2;    ; J; (I) update l for fixed Bl using Eq. (7.16); (II) update Bl for fixed l using Eq. (7.19); (d) Reconstruct Xl ’s from l and Bl using Eq. (7.20); (e) If mod.k; k0 / D 0, update the PCA basis fDl g for each Xl ; .kC1/ from fXl g’s by solving Eq. (7.23); (f) Patch-to-image transformation: obtain reconstructed xO .kC1/ • Output: xO .

Table 7.1 The PSNR (dB) results by different denoising methods. In each cell, the results of the four denoising methods are reported in the following order: top w Lena Monarch Barbara Boat C. Man Couple F. Print Hill House Man Peppers Straw Average

5 38.86 38.70 38.69 38.49 38.38 38.36 37.50 37.35 38.54 38.17 37.60 37.44 36.67 36.81 37.31 37.17 40.13 39.91 37.99 37.78 38.30 38.06 35.81 35.87 37.98 37.84

10 38.68 38.85 38.53 38.74 38.44 38.65 37.34 37.42 38.24 38.39 37.41 37.51 36.71 36.84 37.16 37.23 40.00 40.02 37.84 37.91 38.15 38.22 35.92 36.04 37.87 37.98

36.07 35.81 34.74 34.57 35.07 34.98 34.10 33.90 34.52 34.12 34.13 33.94 32.65 32.70 33.84 33.69 37.06 36.80 34.18 33.96 34.94 34.66 31.46 31.50 34.40 34.22

left—BM3D-SAPCA [50]; top right—LSSC [43]; bottom left—NCSR [45]; bottom right—proposed SSC-GSM

15 35.83 35.96 34.48 34.82 34.95 35.27 33.99 33.95 34.14 34.28 33.96 33.94 32.57 32.63 33.68 33.70 37.05 36.79 34.03 34.06 34.80 34.83 31.39 31.56 34.24 34.32

34.43 34.09 32.46 32.34 33.27 33.02 32.29 32.03 32.31 31.99 32.20 31.95 30.46 30.46 32.06 31.86 35.31 35.11 32.12 31.89 33.01 32.70 29.13 29.13 32.42 32.21

20 34.14 34.23 32.15 32.52 32.96 33.32 32.17 32.11 31.96 32.03 32.06 31.98 30.31 30.36 31.89 31.89 35.32 35.03 31.98 31.99 32.87 32.87 28.95 29.16 32.23 32.29

33.20 32.92 30.92 30.69 31.97 31.72 31.02 30.74 30.86 30.48 30.83 30.56 28.97 28.99 30.85 30.61 34.03 33.97 30.73 30.52 31.61 31.26 27.52 27.50 31.04 30.83

50 32.88 33.08 30.58 30.84 31.53 32.06 30.87 30.82 30.54 30.50 30.70 30.63 28.78 28.87 30.71 30.69 34.16 34.00 30.60 30.60 31.47 31.41 27.36 27.51 30.85 30.92

29.07 28.42 26.28 25.68 27.51 27.10 26.89 26.60 26.59 26.16 26.48 26.21 24.53 24.53 27.13 26.86 29.53 29.63 26.84 26.60 26.94 26.53 22.79 22.48 26.71 26.44

100 28.95 29.05 25.59 26.02 27.13 27.60 26.76 26.79 26.36 26.29 26.31 26.41 24.21 24.50 26.99 27.05 29.90 30.36 26.72 26.76 26.87 26.82 22.67 22.84 26.54 26.71

25.37 25.66 22.31 22.05 23.05 23.30 23.71 23.64 22.91 22.89 23.19 23.22 21.07 21.29 24.10 24.13 25.20 25.65 23.86 23.97 23.05 22.64 19.42 19.23 23.10 23.14

25.96 25.91 21.82 22.52 23.56 24.05 23.94 23.90 23.14 23.23 23.34 23.36 21.18 21.54 24.30 24.24 25.63 26.70 24.00 24.02 23.14 23.34 19.50 19.52 23.29 23.53

166

X. Li et al.

Fig. 7.1 Denoising performance comparison on the Lena image with moderate noise corruption. (a) Original image; (b) Noisy image ( n D 20); denoised images by (c) BM3D-SAPCA [50] (PSNRD33.20 dB, SSIMD0.8803);

7.4

Nonlocal Compressed Sensing via Low-rank Methods

7.4.1

Compressed Sensing via l1 -optimization: Ten Years After

where ˆ 2 CMN ; M < N is the measurement matrix and l0 is a pseudo-norm counting the number of non-zero entries in x. However, since l0 -minimization is computationally untractable (i.e., NP-hard), it is often suggested that one can replace the nonconvex l0 norm by its convex l1 counterpart—i.e.,

The theory of compressed sensing (CS) established in 2006 [17, 51] originated from the same motivation as sparse coding discussed in the previous section. A standard formulation of CS problem targets at pursuing the sparsest signal x that satisfies the observation constraint y D ˆx, namely x D argmin kxk0 ; s.t. y D ˆx; x

(d) LSSC [43] (PSNRD32.88 dB, SSIMD0.8742); (e) NCSR [45] (PSNRD32.92 dB, SSIMD0.8760); (f) Proposed SSC-GSM (PSNRD33.08, SSIMD0.8787)

(7.25)

x D argmin kxk1 ; s.t. y D ˆx;

(7.26)

x

One of the key results in [17] is to show solving this l1 -optimization problem can recover a K-sparse signal from M D O.Klog.N=K// random measurements. The optimization problem in Eq. (7.26) is convex and calls for a well-known linear programming technique such as basis pursuit [17, 51]. Alternatively, using Lagrangian multiplier , one can convert the

7 Sparsity Based Nonlocal Image Restoration: An Alternating Optimization Approach

167

Fig. 7.2 Denoising performance comparison on the House image with strong noise corruption. (a) Original image; (b) Noisy image ( n =100); denoised images by (c) BM3D-SAPCA [50] (PSNRD35.20 dB, SSIMD0.6767);

(d) LSSC [43] (PSNR=25.63 dB, SSIMD0.7389); (e) NCSR [45] (PSNRD25.65 dB, SSIMD0.7434); (f) Proposed SSC-GSM (PSNRD26.70, SSIMD0.7430)

constrained optimization problem in Eq. (7.26) to an equivalent unconstrained optimization problem:

a way that the global optimum can be reached through a sequence of convex cost functions Jp .x/ (the auxiliary variable p parameterizes the graduation process). When iteration starts from a small parameter favoring a smooth cost function, it is relatively easy to find a favorable local optimum. As the cost function becomes more jagged after more iterations, the local optimum will be gradually driven toward the global minimum with some probability. Similar ideas of improving sparsity by reweighted l1 -optimization [49] also exist in the literature. In addition to the standard lp .0  p  1/ sparsity, the other line of algorithmic development toward CS is to formulate CS as a Bayesian inference problem. When compared against noise removal, CS can be formulated in a similar fashion except that the observation

x D argmin ky  ˆxk22 C kxk1 ;

(7.27)

x

which can be efficiently solved by various methods (e.g, iterative shrinkage algorithm [52], Bregman-Split algorithm [53], homotopy-based method [24] and alternative direction multiplier method (ADMM) [54]). Recent advances have also shown that better CS recovery performance can be obtained by replacing the l1 norm with a non-convex lp .0 < p < 1/ norm though at the price of higher computational complexity [23]. Deterministic annealing (DA) [55, 56] (a.k.a. graduated non-convexity [22]) represents a strategy of modifying nonconvex cost function J.x/ in such

168

X. Li et al.

model (a.k.a. likelihood term in Bayesian image restoration [57]) changes from noisy to incomplete data. Such Bayesian perspective allows us to leverage those image priors developed for image restoration into the study of developing CS algorithms. Similar observation has been made in the recent work called modelbased image reconstruction with plug-and-play prior [58]; but in fact, twisting image denoising algorithms to solve the CS problem is a standard procedure—e.g., BM3D-CS [59] is based on the prior work of BM3D denoising [39]. Despite the impressive performance of BM3D-CS, its theoretical foundation remains weak—i.e., does it also admit any variational interpretation?

7.4.2

Compressed Sensing via Nonlocal Low-Rank (NLR) Regularization

There are many ways of characterizing selfsimilarity in signals of our interest; low-rank method has been applied to an emerging new field “at the heels of CS”—matrix completion with noise [60]. To see how self-similarity leads to a low-rank constraint, let’s assume a sufficient number of similar patches can be found for any p p exemplar patch of size n  n at the position i denoted by xO i 2 Cn . Then we can obtain a data matrix Xi D Œxi0 ; xi1 ; : : : ; xim1 ; Xi 2 Cnm for each exemplar patch xi , where each column of Xi denotes a patch similar to xi (including xi itself). Under the assumption that these image patches have similar structures, the formed data matrix Xi has a low-rank property (e.g., identical repetition of exemplar patch will always produce a rank-one matrix regardless the dimension of data matrix). In practice, data matrix Xi may be noisy or incomplete, which could lead to the deviation from the low-rank constraint. One possible solution is to model the data matrix Xi as: Xi D Li C Wi , where Li and Wi denote the low-rank matrix and the noise term respectively. Therefore, the matrix completion problem with noise boils down to a rank minimization one. In general, rank-minimization is an NPhard problem just like l0 -optimization. A

computationally more tractable approach is to use the nuclear norm k k (sum of singular values) as a convex surrogate of the rank. Using the nuclear norm, the original rankminimization problem can be efficiently solved by a technique named singular value thresholding (SVT) [61]. Despite good theoretical guarantee of the correctness [62], we conjecture that nonconvex optimization toward rank minimization could lead to better recovery results just like what has been observed in the studies of lp -optimization. Therefore in [63], we have considered a smooth but non-convex surrogate of the rank instead of the nuclear norm. We will first highlight the choice of surrogate function and then discuss its implications into the optimization algorithms. It was shown in [64] that for a symmetric positive semidefinite matrix X 2 Rnn , the rank minimization problem can be approximately solved by minimizing the following functional: : E.X; "/ D log det.X C "I/;

(7.28)

where " denotes a small constant value. Since this function E.X; "/ approximates the sum of the logarithm of singular values (up to a scale), the function E.X; "/ is smooth yet non-convex. Figure 7.3 shows the comparison of the nonconvex surrogate function, the rank, and the nuclear norm in the scalar case. It can be observed from Fig. 7.3 that the surrogate function E.X; "/ can indeed better approximate the rank than the nuclear norm. For a general matrix Li 2 Cnm , n  m that is neither square nor positive semidefinite, we can invoke the strategy of diagonalization and modify Eq. (7.28) into : 1=2 L.Li ; "/ D log det..Li L> C "I/ i / D log det.U† 1=2 U1 C "I/

(7.29)

D log det.† 1=2 C "I/; where † is a diagonal matrix whose diagonal elements are eigenvalues of matrix Li L> i , i.e., 1=2 1 Li L> D U†U , and † is a diagonal mai trix whose diagonal elements are the singular

7 Sparsity Based Nonlocal Image Restoration: An Alternating Optimization Approach

169

Fig. 7.3 Comparison of L.x; "/, rank.x/ D kxk0 and the nuclear norm D kxk1 in the case of a scalar

X X

0

=

X

=rank(x)

log(ε+ x )

values of the matrix Li . Therefore, one can see that L.Li ; "/ is a surrogate function of rank.Li / 1=2 obtained by setting X D .Li L> . We then i / propose the following low-rank approximation problem for solving Li Li D argmin kXi  Li k2F C L.Li ; "/:

1

(7.30)

achieve better recovery than previous methods). Next, we will show that the optimization problem formulated in Eq. (7.31) can be efficiently solved by the method of alternative minimization.

7.4.3

Li

For each exemplar image patch, we can approximate the matrix Xi with a low-rank matrix Li by solving Eq. (7.30). To apply the above surrogate function into CS, we propose to enforce the low-rank property over the sets of nonlocal similar patches for each extracted exemplar patch along with the constraint of linear measurements. More specifically, we have adopted the following global objective functional for CS recovery: .Ox; LO i / D argmin ky  ˆxk22 x;Li

X Q i x  Li k2 C L.Li ; "/g; C fkR F

Optimization Algorithm for CS Image Recovery

In this section, we show how to solve the aboveformulated optimization problem by alternatively minimizing the objective functional with respect to the whole image x and low-rank data matrices Li . A. Low-rank Matrix Optimization via Iterative Single Value Thresholding (ISVT) The first subproblem is to update low-rank data matrix Li based on an initial estimate of the unknown image x. After grouping a set of similar patches for each exemplar patch xi , we consider the following minimization problem for each data matrix Li : Q  Li k2 C L.Li ; "/: (7.32) Li D argmin kRx F

i

Li

(7.31) : Q ix D ŒRi0 x; Ri1 x; : : : ; Rim1 x denotes the where R matrix formed by the set of similar patches for every exemplar patch xi . We note that the above nonlocal low-rank regularization (NLR) can exploit both the group sparsity of similar patches and nonconvexity of rank minimization (thus

Note that L.Li ; "/ approximately sums up the logarithm of singular values (up to a scale), we can rewrite Eq. (7.32) into 0 X C log. j C "/:

jD1

n

min kXi  Li

Li k2F

(7.33)

170

X. Li et al.

Q i x, n0 D minfn; mg, and j is the where Xi D R P jth singular value of Li . Though njD1 log. j C "/ is non-convex, we can solve it efficiently using an annealing-based local minimization method. Let P f . / D njD1 log. j C "/. Then f . / can be approximated by its first-order Taylor expansion as f . / D f . .k/ / C hrf . .k/ /;    .k/ i; (7.34) where  .k/ is the solution after the kth iteration. Therefore, Eq. (7.33) can be solved by iteratively solving  D argmin kXi  Li k2F C

Li

.kC1/

Li

n0 X

j

.k/ jD1 j

;

C" (7.35) where have used the fact that rf . .k/ / D Pn0 we 1 and ignored the constants in jD1 .k/ j C"

Eq. (7.34). For the simplicity of notation, we can rewrite Eq. (7.35) into 1 D argmin kXi  Li k2F C '.Li ; w.k/ /; 2 Li Pn0 (7.36) .k/ where  D =.2 / and '.Li ; w/ D j wj j denotes the weighted nuclear norm with weights .k/ .k/ wj D 1=. j C "/. Note that the weights are ascending because singular values j are ordered in a descending order. It is known that the optimal solution to (7.36) is given by a weighted singular value thresholding operator, known as the proximal operator. However, the weighted nuclear norm is a convex function if and only if the weights are descending; while in our case, the weights are ascending. It follows that one cannot expect to find a global minimizer for (7.36). Nevertheless, one could still show that the weighted singular value thresholding can produce one (possible local) minimizer to (7.36) (detailed proof is referred to [63]): .kC1/

Li

Theorem 1 (Proximal Operator of Weighted Nuclear Norm). For each X 2 Cnm and 0  w1   wn0 , n0 D minfm; ng, a minimizer to 1 min kX  Lk2F C '.L; w/ L 2

(7.37)

is given by the weighted singular value thresholding operator Sw; .X/: Sw; .X/ WD U.†  diag.w//C V> ;

(7.38)

where U†V> is the SVD of X and .x/C D maxfx; 0g. Based on the above theorem, we can obtain the reconstructed matrix in the .k C 1/th iteration by .kC1/

Li

Q  diag.w.k/ //C V> ; D U.†

(7.39) .k/

Q > is the SVD of Xi and w where U†V j

D

.k/ 1=. j .0/

C "/. In our implementation, we set w D Œ1; 1; : : : ; 1> and the first iteration is equivalent to solving an unweighted nuclearnorm minimization problem. When Li is a vector, log det. / leads to the well-known reweighted `1 norm [49]. In [49] it has been shown that the reweighted `1 -norm performs much better than `1 -norm in approximating the `0 -norm and often leads to better image recovery results. We can observe that the surrogate function of log det. / can lead to better CS recovery results than the nuclear norm next. B. Image Recovery via Alternative Direction Multiplier Method (ADMM) The second subproblem is the updating of image estimate x based on the updated Li . Such problem can be formulated as follows: x D argmin ky  ˆxk22 C x

X

Q i x  Li k2 : kR F

i

(7.40) In fact, Eq. (7.40) is a quadratic optimization problem admitting a closed-form solution—i.e., x D .ˆ H ˆC

X i

Q >R Q i /1 .ˆ H yC R i

X

Q > Li /; R i

i

(7.41) where the superscript H denotes the Hermitian : Pm1 > Q > Li D transpose operation and R rD0 Rr xir i : Pm > Q >R Qi D and R R R . In Eq. (7.41), the mar rD0 r i trix to be inverted is large. Hence, directly solving Eq. (7.41) is not possible. In practice, Eq. (7.41) can be computed by using the conjugate gradient (CG) algorithm.

7 Sparsity Based Nonlocal Image Restoration: An Alternating Optimization Approach

When the measurement matrix ˆ is a partial Fourier transform matrix that has direct connections with MRI, we can derive a much faster algorithm to solve Eq. (7.40) by using the method of alternative direction multipliermethod

(ADMM) [54, 65]. The advantage of ADMM is that we can split Eq. (7.40) into two sub-problems both admitting closed-form solutions. By applying ADMM to Eq. (7.40), we obtain

.x; z; / D argmin ky  ˆxk22 C ˇkx  z C x

where z 2 CN is an auxiliary variable,  2 CN is the Lagrangian multiplier, and ˇ is a positive

z.lC1/ D argmin ˇ .l/ kx.l/  z C z

171

X  2 Q i z  Li k2 ; k2 C kR F 2ˇ i

(7.42)

constant. The optimization of Eq. (7.42) consists of the following iterations: X .l/ 2 Q i z  Li k2 ; k2 C kR F .l/ 2ˇ i

x.lC1/ D argmin ky  ˆxk22 C ˇ .l/ kx  z.lC1/ C x

.l/ 2 k ; 2ˇ .l/ 2

(7.43)

.lC1/ D .l/ C ˇ .l/ .x.lC1/  z.lC1/ /; ˇ .lC1/ D ˇ .l/ ; where  > 1 is a constant. For fixed x.l/ , .l/ and ˇ .l/ , z.lC1/ admits the following closed-form solution: z.lC1/ D .

P i

Q >R Q i C ˇ .l/ I/1 .ˇ .l/ x.l/ C .l/ R i 2 P Q i Li /: C i R (7.44)

P > Q R Q i is a diagonal matrix. Note that the term i R i Each of the entries in the diagonal matrix corresponds to an image pixel location, and its value is the number of overlapping patches that cover P Q i Li denotes the pixel location. The term R i the patch averaging result—i.e., averaging all of the collected similar patches for each exemplar patch. Therefore, Eq. (7.44) can easily computed in one step. For fixed z.lC1/ , .l/ and ˇ .l/ , the xsubproblem can be solved by computing: .ˆ H ˆ C ˇ .l/ I/x D .ˆ H y C ˇ .l/ z.lC1/ 

.l/ /: 2 (7.45)

When ˆ is a partial Fourier transform matrix ˆ D DF, where D and F denotes the downsampling matrix and Fourier transform matrix respectively, Eq. (7.45) can be easily solved by transforming the problem from image space to Fourier space. By substituting ˆ D DF into Eq. (7.45) and applying Fourier transform to each side of Eq. (7.45), we obtain F..DF/H DF Cˇ .l/ I/FH Fx D F.DF/H y CF.ˇ .l/ z.lC1/ 

.l/ / 2

(7.46)

which can be further simplified into .l/ //; 2 (7.47) Then, x.lC1/ can be computed by applying inverse Fourier transform to the right hand side of Eq. (7.47)- i.e., Fx D .D> DCˇ .l/ /1 .D> yCF.ˇ .l/ z.lC1/ 

172

X. Li et al.

Algorithm 2: CS via Nonlocal Low-Rank (NLR) Regularization • Initialization: (a) Estimate an initial image xO using a standard CS recovery method (e.g., DCT/Wavelet based recovery method); (b) Set parameters , ,  D =.2 /, ˇ, K, J, and L. (c) Initialize wi D Œ1; 1; : : : ; 1> , x.1/ D xO , .1/ D 0; (d) Grouping a set of similar patches Gi for each exemplar patch using x.1/ ; • Outer loop: for k D 1; 2; : : : ; K do (a) Patch dataset Xi construction: grouping a set of similar patches for each exemplar patch xi using x.k/ ; .0/ (b) Set Li D Xi ; (c) Inner loop (Low-rank approximation, solving Eq. (7.33)): for j D 1; 2; : : : ; J do .j1/ / C "/; (I) If (k > K0 ), update the weights wi D 1=. .Li .j/ (II) Singular values thresholding via Eq. (7.39): Li D Swi ; .Xi /. .j/ (III) Output Li D Li if j D J. End for (d) Inner loop (Solving Eq. (7.40)): for l D 1; 2; : : : ; L do (I) Compute z.lC1/ via Eq. (7.44); (II) Compute x.lC1/ via Eq. (7.48); (III) .lC1/ D .l/ C ˇ .l/ .x.lC1/  z.lC1/ / (IV) ˇ .lC1/ D ˇ .l/ . (V) Output x.k/ D x.lC1/ if l D L. End for (e) If mod.k; T/ D 0, update the patch grouping. (f) Output the reconstructed image xO D x.k/ if k D K. End for

x.lC1/ D FH f.D> D C ˇ .l/ /1 .D> y CF.ˇ .l/ z.lC1/ 

.l/ //g 2

(7.48)

With updated x and z,  and ˇ can be readily computed according to Eq. (7.43) After obtaining an improved estimate of the unknown image, the low-rank matrices Li can be updated by Eq. (7.39). The updated Li is then used to improve the estimate of x by solving Eq. (7.40). Such process is iterated until the convergence. The overall procedure is summarized below as Algorithm 2.

7.4.4

Experimental Results and Discussions

In this section, we report our experimental results of CS on some complex-valued realworld MR images. Two sets of brain images sized by 256256 and acquired on a 1.5T Philips Achieva system are used in our experiment. The CS-MRI acquisition processes are simulated

by sub-sampling the kspace data (2D-FFT of brain images). We have considered two subsampling strategies: random sub-sampling [24] and pseudo-radial sampling. Two test MR images and the examples of sub-sampling masks are shown in Fig. 7.4. We have compared the proposed method against conventional CSMRI method of [66] (denoted as SparseMRI), TV-based method, reweighted-TV (denoted as ReTV) method [49], and baseline zero-filling reconstruction method. In Figs. 7.5, 7.6, we have compared the reconstructed MR magnitude images by the test CSMRI recovery methods for variable density 2D random sampling. In Figs. 7.5 and 7.6, the sampling rate is 0:2N, i.e., 5-fold undersampling of the kspace data. We can see that the SparseMRI method cannot preserve the sharp edges and fine image details. The ReTV method outperforms the SparseMRI method in terms of PSNR; however, visual artifacts can still be clearly observed. By contrast, the proposed NLR-CS-baseline preserves the edges and local structures better than ReTV method; and the proposed NLR-CS can

7 Sparsity Based Nonlocal Image Restoration: An Alternating Optimization Approach

173

Fig. 7.4 Sub-sampling masks and test MR images. (a) random sub-sampling mask; (b) pseudo-radial sub-sampling mask; (c) Head MR image; (d) Brain MR image

Fig. 7.5 Reconstructed Head MR images using 0:2N kspace samples (5-fold under-sampling, random sampling). (a) original MR image (magnitude); (b)

SparseMRI method [66] (22.45 dB); (c) ReTV method [49] (27.31 dB); (d) proposed NLR-CS-baseline method (30.84 dB); (e) proposed NLR-CS method (33.31 dB)

Fig. 7.6 Reconstructed Brain MR images using 0:2N kspace samples (5-fold under-sampling, random sampling). (a) original MR image (magnitude); (b)

SparseMRI method [66] (30.20 dB); (c) ReTV method [49] (33.49 dB); (d) proposed NLR-CS-baseline (35.39 dB); (e) proposed NLR-CS method (36.34 dB)

further dramatically outperforms the NLR-CSbaseline. In Fig. 7.7 we plot the PSNR curves as a function of sampling rates for both random and pseudo-radial sampling schemes. As can be seen from Fig. 7.7, the PSNR results of the proposed NLR-CS are much higher than other competing methods especially at low sampling rates, which implies that the proposed NLR-CS can better remove the artifacts and preserve important image structures more effectively in the situation of large speed-up factor.

7.5

Conclusions

In this chapter, we have revisited several connections between image restoration and convex optimization including sparse coding, compressed sensing and low-rank methods. It is shown that Bayesian formulation of image restoration problems is equivalent to regularized formulation despite the apparent difference in the choice of mathematical languages. We have used two spe-

174

X. Li et al. Head

Brain 40

45 40

35

PSNR (dB)

PSNR (dB)

35 30 25 Zero-filling SparseMRI TV ReTV NLR-CS-baseline NLR-CS

20 15 10 0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

30

25 Zero-filling SparseMRI TV ReTV NLR-CS-baseline NLR-CS

20

15 0.05

0.5

0.1

0.15

Cartisian sampling rates

0.2

0.25

(a) 36 34 32

Head

0.45

0.5

Brain 36 34 32

26 24

30 28 Zero-filling SparseMRI TV ReTV NLR-CS-baseline NLR-CS

26

22 20

24

18

22 0.1

0.4

38

Zero-filling SparseMRI TV ReTV NLR-CS-baseline NLR-CS

28

16 0.05

0.35

(b)

PSNR (dB)

PSNR (dB)

30

0.3

Cartisian sampling rates

0.2

0.15

0.25

0.3

Pseudo radial sampling rates

20 0.05

0.1

0.15

0.2

0.25

0.3

Pseudo radial sampling rates

(c)

(d)

Fig. 7.7 PSNR results as sampling rates varies. (a)–(b) Cartesian random sampling; (c)–(d) pseudo-radial sampling

cific applications—namely image denoising and compressed sensing—to demonstrate how simultaneous sparse coding and nonlocal regularization both admit a nonconvex optimization-based formulation, which can lead to novel insights to our understanding why BM3D and BM3D-CS can achieve excellent performance. The nonconvexity arises from the unknown clustering relationship of similar patches in natural images. In both scenarios, we have shown how a nonconvex optimization problem—though difficult to obtain its global minimum—can be approximately solved by the method of alternative minimization. Our experimental results have shown improved denoising and CS performance over current stateof-the-art.

References 1. Buades A, Coll B, Morel J-M (2005) A non-local algorithm for image denoising. CVPR 2:60–65 2. Foi A, Katkovnik V, Egiazarian K (2007) Pointwise shape-adaptive dct for high-quality denoising and deblocking of grayscale and color images. IEEE Trans Image Process 16:1395–1411 3. Dong W, Li X, Zhang L, Shi G (2011) Sparsitybased image denoising via dictionary learning and structural clustering. IEEE conference on computer vision and pattern recognition, pp 457–464 4. Rudin L, Osher S, Fatemi E (1992) Nonlinear total variation based noise removal algorithms. Physica D 60:259–268 5. El-Fallah AI, Ford GE (1997) Mean curvature evolution and surface area scaling in image filtering. IEEE Trans Image Process 6:750–753

7 Sparsity Based Nonlocal Image Restoration: An Alternating Optimization Approach 6. Biemond J, Lagendijk R, Mersereau R (1990) Iterative methods for image deblurring. Proc IEEE 78(5):856–883 7. Bertsekas DP (2014) Constrained optimization and Lagrange multiplier methods. Academic Press, Boston 8. Tikhonov A, Arsenin V (1977) Solutions of Ill-posed Problems. Wiley, New York 9. Chan T, Vese L (2001) Active contours without edges 10:266–277 10. Chan TF, Shen J (2001) Nontexture inpainting by curvature-driven diffusions. J Vis Comm Image Rep 12(4):436–449 11. Aubert G, Vese L (1997) A variational method in image recovery. SIAM J Numer Anal 34(5): 1948–1979 12. Wang Y, Yang J, Yin W, Zhang Y (2008) A new alternating minimization algorithm for total variation image reconstruction. SIAM J Imag Sci 1(3): 248–272 13. Beck A, Teboulle M (2009) Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Transactions on Image Processing 18(11):2419–2434 14. Chambolle A (2004) An algorithm for total variation minimization and applications. J Math Imaging Vis 20(1):89–97 15. Perona P, Malik J (1990) Scale space and edge detection using anisotropic diffusion. IEEE transactions on pattern analysis and machine intelligence 12(7): 629–639 16. Sapiro G (2001) Geometric Partial Differential Equations and Image Analysis. Cambridge University Press, New York 17. Candès EJ, Romberg JK, Tao T (2006) Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory 52(2):489–509 18. Boyd SP, Vandenberghe L (2004) Convex optimization. Cambridge University Press, New York 19. Zibulevsky M, Elad M (2010) L1-L2 optimization in signal and image processing. IEEE Signal Processing Magazine 27:76–88 20. Yang A, Sastry S, Ganesh A, Ma Y (2010) Fast l1minimization algorithms and an application in robust face recognition: A review. In: 2010 17th IEEE International Conference on Image Processing (ICIP), pp 1849–1852 21. Goldstein T, Osher S (2009) The split Bregman method for L1 regularized problems. SIAM J Imag Sci 2(2):323–343 22. Blake A, Zisserman A (1987) Visual reconstruction. Cambridge, MA, USA: MIT Press 23. Chartrand R (2007) Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal Processing Letters 14:707–710 24. Trzasko J, Manduca A (2009) Highly Undersampled Magnetic Resonance Image Reconstruction via Homotopic l_{0}-Minimization. IEEE Transactions on Medical Imaging 28(1):106–121

175

25. Wiener N (1949) Extrapolation, interpolation, and smoothing of stationary time series: with engineering applications. J Am Stat Assoc 47:258 26. Srivastava A, Lee A, Simoncelli E, Zhu S (2003) On advances in statistical modeling of natural images. J Math Imaging Vis 18:17–33 27. Nahi N (1972) Role of recursive estimation in statistical image enhancement Proc IEEE 60(7):872–877 28. Lev A, Zucker SW, Rosenfeld A (1977) Iterative enhancemnent of noisy images. IEEE Transactions on Systems, Man, and Cybernetics 7(6):435–442 29. Lee J-S (1980) Digital image enhancement and noise filtering by use of local statistics. IEEE transactions on pattern analysis and machine intelligence 2: 165–168 30. Carlson CR, Adelson EH, Anderson CH et al (1985) Improved system for coring an image-representing signal. WO Patent 1,985,002,081 31. Carlson CR, Adelson EH, Anderson CH (1985) System for coring an image-representing signal. US Patent 4,523,230 32. Donoho D, Johnstone I (1994) Ideal spatial adaptation by wavelet shrinkage. Biometrika 81: 425–455 33. Simoncelli EP, Adelson EH (1996) Noise removal via bayesian wavelet coring. In: IEEE international conference on image processing, pp 379–382 34. Kozintsev I, Mihcak MK, Ramchandran K (1999) Local statistical modeling of wavelet image coefficients and its application to denoising. In: IEEE international conference on acoustics, speech, and signal processing, pp 3253–3256 35. Li X, Orchard M (2000) Spatially adaptive image denoising under overcomplete expansion. In: IEEE international conference on image processing, pp 300–303 36. Chang S, Yu B, Vetterli M (2000) Spatially adaptive wavelet thresholding with context modeling for image denoising. IEEE Transactions on Image Processing 9(9):1522–1531 37. Portilla J, Strela V, Wainwright M, Simoncelli E (2003) Image denoising using scale mixtures of Gaussians in the wavelet domain. IEEE Trans Image Process 12:1338–1351 38. Andrews DF, Mallows CL (1974) Scale mixtures of normal distributions J R Stat Soc Ser B (Methodological) 36(1):99–102 39. Dabov K, Foi A, Katkovnik V, Egiazarian K (2007) Image denoising by sparse 3D transform-domain collaborative filtering. IEEE Trans Image Process 16:2080–2095 40. Dabov K, Foi A, Katkovnik V, Egiazarian K (2006) Image denoising with block-matching and 3D filtering. In: Proc SPIE Electronic Imaging: Algorithms and Systems V, vol 6064. San Jose, CA, USA 41. Zhu S, Ma K-K (2000) New diamond search algorithm for fast block-matching motion estimation. IEEE Trans Image Process 9(2):287–290 42. Daubechies I (1996) Where do wavelets come from? A personal point of view Proc IEEE 84(4):510–513

176 43. Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A (2009) Non-local sparse models for image restoration. In: 2009 IEEE 12th International Conference on Computer Vision, pp 2272–2279 44. Dong W, Shi G, Li X (2013) Nonlocal image restoration with bilateral variance estimation: a lowrank approach. IEEE Trans Image Process 22(2): 700–711 45. Dong W, Zhang L, Shi G, Li X (2013) Nonlocally centralized sparse representation for image restoration. IEEE Trans Image Process 22(4): 1620–1630 46. Galatsanos N, Katsaggelos A (1992) Methods for choosing the regularization parameter and estimating the noise variance in image restoration and their relation. IEEE Trans Image Process 1: 322–336 47. Lyu S, Simoncelli E (2009) Modeling multiscale subbands of photographic images with fields of gaussian scale mixtures. IEEE transactions on pattern analysis and machine intelligence 31(4):693–706 48. Box GE, Tiao GC (2011) Bayesian inference in statistical analysis, vol 40. Wiley, New York 49. Candes E, Wakin M, Boyd S (2008) Enhancing sparsity by reweighted l1 minimization. J Fourier Anal Appl 14(5):877–905 50. Dabov K, Foi A, Katkovnik V, Egiazarian K (2009) BM3D image denoising with shape-adaptive principal component analysis. Proceedings on SPARS’09, Signal Processing wiht Adaptive Sparse Structured Representations, p 6 51. Donoho D (2006) Compressed sensing. IEEE Transactions on Information Theory 52(4):1289–1306 52. Daubechies I, Defrise M, De Mol C (2004) An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm Pure Appl Math 57(11):1413–1457 53. Zhang X, Burger M, Osher S (2011) A unified primal-dual algorithm framework based on bregman iteration. J Sci Comput 46(1):20–46 54. Lin Z, Chen M, Ma Y (2009) The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices. Coordinated Science Laboratory Report no. UILU-ENG-09-2215

X. Li et al. 55. Rose K, Gurewwitz E, Fox G (1990) A deterministic annealing approach to clustering. Pattern Recogn Lett 11(9):589–594 56. Rose K (1998) Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proc IEEE 86: 2210–2239 57. Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE transactions on pattern analysis and machine intelligence 6:721–741 58. Venkatakrishnan SV, CA Bouman, Wohlberg B (2013) Plug-and-play priors for model based reconstruction. In: IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp 945–948 59. Egiazarian K, Foi A, Katkovnik V (2007) Compressed sensing image reconstruction via recursive spatially adaptive filtering. In: IEEE international conference on image processing, vol 1, San Antonio, TX, USA 60. Candes E, Plan Y (2010) Matrix completion with noise. Proc IEEE 98(6):925–936 61. Cai J, Candes E, Shen Z (2010) A singular value thresholding algorithm for matrix completion. SIAM J Optim 20:1956 62. Candes E, Li X, Ma Y, Wright J (2011) Robust principal component analysis. J ACM 58(3):11 63. Dong W, Li X, Shi G, Ma Y, Huang F (2014) Compressive sensing via nonlocal low-rank regularization. IEEE Trans Image Process 23(8):3618–3632 64. Fazel M, Hindi H, Boyd SP (2003) Log-det heuristic for matrix rank minimization with applications to Hankel and Euclidean distance matrices. In: IEEE Proceedings on the American Control Conference 3:2156–2162 65. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3(1):1–122 66. Lustig M, Donoho D, Pauly J (2007) Sparse MRI: The application of compressed sensing for rapid MR imaging. Magn Reson Med 58(6):1182–1195

8

Sparsity Constrained Estimation in Image Processing and Computer Vision Vishal Monga, Hojjat Seyed Mousavi, and Umamahesh Srinivas

8.1

Introduction

8.1.1

Estimation Problems in Image Processing and Computer Vision

Several real-world signal and image processing problems can be linked by a common underlying theme, that of signal estimation. As a typical scenario, we have access to some observed data, acquired via an imaging sensor for example. The goal is to determine an unknown signal/image from these observations, the challenge being that the observations could either be incomplete or corrupted by noise and other signal distortions. The Bayesian strategy of leveraging contextual awareness, either as prior beliefs about the signal/parameters being estimated or modeling the inherent structure of the problem, has emerged as an intuitive way of recovering the unknown signal from uncertain observations. Statistical

V. Monga () • H.S. Mousavi The Pennsylvania State University, University Park, PA 16802, USA e-mail: [email protected]; [email protected] U. Srinivas Apple Inc., Cupertino, CA 95014, USA e-mail: [email protected]

signal processing theory is replete with Bayesian formulations to estimation problems. To state formally, suppose we observe signal y, which is a noisy version of the desired signal x. A widely used estimator, known as the maximum a posteriori (MAP) estimator, determines the optimal xO which maximizes the following cost function: xO D arg max P.xjy/ D arg max P.yjx/P.x/: x

x

(8.1) It is important to note that awareness of context is captured by the P.x/ term in (8.1). Equivalently, the prior P.x/ may also be interpreted as a regularizer to smooth the optimal solution [1]. It is well-known that enforcing a Gaussian prior leads to the `2 regularized problem. Similarly, the choice of a Laplacian prior leads to the `1 regularizer, which is used extensively to enforce, a priori, a notion of sparsity in signal processing problems. In fact, significant research effort over the past two decades has been dedicated to the problem of sparse signal recovery, wherein the sparsity of information—or seen differently, redundancy—inherent in signal and image content has been exploited as prior information [2]. It is appropriate at this juncture to trace a brief history of estimation problems in image processing before looking ahead to the main contributions in this chapter.

© Springer International Publishing AG 2017 V. Monga (ed.), Handbook of Convex Optimization Methods in Imaging Science, DOI 10.1007/978-3-319-61609-4_8

177

178

Image Denoising: Arguably one of the most important classical inverse problems in imaging is image denoising. It refers to the process of recovering a clean and noise-less image from noisy observations. Images acquired by imaging sensors are generally contaminated by noise. Imperfect instruments and problems with the data acquisition process can degrade the quality of the acquired data. Furthermore, transmission errors and compression can also introduce noise in signals/images. Thus, denoising is often a crucial pre-processing step before further analysis of images. In image denoising, wavelet-based approaches have demonstrated superior performance due to properties such as sparsity and multi-resolution structure. In 1995, Donoho proposed a simple soft thresholding technique [3]. Different ways to compute the parameters for the thresholding of wavelet coefficients have since been proposed. Data adaptive thresholds [4] were introduced by Fodor et al. for optimum threshold selection. Subsequently, translation invariant methods [5] substantially improved the perceptual quality of images. Recently, probabilistic models using the statistical properties of the images or their wavelet coefficients have gained more attention, and much effort has been devoted to Bayesian denoising in the wavelet domain [6, 7]. Tree structured and hidden Markov models have also been investigated in image denoising [8]. Ongoing research continues to focus on using various statistical models to model the statistical behavior of the image data. For instance, sparsifying priors and learning methods were applied in image denoising [9, 10] and claimed to outperform competing techniques. Image Inpainting: Another classical estimation problem in image processing is inpainting, derived from the ancient art of restoring images by professional image restorers. Digital inpainting is the process of restoring lost or damaged parts of an image and reconstructing them based on the background information. This recovery of missing information in images has many applications, ranging from image coding to wireless image transmission, special effects and image restora-

V. Monga et al.

tion. The basic idea behind inpainting algorithms is to fill-in lost regions with available information from their local neighborhood. Image inpainting is an ill-posed problem. It is therefore necessary to introduce image priors [11] in order to condition the problem better and develop tractable solutions. Diffusion-based methods for inpainting introduce smoothness priors via parametric models or partial differential equations (PDEs) to diffuse (or propagate) local structures from the exterior to the interior of the missing parts. These methods are generally suitable for completing and inpainting small regions, but are not well suited for restoring the texture of large areas [12, 13]. A second family of inpainting methods, known as exemplar-based methods, exploit image statistics and self similarity priors [14] for recovery of texture of the missing area. Unsurprisingly, with the advent of sparse representation theory, sparsifying priors have also been exploited for solving the ill-posed inpainting problem. In this case, the image is assumed to be sparse in a specific basis domain such as discrete cosine transform (DCT). It is assumed that the image patches share the same sparse representations in both known and unknown regions. Adaptive sparse representations are used in [15] by assuming locally uniform regions present in the image. Exploiting both local and nonlocal sparsity is considered in [16], in combination with Bayesian model averaging. As a logical extension, hybrid solutions using diffusion, examplar or sparsity methods have also emerged [17]. Image Super-resolution: The limitations of digital imaging sensors pose challenges in capturing images at a desired high quality and resolution, which is an important requirement in many practical applications such as medical imaging, computer vision, video surveillance, etc. This has necessitated solutions to the problem of image super-resolution, the process of enhancing the spatial resolution of an image from its low resolution observation(s). The key idea is to capture diversity in low-resolution captures [18–21]. Super-resolution is fundamentally an ill-posed problem, since multiple high

8 Sparsity Constrained Estimation in Image Processing and Computer Vision

resolution images can be mapped to the same low resolution image. Hence, prior information is necessary to yield realistic and robust solutions. This information can be captured using prior knowledge of the scene, statistical distribution of pixels, smoothness and edge information, etc. [22–24]. Of late, sparsity-based methods have been proposed for single image super-resolution. Yang et al. proposed the application of a sparsity-inducing prior and sparse coding to recover the high resolution image [25–27]. In this framework, a large collection of image patches is obtained, and high and low resolution dictionaries are jointly learned under sparsity constraints so the same sparse code can represent both the low and high resolution image patches. Zeyde et al. extended this method to develop a local Sparse-Land model on image patches [28]. Timofte et al. recently proposed the anchored neighborhood regression method, which supports the use of learned dictionaries in conjunction with neighbor embedding methods [29, 30]. Image Classification: Traditionally image classification is not viewed as an estimation problem. However, when formulated as the recovery of class label information—the unknown desired “signal”—from a given image, the connection to estimation appears much less tenuous. Further, a simple extension of the idea of priors to per-class priors, with separately learned parameters, encourages the development of Bayesiantype strategies for classification. A seminal recent contribution to pattern classification by Wright et al. [31] has introduced the idea of using the sparse recovery framework for classification by an ingenious choice of dictionaries. We will elaborate on this topic in Sect. 8.3. Needless to say, several variants of priors have since been proposed to capture the class-specific sparse behavior of the recovered signals. It must be noted that the appropriate selection of priors and tractability of the resulting optimization problem are both crucial for the development of Bayesian approaches to signal/image estimation and their effective application to practical problems. In this chapter,

179

our goal is to introduce the reader to formulations of, and solutions to, novel flavors of estimation problems, which enforce sparsity-inducing priors in an optimization framework. The framework is developed in all generality for signal estimation, and we demonstrate customized applications to popular image processing problems.

8.1.2

Sparsity Constrained Optimization: Overview

The emerging significance of sparse representations in signal and image processing problems is evident from their prevalence in approaches to the popular applications discussed above as well as several other problems. We devote this section to a brief overview of sparse signal representation theory from an optimization perspective. The value of parsimony in signal representation has been recognized for a long time now. It is well-known that a large class of signals, including audio and images, can be expressed naturally in a compact manner with respect to a wellchosen basis. Among the most widely applicable of such bases are the Fourier and wavelet basis. In fact, this idea has been leveraged successfully in commercial signal compression algorithms [32]. Evidence for the prevalence of sparsity in human perception [33] has motivated research on sparse coding for images for a variety of applications such as acquisition [34], compression [32], and modeling [35]. Typically, sparse models assume that a signal can be efficiently represented as a sparse linear combination of atoms in a given or learned dictionary [31, 36, 37]. Sparsity in signals enables us to develop efficient algorithms for extracting information from data and is often a natural prior assumption in inverse problems. Sparse reconstruction or representation has spawned a variety of applications in image/signal classification [31, 38, 39], dictionary learning [40–42], signal recovery [43–45], image denoising and inpainting [46] and medical imaging [47, 48]. Over the past decade, sparsity has emerged as a dominant theme in applications involving big data.

180

V. Monga et al.

Compressive sensing (CS) [34] can be viewed as a formalization of the quest for economical signal acquisition and representation. The central problem in CS is to recover a signal x 2 Rn given a vector of linear measurements y 2 Rm (m  n) of the form: y D Ax C n;

(8.2)

where A 2 Rmn is the measurement matrix (or dictionary) and n 2 Rn models the additive noise. Assuming x is compressible, it can be recovered by solving the following combinatorial optimization problem [49]: .P0 /

min kxk0 subject to y D Ax; (8.3) x

where kxk0 is the l0 -“norm” that counts the number of non-zero (active) entries in x. This is an NP-hard non-convex optimization problem and its solution requires an intractable combinatorial search in the solution space. A common alternative is to solve the following convex relaxation problem: .P1 /

min kxk1 subject to y D Ax; (8.4) x

Pp kxk1 D iD1 jxi j. Unlike (P0 ), this problem is convex and there are well-known solution in statistics to solve it efficiently, e.g. lasso [50, 51]. The programs (P0 ) and (P1 ) differ only in the choice of objective function, with the latter using an `1 -norm as a substitute for the `0 sparsity. There are conditions guaranteeing a formal equivalence between the NP-hard problem (P0 ) and its relaxed version (P1 ) [52]. In recent years, many sparse recovery algorithms have been proposed to solve the convex problem (P1 ). The regularized version of (P1 ), min ky  Axk22 C kxk1 ; x

(8.5)

or its variants have received the most attention in the literature. Moreover, there is a lot of interest in sparsity-constrained or regularized problems of the following form recently: min F .x/ C G .x/; x

(8.6)

where F .x/ is a convex, smooth and differentiable function of x, while G .x/ is a non-smooth and possibly non-convex function of x encouraging sparsity in the signal. Many solutions have been proposed for the above sparsity-constrained optimization, such as iterative shrinkage methods [53], recovery by separable approximation [44], and re-weighted `1 methods [54], In statistics, generalizations of the lasso have been proposed for different problems. For example, Yuan and Lin [55] proposed the grouped lasso framework. Zou and Hastie proposed the elastic net, which has an additional `2 -norm smoothing regularizer [56]. Zou et al. introduced the adaptive or weighted lasso framework [57] and the adaptive elastic net [58]. Other flavors of (8.5) and corresponding solutions are provided in [59–64]. Sparsity-constrained optimization has received considerable attention in the signal and image processing community as well. Greedy algorithms or adaptive methods have been proposed for sparsity-promoting optimization problems involving different regularizers, such as `1 -norm, `0 pseudo-norm [45, 65, 66], Bayesianbased methods [67–70] or general sparse approximation algorithms such as SpaRSA, ADMM, and FISTA [44, 53, 71, 72].

8.1.3

Chapter Overview

In the remainder of this chapter, we formulate and solve new flavors of sparsity-constrained optimization problems, built on the family of spikeand-slab priors. First we provide a Bayesian setup for sparse recovery using spike-and-slab priors and give a probabilistic interpretation to a very well-known sparsity constrained optimization problem. This optimization problem is generally very hard to solve and existing solutions include simplification, relaxation, etc. However, in this chapter we provide a neat iterative solution to solve the originally non-convex problem by a sequence of convex refinements. We provide theoretical support for this method and show its effectiveness in a range of applications in image and signal processing. We conclude the chapter

8 Sparsity Constrained Estimation in Image Processing and Computer Vision

by providing experimental results for sparse recovery, image reconstruction, image classification and object categorization. Specifically for image classification, we introduce the notion of class-specific priors together with the class-specific dictionary A already proposed in [31]. This leads to class-specific optimization problems that offer better discriminability. This formulation represents a consummation of ideas developed for model-based compressive sensing into a general framework for sparse model-based classification.

8.2

Sparsity Constrained Estimation Using Spike-and-Slab Priors

Sparse recovery problems can be categorized into two sets of approaches. The first category employs sparsity-inducing regularizers in conjunction with the reconstruction error term to obtain a sparse approximation of the signal. Examples include [31, 45, 66, 73]. The second category may be identified with model-based compressive sensing [74], wherein a set of priors is introduced on the sparse signal to capture both sparsity and structure [47, 68, 74–76]. Published work has demonstrated the value of adding structural constraints and prior information to sparse recovery frameworks, both for representation purposes and classification [37, 38, 77, 78]. Such information can be incorporated by using priors and structured variable selection [56,79], exploiting group information [80, 81] or considering the simultaneous encoding of multiple signals [73, 82, 83]. In this section, we focus on sparse representation methods from a Bayesian perspective. We give a probabilistic interpretation to some of the existing sparse representation-based methods and provide a new methodology for capturing sparsity using sparsifying priors. In particular, we focus on the setup of [84], where a variant of the spike-and-slab prior is employed to encourage sparsity. The optimization problem resulting from this model is known to be a hard nonconvex problem, whose existing solutions in-

181

volve simplifying assumptions and/or relaxations [38, 47, 84]. In this chapter, we present an approach to directly solve the resulting optimization problem in its general form and without simplification or relaxation. This approach can be seen as a logical evolution of `1 -reweighted methods [43, 54] and has the following main characteristics: • It is an Iterative Convex Refinement (ICR) approach for sparse signal recovery, which refines the solution at each iteration via convex optimization. Essentially, the sequence of solutions from these convex problems approaches a sub-optimal solution of the hard non-convex problem. • ICR has two variants: an unconstrained version, and a version with a non-negativity constraint on sparse coefficients, which may be required in some real-world problems such as image recovery. Later in this chapter, we will illustrate the effectiveness of this solution in varied applications such as image classification, signal reconstruction, and image recovery.

8.2.1

The Spike-and-Slab Prior

As mentioned earlier, the theme of this chapter is to develop optimization approaches around sparsity-inducing priors for estimation problems in image processing. Introducing priors for capturing sparsity is a particular example of the Bayesian narrative, wherein signal recovery can be enhanced by exploiting contextual and prior information. As suggested by [85, 86], sparsity can be enforced via the following optimization problem: max Px .x/ subject to jjy  Axjj2 < ; (8.7) x

where Px is the probability density function of x that models sparsity. The most common example is the i.i.d. Laplacian prior, which is equivalent to `1 -norm minimization [76, 85].

182

V. Monga et al.

The choice of sparsity promoting priors that can capture the joint distribution of coefficients (both structure and sparsity) is a challenging task. Examples of such priors are Laplacian [76], generalized Pareto [87], spike-and-slab [88], etc. Amongst these priors, the spike-and-slab prior is particularly well-suited to promote sparsity. It is widely used in sparse recovery and Bayesian inference for variable selection and regression [47, 75, 77, 89–92]. In fact, the spike-and-slab prior is acknowledged to be the gold standard for sparse inference in the Bayesian set-up [92]. It is characterized by modeling each coefficient xi as a mixture of two densities as follows: xi  .1  wi /ı0 C wi Pi .xi /;

(8.8)

where ı0 is the Dirac function at zero (the spike), and Pi (the slab) is an appropriate prior distribution for nonzero values of xi (e.g. Gaussian). wi 2 Œ0; 1 controls the structural sparsity of the signal. If wi is chosen to be close to zero, xi tends to remain zero. On the contrary, by choosing wi close to 1, Pi is the dominant distribution, encouraging xi to take a non-zero value.

8.2.2

Bayesian Sparse Recovery Framework

Inference from the posterior density for the spikeand-slab model is ill-defined because the Dirac delta function is unbounded. Some ways to handle this issue include approximations [92], such as approximation of spike term with a very narrow Gaussian [89], approximating the whole posterior function with product of Gaussian(s) and Bernoulli(s) density functions [47, 93–96], etc. Here, the focus is on the setup of Yen et al. [84], which is an approximate spike-and-slab prior for inducing sparsity on x. Inspired by Bayesian compressive sensing [68, 77], we employ a hierarchical Bayesian framework for signal recovery. The measurement vector y is conditionally modeled as a Gaussian distribution based on the additive noise model (y D Ax C n), while the sparse vector x is modeled by the spike-and-slab prior. More precisely, the Bayesian formulation is as follows:

  yjA; x; ; 2  N Ax; 2 I Y

(8.9)

p

xj; ; 2 

i N .0; 2 =/

iD1

C.1  i /I.xi D 0/ (8.10) j 

p Y

Bernoulli.i /

(8.11)

iD1

where N .:/ represents the Gaussian distribution. Also note that in (8.10) each coefficient of x is modeled on the framework proposed in [84] and i is the latent variable indicating whether xi is active (non-zero) or not. Then, since i is a binary variable, it implies that conditioned on i D 0, xi is equal to 0 with probability one. On the other hand, conditioned on i D 1, xi follows a normal distribution with mean 0 and variance 2 =. Motivated by the MAP estimation technique in [38,84] the optimal x;  are obtained by the following MAP estimate: ˚  .x ;   / D arg max f .x; jA; y; ; ; 2 / : x;

(8.12) For the remainder of the analysis, without loss of generality we assume that jyi j  1, i D 1 : : : q, jxi j  1, i D 1 : : : p and columns of A have unity norm. We first begin with the following lemma from [97], which essentially gives a Bayesian interpretation to a well-known sparsity constrained optimization. Lemma 1. Using the introduced Bayesian model, the MAP estimation in (8.12) is equivalent to the following minimization problem: .x ;   / D arg min jjy  Axjj22 x;

Cjjxjj22 C

p X

i i

(8.13)

iD1

where i , 2 log



2 2 .1i /2 i2

 .

Proof. Here, we provide a sketch of the proof and readers are encouraged to refer to [97] for additional details.

8 Sparsity Constrained Estimation in Image Processing and Computer Vision

To perform the MAP estimation, note that the posterior probability is given by:

183

.x ;   / D arg min f2 log f .x; ; jA; y; ; /g : x;

(8.15) 2

f .x; ; jA; y; ; / / f .yjA; x; ; / f .xj; 2 ; /f .j /:

(8.14)

The optimal x ;   are obtained by MAP estimation as:

We now separately evaluate each term on the right hand side of (8.14). According to (8.9) we have:

  1 1 T f .yjA; x; ; / D exp  2 .y  Ax/ .y  Ax/ .2 2 /q=2 2 2

) 2 log f .yjA; x; ; 2 / D q log 2 C q log.2/ C

1 jjy  Axjj2 : 2

Since i is the indicator variable and only takes values 0 and 1, we can rewrite (8.10) in the following form:

xj; ; 2 

p  Y

 i  1 i N .0; 2 1 / : I.xi D 0/

iD1

) 2 log f .xj; 2 ; / D



p p X jjxjj22 2 2 X C log  2 .1  i / log I.xi D 0/: i 2 1  iD1 iD1

In fact the final term on the right hand side evaluates to zero, since I.xi D 0/ D 1 ) log I.xi D 0/ D 0, and I.xi D 0/ D 0 ) xi ¤ f .j / D

p Y

0 ) i D 1 ) .1  i / D 0. Finally (8.11) implies that



i i .1  i /1 i

iD1

) 2 log f .j / D 2

p X



log i i C log.1  i /1 i

iD1

X



p

D

iD1

i log

1  i i

2

2

p X iD1

log.1  i /:

184

V. Monga et al.

Plugging all these expressions back into (8.15), we obtain: 1   .x ;  / D arg min q log 2 C 2 jjy  Axjj2 x;

p jjxjj22 2 2 X C 2 1 C log i   iD1

! p X 1  i 2 C i log (8.16) i iD1 Essentially, for fixed 2 , the cost function will reduce to: L.x; / D jjyAxjj22 Cjjxjj22 C

p X

i i

iD1

(8.17) 

2

 2

.1i / where i , 2 log 2  .  2 i Remark: Note that we are particularly interested in solving (8.13) which has broad applicability in recovery and regression [84], image classification and restoration [38, 98] and sparse coding [69, 99]. This is a non-convex mixedinteger programming involving the binary indicator variable  and is not easily solvable using conventional optimization algorithms. It is worth mentioning that this is a more general formulation than the framework proposed in [38] or [84], where the authors simplified the optimization problem by assuming the same  for each coefficient i . This assumption Ppchanges the last Pp term in (8.13) to jjxjj0 since iD1 i i D  iD1 i D jjxjj0 . The resulting optimization will then be of the following form:

L.x; / D jjyAxk22 Ckxk22 C 2 ;; kxk0 : (8.18) which is well-known for its non-convex `0 -norm term. We just gave a Bayesian interpretation for the `0 -norm sparsity constrained optimization problem. There have been many solutions to this problem including but not limited to greedy pursuit-based methods, convex relaxations, etc. among which majorization-minimization method [84] is particularly effective. Further, a relaxation of `0 to `1 -norm reduces the problem to the well-known elastic net [58]:

L.x; / D ky  Axk22 C kxk22 C  2 ;; kxk1 : (8.19) The framework in (8.13) therefore offers greater generality in capturing the sparsity of x. As an example, consider the scenario in a reconstruction or classification problem where some dictionary (training) columns are more important than others [100]. It is then possible to encourage their contribution to the linear model by assigning higher values to the corresponding i ’s, which in turn makes it more likely that the ith coefficient xi becomes activated.

8.2.3

Iterative Convex Refinement (ICR)

The solutions that are proposed for sparsity constrained optimization using spike-and-slab priors are mostly based on simplifying assumptions as in [78], approximations [66, 84] or relaxation [58]. Recognizing that (8.13) is a hard nonconvex problem, an intuitively motivated iterative approach is presented here, such that a convex optimization problem is solved at each step and the sequence of solutions converges to a suboptimal solution. We first present the solution to (8.13) for the case where the entries of x are non-negative. This case has application in some real world problems dealing with only non-negative entries. Then, we present the ICR algorithm in its general form with no constraints. The central idea of the Iterative Convex Refinement (ICR) algorithm—see Algorithm 1—is to generate a sequence of optimization problems that refines the solution of previous iteration based on solving a modified convex problem. At iteration n of ICR, the indicator variable xi i is replaced with the normalized ratio .n1/

i

and the convex optimization problem in (8.20) is solved, which is a simple quadratic program .n1/ with non-negativity constraint. Note that, i is intuitively the average value of optimal xi ’s obtained from iteration 1 up to .n  1/ and is rigorously defined as in (8.22). The motivation for this substitution is that, if the sequence of solutions x.n/ converges to a point in Rp we also

8 Sparsity Constrained Estimation in Image Processing and Computer Vision

185

Algorithm 1: Iterative Convex Refinement (ICR) Require: A; ; y. initialize: .0/ D AT y, iteration index n D 1. while Stopping criterion not met do (1) Solve the convex optimization problem at iteration n: (Non-negative) For non-negative ICR solve x.n/ D arg min jjy  Axjj22 C jjxjj22 C x Nj we have 1 ˇ .n/ ˇ ˇ ˇ C j

1 ) ˇ .n/ ˇ ˇ ˇ C j

ˇ ˇ ˇ 1 1 ˇˇ c ˇˇ ˇ ˇ .nC1/ ˇˇ  ˇˇ .n/ ˇˇ ˇ  n C 1

j

j

(8.28)

where c is some positive constant. ˇ .n/ ˇ Proof. Assume ˇ j ˇ  ˛j D . First, note that it is straightforward to see that the difference of consecutive average values ˇ .nC1/ ˇ ˇ has ˇ the following 1 ˇ  ˇ .n/ ˇ  1 . Now,  ˇ j property:  nC1 j nC1 let Nj D ˛2 j , then for all n > Nj we have: ˇ .n/ ˇ ˇ ˇ  j

ˇ .nC1/ ˇ ˇ .n/ ˇ 1 ˇ  ˇ ˇ C 1  ˇ j j nC1 nC1 (8.29)

where the left hand side is positive, since ˇ .n/ ˇ ˇ ˇ  j

˛j 1 1  ˛j  D Dı>0 nC1 Nj 2

Using this fact and (8.29) we infer that:

1 nC1

1 1  ˇ .nC1/ ˇ  ˇ .n/ ˇ ˇ

ˇ ˇ ˇ  j j

1 nC1

1 1 1 1  ˇ .n/ ˇ  ˇ .nC1/ ˇ  ˇ .n/ ˇ  ˇ .n/ ˇ ˇ ˇ ˇ

ˇ ˇ ˇ ˇ ˇ  j j j j

1 nC1

1 nC1

1  ˇ .n/ ˇ ˇ ˇ j

1 1  nC1 1 1 nC1 ˇ ˇ ) ˇ  ˇ .nC1/ ˇ  ˇ .n/ ˇ  ˇ ˇ ˇ ˇ ˇ .n/ ˇ 1 ˇ

ˇ ˇ ˇ ˇ .n/ ˇ C 1 ˇ .n/ ˇ ˇ ˇ .n/ ˇ

 j j j j j j nC1 nC1

188

V. Monga et al.

In the last expression, we have: RHS D

LHS D

1 ˇ ˇ .n/ .n C 1/ ˇ j ˇ  1 ˇ ˇ .n/ .n C 1/ ˇ j ˇ C

1 nC1

Theorem 1. For sufficiently large n, the sequence of optimal cost function values 1 obtained from ICR, i.e. an D fn .x.n/ / forms a ˇ ˇ .n/ .n C 1/ı ˇ ˇ Quasi-Cauchy sequence.

1 nC1

1 ˇ ˇ .n/ .n C 1/ı ˇ ˇ

j

0 ˇ ˇ ˇfnC1 .x.nC1/ /  fn .x.n/ /ˇ  c n

j

Therefore, ˇ ˇ ˇ 1 1 ˇˇ 1 ˇˇ ˇ ˇ .nC1/ ˇˇ  ˇˇ .n/ ˇˇ ˇ  .n C 1/ı ;

j

j

n > Nj : 

Another interpretation of this lemma is that as the number of iterations grows, the cost functions at each iteration of ICR get closer to each other. In view of these two lemmas, we can show that the sequence of optimal cost function values obtained from ICR algorithm forms a QuasiCauchy sequence [101]. In other words, this is a sequence of bounded values such that their difference at two consecutive iterations gets smaller.

(8.30)

Proof. Before providing the proof, note that we can assume ˇ .n/ ˇ for a sufficiently large N0 , if n  N0 , then ˇ j ˇ is either always less than ˛j or always greater. This is ˇbecause according to Lemma 1, .n/ ˇ we know that if ˇ j ˇ once becomes smaller than ˛j for some n, it will remain less than ˛j for all the following iterations. Therefore, let nj ; j D 1:::p be the iteration index that for all n > nj , ˇ .n/ ˇ ˇ ˇ < . Note that some nj ’s may be equal j to infinity which means they are never smaller than . For those j that nj D 1, let Nj to be the same as Nj defined in proof of Lemma 2. With these definitions, we now proceed to prove the Theorem. We first show that for n > N0 D max.maxj nj ; maxj Nj /, the sequence of fn .x.n/ / has the following property:

p  1 ˇ ˇX ˇ 1  ˇˇ ˇfnC1 .x.n/ /  fn .x.n/ /ˇ D ˇˇ i ˇ .n/ ˇ  ˇ .n1/ ˇ jxi jˇ ˇ ˇ ˇ

ˇ iD1 i i ˇ ˇ ˇ ˇ ˇ 1 ˇ 1 X X 1 ˇˇ 1 ˇˇ ˇ ˇ  i ˇ ˇ .n/ ˇ  ˇ .n1/ ˇ ˇjxi j C i ˇ ˇ .n/ ˇ  ˇ .n1/ ˇ ˇjxi j ˇ ˇ ˇ ˇ

ˇ ˇ ˇ ˇ

ˇˇ ˇˇ .n1/ .n1/ j i



j N0 , an D fn .x.n/ / is QuasiCauchy. Since the minimum value fnC1 .x.nC1/ / is smaller than fnC1 .x.n/ /, we can write:

(8.31)

fnC1 .x.nC1/ /  fn .x.n/ /  fnC1 .x.n/ /  fn .x.n/ / 

c0 n

where we used (8.31) for n > N0 . With the same reasoning for n > N0 we have:

fnC1 .x.nC1/ /  fn .x.n/ /  fnC1 .x.nC1/ /  fn .x.nC1/ /  

c0 n

8 Sparsity Constrained Estimation in Image Processing and Computer Vision

Therefore, 0 ˇ ˇ ˇfnC1 .x.nC1/ /  fn .x.n/ /ˇ  c n

for n > N0 .  Combination of this theorem with a reasonable stopping criterion guarantees the termination of the ICR algorithm. The stopping criteria used in this case is the norm of difference in the solutions x.n/ in consecutive iterations. At termination where the solution converges, the ratio x.n/i

i

will be zero for zero coefficients and approaches 1 for nonzero coefficients, which matches the value of i in both cases. Remark: Despite the fact that analytical results show a decay of order 1n in difference between consecutive optimal cost function values, ICR shows much faster convergence in practice. In addition, the property that was proved in Lemma 1 can make the ICR algorithm much faster.

8.3

Applications

8.3.1

Sparse Signal Recovery

In this section, we present example experimental results supporting the effectiveness of ICR method in sparse recovery problem. Two experimental scenarios are considered: (1) simulated data and (2) a real-world image recovery problem. In each case, comparisons are shown against well-known alternatives. Simulated Data: A typical experiment for sparse recovery is set up as in [66, 84] with a randomly generated Gaussian matrix A 2 Rqp and a sparse vector x0 2 Rp . Based on A and x0 , we form the observation vector y 2 Rq according to the additive noise model: y D Ax0 C n with D 0:01. The competitive methods for sparse recovery are: 1. SpaRSA [44,102] which is a powerful method to solve the problems of the form (8.13).

2. 3. 4. 5. 6.

189

This method has been specifically proposed to solve this general type of problems where summation of a smooth function and a nonsmooth (possibly) non-convex function is to be minimized. Yen et al. framework, Majorization Minimization (MM) algorithm [84] Elastic Net [58] which is a `1 relaxation version of [84] FOCUSS algorithm [43] which is a reweighted `1 algorithm for sparse recovery problem [54] Expectation propagation approach for spikeand-slab recovery (SS-EP) [93] Variational Garrote (VG) [95]

Initialization for all methods is consistent as suggested in [102]. Table 8.1 reports the experimental results for a small scale problem. The reason to report results on a small scale problem is to be able to use the IBM ILOG CPLEX optimizer [103] which is a very powerful optimization toolbox for solving many different optimization problems. It can also find the global solution to non-convex and mixed-integer programming problems. We used this feature of CPLEX to compare ICR’s solution with the global minimizer. For obtaining the results in Table 8.1, we choose p D 64, q D 32 and the sparsity level of x0 is 10. We generated 1000 realizations of A; x0 and n and recovered x using different methods. Two different types of figures of merit are used for evaluation of different sparse recovery methods: First, we compare different methods in terms of cost function value averaged over realizations, which is a direct measure of the quality of the solution to (8.13). Second, we compare performances from the sparse recovery viewpoint, and used the following figures of merit: mean square error (MSE) with respect to the global solution (xg ) obtained by CPLEX optimizer, Support Match (SM) measure indicating how much the support of each solution matches to that of xg . However, cost function values and comparisons with global solution are not provided for SS-EP and VG since they are not direct solutions to the optimization problem in (8.13).

190

V. Monga et al.

Table 8.1 Comparison of methods for p D 64 and q D 32

Table 8.2 Comparison of methods for p D 512 and q D 128

Method

SpaRSA MM

E-Net

FOCUSS ICR



Avg f .x / 2.05E-2 1.52E-2 3.33E-2 3.89E-2 1.45E-2 MSE vs xg 1.07E-3 5.49E-3 2.45E-4 1.55E-4 8.45E-5 SM vs xg (%) 81.57 80.25 70.20 90.53 97.13 Method

SpaRSA Elastic-Net VG

MM

FOCUSS SS-EP

ICR

Avg f .x / MSE vs x0 Sparsity Level SM vs x0 (%) Time (sec)

8.75E-2 1.05E-3 54.12 90.47 1.36

1.02E-1 1.89E-3 79.32 84.12 3.38

8.31E-2 3.69E-4 21.37 94.17 3.01

6.72E-2 2.39E-4 28.88 95.41 3.15

9.77E-2 8.35E-4 77.81 86.81 1.40

– 3.65E-4 16.82 89.45 0.82

– 3.50E-4 21.68 93.90 5.37

Fig. 8.1 Convergence of ICR (right) and ICR-NN (left). MSE vs # of iteration

As can be seen from Table 8.1, ICR outperforms the competing methods in many different aspects. In particular from the first row, we infer that ICR is a better solution to (8.13) since it achieves a better minimum in average sense. Moreover, significantly higher support match (SM D 97:13%) measure for ICR shows that ICR’s solution shows much more agreement with the global solution in terms of finding the location of non-zeros. Finally, the ICR solution is also the closest to the global solution obtained from CPLEX optimizer in the sense of MSE (by more than one order of magnitude). Next, we present results for a typical larger scale problem. We chose p D 512, q D 128 and set the sparsity level of x0 to be 30 and carry out the same experiment as before. Because of the scale of the problem, the global solution is now unavailable and therefore, we compare the results against x0 which is the “ground truth". Results are reported in Table 8.2. Table 8.2 also additionally reports the average sparsity level of the solution and it can be seen that the sparsity level of ICR is the closest to the true sparsity level of x0 . In all other figures of merit, viz. the

cost function value (averaged over realizations), MSE and support match vs x0 , ICR is again the best. Figure 8.1 additionally shows the mean square error of solution versus true solution in two specific examples. The convergence plots of solution for ICR and ICR-NN are shown as a function of number of iterations, respectively. Further, we show more experimental results from ICR framework to support its significance in comparison with other popular methods for spike-and-slab sparse recovery problem. Following the same experimental setup for synthetic data, we illustrate the performance of the ICR in comparison with others as the sparsity level of x0 (jjx0 jj0 ) changes. We vary the true sparsity level from only 5 non-zero elements in x0 up to 75 and compared MSE and support match percentage of the solutions from each method. The length of sparse signal is chosen to be p D 512 and number of observations is q D 128. Figure 8.2 shows an alternate result as the MSE plotted against the sparsity level; once again the merits of ICR are readily apparent. This figure also illustrates that the support of ICR’s

8 Sparsity Constrained Estimation in Image Processing and Computer Vision

191

Fig. 8.2 Comparison of MSE (left) and Support Match (SM) (right) obtained by each method versus sparsity level of x0 Fig. 8.3 Comparison of average sparsity level obtained by each method versus sparsity level of x0 . Dashed line shows the true level of sparsity

solution is the closest to the support of x0 . More than 90% match between the support of ICR’s solution and that of x0 for a wide range of sparsity levels makes ICR very valuable to variable selection problems specially in Bayesian framework. Figure 8.3 shows the actual sparsity level of solution for different methods. The dashed line corresponds to the true level of sparsity and ICR’s solutions is the closest to the dashed line implying that the level of sparsity of ICR’s solution matches the level of sparsity of x0 more than other methods. This also support the results obtained from Fig. 8.2. Figure 8.4 illustrates the mean square errors (MSEs) and support match (SM) obtained from

different methods under different SNRs. The chosen values for are 0:05; 0:01; 0:001; 0:0001. Image Reconstruction We now apply ICR algorithm to real data for reconstruction of handwritten digit images from the well-known MNIST dataset [104]. The MNIST dataset contains 60000 digit images (0 to 9) of size 28  28 pixels. Most of pixels in these images are inactive and zero and only a few take nonzero values. Thus, these images are naturally sparse and fit into the spike-and-slab model. The experiment is set up such that a sparse signal x (vectorized image) is to be reconstructed from a smaller set of random measurements y.

192

V. Monga et al.

Fig. 8.4 Comparison of MSE (left) and Support Match (SM) (right) obtained by each method under various noise levels

For any particular image, we assume the random measurement (150 measurements) are obtained by a Gaussian measurement matrix A 2 R150784 with added noise according to (8.2). We compare our result against the following image recovery methods for sparse images: (1) SALSA-TV which uses the variable splitting proposed by Figueiredo et al. [105] combined with Total Variation (TV) regularizers [106]. (2) A Bayesian Image Reconstruction (BIR) [69], based on a more recent version of Bayesian image reconstruction method [70] proposed by Hero et al.. (3) We also compare our results with Adaptive Elastic Net method [58] which is commonly used in sparse image recovery problems. (4) Finally, the result of the non-negative ICR (ICR-NN) is shown which explicitly enforces a non-negativity constraint on x which in this case corresponds to the intensity of reconstructed image pixels. Recovered images are shown in Fig. 8.5 and the corresponding average reconstruction error (MSE) for the whole database appears next to each method. Clearly, ICR and ICR-NN outperform the other methods both visually and based on MSE value. It is also intuitively satisfying that ICR-NN which captures the non-negativity constraint natural to this problem, provides the best result overall.

8.3.2

Image Classification

As the second application of sparsity-constrained estimation using spike-and-slab priors, we present an extension of the framework discussed above for the task of image classification. In order to motivate the discussion, we first present a quick summary of related work in sparse representation-based classification, followed by a discussion of our contribution. We conclude the section with experiments on popular image classification databases to demonstrate the performance benefits of our approach.

8.3.2.1 Sparse Representation-based Classification: Overview A significant contribution to the development of algorithms for image classification is a recent sparse representation-based classification (SRC) framework [31], which exploits the discriminative capability of sparse representations. The key idea of designing class-specific dictionaries combines the analytical framework of compressive sensing and insight into human vision models based on overcomplete dictionaries. Suppose we have sets of training image vectors from multiple classes, collected into dictionaries Ai ; i D 1; : : : ; K. Let there be Ni training images (each in Rn ) corresponding to

8 Sparsity Constrained Estimation in Image Processing and Computer Vision

193

Fig. 8.5 Examples of reconstructed images from MNIST dataset using different methods. The numbers appeared next to each method is the average MSE

class Ci ; i D 1; : : : ; K. The cumulative training dictionary P is A D ŒA1 A2 : : : AK  2 RnT , with T D KiD1 Ni . A test image y 2 Rn is now expressed as a linear combination of the training: y ' A1 x1 C A2 x2 C : : : C AK xK D Ax; where x is ideally a sparse vector. The classifier seeks the sparsest representation by solving the following problem: ky  Axk2  : (8.32) Here, sparsity manifests itself due to the class-specific design of dictionaries. Class assignment involves a comparison of classspecific reconstruction errors. Compared to (8.4), this formulation has two modifications: a dictionary with class-specific ordering of training images, and an inequality constraint to account for signal noise. The robustness of the sparse feature to real-world image distortions [31, 107] is rooted in optimization theory. It has resulted in the widespread application of SRC in practical classification tasks. xO D arg min kxxk1

subject to

8.3.2.2 Structured Sparse Priors for Image Classification (SSPIC) We propose the following formulation: x Ci D arg max fCi .x/ s.t. ky  Axk2  ; x

i D 1; : : : ; K:

(8.33)

We introduce the notion of class-specific priors fCi together with the class-specific dictionary A already proposed in [31]. This leads to classspecific optimization problems that offer better discriminability compared to the traditional SRC formulation. This formulation represents a consummation of ideas developed for model-based compressive sensing into a general framework for sparse model-based classification. While the framework is more general in its ability to model class-specific sparsity, here we demonstrate an example of one such family of structured priors that model sparsity, namely the spike-and-slab prior. Figure 8.6 offers a set-theoretic viewpoint to illustrate the central idea of the SSPIC framework. We consider a binary classification problem for ease of explanation; the idea easily extends to multi-class scenarios. Figure 8.6a repreCi sents the SRC problem. Srec is the sub-level set of all vectors that lead to reconstruction error less than a specific tolerance  for class i. Ssparse is the set of all vectors with only a few non-zero coefficients. The vectors that lie in the intersection of these two sets (shaded region in Fig. 8.6a) are exactly the set of solutions to (8.32). Crucially, the sparsity prior is identical across both classes. Figures 8.6b and 8.6c describe our idea for binary classification. Now we have two sub-level sets Srec;1 and Srec;2 , which correspond to vectors x leading to reconstruction error not greater than 1 and 2 respectively (1 > 2 ).

194

V. Monga et al.

Fig. 8.6 Set-theoretic comparison: (a) the SRC [31] approach, wherein the sparsity prior is identical across all classes, and (b)–(c) our proposed framework—structured sparsity using class-specific priors. In (b), the test vector

yC0 is actually from class C0 (truth), while in (c), the test vector yC1 is from C1 . Note that this illustration is for a binary classification problem. It naturally generalizes to multi-class problems as described in Sect. 8.3.2.2

Our contribution is the introduction of the two Ci sets Sstruct ; i D 0; 1. These sets enforce additional class-specific structure on the sparse coefficients. They may be interpreted as the set of all vectors x that have a likelihood (as defined by fCi ) greater than a well-chosen threshold c . Our illustration ties together the parallel ideas of the prior fCi and Ci the set Sstruct . This is neither new nor ambiguous however; the quantitative equivalence between prior maximization and norm minimization is well-known. Let us first consider Fig. 8.6b, where the sample test vector yC0 is in fact from class C0 . C0 C1 and Sstruct are defined by The two sets Sstruct priors fC0 and fC1 respectively, which simultaneously encode sparsity and class-specific structure. For a relaxed reconstruction error tolerance 1 , both these sets have non-empty intersection C0 with Srec ;1 . As a result, both the class-specific optimization problems in (8.33) are feasible and this increases the possibility of the test vector being misclassified. However, as the error bound C0 inis tightened to 2 , we see that only Sstruct C0 tersects with Srec ;2 , and the solution to (8.33)

correctly identifies the class of the test vector as C0 . An analogous argument holds in Fig. 8.6c, where the test vector yC1 is now from C1 . As the reconstruction error tolerance is reduced, only C1 C1 Sstruct intersects with Srec ;2 . The remainder of this section is devoted to the analytical development of this idea. Our goal is to begin with the general formulation in (8.33), choose a family of class-specific priors fCi , and arrive at the equivalent constrained optimization formulation that reveals a Bayesian interpretation of SRC. Owing to its proven success in modeling sparsity, the spike-and-slab prior is an excellent initial choice for the fCi in this framework. What benefit does the Bayesian approach buy? A limitation of SRC is its requirement of abundant training (highly overcomplete dictionary A). However an outstanding challenge in many realworld classification problems is that statistically significant training is often unavailable. The use of well designed/chosen class-specific priors alleviates the burden on the number of training images required, as we shall see in Sect. 8.3.2.3.

8 Sparsity Constrained Estimation in Image Processing and Computer Vision

It is worth re-emphasizing at this juncture that the same class-specific dictionary proposed in SRC [31] is employed in our approach. Consequently, the sparse coefficient vector corresponding to a test image is naturally discriminative. Based on the assumptions underlying SRC, it is expected that the sparse vector that is the outcome of (8.32) ideally has non-zero coefficients belonging to a particular class. In general, our choice of class-specific priors guides the sparse recovery problem towards solutions that are more discriminative than those obtained by solving SRC (8.32). We return to the MAP formulation arrived at earlier in Sect. 8.2.2 (and [38]): 1 .y  Ax/T .y  Ax/ 2

n 2 2 X 2 Cm log C log i  iD1

.x ;   ; 2 / D arg min

n X

195



.1  i /2 C i log i2 iD1 C



22 xT x C 2 2 1

C2.1 C 1/ log 2 :

(8.34)

Several such model parameter sets may be trained, one per class, to offer an immediate extension of the recovery framework for classification tasks. The formulation in (8.34) truly captures the interaction between coefficients by learning a parameter per coefficient. However, we observe that (8.34) comprises a collection of terms that: (i) lacks direct interpretation in terms of modeling sparsity, and (ii) leads to a difficult optimization problem. So, we introduce the simplifying assumption of choosing a single scalar  per class. With this simplification that encourages class-specific structure, we now have:

1 .y  Ax/T .y  Ax/ C m log 2 2



n n X 2 2 X .1  /2 C log i C i log  2 iD1 iD1

.x ;   ; 2 / D arg min

x;; 2

C

22 xT x C C 2.1 C 1/ log 2 : 2 2 1

(8.35)

Upon further simplification, L.x; ; 2 / D

1 ˚ ky  Axk22 C kxk22 C  2 ;; kxk0 2  C.m C 21 C 2/ 2 log 2 C 22 ;

 2  2 where  2 ;; WD 2 log 2 .1/ . In 2 fact, we obtain multiple such cost functions L0 .x; ; 2 / and L1 .x; ; 2 /, one corresponding

(8.36)

to each class. Consequently, we solve multiple optimization problems and obtain class-specific optimal solutions (assuming fixed 2 ),

xCi D arg min Li .xI 2 / x

D arg min ky  Axk22 C Ci kxk22 C  2 ;Ci ;Ci kxk0 : x

(8.37)

196

V. Monga et al.

The proposed framework extends to multi-class classification in a natural manner, by defining multiple such structured priors, one per class, and solving the corresponding optimization problems. Different sets of data-dependent parameters  2 ;Ci ;Ci and Ci are learned from the training images of each class. The general form of the classification rule for multiple (K) classes is as follows: Class.y/ D arg min Li .xCi I 2 /:

(8.38)

i2f1;:::;Kg

To summarize our analytical contribution, we initially choose a sparsity-inducing spike-andslab prior per class and perform MAP estimation. With reasonable simplifications on model structure, we obtain the final formulation (8.37) which explicitly captures sparsity in the form of an `0 -norm minimization term. This is intuitively satisfying, since we are looking for sparse vectors x. Thereby, we offer a Bayesian perspective to SRC. It is useful to compare and contrast the resulting optimization in (8.37) with the SRC formulation in (8.32):

This version of the optimization problem can be compared with the set-theoretic illustration in Fig. 8.6, after identifying the equivalence between maximizing fCi and minimizing the corresponding Li .xI 2 / term obtained by analytical derivation.

8.3.2.3 Experimental Validation In this section, we evaluate the performance of the SSPIC method in the important image classification problems: (i) face recognition and (ii) object categorization. Our goal is to demonstrate the practical benefits of learning priors in a classspecific manner for sparsity-based classification. Accordingly, in each case our primary comparison is with SRC. In addition, we compare SSPIC with other algorithms acknowledged to deliver excellent classification accuracy in these problems.

Face Recognition: The problem of automatically verifying a person’s identity by comparing a live face capture against a stored database of human face images has witnessed considerable research activity over the past two decades. The 1. SRC enforces sparsity by the use of an l1 - rich diversity of facial image captures, due to norm term. In (8.37), structure on sparse coef- varying illumination conditions, spatial resoluficients is enforced by a weighted combination tion, pose, facial expressions, occlusion and disof `0 and `2 -terms. The `2 -term in particular guise, offers a major challenge to the success is known to encourage smoothness and group of automatic human face recognition systems. A behavior [108]. survey of face recognition methods in literature 2. Even more crucially, the weights on the `0 and is provided in [109–111]. `2 -terms are class-specific. That is, the (per We perform experiments on two popular colclass) spike-and-slab priors provide additional lections of human face images: the Extended Yale discriminative ability over class-specific dic- B database and the AR database. The Extended tionaries in SRC. Yale B database [112] consists of 2,414 perfectlyaligned frontal face images of size 192  168 of For fixed 2 each optimization problem re- 38 individuals, 64 images per individual, under duces to: various conditions of illumination. For each subject we randomly choose 32 images in Subsets xCi D arg min kyAxk22 CCi kxk22 C 2 ;Ci ;Ci kxk0 : 1 and 2, which were taken under less extreme x (8.39) lighting conditions, as the training data. The remaining images are used for testing. All trainThis may also be interpreted as ing and test samples are downsampled to size 32  28. In Sect. 8.3.2.6, we test the algorithms xCi D arg min Ci kxk22 C  2 ;Ci ;Ci kxk0 x on the AR database [113], which contains imsubject to ky  Axk2  : (8.40) ages of human faces occluded by sunglasses and scarves.

8 Sparsity Constrained Estimation in Image Processing and Computer Vision

We compare our algorithm against five popular face recognition algorithms: 1. SRC: Sparse Representation-based Classification [31] 2. RSC: Robust Sparse Coding for face recognition [114] 3. Eigen-NS: Eigenfaces [115] as features with Nearest Subspace [116] classifier 4. Eigen-SVM: Eigenfaces with Support Vector Machine [117] classifier 5. Fisher-NS: Fisherfaces [118] as features with Nearest Subspace classifier 6. Fisher-SVM: Fisherfaces with SVM classifier. It must be noted that the performance of a similar set of approaches is compared in [31]. The comparison with the Eigenfaces approach represents baseline comparison prior to the advent of SRC. Robust sparse coding (RSC) [114] has emerged as a competitive approach for face recognition recently. Remark: In the results to follow, we test the SSPIC approach under a variety of challenging scenarios for face recognition, such as pixel corruption, disguise, the presence of outliers, and limited training. We assume reasonably good geometric alignment of test and training images as was also done in SRC [31]. While we acknowledge that misregistration is an important practical problem in face recognition, our focus is on showing the practical benefits of SSPIC broadly in image classification, face recognition being one such application. Pre-processing and alignment have been incorporated into sparsitybased classification to deliver practical benefits [119,120]. Similar strategies can also be included in order to make SSPIC robust to registration errors. Next, we present the results corresponding to five different experimental settings that collectively bring out the practical benefits of SSPIC.

8.3.2.4 Test Images with Geometric Alignment Overall recognition rates—ratio of the total number of correctly classified images to the total number of test images, expressed

197

Table 8.3 Overall recognition rates using calibrated test images from the Extended Yale B database Method

Recognition rate (%)

SSPIC SRC RSC Eigen-NS Eigen-SVM Fisher-NS Fisher-SVM

97.3 97.1 97.3 89.5 91.9 84.7 92.6

Table 8.4 Overall recognition rate when test images are subjected to random pixel corruption (Extended Yale B Database) Method

Recognition rate (%)

SSPIC SRC RSC Eigen-NS Eigen-SVM Fisher-NS Fisher-SVM

95.0 93.2 93.7 54.3 58.5 56.2 59.9

as a percentage—are reported in Table 8.3. Experiments in [31] reveal the robustness of SRC to distortions under the assumption that the test images are well-calibrated. Here, we see that SSPIC produces better results than SRC on calibrated test images (with minimal registration errors). The improvements offered by SRC, RSC and SSPIC over the other techniques are significant, validating the claim that the sparse representations (with respect to class-specific dictionaries) are sufficiently discriminative.

8.3.2.5 Recognition Under Random Pixel Corruption We randomly corrupt 50% of the image pixels in each test image. The overall recognition rates are shown in Table 8.4. In continuation of a trend first reported in [31], the sparsity-based techniques (SRC, RSC and SSPIC) demonstrate appreciable robustness to pixel corruption, whereas the other competing methods suffer drastic degradation in performance. So far we have seen that the three sparsitybased methods have comparable classification

198

V. Monga et al.

Fig. 8.7 Robustness to choice of training. Different sets of training images are chosen randomly in each run. Classifier accuracy is modeled as a Gaussian (pdf plotted in figure) whose mean and standard deviation are, respectively, the mean accuracy and standard deviation over 10 random runs

accuracy. Next we characterize the distribution of the classification rates for this experiment to bring out a significant feature of SSPIC. The classification accuracy is modeled as a random variable whose value emerges as the outcome given a random selection of training images (for the case of sufficient training). The experiment is repeated 10 times, and the mean classifier accuracy and standard deviation are plotted in Fig. 8.7. The mean value or average classification rate is the highest for SSPIC—this is consistent with the results in Table 8.4. Further, the variance is the lowest for SSPIC indicating improved robustness against particular choice of training samples. The nearest subspace and SVM-based methods have significantly higher variance; only the FisherSVM performance is shown in the figure.

8.3.2.6 Recognition Under Disguise We evaluate the sensitivity of SSPIC to the presence of disguise (representative of real-life scenarios) using the AR Face Database [113]. We choose a subset of the database containing 20 male and 20 female subjects chosen randomly. For training, we consider 8 clean (with no occlusions) images each per subject. These images may however capture different facial expressions. Faces with two different types of disguise are used for testing purposes: subjects wearing sun-

Table 8.5 Overall recognition rate when test subjects wear disguise (AR Database) Method

Recognition rate (%) Recognition rate (%) Sunglasses Scarves

SSPIC SRC RSC Eigen-NS Eigen-SVM Fisher-NS Fisher-SVM

95.1 93.5 93.1 47.2 53.5 57.9 61.7

91.6 90.1 90.7 29.6 34.5 41.7 43.6

glasses and subjects partially covering their face with a scarf. Accordingly, we present two sets of results. In each scenario, we use 6 images per subject for testing, leading to a total of 240 test images each for sunglasses and scarves. The performance of SSPIC is compared with the five other approaches in Table 8.5. It can be seen that SSPIC, RSC and SRC comprehensively outperform the other methods, yet again highlighting the robustness of sparse features.

8.3.2.7 Outlier Rejection We return to the Extended Yale B database for this experiment. Images from 19 of the 38 classes are included in the training set, and faces from the other 19 classes are treated as outliers. For

8 Sparsity Constrained Estimation in Image Processing and Computer Vision

199

Fig. 8.8 Face recognition using the Extended Yale B database: Outlier rejection performance represented via receiver operating characteristic (ROC) curves

training, 15 samples per class from Subsets 1 and 2 are used (19  15 D 285 images in total), while 500 images are randomly chosen for testing, among which 250 are inliers and the other 250 are outliers. For SSPIC, we use a threshold ı on the likelihood ratios being compared in (8.38). If the maximum value among the log-likelihood ratios does not exceed ı, the corresponding test image is labeled an outlier. For SRC, the Sparsity Concentration Index [31] is used as the criterion for outlier rejection. For the other approaches under comparison which use the nearest subspace and SVM classifiers, reconstruction residuals are compared to a threshold to decide outlier rejection. The receiver operating characteristic (ROC) curves for all of the approaches are shown in Fig. 8.8. The probability of detection is the ratio between the number of detected inliers and the total number of inliers, while the false alarm rate is computed by the number of outliers which are detected as inliers divided by the total number of outliers. Here too, we observe that the SSPIC exhibits the best performance.

8.3.2.8 Effect of Training Size on Performance This experiment demonstrates the most significant benefit of SSPIC over SRC. As discussed earlier, the success of sparsity-based approaches in general is dependent on the availability of a generous dictionary of training images. We believe that an investigation into the performance of SRC with reduced training has not been addressed adequately before, although lack of sufficient training is an important concern in some practical problems. We use the same sets of 32 training and test images per class from the Extended Yale B database as described earlier. However, we successively use smaller subsets—8, 16 and 24— of the 32 training images in this experiment. All images are assumed to be calibrated, i.e there are minimal registration errors. The performance as a function of training set size is plotted in Fig. 8.9. As the number of training images decreases, it is reasonable to expect the performance of all methods to suffer. However, by virtue of learning priors in a class-specific manner, SSPIC exhibits a

200

V. Monga et al.

Fig. 8.9 Face recognition using the Extended Yale B database: Classification accuracy as a function of training set size

more graceful decay in performance. The sparse linear representation model assumption central to SRC may not hold when very few training images are considered, causing its performance to drop significantly.

2. SPM: Spatial Pyramid Matching [122] 3. SVM-KNN: Support Vector Machines C Nearest Neighbor [123] 4. LLC: Locality-constrained Linear Coding [124].

Object Categorization: We now consider the problem of categorizing objects present in natural images. Here we use the popular Caltech101 dataset. We report the overall classification accuracy for SSPIC and representative competing approaches from literature. We also evaluate the performance of the competing algorithms as a function of the size of the training set.

8.3.2.9 Average Classification Accuracy Table 8.6 reports the average classification accuracy over the 101 classes for the case when 30 training images per class are utilized. The experiment is repeated 10 times with different randomly selected training images and average accuracy is reported, in order to mitigate the effect of selection bias. Recognition rates using SRC are significantly poor in comparison with other competing algorithms. SSPIC gives the second-best classification rates and significantly improves upon the performance of SRC. This underlines the effectiveness of carefully choosing priors in a class-specific manner. It must be noted that the other methods use elaborate feature extraction techniques customized for object categorization, while SRC and SSPIC are based on the often simplistic linear representation model for images.

Caltech-101 Dataset: The Caltech-101 dataset [121] contains 9144 images categorized into 101 classes. The number of images per category varies from 31 to 800. This dataset has been collected using Google Image Search, and many of the classes contain a significant amount of intra-class variation in appearance. Sample images are shown in Fig. 8.10. We partition the entire dataset into collections of 5, 10, 15, 20, and 30 training images per class and use a maximum of 50 test images per class. We compare the performance of SSPIC with four other methods: 1. SRC: Sparse Representation-based Classification [31]

8.3.2.10 Effect of Training Size on Performance Similar to the experiments for the face recognition problem, we investigate the performance

8 Sparsity Constrained Estimation in Image Processing and Computer Vision

201

Fig. 8.10 Sample images from the Caltech-101 dataset [121] Table 8.6 Average classification accuracy on the Caltech-101 dataset Method

Average accuracy .%/

SSPIC SRC SPM SVM-KNN LLC

71.03 58.20 64.60 66.20 73.44

of the competing approaches as the number of training images per class varies. Specifically, we consider the cases of 5, 10, 15, 20, and 30 training images. These are typical of experimental conditions investigated in literature for the Caltech-101 dataset. Figure 8.11 shows the plots of classification accuracy as a function of training set size. Clearly, the SRC has the worst performance in each of the training regimes. Here we identify two benefits of SSPIC: (i) the ability to learn better discriminative models using class-specific

priors that leads to significant improvements over SRC, and (ii) the effectiveness of class-specific priors in capturing discriminative information even under limited training.

8.4

Concluding Remarks

Over the past decade, sparsity has become one of the most prevalent themes in signal processing and big data applications. In this chapter, we address sparse signal recovery in a Bayesian framework, where sparsity is enforced on reconstruction coefficients via probabilistic priors. In particular, we focus on employing spike-and-slab priors to encourage sparsity. The optimization problem resulting from this model has broad applicability in recovery and regression problems. It is however known to be a hard non-convex problem, whose existing solutions involve simplifying assumptions and/or relaxations. As an

202

V. Monga et al.

Fig. 8.11 Classification accuracy as a function of training set size for the Caltech-101 dataset

effective and efficient solution, we develop an Iterative Convex Refinement (ICR) approach that aims to solve the aforementioned optimization problem directly, allowing for greater generality in the sparse structure, which circumvents the trade-off between computational complexity and performance. Essentially, ICR solves a sequence of convex optimization problems such that the sequence of solutions converges to a suboptimal solution of the original hard optimization problem. Subsequently, we focus on the development of a Bayesian perspective to sparse representationbased classification via the introduction of classspecific priors. In conjunction with class-specific dictionaries as proposed in [31], this leads to class-specific optimization problems whose solutions are more discriminative than reconstruction error comparisons. These class-specific optimization problems can benefit from the ICR framework in order to solve them efficiently. An important practical benefit of this formulation is the robustness of the framework to limited training. This is experimentally demonstrated for two challenging practical problems: face recognition and object categorization.

References 1. Tikhonov AN, Arsenin VY (1977) Solution of Illposed Problems. Winston, New York 2. Candès EJ, Romberg JK, Tao T (2006) Stable signal recovery from incomplete and inaccurate measurements. Comm Pure Appl Math 59(8):1207–1223 3. Donoho FL (1995) De-noising by soft-thresholding. IEEE Trans Inf Theory 41(3):613–627 4. Fodor IK, Kamath C (2003) Denoising through wavelet shrinkage: An empirical study. J Elec Imag 12(1):151–160 5. Coifman RR, Donoho DL (1995) Translationinvariant de-noising. In: Wavelets and statistics. Lecture Notes in Statistics Series, vol 103. Springer, pp 125–150 6. Simoncelli EP (1999) Bayesian denoising of visual images in the wavelet domain. In: Bayesian inference in wavelet-based models. pp 291–308, Springer 7. Deledalle C-A, Denis L, Tupin F (2009) Iterative weighted maximum likelihood denoising with probabilistic patch-based weights. IEEE Trans Image Process 18(12):2661–2672 8. Romberg JK, Choi H, Baraniuk RG (2001) Bayesian tree-structured image modeling using wavelet-domain hidden Markov models. IEEE Trans Image Process 10(7):1056–1068 9. Elad M, Aharon M (2006) Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans Image Process 15(12):3736–3745

8 Sparsity Constrained Estimation in Image Processing and Computer Vision 10. Dong W, Li X, Zhang L, Shi G (2011) Sparsitybased image denoising via dictionary learning and structural clustering. In: Proc IEEE Int Conf Comput Vision Pattern Recogn, pp 457–464 11. Guillemot C, Le Meur O (2014) Image inpainting: Overview and recent advances. IEEE Signal Process Mag 31(1):127–144 12. Bertalmio M, Sapiro G, Caselles V, Ballester C (2000) Image inpainting. In: Proc ACM Conf Comput Graph Interactive Tech. ACM, pp 417–424 13. Telea A (2004) An image inpainting technique based on the fast marching method. J Graphics Tools 9(1):23–34 14. Liang L, Liu C, Xu Y-Q, Guo B, Shum H-Y (2001) Real-time texture synthesis by patch-based sampling. ACM Trans Graph 20(3):127–150 15. Guleryuz OG (2006) Nonlinear approximation based image recovery using adaptive sparse reconstructions and iterated denoising-part I: Theory. IEEE Trans Image Process 15(3):539–554 16. Li X (2011) Image recovery via hybrid sparse representations: A deterministic annealing approach. IEEE J Sel Topics Signal Process 5(5):953–962 17. Starck J-L, Elad M, Donoho DL (2005) Image decomposition via the combination of sparse representations and a variational approach. IEEE Trans Image Process 14(10):1570–1582 18. Kim SP, Bose NK, Valenzuela HM (1990) Recursive reconstruction of high resolution image from noisy undersampled multiframes. IEEE Trans Acoust, Speech, Signal Process 38(6):1013–1027 19. Freeman WT, Jones TR, Pasztor EC (2002) Example-based super-resolution. IEEE Comput Graph Appln 22(2):56–65 20. Farsiu S, Robinson MD, Elad M, Milanfar P (2004) Fast and robust multiframe super resolution. IEEE Trans Image Process 13(10):1327–1344 21. Park SC, Park MK, Kang MG (2003) Superresolution image reconstruction: A technical overview. IEEE Signal Process Mag 20(3):21–36 22. Tappen MF, Russell BC, Freeman WT (2003) Exploiting the sparse derivative prior for superresolution and image demosaicing. In: IEEE Workshop on Statistical and Computational Theories of Vision 23. Fattal R (2007) Image upsampling via imposed edge statistics. ACM Trans. Graph. 26(3). Article No. 95. http://doi.acm.org/10.1145/1276377.1276496 24. Dai S, Han M, Xu W, Wu Y, Gong Y (2007) Soft edge smoothness prior for alpha channel super resolution. In: Proc IEEE Int Conf Comput Vision Pattern Recogn, pp 1–8 25. Yang J, Wright J, Huang T, Ma Y (2008) Image super-resolution as sparse representation of raw image patches. In: Proc IEEE Int Conf Comput Vision Pattern Recogn, pp 1–8 26. Yang J, Wright J, Huang T, Ma Y (2010) Image super-resolution via sparse representation. IEEE Trans Image Process 19(11):2861–2873

203

27. Yang J, Wang Z, Lin Z, Cohen S, Huang T (2012) Coupled dictionary training for image superresolution. IEEE Trans Image Process 21(8): 3467– 3478 28. Zeyde R, Elad M, Protter M (2012) On single image scale-up using sparse-representations. In: Curves and Surfaces. Springer, pp 711–730 29. Timofte R, De Smet V, Van Gool L (2013) Anchored neighborhood regression for fast example-based super-resolution. In: Proc IEEE Int Conf Comput Vision Pattern Recogn, pp 1920–1927 30. Timofte R, De Smet V, Van Gool L (2014) A+: Adjusted anchored neighborhood regression for fast super-resolution. In: Proc Asian Conf Comput Vision, pp 111–126 31. Wright J, Yang AY, Ganesh A, Sastry SS, Ma Y (2009) Robust face recognition via sparse representation. IEEE Trans Pattern Anal Machine Intell 31(2):210–227 32. Taubman DS, Marcellin MW (2001) JPEG 2000: Image Compression Fundamentals, Standards and Practice. Kluwer Academic, Norwell, MA 33. Olshausen BA, Field DJ (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381: 607–609 34. Candès E, Romberg J, Tao T (2006) Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans Inf Theory 52(2):489–509 35. Lustig M, Donoho DL, Pauly JL (2007) Sparse MRI: The application of compressed sensing for rapid MR imaging. Magn Reson Med 58: 1182–1195 36. Kreutz-Delgado K, Murray JF, Rao BD, Engan K, Lee T-W, Sejnowski TJ (2003) Dictionary learning algorithms for sparse representation. Neural Comput 15(2):349–396 37. Sprechmann P, Ramirez I, Sapiro G, Eldar YC (2011) C-HiLasso: A collaborative hierarchical sparse modeling framework. IEEE Trans Signal Process 59(9):4183–4198 38. Srinivas U, Suo Y, Dao M, Monga V, Tran TD (2015) Structured sparse priors for image classification. IEEE Trans Image Process 24(6):1763–1776 39. Mousavi HS, Srinivas U, Monga V, Suo Y, Dao M, Tran TD (2014) Multi-task image classification via collaborative, hierarchical spike-and-slab priors. In: Proc IEEE Int Conf Image Process, pp 4236–4240 40. Suo Y, Dao M, Tran T, Mousavi H, Srinivas U, Monga V (2014) Group structured dirty dictionary learning for classification. In: Proc IEEE Int Conf Image Process, pp 150–154 41. Anaraki FP, Hughes SM (2013) Compressive KSVD. In: Proc IEEE Int Conf Acoust, Speech, Signal Process. pp 5469–5473 42. Sadeghi, M Babaie-Zadeh M, Jutten C (2013) Dictionary learning for sparse representation: A novel approach. IEEE Signal Process Lett 20(12): 1195–1198

204 43. Gorodnitsky IF, Rao BD (1997) Sparse signal reconstruction from limited data using FOCUSS: A re-weighted minimum norm algorithm. IEEE Trans Signal Process 45(3):600–616 44. Wright SJ, Nowak RD, Figueiredo MA (2009) Sparse reconstruction by separable approximation. IEEE Trans Signal Process 57(7):2479–2493 45. Tropp JA, Gilbert AC (2007) Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans Inf Theory 53(12):4655–4666 46. Elad M, Aharon M (2006) Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans Image Process 15(12):3736–3745 47. Andersen MR, Winther O, Hansen LK (2014) Bayesian inference for structured spike and slab priors. In: Adv Neural Inf Process Syst, pp 1745–1753 48. Srinivas U, Mousavi HS, Monga V, Hattel A, Jayarao B (2014) Simultaneous sparsity model for histopathological image representation and classification.” IEEE Trans Med Imag 33(5): 1163–1179 49. Baraniuk RG (2007) Compressive sensing. IEEE Signal Process Mag 24(4):118–121 50. Tibshirani R (2011) Regression shrinkage and selection via the Lasso: A retrospective. J R Stat Soc Ser B (Methodological) 73(3):273–282 51. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodological) 58(1):267–288 52. Donoho DL (2006) For most large underdetermined systems of linear equations the minimal 1-norm solution is also the sparsest solution. Commun Pure Appl Math 59(6):797–829 53. Becker S, Bobin J, Candès EJ (2011) Nesta: A fast and accurate first-order method for sparse recovery. SIAM J Imag Sci 4(1):1–39 54. Candes EJ, Wakin MB, Boyd SP (2008) Enhancing sparsity by reweighted L1 minimization. J Fourier Anal Appl 14(5–6):877–905 55. Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B (Methodological) 68(1):49–67 56. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Methodological) 67(2):301–320 57. Zou H (2006) The adaptive lasso and its oracle properties. J Amer Stat Assoc 101(476):1418–1429 58. Zou H, Zhang HH (2009) On the adaptive elasticnet with a diverging number of parameters. Ann Stat 37(4):1733–1751 59. Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused Lasso. J R Stat Soc Ser B (Methodological) 67(1):91–108 60. Yuan M, Lin Y (2007) Model selection and estimation in the gaussian graphical model. Biometrika 94(1):19–35

V. Monga et al. 61. Candes E, Tao T (2007) The dantzig selector: Statistical estimation when p is much larger than n. Ann Stat 35(6):2313–2351 62. Tibshirani RJ, Hoefling H, Tibshirani R (2011) Nearly-isotonic regression. Technometrics 53(1):54–61 63. Candes EJ, Tao T (2010) The power of convex relaxation: Near-optimal matrix completion. IEEE Trans Inf Theory 56(5):2053–2080 64. Mazumder R, Hastie T, Tibshirani R (2010) Spectral regularization algorithms for learning large incomplete matrices. J Mach Learn Res 11(Aug): 2287–2322 65. Mousavi A, Maleki A, RG Baraniuk (2013) Asymptotic analysis of LASSOs solution path with implications for approximate message passing. arXiv preprint arXiv:1309.5979 66. Mohimani H, Babaie-Zadeh M, Jutten C (2009) A fast approach for overcomplete sparse decomposition based on smoothed norm. IEEE Trans Signal Process 57(1):289–301 67. Wipf DP, Rao BD (2004) Sparse Bayesian learning for basis selection. IEEE Trans Signal Process 52(8):2153–2164 68. Ji S, Xue Y, Carin L (2008) Bayesian compressive sensing. IEEE Trans Signal Process 56(6):2346– 2356 69. Lu X, Wang Y, Yuan Y (2013) Sparse coding from a Bayesian perspective. IEEE Trans Neur Netw Learn Sys 24(6):929–939 70. Dobigeon N, Hero AO, Tourneret J-Y (2009) Hierarchical bayesian sparse image reconstruction with application to MRFM. IEEE Trans Image Process 18(9):2059–2070 71. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3(1):1–122 72. Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imag Sci 2(1):183–202 73. Tropp JA (2006) Algorithms for simultaneous sparse approximation. Part II: Convex relaxation. Signal Processing 86(3):589–602 74. Baraniuk RG, Cevher V, Duarte MF, Hegde C (2010) Model-based compressive sensing. IEEE Trans Image Process 56(4):1982–2001 75. He L, Carin L (2009) Exploiting structure in wavelet-based Bayesian compressive sensing. IEEE Trans Signal Process 57(9):3488–3497 76. Babacan S, Molina R, Katsaggelos A (2010) Bayesian compressive sensing using Laplace priors. IEEE Trans Image Process 19(1):53–63 77. Suo Y, Dao M, Tran T, Srinivas U, Monga V (2013) Hierarchical sparse modeling using spike and slab priors. In: Proc IEEE Int Conf Acoust, speech, Signal Process. pp 3103–3107

8 Sparsity Constrained Estimation in Image Processing and Computer Vision 78. Srinivas U, Suo Y, Dao M, Monga V, Tran TD (2013) Structured sparse priors for image classification. In: Proc IEEE Int Conf Image Process, pp 3211–3215 79. Jenatton R, Audibert J-Y, Bach F (2011) Structured variable selection with sparsity-inducing norms. J Mach Learn Res 12:2777–2824 80. Huang J, Zhang T, Metaxas D (2011) Learning with structured sparsity. J Mach Learn Res 12: 3371–3412 81. Huang J, Zhang T et al (2010) The benefit of group sparsity. Ann Stat 38(4):1978–2004 82. Cotter SF, Rao BD, Engan K, Kreutz-Delgado K (2005) Sparse solutions to linear inverse problems with multiple measurement vectors. IEEE Trans Signal Process 53(7):2477–2488 83. Eldar YC, Rauhut H (2010) Average case analysis of multichannel sparse recovery using convex relaxation. IEEE Trans Inf Theory 56(1):505–519 84. Yen T-J (2011) A majorization–minimization approach to variable selection using spike and slab priors. Ann Stat 39(3):1748–1775 85. Cevher V, Indyk P, Carin L, Baraniuk RG (2010) Sparse signal recovery and acquisition with graphical models. IEEE Signal Process Mag 27(6):92–103 86. Cevher V (2009) Learning with compressible priors. In: Adv Neural Inf Process Syst, pp 261–269 87. Cevher V, Indyk P, Carin L, Baraniuk RG (2010) Sparse signal recovery and acquisition with graphical models. IEEE Signal Process Mag 27(6):92–103 88. Mitchell TJ, Beauchamp JJ (1988) Bayesian variable selection in linear regression. J Amer Stat Assoc 83(404):1023–1032 89. George EI, McCulloch RE (1993) Variable selection via Gibbs sampling. J Amer Stat Assoc 88(423):881–889 90. Ishwaran H, Rao JS (2005) Spike and slab variable selection: Frequentist and Bayesian strategies. Ann Stat 730–773 91. Carvalho CM, Chang J, Lucas JE, Nevins JR, Wang Q, West M (2008) High-dimensional sparse factor modeling: Applications in gene expression genomics. J Amer Stat Assoc 103(484): 1438–1456 92. Titsias MK, Lázaro-Gredilla M (2011) Spike and slab variational inference for multi-task and multiple kernel learning. In: Adv Neural Inf Process Syst, pp 2339–2347 93. Hernández-Lobato JM, Hernández-Lobato D, Suárez A (2014) Expectation propagation in linear regression models with spike-and-slab priors. Machine Learning pp 1–51 94. Hernández-Lobato D, Hernández-Lobato JM, Dupont P (2013) Generalized spike-and-slab priors for bayesian group feature selection using expectation propagation. J Mach Learn Res 14(1):1891–1945 95. Kappen HJ, Gómez V (2014) The variational garrote. Machine Learning 96(3):269–294

205

96. Vila J, Schniter P (2011) Expectation-maximization Bernoulli-Gaussian approximate message passing. In: Proc IEEE Asilomar Conf Signal, Syst, Comput, pp 799–803 97. Mousavi, HS, Monga V, Tran TD (2015) Iterative convex refinement for sparse recovery. IEEE Signal Process Lett 22(11):1903–1907 98. Chouzenoux E, Jezierska A, Pesquet J-C, Talbot H (2013) A majorize-minimize subspace approach for `2  `0 image regularization. SIAM J Imag Sci 6(1):563–591 99. Chaari L, Batatia H, Dobigeon N, Tourneret J-Y (2014) A hierarchical sparsity-smoothness bayesian model for `0 C`1 C`2 regularization. In: Proc IEEE Int Conf Acoust, Speech, Signal Process, pp 1901– 1905 100. Mohammadi M, Fatemizadeh E, Mahoor M (2014) PCA-based dictionary building for accurate facial expression recognition via sparse representation. J Vis Commun Image Represent 25(5):1082–1092 101. Burton D, Coleman J (2010) Quasi-Cauchy sequences. Am Math Monthly 117(4):328–333 102. Wright SJ, Nowak RD, Figueiredo M (2014) SpaRSA software. [Online]. Available: http://www. lx.it.pt/~mtf/SpaRSA/ 103. IBM (2014) ILOG CPLEX optimization studio. [Online]. Available: http://www-01.ibm.com/ software/commerce/optimization/cplex-optimizer/ 104. LeCun Y, Cortes C, Burges CJ (2014) MNIST dataset. [Online]. Available: http://yann.lecun.com/ exdb/mnist/ 105. Afonso MV, Bioucas-Dias JM, Figueiredo MA (2010) Fast image recovery using variable splitting and constrained optimization. IEEE Trans Image Process 19(9):2345–2356 106. Chambolle A (2004) An algorithm for total variation minimization and applications. J Math Imaging Vis 20(1–2):89–97 107. Pillai JK, Patel VM, Chellappa R, Ratha NK (2011) Secure and robust iris recognition using random projections and sparse representations. IEEE Trans Pattern Anal Machine Intell 33(9):1877–1893 108. Majumdar A, Ward RK (2009) Classification via group sparsity promoting regularization. In: Proc IEEE Int Conf Acoust, Speech, Signal Process, pp 861–864 109. Zhao W, Chellappa R, Phillips PJ, Rosenfeld A (2003) Face recognition: A literature survey. ACM Comput Surv 35(4):399–458 110. Zhang X, Gao Y (2009) Face recognition across pose: A review. Pattern Recognition 42(11): 2876– 2896 111. Li SZ, Jain AK (eds) (2011) Handbook of Face Recognition, 2nd edn. Springer 112. Georghiades AS, Belhumeur PN, Kriegman DJ (2001) From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans Pattern Anal Machine Intell 23(6):643–660

206 113. Martinez AM, Benavente R (1998) The AR face database. CVC Technical Report, 24 114. Yang M, Zhang L, Yang J, Zhang D (2011) Robust sparse coding for face recognition. In: Proc IEEE Conf Comput Vision Pattern Recogn, pp 625–632 115. M. Turk and A. Pentland (1991) Eigenfaces for recognition. J Cogn Neurosci 3(1):71–86 116. Ho J, Yang M, Lim J, Lee K, Kriegman D (2003) Clustering appearances of objects under varying illumination conditions. In: Proc IEEE Int Conf Comput Vision Pattern Recogn, pp 11–18 117. Vapnik VN (1995) The nature of statistical learning theory. New York, USA: Springer 118. Belhumeur PN, Hespanha JP, Kriegman DJ (1997) Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Trans Pattern Anal Machine Intell 19(7): 711–720 119. Huang J, Huang X, Metaxas D (2008) Simultaneous image transformation and sparse representation recovery. In: Proc IEEE Int Conf Comput Vision Pattern Recogn, pp 1–8

V. Monga et al. 120. Wagner A, Wright J, Ganesh A, Zhou Z, Mobahi H, Ma Y (2012) Towards a practical face recognition system: Robust alignment and illumination by sparse representation. IEEE Trans Pattern Anal Machine Intell 34(2):372–386 121. Fei-Fei L, Fergus R, Perona P (2004) Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In: Proc IEEE Int Conf Comput Vision Pattern Recogn, pp 59–70 122. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proc IEEE Int Conf Comput Vision Pattern Recogn, pp 2169–2178 123. Zhang H, Berg A, Maire M, Malik J (2006) SVMKNN: Discriminative nearest neighbor classification for visual category recognition. In: Proc IEEE Int Conf Comput Vision Pattern Recogn, pp 2126– 2136 124. Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: Proc IEEE Int Conf Comput Vision Pattern Recogn, pp 3360–3367

9

Optimization Problems Associated with Manifold-Valued Curves with Applications in Computer Vision Rushil Anirudh, Pavan Turaga, and Anuj Srivastava

9.1

Introduction and Motivation

A commonly occurring need in many applications is the need to represent, compare, and manipulate manifold-valued curves or Riemannian trajectories, respecting certain constraints on geometry and temporality. In addition, a desired characteristic is the flexibility of being able to operate at reduced bit budgets for storage and lower compute costs, thereby enabling the applicability of these methods to resource constrained environments. Features lying on Riemannian manifolds are commonly encountered in vision and robotics, in part due to the large number of sensors from which data is acquired. This is particularly true for human actions, where a gamut of sensors are used today such as—traditional video cameras in security, depth cameras such as Microsoft Kinect for gaming/HCI, accelerometers and gyroscopes in fitness trackers, and computational or non-traditional imaging devices. Some examples of popularly used Riemannian R. Anirudh () Lawrence Livermore National Laboratory, California, USA e-mail: [email protected] P. Turaga Arizona State University, Tempe, AZ, USA e-mail: [email protected] A. Srivastava Florida State University, Tallahassee, FL, USA e-mail: [email protected]

features are—shape silhouettes on the Kendall’s shape space [48], pairwise transformations of skeletal joints on SE.3/  SE.3/  SE.3/ [5, 49],representing the parameters of a linear dynamical system as points on the Grassmann manifold [43], and histogram of oriented optical flow (HOOF) on a hyper-sphere [10]. Constraining on geometry is essential for such data, since they do not obey conventional Euclidean properties. In this regard, generalizing statistical models to non-Euclidean spaces has become an active field of research in the past decade. Next, temporal information also needs to be taken into account for applications such as human action recognition, to achieve speed invariance—in which two sequences that are misaligned in time induce unwanted distortions in the distance metric. Accounting for warping reduces the intra-class distance and improves the inter-class distance. Ignoring it leads to artificially inflated variance in the dataset, leading to poor statistical models. While the alignment problem has been studied for Euclidean curves for many years, it has only recently been addressed for Riemannian trajectories. The most common way to solve for the mis-alignment problem is to use dynamic time warping (DTW) which originally found its use in speech processing [7]. For human actions, [46, 56] address this problem using different strategies for features in the Euclidean space. However, DTW behaves as a similarity measure instead of a true distance

© Springer International Publishing AG 2017 V. Monga (ed.), Handbook of Convex Optimization Methods in Imaging Science, DOI 10.1007/978-3-319-61609-4_9

207

208

metric in that it does not naturally allow the estimation of statistical measures such as mean and variance of action trajectories. An ideal representation would be highly discriminative of different classes while factoring out temporal warping to reduce the variability within classes, while also enabling low dimensional coding at the sequence level. Learning such a representation is complicated when the features extracted are nonEuclidean (i.e. they do not obey conventional properties of the Euclidean space). We utilize a recent development in statistics [39], the transport square-root velocity function (TSRVF) representation , to provide a warp invariant representation to the Riemannian trajectories. Exploiting this we propose a framework to generalize the dictionary learning problem to Riemannian trajectories. In other words, we are interested in parameterization of Riemannian trajectories, i.e. for N actions Ai .t/; i D 1 : : : N, our goal is to learn F such that F .x/ D Ai where x 2 Rk is the set of parameters. Such a model will allow us to compare actions by simply comparing them in their parametric space with respect to F , with significantly faster distance computations, while being able to reconstruct the original actions. In this work, we learn two different kinds of functions using PCA and dictionary learning, which have attractive properties for recognition and visualization. For data lying on these manifolds, standard notions of distance, statistics, quantization etc. need modification to account for the non-linearity of the underlying space. As a result, basic computations such as geodesic distance, finding the sample mean etc. are highly involved in terms of computational complexity, and often result in iterative procedures further increasing the computational load making them impractical. We also propose a geometry-based symbolic approximation framework, as a result of which lowbandwidth transmission and accurate real-time analysis for recognition or searching through sequential data become fairly straightforward. We propose a framework that generalizes a popular indexing technique used to mine and search for vector space time series data known as Symbolic

R. Anirudh et al.

Aggregate Approximation (SAX) [29] to Riemannian manifolds. The main idea is to replace Riemannian trajectories with abstract symbols or prototypes, that can be learned offline. Symbolic approximation is a combination of discretization and quantization on manifold spaces, which allows us to approximate distance metrics between sequences in a quick and efficient manner. Another advantage is that extremely fast search is possible because the search is limited to the symbolic space. Further, to enable efficient searching techniques, we develop prototypes or symbols which divide the space into equiprobable regions by proposing the first manifold generalization of a conscience based competitive learning algorithm [12]. Using these prototypes, we demonstrate that signals or sequences on manifolds can be approximated effectively such that the resulting metric remains close to the metric on the original feature space, thereby guaranteeing accurate recognition and search. While this framework is applicable to general high-dimensional feature sequences, we demonstrate its utility on activity recognition. Generally speaking, the ideal symbolic representation is expected to have two key properties: (1) be able to model the data accurately with a low approximation error, and (2) should enable the efficient use of existing data structures and algorithms, developed for string searching. In this chapter, we will present different optimization problems in the context of Riemannian trajectories with applications in computer vision. First we present a generalization of dictionary learning to Riemannian trajectories such that each atom in the dictionary is an entire representative action sequence. Such a model not only highly reduces the dimensionality of the data, but also improves classification performance and enables visualizing the action space easily. Next, we propose indexing manifold sequences using a new clustering algorithm that partitions the dataset into equally likely clusters. This results in extremely quick comparisons between sequences, improving the matching time by two orders of magnitude, without sacrificing on quality.

9 Optimization Problems Associated with Manifold-Valued Curves with Applications in Computer Vision

209

Outline of the Chapter The next section discusses some related literature on Riemannian trajectories and their compact representations. Next, in Sect. 9.2 we briefly introduce differential geometry and discuss geometric properties of the manifolds considered in this chapter. This is followed by a preliminary dictionary learning problem in Sect. 9.2.4. This problem is posed in the Euclidean space, using extrinsic representations of manifold valued data. Next, in Sect. 9.3, the TSRVF representation is formally introduced as a way to directly work with manifold valued curves instead of approximating them in the Euclidean space. Using this representation, we show how to learn a kSVD or PCA basis of Riemannian trajectories in Algorithm 1. Next, in Sect. 9.4, we discuss an indexing method for Riemannian trajectories, which can match very efficiently without sacrificing recognition performance. Finally, we conclude the chapter in Sect. 9.5, including some ideas for future work.

point in the TSRVF representation, by developing a purely intrinsic approach that redefines the TSRVF at the starting point of each trajectory. A version of the representation for Euclidean trajectories—known as the Square-Root Velocity Function (SRVF), was recently applied to skeletal action recognition using joint locations in R3 with promising results [13]. Rate invariance for activities has been addressed before [46, 56], for example [46] models the space of all possible warpings of an action sequence, where the nonlinearity is the space of warp functions. Such techniques can align sequences correctly, even when features are multi-modal [56]. However, most of the techniques are used for recognition which can be achieved with a similarity measure, but we are interested in a representation which serves a more general purpose to (1) provide an effective metric for comparison, recognition, retrieval, etc. and (2) provide a framework for efficient lower dimensional coding which also enables recovery back to the original feature space.

9.1.1

Dimensionality Reduction for Manifold Valued Data Principal component analysis has been used extensively in statistics for dimensionality reduction of linear data. It has also been extended to model a wide variety of data types. For high dimensional data in Rn , manifold learning (or non-linear dimensionality reduction) techniques [34, 41] attempt to identify the underlying low dimensional manifold while preserving specific properties of the original space. Using a robust metric, one could theoretically use such techniques for coding, but the algorithms have impractical memory requirements for very high dimensional data of the order of  104 105 , they also do not provide a way of reconstructing the original manifold data. For data already lying on a known manifold, geometry aware mapping of SPD matrices [22] constructs a lower-dimensional SPD manifold, and principal geodesic analysis (PGA) [16] identifies the primary geodesics along which there is maximum variability of data points. We are interested in identifying the variability of sequences instead. Recently, dictionary learning

Related Work

Elastic Metrics for Riemannian Trajectories There have been many studies to solve or work around the mis-alignment problem, which is commonly encountered in action recognition. The most popular way is to use DTW [7]. Shape based elastic metrics have also been successful in matching Euclidean curves [37,38]. However, in this chapter we are interested in modeling manifold valued curves instead of Euclidean curves. A recent development in statistics [39] proposed the transport squareroot velocity function (TSRVF) framework, that provides a way to represent trajectories on Riemannian manifolds such that the distance between two trajectories is invariant to identical time-warpings. The representation itself lies on a tangent space and is therefore Euclidean, this is discussed further in Sect. 9.3. The representation was then applied to the problem of visual speech recognition by warping trajectories on the space of SPD matrices [40]. A more recent work [54] has addressed the arbitrariness of the reference

210

R. Anirudh et al.

methods for data lying on Riemannian manifolds have been proposed [11, 23, 25, 53] and could potentially be used to code sequential data without explicitly taking time into consideration. Coding data on Riemannian manifolds is still a new idea with some progress in the past few years, for example recently the Vector of Locally Aggregated Descriptors (VLAD) has also been extended recently to Riemannian manifolds [15].

9.2

Mathematical Preliminaries

In this section we will introduce some preliminaries on differential geometry, and the geometric properties of manifolds of interest namely—the product space SE.3/ SE.3/, the Grassmann manifold, and the space of symmetric positive definite (SPD) matrices. For a more expanded discussion on Riemannian manifolds, we refer the reader to [1, 8].

The distance between two points on a manifold is measured by means of the ‘length’ of the shortest curve connecting the points known as the geodesic. The notion of length is formalized by defining a Riemannian metric, which is a map h ; i that associates to each point p 2 M a symmetric, bilinear, positive definite form on the tangent space Tp .M /. The Riemannian metric allows one to compute the infinitesimal length of tangent-vectors along a curve. The length of the entire curve is then obtained by integrating the infinitesimal lengths of tangents along the curve. i.e. given p; q 2 M , the distance between them is the infimum of the lengths of all smooth paths on M which start at p and end at q: d.p; q/ D

Preliminaries in Differential Geometry

A topological space M is called a manifold if it is locally Euclidean i.e. for each point p 2 M , there exists an open neighborhood U of p and a mapping  W U ! Rn such that .U/ is open in Rn and  W U ! .U/ is called a diffeomorphism. The pair .U; / is called a coordinate chart for the points that fall in U. To analyze sequences on manifolds, one needs to understand tangent-space and exponential mappings. A tangent-space at a point of a manifold M is obtained by considering the velocities of differentiable curves passing through the given point. i.e. for a point p 2 M , a differentiable curve passing through it is represented as ˛ W .ı; ı/ ! M such that ˛.0/ D p. The velocity ˛.0/ P refers to the velocity of the curve at p. This vector has the same dimension as the manifold and is a tangent vector to M at p. The set of all such tangent vectors is called the tangent space, denoted by Tp .M /, to M at p and is always a vector-space. We refer the interested reader to [1, 8] for a more detailed study on this topic.

LΠ; where; (9.1)

Z LΠ D

9.2.1

inf

f WŒ0;17!M j .0/Dp; .1/Dqg

1 0

p .h P .t/; P .t/i/dt

(9.2)

For a point p 2 M , the exponential map— expp W Tp .M / ! M , is defined by expp .v/ D ˛v .1/ where ˛v is a specific geodesic in the direction of the tangent-vector v. The inverse mapping exp1 W M ! Tp called the inverse p exponential map at a ‘pole’, p takes a point on the manifold and returns the tangent vector of the geodesic connecting it to the pole, on the tangent space of the pole. For the special case of Rn , these concepts reduce to familiar notions of subtraction and addition. The exponential map for two points a and b is given by: expa .ˇb/ D aCˇb, where ˇ is some scalar that determines the extent to which one must travel from a to reach b; the inverse exponential map is computed as: exp1 a .b/ D a  b.

9.2.2

Product Space of the Special Euclidean Group

For action recognition, we represent a stick figure as a combination of relative transformations between joints, as proposed in [49]. The resulting

9 Optimization Problems Associated with Manifold-Valued Curves with Applications in Computer Vision

feature for each skeleton is interpreted as a point on the product space of SE.3/   SE.3/. The skeletal representation explicitly models the 3D geometric relationships between various body parts using rotations and translations in 3D space [49]. These transformation matrices lie on the curved space known as the Special Euclidean group SE.3/. Therefore the set of all transformations lies on the product space of SE.3/   SE.3/. The special Euclidean group, denoted by SE.3/ is a Lie group, containing the set of all 4  4 matrices of the form ! R v !  P.R; v / D ; (9.3) 0 1 where R denotes the rotation matrix, which is a point on the special orthogonal group SO.3/ and !  v denotes the translation vector, which lies in R3 . The 4  4 identity matrix I4 is an element of SE.3/ and is the identity element of the group. The tangent space of SE.3/ at I4 is called its Lie algebra—denoted here as se.3/. It can be identified with 4  4 matrices of the form1 2 0 !  6 !3 b ! v b D D6 4!2 0 0 0

!3 0 !1 0

!2 !1 0 0

3 v1 v2 7 7; v3 5 0

(9.4)

where b ! is a 3  3 skew-symmetric matrix and !  v 2 R3 . An equivalent representation is D Œ!1 ; !2 ; !3 ; v1 ; v2 ; v3 T 2 R6 . For the exponential and inverse exponential maps, we use the expressions provided on p. 413–414 in [31], we reproduce them for completeness here. The exponential map is given by exp b D

! I  v ! D 0 and 0 1 "

! ! eb A v exp b D 0 1

# ! ¤ 0;

(9.5)

1 We are following the notation to denote the vector space ( 2 R6 ) and the equivalent Lie algebra representation 2 se.3/) as described in p. 411 of [31]. (b

211

! where eb is given explicitly by the Rodrigues’s b b ! !2 formula—D I C k!k sink!k C k!k 2 .1  cosk!k/, b ! !2 and A D I C .1  cosk!k/ C b .k!k  k!k2

k!k3

sink!k/. The inverse exponential map is given by !  R v b ! A1! v b D log D ; 0 1 0 0

(9.6)

where b ! D logR, and 1 ! A1 DI  b 2 2 sink!kk!k.1Ccosk!k/ 2 C b ! ! ¤ 0; 2k!k2 sink!k when ! D 0, then A D I. These tools are trivially extended to the product space, for example the identity element of the product space is simply .I4 ; I4 ; : : : ; I4 / and the Lie algebra is m D se.3/  se.3/. Parallel transport on the product space is the parallel transport of the point on component spaces. Let TO .SO.3// denote the tangent space at O 2 SO.3/, then the parallel transport of a W 2 TO .SO.3// from O to I33 is given by OT W. For more details on the properties of the special Euclidean group, we refer the interested reader to [31].

9.2.3

Grassmann Manifold as a Shape Space

To visualize the action space, we also use shape silhouettes of an actor for different activities. These are interpreted as points on the Grassmann manifold. To obtain a shape as a point, we first obtain a landmark representation of the silhouette by uniformly sampling the shape. Let L D Œ.x1 ; y1 /; .x2 ; y2 / : : : ; .xm ; ym / be an m  2 matrix that defines m points on the silhouette whose centroid has been translated to zero. The affine shape space [18] is useful to remove small variations in camera locations or the pose of the subject. Affine transforms of a base shape Lbase can be expressed as Laffine .A/ D Lbase AT , and this multiplication by a full rank matrix on the right preserves

212

the column-space of the matrix, Lbase . Thus the 2D subspace of Rm spanned by the columns of the shape, i.e. span.Lbase / is invariant to affine transforms of the shape. Subspaces such as these can be identified as points on a Grassmann manifold [6, 44]. Denoted by, Gk;mk , the Grassmann manifold is the space whose points are k-dimensional hyperplanes (containing the origin) in Rm . An equivalent definition of the Grassmann manifold is as follows: To each k-plane,  in Gk;mk corresponds a unique mm orthogonal projection matrix, P which is idempotent and of rank k. If the columns of a tall m  k matrix Y spans  then YY T D P. Then the set of all possible projection matrices P, is diffeomorphic to G . The identity element of P is defined as Q D diag.Ik ; 0mk;mk /, where 0a;b is an a  b matrix of zeros and Ik is the k  k identity matrix. The Grassmann manifold G (or P) is a quotient space of the orthogonal group, O.m/. Therefore, the geodesic on this manifold can be made explicit by lifting it to a particular geodesic in O.m/ [36]. Then the tangent, X, to the lifted geodesic curve in O.m/ defines the velocity associated with the curve in P. The tangent space of O.m/ at identity is o.m/, the space of m  m skew-symmetric matrices, X. Moreover in o.m/, the Riemannian metric is just the inner product of hX1 ; X2 i = trace(X1 X2T ) which is inherited by P as well. The geodesics in P passing through the point Q (at time t = 0) are of the type ˛ W .; / 7! P; ˛.t/ D exp(tX)Qexp(tX), where X is a skew-symmetric matrix belonging to the set M where   0 A k;mk MD WA2R o.m/ (9.7) AT 0 Therefore the geodesic between Q and any point P is completely specified by an X 2 M such that exp(X)Qexp(-X/ D P. We can construct a geodesic between any two points P1 ; P2 2 P by rotating them to Q and some P 2 P. Readers are referred to [36] for more details on the exponential and logarithmic maps of Gk;mk .

R. Anirudh et al.

9.2.4

A Preliminary Example: Dictionary Learning for Human Actions Using Extrinsic Representations

A commonly used trick to work around the nonlinear nature of manifold valued data is to use extrinsic representations. For certain spaces, isometries may exist enabling accurate representations, but this is not always the case. We will show how even a simple linear model for actions, can have attractive generative properties [3]. Human actions tend to evolve smoothly over time, and in feature space these are typically geodesic-like trajectories. In the Euclidean case, one can think of actions as evolving over hyperlines emanating from a single point, the starting pose, or more generally, different action classes lie on different subspaces, with their own starting pose. To cluster such data that lies along hyperlines, He et al. [24] proposed the K-hyperline clustering algorithm, which is an iterative procedure that performs a least squares fit of K one dimensional linear subspaces to the training data. Such a clustering is also closely related to dictionary learning [42]. This can be generalized such that the elementary features in this dictionary correspond to the 1D affine subspaces that represent human activities. The proposed dictionary is learned with features that are extracted per frame from the videos in an action dataset. Each dictionary atom consists of a tuple— a point and a direction in space. We also introduce new constraints to the traditional sparse coding problem, and adapt it to the heterogeneous dictionary. We show that this can be an effective generative model for human actions. It is interesting to note that even with extrinsic features, and a simple linear model to approximate temporal information, the resulting sparse codes show promise for action recognition, and reconstructing unseen actions. Extrinsic Representation of the Grassmann Manifold We use the shape silhouette feature obtained from the UMD Actions Dataset [46]. As described earlier, these can be represented as affine invari-

9 Optimization Problems Associated with Manifold-Valued Curves with Applications in Computer Vision

ant subspaces, which in turn are interpreted as points on the Grassmann manifold. With each set of landmarks on the shape, we generate an m  m projection matrix that is P D UU T , where L D USV T is the rank-2 SVD. Let Pv be the vectorized form of P, we use Pv as a feature to learn our dictionary. To recover the shape from this vector we re-obtain the projection matrix P and perform a rank-2 SVD on it. Now the feature corresponding to a shape at time t is generated as Pv .t/ D j C ˇ.t/dj , parameterized by ˇ.t/ which determines to what extent one must travel from j along the direction dj .

9.2.4.1 Dictionary Learning and Sparse Coding Problem When a dictionary is constructed using Khyperline clustering, each atom corresponds to a linear subspace. We generalize this dictionary to be a collection of affine subspaces, where each atom is described by a point and an associated direction in space. To learn such a dictionary, we propose a 1D affine subspace clustering algorithm. In this method, we incorporate an additional step of calculating the sample mean j of the jth cluster along with the least-squares fit of a 1D subspace, dj , in K-hyperline clustering. The algorithm is described in Table 9.1. To identify the cluster membership, we project a data sample onto each dictionary atom and choose the one that results in the least representation error. The projection is performed as O PH .x/ D  C ˇd;

where

ˇO D min kx    ˇdk22 : ˇ

(9.8)

Note that in this case, the least squares solution for ˇ is dT .x  /. Sparse Coding Let us assume that a test sample in Rn can be represented as a linear combination of a small arg min kx  .M˛ C Dˇ/k22 C  ˛;ˇ

213

Table 9.1 The dictionary learning algorithm Input Features fx1 ;    ; xT g and size of dictionary, K. Output Affine subspaces fH1 ;    ; HK g represented using the means f1 ;    ; K g and the directions fd1 ;    ; dK g. Membership classes, C1 ;    ; CK . Algorithm Initialize: f1 ;    ; K g and fd1 ;    ; dK g. while convergence not reached Compute memberships: - For each sample xi compute the projection of xi onto each Hj , denoted by PHj .xi /. - k D arg minj jjxi  PHj .xi /jjKjD1 and Ck D Ck [ fig. Update Hj : For each cluster j, compute {j ; dj } as the sample mean and the first principal component of all samples indexed by Cj , respectively. end

number of affine subspaces. Assuming that the set of dictionary atoms given by fj ; dj gKjD1 is known, the generative model for a test sample x can be written as xD

X

˛j j C ˇj dj :

(9.9)

j2S

where S is the set of atoms that participate in the representation of x. The solution to (9.9) can be obtained using convex programming. The key consideration is that for a given j, j and dj must be chosen together. Furthermore, it is also useful to ensure that the new mean is in the convex hull of the means of S. This can be posed and solved as group Lasso [51],

 K  X X  ˛i    ˛i D 1;  ˇi  s.t. ˛i  0; iD1

2

i

(9.10)

214

where M D Œj KjD1 and D D Œdj KjD1 . Experiments Reconstruction of Unseen Actions In this experiment, we test the efficiency of the proposed dictionary in modeling unseen actions from test data. Since every action is modeled as a combination of means and directions, an unseen action will typically have a mean that is different from any of the previously learned actions. Hence, we model the new mean as a linear combination of means and find its principal direction as a combination of the known directions. For our experiments, we obtained activities from the Weizmann activity dataset [19] which consists of 90 videos of 10 different actions, each performed by 9 different persons. The classes of actions include running, jumping, walking, side walking etc. In order to evaluate the performance of the proposed sparse coding model, we used the features of all subjects from 6 different activities in the Weizmann dataset for obtaining the dictionary and evaluated the reconstruction error for features from the other 4 activities. The set of unseen testing activities included jack, pjump, skip and wave1. For all our experiments on this dataset we used the histogram of oriented optical flow (HOOF) feature that was introduced in [10]. This feature bins optical flow vectors based on their directions and their primary angle with the

R. Anirudh et al.

horizontal axis, weighted by their magnitudes. Using magnitudes alone is susceptible to noise and can be very sensitive to scale. Thus all optical flow vectors, v D Œx; yT with direction  D tan1 . xy / in the range  2 C  b1   <  2 C B p b  B will contribute by x2 C y2 to the sum in bin b, where 1  b  B, typically B D 30 is used. Finally, the histogram is normalized to sum up to 1. Using the training activities, we computed K (fixed at 20; 30 and 40) clusters to identify the principal directions and their cluster centroids. For the test activities, we performed sparse coding of the features using the computed centers and directions as the dictionary atoms. Table compares the average reconstruction error obtained for features from the test activities using different coding schemes. Since more than one atom can be used for representation, the reconstruction error in our model is significantly lower than those obtained with K-means or K-hyperline clustering (Fig. 9.1).

9.2.4.2 Limitations It is evident from above that even an extrinsic approach to modeling manifold valued features and a linear approximation for temporal information holds promise. However, these can quickly become insufficient due to many reasons such as—(a) In many cases the extrinsic embedding may not be clear or more accurate (b) Temporal

Fig. 9.1 Actions generated by sampling along the learned lines on the UMD actions data set [46]. Some generated actions such as wave, talk on phone, kick appear to be laterally inverted as our representation is affine invariant

9 Optimization Problems Associated with Manifold-Valued Curves with Applications in Computer Vision

Method

K=20

No. of clusters K=30 K=40

K-means— 0.3295 0.3069 0.2985 K-Hyperline d 0.2657 0.2485 0.2399 .; d/ 0.1171 0.1039 0.0956 Dictionary (a) Comparison of reconstruction error obtained using the proposed sparse coding

mis-alignment can lead to large distortions (c) linear approximations only suffice for very simple actions. As a result, a more formal approach is needed that can capture both the non-linearity of the features as well as the temporal information.

9.3

Elastic Representations of Riemannian Trajectories

We now address the problem of temporal realignment for Riemannian trajectories. We employ the Transport Square Root Velocity Function (TSRVF), which was recently proposed in [39] as a representation to perform warp invariant comparison between multiple Riemannian trajectories. Using the TSRVF representation for human actions, we propose to learn the latent function space of these Riemannian trajectories in a much lower dimensional space [4, 5]. As we demonstrate in our experiments, such a mapping also provides some robustness to noise which is essential when dealing with noisy sensors. Let ˛ denote a smooth trajectory on M and let M denote the set of all such trajectories: M D f˛ W Œ0; 1 7! M j; ˛ is smoothg. Also define  to be the set of all orientation preserving diffeomorphisms of [0,1]:  D f 7! Œ0; 1j .0/ D 0; .1/ D 1; is a diffeomorphismg. It is important to note that forms a group under the composition operation. If ˛ is a trajectory on M , then ˛ ı is a trajectory that follows the same sequence of points as ˛ but at the evolution rate governed by . The group  acts on M, M ! M, according to .˛; / D ˛ ı : To construct the TSRVF representation, we require a formulation for parallel transporting a vector between two

Proposed dictionary K-means dictionary Guha et al., Multiple Dictionaries [21] Guha et al., Single Dictionary [21] Chaudhry et al. [10] (b) Recognition performance (%) using the sparse codes

215

98.88 84.44 98.9 96.67 95.66 proposed

points p; q 2 M , denoted by .v/p!q . For cases where p and q do not fall in the cut loci of each other, the geodesic remains unique, and therefore the parallel transport is well defined (Fig. 9.2). The TSRVF [39] for a smooth trajectory ˛ 2 M is the parallel transport of a scaled velocity vector field of ˛ to a reference point c 2 M according to: ( h˛ .t/ D

˛.t/ P ˛.t/7!c

p

j˛.t/j P

2 Tc .M /; j˛.t/j P ¤0

0 2 Tc .M /

j˛.t/j P D0 (9.11) where j : j denotes the norm related to the Riemannian metric on M and Tc .M / denotes the tangent space of M at c. Since ˛ is smooth, so is the vector field h˛ . Let H Tc .M /Œ0;1 be the set of smooth curves in Tc .M / obtained as TSRVFs of trajectories in M , H D fh˛ j˛ 2 M g. Distance Between TSRVFs Since the TSRVFs lie on Tc .M /, the distance is measured by the standard L2 norm given by Z dh .h˛1 ; h˛2 / D

1 0

jh˛1 .t/  h˛2 .t/j

2

12

:

(9.12) If a trajectory ˛ is warped by , to result in ˛ ı , the TSRVF of the warped trajectory is given by: q P h˛ı .t/ D h˛ . .t// .t/ The distance between unchanged to warping, i.e.

TSRVFs

dh .h˛1 ; h˛2 / D dh .h˛1 ı ; h˛2 ı /:

(9.13) remains

(9.14)

216

R. Anirudh et al.

Fig. 9.2 Row wise from top—S1 , S2 , Warped action SQ2 , Warped mean, Unwarped mean. The TSRVF can enable more accurate estimation of statistical quantities such as average of two actions S1 ; S2

The invariance to group action is important as it allows us to compare two trajectories using the optimization problem stated next. Metric Invariant to Temporal Mis-alignment Next, we will use dh to define a metric between trajectories that is invariant to their time warpings. The basic idea is to partition M using an equivalence relation using the action of  and then to inherit dh on to the quotient space of this equivalence relation. Any two trajectories ˛1 ; ˛2 are set to be equivalent if there is a warping function 2  such that ˛1 D ˛2 ı . The distance dh can be inherited as a metric between the orbits if two conditions are satisfied: (1) the action of  on M is by isometries, and (2) the equivalence classes are closed sets. While the first condition has already been verified (see Eq. 9.14), the second condition needs more consideration. In fact, since  is an open set (under the standard norm), its equivalence classes are also consequently open. This issue is resolved in [39] using a larger, closed set of time-warping functions as follows. Define Q to the set of all non-decreasing, absolutely continuous functions, W Œ0; 1 ! Œ0; 1 such that .0/ D 0 and .1/ D 1. This Q

is a semi-group with the composition operation. More importantly, the original warping group  is a dense subset of Q and the elements of Q warp the trajectories in the same way as , except that they allow for singularities [39]. If we define Q instead of , the equivalence relation using , then orbits are closed and the second condition is satisfied as well. This equivalence relation takes the following form. Any two trajectories ˛1 ; ˛2 are said to be equivalent, if there exists a 2 Q Q and such that ˛1 D ˛2 ı . Since  is dense in , since the mapping ˛ 7! .˛.0/; h˛ / is bijective, we can rewrite this equivalence relation in terms of TSRVF as ˛1  ˛2 , if (a.) ˛1 .0/ D ˛2 .0/, and (b.) there exists a sequence f k g 2  such that limk7!1 h˛1 ı k D h˛2 , this convergence is measured under the L2 metric. In other words two trajectories are said to be equivalent if they have the same starting point, and the TSRVF of one can be time-warped into the TSRVF of the other using a sequence of warpings. We will use the notation Œ˛ to denote the set of all trajectories that are equivalent to a given ˛ 2 M. Now, the distance dh can be inherited on the quotient space, with the result ds on M=  (or equivalently H = ) given by:

9 Optimization Problems Associated with Manifold-Valued Curves with Applications in Computer Vision

217

ds .Œ˛1 ; Œ˛2 /  inf dh ..h˛1 ; 1 /; .h˛2 ; 2 // 1 ; 2 2Q

Z

1

D inf

1 ; 2 2Q

0

ˇ ˇ2 12 p p ˇ ˇ . .t// P .t/  h . .t// P .t/ ˇh˛1 1 ˇ dt 1 ˛2 2 2

The interesting part is that we do not have to solve for the optimizers in Q since  is dense in Q and, for any ı > 0, there exists a  such that jdh .h˛1 ; h˛2 o  /  ds .Œh˛1 ; Œh˛2 /j < ı: (9.16) This  may not be unique but any such  is sufficient for our purpose. Further, since  2 , it has an inverse that can be used in further analysis. The minimization over  is solved for using dynamic programming. Here one samples the interval Œ0; 1 using T discrete points and then restricts to only piece-wise linear that passes through the T  T grid. Further properties of the metric ds are provided in [39]. Additional Considerations for Human Actions In the original formulation of the TSRVF [39], a set of trajectories were all warped together to produce the mean trajectory. In the context of analyzing skeletal human actions, several design choices are available to warp different actions and maybe chosen to potentially improve performance. For example, warping actions per class may work better for certain kinds of actions that have a very different speed profile, this can be achieved by modifying (9.15), to use class information. Next, since the work here is concerned with skeletal representations of humans, different joints have varying degrees of freedom for all actions. Therefore, in the context of skeletal representations, it is reasonable to assume that different joints require different constraints on warping functions. While it may be harder to explicitly impose different constraints to solve for , it can be easily achieved by solving for per joint trajectory instead of the entire skeleton.

9.3.1

(9.15)

Riemannian Functional Coding

Typical representations for action recognition tend to be extremely high dimensional in part because the features are extracted per-frame and stacked. Any computation on such nonEuclidean trajectories can become very easily involved. For example, a recently proposed skeletal representation [49] results in a 38220 dimensional vector for a 15 joint skeletal system when observed for 35 frames. Such features do not take into account, the physical constraints of the human body, which translates to giving varying degrees of freedom to different joints. It is therefore a reasonable assumption to make that the true space of actions is much lower dimensional. This is similar to the argument that motivated manifold learning for image data, where the number of observed image pixels maybe extremely high dimensional, but the object or scene is often considered to lie on a lower dimensional manifold. A lower dimensional embedding will provide a robust, computationally efficient, and intuitive framework for analysis of actions. In this section, we address these issues by studying the statistical properties of trajectories on Riemannian manifolds to extract lower dimensional representations or codes. Elastic representations for Riemannian trajectories is relatively new and the lower dimensional embedding of such sequences has remained unexplored. We employ the transport square-root velocity function (TSRVF) representation  a recent development in statistics [39], to provide a warp invariant representation to the Riemannian trajectories. The TSRVF is also advantageous

218

as it provides a functional representation that is Euclidean. Exploiting this we propose to learn the low dimensional embedding with a Riemannian functional variant of popular coding techniques. In other words, we are interested in parameterization of Riemannian trajectories, i.e. for N actions Ai .t/; i D 1 : : : N, our goal is to learn F such that F .x/ D Ai where x 2 Rk is the set of parameters. Such a model will allow us to compare actions by simply comparing them in their parametric space with respect to F , with significantly faster distance computations, while being able to reconstruct the original actions. In this work, we learn two different kinds of functions using PCA and dictionary learning, which have attractive properties for recognition and visualization. For data in the Euclidean domain, dictionary learning methods such as PCA, k-SVD etc. can learn a basis of actions/movements onto which we can project the high dimensional actions. The resulting representation in this new basis is significantly lower dimensional, while still capturing all the information. The challenge however is in being able to reliably vectorize these Riemannian trajectories, while taking geometry and the temporal nature into account. For example, without taking geometry into account, the PCA is entirely meaningless, whereas ignoring time information (such as temporal mis-alignment etc.) can artificially inflate the variance, leading to poor a embedding. We study two main applications of coding— (1) visualization of high dimensional Riemannian trajectories, and (2) classification. For visualization, one key property is to be able to reconstruct back from the low dimensional space, which is easily done using principal component analysis (PCA). For classification, we show results on discriminative coding methods such as K-SVD, LC-KSVD, in additional to PCA, that learn a dictionary where each atom is a trajectory. More generally, common manifold learning techniques such as Isomap [41], and LLE [34] can also be used to perform coding, while keeping in mind that it is not easy to obtain the original feature from the low dimensional code. Further, the trajectories tend to be extremely high

R. Anirudh et al.

dimensional (of the order of 104  105 ), therefore most manifold learning techniques require massive memory requirements. Next we describe the algorithm to obtain low dimensional codes using PCA and dictionary learning algorithms. Dictionary Learning for Riemannian Trajectories The TSRVF representation allows the evaluation of first and second order statistics on entire sequences of actions and define quantities such as the variability of actions, which we can use to estimate the redundancy in the data similar to the Euclidean space. We utilize the TSRVF to obtain the ideal warping between sequences, such that the warped sequence is equivalent to its TSRVF. To obtain a low dimensional embedding, first we represent the sequences as deviations from a reference sequence using tangent vectors. For manifolds such as SE.3/ the natural “origin” I4 can be used, in other cases the sequence mean [39] by definition lies equi-distant from all the points and therefore is a suitable candidate. In all our experiments, we found the tangent vectors obtained from the mean sequence to be much more robust and discriminative. Next, we obtain the shooting vectors, which are the tangent vectors one would travel along, starting from the average sequence .t/ at  D 0 to reach the ith action ˛Q i .t/ at time  D 1, this is depicted in Fig. 9.3a. Note here that  is the time in the sequence space which is different from t, which is time in the original manifold space. The combined shooting vectors can be interpreted as a sequence tangent that takes us from one point to another in sequence space, in unit time. Since we are representing each trajectory as a vector field, we can use existing algorithms to perform coding treating the sequences as points, because we have accounted for the temporal information. The Algorithm 1 describes the process to perform coding using a generic coding function represented as F W RD ! Rd ; where d