Matrix Analysis and Applications 9781108277587

This balanced and comprehensive study presents the theory, methods and applications of matrix analysis in a new theoreti

317 60 6MB

English Pages 760 Year 2017

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Matrix Analysis and Applications
 9781108277587

Table of contents :
Contents......Page 8
Preface......Page 18
Notation......Page 22
Abbreviations......Page 32
Algorithms......Page 35
PART I MATRIX ALGEBRA......Page 38
1.1.1 Vectors and Matrices......Page 40
1.1.2 Basic Vector Calculus......Page 43
1.1.3 Basic Matrix Calculus......Page 45
1.1.5 Matrix Functions......Page 48
1.2.1 Elementary Row Operations......Page 50
1.2.2 Gauss Elimination Methods......Page 53
1.3.1 Sets......Page 57
1.3.2 Fields and Vector Spaces......Page 59
1.3.3 Linear Mapping......Page 61
1.4.1 Inner Products of Vectors......Page 64
1.4.2 Norms of Vectors......Page 65
1.4.3 Similarity Comparison Between Vectors......Page 69
1.4.4 Banach Space, Euclidean Space, Hilbert Space......Page 72
1.4.5 Inner Products and Norms of Matrices......Page 73
1.5 Random Vectors......Page 77
1.5.1 Statistical Interpretation of Random Vectors......Page 78
1.5.2 Gaussian Random Vectors......Page 81
1.6.1 Quadratic Forms......Page 84
1.6.2 Determinants......Page 86
1.6.3 Matrix Eigenvalues......Page 89
1.6.4 Matrix Trace......Page 91
1.6.5 Matrix Rank......Page 93
1.7.1 Definition and Properties of Inverse Matrices......Page 96
1.7.2 Matrix Inversion Lemma......Page 97
1.7.3 Inversion of Hermitian Matrices......Page 98
1.7.4 Left and Right Pseudo-Inverse Matrices......Page 100
1.8.1 Definition and Properties......Page 102
1.8.2 Computation of Moore–Penrose Inverse Matrix......Page 106
1.9.1 Direct Sum of Matrices......Page 108
1.9.2 Hadamard Product......Page 109
1.10.1 Kronecker Products......Page 112
1.10.2 Generalized Kronecker Products......Page 114
1.10.3 Khatri–Rao Product......Page 115
1.11.1 Vectorization and Commutation Matrix......Page 116
1.11.2 Matricization of a Vector......Page 119
1.11.3 Properties of Vectorization Operator......Page 120
1.12.1 Sparse Vectors and Sparse Representations......Page 121
1.12.2 Sparse Representation of Face Recognition......Page 123
Exercises......Page 124
2.1 Hermitian Matrices......Page 132
2.2 Idempotent Matrix......Page 133
2.3.1 Permutation Matrix and Exchange Matrix......Page 135
2.3.2 Generalized Permutation Matrix......Page 138
2.4 Orthogonal Matrix and Unitary Matrix......Page 141
2.5.2 Triangular Matrix......Page 144
2.6 Summing Vector and Centering Matrix......Page 146
2.6.1 Summing Vector......Page 147
2.6.2 Centering Matrix......Page 148
2.7.1 Vandermonde Matrix......Page 149
2.7.2 Fourier Matrix......Page 151
2.7.3 Index Vectors......Page 153
2.7.4 FFT Algorithm......Page 155
2.8 Hadamard Matrix......Page 158
2.9.1 Symmetric Toeplitz Matrix......Page 160
2.9.2 Discrete Cosine Transform of Toeplitz Matrix......Page 162
Exercises......Page 163
3.1 Jacobian Matrix and Gradient Matrix......Page 166
3.1.1 Jacobian Matrix......Page 167
3.1.2 Gradient Matrix......Page 169
3.1.3 Calculation of Partial Derivative and Gradient......Page 170
3.2.1 Calculation of Real Matrix Differential......Page 176
3.2.2 Jacobian Matrix Identification......Page 178
3.2.3 Jacobian Matrix of Real Matrix Functions......Page 184
3.3.1 Real Hessian Matrix......Page 187
3.3.2 Real Hessian Matrix Identification......Page 189
3.4.1 Holomorphic Function and Complex Partial Derivative......Page 194
3.4.2 Complex Matrix Differential......Page 198
3.4.3 Complex Gradient Matrix Identification......Page 206
3.5 Complex Hessian Matrices and Identification......Page 211
3.5.1 Complex Hessian Matrices......Page 212
3.5.2 Complex Hessian Matrix Identification......Page 214
Exercises......Page 216
PART II MATRIX ANALYSIS......Page 218
4.1 Real Gradient Analysis......Page 220
4.1.1 Stationary Points and Extreme Points......Page 221
4.1.2 Real Gradient Analysis of f(x)......Page 223
4.1.3 Real Gradient Analysis of f(X)......Page 225
4.2.1 Extreme Point of Complex Variable Function......Page 227
4.2.2 Complex Gradient Analysis......Page 232
4.3.1 Standard Constrained Optimization Problems......Page 235
4.3.2 Convex Sets and Convex Functions......Page 237
4.3.3 Convex Function Identification......Page 240
4.4.1 Gradient Method......Page 242
4.4.2 Conjugate Gradient Method......Page 247
4.4.3 Convergence Rates......Page 252
4.5.1 Lipschitz Continuous Function......Page 254
4.5.2 Nesterov Optimal Gradient Algorithms......Page 257
4.6 Nonsmooth Convex Optimization......Page 260
4.6.1 Subgradient and Subdifferential......Page 261
4.6.2 Proximal Operator......Page 265
4.6.3 Proximal Gradient Method......Page 269
4.7.1 Lagrange Multiplier Method......Page 274
4.7.2 Penalty Function Method......Page 275
4.7.3 Augmented Lagrange Multiplier Method......Page 277
4.7.4 Lagrangian Dual Method......Page 279
4.7.5 Karush–Kuhn–Tucker Conditions......Page 281
4.7.6 Alternating Direction Method of Multipliers......Page 285
4.8.1 Newton Method for Unconstrained Optimization......Page 288
4.8.2 Newton Method for Constrained Optimization......Page 291
4.9.1 Original–Dual Problems......Page 297
4.9.2 First-Order Original–Dual Interior-Point Method......Page 298
4.9.3 Second-Order Original–Dual Interior-Point Method......Page 300
Exercises......Page 305
5.1 Numerical Stability and Condition Number......Page 308
5.2.1 Singular Value Decomposition......Page 311
5.2.2 Properties of Singular Values......Page 314
5.2.3 Rank-Deficient Least Squares Solutions......Page 317
5.3.1 PSVD Problem......Page 320
5.3.2 Accurate Calculation of PSVD......Page 321
5.4 Applications of Singular Value Decomposition......Page 322
5.4.1 Static Systems......Page 323
5.4.2 Image Compression......Page 324
5.5.1 Definition and Properties......Page 326
5.5.2 Algorithms for GSVD......Page 329
5.5.3 Two Application Examples of GSVD......Page 331
5.6 Low-Rank–Sparse Matrix Decomposition......Page 333
5.6.1 Matrix Decomposition Problems......Page 334
5.6.2 Singular Value Thresholding......Page 335
5.6.3 Robust Principal Component Analysis......Page 337
5.7 Matrix Completion......Page 339
5.7.1 Matrix Completion Problems......Page 340
5.7.2 Matrix Completion Model and Incoherence......Page 342
5.7.3 Singular Value Thresholding Algorithm......Page 343
5.7.4 Fast and Accurate Matrix Completion......Page 345
Exercises......Page 349
6 Solving Matrix Equations......Page 352
6.1.1 Ordinary Least Squares Methods......Page 353
6.1.2 Properties of Least Squares Solutions......Page 354
6.1.3 Data Least Squares......Page 357
6.2.1 Tikhonov Regularization......Page 358
6.2.2 Regularized Gauss–Seidel Method......Page 361
6.3.1 TLS Problems......Page 365
6.3.2 TLS Solution......Page 366
6.3.3 Performances of TLS Solution......Page 370
6.3.4 Generalized Total Least Squares......Page 372
6.3.5 Total Least Squares Fitting......Page 374
6.3.6 Total Maximum Likelihood Method......Page 379
6.4 Constrained Total Least Squares......Page 381
6.4.1 Constrained Total Least Squares Method......Page 382
6.4.2 Harmonic Superresolution......Page 384
6.4.3 Image Restoration......Page 385
6.5 Subspace Method for Solving Blind Matrix Equations......Page 387
6.6.1 Nonnegative Matrices......Page 390
6.6.2 Nonnegativity and Sparsity Constraints......Page 392
6.6.3 Nonnegative Matrix Factorization Model......Page 393
6.6.4 Divergences and Deformed Logarithm......Page 398
6.7.1 Multiplication Algorithms......Page 403
6.7.2 Nesterov Optimal Gradient Algorithm......Page 409
6.7.3 Alternating Nonnegative Least Squares......Page 411
6.7.4 Quasi-Newton Method......Page 414
6.7.5 Sparse Nonnegative Matrix Factorization......Page 415
6.8.1 ℓ1-Norm Minimization......Page 418
6.8.2 Lasso and Robust Linear Regression......Page 421
6.8.3 Mutual Coherence and RIP Conditions......Page 424
6.8.4 Relation to Tikhonov Regularization......Page 426
6.8.5 Gradient Analysis of ℓ1-Norm Minimization......Page 427
6.9.1 Basis Pursuit Algorithms......Page 428
6.9.3 Barzilai–Borwein Gradient Projection Algorithm......Page 431
6.9.4 ADMM Algorithms for Lasso Problems......Page 434
6.9.5 LARS Algorithms for Lasso Problems......Page 435
6.9.6 Covariance Graphical Lasso Method......Page 437
6.9.7 Homotopy Algorithm......Page 439
6.9.8 Bregman Iteration Algorithms......Page 440
Exercises......Page 446
7.1.1 Eigenvalue Problem......Page 450
7.1.2 Characteristic Polynomial......Page 452
7.2.1 Eigenvalues......Page 453
7.2.2 Eigenvectors......Page 456
7.3 Similarity Reduction......Page 459
7.3.1 Similarity Transformation of Matrices......Page 460
7.3.2 Similarity Reduction of Matrices......Page 463
7.3.3 Similarity Reduction of Matrix Polynomials......Page 467
7.4.1 Smith Normal Forms......Page 471
7.4.2 Invariant Factor Method......Page 474
7.4.3 Conversion of Jordan Form and Smith Form......Page 478
7.4.4 Finding Smith Blocks from Jordan Blocks......Page 479
7.4.5 Finding Jordan Blocks from Smith Blocks......Page 480
7.5.1 Cayley–Hamilton Theorem......Page 483
7.5.2 Computation of Inverse Matrices......Page 485
7.5.3 Computation of Matrix Powers......Page 487
7.5.4 Calculation of Matrix Exponential Functions......Page 489
7.6.1 Pisarenko Harmonic Decomposition......Page 492
7.6.2 Discrete Karhunen–Loeve Transformation......Page 495
7.6.3 Principal Component Analysis......Page 498
7.7.1 Generalized Eigenvalue Decomposition......Page 500
7.7.2 Total Least Squares Method for GEVD......Page 504
7.7.3 Application of GEVD: ESPRIT......Page 505
7.7.4 Similarity Transformation in GEVD......Page 508
7.8.1 Definition and Properties of Rayleigh Quotient......Page 511
7.8.2 Rayleigh Quotient Iteration......Page 512
7.8.3 Algorithms for Rayleigh Quotient......Page 513
7.9.1 Definition and Properties......Page 515
7.9.2 Effectiveness of Class Discrimination......Page 517
7.9.3 Robust Beamforming......Page 519
7.10.1 Description of Quadratic Eigenvalue Problems......Page 521
7.10.2 Solving Quadratic Eigenvalue Problems......Page 523
7.10.3 Application Examples......Page 527
7.11.1 Joint Diagonalization Problems......Page 532
7.11.2 Orthogonal Approximate Joint Diagonalization......Page 534
7.11.3 Nonorthogonal Approximate Joint Diagonalization......Page 537
Exercises......Page 540
8.1.1 Bases of Subspaces......Page 548
8.1.2 Disjoint Subspaces and Orthogonal Complement......Page 550
8.2.1 Definitions and Properties......Page 553
8.2.2 Subspace Basis Construction......Page 557
8.2.3 SVD-Based Orthonormal Basis Construction......Page 559
8.2.4 Basis Construction of Subspaces Intersection......Page 562
8.3.1 Signal Subspace and Noise Subspace......Page 563
8.3.2 Multiple Signal Classification (MUSIC)......Page 566
8.3.3 Subspace Whitening......Page 568
8.4.1 Equivalent Subspaces......Page 569
8.4.2 Grassmann Manifold......Page 570
8.4.3 Stiefel Manifold......Page 572
8.5 Projection Approximation Subspace Tracking (PAST)......Page 573
8.5.1 Basic PAST Theory......Page 574
8.5.2 PAST Algorithms......Page 577
8.6.1 Rayleigh–Ritz Approximation......Page 579
8.6.2 Fast Subspace Decomposition Algorithm......Page 581
Exercises......Page 583
9.1 Projection and Orthogonal Projection......Page 588
9.1.1 Projection Theorem......Page 589
9.1.2 Mean Square Estimation......Page 591
9.2.1 Projector and Orthogonal Projector......Page 593
9.2.2 Projection Matrices......Page 595
9.2.3 Derivatives of Projection Matrix......Page 598
9.3.1 Updating Formulas for Projection Matrices......Page 599
9.3.2 Prediction Filters......Page 601
9.3.3 Updating of Lattice Adaptive Filter......Page 604
9.4 Oblique Projector of Full Column Rank Matrix......Page 607
9.4.1 Definition and Properties of Oblique Projectors......Page 608
9.4.2 Geometric Interpretation of Oblique Projectors......Page 612
9.4.3 Recursion of Oblique Projectors......Page 615
9.5.1 Definition and Properties......Page 616
9.5.2 Calculation of Oblique Projection......Page 618
9.5.3 Applications of Oblique Projectors......Page 620
Exercises......Page 622
PART III HIGHER-ORDER MATRIX ANALYSIS......Page 624
10.1.1 Tensors......Page 626
10.1.2 Tensor Representation......Page 629
10.2.1 Vectorization and Horizontal Unfolding......Page 634
10.2.2 Longitudinal Unfolding of Tensors......Page 638
10.3.1 Inner Product, Norm and Outer Product......Page 643
10.3.2 Mode-n Product of Tensors......Page 645
10.3.3 Rank of Tensor......Page 649
10.4 Tucker Decomposition of Tensors......Page 651
10.4.1 Tucker Decomposition (Higher-Order SVD)......Page 652
10.4.2 Third-Order SVD......Page 654
10.4.3 Alternating Least Squares Algorithms......Page 658
10.5.1 Bilinear Model......Page 662
10.5.2 Parallel Factor Analysis......Page 664
10.5.3 Uniqueness Condition......Page 672
10.5.4 Alternating Least Squares Algorithm......Page 674
10.6 Applications of Low-Rank Tensor Decomposition......Page 678
10.6.1 Multimodal Data Fusion......Page 679
10.6.2 Fusion of Multimodal Brain Images......Page 681
10.6.3 Process Monitoring......Page 683
10.6.4 Note on Other Applications......Page 685
10.7.1 Tensor–Vector Products......Page 686
10.7.2 Determinants and Eigenvalues of Tensors......Page 688
10.7.3 Generalized Tensor Eigenvalues Problems......Page 693
10.7.4 Orthogonal Decomposition of Symmetric Tensors......Page 695
10.8 Preprocessing and Postprocessing......Page 696
10.8.1 Centering and Scaling of Multi-Way Data......Page 697
10.8.2 Compression of Data Array......Page 698
10.9.1 Multiplication Algorithm......Page 701
10.9.2 ALS Algorithms......Page 704
10.10 Tensor Completion......Page 707
10.10.1 Simultaneous Tensor Decomposition and Completion......Page 708
10.10.2 Smooth PARAFAC Tensor Completion......Page 711
10.11 Software......Page 713
Exercises......Page 715
References......Page 718
Index......Page 745

Citation preview

M AT R I X A NA LY S I S A N D A P P L I C AT I O N S

This balanced and comprehensive study presents the theory, methods and applications of matrix analysis in a new theoretical framework, allowing readers to understand secondorder and higher-order matrix analysis in a completely new light. Alongside the core subjects in matrix analysis, such as singular value analysis, the solution of matrix equations and eigenanalysis, the author introduces new applications and perspectives that are unique to this book. The very topical subjects of gradient analysis and optimization play a central role here. Also included are subspace analysis, projection analysis and tensor analysis, subjects which are often neglected in other books. Having provided a solid foundation to the subject, the author goes on to place particular emphasis on the many applications matrix analysis has in science and engineering, making this book suitable for scientists, engineers and graduate students alike. X I A N - D A Z H A N G is Professor Emeritus in the Department of Automation, at Tsinghua University, Beijing. He was a Distinguished Professor at Xidian University, Xi’an, China – a post awarded by the Ministry of Education of China, and funded by the Ministry of Education of China and the Cheung Kong Scholars Programme – from 1999 to 2002. His areas of research include signal processing, pattern recognition, machine learning and related applied mathematics. He has published over 120 international journal and conference papers, and 7 books in Chinese. He taught the graduate course “Matrix Analysis and Applications” at Tsinghua University from 2004 to 2011.

M AT RI X A NA LYSIS AND A P P L I C ATIONS X I A N - DA Z H A N G Tsinghua University, Beijing

University Printing House, Cambridge CB2 8BS, United Kingdom One Liberty Plaza, 20th Floor, New York, NY 10006, USA 477 Williamstown Road, Port Melbourne, VIC 3207, Australia 4843/24, 2nd Floor, Ansari Road, Daryaganj, Delhi – 110002, India 79 Anson Road, #06–04/06, Singapore 079906 Cambridge University Press is part of the University of Cambridge. It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning, and research at the highest international levels of excellence. www.cambridge.org Information on this title: www.cambridge.org/9781108417419 DOI: 10.1017/9781108277587 c Xian-Da Zhang 2017  This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2017 Printed in the United Kingdom by Clays, St Ives plc A catalogue record for this publication is available from the British Library. ISBN 978-1-108-41741-9 Hardback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.

To John Zhang, Ellen Zhang and Andrew Wei

Contents

Preface Notation Abbreviations Algorithms

PART I 1

MATRIX ALGEBRA

Introduction to Matrix Algebra 1.1 Basic Concepts of Vectors and Matrices 1.1.1 Vectors and Matrices 1.1.2 Basic Vector Calculus 1.1.3 Basic Matrix Calculus 1.1.4 Linear Independence of Vectors 1.1.5 Matrix Functions 1.2 Elementary Row Operations and Applications 1.2.1 Elementary Row Operations 1.2.2 Gauss Elimination Methods 1.3 Sets, Vector Subspaces and Linear Mapping 1.3.1 Sets 1.3.2 Fields and Vector Spaces 1.3.3 Linear Mapping 1.4 Inner Products and Vector Norms 1.4.1 Inner Products of Vectors 1.4.2 Norms of Vectors 1.4.3 Similarity Comparison Between Vectors 1.4.4 Banach Space, Euclidean Space, Hilbert Space 1.4.5 Inner Products and Norms of Matrices 1.5 Random Vectors 1.5.1 Statistical Interpretation of Random Vectors 1.5.2 Gaussian Random Vectors

page xvii xxi xxxi xxxiv

1 3 3 3 6 8 11 11 13 13 16 20 20 22 24 27 27 28 32 35 36 40 41 44

viii

1.6

Performance Indexes of Matrices 1.6.1 Quadratic Forms 1.6.2 Determinants 1.6.3 Matrix Eigenvalues 1.6.4 Matrix Trace 1.6.5 Matrix Rank 1.7 Inverse Matrices and Pseudo-Inverse Matrices 1.7.1 Definition and Properties of Inverse Matrices 1.7.2 Matrix Inversion Lemma 1.7.3 Inversion of Hermitian Matrices 1.7.4 Left and Right Pseudo-Inverse Matrices 1.8 Moore–Penrose Inverse Matrices 1.8.1 Definition and Properties 1.8.2 Computation of Moore–Penrose Inverse Matrix 1.9 Direct Sum and Hadamard Product 1.9.1 Direct Sum of Matrices 1.9.2 Hadamard Product 1.10 Kronecker Products and Khatri–Rao Product 1.10.1 Kronecker Products 1.10.2 Generalized Kronecker Products 1.10.3 Khatri–Rao Product 1.11 Vectorization and Matricization 1.11.1 Vectorization and Commutation Matrix 1.11.2 Matricization of a Vector 1.11.3 Properties of Vectorization Operator 1.12 Sparse Representations 1.12.1 Sparse Vectors and Sparse Representations 1.12.2 Sparse Representation of Face Recognition Exercises 2

Special Matrices 2.1 Hermitian Matrices 2.2 Idempotent Matrix 2.3 Permutation Matrix 2.3.1 Permutation Matrix and Exchange Matrix 2.3.2 Generalized Permutation Matrix 2.4 Orthogonal Matrix and Unitary Matrix 2.5 Band Matrix and Triangular Matrix 2.5.1 Band Matrix 2.5.2 Triangular Matrix 2.6 Summing Vector and Centering Matrix 2.6.1 Summing Vector

47 47 49 52 54 56 59 59 60 61 63 65 65 69 71 71 72 75 75 77 78 79 79 82 83 84 84 86 87 95 95 96 98 98 101 104 107 107 107 109 110

ix

3

4

2.6.2 Centering Matrix 2.7 Vandermonde Matrix and Fourier Matrix 2.7.1 Vandermonde Matrix 2.7.2 Fourier Matrix 2.7.3 Index Vectors 2.7.4 FFT Algorithm 2.8 Hadamard Matrix 2.9 Toeplitz Matrix 2.9.1 Symmetric Toeplitz Matrix 2.9.2 Discrete Cosine Transform of Toeplitz Matrix Exercises

111 112 112 114 116 118 121 123 123 125 126

Matrix Differential 3.1 Jacobian Matrix and Gradient Matrix 3.1.1 Jacobian Matrix 3.1.2 Gradient Matrix 3.1.3 Calculation of Partial Derivative and Gradient 3.2 Real Matrix Differential 3.2.1 Calculation of Real Matrix Differential 3.2.2 Jacobian Matrix Identification 3.2.3 Jacobian Matrix of Real Matrix Functions 3.3 Real Hessian Matrix and Identification 3.3.1 Real Hessian Matrix 3.3.2 Real Hessian Matrix Identification 3.4 Complex Gradient Matrices 3.4.1 Holomorphic Function and Complex Partial Derivative 3.4.2 Complex Matrix Differential 3.4.3 Complex Gradient Matrix Identification 3.5 Complex Hessian Matrices and Identification 3.5.1 Complex Hessian Matrices 3.5.2 Complex Hessian Matrix Identification Exercises

129 129 130 132 133 139 139 141 147 150 150 152 157 157 161 169 174 175 177 179

PART II

181

MATRIX ANALYSIS

Gradient Analysis and Optimization 4.1 Real Gradient Analysis 4.1.1 Stationary Points and Extreme Points 4.1.2 Real Gradient Analysis of f (x) 4.1.3 Real Gradient Analysis of f (X) 4.2 Gradient Analysis of Complex Variable Function 4.2.1 Extreme Point of Complex Variable Function

183 183 184 186 188 190 190

x

5

4.2.2 Complex Gradient Analysis 4.3 Convex Sets and Convex Function Identification 4.3.1 Standard Constrained Optimization Problems 4.3.2 Convex Sets and Convex Functions 4.3.3 Convex Function Identification 4.4 Gradient Methods for Smooth Convex Optimization 4.4.1 Gradient Method 4.4.2 Conjugate Gradient Method 4.4.3 Convergence Rates 4.5 Nesterov Optimal Gradient Method 4.5.1 Lipschitz Continuous Function 4.5.2 Nesterov Optimal Gradient Algorithms 4.6 Nonsmooth Convex Optimization 4.6.1 Subgradient and Subdifferential 4.6.2 Proximal Operator 4.6.3 Proximal Gradient Method 4.7 Constrained Convex Optimization 4.7.1 Lagrange Multiplier Method 4.7.2 Penalty Function Method 4.7.3 Augmented Lagrange Multiplier Method 4.7.4 Lagrangian Dual Method 4.7.5 Karush–Kuhn–Tucker Conditions 4.7.6 Alternating Direction Method of Multipliers 4.8 Newton Methods 4.8.1 Newton Method for Unconstrained Optimization 4.8.2 Newton Method for Constrained Optimization 4.9 Original–Dual Interior-Point Method 4.9.1 Original–Dual Problems 4.9.2 First-Order Original–Dual Interior-Point Method 4.9.3 Second-Order Original–Dual Interior-Point Method Exercises

195 198 198 200 203 205 205 210 215 217 217 220 223 224 228 232 237 237 238 240 242 244 248 251 251 254 260 260 261 263 268

Singular Value Analysis 5.1 Numerical Stability and Condition Number 5.2 Singular Value Decomposition (SVD) 5.2.1 Singular Value Decomposition 5.2.2 Properties of Singular Values 5.2.3 Rank-Deficient Least Squares Solutions 5.3 Product Singular Value Decomposition (PSVD) 5.3.1 PSVD Problem 5.3.2 Accurate Calculation of PSVD 5.4 Applications of Singular Value Decomposition

271 271 274 274 277 280 283 283 284 285

xi

6

5.4.1 Static Systems 5.4.2 Image Compression 5.5 Generalized Singular Value Decomposition (GSVD) 5.5.1 Definition and Properties 5.5.2 Algorithms for GSVD 5.5.3 Two Application Examples of GSVD 5.6 Low-Rank–Sparse Matrix Decomposition 5.6.1 Matrix Decomposition Problems 5.6.2 Singular Value Thresholding 5.6.3 Robust Principal Component Analysis 5.7 Matrix Completion 5.7.1 Matrix Completion Problems 5.7.2 Matrix Completion Model and Incoherence 5.7.3 Singular Value Thresholding Algorithm 5.7.4 Fast and Accurate Matrix Completion Exercises

286 287 289 289 292 294 296 297 298 300 302 303 305 306 308 312

Solving Matrix Equations 6.1 Least Squares Method 6.1.1 Ordinary Least Squares Methods 6.1.2 Properties of Least Squares Solutions 6.1.3 Data Least Squares 6.2 Tikhonov Regularization and Gauss–Seidel Method 6.2.1 Tikhonov Regularization 6.2.2 Regularized Gauss–Seidel Method 6.3 Total Least Squares (TLS) Methods 6.3.1 TLS Problems 6.3.2 TLS Solution 6.3.3 Performances of TLS Solution 6.3.4 Generalized Total Least Squares 6.3.5 Total Least Squares Fitting 6.3.6 Total Maximum Likelihood Method 6.4 Constrained Total Least Squares 6.4.1 Constrained Total Least Squares Method 6.4.2 Harmonic Superresolution 6.4.3 Image Restoration 6.5 Subspace Method for Solving Blind Matrix Equations 6.6 Nonnegative Matrix Factorization: Optimization Theory 6.6.1 Nonnegative Matrices 6.6.2 Nonnegativity and Sparsity Constraints 6.6.3 Nonnegative Matrix Factorization Model 6.6.4 Divergences and Deformed Logarithm

315 316 316 317 320 321 321 324 328 328 329 333 335 337 342 344 345 347 348 350 353 353 355 356 361

xii

7

6.7

Nonnegative Matrix Factorization: Optimization Algorithms 6.7.1 Multiplication Algorithms 6.7.2 Nesterov Optimal Gradient Algorithm 6.7.3 Alternating Nonnegative Least Squares 6.7.4 Quasi-Newton Method 6.7.5 Sparse Nonnegative Matrix Factorization 6.8 Sparse Matrix Equation Solving: Optimization Theory 6.8.1 1 -Norm Minimization 6.8.2 Lasso and Robust Linear Regression 6.8.3 Mutual Coherence and RIP Conditions 6.8.4 Relation to Tikhonov Regularization 6.8.5 Gradient Analysis of 1 -Norm Minimization 6.9 Sparse Matrix Equation Solving: Optimization Algorithms 6.9.1 Basis Pursuit Algorithms 6.9.2 First-Order Augmented Lagrangian Algorithm 6.9.3 Barzilai–Borwein Gradient Projection Algorithm 6.9.4 ADMM Algorithms for Lasso Problems 6.9.5 LARS Algorithms for Lasso Problems 6.9.6 Covariance Graphical Lasso Method 6.9.7 Homotopy Algorithm 6.9.8 Bregman Iteration Algorithms Exercises

366 366 372 374 377 378 381 381 384 387 389 390 391 391 394 394 397 398 400 402 403 409

Eigenanalysis 7.1 Eigenvalue Problem and Characteristic Equation 7.1.1 Eigenvalue Problem 7.1.2 Characteristic Polynomial 7.2 Eigenvalues and Eigenvectors 7.2.1 Eigenvalues 7.2.2 Eigenvectors 7.3 Similarity Reduction 7.3.1 Similarity Transformation of Matrices 7.3.2 Similarity Reduction of Matrices 7.3.3 Similarity Reduction of Matrix Polynomials 7.4 Polynomial Matrices and Balanced Reduction 7.4.1 Smith Normal Forms 7.4.2 Invariant Factor Method 7.4.3 Conversion of Jordan Form and Smith Form 7.4.4 Finding Smith Blocks from Jordan Blocks 7.4.5 Finding Jordan Blocks from Smith Blocks 7.5 Cayley–Hamilton Theorem with Applications 7.5.1 Cayley–Hamilton Theorem

413 413 413 415 416 416 419 422 423 426 430 434 434 437 441 442 443 446 446

xiii

8

7.5.2 Computation of Inverse Matrices 7.5.3 Computation of Matrix Powers 7.5.4 Calculation of Matrix Exponential Functions 7.6 Application Examples of Eigenvalue Decomposition 7.6.1 Pisarenko Harmonic Decomposition 7.6.2 Discrete Karhunen–Loeve Transformation 7.6.3 Principal Component Analysis 7.7 Generalized Eigenvalue Decomposition (GEVD) 7.7.1 Generalized Eigenvalue Decomposition 7.7.2 Total Least Squares Method for GEVD 7.7.3 Application of GEVD: ESPRIT 7.7.4 Similarity Transformation in GEVD 7.8 Rayleigh Quotient 7.8.1 Definition and Properties of Rayleigh Quotient 7.8.2 Rayleigh Quotient Iteration 7.8.3 Algorithms for Rayleigh Quotient 7.9 Generalized Rayleigh Quotient 7.9.1 Definition and Properties 7.9.2 Effectiveness of Class Discrimination 7.9.3 Robust Beamforming 7.10 Quadratic Eigenvalue Problems 7.10.1 Description of Quadratic Eigenvalue Problems 7.10.2 Solving Quadratic Eigenvalue Problems 7.10.3 Application Examples 7.11 Joint Diagonalization 7.11.1 Joint Diagonalization Problems 7.11.2 Orthogonal Approximate Joint Diagonalization 7.11.3 Nonorthogonal Approximate Joint Diagonalization Exercises

448 450 452 455 455 458 461 463 463 467 468 471 474 474 475 476 478 478 480 482 484 484 486 490 495 495 497 500 503

Subspace Analysis and Tracking 8.1 General Theory of Subspaces 8.1.1 Bases of Subspaces 8.1.2 Disjoint Subspaces and Orthogonal Complement 8.2 Column Space, Row Space and Null Space 8.2.1 Definitions and Properties 8.2.2 Subspace Basis Construction 8.2.3 SVD-Based Orthonormal Basis Construction 8.2.4 Basis Construction of Subspaces Intersection 8.3 Subspace Methods 8.3.1 Signal Subspace and Noise Subspace 8.3.2 Multiple Signal Classification (MUSIC)

511 511 511 513 516 516 520 522 525 526 526 529

xiv

9

10

8.3.3 Subspace Whitening 8.4 Grassmann Manifold and Stiefel Manifold 8.4.1 Equivalent Subspaces 8.4.2 Grassmann Manifold 8.4.3 Stiefel Manifold 8.5 Projection Approximation Subspace Tracking (PAST) 8.5.1 Basic PAST Theory 8.5.2 PAST Algorithms 8.6 Fast Subspace Decomposition 8.6.1 Rayleigh–Ritz Approximation 8.6.2 Fast Subspace Decomposition Algorithm Exercises

531 532 532 533 535 536 537 540 542 542 544 546

Projection Analysis 9.1 Projection and Orthogonal Projection 9.1.1 Projection Theorem 9.1.2 Mean Square Estimation 9.2 Projectors and Projection Matrices 9.2.1 Projector and Orthogonal Projector 9.2.2 Projection Matrices 9.2.3 Derivatives of Projection Matrix 9.3 Updating of Projection Matrices 9.3.1 Updating Formulas for Projection Matrices 9.3.2 Prediction Filters 9.3.3 Updating of Lattice Adaptive Filter 9.4 Oblique Projector of Full Column Rank Matrix 9.4.1 Definition and Properties of Oblique Projectors 9.4.2 Geometric Interpretation of Oblique Projectors 9.4.3 Recursion of Oblique Projectors 9.5 Oblique Projector of Full Row Rank Matrices 9.5.1 Definition and Properties 9.5.2 Calculation of Oblique Projection 9.5.3 Applications of Oblique Projectors Exercises

551 551 552 554 556 556 558 561 562 562 564 567 570 571 575 578 579 579 581 583 585

PART III

587

HIGHER-ORDER MATRIX ANALYSIS

Tensor Analysis 10.1 Tensors and their Presentation 10.1.1 Tensors 10.1.2 Tensor Representation 10.2 Vectorization and Matricization of Tensors

589 589 589 592 597

xv

10.2.1 Vectorization and Horizontal Unfolding 10.2.2 Longitudinal Unfolding of Tensors 10.3 Basic Algebraic Operations of Tensors 10.3.1 Inner Product, Norm and Outer Product 10.3.2 Mode-n Product of Tensors 10.3.3 Rank of Tensor 10.4 Tucker Decomposition of Tensors 10.4.1 Tucker Decomposition (Higher-Order SVD) 10.4.2 Third-Order SVD 10.4.3 Alternating Least Squares Algorithms 10.5 Parallel Factor Decomposition of Tensors 10.5.1 Bilinear Model 10.5.2 Parallel Factor Analysis 10.5.3 Uniqueness Condition 10.5.4 Alternating Least Squares Algorithm 10.6 Applications of Low-Rank Tensor Decomposition 10.6.1 Multimodal Data Fusion 10.6.2 Fusion of Multimodal Brain Images 10.6.3 Process Monitoring 10.6.4 Note on Other Applications 10.7 Tensor Eigenvalue Decomposition 10.7.1 Tensor–Vector Products 10.7.2 Determinants and Eigenvalues of Tensors 10.7.3 Generalized Tensor Eigenvalues Problems 10.7.4 Orthogonal Decomposition of Symmetric Tensors 10.8 Preprocessing and Postprocessing 10.8.1 Centering and Scaling of Multi-Way Data 10.8.2 Compression of Data Array 10.9 Nonnegative Tensor Decomposition Algorithms 10.9.1 Multiplication Algorithm 10.9.2 ALS Algorithms 10.10 Tensor Completion 10.10.1 Simultaneous Tensor Decomposition and Completion 10.10.2 Smooth PARAFAC Tensor Completion 10.11 Software Exercises References Index

597 601 606 606 608 612 614 615 617 621 625 625 627 635 637 641 642 644 646 648 649 649 651 656 658 659 660 661 664 664 667 670 671 674 676 678 681 708

Preface

Linear algebra is a vast field of fundamental importance in most areas of pure (and applied) mathematics, while matrices are a key tool for the researchers, scientists, engineers and graduate students majoring in the science and engineering disciplines. From the viewpoint of applications, matrix analysis provides a powerful mathematical modeling and computational framework for posing and solving important scientific and engineering problems. It is no exaggeration to say that matrix analysis is one of the most creative and flexible mathematical tools and that it plays an irreplaceable role in physics, mechanics, signal and information processing, wireless communications, machine learning, computer vision, automatic control, system engineering, aerospace, bioinformatics, medical image processing and many other disciplines, and it effectively supports research in them all. At the same time, novel applications in these disciplines have spawned a number of new results and methods of matrix analysis, such as quadratic eigenvalue problems, joint diagonalization, sparse representation and compressed sensing, matrix completion, nonnegative matrix factorization, tensor analysis and so on. Goal of the Book The main goal of this book is to help the reader develop the skills and background needed to recognize, formulate and solve linear algebraic problems by presenting systematically the theory, methods and applications of matrix analysis. A secondary goal is to help the reader understand some recent applications, perspectives and developments in matrix analysis. Structure of the Book In order to provide a balanced and comprehensive account of the subject, this book covers the core theory and methods in matrix analysis, and places particular emphasis on its typical applications in various science and engineering disciplines. The book consists of ten chapters, spread over three parts. Part I is on matrix algebra: it contains Chapters 1 through 3 and focuses on the necessary background material. Chapter 1 is an introduction to matrix algebra that is devoted to basic matrix operations. This is followed by a description of the vecxvii

xviii

Preface

torization of matrices, the representation of vectors as matrices, i.e. matricization, and the application of sparse matrices to face recognition. Chapter 2 presents some special matrices used commonly in matrix analysis. Chapter 3 presents the matrix differential, which is an important tool in optimization. Part II is on matrix analysis: this is the heart of the book, and deals with the topics that are most frequently needed. It covers both theoretical and practical aspects and consists of six chapters, as follows. Chapter 4 is devoted to the gradient analysis of matrices, with applications in smooth and nonsmooth convex optimization, constrained convex optimization, Newton’s algorithm and the original–dual interior-point method. In Chapter 5 we describe the singular value analysis of matrices, including singular value decomposition, generalized singular value decomposition, low-rank sparse matrix decomposition and matrix completion. Researchers, scientists, engineers and graduate students from a wide variety of disciplines often have to use matrices for modeling purposes and to solve the resulting matrix equations. Chapter 6 focuses on ways to solve such equations and includes the Tikhonov regularization method, the total least squares method, the constrained total least squares method, nonnegative matrix factorization and the solution of sparse matrix equations. Chapter 7 deals with eigenvalue decomposition, matrix reduction, generalized eigenvalue decomposition, the Rayleigh quotient, the generalized Rayleigh quotient, quadratic eigenvalue problems and joint diagonalization. Chapter 8 is devoted to subspace analysis methods and subspace tracking algorithms in adaptive signal processing. Chapter 9 focuses on orthogonal and oblique projections with their applications. Part III is on higher-order matrix analysis and consists simply of Chapter 10. In it, matrix analysis is extended from the second-order case to higher orders via a presentation of the basic algebraic operations, representation as matrices, Tuckey decomposition, parallel factor decomposition, eigenvalue decomposition of tensors, nonnegative tensor decomposition and tensor completion, together with applications. Features of the Book The book introduces a novel theoretical framework for matrix analysis by dividing it into second-order matrix analysis (including gradient analysis, singular value analysis, eigenanalysis, subspace analysis and projection analysis) and higher-order matrix analysis (tensor analysis). Gradient analysis and optimization play an important role in the book. This is a very topical subject and is central to many modern applications (such as communications, signal processing, pattern recognition, machine learning, radar, big data analysis, multimodal brain image fusion etc.) though quite classical in origin. Some more contemporary topics of matrix analysis such as subspace analysis,

Preface

xix

projection analysis and tensor analysis, and which are often missing from other books, are included in our text. Particular emphasis is placed on typical applications of matrix methods in science and engineering. The 80 algorithms for which summaries are given should help readers learn how to conduct computer experiments using related matrix analysis in their studies and research. In order to make these methods easy to understand and master, this book adheres to the principle of both interpreting physics problems in terms of mathematics, and mathematical results in terms of physical ideas. Thus some typical or important matrix analysis problems are introduced by modeling a problem from physics, while some important mathematical results are explained and understood by revealing their physical meaning. Reading the Book The following diagram gives a schematic organization of this book to illustrate the chapter dependences.

Chapter 1

Chapter 2 (optional)

Chapter 3 Part I: Matrix Algebra

Chapter 4

Chapter 5

Chapter 7

Chapter 8

Chapter 9

Chapter 6 Part II: Matrix Analysis

Chapter 10 (optional) Part III: Higher-Order Matrix Analysis

Chapters 2 and 10 are optional. In particular, Chapter 10 is specifically devoted to readers involved in multi-channel or multi-way data analysis and processing. Intended Readership Linear algebra and matrix analysis are used in a very wide range of subjects including physics, statistics, computer science, economics, information science and

xx

Preface

technology (including signal and image processing, communications, automation control, system engineering and pattern recognition), artificial intelligence, bioinformatics, biomedical engineering, to name just a selection. This book is dedicated to providing individuals in those disciplines with a solid foundation of the fundamental skills needed to develop and apply linear algebra and matrix analysis methods in their work. The only background required of the reader is a good knowledge of advanced calculus, so the book will be suitable for graduate students in science and engineering. Acknowledgments The contents of this book reflect the author’s collaboration with his own graduate students Jian Li, Zi-Zhe Ding, Yong-Tao Su, Xi-Lin Li, Heng Yang, Xi-Kai Zhao, Qi Lv, Qiu-Beng Gao, Li Zhang, Jian-Jiang Ding, Lu Wu, Feng Zhu, Ling Zhang, Dong-Xia Chang, De-Guang Xie, Chun-Yu Peng, Dao-Ming Zhang, Kun Wang, Xi-Yuan Wang, Zhong Chen, Tian-Xiang Luan, Liang Zheng, Yong Zhang, Yan-Yi Rao, all at Tsinghua University, and Shun-Tian Lou, Xiao-Long Zhu, Ji-Ming Ye, Fang-Ming Han, Xiao-Jun Li, Jian-Feng Chen at Xidian University. Since 2004 we have taught graduate courses on matrix analysis and applications at Tsinghua University. Over the years I have benefited from keen interest, feedback and suggestions from many people, including my own graduate students, and students in our courses. I wish to thank Dr. Fang-Ming Han for his contribution to co-teaching and then teaching these courses, and Xi-Lin Li, Lu Wu, Dong-Xia Chang, Kun Wang, Zhong Chen, Xi-Yuan Wang and Yan-Yi Rao for their assistance with the teaching. Kun Wang, Zhong Chen, Liang Zheng and Xi-Yuan Wang kindly provided some illustrations in the book. I am grateful to the countless researchers in linear algebra, matrix analysis, information science and technology for their original contributions and to the anonymous reviewers for their critical comments and suggestions, which have greatly improved the text. I am most grateful to the Commissioning Editor, David Liu, the Content Manager, Esther Migu´eliz, and the copyeditor, Susan Parkinson, for their patience, understanding, suggestions and high-quality content management and copyediting in the course of the book’s writing and publication. This book uses some of the contents and materials of my book Matrix Analysis and Applications (Second Edition in Chinese, Tsinghua University Press, 2013). Finally, I am grateful to my wife Xiao-Ying Tang, my son Yuan-Sheng Zhang, my daughter-in-law Lin Yan, my daughter Ye-Wei Zhang, my son-in-law Wei Wei for their support and encouragement in this project.

Notation

Sets R

real numbers

R

n

real n-vectors (n × 1 real matrices)

R

m×n

real m × n matrices

R[x] R[x]

real polynomials m×n

real m × n polynomial matrices

R

I×J×K

real third-order tensors

R

I1 ×···×IN

real N th-order tensor

R+

nonnegative real numbers, nonnegative orthant

R++

positive real numbers

C

complex numbers

C

n

complex n-vectors

C

m×n

complex m × n matrices

C[x] C[x]

complex polynomials m×n

complex m × n polynomial matrices

C

I×J×K

complex third-order tensors

C

I1 ×···×IN

complex N th-order tensors

K

real or complex numbers

K

n

real or complex n-vectors

K

m×n

real or complex m × n matrices

K

I×J×K

real or complex third-order tensors

K

I1 ×···×IN

real or complex N th-order tensors

Z

integers

Z+

nonnegative integers xxi

xxii

Notation

Sets (continued) Sn×n

symmetric n × n matrices

Sn×n + Sn×n ++

symmetric positive semi-definite n × n matrices

[m,n]

S

symmetric mth-order n-dimensional tensors AI1×···×Im, I1= · · · =In

S+

[m,n]

symmetric mth-order n-dimensional nonnegative tensors



for all

x∈A

x belongs to the set A, i.e. x is an element of A

x∈ /A

x is not an element of the set A

U → V

U maps to V

U →W

U transforms to W



such that



exists

A⇒B

A implies B

A⊆B

A is a subset of B

A⊂B

A is a proper subset of B

A∪B

union of sets A and B

A∩B

intersection of sets A and B

A+B

sum set of sets A and B

A−B

set-theoretic difference of sets A and B

X \A

complement of the set A in the set X

X1 × · · · ×Xn

Cartesian product of sets X1 , . . . , Xn

L

linear manifold

Gr(n, r)

Grassmann manifold

St(n, r)

Stiefel manifold

Or

orthogonal group

S



K (A, f ) m

symmetric positive definite n × n matrices

orthogonal complement of the subspace S order-m Krylov subspace generated by A and f

Col(A)

column space of the matrix A

Ker(A)

kernel space of the matrix A

Null(A)

null space of the matrix A

nullity(A)

nullity of the matrix A

Range(A)

range space of the matrix A

Notation

Sets (continued) Row(A)

row space of the matrix A

Span(a1 , . . . , am ) span of vectors a1 , . . . , am Vectors x∗

conjugate of the vector x

x

T

transpose of the vector x

x

H

conjugate transpose (Hermitian conjugate) of the vector x

L(u)

linear transform of the vector u

x 0

0 -norm: the number of nonzero entries in the vector x

x 1

1 -norm of the vector x

x 2

Euclidean norm of the vector x

x p

p -norm or H¨older norm of the vector x

x ∗

nuclear norm of the vector x

x ∞

∞ -norm of the vector x

x, y = x y

inner product of vectors x and y

x ◦ y = xy

outer product of vectors x and y

H

H

x⊥y

orthogonality of vectors x and y

x>0

positive vector, with components xi > 0, ∀ i

x≥0

nonnegative vector, with components xi ≥ 0, ∀ i

x≥y

vector elementwise inequality xi ≥ yi , ∀ i

unvec(x)

matricization of the column vector x

unrvec(x)

row matricization of the column vector x

(m)

θi

(m)

, yi

Rayleigh–Ritz (RR) values, RR vectors

(m) (m) (θi , yi )

Ritz pair

Matrices real m × n matrix A

A ∈ Rm×n A∈C

complex m × n matrix A

m×n

A[x] ∈ R[x]

m×n

A[x] ∈ C[x]

m×n

A A

∗ T

real m × n polynomial matrix A complex m × n polynomial matrix A conjugate of A transpose of A

xxiii

xxiv

Notation

Matrices (continued) AH

conjugate transpose (Hermitian conjugate) of A

(A, B)

matrix pencil

det(A), |A|

determinant of A

tr(A)

trace of A

rank(A)

rank of A

eig(A)

eigenvalues of the Hermitian matrix A

λi (A)

ith eigenvalue of the Hermitian matrix A

λmax (A)

maximum eigenvalue(s) of the Hermitian matrix A

λmin (A)

minimum eigenvalue(s) of the Hermitian matrix A

λ(A, B)

generalized eigenvalue of the matrix pencil (A, B)

σi (A)

ith singular value of A

σmax (A)

maximum singular value(s) of A

σmin (A)

minimum singular value(s) of A

ρ(A)

spectral radius of A

A

−1

inverse of the nonsingular matrix A

A



Moore–Penrose inverse of A

A0

positive definite matrix A

A0

positive semi-definite matrix A

A≺0

negative definite matrix A

A0

negative semi-definite matrix A

A>O

positive (or elementwise positive) matrix A

A≥O

nonnegative (or elementwise nonnegative) matrix A

A≥B

matrix elementwise inequality aij ≥ bij , ∀ i, j

A 1

maximum absolute column-sum norm of A

A ∞

maximum absolute row-sum norm of A

A spec

spectrum norm of A : σmax (A)

A F

Frobenius norm of A

A ∞

max norm of A : the absolute maximum of all entries of A

A G

Mahalanobis norm of A

vec A

column vectorization of A

rvec A

row vectorization of A

off(A)

off function of A = [aij ] :

diag(A)

diagonal function of A = [aij ] :

m

n

i=1,i=j

n

j=1

i=1 |aii |

|aij |2 2

Notation

Matrices (continued) diag(A)

diagonal vector of A = [aij ] : [a11 , . . . , ann ]T

Diag(A)

diagonal matrix of A = [aij ] : Diag(a11 , . . . , ann )

A, B

inner product of A and B : (vec A)H vec B

A⊗B

Kronecker product of matrices A and B

AB

Khatri–Rao product of matrices A and B

A∗B

Hadamard product of matrices A and B

A⊕B

direct sum of matrices A and B

{A}N

matrix group consisting of matrices A1 , . . . , AN

{A}N ⊗ B

generalized Kronecker product of {A}N and B

δx, δX

perturbations of the vector x and the matrix X

cond(A)

condition number of the matrix A

In(A)

inertia of a symmetric matrix A

i+ (A)

number of positive eigenvalues of A

i− (A)

number of negative eigenvalues of A

i0 (A)

number of zero eigenvalues of A

A∼B similarity transformation ∼ A(λ) = B(λ) balanced transformation . A=B essentially equal matrices J = PAP−1

Jordan canonical form of the matrix A

dk (x)

kth determinant divisor of a polynomial matrix A(x)

σk (x)

kth invariant factor of a polynomial matrix A(x)

A(λ)

λ-matrix of the matrix A

S(λ)

Smith normal form of the λ-matrix A(λ)

Special Vectors and Special Matrices PS

projector onto the subspace S

P⊥ S

orthogonal projector onto the subspace S

EH|S

oblique projector onto the subspace H along the subspace S

ES|H

oblique projector onto the subspace S along the subspace H

1

summing vector with all entries 1

0

null or zero vector with all components 0

ei

basic vector with ei = 1 and all other entries 0

π

extracting vector with the last nonzero entry 1

xxv

xxvi

Notation

Special Vectors and Special Matrices (continued) iN

index vector : [0, 1, . . . , N − 1]T

iN,rev

bit-reversed index vector of iN

O

null or zero matrix, with all components zero

I

identity matrix

Kmn

mn × mn commutation matrix

Jn

n × n exchange matrix : Jn = [en , . . . , e1 ]

P

n × n permutation matrix : P = [ei1 , . . . , ein ], i1 , . . . , in ∈ {1, . . . , n}

G

generalized permutation matrix or g-matrix : G = PD

Cn

n × n centering matrix = In − n−1 1n 1Tn

FN

N × N Fourier matrix with entry F (i, k) = (e−j2π/N )(i−1)(k−1)

FN,rev

N × N bit-reversed Fourier matrix

Hn

Hadamard matrix : Hn HTn = HTn Hn = nIn

A

symmetric Toeplitz matrix : [a|i−j| ]ni,j=1

A

complex Toeplitz matrix: Toep[a0 , a1 , . . . , an ] with a−i = a∗i

Qn

n × n real orthogonal matrix : QQT = QT Q = I

Un

n × n unitary matrix : UUH = UH U = I

Qm×n

m × n semi-orthogonal matrix : QT Q = In or QQT = Im

Um×n

m × n para-unitary matrix : UH U = In , m > n, or UUH = Im , m < n

Sb

between-class scatter matrix

Sw

within-class scatter matrix

Tensors A ∈ KI1 ×···×IN

N th-order real or complex tensor

I, E

identity tensor

Ai::

horizontal slice matrix of A ∈ KI×J×K

A:j:

lateral slice matrix of A ∈ KI×J×K

A::k

frontal slice matrix of A ∈ KI×J×K

a:jk , ai:k , aij:

mode-1, model-2, model-3 vectors of A ∈ KI×J×K

vec A

vectorization of tensor A

unvec A A

matricization of tensor A

(JK×I)

,A

(KI×J)

,A

(IJ×K)

matricization of tensor A ∈ KI×J×K

Notation

Tensors (continued) A, B

inner product of tensors : (vec A)H vec B

A F

Frobenius norm of tensor A

A=u◦v◦w

outer product of three vectors u, v, w

X ×n A

Tucker mode-n product of X and A rank of tensor A

rank(A) [[G; U

(1)

,...,U

(N )

Tucker operator of tensor G

]]

G × 1 A ×2 B ×3 C (1)

· · · ×N U

P 

Q 

G ×1 U xijk =

R 

Tucker decomposition (third-order SVD) (N )

higher-order SVD of N th-order tensor

aip bjq ckr

CP decomposition of the third-order tensor X

p=1 q=1 r=1

Axm , Axm−1

tensor–vector product of A ∈ S[m,n] , x ∈ Cn×1

det(A)

determinant of tensor A

λi (A)

ith eigenvalue of tensor A

σ(A)

spectrum of tensor A

Functions and Derivatives def

=

defined to be equal



asymptotically equal (in scaling sense)



approximately equal (in numerical value)

f :R

m

→R

real function f (x), x ∈ Rm , f ∈ R

f : Rm×n → R

real function f (X), X ∈ Rm×n , f ∈ R

f : Cm × Cm → R

real function f (z, z∗ ), z ∈ Cm , f ∈ R

f : Cm×n ×Cm×n → R

real function f (Z, Z∗ ), Z ∈ Cm×n , f ∈ R

dom f, D

definition domain of function f

E

domain of equality constraint function

I

domain of inequality constraint function

F

feasible set

Bc (c; r), Bo (c; r)

closed, open neighborhoods of c with radius r

Bc (C; r), Bo (C; r)

closed, open neighborhoods of C with radius r





f (z, z ), f (Z, Z ) ∗

function of complex variables z, or Z ∗

df (z, z ), df (Z, Z )

complex differentials

D(p g)

distance between vectors p and g

xxvii

xxviii

Notation

Functions and Derivatives (continued) D(x, y)

dissimilarity between vectors x and y

DE (x, y)

Euclidean distance between vectors x and y

DM (x, y)

Mahalanobis distance between vectors x and y

DJg (x, y)

Bregman distance between vectors x and y

Dx , Dvec X

row partial derivative operators

Dx f (x)

row partial derivative vectors of f (x)

Dvec X f (X)

row partial derivative vectors of f (X)

DX

Jacobian operator

DX f (X)

Jacobian matrix of the function f (X)

Dz , Dvec Z

complex cogradient operator

Dz∗ , Dvec Z∗

complex conjugate cogradient operator



Dz f (z, z )

cogradient vector of complex function f (z, z∗ )

Dvec Z f (Z, Z∗ )

cogradient vector of complex function f (Z, Z∗ )

Dz∗ f (z, z∗ )

conjugate cogradient vector of f (z, z∗ )

Dvec Z∗ f (Z, Z∗ )

conjugate cogradient vector of f (Z, Z∗ )

D Z , ∇Z ∗

Jacobian, gradient matrix operator ∗

DZ f (Z, Z )

Jacobian matrices of f (Z, Z∗ )

∇Z f (Z, Z∗ )

gradient matrices of f (Z, Z∗ )

DZ∗ f (Z, Z∗ )

conjugate Jacobian matrices of f (Z, Z∗ )

∇Z∗ f (Z, Z∗ )

conjugate gradient matrices of f (Z, Z∗ )

∇x , ∇vec X

gradient vector operator

∇x f (x)

gradient vector of function f (x)

∇vec X f (X)

gradient vector of function f (X)

∇f (X)

gradient matrix of function f

∇ f

Hessian matrix of function f

2

Hx f (x) ∗

Hf (z, z )

Hessian matrix of function f full Hessian matrix of f (z, z∗ )

Hz,z , Hz,z∗ , Hz∗ ,z∗ part Hessian matrices of function f (z, z∗ ) df, ∂f

differential or subdifferential of function f

g ∈ ∂f

subgradient of function f

Δx

descent direction of function f (x)

Δxnt

Newton step of function f (x)

max f, min f

maximize, minimize function f

Notation

Functions and Derivatives (continued) max{x, y}

maximum of x and y

min{x, y}

minimum of x and y

inf

infimum

sup

supremum

Re, Im

real part, imaginary part of complex number

arg

argument of objective function or complex number

PC (y), PC y

projection operator of the vector y onto the subspace C

x

+

nonnegative vector with entry [x+ ]i = max{xi , 0}

proxh (u)

proximal operator of function h(x) to point u

proxh (U)

proximal operator of function h(X) to point U

soft(x, τ ), Sτ [x] soft thresholding operator of real variable x Probability soft(x, τ ), soft(X, τ )

soft thresholding operator of real variables x, X

Dμ (Σ)

singular value (matrix) thresholding (operation)

IC (x)

indicator function

xLS , XLS

least squares solutions to Ax = b, AX = B

xDLS , XDLS

data least squares solutions to Ax = b, AX = B

xWLS , XWLS

weighted least squares solutions to Ax = b, AX = B

xopt , Xopt

optimal solutions to Ax = b, AX = B

xTik , XTik

Tikhonov solutions to Ax = b, AX = B

xTLS , XTLS

total least squares (TLS) solutions to Ax = b, AX = B

xGTLS , XGTLS

generalized TLS solutions to Ax = b, AX = B

xML , XML

maximum likelihood solutions to Ax = b, AX = B

(α,β) DAB (P G)

alpha–beta (AB) divergence of matrices P and G

Dα (P G)

alpha-divergence of matrices P and G

Dβ (P G)

beta-divergence of matrices P and G

DKL (P G)

Kullback–Leibler divergence of matrices P and G

lnq (x)

Tsallis logarithm

expq (x)

q-exponential

ln1−α (x)

deformed logarithm

exp1−α (x)

deformed exponential

xxix

xxx

Notation

Probability (continued) sign(x)

signum function of real valued variable x

SGN(x)

signum multifunction of real valued variable x

shrink(y, α)

shrink operator

R(x)

Rayleigh quotient, generalized Rayleigh quotient

Moff

off-diagonal matrix corresponding to matrix M

z

−j

x(n)

time-shifting operation on vector x(n)

¯ E{x} = x

expectation (mean) of random vector x

E{xx }

autocorrelation matrix of random vector x

E{xy }

cross-correlation matrix of random vectors x and y

ρxy

correlation coefficient of random vectors x and y

N (c, Σ)

Gaussian random vector with mean (vector) c and

H

H

covariance (matrix) Σ CN (c, Σ)

complex Gaussian random vector with mean (vector) c and covariance (matrix) Σ

f (x1 , . . . , xm ) joint probability density function of random vector x = [x1 , . . . , xm ]T Φx (ω)

characteristic function of random vector x

Abbreviations

AB

alpha–beta

ADMM

alternating direction method of multipliers

ALS

alternating least squares

ANLS

alternating nonnegative least squares

APGL

accelerated proximal gradient line

ARNLS

alternating regularization nonnegative least squares

BBGP

Barzilai–Borwein gradient projection

BCQP

bound-constrained quadratic program

BFGS

Broyden–Fletcher–Goldfarb–Shanno

BP

basis pursuit

BPDN

basis pursuit denoising

BSS

blind source separation

CANDECOMP

canonical factor decomposition

CNMF

constrained nonnegative matrix factorization

CoSaMP

compression sampling matching pursuit

CP

CANDECOMP/PARAFAC

DCT

discrete cosine transform

DFT

discrete Fourier transform

DLS

data least squares

DOA

direction of arrival

EEG

electroencephalography

EM

expectation-maximization

EMML

expectation-maximization maximum likelihood

ESPRIT

estimating signal parameters via rotational invariance technique

xxxi

xxxii

Abbreviations

EVD

eigenvalue decomposition

FAJD

fast approximate joint diagonalization

FAL

first-order augmented Lagrangian

FFT

fast Fourier transform

FISTA

fast iterative soft thresholding algorithm

GEAP

generalized eigenproblem adaptive power

GEVD

generalized eigenvalue decomposition

GSVD

generalized singular value decomposition

GTLS

generalized total least squares

HBM

heavy ball method

HOOI

higher-order orthogonal iteration

HOSVD

higher-order singular value decomposition

ICA

independent component analysis

IDFT

inverse discrete Fourier transform

iid

independent and identically distributed

inf

infimum

KKT

Karush–Kuhn–Tucker

KL

Kullback–Leibler

LARS

least angle regressive

Lasso

least absolute shrinkage and selection operator

LDA

linear discriminant analysis

LMV

Lathauwer–Moor–Vanderwalle

LP

linear programming

LS

least squares

LSI

latent semantic indexing

max

maximize, maximum

MCA

minor component analysis

MIMO

multiple-input–multiple-output

min

minimize, minimum

ML

maximum likelihood

MP

matching pursuit

MPCA

multilinear principal component analysis

MUSIC

multiple signal classification

Abbreviations

NeNMF

Nesterov nonnegative matrix factorization

NMF

nonnegative matrix factorization

NTD

nonnegative tensor decomposition

OGM

optimal gradient method

OMP

orthogonal matching pursuit

PARAFAC parallel factor decomposition PAST

projection approximation subspace tracking

PASTd

projection approximation subspace tracking via deflation

PCA

principal component analysis

PCG

preconditioned conjugate gradient

PCP

principal component pursuit

pdf

positive definite

PMF

positive matrix factorization

psdf

positive semi-definite

PSF

point-spread function

PSVD

product singular value decomposition

RIC

restricted isometry constant

RIP

restricted isometry property

ROMP

regularization orthogonal matching pursuit

RR

Rayleigh–Ritz

QCLP

quadratically constrained linear programming

QEP

quadratic eigenvalue problem

QP

quadratic programming

QV

quadratic variation

sign

signum

SPC

smooth PARAFAC tensor completion

StOMP

stagewise orthogonal matching pursuit

sup

supremum

SVD

singular value decomposition

SVT

singular value thresholding

TLS

total least squares

TV

total variation

UPCA

unfold principal component analysis

VQ

vector quantization

xxxiii

Algorithms

1.1

Reduced row echelon form

1.2

Solving m × n matrix equations Ax = b

1.3

Equation solving method 1

1.4

Equation solving method 2

1.5

Full-rank decomposition

1.6

Column recursive method

1.7

Trace method

2.1

Fast DCT algorithm for Toeplitz matrix

4.1

Gradient descent algorithm and its variants

4.2

Conjugate gradient algorithm

4.3

Biconjugate gradient algorithm

4.4

PCG algorithm via preprocessor

4.5

PCG algorithm without preprocessor

4.6

Nesterov (first) optimal gradient algorithm

4.7

Nesterov algorithm with adaptive convexity parameter

4.8

Nesterov (third) optimal gradient algorithm

4.9

FISTA algorithm with fixed step

4.10

Newton algorithm via backtracking line search

4.11

Feasible start Newton algorithm

4.12

Infeasible start Newton algorithm

4.13

Feasible start complex Newton algorithm

4.14

Feasible start original–dual interior-point algorithm

4.15

Feasible point algorithm

5.1

PSVD(B,C)

5.2

PSVD of BT SC xxxiv

Algorithms

5.3

GSVD algorithm 1

5.4

GSVD algorithm 2

5.5

Tangent algorithm for GSVD

5.6

Robust PCA via accelerated proximal gradient

5.7

Singular value thresholding for matrix completion

5.8

Truncated nuclear norm minimization via APGL search

5.9

Truncated nuclear norm minimization via ADMM

6.1

TLS algorithm for minimum norm solution

6.2

SVD–TLS algorithm

6.3

TLS algorithm for fitting m-dimension syperplane

6.4

Total maximum likelihood algorithm

6.5

Solving blind matrix equation X = Aθ B

6.6

NeNMF algorithm

6.7

Optimal gradient method OGM(Ak , S)

6.8

Orthogonal matching pursuit algorithm

6.9

Subspace pursuit algorithm

6.10 6.11

FAL ({λk , k , τk }k∈Z+ , η)   APG pk , fk , L, Fk , xk−1 , hk , APGSTOP

6.12

GPSR–BB algorithm

6.13

LARS algorithm with Lasso modification

6.14

Coordinate descent algorithm

6.15

Homotopy algorithm

6.16

Linearized Bregman iterative algorithm

7.1

Similarity reduction of matrix A

7.2

Calculation of matrix powers

7.3

Lanczos algorithm for GEVD

7.4

Tangent algorithm for computing the GEVD

7.5

GEVD algorithm for singular matrix B

7.6

Basic ESPRIT algorithm 1

7.7

TLS–ESPRIT algorithm

7.8

Basic ESPRIT algorithm 2

7.9

Rayleigh quotient iteration algorithm

7.10

Rayleigh quotient iteration for general matrix

7.11

Linearized algorithm for QEP

xxxv

xxxvi

Algorithms

7.12

Fast approximate joint diagonalization algorithm

8.1

PAST algorithm

8.2

PAST via deflation (PASTd) algorithm

8.3

Bi-Lanczos iteration

8.4

Fast subspace decomposition algorithm

10.1

ALS algorithm for Tucker decomposition

10.2

HOSVD (X , R1 , . . . , RN )

10.3

HOOI (X , R1 , . . . , RN )

10.4

CP–ALS algorithm via Kiers horizontal unfolding

10.5

CP–ALS algorithm via LMV longitudinal unfolding

10.6

CP–ALS (X , R)

10.7

Regularized ALS algorithm CP-RALS (X , R, N, λ)

10.8

Generalized eigenproblem adaptive power method

10.9

Z-eigenpair adaptive power method

10.10

Tensor power method

10.11

Data array compression algorithm

10.12

Multiplication algorithm for NCP decomposition

10.13

ALS algorithm for nonnegative Tucker decomposition

10.14

CP–NALS (X , R)

10.15

Simultaneous tensor decomposition and completion

10.16

SPC algorithm

PART I MATRIX ALGEBRA

1 Introduction to Matrix Algebra

In science and engineering, we often encounter the problem of solving a system of linear equations. Matrices provide the most basic and useful mathematical tool for describing and solving such systems. Matrices not only have many basic mathematics operations (such as transposition, inner product, outer product, inverse, generalized inverse etc.) but also have a variety of important scalar functions (e.g., a norm, a quadratic form, a determinant, eigenvalues, rank and trace etc.). There are also special matrix operations, such as the direct sum, direct product, Hadamard product, Kronecker product, vectorization, etc. In this chapter, we begin our introduction to matrix algebra by relating matrices to the problem of solving systems of linear equations.

1.1 Basic Concepts of Vectors and Matrices First we introduce the basic concepts of and notation for vectors and matrices.

1.1.1 Vectors and Matrices Let R (or C) denote the set of real (or complex) numbers. An m-dimensional column vector is defined as ⎡ ⎤ x1 ⎢ .. ⎥ x = ⎣ . ⎦. (1.1.1) xm If the ith component xi is a real number, i.e., xi ∈ R, for all i = 1, . . . , m, then x is an m-dimensional real vector and is denoted x ∈ Rm×1 or simply x ∈ Rm . Similarly, if xi ∈ C for some i, then x is known as an m-dimensional complex vector and is denoted x ∈ Cm . Here, Rm and Cm represent the sets of all real and complex m-dimensional column vectors, respectively. An m-dimensional row vector x = [x1 , . . . , xm ] is represented as x ∈ R1×m or 3

4

Introduction to Matrix Algebra

x ∈ C1×m . To save space, an m-dimensional column vector is usually written as the transposed form of a row vector, denoted x = [x1 , . . . , xm ]T . An m × n matrix is expressed as ⎡ ⎤ a11 · · · a1n ⎢ .. ⎥ = [a ]m,n .. A = ⎣ ... (1.1.2) ij i=1,j=1 . . . ⎦ am1

···

amn

The matrix A with (i, j)th real entry aij ∈ R is called an m × n real matrix, and denoted by A ∈ Rm×n . Similarly, A ∈ Cm×n is an m × n complex matrix. An m×n matrix can be represented as A = [a1 , . . . , an ] where its column vectors are aj = [a1j , . . . , amj ]T , j = 1, . . . , n. The system of linear equations ⎫ a11 x1 + a12 x2 + · · · + a1n xn = b1 , ⎪ ⎪ ⎪ ⎪ ⎬ a21 x1 + a22 x2 + · · · + a2n xn = b2 , ⎪ (1.1.3) .. ⎪ ⎪ . ⎪ ⎪ ⎪ ⎭ am1 x1 + am2 x2 + · · · + amn xn = bm can be simply rewritten using vector and matrix symbols as a matrix equation Ax = b, where



a11 ⎢ .. A=⎣ . am1

··· .. . ···

⎤ a1n .. ⎥ , . ⎦ amn

(1.1.4)

⎤ x1 ⎢ ⎥ x = ⎣ ... ⎦ , ⎡

xn

⎤ b1 ⎢ ⎥ b = ⎣ ... ⎦ . ⎡

(1.1.5)

bm

In modeling physical problems, the matrix A is usually the symbolic representation of a physical system (e.g., a linear system, a filter, or a wireless communication channel). There are three different types of vector in science and engineering [232]: (1) Physical vector Its elements are physical quantities with magnitude and direction, such as a displacement vector, a velocity vector, an acceleration vector and so forth. (2) Geometric vector A directed line segment or arrow is usually used to visualize a physical vector. Such a representation is called a geometric vector. For exam−−→ ple, v = AB represents the directed line segment with initial point A and the terminal point B. (3) Algebraic vector A geometric vector can be represented in algebraic form. For −−→ a geometric vector v = AB on a plane, if its initial point is A = (a1 , a2 ) −−→ and its terminal point is B = (b1 , b2 ), then the geometric vector v = AB can

1.1 Basic Concepts of Vectors and Matrices 5   b −a be represented in an algebraic form v = b1 − a1 . Such a geometric vector 2 2

described in algebraic form is known as an algebraic vector. Physical vectors are those often encountered in practical applications, while geometric vectors and algebraic vectors are respectively the visual representation and the algebraic form of physical vectors. Algebraic vectors provide a computational tool for physical vectors. Depending on the different types of element value, algebraic vectors can be divided into the following three types: (1) Constant vector Its entries are real constant numbers or complex constant numbers, e.g., a = [1, 5, 4]T . (2) Function vector It uses functions as entries, e.g., x = [x1 , . . . , xn ]T . (3) Random vector Its entries are random variables or signals, e.g., x(n) = [x1 (n), . . . , xm (n)]T where x1 (n), . . . , xm (n) are m random variables or random signals. Figure 1.1 summarizes the classification of vectors. ⎧ ⎪ Physical vectors ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ Geometric vectors ⎪ ⎪ ⎨ ⎧ Vectors ⎪ Constant vectors ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ Algebraic vectors Function vectors ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ Random vectors Figure 1.1 Classification of vectors.

Now we turn to matrices. An m × n matrix A is called a square matrix if m = n, a broad matrix for m < n, and a tall matrix for m > n. The main diagonal of an n × n matrix A = [aij ] is the segment connecting the top left to the bottom right corner. The entries located on the main diagonal, a11 , a22 , . . . , ann , are known as the (main) diagonal elements. An n × n matrix A = [aij ] is called a diagonal matrix if all entries off the main diagonal are zero; it is then denoted by D = Diag(d11 , . . . , dnn ).

(1.1.6)

In particular, a diagonal matrix I = Diag(1, . . . , 1) is called an identity matrix, and O = Diag(0, . . . , 0) is known as a zero matrix. A vector all of whose components are equal to zero is called a zero vector and is denoted as 0 = [0, . . . , 0]T . An n × 1 vector x = [x1 , . . . , xn ]T with only one nonzero entry xi = 1 constitutes

6

Introduction to Matrix Algebra

a basis vector, denoted ei ; e.g., ⎡ ⎤ ⎡ ⎤ 1 0 ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ e1 = ⎢0⎥ , e2 = ⎢0⎥ , ⎢.⎥ ⎢.⎥ ⎣ .. ⎦ ⎣ .. ⎦ 0 0

...,

⎡ ⎤ 0 ⎢0⎥ ⎢ ⎥ ⎢ ⎥ en = ⎢0⎥ . ⎢.⎥ ⎣ .. ⎦ 1

(1.1.7)

Clearly, an n × n identity matrix I can be represented as I = [e1 , . . . , en ] using basis vectors. In this book, we use often the following matrix symbols. A(i, :) means the ith row of A. A(:, j) means the jth column of A. A(p : q, r : s) means the (q − p + 1) × (s − r + 1) submatrix consisting of the pth row to the qth row and the rth column to the sth column of A. For example, ⎡ ⎤ a32 a33 a34 ⎢a42 a43 a44 ⎥ ⎥. A(3 : 6, 2 : 4) = ⎢ ⎣a a a ⎦ 52

53

54

a62

a63

a64

A matrix A is an m × n block matrix if it can ⎡ A11 A12 ⎢ A21 A22 ⎢ A = [Aij ] = ⎢ . .. ⎣ .. . Am1 Am2

be represented in the form ⎤ · · · A1n · · · A2n ⎥ ⎥ .. ⎥ , .. . . ⎦ ···

Amn

where the Aij are matrices. The notation [Aij ] refers to a matrix consisting of block matrices.

1.1.2 Basic Vector Calculus Basic vector calculus requires vector addition, vector multiplication by a scalar and vector products. The vector addition of u = [u1 , . . . , un ]T and v = [v1 , . . . , vn ]T is defined as u + v = [u1 + v1 , . . . , un + vn ]T .

(1.1.8)

Vector addition has the following two main properties: • Commutative law u + v = v + u. • Associative law (u + v) ± w = u + (v ± w) = (u ± w) + v. The vector multiplication of an n × 1 vector u by a scalar α is defined as αu = [αu1 , . . . , αun ]T .

(1.1.9)

1.1 Basic Concepts of Vectors and Matrices

7

The basic property of vector multiplication by a scalar is that it obeys the distributive law: α(u + v) = αu + αv.

(1.1.10)

The inner product (or dot product or scalar product) of two real or complex n × 1 vectors u = [u1 , . . . , un ]T and v = [v1 , . . . , vn ]T , denoted u, v, is defined as the real number n  T u i vi , (1.1.11) u, v = u v = u1 v1 + · · · + un vn = i=1

or the complex number u, v = uH v = u∗1 v1 + · · · + u∗n vn =

n 

u∗i vi ,

(1.1.12)

i=1

where uH is the complex conjugate transpose, or Hermitian conjugate, of u. The inner product of two vectors u, v = uH v has several important applications; for example, it may be used to measure the size (or length) of a vector, the distance between vectors, the neighborhood of a vector and so on. We will present these applications later. The outer product (or cross product) of an m × 1 real vector and an n × 1 real vector, denoted u ◦ v, is defined as the m × n real matrix ⎤ ⎡ u1 v1 · · · u1 vn ⎢ .. .. ⎥ ; u ◦ v = uvT = ⎣ ... (1.1.13) . . ⎦ u m v1

···

um v n

if u and v are complex then the outer product is the m × n complex matrix ⎤ ⎡ u1 v1∗ · · · u1 vn∗ ⎢ .. .. ⎥ . u ◦ v = uvH = ⎣ ... (1.1.14) . . ⎦ um v1∗

···

um vn∗

In signal processing, wireless communications, pattern recognition, etc., for two m×1 data vectors x(t) = [x1 (t), . . . , xm (t)]T and y(t) = [y1 (t), . . . , ym (t)]T , the m× m autocorrelation matrix is given by Rxx = E{x(t) ◦ x(t)} = E{x(t)xH (t)} and the m × m cross-correlation matrix is given by Rxy = E{x(t) ◦ y(t)} = E{x(t)yH (t)}, where E is the expectation operator. Given the sample data xi (t), yi (t), i = 1, . . . , m, ˆ xx and the sample cross-correlation t = 1, . . . , N, the sample autocorrelation matrix R ˆ xy can be respectively estimated by matrix R N  ˆ xx = 1 R x(t)xH (t), N t=1

N  ˆ xy = 1 R x(t)yH (t). N t=1

(1.1.15)

8

Introduction to Matrix Algebra

1.1.3 Basic Matrix Calculus Basic matrix calculus requires the matrix transpose, conjugate, conjugate transpose, addition and multiplication. DEFINITION 1.1 If A = [aij ] is an m × n matrix, then its transpose AT is an n × m matrix with the (i, j)th entry [AT ]ij = aji . The conjugate of A is represented as A∗ and is an m × n matrix with (i, j)th entry [A∗ ]ij = a∗ij , while the conjugate or Hermitian transpose of A, denoted AH ∈ Cn×m , is defined as ⎡ ∗ ⎤ a11 a∗21 · · · a∗m1 ⎢ ∗ ⎥ ⎢ a12 a∗22 · · · a∗m2 ⎥ H ∗ T ⎢ ⎥ (1.1.16) A = (A ) = ⎢ . .. .. ⎥ . .. ⎣ .. . . . ⎦ a∗1n a∗2n · · · a∗mn DEFINITION 1.2 An n × n real (complex) matrix satisfying AT = A (AH = A) is called a symmetric matrix (Hermitian matrix). There are the following relationships between the transpose and conjugate transpose of a matrix: AH = (A∗ )T = (AT )∗ . For an m × n block matrix A = [Aij ], its n × m block matrix: ⎡ H A11 AH 21 ⎢ H ⎢ A12 AH 22 AH = ⎢ .. ⎢ .. ⎣ . . H A1n AH 2n

(1.1.17)

conjugate transpose AH = [AH ji ] is an ···

AH m1

··· .. . ···

AH mn



⎥ AH m2 ⎥ ⎥ .. ⎥ . . ⎦

The simplest algebraic operations with matrices are the addition of two matrices and the multiplication of a matrix by a scalar. DEFINITION 1.3 Given two m × n matrices A = [aij ] and B = [bij ], matrix addition A + B is defined by [A + B]ij = aij + bij . Similarly, matrix subtraction A − B is defined as [A − B]ij = aij − bij . By using this definition, it is easy to verify that the addition and subtraction of two matrices obey the following rules: • Commutative law A + B = B + A. • Associative law (A + B) ± C = A + (B ± C) = (A ± C) + B. DEFINITION 1.4 Let A = [aij ] be an m × n matrix, and α be a scalar. The product αA is an m × n matrix and is defined as [αA]ij = αaij .

1.1 Basic Concepts of Vectors and Matrices

9

DEFINITION 1.5 Consider an m × n matrix A = [aij ] and an r × 1 vector x = [x1 , . . . , xr ]T . The product Ax exists only when n = r and is an m × 1 vector whose entries are given by [Ax]i =

n 

aij xj ,

i = 1, . . . , m.

j=1

DEFINITION 1.6 The matrix product of an m × n matrix A = [aij ] and an r × s matrix B = [bij ], denoted AB, exists only when n = r and is an m × s matrix with entries n  [AB]ij = aik bkj , i = 1, . . . , m; j = 1, . . . , s. k=1

THEOREM 1.1

The matrix product obeys the following rules of operation:

(a) Associative law of multiplication If A ∈ Cm×n , B ∈ Cn×p and C ∈ Cp×q , then A(BC) = (AB)C. (b) Left distributive law of multiplication For two m × n matrices A and B, if C is an n × p matrix then (A ± B)C = AC ± BC. (c) Right distributive law of multiplication If A is an m × n matrix, while B and C are two n × p matrices, then A(B ± C) = AB ± AC. (d) If α is a scalar and A and B are two m×n matrices then α(A+B) = αA+αB. Proof We will prove only (a) and (b) here, while the proofs of (c) and (d) are left to the reader as an exercise. (a) Let Am×n = [aij ], Bn×p = [bij ], Cp×q = [cij ], then  p  n n    aik (BC)kj = aik bkl clj [A(BC)]ij = =

k=1 p  n 

k=1

(aik bkl )clj =

l=1 k=1

p 

l=1

[AB]il clj = [(AB)C]ij

l=1

which means that A(BC) = (AB)C. (b) From the rule for matrix multiplication it is known that [AC]ij =

n 

aik ckj ,

[BC]ij =

k=1

n 

bik ckj .

k=1

Then, according to the matrix addition rule, we have [AC + BC]ij = [AC]ij + [BC]ij =

n  k=1

This gives (A + B)C = AC + BC.

(aik + bik )ckj = [(A + B)C]ij .

10

Introduction to Matrix Algebra

Generally speaking, the product of two matrices does not satisfy the commutative law, namely AB = BA. Another important operation on a square matrix is that of finding its inverse. Put x = [x1 , . . . , xn ]T and y = [y1 , . . . , yn ]T . The matrix–vector product Ax = y can be regarded as a linear transform of the vector x, where the n × n matrix A is called the linear transform matrix. Let A−1 denote the linear inverse transform of the vector y onto x. If A−1 exists then one has x = A−1 y.

(1.1.18)

This equation can be viewed as the result of using A−1 to premultiply the original linear transform Ax = y, giving A−1 Ax = A−1 y = x, which means that the linear inverse transform matrix A−1 must satisfy A−1 A = I. Furthermore, x = A−1 y should be invertible as well. In other words, after premultiplying x = A−1 y by A, we get Ax = AA−1 y, which should be consistent with the original linear transform Ax = y. This means that A−1 must also satisfy AA−1 = I. On the basis of the discussion above, the inverse matrix can be defined as follows. DEFINITION 1.7 Let A be an n × n matrix. The matrix A is said to be invertible if there is an n × n matrix A−1 such that AA−1 = A−1 A = I, and A−1 is referred to as the inverse matrix of A. The following are properties of the conjugate, transpose, conjugate transpose and inverse matrices. 1. The matrix conjugate, transpose and conjugate transpose satisfy the distributive law: (A + B)∗ = A∗ + B∗ ,

(A + B)T = AT + BT ,

(A + B)H = AH + BH .

2. The transpose, conjugate transpose and inverse matrix of product of two matrices satisfy the following relationship: (AB)T = BT AT ,

(AB)H = BH AH ,

(AB)−1 = B−1 A−1

in which both A and B are assumed to be invertible. 3. Each of the symbols for the conjugate, transpose and conjugate transpose can be exchanged with the symbol for the inverse: (A∗ )−1 = (A−1 )∗ ,

(AT )−1 = (A−1 )T ,

(AH )−1 = (A−1 )H .

The notations A−∗ = (A−1 )∗ , A−T = (A−1 )T and A−H = (A−1 )H are sometimes used. 4. For any m × n matrix A, the n × n matrix B = AH A and the m × m matrix C = AAH are Hermitian matrices.

1.1 Basic Concepts of Vectors and Matrices

11

1.1.4 Linear Independence of Vectors Consider the system of linear equations (1.1.3). It can be written as the matrix equation Ax = b. Denoting A = [a1 , . . . , an ], the m equations of (1.1.3) can be written as x1 a1 + · · · + xn an = b. This is called a linear combination of the column vectors a1 , . . . , an . DEFINITION 1.8 A set of n vectors, denoted {u1 , . . . , un }, is said to be linearly independent if the matrix equation c 1 u1 + · · · + c n u n = 0

(1.1.19)

has only zero solutions c1 = · · · = cn = 0. If the above equation may hold for a set of coefficients that are not all zero then the n vectors u1 , . . . , un are said to be linearly dependent. An n × n matrix A is nonsingular if and only if the matrix equation Ax = 0 has only the zero solution x = 0. If Ax = 0 exists for any nonzero solution x = 0 then the matrix A is singular. For an n × n matrix A = [a1 , . . . , an ], the matrix equation Ax = 0 is equivalent to a1 x1 + · · · + an xn = 0.

(1.1.20)

From the above definition it follows that the matrix equation Ax = 0 has a zero solution vector only, i.e., the matrix A is nonsingular, if and only if the column vectors a1 , . . . , an of A are linearly independent. Because of importance of this result, it is described in a theorem below. THEOREM 1.2 An n × n matrix A = [a1 , . . . , an ] is nonsingular if and only if its n column vectors a1 , . . . , an are linearly independent. To summarize the above discussions, the nonsingularity of an n × n matrix A can be determined in any of the following three ways: (1) its column vectors are linearly independent; (2) for the matrix equation Ax = b there exists a unique nonzero solution; (3) the matrix equation Ax = 0 has only a zero solution.

1.1.5 Matrix Functions The following are five common matrix functions:

12

Introduction to Matrix Algebra

1. Triangle matrix function sin A =

∞  1 (−1)n A2n+1 1 = A − A3 + A5 − · · · (2n + 1)! 3! 5! n=0

(1.1.21)

cos A =

∞  (−1)n A2n 1 1 = I − A2 + A 4 − · · · . (2n)! 2! 4! n=0

(1.1.22)

2. Logarithm matrix function ∞  (−1)n−1 n 1 1 ln(I + A) = A = A − A2 + A3 − · · · . n 2 3 n=1

(1.1.23)

3. Exponential matrix function [311], [179] eA = e−A =

∞  1 n 1 1 A = I + A + A2 + A 3 + · · · n! 2 3! n=0

(1.1.24)

∞  1 1 1 (−1)n An = I − A + A2 − A3 + · · · n! 2 3! n=0

(1.1.25)

1 1 eAt = I + At + A2 t2 + A3 t3 + · · · . 2 3!

(1.1.26)

4. Matrix derivative If the entries aij of the matrix A are the functions of a parameter t then the derivative of the matrix A is defined as follows: ⎡

da11 ⎢ dt ⎢ ⎢ da21 ⎢ dA ˙ = A=⎢ ⎢ dt dt ⎢ .. ⎢ . ⎣ da m1 dt

da12 dt da22 dt .. . dam2 dt

··· ··· .. . ···

⎤ da1n dt ⎥ ⎥ da2n ⎥ ⎥ . dt ⎥ .. ⎥ ⎥ . ⎥ damn ⎦ dt

(1.1.27)

• The derivative of a matrix exponential function is given by deAt = AeAt = eAt A. dt

(1.1.28)

• The derivative of matrix product is given by dA dB d (AB) = B+A , dt dt dt where A and B are matrix functions of the variable t.

(1.1.29)

1.2 Elementary Row Operations and Applications

5. Matrix integral

⎡ 

⎢ ⎢ Adt = ⎢ ⎢ ⎣ 

a11 dt a21 dt .. . am1 dt

 

a12 dt

···

··· .. .  am2 dt · · · a22 dt .. .

  

a1n dt

13



⎥ a2n dt ⎥ ⎥. ⎥ .. ⎦ .

(1.1.30)

amn dt

1.2 Elementary Row Operations and Applications Simple operations related to the rows of a matrix are referred to as elementary row operations; they can efficiently solve matrix equations, find the inverse of a matrix, construct the basis vectors of a vector space and so on.

1.2.1 Elementary Row Operations When solving an m × n system of linear equations, it is useful to reduce the number of equations. The principle of reduction processes is to keep the solutions of the system of equations unchanged. DEFINITION 1.9 Two systems of linear equations in n unknowns are said to be equivalent systems of equations if they have the same sets of solutions. To transform a given m × n matrix equation Am×n xn = bm into an equivalent matrix equation, a simple and efficient way is to apply successive elementary operations on the given matrix equation. DEFINITION 1.10 The following three types of operation on the rows of a system of linear equations are called elementary row operations. Type I Interchange any two equations, say the pth and qth equations; this is denoted by Rp ↔ Rq . Type II Multiply the pth equation by a nonzero number α; this is denoted by αRp → Rp . Type III Add β times the pth equation to the qth equation; this is denoted by βRp + Rq → Rq . Clearly, any type of elementary row operation does not change the solution of a system of linear equations, so after elementary row operations, the reduced system of linear equations and the original system of linear equations are equivalent. As a matter of fact, any elementary operation on an m × n system of equations Ax = b is equivalent to the same type of elementary operation on the augmented matrix B = [A, b], where the column vector b is written alongside A. Hence, performing elementary row operations on a system of linear equations Ax = b is in practice implemented by using the same elementary row operations on the augmented matrix B = [A, b].

14

Introduction to Matrix Algebra

The discussions above show that if, after a sequence of elementary row operations, the augmented matrix Bm×(n+1) becomes another simpler matrix Cm×(n+1) then two matrices are row equivalent. For the convenience of solving a system of linear equations, the final row equivalent matrix should be of echelon form (see below). The leftmost nonzero entry of a nonzero row is called the leading entry of the row. If the leading entry is equal to 1 then it is said to be the leading-1 entry. A matrix is said to be an echelon matrix if it has the following

DEFINITION 1.11 forms:

(1) all rows with all entries zero are located the bottom of the matrix; (2) the leading entry of each nonzero row appears always to the right of the leading entry of the nonzero row above; (3) all entries below the leading entry of the same column are equal to zero. The following are some examples ⎡ ⎤ ⎡ 2 ∗ ∗ 1 ∗ ⎢0 5 ∗⎥ ⎢0 3 ⎥ ⎢ A=⎢ ⎣0 0 3⎦ , A = ⎣0 0 0

0

0

of echelon matrices: ⎤ ⎡ ⎤ ⎡ ∗ 3 ∗ ∗ 0 ⎢0 0 5⎥ ⎢0 ∗⎥ ⎥, A = ⎢ ⎥ ⎢ ⎣0 0 0⎦ , A = ⎣0 0⎦

0 0 0

0

0

0

0

1 0 0 0

⎤ ∗ 9⎥ ⎥, 0⎦ 0

where ∗ indicates that the entry can be any value. DEFINITION 1.12 [232] An echelon matrix B is said to be a reduced row-echelon form (RREF) matrix if the leading entry of each nonzero row is a leading-1 entry, and each leading-1 entry is the only nonzero entry in the column in which it is located. THEOREM 1.3 Any m × n matrix A is row equivalent to one and only one matrix in reduced row-echelon form. Proof

See [285, Appendix A].

Given an m × n matrix B, Algorithm 1.1 transforms B to a reduced row-echelon form by performing suitable elementary operations. DEFINITION 1.13 [285, p. 15] A pivot position of an m × n matrix A is the position of some leading entry of its echelon form. Each column containing a pivot position is called a pivot column of the matrix A. The following examples show how to perform elementary operations for transforming a matrix to its row-echelon form and reduced row-echelon form, and how to determine the pivot columns of the original matrix.

1.2 Elementary Row Operations and Applications Algorithm 1.1

15

Reduced row-echelon form [229]

input: B ∈ Rm×n . 1. Select the row that has the first nonzero entry, say Ri . If this happens to be the first row, then go to the next step. Otherwise, perform a Type-I elementary operation R1 ↔ Ri such that the first entry of the new first row is nonzero. 2. Make a Type-II elementary operation αR1 → R1 to get a leading-1 entry for the first row. 3. Perform a Type-III elementary operation αRi + αR1 → R1 , i > 1, to make all entries below the leading entry 1 of the first row equal to 0. 4. For the ith row, i = 2, . . . , m, perform suitable elementary operations similar to the above steps to get a leading-1 entry for the ith row, and make all entries below the leading-1 entry of the ith row equal to 0. output: The reduced row-echelon form of B.

EXAMPLE 1.1

Consider the 3 × 5 matrix ⎡ −3 6 −1 1 A = ⎣ 1 −2 2 3 2 −4 5 8

⎤ −7 −1 ⎦ . −4

First, perform the elementary operations (−2)R2 + R3 → R3 and 3R2 + R1 → R1 : ⎡ ⎤ 0 0 5 10 −10 ⎣ 1 −2 2 3 −1 ⎦ . 0 0 1 2 −2 Then, perform (− 25 )R1 + R2 → R2 and (− 15 )R1 + R3 → R3 : ⎡ ⎤ 0 0 5 10 −10 ⎣ 1 −2 0 −1 3 ⎦. 0 0 0 0 0 Finally, perform R1 ↔ R2 : ⎡ ⎤ 1 −2 0 −1 3 ⎣ ¯ 0 0 5 10 −10 ⎦ ¯ 0 0 0 0 0

(row-echelon form).

The underlined positions give the pivot positions of the reworked matrix A. Hence, the first and third columns are the two pivot columns of the original version A. That is, the pivot columns are given by ⎡ ⎤ ⎡ ⎤ −3 −1 ⎣ 1 ⎦, ⎣ 2 ⎦. 2 5

16

Introduction to Matrix Algebra

After performing the row elementary operation (− 15 )R2 → R2 , the above echelon matrix comes to the reduced row-echelon form: ⎡

1 ⎣ 0 0

⎤ −2 0 −1 3 0 1 2 −2 ⎦ . 0 0 0 0

1.2.2 Gauss Elimination Methods Elementary row operations can be used to solve matrix equations and perform matrix inversion. 1. Gauss Elimination Method for Solving Matrix Equations Consider how to solve an n × n matrix equation Ax = b, where the inverse matrix A−1 of the matrix A exists. It is to be hoped that the solution vector x = A−1 b can be obtained by using elementary row operations. Form the n × (n + 1) augmented matrix B = [A, b]. Since the solution x = A−1 b can be written as a new matrix equation Ix = A−1 b, we get a new augmented matrix C = [I, A−1 b] associated with the solution x = A−1 b. Hence, we can write the solution process for the two matrix equations Ax = b and x = A−1 b respectively as follows: Elementary row

Ax = b −−−−−−−−−−→

matrix equations

operations

augmented matrices

[A, b]

Elementary row

−−−−−−−−−−→ operations

x = A−1 b, [I, A−1 b].

This implies that, after suitable elementary row operations on the augmented matrix [A, b], if the left-hand part of the new augmented matrix is an n × n identity matrix I then the (n + 1)th column gives the solution x = A−1 b of the original equation Ax = b directly. This method is called the Gauss or Gauss–Jordan elimination method. EXAMPLE 1.2

Use the Gauss elimination method to solve x1 + x2 + 2x3 = 6, 3x1 + 4x2 − x3 = 5, −x1 + x2 + x3 = 2.

As before, perform elementary row operations on the augmented matrix of the given

1.2 Elementary Row Operations and Applications

17

matrix equation to yield the following results: ⎤ ⎡ 1 1 1 1 2 6 (−3)R +R →R 2 2 1 ⎣ 3 4 −1 5 ⎦ −−−−− −−−−−→ ⎣ 0 1 −1 1 −1 1 1 2 ⎡ ⎤ ⎡ 1 1 2 6 (−1)R +R →R 1 0 2 1 1 ⎣ 0 1 −7 −13 ⎦ −−−−−−−− ⎣ 0 1 −−→ 0 2 3 8 0 2 ⎡ ⎤ ⎡ 1 0 9 19 1 0 9 1 17 R3 →R3 ⎣ 0 1 −7 −13 ⎦ − −−−−−→ ⎣ 0 1 −7 0 0 17 34 0 0 1 ⎡ ⎤ ⎡ 1 0 0 1 1 0 0 3 +R2 →R2 ⎣ 0 1 −7 −13 ⎦ 7R −−− −−−−−→ ⎣ 0 1 0 0 0 1 2 0 0 1 ⎡

⎤ 2 6 R +R →R 3 3 1 −7 −13 ⎦ −−−−−−−→ 1 2 ⎤ 9 19 (−2)R2 +R3 →R3 −7 −13 ⎦ −−−−−−−−−−→ 3 8 ⎤ 19 (−9)R +R →R 3 1 1 −13 ⎦ −−−−−−−−−−→ 2 ⎤ 1 1 ⎦. 2

Hence, the solution to the linear system of equations is given by x1 = 1, x2 = 1 and x3 = 2. Elementary row operations are also applicable for solving the m × n (where m > n) matrix equation Ax = b, as shown in Algorithm 1.2. To do this, it is necessary to transform the augmented matrix B = [A, b] to its reduced row-echelon form. Algorithm 1.2

Solving the m × n matrix equations Ax = b [232]

1. Form the augmented matrix B = [A, b]. 2. Run Algorithm 1.1 to transform the augmented matrix B to its reduced row-echelon form, which is equivalent to the original augmented matrix. 3. From the reduced row-echelon form write down the corresponding system of linear equations equivalent to the original system of linear equations. 4. Solve the new system of linear equations to yield its general solution.

EXAMPLE 1.3

Solve the system of linear equations 2x1 + 2x2 − x3 = 1, −2x1 − 2x2 + 4x3 = 1, 2x1 + 2x2 + 5x3 = 5, −2x1 − 2x2 − 2x3 = −3.

18

Introduction to Matrix Algebra

Perform the Type-II operation ( 12 )R1 → R1 on the augmented matrix: ⎤ ⎡ ⎡ ⎤ 1 1 1 − 12 2 2 −1 1 2 ⎥ ⎢ ⎢ −2 −2 4 1 ⎥ −2 −2 4 1 ⎥. ⎥ → ⎢ B=⎢ ⎥ ⎢ ⎣ 2 2 5 5 ⎦ ⎣ 2 2 5 5 ⎦ −2 −2 −2 −3 −2 −2 −2 −3 Use elementary row operations to make the first entry zero for the ith row, with i = 2, 3, 4: ⎡ ⎤ ⎤ ⎡ 1 1 1 1 − 12 1 1 − 12 2 2 ⎢ ⎥ ⎥ ⎢ ⎢ −2 −2 4 1 ⎥ → ⎢ 0 0 3 2 ⎥. ⎢ ⎥ ⎥ ⎢ ⎣ 0 0 ⎣ 2 2 5 5 ⎦ 6 4 ⎦ −2 −2 −2 −3 0 0 −3 −2 Perform the Type-II ⎡ 1 ⎢ ⎢ 0 ⎢ ⎣ 0 0

operation ( 13 )R2 → R2 : ⎤ ⎡ 1 1 − 12 2 ⎥ ⎢ 0 3 2 ⎥ → ⎢ ⎥ ⎢ ⎣ 0 6 4 ⎦ 0 −3 −2

1 1 0 0 0 0 0 0

− 12

1 2 2 3



⎥ ⎥ 1 ⎥. 6 4 ⎦ −3 −2

Then, using elementary row operations, transform to zero all entries directly above and below the leading-1 entry of the second row: ⎡ ⎡ ⎤ ⎤ 1 1 1 − 12 1 1 0 56 2 ⎢ ⎢ 2 ⎥ 2 ⎥ ⎢ 0 0 ⎢ ⎥ ⎥ 1 3 ⎥ → ⎢ 0 0 1 3 ⎥, ⎢ ⎣ 0 0 ⎣ 0 0 0 0 ⎦ 6 4 ⎦ 0 0 −3 −2 0 0 0 0 from which one has two new linear equations, x1 + x2 = 56 and x3 = 23 . This linear system of equations has infinitely many solutions, its general solution is given by x1 = 56 − x2 , x3 = 23 . If x2 = 1 then a set of particular solutions is given by x1 = − 16 , x2 = 1 and x3 = 23 . Any m × n complex matrix equation Ax = b can be written as (Ar + j Ai )(xr + j xi ) = br + j bi ,

(1.2.1)

where Ar , xr , br and Ai , xi , bi are the real and imaginary parts of A, x, b, respectively. Expand the above equation to yield Ar xr − Ai xi = b r ,

(1.2.2)

Ai xr + Ar xi = b i .

(1.2.3)

1.2 Elementary Row Operations and Applications

19

The above equations can be combined into 

Ar Ai

    xr b = r . xi bi

−Ai Ar

(1.2.4)

Thus, m complex-valued equations with n complex unknowns become 2m realvalued equations with 2n real unknowns. In particular, if m = n then we have complex matrix equation Ax = b   Ar −Ai br augmented matrix Ai Ar b i

Elementary row

x = A−1 b,

−−−−−−−−−−→ operations

Elementary row

−−−−−−−−−−→ operations



In On

On In

 xr . xi

This shows that if we write the n × (n + 1) complex augmented matrix [A, b] as a 2n × (2n + 1) real augmented matrix and perform elementary row operations to make its left-hand side become an 2n × 2n identity matrix then the upper and lower halves of the (2n + 1)th column give respectively the real and imaginary parts of the complex solution vector x of the original complex matrix equation Ax = b. 2. Gauss Elimination Method for Matrix Inversion Consider the inversion operation on an n × n nonsingular matrix A. This problem can be modeled as an n × n matrix equation AX = I whose solution X is the inverse matrix of A. It is easily seen that the augmented matrix of the matrix equation AX = I is [A, I], whereas the augmented matrix of the solution equation IX = A−1 is [I, A−1 ]. Hence, we have the following relations: matrix equation augmented matrix

AX = I

Elementary row

−−−−−−−−−−→ X = A−1 , operations

Elementary row

−−−−−−−−−−→ [I, A−1 ].

[A, I]

operations

This result tells us that if we use elementary row operations on the n×2n augmented matrix [A, I] so that its left-hand part becomes an n × n identity matrix then the right-hand part yields the inverse A−1 of the given n × n matrix A directly. This is called the Gauss elimination method for matrix inversion. Suppose that an n × n complex matrix A is nonsingular. Then, its inversion can be modeled as the complex matrix equation (Ar +jAi )(Xr +jXi ) = I. This complex equation can be rewritten as 

Ar Ai

−Ai Ar



   Xr I = n , Xi On

(1.2.5)

20

Introduction to Matrix Algebra

from which we get the following relation: Elementary row

complex matrix equation AX = I −−−−−−−−−−→ X = A−1 , operations     Ar −Ai In I On Xr Elementary row augmented matrix . −−−−−−−−−−→ n operations Ai Ar O n On In Xi That is to say, if performing elementary row operations on the 2n × 3n augmented matrix to transform its left-hand side to an 2n × 2n identity matrix then the upper and lower halves of the 2n × n matrix on the right give respectively the real and imaginary parts of the inverse matrix A−1 of the complex matrix A.

1.3 Sets, Vector Subspaces and Linear Mapping The set of all n-dimensional vectors with real (complex) components is called a real (complex) n-dimensional vector space, denoted Rn (Cn ). The n-dimensional vectors of real-world problems usually belong to subsets other than the whole set Rn or Cn . These subsets are known as vector subspaces. In this section, we present the sets, the vector subspaces and the linear mapping of one vector subspace onto another.

1.3.1 Sets Before introducing definitions of vector spaces and subspaces, it is necessary to present some set concepts. As the name implies, a set is a collection of elements. A set is usually denoted by S = {·}; inside the braces are the elements of the set S. If there are only a few elements in the set S, we write out these elements within the braces, e.g., S = {a, b, c, d}. To describe the composition of a more complex set mathematically, we use the symbol “|” to mean “such that”. For example, S = {x|P (x) = 0} reads “the element x in set S such that P (x) = 0”. A set with only one element α is called a singleton, denoted {α}. The following are several common notations for set operations: ∀ denotes “for all · · · ”; x ∈ A reads “x belongs to the set A”, i.e., x is an element or member of A; x∈ / A means that x is not an element of the set A;  denotes “such that”; ∃ denotes “there exists”; A ⇒ B reads “condition A results in B” or “A implies B”. As an example, the trivial statement “there is a zero element θ in the set V such that the addition x + θ = x = θ + x holds for all elements x in V ” can be concisely

1.3 Sets, Vector Subspaces and Linear Mapping

21

expressed in the above notation as follows: ∃ θ ∈ V  x + θ = x = θ + x, ∀ x ∈ V. Let A and B be two sets; then the sets have the following basic relations. The notation A ⊆ B reads “the set A is contained in the set B” or “A is a subset of B”, which implies that each element in A is an element in B, namely x ∈ A ⇒ x ∈ B. If A ⊂ B then A is called a proper subset of B. The notation B ⊃ A reads “B contains A” or “B is a superset of A”. The set with no elements is denoted by Ø and is called the null set. The notation A = B reads “the set A equals the set B”, which means that A ⊆ B and B ⊆ A, or x ∈ A ⇔ x ∈ B (any element in A is an element in B, and vice versa). The negation of A = B is written as A = B, implying that A does not belong to B, neither does B belong to A. The union of A and B is denoted as A ∪ B. If X = A ∪ B then X is called the union set of A and B. The union set is defined as follows: X = A ∪ B = {x ∈ X| x ∈ A or x ∈ B}.

(1.3.1)

In other words, the elements of the union set A ∪ B consist of the elements of A and the elements of B. The intersection of both sets A and B is represented by the notation A ∩ B and is defined as follows: X = A ∩ B = {x ∈ X| x ∈ A and x ∈ B}.

(1.3.2)

The set X = A ∩ B is called the intersection set of A and B. Each element of the intersection set A ∩ B consists of elements common to both A and B. The notation Z = A + B means the sum set of sets A and B and is defined as follows: Z = A + B = {z = x + y ∈ Z| x ∈ A, y ∈ B},

(1.3.3)

namely, an element z of the sum set Z = A + B consists of the sum of the element x in A and the element y in B. The set-theoretic difference of the sets A and B, denoted “A − B”, is also termed the difference set and is defined as follows: X = A − B = {x ∈ X| x ∈ A, but x ∈ / B}.

(1.3.4)

That is to say, the difference set A − B is the set of elements of A that are not in B. The difference set A − B is also sometimes denoted by the notation X = A \ B. For example, {Cn \ 0} denotes the set of nonzero vectors in complex n-dimensional vector space. The relative complement of A in X is defined as Ac = X − A = X \ A = {x ∈ X| x ∈ / A}.

(1.3.5)

22

Introduction to Matrix Algebra

EXAMPLE 1.4

For the sets A = {1, 2, 3},

B = {2, 3, 4},

we have A ∪ B = {1, 2, 3, 4},

A ∩ B = {2, 3},

A − B = A \ B = {1},

A + B = {3, 5, 7},

B − A = B \ A = {4}.

If both X and Y are sets, and x ∈ X and y ∈ Y , then the set of all ordered pairs (x, y) is denoted by X × Y and is termed the Cartesian product of the sets X and Y , and is defined as follows:    X × Y = (x, y)x ∈ X, y ∈ Y . (1.3.6) Similarly, X1 × · · · × Xn denotes the Cartesian product of n sets X1 , . . . , Xn , and its elements are the ordered n-ples (x1 , . . . , xn ):    (1.3.7) X1 × · · · × Xn = (x1 , . . . , xn )x1 ∈ X1 , . . . , xn ∈ Xn . If f (X, Y) is a scalar function with real matrices X ∈ Rn×n and Y ∈ Rn×n as variables then in linear mapping notation, the function can be denoted by the Cartesian product form f : Rn×n × Rn×n → R. 1.3.2 Fields and Vector Spaces The previous subsections uses the symbols R, Rm , Rm×n and C, Cm , Cm×n . In this subsection we explain this notation from the viewpoint of vector spaces. DEFINITION 1.14 [229] A field is an algebraic structure F consisting of a nonempty set F together with two operations: for any two elements α, β in the set F , the addition operation α + β and the multiplication operation αβ are uniquely determined such that, for any α, β, γ ∈ F , the following conditions are satisfied: (1) (2) (3) (4) (5) (6) (7)

α + β = β + α and αβ = βα; α + (β + γ) = (α + β) + γ and α(βγ) = (αβ)γ; α(β + γ) = αβ + αγ; there is an element 0 in F such that α + 0 = α for all α ∈ F ; there is an element −α such that α + (−α) = 0 for any α ∈ F ; there exists a nonzero element 1 ∈ F such that α1 = α for all α ∈ F ; there is an element α−1 ∈ F such that αα−1 = 1 for any nonzero element α ∈ F.

The set of rational numbers, the set of real numbers and the set of complex numbers are fields, the rational number field Q, the real number field R and the complex number field C. However, the set of integers is not a field. When considering a nonempty set V with vectors as elements, the concept of the field can be generalized to the concept of vector space.

1.3 Sets, Vector Subspaces and Linear Mapping

23

DEFINITION 1.15 A set V with vectors as elements is called a vector space if the addition operation is defined as the addition of two vectors, the multiplication operation is defined as the product of a vector and a scalar in the scalar field S and, for vectors x, y, w in the set V and scalars a1 , a2 in the scalar field S, the following axioms, also termed postulates or laws, on addition and multiplication hold. 1. Closure properties (1) If x ∈ V and y ∈ V then x + y ∈ V , namely V is closed under addition. This is called the closure for addition. (2) If a1 is a scalar and y ∈ V then a1 y ∈ V , namely V is closed under scalar multiplication. This is called the closure for scalar multiplication. 2. Axioms on addition (1) Commutative law for addition: x + y = y + x, ∀ x, y ∈ V . (2) Associative law for addition: x + (y + w) = (x + y) + w, ∀ x, y, w ∈ V . (3) Existence of the null vector: There exists a null vector 0 in V such that, for any vector y ∈ V , y + 0 = y holds. (4) Existence of negative vectors: Given a vector y ∈ V , there is another vector −y ∈ V such that y + (−y) = 0 = (−y) + y. 3. Axioms on scalar multiplication (1) Associative law for scalar multiplication: a(by) = (ab)y holds for all vectors y ∈ V and all scalars a, b ∈ S. (2) Right distributive law for scalar multiplication: a(x + y) = ax + ay holds for all vectors x, y ∈ V and any scalar a in S. (3) Left distributive law for scalar multiplication: (a + b)y = ay + by holds for all vectors y and all scalars a, b ∈ S. (4) Unity law for scalar multiplication: 1y = y holds for all vectors y ∈ V . As stated in the following theorem, vector spaces have other useful properties. THEOREM 1.4 • • • • • •

If V is a vector space, then

The null vector 0 is unique. The inverse operation −y is unique for each y ∈ V . For every vector y ∈ V , 0y = 0 is true. For every scalar a, a0 = 0 holds. If ay = 0 then a = 0 or y = 0. (−1)y = −y.

Proof

See [232, pp. 365–366].

24

Introduction to Matrix Algebra

If the vectors in a vector space V are real valued, and the scalar field is a real field, then V is referred to as a real vector space. Similarly, if the vectors in V are complex valued, and the scalar field is a complex field, then V is called a complex vector space. The two most typical vector spaces are the m-dimensional real vector space Rm and the m-dimensional complex vector space Cm . In many cases we are interested only in some given subset of a vector space V other than the whole space V . DEFINITION 1.16 Let W be a nonempty subset of the vector space V . Then W is called a vector subspace of V if all elements in W satisfy the following two conditions: (1) x − y ∈ W for all x, y ∈ W ; (2) αx ∈ W for all α ∈ R (or C) and all x ∈ W . Vector subspaces are very useful in engineering applications such as automatic control, signal processing, pattern recognition, wireless communication and so forth. We will devote our attention to the theory and applications of vector subspaces in Chapter 8, “Subspace Analysis and Tracking”.

1.3.3 Linear Mapping In the previous subsections we discussed some simple operations of vectors, vector addition and the multiplication of a vector by a scalar, but we have not yet considered the transformation between vectors in two vector spaces. By Wikipedia, mapping is the creation of maps, a graphic symbolic representation of the significant features of a part of the surface of the Earth. In mathematics, mapping is a synonym for mathematical function or for Morphism. The notation for mappings follows the usual notation for functions. If V is a subspace in Rm and W is a subspace in Rn then the notation T : V → W

(1.3.8)

T : V →W

(1.3.9)

denotes a general mapping, while

represents a linear mapping or linear transformation, which is a mapping such that the following operations hold: 1. T (v1 + v2 ) = T (v1 ) + T (v2 ) for any vectors v1 and v2 in V ; 2. T (αv) = αT (v) for any scalar α. The main example of a linear transformation is given by matrix multiplication.

1.3 Sets, Vector Subspaces and Linear Mapping

25

Thus, a mapping represents a rule for transforming the vectors in V to corresponding vectors in W . The subspace V is said to be the initial set or domain of the mapping T and W its final set or codomain. When v is some vector in the vector space V , T (v) is referred to as the image of the vector v under the mapping T , or the value of the mapping T at the point v, whereas v is called the original image of T (v). Given a subspace A of the vector space V , the mapping T (A) = {T (v) | v ∈ A}

(1.3.10)

represents a collection of the vectors v under the mapping T . In particular, if T (v) represents a collection of transformed outputs of all vectors v in V , i.e., T (V ) = {T (v) | v ∈ V },

(1.3.11)

then we say that T (V ) is the range of the mapping T , denoted by Im(T ) = T (V ) = {T (v) | v ∈ V }.

(1.3.12)

DEFINITION 1.17 A mapping T : V → W is one-to-one if it is either injective or surjective, i.e., T (x) = T (y) implies x = y or distinct elements have distinct images. A one-to-one mapping T : V → W has an inverse mapping T −1 |W → V . The inverse mapping T −1 restores what the mapping T has done. Hence, if T (v) = w then T −1 (w) = v, resulting in T −1 (T (v)) = v, ∀ v ∈ V and T (T −1 (w)) = w, ∀ w ∈ W. DEFINITION 1.18 A mapping T is called a linear mapping or linear transformation in the vector space V if it satisfies the linear relationship T (c1 v1 + c2 v2 ) = c1 T (v1 ) + c2 T (v2 )

(1.3.13)

for all v1 , v2 ∈ V and all scalars c1 , c2 . If u1 , . . . , up are the input vectors of a system in engineering then T (u1 ), . . . , T (up ) can be viewed as the output vectors of the system. The criterion for identifying whether a system is linear is this: if the system input is the linear expression y = c1 u1 + · · · + cp up then we say that the system is linear only if its output satisfies the linear expression T (y) = T (c1 u1 + · · · + cp up ) = c1 T (u1 ) + · · · + cp T (up ). Otherwise, the system is nonlinear. EXAMPLE 1.5

Determine whether the following transformation T1 , T2 : R3 → R2

26

Introduction to Matrix Algebra

are linear:



x1 + x2



, where x = [x1 , x2 , x3 ]T , x21 − x22   x1 − x2 T2 (x) = , where x = [x1 , x2 , x3 ]T . x2 + x3

T1 (x) =

It is easily seen that the transformation T1 | R3 → R2 does not satisfy a linear relationship and thus is not a linear transformation, while T2 | R3 → R2 is a linear transformation since it satisfies a linear relationship. An interesting special application of linear mappings is the electronic amplifier A ∈ Cn×n with high fidelity (Hi-Fi). By Hi-Fi, one means that there is the following linear relationship between any input signal vector u and the corresponding output signal vector Au of the amplifier: Au = λu,

(1.3.14)

where λ > 1 is the amplification factor or gain. The equation above is a typical characteristic equation of a matrix. Another example is that the product of a matrix by a vector Am×n xn×1 can also be viewed as a linear mapping, T : x → Ax or T (x) = Ax, that transforms the vector x in Cn to a vector y = Ax in Cm . In view of this linear mapping or transformation, the product of a matrix A and a vector is usually called the matrix transformation of the vector, where A is a transformation matrix. The following are intrinsic relationships between a linear space and a linear mapping. THEOREM 1.5 [35, p. 29] Let V and W be two vector spaces, and let T : V → W be a linear mapping. Then the following relationships are true: • if M is a linear subspace in V then T (M ) is a linear subspace in W ; • if N is a linear subspace in W then the linear inverse transform T −1 (N ) is a linear subspace in V . A linear mapping has the following basic properties: if T : V → W is a linear mapping then T (0) = 0

and

T (−x) = −T (x).

(1.3.15)

In particular, for a given linear transformation y = Ax where the transformation matrix A is known, if our task is to obtain the output vector y from the input vector x then Ax = y is said to be a forward problem. Conversely, the problem of finding the input vector x from the output vector y is known as an inverse problem. Clearly, the essence of the forward problem is a matrix–vector calculation, while the essence of the inverse problem is solving a matrix equation.

1.4 Inner Products and Vector Norms

27

1.4 Inner Products and Vector Norms In mathematics, physics and engineering, we often need to measure the size (or length) and neighborhood for a given vector, and to calculate the angle, distance and similarity between vectors. In these cases, inner products and vector norms are two essential mathematical tools.

1.4.1 Inner Products of Vectors Let K represent a field that may be either the real field R or the complex field C, and let V be an n-dimensional vector space Rn or Cn . DEFINITION 1.19 The function x, y is called the inner product of two vectors x and y, and V is known as the inner product vector space, if for all vectors x, y, z ∈ V and scalars α, β ∈ K, the mapping function ·, ·|V × V → K satisfies the following three axioms. (1) Conjugate symmetry x, y = y, x∗ . (2) Linearity in the first argument αx + βy, z = αx, z + βy, z. (3) Nonnegativity x, x ≥ 0 and x, x = 0 ⇔ x = 0 (strict positivity). For a real inner product space, the conjugate symmetry becomes the real symmetry x, y = y, x. Linearity in the first argument contains the homogeneity αx, y = αx, y and the additivity x + y, z = x, z + y, z. Conjugate symmetry and linearity in the first argument imply that x, αy = αy, x∗ = α∗ y, x∗ = α∗ x, y, ∗





x, y + z = y + z, x = y, x + z, x = x, y + x, z.

(1.4.1) (1.4.2)

A vector space with an inner product is called an inner product space. It is easy to verify that the function x, y = xH y =

n 

x∗i yi

(1.4.3)

i=1

is an inner product of two vectors, since the three axioms above are satisfied. The function x, y = xH y is called the canonical inner product of n × 1 vectors x = [x1 , . . . , xn ]T and y = [y1 , . . . , yn ]T . Note that in some literature the following canonical inner product formula is used: n  x, y = xT y∗ = xi yi∗ . i=1

However, one sometimes adopts the weighted inner product x, yG = xH Gy,

(1.4.4)

28

Introduction to Matrix Algebra

where G is a weighting matrix such that xH Gx > 0, ∀ x ∈ Cn . −1 EXAMPLE 1.6 Let {e j2πf n }N n=0 be a sinusoidal sequence with frequency f , and  T 2π 2π en (f ) = 1, e j( n+1 )f , . . . , e j( n+1 )nf be an (n+1)×1 complex sinusoid vector. Then the discrete Fourier transform (DFT) of N data samples x(n), n = 0, 1, . . . , N − 1 can be represented by the canonical inner product as follows:

X(f ) =

N −1 

x(n)e−j2πnf /N = eH N −1 (f )x = eN −1 (f ), x,

n=0

where x = [x(0), x(1), . . . , x(N − 1)]T is the data vector.

1.4.2 Norms of Vectors DEFINITION 1.20 Let V be a real or complex vector space. Given a vector x ∈ V , the mapping function p(x)|V → R is called the norm of the vector x ∈ V , if for all vectors x, y ∈ V and any scalar c ∈ K (here K denotes R or C), the following three norm axioms hold: (1) Nonnegativity: p(x) ≥ 0 and p(x) = 0 ⇔ x = 0. (2) Homogeneity: p(cx) = |c| · p(x) is true for all complex constant c. (3) Triangle inequality: p(x + y) ≤ p(x) + p(y). In a real or complex inner product space V , the vector norms have the following properties [35]. 1. 0 = 0 and x > 0, ∀ x = 0. 2. cx = |c| x holds for all vector x ∈ V and any scalar c ∈ K. 3. Polarization identity: For real inner product spaces we have x, y =

 1

x + y 2 − x − y 2 , 4

∀ x, y,

(1.4.5)

and for complex inner product spaces we have x, y =

 1

x + y 2 − x − y 2 − j x + j y 2 + j x − j y 2 , 4

∀ x, y. (1.4.6)

4. Parallelogram law  

x + y 2 + x − y 2 = 2 x 2 + y 2 ,

∀ x, y.

(1.4.7)

5. Triangle inequality

x + y ≤ x + y ,

∀ x, y ∈ V.

(1.4.8)

1.4 Inner Products and Vector Norms

29

6. Cauchy–Schwartz inequality |x, y| ≤ x

y .

(1.4.9)

The equality |x, y| = x

y holds if and only if y = c x, where c is some nonzero complex constant. 1. Common Norms of Constant Vectors The following are several common norms of constant vectors. (1) 0 -norm def

x 0 = number of nonzero entries of x.

(1.4.10)

(2) 1 -norm def

x 1 =

m 

|xi | = |x1 | + · · · + |xm |.

(1.4.11)

i=1

(3) 2 -norm or the Euclidean norm

(4) ∞ -norm

 1/2 def .

x 2 = x E = |x1 |2 + · · · + |xm |2

(1.4.12)

  def

x ∞ = max |x1 |, . . . , |xm | .

(1.4.13)

(5) p -norm or H¨ older norm [275]  def

x p =

m 

1/p |xi |p

,

p ≥ 1.

(1.4.14)

i=1

The 0 -norm does not satisfy the homogeneity cx 0 = |c| x 0 , and thus is a quasi-norm. However, the 0 -norm plays a key role in the representation and analysis of sparse vectors, as will be seen in Section 1.12. The p -norm is a quasi-norm if 0 < p < 1 and a norm if p ≥ 1. Clearly, when p = 1 or p = 2, the p -norm reduces to the 1 -norm or the 2 -norm, respectively. The most commonly used vector norm is the Euclidean norm. The following are several important applications of the Euclidean norm. • Measuring the size or length of a vector: size(x) = x 2 =



x21 + · · · + x2m ,

which is called the Euclidean length. • Defining the -neighborhood of a vector x:    N (x) = y y − x 2 ≤ ,

> 0.

(1.4.15)

(1.4.16)

30

Introduction to Matrix Algebra

• Measuring the distance between vectors x and y:  d(x, y) = x − y 2 = (x1 − y1 )2 + · · · + (xm − ym )2 . This is called the Euclidean distance. • Defining the angle θ (0 ≤ θ ≤ 2π) between vectors x and y:    H  x y x, y def  . θ = arccos  = arccos

x

y x, x y, y

(1.4.17)

(1.4.18)

A vector with unit Euclidean length is referred to as a normalized (or standardized) vector. For any nonzero vector x ∈ Cm , x/x, x1/2 is the normalized version of the vector and has the same direction as x. The norm x is said to be a unitary invariant norm if Ux = x holds for all vectors x ∈ Cm and all unitary matrices U ∈ Cm×m . PROPOSITION 1.1 [214] The Euclidean norm · 2 is unitary invariant. If the inner product x, y = xH y = 0, this implies that the angle between the vectors θ = π/2. In this case, the constant vectors x and y are said to be orthogonal, from which we have the following definition. DEFINITION 1.21 Two constant vectors x and y are said to be orthogonal, and this is denoted by x ⊥ y, if their inner product x, y = xH y = 0. 2. Inner Products and Norms of Function Vectors DEFINITION 1.22 Let x(t), y(t) be two function vectors in the complex vector space Cn , and let the definition field of the function variable t be [a, b] with a < b. Then the inner product of the function vectors x(t) and y(t) is defined as follows: b def x(t), y(t) = xH (t)y(t)dt. (1.4.19) a

It can be verified that the above function satisfies the three axioms of inner products (see Definition 1.19). Note that the real field R is an one-dimensional inner product space, but it is not a Euclidean space as the real field R is not finite dimensional. The angle between function vectors is defined as follows: b xH (t)y(t)dt x(t), y(t) def a  cos θ =  = , (1.4.20)

x(t) · y(t) x(t), x(t) y(t), y(t) where x(t) is the norm of the function vector x(t) and is defined as follows:  1/2 b def H x (t)x(t)dt . (1.4.21)

x(t) = a

1.4 Inner Products and Vector Norms

31

Clearly, if the inner product of two function vectors is equal to zero, i.e., b xH (t)y(t)dt = 0, a

then the angle θ = π/2. Hence, two function vectors are said to be orthogonal in [a, b], denoted x(t) ⊥ y(t), if their inner product is equal to zero. 3. Inner Products and Norms of Random Vectors DEFINITION 1.23 Let x(ξ) and y(ξ) be two n × 1 random vectors of variable ξ. Then the inner product of random vectors is defined as follows:   def x(ξ), y(ξ) = E xH (ξ)y(ξ) , (1.4.22) where E is the expectation operator E{x(ξ)} = [E{x1 (ξ)}, . . . , E{xn (ξ)}]T and the function variable ξ may be time t, circular frequency f , angular frequency ω or space parameter s, and so on. The square of the norm of a random vector x(ξ) is defined as   def

x(ξ) 2 = E xH (ξ)x(ξ) .

(1.4.23)

In contrast with the case for a constant vector or function vector, an m × 1 random vector x(ξ) and an n × 1 random vector y(ξ) are said to be orthogonal if any entry of x(ξ) is orthogonal to each entry of y(ξ). The following is the definition of random vector orthogonality. DEFINITION 1.24 An m×1 random vector x(ξ) and an n×1 random vector y(ξ) are orthogonal, denoted by x(ξ) ⊥ y(ξ), if their cross-correlation matrix is equal to the m × n null matrix O, i.e.,   E x(ξ)yH (ξ) = O. (1.4.24) The following proposition shows that, for any two orthogonal vectors, the square of the norm of their sum is equal to the sum of the squares of the respective vector norms. PROPOSITION 1.2 Proof

If x ⊥ y, then x + y 2 = x 2 + y 2 .

From the axioms of vector norms it is known that

x + y 2 = x + y, x + y = x, x + x, y + y, x + y, y.

(1.4.25)

Since x and y are orthogonal, we have x, y = E{xT y} = 0. Moreover, from the axioms of inner products it is known that y, x = x, y = 0. Substituting this result into Equation (1.4.25), we immediately get

x + y 2 = x, x + y, y = x 2 + y 2 . This completes the proof of the proposition.

32

Introduction to Matrix Algebra

This proposition is also referred to as the Pythagorean theorem. On the orthogonality of two vectors, we have the following conclusions. • Mathematical definitions Two vectors x and y are orthogonal if their inner product is equal to zero, i.e., x, y = 0 (for constant vectors and function vectors), or the mathematical expectation of their outer product is equal to a null matrix O, i.e., E{xyH } = O (for random vectors). • Geometric interpretation If two vectors are orthogonal then their angle is π/2, and the projection of one vector onto the other vector is equal to zero. • Physical significance When two vectors are orthogonal, each vector contains no components of the other, that is, there exist no interactions or interference between these vectors. The main points above are useful for adopting flexibly the orthogonality of two vectors in engineering applications.

1.4.3 Similarity Comparison Between Vectors Clustering and pattern classification are two important techniques in statistical data analysis. By clustering, we mean that a given big data set is divided into several data subsets (categories) in which the data in each subset have common or similar features. Pattern classification involves classifying some unknown stimuli (activations) into a fixed number of categories (classes). The main mathematical tool for clustering and classification is the distance measure. The distance between two vectors p and g, denoted D(p g), is a measure if it has following properties. (1) Nonnegativity and positiveness D(p g) ≥ 0, and equality holds if and only if p = g. (2) Symmetry D(p g) = D(g p). (3) Triangle inequality D(p z) ≤ D(p g) + D(g z). It is easily shown that the Euclidean distance is a measure. In many applications, data vectors must become low-dimensional vectors via some transformation or processing method. These low-dimensional vectors are called the pattern vectors or feature vectors owing to the fact that they extract the features of the original data vectors, and are directly used for pattern clustering and classification. For example, the colors of clouds and the parameters of voice tones are pattern or feature vectors in weather forecasting and voice classification, respectively. The basic rule of clustering and classification is to adopt some distance metric to measure the similarity of two feature vectors. As the name suggests, this similarity is a measure of the degree of similarity between vectors.

1.4 Inner Products and Vector Norms

33

Consider a pattern classification problem. For simplicity, suppose that there are M classes of pattern vectors s1 , . . . , sM . Our problem is: given a feature vector x with an unknown pattern, we hope to recognize to which class it belongs. For this purpose, we need to compare the unknown pattern vector x with M known pattern vectors in order to recognize to which known pattern vector the given vector x is most similar. On the basis of a similarity comparison, we can obtain the pattern or signal classification. A quantity known as the dissimilarity is used to make a reverse measurement of the similarity between vectors: two vectors with a small dissimilarity are similar. Let D(x, s1 ), . . . , D(x, sM ) be the dissimilarities between the unknown pattern vector x and the respective known pattern vectors s1 , . . . , sM . As an example, we will compare the dissimilarities between x and s1 , s2 . If D(x, s1 ) < D(x, s2 ),

(1.4.26)

then we say the unknown pattern vector x is more similar to s1 than s2 , where s1 and s2 are known pattern vectors. The simplest and most intuitive dissimilarity parameter is the Euclidean distance between vectors. The Euclidean distance between the unknown pattern vector x and the ith known pattern vector si , denoted DE (x, si ), is given by  (1.4.27) DE (x, si ) = x − si 2 = (x − si )T (x − si ). The two vectors are exactly the same if their Euclidean distance is equal to zero: DE (x, y) = 0



x = y.

If DE (x, si ) = min DE (x, sk ), k

k = 1, . . . , M,

(1.4.28)

then si ∈ {s1 , . . . , sM } is said to be a nearest neighbour to x. A widely used classification method is nearest neighbor classification, which judges x to belong in the model type corresponding to its nearest neighbour. Another frequently used distance function is the Mahalanobis distance, proposed by Mahalanobis in 1936 [312]. The Mahalanobis distance from the vector x to its mean μ is given by  DM (x, μ) = (x − μ)T C−1 (1.4.29) x (x − μ), where Cx = Cov(x, x) = E{(x − μ)(x − μ)T } is the autocovariance matrix of the vector x. The Mahalanobis distance between vectors x ∈ Rn and y ∈ Rn is denoted by DM (x, y), and is defined as [312]  DM (x, y) = (x − y)T C−1 (1.4.30) xy (x − y),

34

Introduction to Matrix Algebra

where Cxy = Cov(x, y) = E{(x − μx )(y − μy )T } is the cross-covariance matrix of x and y, while μx and μy are the means of x and y, respectively. Clearly, if the covariance matrix is the identity matrix, i.e., C = I, then the Mahalanobis distance reduces to the Euclidean distance. If the covariance matrix takes a diagonal form then the corresponding Mahalanobis distance is called the normalized Euclidean distance and is given by  n  (xi − yi )2 DM (x, y) = ! , (1.4.31) σi2 i=1 in which σi is the standard deviation of xi and yi in the whole sample set. Let M M M   1  μ= si , C = (si − μ)(sj − μ)T (1.4.32) M i=1 i=1 j=1 be the sample mean vector of M known pattern vectors si and the sample crosscovariance matrix. Then the Mahalanobis distance from the unknown pattern vector x to the ith known pattern vector si is defined as  (1.4.33) DM (x, si ) = (x − si )T C−1 (x − si ). By the nearest neighbor classification method, if DM (x, si ) = min DM (x, sk ), k

k = 1, . . . , M,

(1.4.34)

then the unknown pattern vector x is recognized as being in the pattern type to which si belongs. The measure of dissimilarity between vectors is not necessarily limited to distance functions. The cosine function of the acute angle between two vectors, D(x, si ) = cos θi =

xT s i ,

x 2 si 2

(1.4.35)

is an effective measure of dissimilarities as well. If cos θi < cos θj , ∀ j = i, holds then the unknown pattern vector x is said to be most similar to the known pattern vector si . The variant of Equation (1.4.35) D(x, si ) =

x T si xT x + sTi si + xT si

(1.4.36)

is referred to as the Tanimoto measure [469] and is widely used in information retrieval, the classification of diseases, animal and plant classifications etc. The signal to be classified as one of a set of objects is called the target signal. Signal classification is usually based on some physical or geometric attributes or concepts relating to the objects. Let X be a target signal and let Ai represent a classification attribute defining the ith object. Then we can adopt the object–concept

1.4 Inner Products and Vector Norms

35

distance D(X, Ai ) to describe the dissimilarity between the target signal and the ith object [448], and thus we have a relationship similar to Equation (1.4.26) as follows: if D(X, Ai ) < D(X, Aj ),

∀ i, j,

(1.4.37)

then the target signal X is classified as the ith type of object, Ci , with the minimum object–concept distance D(X, Ai ).

1.4.4 Banach Space, Euclidean Space, Hilbert Space A vector space equipped with a vector norm is called a normed vector space. Let X be a vector space. If, for every Cauchy sequence {xn } in X, there exists an element x in X such that lim xn = x

n→∞

then X is known as a complete vector space. The above convergence of a Cauchy sequence of vectors can be equivalently written as a convergence in the norm, lim xn − x = 0.

n→∞

A Banach space is a complete normed vector space. Because the vector norm is a metric that allows the computation of vector length and the distance between vectors, a Banach space is a vector space with a metric that can compute vector length and the distance between vectors and is complete in the sense that a Cauchy sequence of vectors always converges in the norm. A Euclidean space is an affine space with the standard Euclidean structure and encompasses the two-dimensional Euclidean plane, the three-dimensional space of Euclidean geometry, and other higher-dimensional spaces. Euclidean spaces have finite dimension [449]. An n-dimensional Euclidean space is usually denoted by Rn , which is assumed to have the standard Euclidean structure. The standard Euclidean structure includes the following: (1) the standard inner product (also known as the dot product) on Rm , x, y = x · y = xT y = x1 y1 + · · · + xm ym ; (2) the Euclidean length of a vector x on Rm ,  

x = x, y = x21 + · · · + x2m ; (3) the Euclidean distance between x and y on Rm ,  d(x, y) = x − y = (x1 − y1 )2 + · · · + (xm − ym )2 ;

36

Introduction to Matrix Algebra

(4) the (nonreflex) angle θ (0◦ ≤ θ ≤ 180◦ ) between vectors x and y on Rm ,    T  x, y x y θ = arccos = arccos .

x

y

x

y A vector space equipped with an inner product is referred to as an inner product space. Hilbert space, named after David Hilbert, is a generalization of Euclidean space: it is no longer limited to the finite-dimensional case. Similarly to Euclidean space, a Hilbert space is an inner product space that allows the length of a vector and the distance and angle between vectors to be measured, and thus it possesses the definition of orthogonality. Furthermore, a Hilbert space is a complete space, and all Cauchy sequences are equivalent to convergent sequences, so that most concepts in calculus can be extended to Hilbert spaces. Hilbert spaces provide an efficient representation for Fourier series and Fourier transforms based on any orthogonal series.

1.4.5 Inner Products and Norms of Matrices The inner products and norms of vectors can be extended to the inner products and norms of matrices. Consider m × n complex matrices A = [a1 , . . . , an ] and B = [b1 , . . . , bn ]. We stack A and B respectively into the following mn × 1 vectors according to their columns: ⎡ ⎤ ⎡ ⎤ a1 b1 ⎢ .. ⎥ ⎢ .. ⎥ a = vec(A) = ⎣ . ⎦ , b = vec(B) = ⎣ . ⎦ , an

bn

where the elongated vector vec(A) is the vectorization of the matrix A. We will discuss the vectorization of matrices in detail in Section 1.11. DEFINITION 1.25 The inner product of two m × n matrices A and B, denoted A, B| Cm×n ×Cm×n → C, is defined as the inner product of two elongated vectors: A, B = vec(A), vec(B) =

n  i=1

aH i bi

=

n 

ai , bi ,

(1.4.38)

i=1

equivalently written as A, B = (vec A)H vec(B) = tr(AH B),

(1.4.39)

where tr(C) represents the trace function of a square matrix C, defined as the sum of its diagonal entries. Let K represent a real or complex field; thus Km×n denotes either Rm×n or Cm×n .

1.4 Inner Products and Vector Norms

37

DEFINITION 1.26 The norm of the matrix A ∈ Km×n , denoted A , is defined as the real-valued function having the following properties: (a) For any nonzero matrix A = O, its norm is larger than zero, i.e., A > 0, if A = O, and A = 0 if and only if A = O. (b) cA = |c| A for any c ∈ K. (c) A + B ≤ A + B . (d) The norm of the product of two matrices is less than or equal to the product of their norms, that is, AB ≤ A

B . n n EXAMPLE 1.7 For a real-valued function f (A) = i=1 j=1 |aij |, it is easy to verify the following: (a) f (A) ≥ 0; when A = 0, i.e., when aij ≡ 0, f (A) = 0. n n n  n    (b) f (cA) = |caij | = |c| |aij | = |c|f (A). i=1 j=1 n n  

(c) f (A + B) =

i=1 j=1

(|aij + bij |) ≤

i=1 j=1

n n  

(|aij | + |bij |) = f (A) + f (B).

i=1 j=1

(d) For the product of two matrices, we have  n  n  n  n n  n       aik bkj  ≤ |aik ||bkj | f (AB) =    i=1 j=1 k=1 i=1 j=1 k=1  n  n n  n    |aik | |bkl | = f (A)f (B). ≤ i=1 j=1

k=1

l=1

Hence, by Definition 1.26, the real function f (A) = norm.

n i=1

n j=1

|aij | is a matrix

There are three types of common matrix norms: induced norms, entrywise norms and Schatten norms. 1. Induced Norms The induced norm of an m × n matrix A ∈ Km×n is defined by means of the norm

x of the vector x ∈ Kn and the norm Ax of the vector Ax ∈ Km as  # "

Ax  n x ∈ K with x =  0 (1.4.40)

A = sup

x     (1.4.41) = sup Ax x ∈ Kn with x = 1 . The induced norm is also called the operator norm. A common induced norm is the p-norm:

Ax p def , (1.4.42)

A p = sup x=0 x p

38

Introduction to Matrix Algebra

where p = 1, 2, . . . The p-norm is also referred to as the Minkowski p-norm or p -norm. In particular, when p = 1 or ∞ the corresponding induced norms are respectively

A 1 = max

1≤j≤n

A ∞ = max

m 

|aij |,

i=1 n 

1≤i≤m

|aij |.

(1.4.43) (1.4.44)

j=1

That is, A 1 and A ∞ are simply the maximum absolute column sum and maximum absolute row sum of the matrix, respectively. The induced norms A 1 and A ∞ are also called the absolute column sum norm and the absolute row sum norm, respectively. EXAMPLE 1.8

For the matrix



⎤ 1 −2 3 ⎢ −4 5 −6 ⎥ ⎥, A=⎢ ⎣ 7 −8 −9 ⎦ −10 11 12

its absolute column sum norm and row sum norm are respectively given by 

A 1 = max 1 + | − 4| + 7 + | − 10|, | − 2| + 5 + | − 8| + 11,  3 + | − 6| + | − 9| + 12 = max{22, 26, 30} = 30, 

A ∞ = max 1 + | − 2| + 3, | − 4| + 5 + | − 6|, 7 + | − 8| + | − 9|,  | − 10| + 11 + 12 = max{6, 15, 24, 33} = 33. Another common induced matrix norm is the spectral norm, with p = 2, denoted

A 2 = A spec . This is defined as  (1.4.45)

A 2 = A spec = λmax (AH A) = σmax (A), i.e., the spectral norm is the largest singular value of A or the square root of the largest eigenvalue of the positive semi-definite matrix AH A. 2. Entrywise Norm Another type of matrix norm, the entrywise norms, treats an m × n matrix as a column vector of size mn and uses one of the familiar vector norms. Let a = [a11 , . . . , am1 , a12 , . . . , am2 , . . . , a1n , . . . , amn ]T = vec(A) be an mn × 1 elongated vector of the m × n matrix A. If we use the p -norm definition of the elongated vector a then we obtain the p -norm of the matrix A as follows: ⎞1/p ⎛ n m   def |aij |p ⎠ . (1.4.46)

A p = a p = vec(A) p = ⎝ i=1 j=1

1.4 Inner Products and Vector Norms

39

Since this kind of matrix norm is represented by the matrix entries, it is named the entrywise norm. The following are three typical entrywise matrix norms: (1) 1 -norm (p = 1) def

A 1 =

n m  

|aij |;

(1.4.47)

i=1 j=1

(2) Frobenius norm (p = 2) ⎞1/2 ⎛ m  n  def |aij |2 ⎠ ;

A F = ⎝

(1.4.48)

i=1 j=1

(3) max norm or ∞ -norm (p = ∞)

A ∞ =

max

i=1,...,m; j=1,...,n

{|aij |}.

(1.4.49)

The Frobenius norm is an extension of the Euclidean norm of the vector to the elongated vector a = [a11 , . . . , am1 , a12 , . . . , a1n , . . . , amn ]T . The Frobenius norm can be also written in the form of the trace function as follows:  min{m,n}   def ! 1/2 H = tr(A A) = σi2 (A). (1.4.50)

A F = A, A i=1

Given an m × n matrix A, its Frobenius norm weighted by a positive definite matrix Ω, denoted A Ω , is defined by 

A Ω = tr(AH ΩA). (1.4.51) This norm is usually called the Mahalanobis  norm. min{m,n} 2 From A 2 = σmax (A) and A F = σi (A) one gets a relationship i=1 between the induced norm and the entrywise norm with p = 2:

A 2 ≤ A F ,

(1.4.52)

and the equality  holds if and only if the matrix A is a rank-1 matrix or a zero min{m,n} 2 σi (A) = σmax (A) in this case. matrix, since i=1 The Frobenius norm A F is very useful for numerical linear algebra and matrix analysis, since it is more easily calculated than the induced norm A 2 . 3. Schatten Norms The common Schatten norms are the Schatten p-norms defined by the vector of

40

Introduction to Matrix Algebra

the singular values of a matrix. If the singular values of a matrix A ∈ Cm×n are denoted by σi then the Schatten p-norms are defined by ⎞1/p ⎛ min{m,n}  σip ⎠ , p = 1, 2, ∞. (1.4.53)

A p = ⎝ i=1

The Schatten p-norms share the same notation as the induced and entrywise pnorms, but they are different. The case p = 2 yields the Frobenius norm introduced before. The most familiar Schatten p-norm is the Schatten norm with p = 1. In order to avoid confusion, this norm will be denoted by A ∗ or A tr , rather than A1 , and is called the nuclear norm (also known as the trace norm); it is defined as follows: min{m,n}  √

A ∗ = A tr = tr( AH A) = σi , (1.4.54) i=1

where tr(A) is the trace of the square matrix A. If A and B are m × n matrices then their matrix norms have the following properties:

A + B + A − B = 2( A 2 + B 2 ) (parallelogram law),

A + B A − B ≤ A + B . 2

2

(1.4.55) (1.4.56)

The following are the relationships between the inner products and norms [214]. (1) Cauchy–Schwartz inequlity 2

|A, B| ≤ A 2 B 2 .

(1.4.57)

The equals sign holds if and only if A = c B, where c is a complex constant. (2) Pythagoras’ theorem A, B = 0



A + B 2 = A 2 + B 2 .

(1.4.58)

(3) Polarization identity  1 (1.4.59)

A + B 2 − A − B 2 , 4  1 Re (A, B) = (1.4.60)

A + B 2 − A 2 − B 2 , 2 where Re (A, B) represents the real part of the inner product A, B. Re (A, B) =

1.5 Random Vectors In engineering applications, the measured data are usually random variables. A vector with random variables as its entries is called a random vector. In this section, we discuss the statistics and properties of random vectors by focusing on Gaussian random vectors.

1.5 Random Vectors

41

1.5.1 Statistical Interpretation of Random Vectors In the statistical interpretation of random vectors, the first-order and second-order statistics of random vectors are the most important. Given a random vector x(ξ) = [x1 (ξ), . . . , xm (ξ)]T , the mean vector of x(ξ), denoted μx , is defined as follows: ⎡ ⎤ ⎡ ⎤ E{x1 (ξ)} μ1 def ⎢ ⎥ ⎢ .. ⎥ .. μx = E{x(ξ)} = ⎣ (1.5.1) ⎦ = ⎣ . ⎦, . E{xm (ξ)}

μm

where E{xi (ξ)} = μi represents the mean of the random variable xi (ξ). The autocorrelation matrix of the random vector x(ξ) is defined by ⎤ ⎡ r11 · · · r1m   ⎢ def .. ⎥ , .. Rx = E x(ξ)xH (ξ) = ⎣ ... . . ⎦ rm1

···

(1.5.2)

rmm

where rii (i = 1, . . . , m) denotes the autocorrelation function of the random variable xi (ξ):     def (1.5.3) rii = E xi (ξ)x∗i (ξ) = E |xi (ξ)|2 , i = 1, . . . , m, whereas rij represents the cross-correlation function of xi (ξ) and xj (ξ):   def rij = E xi (ξ)x∗j (ξ) , i, j = 1, . . . , m, i = j.

(1.5.4)

Clearly, the autocorrelation matrix is a complex-conjugate-symmetric matrix (i.e., a Hermitian matrix). The autocovariance matrix of the random vector x(ξ), denoted Cx , is defined as follows:   def Cx = Cov(x, x) = E (x(ξ) − μx )(x(ξ) − μx )H ⎤ ⎡ c11 · · · c1m ⎢ .. ⎥ , .. (1.5.5) = ⎣ ... . . ⎦ cm1

···

cmm

where the diagonal entries,   def cii = E |xi (ξ) − μi |2 ,

i = 1, . . . , m,

(1.5.6)

represent the variance σi2 of the random variable xi (ξ), i.e., cii = σi2 , whereas the other entries,     def cij = E [xi (ξ) − μi ][xj (ξ) − μj ]∗ = E xi (ξ)x∗j (ξ) − μi μ∗j = c∗ji , (1.5.7) express the covariance of the random variables xi (ξ) and xj (ξ). The autocovariance matrix is also Hermitian.

42

Introduction to Matrix Algebra

The relationship between the autocorrelation and autocovariance matrices is given by Cx = R x − μ x μ H x .

(1.5.8)

By generalizing the autocorrelation and autocovariance matrices, one obtains the cross-correlation matrix of the random vectors x(ξ) and y(ξ), ⎤ ⎡ rx1 ,y1 · · · rx1 ,ym   def ⎢ .. ⎥ , .. Rxy = E x(ξ)yH (ξ) = ⎣ ... (1.5.9) . . ⎦ rxm ,y1 and the cross-covariance matrix def



Cxy = E [x(ξ) − μx ][y(ξ) − μy ]

···



 H

cx1 ,y1 ⎢ = ⎣ ... cxm ,y1

rxm ,ym ··· .. . ···

⎤ cx1 ,ym .. ⎥ , . ⎦

(1.5.10)

cxm ,ym

  def where rxi ,yj = E xi (ξ)yj∗ (ξ) is the cross-correlation of the random vectors xi (ξ) def

and yj (ξ) and cxi ,yj = E{[xi (ξ)−μxi ][yj (ξ) −μyj ]∗ } is the cross-covariance of xi (ξ) and yj (ξ). It is easily seen that there exists the following relationship between the crosscovariance and cross-correlation matrices: Cxy = Rxy − μx μH y .

(1.5.11)

In real-world applications, a data vector x = [x(0), x(1), . . . , x(N − 1)]T with nonzero mean μx usually needs to undergo a zero-mean normalization: x ← x = [x(0) − μx , x(1) − μx , . . . , x(N − 1) − μx ]T ,

μx =

N −1 1  x(n). N n=0

After zero-mean normalization, the correlation matrices and covariance matrices are equal, i.e., Rx = Cx and Rxy = Cxy . Some properties of these matrices are as follows. 1. The autocorrelation matrix is Hermitian, i.e., RH x = Rx . 2. The autocorrelation matrix of the linear combination vector y = Ax+b satisfies Ry = ARx AH . 3. The cross-correlation matrix is not Hermitian but satisfies RH xy = Ryx . 4. R(x1 +x2 )y = Rx1 y + Rx2 y . 5. If x and y have the same dimension, then Rx+y = Rx + Rxy + Ryx + Ry . 6. RAx,By = ARxy BH .

1.5 Random Vectors

43

The cross-correlation function describes the degree of correlation of two random variables xi (ξ) and xj (ξ). Generally speaking, the larger the autocorrelation function is, the greater is the degree of correlation of the two random vectors. The degree of correlation of two random variables x(ξ) and y(ξ) can be measured by their correlation coefficient, cxy E{(x(ξ) − x ¯)(y(ξ) − y¯)∗ } def = . ρxy =  2 2 σ E{|x(ξ) − x ¯| }E{|y(ξ) − y¯| } x σy

(1.5.12)

Here cxy = E{(x(ξ) − x ¯)(y(ξ) − y¯)∗ } is the cross-covariance of the random variables x(ξ) and y(ξ), while σx2 and σy2 are respectively the variances of x(ξ) and y(ξ). Applying the Cauchy–Schwartz inequality to Equation (1.5.12), we have 0 ≤ |ρxy | ≤ 1.

(1.5.13)

The correlation coefficient ρxy measures the degree of similarity of two random variables x(ξ) and y(ξ). The closer ρxy is to zero, the weaker the degree of similarity of the random variables x(ξ) and y(ξ) is. On the other hand, the closer ρxy is to 1, the more similar x(ξ) and y(ξ) are. In particular, the two extreme values 0 and 1 of correlation coefficients have interesting physical meanings. The case ρxy = 0 means that the cross-covariance cxy = 0, which implies there are no correlated components between the random variables x(ξ) and y(ξ). Thus, if ρxy = 0, the random variables x(ξ) and y(ξ) are said to be uncorrelated. Since this uncorrelation is defined in a statistical sense, it is usually said to be a statistical uncorrelation. It is easy to verify that if x(ξ) = c y(ξ), where c is a complex number, then |ρxy | = 1. Up to a fixed amplitude scaling factor |c| and a phase φ(c), the random variables x(ξ) and y(ξ) are the same, so that x(ξ) = cy(ξ) = |c|e jφ(c) y(ξ). Such a pair of random variables is said to be completely correlated or coherent. DEFINITION 1.27 Two random vectors x(ξ) = [x1 (ξ), . . . , xm (ξ)]T and y(ξ) = [y1 (ξ), . . . , yn (ξ)]T are said to be statistically uncorrelated if their cross-covariance matrix Cxy = Om×n or, equivalently, ρxi ,yj = 0, ∀ i, j. The random variables x(ξ) and y(ξ) are orthogonal if their cross-correlation is equal to zero, namely rxy = E{x(ξ)y ∗ (ξ)} = 0.

(1.5.14)

Similarly, two random vectors x(ξ) = [x1 (ξ), . . . , xm (ξ)]T and y(ξ) = [y1 (ξ), . . . , yn (ξ)]T are said to be orthogonal if any entry xi (ξ) of x(ξ) is orthogonal to any entry yj (ξ) of y(ξ), i.e., rxi ,yj = E{xi (ξ)yj (ξ)} = 0, i = 1, . . . , m, j = 1, . . . , n. Clearly, this implies that the cross-correlation matrix of the two random vectors is equal to the zero matrix, i.e., Rxy = Om×n . DEFINITION 1.28 The m × 1 random vector x(ξ) is said to be orthogonal to the n × 1 random vector y(ξ), if their cross-correlation matrix Rxy = Om×n .

44

Introduction to Matrix Algebra

Note that for the zero-mean normalized m × 1 random vector x(ξ) and n × 1 random vector y(ξ), their statistical uncorrelation and orthogonality are equivalent, as their cross-covariance and cross-correlation matrices are equal, i.e., Cxy = Rxy .

1.5.2 Gaussian Random Vectors DEFINITION 1.29 If each of its entries xi (ξ), i = 1, . . . , m, is a Gaussian random variable then the random vector x = [x1 (ξ), . . . , xm (ξ)]T is called a Gaussian normal random vector. As stated below, the representations of the probability density functions of real and complex Gaussian random vectors are slightly different. Let x ∼ N (¯ x, Γx ) denote a real Gaussian or normal random vector with the ¯ = [¯ ¯ )(x − x ¯ )T }. mean vector x x1 , . . . , x ¯m ]T and covariance matrix Γx = E{(x − x If each entry of the Gaussian random vector is independent identically distributed 2 ¯ )(x − x ¯ )T } = Diag(σ12 , . . . , σm (iid) then its covariance matrix Γx = E{(x − x ), 2 2 where σi = E{(xi − x¯i ) } is the variance of the Gaussian random variable xi . Under the condition that all entries are statistically independent of each other, the probability density function of a Gaussian random vector x ∼ N (¯ x, Γx ) is the joint probability density function of its m random variables, i.e., f (x) = f (x1 , . . . , xm ) = f (x1 ) · · · f (xm )     1 1 ¯ 1 )2 ¯ m )2 (x1 − x (xm − x = ···  exp − exp − 2 2 2σ12 2σm 2πσm 2πσ12   ¯ 1 )2 ¯ m )2 (x1 − x (xm − x 1 exp − − · · · − = 2 2σ12 2σm (2π)m/2 σ1 · · · σm or f (x) =

  1 1 T −1 ¯ ¯ exp − Γ (x − x ) . (x − x ) x 2 (2π)m/2 |Γx |1/2

(1.5.15)

If the entries are not statistically independent of each other, then the probability density function of the Gaussian random vector x ∼ N (¯ x, Γx ) is also given by Equation (1.5.15), but the exponential term becomes [372], [397] m m   ¯) = ¯ )T Γ−1 [Γ−1 (1.5.16) (x − x x (x − x x ]i,j (xi − μi )(xj − μj ), i=1 j=1

[Γ−1 x ]ij

where represents the (i, j)th entry of the inverse matrix Γ−1 x and μi = E{xi } is the mean of the random variable xi . The characteristic function of a real Gaussian random vector is given by   1 T T (1.5.17) Φ x (ω1 , . . . , ωm ) = exp j ω μx − ω Γx ω , 2

1.5 Random Vectors

45

where ω = [ω1 , . . . , ωm ]T . If xi ∼ CN (μi , σi2 ), then x = [x1 , . . . , xm ]T is called a complex Gaussian random vector, denoted x ∼ CN (μx , Γx ), where μx = [μ1 , . . . , μm ]T and Γ are respectively the mean vector and the covariance matrix of the random vector x. If xi = ui + jvi and the random vectors [u1 , v1 ]T , . . . , [um , vm ]T are statistically independent of each other then the probability density function of a complex Gaussian random vector x is given by [397, p. 35-5] −1    m m m ( (  1 m 2 2 f (xi ) = π σi exp − |xi − μi | f (x) = σ2 i=1 i=1 i=1 i * ) 1 = m (1.5.18) exp − (x − μx )H Γ−1 x (x − μx ) , π |Γx | 2 ). The characteristic function of the complex Gaussian where Γx = Diag(σ12 , . . . , σm random vector x is determined by   1 Φx (ω) = exp j Re(ω H μx ) − ω H Γx ω . (1.5.19) 4

A Gaussian random vector x has the following important properties. (1) The probability density function of x is completely described by its mean vector and covariance matrix. (2) If two Gaussian random vectors x and y are statistically uncorrelated then they are also statistically independent. (3) Given a Gaussian random vector x with mean vector μx and covariance matrix Γx , the random vector y obtained by the linear transformation y(ξ) = Ax(ξ) is also a Gaussian random vector, and its probability density function is given by   1 1 T −1 f (y) = exp − ) Γ (y − μ ) (1.5.20) (y − μ y y y 2 (2π)m/2 |Γy |1/2 for real Gaussian random vectors and * ) 1 exp − (y − μy )H Γ−1 f (y) = m y (y − μy ) π |Γy |

(1.5.21)

for complex Gaussian random vectors. In array processing, wireless communications, multiple-channel signal processing etc., one normally uses multiple sensors or array elements to receive multipath signals. In most cases it can be assumed that the additive noise in each sensor is a white Gaussian noise and that these white Gaussian noises are statistically uncorrelated.

46

Introduction to Matrix Algebra

EXAMPLE 1.9 Consider a real Gaussian noise vector x(t) = [x1 (t), . . . , xm (t)]T whose entries are all real Gaussian noise processes that are statistically uncorrelated. If these white Gaussian noises have the same variance σ 2 then + σ 2 , i = j, cxi ,xj = rxi ,xj = (1.5.22) 0, i = j, and thus the autocovariance matrix of x(t) is ⎡ rx ,x   ⎢ 1. 1 T Cx = Rx = E x(t)x (t) = ⎣ .. rxm ,x1

··· .. . ···

⎤ rx1 ,xm .. ⎥ = σ 2 I. . ⎦ rxm ,xm

Hence the statistical expression of a real white Gaussian noise vector is given by   (1.5.23) E{x(t)} = 0 and E x(t)xT (t) = σ 2 I. EXAMPLE 1.10 Consider the complex Gaussian random vector x(t) = [x1 (t), . . . , xm (t)]T whose components are complex white Gaussian noises and are statistically uncorrelated. If xi (t), i = 1, . . . , m, have zero mean and the same variance σ 2 then the real part xR,i (t) and the imaginary part xI,i (t) are two real white Gaussian noises that are statistically independent and have the same variance. This implies that     E xR,i (t) = 0, E xI,i (t) = 0,     1 E x2R,i (t) = E x2I,i (t) = σ 2 , 2   E xR,i (t)xI,i (t) = 0,       E xi (t)x∗i (t) = E x2R,i (t) + E x2I,i (t) = σ 2 . From the above conditions we know that     E x2i (t) = E (xR,i (t) + j xI,i (t))2       = E x2R,i (t) − E x2I,i (t) + j 2E xR,i (t)xI,i (t) 1 1 = σ 2 − σ 2 + 0 = 0. 2 2 Since x1 (t), . . . , xm (t) are statistically uncorrelated, we have     E xi (t)x∗k (t) = 0, i = k. E xi (t)xk (t) = 0, Summarizing the above conditions, we conclude that the statistical expression of the complex white Gaussian noise vector x(t) is given by   E x(t) = 0, (1.5.24)   H 2 (1.5.25) E x(t)x (t) = σ I,   T E x(t)x (t) = O. (1.5.26)

1.6 Performance Indexes of Matrices

47

Note that there is a difference between the statistical representations of the real and complex white Gaussian noise vectors. We will adopt the symbols x(t) ∼ N (0, σ 2 I) and x(t) ∼ CN (0, σ 2 I) to represent respectively the real and complex zero-mean white Gaussian noise vectors x(t).

1.6 Performance Indexes of Matrices An m × n matrix is a multivariate representation having mn components. In mathematics, one often needs a multivariate representation to be described by a scalar. The performance indexes of a matrix are such mathematical tools. In previous sections we have discussed two such performance indexes of matrices: the inner product and the norm. In this section, we present other several important performance indexes: the quadratic form, determinant, eigenvalues, trace and rank of a matrix.

1.6.1 Quadratic Forms The quadratic form of an n × n matrix A n × 1 nonzero vector. For example, ⎡ 1 xT Ax = [x1 , x2 , x3 ] ⎣−1 −1

is defined as xH Ax, where x may be any ⎤⎡ ⎤ 4 2 x1 7 5 ⎦ ⎣ x2 ⎦ x3 6 3

= x21 + 7x22 + 3x23 + 3x1 x2 + x1 x3 + 11x2 x3 . If x = [x1 , . . . , xn ]T , and the aij is the (i, j)th entry of the n × n matrix A, the quadratic form of A can be expressed as xT Ax =

n  n 

xi xj aij =

i=1 j=1

=

n  i=1

From this formula, it ⎡ 1 A = ⎣ −1 −1 ⎡ 1.0 C = ⎣ 1.5 0.5

aii x2i +

n  i=1

n−1 

n 

aii x2i +

n n  

aij xi xj

i=1, i=j j=1

(aij + aji )xi xj .

i=1 j=i+1

is easy to know that the matrices ⎤ ⎡ ⎤ 4 2 1 −1 −1 B = AT = ⎣ 4 7 5 ⎦, 7 6 ⎦, 6 3 2 5 3 ⎤ ⎡ ⎤ 1.5 0.5 1 114 52 7.0 5.5 ⎦ , D = ⎣ −111 7 2 ⎦, ... 5.5 3.0 −51 9 3

48

Introduction to Matrix Algebra

have the same quadratic form, i.e., xT Ax = xT Bx = xT Cx = xT Dx = x21 + 7x22 + 3x23 + 3x1 x2 + x1 x3 + 11x2 x3 . That is to say, for any quadratic form function f (x1 , . . . , xn ) =

n  i=1

αii x2i +

n n  

αij xi xj ,

i=1, i=j j=1

there exist many different matrices A such that their quadratic forms xT Ax = f (x1 , . . . , xn ) are the same. However, there is only one symmetric matrix AT = A satisfying xT Ax = f (x1 , . . . , xn ), whose entries are aij = aji = 12 (αij + αji ), where i = 1, . . . , n, j = 1, . . . , n. Hence, in order to ensure the uniqueness of the definition, it is necessary to assume A to be a real symmetric matrix or a complex Hermitian matrix when discussing the quadratic forms of an n × n matrix A. This assumption ensures that any quadratic form function is real-valued, since (xH Ax)∗ = (xH Ax)H = xH AH x = xH Ax holds for any Hermitian matrix A and any nonzero vector x. One of the basic advantages of a real-valued function is its suitability for comparison with a zero value. If the quadratic form xH Ax > 0 is positive definite then the Hermitian matrix A is also positive definite. Similarly, one can define the positive semi-definiteness, negative definiteness and negative semi-definiteness of a Hermitian matrix. DEFINITION 1.30

A Hermitian matrix A is said to be:

(1) positive definite, denoted A  0, if the quadratic form xH Ax > 0, ∀ x = 0; (2) positive semi-definite, denoted A  0, if the quadratic form xH Ax ≥ 0, ∀ x = 0; (3) negative definite, denoted A ≺ 0, if the quadratic form xH Ax < 0, ∀ x = 0; (4) negative semi-definite, denoted A  0, if the quadratic form xH Ax ≤ 0, ∀ x = 0; (5) indefinite if xH Ax > 0 for some nonzero vectors x and xH Ax < 0 for other nonzero vectors x. ⎡

For example, the real symmetric matrix R = ⎣

3 −1 0

−1 3 −1

⎤ 0 −1 ⎦ 3 2

is positive definite,

as the quadratic form xH Rx = 2x21 + x22 + 2x23 + (x1 − x2 ) + (x2 − x3 )2 > 0 unless x1 = x2 = x3 = 0. One sentence summary: As a performance index, the quadratic form of a Hermitian matrix describes its positive definiteness.

1.6 Performance Indexes of Matrices

49

1.6.2 Determinants The reader will recall that the determinant of a 2 × 2 matrix [aij ] is given by   a a12  def = a11 a22 − a12 a21 . det[aij ] =  11 a21 a22  The determinant of an n × n matrix A, denoted det(A) or |A|, is written as follows:    a11 a12 · · · a1n     a21 a22 · · · a2n    (1.6.1) det(A) = |A| =  . .. ..  . ..  .. . . .   a an2 · · · ann  n1 After removing the ith row and the jth column from the matrix A, the determinant Aij of the remaining matrix is known as the cofactor of the entry aij . In particular, when j = i, Aii is known as the principal minor of A. Letting Aij be the (n − 1) × (n − 1) submatrix obtained by removing the ith row and the jth column from the n × n matrix A, the cofactor Aij is related to the determinant of the submatrix Aij as follows: Aij = (−1)i+j det(Aij ).

(1.6.2)

The determinant of an n × n matrix A is ∀n equal to the sum of the products of each entry of its any row (say the ith row) or any column (say the jth column) by its corresponding cofactor, namely det(A) = ai1 Ai1 + · · · + ain Ain =

n 

aij (−1)i+j det(Aij ),

j=1 n 

det(A) = a1j A1j + · · · + anj Anj =

aij (−1)i+j det(Aij ).

(1.6.3)

(1.6.4)

i=1

Hence the determinant of A can be recursively calculated: an nth-order determinant can be computed from the (n − 1)th-order determinants, while each (n − 1)th-order determinant can be calculated from the (n − 2)th-order determinants, and so forth. For a 3 × 3 matrix A, its determinant is recursively given by ⎤ ⎡ a11 a12 a13 det(A) = det ⎣a21 a22 a23 ⎦ = a11 A11 + a12 A12 + a13 A13 a31 a32 a33       a    a23  a23  1+2 a21 1+3 a21 a22  + a + a = a11 (−1)1+1  22 (−1) (−1) 12 13     a32 a33 a31 a33 a31 a33  = a11 (a22 a33 − a23 a32 ) − a12 (a21 a33 − a23 a31 ) + a13 (a21 a33 − a22 a31 ). This is the diagonal method for a third-order determinant.

50

Introduction to Matrix Algebra

DEFINITION 1.31 matrix.

A matrix with nonzero determinant is known as a nonsingular

1. Determinant Equalities [307] 1. If two rows (or columns) of a matrix A are exchanged then the value of det(A) remains unchanged, but the sign is changed. 2. If some row (or column) of a matrix A is a linear combination of other rows (or columns), then det(A) = 0. In particular, if some row (or column) is proportional or equal to another row (or column), or there is a zero row (or column), then det(A) = 0. 3. The determinant of an identity matrix is equal to 1, i.e., det(I) = 1. 4. Any square matrix A and its transposed matrix AT have the same determinant, i.e., det(A) = det(AT ); however, det(AH ) = (det(AT ))∗ . 5. The determinant of a Hermitian matrix is real-valued, since det(A) = det(AH ) = (det(A))∗ .

(1.6.5)

6. The determinant of the product of two square matrices is equal to the product of their determinants, i.e., det(AB) = det(A) det(B),

A, B ∈ Cn×n .

(1.6.6)

7. For any constant c and any n × n matrix A, det(cA) = cn det(A). 8. If A is nonsingular then det(A−1 ) = 1/ det(A). 9. For matrices Am×m , Bm×n , Cn×m , Dn×n , the determinant of the block matrix for A nonsingular is   A B = det(A) det(D − CA−1 B) (1.6.7) det C D and for D nonsingular is   A B = det(D) det(A − BD−1 C). det C D

(1.6.8)

10. The determinant of a triangular (upper or lower triangular) matrix A is equal to the product of its main diagonal entries: det(A) =

n (

aii .

i=1

The determinant of a diagonal matrix A = Diag(a11 , . . . , ann ) is also equal to the product of its diagonal entries.

1.6 Performance Indexes of Matrices

51

Here we give a proof of Equation (1.6.7):      A B I A−1 B A O det = det I C D C D − CA−1 B O = det(A) det(D − CA−1 B). We can prove Equation (1.6.8) in a similar way. 2. Determinant Inequalities [307] 1. Cauchy–Schwartz inequality: If A, B are m × n matrices, then | det(AH B)|2 ≤ det(AH A) det(BH B). 2. Hadamard inequality: For an m × m matrix A, one has det(A) ≤

m ( i=1

⎛ ⎝

m 

⎞1/2 |aij |2 ⎠

.

j=1

3. Fisher inequality: For Am×m , Bm×n , Cn×n , one has   A B ≤ det(A) det(C). det BH C 4. Minkowski inequality: If Am×m = Om×m and Bm×m = Om×m are positive semi-definite then    m det(A + B) ≥ m det(A) + m det(B). 5. The determinant of a positive definite matrix A is larger than 0, i.e., det(A) > 0. 6. The determinant of a positive semi-definite matrix A is larger than or equal to 0, i.e., det(A) ≥ 0. 7. If the m×m matrix A is positive semi-definite, then (det(A))1/m ≤ m−1 det(A). 8. If the matrices Am×m and Bm×m are positive semi-definite then det(A + B) ≥ det(A) + det(B). 9. If Am×m is positive definite and Bm×m is positive semi-definite then det(A + B) ≥ det(A). 10. If Am×m is positive definite and Bm×m is negative semi-definite then det(A + B) ≤ det(A). One sentence summary: As a performance index, the value of the determinant of a matrix determines whether it is singular.

52

Introduction to Matrix Algebra

1.6.3 Matrix Eigenvalues Consider the output of a linear transformation L whose input is an n × 1 nonzero vector u. If the output is different from the input by a scale factor λ, i.e., Lu = λu,

u = 0,

(1.6.9)

then the scalar λ and the vector u are known as the eigenvalue and the corresponding eigenvector of the linear transformation L. Since Lu = λu implies that the input vector keeps its “direction” unchanged, the vector u must depict an inherent feature of the linear transformation L. This is the reason why the vector u is called the eigenvector of the linear transformation. In this sense, the eigenvalue λ can be regarded as the “gain” of the linear transformation L when u is inputted. When the linear transformation L takes the form of an n × n matrix A, the expression (1.6.9) can be extended to the definition of the eigenvalue and eigenvector of the matrix A: if the linear algebraic equation Au = λu

(1.6.10)

has a nonzero n × 1 solution vector u then the scalar λ is called an eigenvalue of the matrix A, and u is its eigenvector corresponding to λ. The matrix equation (1.6.10) can be written equivalently as (A − λI)u = 0.

(1.6.11)

Since the above equation holds for a nonzero vector, u, the only condition for it have a nonzero solution is that the determinant of the matrix A − λI is equal to zero, i.e., det(A − λI) = 0.

(1.6.12)

This equation is known as the characteristic equation of the matrix A. The characteristic equation (1.6.12) reflects the following facts: • If (1.6.12) holds for λ = 0 then det(A) = 0. This implies that as long as the matrix A has a zero eigenvalue, this matrix must be a singular matrix. • All the eigenvalues of a zero matrix are zero, and for any singular matrix there exists at least one zero eigenvalue. Clearly, if all n diagonal entries of an n × n singular matrix A contains a subtraction of the same scalar x = 0 that is not an eigenvalue of A then the matrix A − xI must be nonsingular, since |A − xI| = 0. Let eig(A) represent the eigenvalues of the matrix A. The basic properties of eigenvalues are listed below: 1. 2. 3. 4.

eig(AB) = eig(BA). An m × n matrix A has at most min{m, n} different eigenvalues. If rank(A) = r, then the matrix A has at most r different eigenvalues. The eigenvalues of the inverse matrix satisfy eig(A−1 ) = 1/eig(A).

1.6 Performance Indexes of Matrices

53

5. Let I be the identity matrix; then

PROPOSITION 1.3 values.

eig(I + cA) = 1 + c eig(A),

(1.6.13)

eig(A − cI) = eig(A) − c.

(1.6.14)

All eigenvalues of a positive definite matrix are positive real

Proof Suppose that A is a positive definite matrix; then the quadratic form xH Ax > 0 holds for any nonzero vector x. If λ is any eigenvalue of A, i.e., Au = λu, then uH Au = uH λu and thus λ = uH Au/(uH u) must be a positive real number, as it is the ratio of two positive real numbers. Since uH u > 0 for any nonzero vector u, from λ = uH Au/(uH u) it is directly known that positive definite and nonpositive definite matrices can be described by their eigenvalues as follows. (1) (2) (3) (4) (5)

Positive definite matrix Its eigenvalues are positive real numbers. Positive semi-definite matrix Its eigenvalues are nonnegative. Negative definite matrix Its eigenvalues are negative. Negative semi-definite matrix Its eigenvalues are nonpositive. Indefinite matrix It has both positive and negative eigenvalues.

If A is a positive definite or positive semi-definite matrix then ( aii . det(A) ≤

(1.6.15)

i

This inequality is called the Hadamard inequality [214, p. 477]. The characteristic equation (1.6.12) suggests two methods for improving the numerical stability and the accuracy of solutions of the matrix equation Ax = b, as described below. 1. Method for Improving the Numerical Stability Consider the matrix equation Ax = b, where A is usually positive definite or nonsingular. However, owing to noise or errors, A may sometimes be close to singular. We can alleviate this difficulty as follows. If λ is a small positive number then −λ cannot be an eigenvalue of A. This implies that the characteristic equation |A − xI| = |A − (−λ)I| = |A + λI| = 0 cannot hold for any λ > 0, and thus the matrix A + λI must be nonsingular. Therefore, if we solve (A + λI)x = b instead of the original matrix equation Ax = b, and λ takes a very small positive value, then one can overcome the singularity of A to improve greatly the numerical stability of solving Ax = b. This method of solving (A + λI)x = b, with λ > 0, instead of Ax = b is the well-known Tikhonov regularization method for solving nearly singular matrix equations.

54

Introduction to Matrix Algebra

2. Method for Improving the Accuracy For a matrix equation Ax = b, with the data matrix A nonsingular but containing additive interference or observation noise, if we choose a very small positive scalar λ to solve (A − λI)x = b instead of Ax = b then the influence of the noise of the data matrix A on the solution vector x will be greatly decreased. This is the basis of the well-known total least squares (TLS) method. We will present the Tikhonov method and the TLS method in detail in Chapter 6. As is seen in the Tikhonov and TLS methods, the diagonal entries of a matrix play a more important role in matrix analysis than its off-diagonal entries. One sentence summary: As a performance index, the eigenvalues of a matrix describe its singularness, positive definiteness and the special structure of its diagonal entries.

1.6.4 Matrix Trace DEFINITION 1.32 The sum of the diagonal entries of an n × n matrix A is known as its trace, denoted tr(A): tr(A) = a11 + · · · + ann =

n 

aii .

(1.6.16)

i=1

The following are some properties of the matrix trace. 1. Trace Equality [307] 1. If both A and B are n × n matrices then tr(A ± B) = tr(A) ± tr(B). 2. If both A and B are n × n matrices and c1 and c2 are constants then tr(c1 A ± c2 B) = c1 tr(A) ± c2 tr(B). In particular, tr(cA) = c tr(A). 3. tr(AT ) = tr(A), tr(A∗ ) = (tr(A))∗ and tr(AH ) = (tr(A))∗ . 4. If A ∈ Cm×n , B ∈ Cn×m then tr(AB) = tr(BA). 5. If A is an m × n matrix then tr(AH A) = 0 implies that A is an m × n zero matrix. 6. xH Ax = tr(AxxH ) and yH x = tr(xyH ). 7. The trace of an n × n matrix is equal to the sum of its eigenvalues, namely tr(A) = λ1 + · · · + λn . 8. The trace of a block matrix satisfies   A B tr = tr(A) + tr(D), C D where A ∈ Cm×m , B ∈ Cm×n , C ∈ Cn×m and D ∈ Cn×n .

1.6 Performance Indexes of Matrices

55

9. For any positive integer k, we have tr(Ak ) =

n 

λki .

(1.6.17)

i=1

By the trace equality tr(UV) = tr(VU), it is easy to see that tr(AH A) = tr(AAH ) =

n n  

aij a∗ij =

i=1 j=1

n n  

|aij |2 .

(1.6.18)

i=1 j=1

However, if we substitute U = A, V = BC and U = AB, V = C into the trace equality tr(UV) = tr(VU), we obtain tr(ABC) = tr(BCA) = tr(CAB).

(1.6.19)

Similarly, if we let U = A, V = BCD or U = AB, V = CD or U = ABC, V = D, respectively, we obtain tr(ABCD) = tr(BCDA) = tr(CDAB) = tr(DABC).

(1.6.20)

Moreover, if A and B are m × m matrices, and B is nonsingular then tr(BAB−1 ) = tr(B−1 AB) = tr(ABB−1 ) = tr(A).

(1.6.21)

2. Trace Inequality [307] 1. If A ∈ Cm×n then tr(AH A) = tr(AAH ) ≥ 0. 2. Schur inequality: tr(A2 ) ≤ tr(AT A). 3. If A, B are two m × n matrices then





tr (AT B)2 ≤ tr(AT A)tr(BT B)



T

2





T

2



tr (A B) tr (A B)



(Cauchy–Schwartz inequality),

≤ tr(A AB B), T

T

≤ tr(AAT BBT ).







4. tr (A + B)(A + B)T ≤ 2 tr(AAT ) + tr(BBT ) . 5. If A and B are two m × m symmetric matrices then tr(AB) ≤ 12 tr(A2 + B2 ). The Frobenius norm of an m × n matrix A can be also defined using the traces of the m × m matrix AH A or that of the n × n matrix AAH , as follows [311, p. 10]:   (1.6.22)

A F = tr(AH A) = tr(AAH ). One sentence summary: As a performance index, the trace of a matrix reflects the sum of its eigenvalues.

56

Introduction to Matrix Algebra

1.6.5 Matrix Rank THEOREM 1.6 [433] Among a set of p-dimensional (row or column) vectors, there are at most p linearly independent (row or column) vectors. THEOREM 1.7 [433] For an m × n matrix A, the number of linearly independent rows and the number of linearly independent columns are the same. From this theorem we have the following definition of the rank of a matrix. DEFINITION 1.33 The rank of an m × n matrix A is defined as the number of its linearly independent rows or columns. It needs to be pointed out that the matrix rank gives only the number of linearly independent rows or columns; it gives no information on the locations of these independent rows or columns. The matrix equation Am×n xn×1 = bm×1 is said to be consistent, if it has at least one exact solution. A matrix equation with no exact solution is said to be inconsistent. Matrix equations can be divided into three types. (1) Well-determined equation If m = n and rank(A) = n, i.e., the matrix A is nonsingular, then the matrix equation Ax = b is said to be well-determined. (2) Under-determined equation The matrix equation Ax = b is said to be underdetermined if the number of linearly independent equations is less than the number of independent unknowns. (3) Over-determined equation The matrix equation Ax = b is said to be overdetermined if the number of linearly independent equations is larger than the number of independent unknowns. The terms “well-determined”, “under-determined” and “over-determined” have the following meanings. • Meaning of well-determined equation The number of independent equations and the number of independent unknowns are the same so that the solution of this system of equations is uniquely determined. The exact solution of a well-determined matrix equation Ax = b is given by x = A−1 b. A well-determined equation is a consistent equation. • Meaning of under-determined equation The number of independent equations is less than the number of independent unknowns, which implies that the number of equations is not enough for determining a unique solution. As a matter of fact, such a system of linear equations has an infinitely many solutions. Hence, any under-determined matrix equation is a consistent equation. • Meaning of over-determined equation Since the number of independent equations is larger than the number of independent unknowns, the number of independent equations appears surplus for determining the unique solution. An

1.6 Performance Indexes of Matrices

57

over-determined matrix equation Ax = b has no exact solution and thus is an inconsistent equation that may in some cases have an approximate solution, for example, a least squares solution. A matrix A with rank(A) = rA has rA linearly independent column vectors. The linear combinations of the rA linearly independent column vectors constitute a vector space, called the column space or the range or the manifold of A. The column space Col(A) or the range Range(A) is rA -dimensional. Hence the rank of a matrix can be defined by using the dimension of its column space or range, as described below. DEFINITION 1.34 The dimension of the column space Col(A) or the range Range(A) of an m × n matrix A is defined as the rank of the matrix, namely rA = dim(Col(A)) = dim(Range(A)).

(1.6.23)

The following statements about the rank of the matrix A are equivalent: (1) rank(A) = k; (2) there are k and not more than k columns of A that combine to give a linearly independent set; (3) there are k and not more than k rows of A that combine to give a linearly independent set; (4) there is a k × k submatrix of A with nonzero determinant, but all the (k + 1) × (k + 1) submatrices of A have zero determinant; (5) the dimension of the column space Col(A) or the range Range(A) equals k; (6) k = n − dim[Null(A)], where Null(A) denotes the null space of the matrix A. THEOREM 1.8 [433]

The rank of the product matrix AB satisfies the inequality rank(AB) ≤ min{rank(A), rank(B)}.

(1.6.24)

LEMMA 1.1 If premultiplying an m×n matrix A by an m×m nonsingular matrix P, or postmultiplying it by an n × n nonsingular matrix Q, then the rank of A is not changed, namely rank(PAQ) = rank(A). LEMMA 1.2

rank[A, B] ≤ rank(A) + rank(B).

LEMMA 1.3

rank(A + B) ≤ rank[A, B] ≤ rank(A) + rank(B).

LEMMA 1.4 For an m × n matrix A and an n × q matrix B, the rank inequality rank(AB) ≥ rank(A) + rank(B) − n is true. 1. Properties of the Rank of a Matrix 1. The rank is a positive integer. 2. The rank is equal to or less than the number of columns or rows of the matrix.

58

Introduction to Matrix Algebra

3. If the rank of an n × n matrix A is equal to n then A is nonsingular, or we say that A is a full rank matrix. 4. If rank(Am×n ) < min{m, n} then A is said to be a rank-deficient matrix. 5. If rank(Am×n ) = m (< n) then the matrix A is a full row rank matrix. 6. If rank(Am×n ) = n (< m) then the matrix A is a full column rank matrix. 7. Premultiplying any matrix A by a full column rank matrix or postmultiplying it by a full row rank matrix leaves the rank of the matrix A unchanged. 2. Rank Equalities (1) If A ∈ Cm×n , then rank(AH ) = rank(AT ) = rank(A∗ ) = rank(A). (2) If A ∈ Cm×n and c = 0, then rank(cA) = rank(A). (3) If A ∈ Cm×m and C ∈ Cn×n are nonsingular then rank(AB) = rank(B) = rank(BC) = rank(ABC) for B ∈ Cm×n . That is, after premultiplying and/or postmultiplying by a nonsingular matrix, the rank of B remains unchanged. (4) For A, B ∈ Cm×n , rank(A) = rank(B) if and only if there exist nonsingular matrices X ∈ Cm×m and Y ∈ Cn×n such that B = XAY. (5) rank(AAH ) = rank(AH A) = rank(A). (6) If A ∈ Cm×m then rank(A) = m ⇔ det(A) = 0 ⇔ A is nonsingular. 3. Rank Inequalities • rank(A) ≤ min{m, n} for any m × n matrix A. • If A, B ∈ Cm×n then rank(A + B) ≤ rank(A) + rank(B). • If A ∈ Cm×k and B ∈ Ck×n then rank(A) + rank(B) − k ≤ rank(AB) ≤ min{rank(A), rank(B)}. One sentence summary: As a performance index, the matrix rank describes the linear independence of the rows (or columns) of a matrix, which reflects the full rank or rank deficiency of the matrix. Table 1.1 summarizes the five important performance indexes discussed above and how they describe matrix performance. Table 1.1 Performance indexes of matrices Performance index

Matrix property determined by the performance index

quadratic form

positive definiteness and non-negative definiteness

determinant

singularity

eigenvalues

singularity and positive definiteness

trace

sum of diagonal entries, sum of eigenvalues

rank

linear independence of rows (or columns)

1.7 Inverse Matrices and Pseudo-Inverse Matrices

59

1.7 Inverse Matrices and Pseudo-Inverse Matrices Matrix inversion is an important aspect of matrix calculus. In particular, the matrix inversion lemma is often used in signal processing, system sciences, automatic control, neural networks and so on. In this section we discuss the inverse of a full-rank square matrix and the pseudo-inverse of a non-square matrix with full row (or full column) rank. Regarding the inversion of a nonsquare or rank-deficient matrix, we will discuss this in the next section.

1.7.1 Definition and Properties of Inverse Matrices An n×n matrix is called nonsingular if it has n linearly independent column vectors and n linearly independent row vectors. A nonsingular matrix can also be defined from the viewpoint of a linear system: a linear transformation or a square matrix A is said to be nonsingular if it produces a zero output only for zero input; otherwise it is singular. If a matrix is nonsingular then its inverse must exist. Conversely, a singular matrix has no inverse. The n × n matrix B such that BA = AB = I is called the inverse matrix of A, denoted B = A−1 . If A−1 exists then the matrix A is said to be nonsingular or invertible. On the nonsingularity or invertibility of an n × n matrix A, the following statements are equivalent [214]. (1) A is nonsingular. (2) A−1 exists. (3) rank(A) = n. (4) All rows of A are linearly independent. (5) All columns of A are linearly independent. (6) det(A) = 0. (7) The dimension of the range of A is n. (8) The dimension of the null space of A is equal to zero. (9) Ax = b is a consistent equation for every b ∈ Cn . (10) Ax = b has a unique solution for every b. (11) Ax = 0 has only the trivial solution x = 0. 1. Properties of the Inverse Matrix A−1 [25], [214] 1. A−1 A = AA−1 = I. 2. A−1 is unique. 3. The determinant of the inverse matrix is equal to the reciprocal of the determinant of the original matrix, i.e., |A−1 | = 1/|A|. 4. The inverse matrix A−1 is nonsingular.

60

Introduction to Matrix Algebra

5. The inverse matrix of an inverse matrix is the original matrix, i.e., (A−1 )−1 = A. 6. The inverse matrix of a Hermitian matrix A = AH satisfies (AH )−1 = (A−1 )H = A−H . 7. If AH = A then (A−1 )H = A−1 . That is to say, the inverse matrix of any Hermitian matrix is a Hermitian matrix as well. 8. (A∗ )−1 = (A−1 )∗ . 9. If A and B are invertible then (AB)−1 = B−1 A−1 . 10. If A = Diag(a1 , . . . , am ) is a diagonal matrix then its inverse matrix −1 A−1 = Diag(a−1 1 , . . . , am ).

11. Let A be nonsingular. If A is an orthogonal matrix then A−1 = AT , and if A is a unitary matrix then A−1 = AH .

1.7.2 Matrix Inversion Lemma LEMMA 1.5 Let A be an n×n invertible matrix, and x and y be two n×1 vectors such that (A + xyH ) is invertible; then (A + xyH )−1 = A−1 −

A−1 xyH A−1 . 1 + yH A−1 x

(1.7.1)

Lemma 1.5 is called the matrix inversion lemma, and was presented by Sherman and Morrison [438], [439] in 1949 and 1950. The matrix inversion lemma can be extended to an inversion formula for a sum of matrices: (A + UBV)−1 = A−1 − A−1 UB(B + BVA−1 UB)−1 BVA−1 = A−1 − A−1 U(I + BVA−1 U)−1 BVA−1

(1.7.2)

(A − UV)−1 = A−1 + A−1 U(I − VA−1 U)−1 VA−1 .

(1.7.3)

or

The above formula was obtained by Woodbury in 1950 [513] and is called the Woodbury formula. Taking U = u, B = β and V = vH , the Woodbury formula gives the result (A + βuvH )−1 = A−1 −

β A−1 uvH A−1 . 1 + βvH A−1 u

(1.7.4)

In particular, if we let β = 1 then Equation (1.7.4) reduces to formula (1.7.1), the matrix inversion lemma of Sherman and Morrison. As a matter of fact, before Woodbury obtained the inversion formula (1.7.2),

1.7 Inverse Matrices and Pseudo-Inverse Matrices

61

Duncan [139] in 1944 and Guttman [192] in 1946 had obtained the following inversion formula: (A − UD−1 V)−1 = A−1 + A−1 U(D − VA−1 U)−1 VA−1 .

(1.7.5)

This formula is called the Duncan–Guttman inversion formula [391], [392]. In addition to the Woodbury formula, the inverse matrix of a sum of matrices also has the following forms [206]: (A + UBV)−1 = A−1 − A−1 (I + UBVA−1 )−1 UBVA−1 =A

−1

=A

−1

=A

−1

−A

−1

−A

−1

−A

−1

−1

UB(I + VA

UBV(I + A UBVA

−1

UB)

−1

−1

UBV)

(1.7.6)

−1

(1.7.7)

VA −1

A

−1

−1 −1

(I + UBVA

)

(1.7.8)

.

(1.7.9)

The following are inversion formulas for block matrices. When the matrix A is invertible, one has [22]: −1 A U V D  −1 A + A−1 U(D − VA−1 U)−1 VA−1 = −(D − VA−1 U)−1 VA−1



 −A−1 U(D − VA−1 U)−1 . (D − VA−1 U)−1 (1.7.10)

If the matrices A and D are invertible, then [216], [217] 

A V

U D

−1



(A − UD−1 V)−1 = −1 −D V(A − UD−1 V)−1

−A−1 U(D − VA−1 U)−1 (D − VA−1 U)−1

 (1.7.11)

or [139] 

A V

U D

−1

 =

(A − UD−1 V)−1 −(D − VA−1 U)−1 VA−1

 −(A − UD−1 V)−1 UD−1 . (1.7.12) (D − VA−1 U)−1

1.7.3 Inversion of Hermitian Matrices Consider the inversion of a (m + 1) × (m + 1) nonsingular Hermitian matrix Rm+1 . First, an (m + 1) × (m + 1) Hermitian matrix can always be written in a block form:   Rm r m Rm+1 = H , (1.7.13) rm ρm where ρm is the (m + 1, m + 1)th entry of Rm+1 , and Rm is an m × m Hermitian matrix.

62

Introduction to Matrix Algebra

Now, we consider how to compute the inverse matrix R−1 m+1 using the inverse −1 matrix Rm . For this purpose, let   Qm qm (1.7.14) Qm+1 = qH αm m be the inverse matrix of Rm+1 . Then we have     I R m rm Q m q m = m Rm+1 Qm+1 = H H rm ρm qH α 0 m m m

0m 1

 (1.7.15)

which gives the following four equations: Rm Qm + r m qH m = Im , rH m Qm

(1.7.16)

0H m,

(1.7.17)

R m q m + r m α m = 0m ,

(1.7.18)

rH m qm

+

ρm qH m

=

+ ρm αm = 1.

(1.7.19)

If Rm is invertible, then from Equation (1.7.18), it follows that qm = −αm R−1 m rm .

(1.7.20)

Substitute this result into Equation (1.7.19) to yield αm =

1 . −1 ρm − rH m Rm r m

(1.7.21)

After substituting Equation (1.7.21) into Equation (1.7.20), we can obtain qm =

−R−1 m rm . H ρm − rb R−1 m rm

(1.7.22)

Then, substitute Equation (1.7.22) into Equation (1.7.16): −1 H −1 Qm = R−1 m − Rm r m q m = Rm +

−1 H R−1 m rm (Rm rm ) . −1 ρm − rH m R m rm

(1.7.23)

In order to simplify (1.7.21)–(1.7.23), put bm = [b0 , b1 , . . . , bm−1 ]T = −R−1 m rm , def

(m)

def

βm = ρm −

(m)

−1 rH m Rm r m

(m)

= ρm +

rH m bm .

(1.7.24) (1.7.25)

Then (1.7.21)–(1.7.23) can be respectively simplified to αm =

1 , βm

qm =

1 bm , βm

Qm = R−1 m +

1 bm bH m. βm

Substituting the these results into (1.7.15), it is immediately seen that     −1 1 b m bH Rm 0m bm −1 m . + Rm+1 = Qm+1 = bH 1 0H 0 βm m m

(1.7.26)

1.7 Inverse Matrices and Pseudo-Inverse Matrices

63

This formula for calculating the (m + 1) × (m + 1) inverse Rm+1 from the m × m inverse Rm is called the block inversion lemma for Hermitian matrices [352]. Given an M × M nonsingular Hermitian matrix H, using the block inversion lemma for m = 1, . . . , M − 1 we can find the inverse matrix H−1 in a recursive way.

1.7.4 Left and Right Pseudo-Inverse Matrices From a broader perspective, any n × m matrix G may be called the inverse of a given m × n matrix A, if the product of G and A is equal to the identity matrix I. According to the different possible structures of the m × n matrix A, for the matrix G to be such that the product of it and A is equal to the identity matrix there exist three possibilities as shown in the next example. EXAMPLE 1.11 Consider the inverses of the following three matrices, ⎡ ⎤ ⎡ ⎤   2 −2 −1 4 8 1 3 1 . A1 = ⎣ 1 1 −2 ⎦ , A2 = ⎣ 5 −7 ⎦ , A3 = 2 5 1 1 0 −1 −2 3 For the matrix A1 , there is a unique matrix ⎡ ⎤ −1 −2 5 G = ⎣ −1 −1 3 ⎦ −1 −2 4 such that GA1 = A1 G = I3×3 . In this case, the matrix G is actually the inverse matrix of A1 , i.e., G = A−1 1 . In the case of the matrix A2 , there exist an infinite number of 2 × 3 matrices L such that LA2 = I2×2 , for example  7    2 0 0 3 7 68 17 , ... L= , L= 0 2 5 0 2 5 For the matrix A3 , there are no 3 × 2 matrices G3 such that G3 A3 = I3×3 but there exist more than one 3 × 2 matrix R such that A3 R = I2×2 , for example ⎡ ⎤ ⎡ ⎤ 1 1 −1 1 R = ⎣ −1 0 ⎦, R = ⎣ 0 0 ⎦, ... 3 −1 2 −1 Summarizing the above discussion, we see that, in addition to the inverse matrix A satisfying AA−1 = A−1 A = I, there are two other forms of inverse matrix that satisfy only LA = I or AR = I. −1

DEFINITION 1.35 [433] The matrix L satisfying LA = I but not AL = I is called the left inverse of the matrix A. Similarly, the matrix R satisfying AR = I but not RA = I is said to be the right inverse of A.

64

Introduction to Matrix Algebra

Remark A matrix A ∈ Cm×n has a left inverse only when m ≥ n and a right inverse only when m ≤ n, respectively. As shown in Example 1.11, for a given m × n matrix A, when m > n it is possible that other n × m matrices L to satisfy LA = In , while when m < n it is possible that other n × m matrices R to satisfy AR = Im . That is to say, the left or right inverse of a given matrix A is usually not unique. Let us consider the conditions for a unique solution of the left and right inverse matrices. Let m > n and let the m × n matrix A have full column rank, i.e., rank(A) = n. In this case, the n × n matrix AH A is invertible. It is easy to verify that L = (AH A)−1 AH

(1.7.27)

satisfies LA = I but not AL = I. This type of left inverse matrix is uniquely determined, and is usually called the left pseudo-inverse matrix of A. On the other hand, if m < n and A has the full row rank, i.e., rank(A) = m, then the m × m matrix AAH is invertible. Define R = AH (AAH )−1 .

(1.7.28)

It is easy to see that R satisfies AR = I rather than RA = I. The matrix R is also uniquely determined and is usually called the right pseudo-inverse matrix of A. The left pseudo-inverse matrix is closely related to the least squares solution of overdetermined equations, while the right pseudo-inverse matrix is closely related to the least squares minimum-norm solution of under-determined equations. A detailed discussion is given in Chapter 6. The following are the dimensional recursions for the left and right pseudo-inverse matrices [542]. Consider an n × m matrix Fm (where n > m) and its left pseudo-inverse matrix −1 H F†m = (FH Fm . Let Fm = [Fm−1 , fm ], where fm is the mth column of the m Fm ) matrix Fm and rank(Fm ) = m; then a recursive computation formula for F†m is given by ⎤ ⎡ † −1 Fm−1 − F†m−1 fm eH m Δm ⎦, F†m = ⎣ (1.7.29) H −1 em Δm H −1 ; the initial recursion value where em = (In − Fm−1 F†m−1 )fm and Δ−1 m = (fm em ) † H H is F1 = f1 /(f1 f1 ). Given a matrix Fm ∈ Cn×m , where n < m, and again writing Fm = [Fm−1 , fm ], H −1 the right pseudo-inverse matrix F†m = FH has the following recursive m (Fm Fm ) formula: ⎡ † ⎤ Fm−1 − Δm F†m−1 fm cm ⎦. F†m = ⎣ (1.7.30) Δm cH m

1.8 Moore–Penrose Inverse Matrices

65

† H H Here cH m = fm (In − Fm−1 Fm−1 ) and Δm = cm fm . The initial recursion value is F†1 = f1H /(f1H f1 ).

1.8 Moore–Penrose Inverse Matrices In the above section we discussed the inverse matrix A−1 of a nonsingular square H −1 H matrix A, the left pseudo-inverse matrix A−1 A of an m × n (m > L = (A A) n) matrix A with full column rank, and the right pseudo-inverse matrix A−1 R = AH (AAH )−1 of an m × n (m < n) matrix A with full row rank. A natural question to ask is: does there exist an inverse matrix for an m × n rank-deficient matrix? If such a inverse matrix exists, what conditions should it meet?

1.8.1 Definition and Properties Consider an m × n rank-deficient matrix A, regardless of the size of m and n but with rank(A) = k < min{m, n}. The inverse of an m × n rank-deficient matrix is said to be its generalized inverse matrix and is an n × m matrix. Let A† denote the generalized inverse of the matrix A. From the matrix rank property rank(AB) ≤ min{rank(A), rank(B)}, it follows that neither AA† = Im×m nor A† A = In×n hold, because either the m × m matrix AA† or the n × n matrix A† A is rank-deficient, the maximum of their ranks is rank(A) = k and is less than min{m, n}. Since AA† = Im×m and A† A = In×n , it is necessary to consider using a product of three matrices to define the generalized inverse of a rank-deficient matrix A. For this purpose, let us consider solving the linear matrix equation Ax = y. If A† is the generalized inverse matrix of A then Ax = y ⇒ x = A† y. Substituting x = A† y into Ax = y, we have AA† y = y, and thus AA† Ax = Ax. Since this equation should hold for any nonzero vector x, the following condition should essentially be satisfied: AA† A = A.

(1.8.1)

Unfortunately, the matrix A† meeting the condition (1.8.1) is not unique. To overcome this difficulty we must add other conditions. The condition AA† A = A in (1.8.1) can only guarantee that A† is a generalized inverse matrix of A; it cannot conversely guarantee that A is also a generalized inverse matrix of A† , whereas A and A† should essentially be mutually generalized inverses. This is one of the main reasons why the definition AA† A = A does not uniquely determine A† . Now consider the solution equation of the original matrix equation Ax = y, i.e., x = A† y. Our problem is: given any y = 0, find x. Clearly, x = A† y can be written as x = A† Ax, yielding A† y = A† AA† y. Since A† y = A† AA† y should hold for

66

Introduction to Matrix Algebra

any nonzero vector y, the condition A† AA† = A†

(1.8.2)

must be satisfied as well. This shows that if A† is the generalized inverse of the rank-deficient matrix A, then the two conditions in Equations (1.8.1) and (1.8.2) must be satisfied at the same time. However, these two conditions are still not enough for a unique definition of A† . If an m × n matrix A is of full column rank or full row rank, we certainly hope that the generalized inverse matrix A† will include the left and right pseudoinverse matrices as two special cases. Although the left pseudo-inverse matrix L = (AH A)−1 AH of the m × n full column rank matrix A is such that LA = In×n rather than AL = Im×m , it is clear that AL = A(AH A)−1 AH = (AL)H is an m × m Hermitian matrix. Coincidentally, for any right pseudo-inverse matrix R = AH (AAH )−1 , the matrix product RA = AH (AAH )−1 A = (RA)H is an n×n Hermitian matrix. In other words, in order to guarantee that A† exists uniquely for any m × n matrix A, the following two conditions must be added: AA† = (AA† )H , †



H

A A = (A A) .

(1.8.3) (1.8.4)

On the basis of the conditions (1.8.1)–(1.8.4), one has the following definition of A† . DEFINITION 1.36 [384] Let A be any m × n matrix; an n × m matrix A† is said to be the Moore–Penrose inverse of A if A† meets the following four conditions (usually called the Moore–Penrose conditions): (a) AA† A = A; (b) A† AA† = A† ; (c) AA† is an m × m Hermitian matrix, i.e., AA† = (AA† )H ; (d) A† A is an n × n Hermitian matrix, i.e., A† A = (A† A)H . Remark 1 From the projection viewpoint, Moore [330] showed in 1935 that the generalized inverse matrix A† of an m × n matrix A must meet two conditions, but these conditions are not convenient for practical use. After two decades, Penrose [384] in 1955 presented the four conditions (a)–(d) stated above. In 1956, Rado [407] showed that the four conditions of Penrose are equivalent to the two conditions of Moore. Therefore the conditions (a)–(d) are called the Moore–Penrose conditions, and the generalized inverse matrix satisfying the Moore–Penrose conditions is referred to as the Moore–Penrose inverse of A. Remark 2

In particular, the Moore–Penrose condition (a) is the condition that

1.8 Moore–Penrose Inverse Matrices

67

the generalized inverse matrix A† of A must meet, while (b) is the condition that the generalized inverse matrix A = (A† )† of the matrix A† must meet. Depending on to what extent the Moore–Penrose conditions are met, the generalized inverse matrix can be classified as follows [174]. (1) The generalized inverse matrix A† satisfying all four conditions is the Moore– Penrose inverse of A. (2) The matrix G = A† satisfying conditions (a) and (b) is said to be the selfreflexive generalized inverse of A. (3) The matrix A† satisfying conditions (a), (b) and (c) is referred to as the regularized generalized inverse of A. (4) The matrix A† satisfying conditions (a), (b) and (d) is called the weak generalized inverse of A. It is easy to verify that the inverse matrices and the various generalized inverse matrices described early are special examples of the Moore–Penrose inverse matrices. • The inverse matrix A−1 of an n × n nonsingular matrix A meets all four Moore– Penrose conditions. • The left pseudo-inverse matrix (AH A)−1 AH of an m × n (m > n) matrix A meets all four Moore–Penrose conditions. • The right pseudo-inverse matrix AH (AAH )−1 of an m × n (m < n) matrix A meets all four Moore–Penrose conditions. • The left inverse matrix Ln×m such that LAm×n = In is a weak generalized inverse matrix satisfying the Moore–Penrose conditions (a), (b) and (d). • The right inverse matrix Rn×m such that Am×n R = Im is a regularized generalized inverse matrix satisfying the Moore–Penrose conditions (a), (b) and (c). The inverse matrix A−1 , the left pseudo-inverse matrix (AH A)−1 AH and the right pseudo-inverse matrix AH (AAH )−1 are uniquely determined, respectively. Similarly, any Moore–Penrose inverse matrix is uniquely determined as well. The Moore–Penrose inverse of any m × n matrix A can be uniquely determined by [51] A† = (AH A)† AH

(if m ≥ n)

(1.8.5)

A† = AH (AAH )†

(if m ≤ n).

(1.8.6)

or [187]

From Definition 1.36 it is easily seen that both A† = (AH A)† AH and A† = AH (AAH )† meet the four Moore–Penrose conditions. From [307], [311], [400], [401] and [410], the Moore–Penrose inverse matrices A† have the following properties.

68

Introduction to Matrix Algebra

1. For an m × n matrix A, its Moore–Penrose inverse A† is uniquely determined. 2. The Moore–Penrose inverse of the complex conjugate transpose matrix AH is given by (AH )† = (A† )H = A†H = AH† . 3. The generalized inverse of a Moore–Penrose inverse matrix is equal to the original matrix, namely (A† )† = A. 4. If c = 0 then (cA)† = c−1 A† . 5. If D = Diag(d11 , . . . , dnn ), then D† = Diag(d†11 , . . . , d†nn ), where d†ii = d−1 ii (if † dii = 0) or dii = 0 (if dii = 0). 6. The Moore–Penrose inverse of an m × n zero matrix Om×n is an n × m zero matrix, i.e., O†m×n = On×m . 7. The Moore–Penrose inverse of the n × 1 vector x is an 1 × n vector and is given by x† = (xH x)−1 xH . 8. If AH = A and A2 = A then A† = A. 9. If A = BC, B is of full column rank, and C is of full row rank then A† = C† B† = CH (CCH )−1 (BH B)−1 BH . 10. (AAH )† = (A† )H A† and (AAH )† (AAH ) = AA† . 11. If the matrices Ai are mutually orthogonal, i.e., AH i Aj = O, i = j, then (A1 + · · · + Am )† = A†1 + · · · + A†m . 12. Regarding the ranks of generalized inverse matrices, one has rank(A† ) = rank(A) = rank(AH ) = rank(A† A) = rank(AA† ) = rank(AA† A) = rank(A† AA† ). 13. The Moore–Penrose inverse of any matrix Am×n can be determined by A† = (AH A)† AH or A† = AH (AAH )† . In particular, the inverse matrices of the full rank matrices are as follows: • If A is of full column rank then A† = (AH A)−1 AH , i.e., the Moore–Penrose inverse of a matrix A with full column rank reduces to the left pseudo-inverse matrix of A. • If A is of full row rank then A† = AH (AAH )−1 , i.e., the Moore–Penrose inverse of a matrix A with full row rank reduces to the right pseudo-inverse matrix of A. • If A is nonsingular then A† = A−1 , that is, the Moore–Penrose inverse of a nonsingular matrix A reduces to the inverse A−1 of A. 14. For a matrix Am×n , even if AA† = Im , A† A = In , AH (AH )† = In and (AH )† AH = Im , the following results are true: • • • •

A† AAH = AH and AH AA† = AH ; AA† (A† )H = (A† )H and (AH )† A† A = (A† )H ; (AH )† AA = A and AAH (AH )† = A; AH (A† )H A† = A† and A† (A† )H AH = A† .

1.8 Moore–Penrose Inverse Matrices

69

1.8.2 Computation of Moore–Penrose Inverse Matrix Consider an m × n matrix A with rank r, where r ≤ min(m, n). The following are four methods for computing the Moore–Penrose inverse matrix A† . 1. Equation-Solving Method [384] Step 1 Solve the matrix equations AAH XH = A and AH AY = AH to yield the solutions XH and Y, respectively. Step 2

Compute the generalized inverse matrix A† = XAY.

The following are two equation-solving algorithms for Moore–Penrose inverse matrices. Algorithm 1.3

Equation-solving method 1 [187]

1. Compute the matrix B = AAH . 2. Solve the matrix equation B2 XH = B to get the matrix XH . 3. Calculate the Moore–Penrose inverse matrix B† = (AAH )† = XBXH . 4. Compute the Moore–Penrose inverse matrix A† = AH B† .

Algorithm 1.4

Equation-solving method 2 [187]

1. Calculate the matrix B = AH A. 2. Solve the matrix equation B2 XH = X to obtain the matrix XH . 3. Compute the Moore–Penrose inverse matrix B† = (AH A)† = XBXH . 4. Calculate the Moore–Penrose inverse matrix A† = B† AH .

Algorithm 1.3 computes A† = AH (AAH )† and Algorithm 1.4 computes A† = (AH A)† AH . If the number of columns of the matrix Am×n is larger than the number of its rows then the dimension of the matrix product AAH is less than the dimension of AH A. In this case Algorithm 1.3 needs less computation. If, on the contrary, the number of rows is larger than the number of columns then one should select Algorithm 1.4. 2. Full-Rank Decomposition Method DEFINITION 1.37 Let a rank-deficient matrix Am×n have rank r < min{m, n}. If A = FG, where Fm×r is of full column rank and Gr×n is of full row rank, then A = FG is called the full-rank decomposition of the matrix A. LEMMA 1.6 [433] A matrix A ∈ Cm×n with rank(A) = r can be decomposed into A = FG, where F ∈ Cm×r and G ∈ Cr×n have full-column rank and full-row rank, respectively.

70

Introduction to Matrix Algebra

If A = FG is a full-rank decomposition of the m × n matrix A, then A† = G† F† = GH (GGH )−1 (FH F)−1 FH meets the four Moore–Penrose conditions, so the n × m matrix A† must be the Moore–Penrose inverse matrix of A. Elementary row operations easily implement the full-rank decomposition of a rank-deficient matrix A ∈ Cm×n , as shown in Algorithm 1.5. Algorithm 1.5

Full-rank decomposition

1. Use elementary row operations to get the reduced row-echelon form of A. 2. Use the pivot columns of A as the column vectors of a matrix F. 3. Use the nonzero rows in the reduced row-echelon to get the row vectors of a matrix G. 4. The full-rank decomposition is given by A = FG.

EXAMPLE 1.12

In Example 1.1, via elementary ⎡ −3 6 −1 1 A = ⎣ 1 −2 2 3 2 −4 5 8

row operations on the matrix ⎤ −7 −1 ⎦ , −4

we obtained its reduced row-echelon form, ⎡ ⎤ 1 −2 0 −1 3 ⎣ 0 0 1 2 −2 ⎦ . 0 0 0 0 0 The pivot columns of A are the first and third columns, so we have ⎡ ⎤   −3 −1 1 −2 0 −1 3 ⎣ ⎦ . F= 1 2 , G= 0 0 1 2 −2 2 5 Hence we get the full-rank ⎡ −3 6 −1 1 ⎣ 1 −2 2 3 2 −4 5 8

decomposition A = FG: ⎤ ⎤ ⎡  −7 −3 −1  1 −2 0 −1 3 . −1 ⎦ = ⎣ 1 2 ⎦ 0 0 1 2 −2 2 5 −4 3. Recursive Methods

Block the matrix Am×n into Ak = [Ak−1 , ak ], where Ak−1 consists of the first k − 1 columns of A and ak is the kth column of A. Then, the Moore–Penrose inverse A†k of the block matrix Ak can be recursively calculated from A†k−1 . When k = n, we get the Moore–Penrose inverse matrix A† . Such a recursive algorithm was presented by Greville in 1960 [188]. Algorithm 1.6 below is available for all matrices. However, when the number of

1.9 Direct Sum and Hadamard Product Algorithm 1.6

71

Column recursive method

−1 H initialize: A†1 = a†1 = (aH a1 . Put k = 2. 1 a1 )

repeat 1. Compute dk = A†k−1 ak .  −1 H † (1 + dH dk Ak−1 , k dk ) 2. Compute bk = (a − Ak−1 dk )† , k

A†k−1 − dk bk † 3. Compute Ak = . bk

if dk − Ak−1 dk = 0; if ak − Ak−1 dk = 0.

4. exit if k = n. return k ← k + 1. output: A† .

rows of A is small, in order to reduce the number of recursions it is recommended that one should first use the column recursive algorithm 1.6 for finding the Moore– Penrose matrix (AH )† = AH† , and then compute A† = (AH† )H . 4. Trace Method The trace method for computing the Moore–Penrose inverse matrix is shown in Algorithm 1.7. Algorithm 1.7

Trace method [397]

input: A ∈ Rm×n and rank(A) = r. initialize: B = AT A, and set C1 = I, k = 1. repeat 1. Compute Ck+1 = k−1 tr(Ck B)I − Ck B. 2. exit if k = r − 1. return k ← k + 1. output: A† =

r Ck AT . tr(Ck B)

1.9 Direct Sum and Hadamard Product 1.9.1 Direct Sum of Matrices DEFINITION 1.38 [186] The direct sum of an m × m matrix A and an n × n matrix B, denoted A ⊕ B, is an (m + n) × (m + n) matrix and is defined as follows:   A Om×n . (1.9.1) A⊕B= On×m B

72

Introduction to Matrix Algebra

From the above definition, it is easily shown that the direct sum of matrices has the following properties [214], [390]: 1. If c is a constant then c (A ⊕ B) = cA ⊕ cB. 2. The direct sum does not satisfy exchangeability, i.e., A ⊕ B = B ⊕ A unless A = B. 3. If A, B are two m × m matrices and C and D are two n × n matrices then (A ± B) ⊕ (C ± D) = (A ⊕ C) ± (B ⊕ D), (A ⊕ C)(B ⊕ D) = AB ⊕ CD. 4. If A, B, C are m × m, n × n, p × p matrices, respectively, then A ⊕ (B ⊕ C) = (A ⊕ B) ⊕ C = A ⊕ B ⊕ C. 5. If Am×m and Bn×n are respectively orthogonal matrices then A ⊕ B is an (m + n) × (m + n) orthogonal matrix. 6. The complex conjugate, transpose, complex conjugate transpose and inverse matrices of the direct sum of two matrices are given by (A ⊕ B)∗ = A∗ ⊕ B∗ , (A ⊕ B)T = AT ⊕ BT , (A ⊕ B)H = AH ⊕ BH , (A ⊕ B)−1 = A−1 ⊕ B−1

(if A−1 and B−1 exist).

7. The trace, rank and determinant of the direct sum of N matrices are as follows: N −1  N −1 ,  tr Ai = tr(Ai ), i=0

rank

N −1 ,

det

i=0

=

Ai

i=0

N −1 ,

i=0



rank(Ai ),

i=0

 Ai

N −1 

=

N −1 (

det(Ai ).

i=0

1.9.2 Hadamard Product DEFINITION 1.39 The Hadamard product of two m × n matrices A = [aij ] and B = [bij ] is denoted A ∗ B and is also an m × n matrix, each entry of which is defined as the product of the corresponding entries of the two matrices: [A ∗ B]ij = aij bij . That is, the Hadamard product is a mapping Rm×n × Rm×n → Rm×n .

(1.9.2)

1.9 Direct Sum and Hadamard Product

73

The Hadamard product is also known as the Schur product or the elementwise product. The following theorem describes the positive definiteness of the Hadamard product and is usually known as the Hadamard product theorem [214]. THEOREM 1.9 If two m × m matrices A and B are positive definite (positive semi-definite) then their Hadamard product A∗B is positive definite (positive semidefinite) as well. COROLLARY 1.1 (Fejer theorem) [214] An m × m matrix A = [aij ] is positive semi-definite if and only if m m   aij bij ≥ 0 i=1 j=1

holds for all m × m positive semi-definite matrices B = [bij ]. The following theorems describe the relationship between the Hadamard product and the matrix trace. THEOREM 1.10 [311, p. 46] Let A, B, C be m × n matrices, 1 = [1, . . . , 1]T be n an n × 1 summing vector and D = Diag(d1 , . . . , dm ), where di = j=1 aij ; then     tr AT (B ∗ C) = tr (AT ∗ BT )C , (1.9.3) 1T AT (B ∗ C)1 = tr(BT DC).

(1.9.4)

THEOREM 1.11 [311, p. 46] Let A, B be two n × n positive definite square matrices, and 1 = [1, . . . , 1]T be an n × 1 summing vector. Suppose that M is an n × n diagonal matrix, i.e., M = Diag(μ1 , . . . , μn ), while m = M1 is an n × 1 vector. Then one has tr(AMBT M) = mT (A ∗ B)m,

(1.9.5)

tr(AB ) = 1 (A ∗ B)1,

(1.9.6)

MA ∗ B M = M(A ∗ B )M.

(1.9.7)

T

T

T

T

From the above definition, it is known that Hadamard products obey the exchange law, the associative law and the distributive law of the addition: A ∗ B = B ∗ A,

(1.9.8)

A ∗ (B ∗ C) = (A ∗ B) ∗ C,

(1.9.9)

A ∗ (B ± C) = A ∗ B ± A ∗ C.

(1.9.10)

The properties of Hadamard products are summarized below [311]. 1. If A, B are m × n matrices then (A ∗ B)T = AT ∗ BT ,

(A ∗ B)H = AH ∗ BH ,

(A ∗ B)∗ = A∗ ∗ B∗ .

74

Introduction to Matrix Algebra

2. The Hadamard product of a matrix Am×n and a zero matrix Om×n is given by A ∗ Om×n = Om×n ∗ A = Om×n . 3. If c is a constant then c (A ∗ B) = (cA) ∗ B = A ∗ (c B). 4. The Hadamard product of two positive definite (positive semi-definite) matrices A, B is positive definite (positive semi-definite) as well. 5. The Hadamard product of the matrix Am×m = [aij ] and the identity matrix Im is an m × m diagonal matrix, i.e., A ∗ Im = Im ∗ A = Diag(A) = Diag(a11 , . . . , amm ). 6. If A, B, D are three m × m matrices and D is a diagonal matrix then (DA) ∗ (BD) = D(A ∗ B)D. 7. If A, C are two m × m matrices and B, D are two n × n matrices then (A ⊕ B) ∗ (C ⊕ D) = (A ∗ C) ⊕ (B ∗ D). 8. If A, B, C, D are all m × n matrices then (A + B) ∗ (C + D) = A ∗ C + A ∗ D + B ∗ C + B ∗ D. 9. If A, B, C are m × n matrices then     tr AT (B ∗ C) = tr (AT ∗ BT )C . The Hadamard product of n×n matrices A and B obeys the following inequality: (1) Oppenheim inequality [23, p. 144]

If A and B are positive semi-definite, then

|A ∗ B| ≥ a11 · · · ann |B|.

(1.9.11)

(2) If A and B are positive semi-definite, then [327] |A ∗ B| ≥ |AB|.

(1.9.12)

(3) Eigenvalue inequality [23, p.144] If A and B are positive semi-definite and ˆ ,...,λ ˆn λ1 , . . . , λn are the eigenvalues of the Hadamard product A ∗ B, while λ 1 are the eigenvalues of the matrix product AB, then n ( i=k

λi ≥

n (

ˆi, λ

k = 1, . . . , n.

(1.9.13)

i=k

(4) Rank inequality [327] rank(A ∗ B) ≤ rank(A) rank(B).

(1.9.14)

The Hadamard products of matrices are useful in lossy compression algorithms (such as JPEG).

1.10 Kronecker Products and Khatri–Rao Product

75

1.10 Kronecker Products and Khatri–Rao Product The Hadamard product described in the previous section is a special product of two matrices. In this section, we discuss two other special products of two matrices: the Kronecker products and the Khatri–Rao product.

1.10.1 Kronecker Products Kronecker products are divided into right and left Kronecker products. DEFINITION 1.40 (Right Kronecker product) [32] Given an m × n matrix A = [a1 , . . . , an ] and another p × q matrix B, their right Kronecker product A ⊗ B is an mp × nq matrix defined by ⎡ ⎤ a11 B a12 B · · · a1n B ⎢ a21 B a22 B · · · a2n B ⎥ ⎢ ⎥ (1.10.1) A ⊗ B = [aij B]m,n . .. .. ⎥ . .. i=1,j=1 = ⎢ ⎣ .. . . . ⎦ am1 B am2 B · · · amn B DEFINITION 1.41 (Left Kronecker product) [186], [413] For an m × n matrix A and a p × q matrix B = [b1 , . . . , bq ], their left Kronecker product A ⊗ B is an mp × nq matrix defined by ⎡ ⎤ Ab11 Ab12 · · · Ab1q ⎢Ab21 Ab22 · · · Ab2q ⎥ ⎢ ⎥ (1.10.2) [A ⊗ B]left = [Abij ]p,q = ⎢ . .. .. ⎥ . .. i=1,j=1 ⎣ .. . . . ⎦ Abp1

Abp2

···

Abpq

Clearly, the left or right Kronecker product is a mapping Rm×n ×Rp×q → Rmp×nq . It is easily seen that if we adopt the right Kronecker product form then the left Kronecker product can be written as [A ⊗ B]left = B ⊗ A. Since the right Kronecker product form is the one generally adopted, this book uses the right Kronecker product hereafter unless otherwise stated. In particular, when n = 1 and q = 1, the Kronecker product of two matrices reduces to the Kronecker product of two column vectors a ∈ Rm and b ∈ Rp : ⎤ ⎡ a1 b ⎢ . ⎥ a ⊗ b = [ai b]m (1.10.3) i=1 = ⎣ .. ⎦ . am b The result is an mp × 1 vector. Evidently, the outer product of two vectors x ◦ y = xyT can be also represented using the Kronecker product as x ◦ y = x ⊗ yT . The Kronecker product is also known as the direct product or tensor product [307].

76

Introduction to Matrix Algebra

Summarizing results from [32], [60] and other literature, Kronecker products have the following properties. 1. The Kronecker product of any matrix and a zero matrix is equal to the zero matrix, i.e., A ⊗ O = O ⊗ A = O. 2. If α and β are constants then αA ⊗ βB = αβ(A ⊗ B). 3. The Kronecker product of an m×m identity matrix and an n×n identity matrix is equal to an mn × mn identity matrix, i.e., Im ⊗ In = Imn . 4. For matrices Am×n , Bn×k , Cl×p , Dp×q , we have (AB) ⊗ (CD) = (A ⊗ C)(B ⊗ D).

(1.10.4)

5. For matrices Am×n , Bp×q , Cp×q , we have A ⊗ (B ± C) = A ⊗ B ± A ⊗ C,

(1.10.5)

(B ± C) ⊗ A = B ⊗ A ± C ⊗ A.

(1.10.6)

6. The inverse and generalized inverse matrix of Kronecker products satisfy (A ⊗ B)−1 = A−1 ⊗ B−1 ,

(A ⊗ B)† = A† ⊗ B† .

(1.10.7)

7. The transpose and the complex conjugate transpose of Kronecker products are given by (A ⊗ B)T = AT ⊗ BT ,

(A ⊗ B)H = AH ⊗ BH .

(1.10.8)

8. The rank of the Kronecker product is rank(A ⊗ B) = rank(A)rank(B).

(1.10.9)

9. The determinant of the Kronecker product det(An×n ⊗ Bm×m ) = (det A)m (det B)n .

(1.10.10)

10. The trace of the Kronecker product is tr(A ⊗ B) = tr(A)tr(B).

(1.10.11)

11. For matrices Am×n , Bm×n , Cp×q , Dp×q , we have (A + B) ⊗ (C + D) = A ⊗ C + A ⊗ D + B ⊗ C + B ⊗ D.

(1.10.12)

12. For matrices Am×n , Bp×q , Ck×l , it is true that (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C).

(1.10.13)

13. For matrices Am×n , Bk×l , Cp×q , Dr×s , there is the following relationship: (A ⊗ B) ⊗ (C ⊗ D) = A ⊗ B ⊗ C ⊗ D.

(1.10.14)

14. For matrices Am×n , Bp×q , Cn×r , Dq×s , Er×k , Fs×l , we have (A ⊗ B)(C ⊗ D)(E ⊗ F) = (ACE) ⊗ (BDF).

(1.10.15)

1.10 Kronecker Products and Khatri–Rao Product

77

15. The following is a special example of Equation (1.10.15) (see [32], [186]): A ⊗ D = (AIp ) ⊗ (Iq D) = (A ⊗ Iq )(Ip ⊗ D),

(1.10.16)

where Ip ⊗ D is a block diagonal matrix (for the right Kronecker product) or a sparse matrix (for the left Kronecker product), while A ⊗ Iq is a sparse matrix (for the right Kronecker product) or a block diagonal matrix (for the left Kronecker product). 16. For matrices Am×n , Bp×q , we have exp(A ⊗ B) = exp(A) ⊗ exp(B). 17. Let A ∈ Cm×n and B ∈ Cp×q then [311, p. 47] Kpm (A ⊗ B) = (B ⊗ A)Kqn ,

(1.10.17)

Kpm (A ⊗ B)Knq = B ⊗ A,

(1.10.18)

Kpm (A ⊗ B) = B ⊗ A,

(1.10.19)

Kmp (B ⊗ A) = A ⊗ B,

(1.10.20)

where K is a commutation matrix (see Subsection 1.11.1).

1.10.2 Generalized Kronecker Products The generalized Kronecker product is the Kronecker product of a matrix group consisting of more matrices and another matrix. DEFINITION 1.42 [390] Given N matrices Ai ∈ Cm×n , i = 1, . . . , N that constitute a matrix group {A}N , the Kronecker product of the matrix group {A}N and an N × l matrix B, known as the generalized Kronecker product, is defined by ⎤ ⎡ A1 ⊗ b 1 ⎥ ⎢ .. {A}N ⊗ B = ⎣ (1.10.21) ⎦, . A N ⊗ bN where bi is the ith row vector of the matrix B. EXAMPLE 1.13

Put

⎧ ⎫ 1 ⎪ ⎪ ⎪ ⎪ 1 ⎪ ⎬ ⎨ 2 −1 ⎪ {A}2 =   , ⎪ ⎪ 2 −j ⎪ ⎪ ⎪ ⎪ ⎭ ⎩ 1 j

 B=

1 1

2 −1

 .

Then the generalized Kronecker product {A}2 ⊗ B is given by ⎤ ⎡ ⎡   ⎤ 1 1 1 1 2 2 ⊗ [1, 2] ⎥ ⎢ ⎢ 2 −1 4 −2 ⎥ ⎥ ⎢ 2 −1 ⎢ ⎥. {A}2 ⊗ B = ⎢   ⎥=⎣ 2 −j −2 j ⎦ ⎦ ⎣ 2 −j ⊗ [1, −1] 1 j −1 −j 1 j

78

Introduction to Matrix Algebra

It should be noted that the generalized Kronecker product of two matrix groups {A} and {B} is still a matrix group rather than a single matrix. EXAMPLE 1.14

Let

⎧ ⎫ 1 1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ 1 −1 ⎬ {A}2 =   , ⎪ ⎪ 1 −j ⎪ ⎪ ⎪ ⎪ ⎭ ⎩ 1 j

# [1, 1] . [1, −1]

" {B} =

The generalized Kronecker product is given by ⎫ ⎧ ⎧  1 1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⊗ [1, 1] ⎪ ⎪ ⎬ ⎪ ⎨ ⎨ 1 −1 {A}2 ⊗ {B} =  =   ⎪ ⎪ 1 −j ⎪ ⎪ ⎪ ⎪ ⎪ ⊗ [1, −1]⎪ ⎭ ⎪ ⎩ ⎩ 1 j

1 1

1 1 −1 1

1 1

−j −1 j −1

⎫ ⎪ ⎪ ⎪ ⎬  . j ⎪ ⎪ ⎪ ⎭ −j

1 −1

Generalized Kronecker products have important applications in filter bank analysis and the derivation of fast algorithms in the Haar transform and the Hadamard transform [390]. Based on the generalized Kronecker product, we can derive the fast Fourier transform (FFT) algorithm, which we will discuss in Subsection 2.7.4 in Chapter 2.

1.10.3 Khatri–Rao Product DEFINITION 1.43 The Khatri–Rao product of two matrices with the same number of columns, G ∈ Rp×n and F ∈ Rq×n , is denoted by F  G and is defined as [244], [409] F  G = [f1 ⊗ g1 , f2 ⊗ g2 , . . . , fn ⊗ gn ] ∈ Rpq×n .

(1.10.22)

Thus the Khatri–Rao product consists of the Kronecker product of the corresponding column vectors of two matrices. Hence the Khatri–Rao product is called the columnwise Kronecker product. More generally, if A = [A1 , . . . , Au ] and B = [B1 , . . . , Bu ] are two block matrices, and the submatrices Ai and Bi have the same number of columns, then A  B = [A1 ⊗ B1 , A2 ⊗ B2 , . . . , Au ⊗ Bu ].

(1.10.23)

Khatri–Rao products have a number of properties [32], [301]. 1. The basic properties of the Khatri–Rao product itself are follows. • Distributive law (A + B)  D = A  D + B  D. • Associative law A  B  C = (A  B)  C = A  (B  C). • Commutative law A  B = Knn (B  A), where Knn is a commutation matrix; see the next section.

1.11 Vectorization and Matricization

79

2. The relationship between the Khatri–Rao product and the Kronecker product is (A ⊗ B)(F  G) = AF  BG.

(1.10.24)

3. The relationships between the Khatri–Rao product and the Hadamard product are (A  B) ∗ (C  D) = (A ∗ C)  (B ∗ D),

(1.10.25)

(A  B) (A  B) = (A A) ∗ (B B),  † (A  B)† = (AT A) ∗ (BT B) (A  B)T .

(1.10.26)

T

T

T

(1.10.27)

More generally, (A  B  C)T (A  B  C) = (AT A) ∗ (BT B) ∗ (CT C),  † (A  B  C)† = (AT A) ∗ (BT B) ∗ (CT C) (A  B  C)T .

(1.10.28) (1.10.29)

1.11 Vectorization and Matricization There exist functions or operators that transform a matrix into a vector or vice versa. These functions or operators are the vectorization of a matrix and the matricization of a vector.

1.11.1 Vectorization and Commutation Matrix The vectorization of a matrix A ∈ Rm×n , denoted vec(A), is a linear transformation that arranges the entries of A = [aij ] as an mn × 1 vector via column stacking: vec(A) = [a11 , . . . , am1 , . . . , a1n , . . . , amn ]T .

(1.11.1)

A matrix A can be also arranged as a row vector by stacking the rows; this is known as the row vectorization of the matrix, denoted rvec(A), and is defined as (1.11.2) rvec(A) = [a11 , . . . , a1n , . . . , am1 , . . . , amn ].   a11 a12 , vec(A) = [a11 , a21 , a12 , a22 ]T and For instance, given a matrix A = a21 a22 rvec(A) = [a11 , a12 , a21 , a22 ]. Clearly, there exist the following relationships between the vectorization and the row vectorization of a matrix: rvec(A) = (vec(AT ))T ,

vec(AT ) = (rvec A)T .

(1.11.3)

One obvious fact is that, for a given m × n matrix A, the two vectors vec(A) and vec(AT ) contain the same entries but the orders of their entries are different. Interestingly, there is a unique mn × mn permutation matrix that can transform

80

Introduction to Matrix Algebra

vec(A) into vec(AT ). This permutation matrix is known as the commutation matrix; it is denoted Kmn and is defined by Kmn vec(A) = vec(AT ).

(1.11.4)

Similarly, there is an nm × nm permutation matrix transforming vec(AT ) into vec(A). Such a commutation matrix, denoted Knm , is defined by Knm vec(AT ) = vec(A).

(1.11.5)

From (1.11.4) and (1.11.5) it can be seen that Knm Kmn vec(A) = Knm vec(AT ) = vec(A). Since this formula holds for any m × n matrix A, we have Knm Kmn = Imn or K−1 mn = Knm . The mn × mn commutation matrix Kmn has the following properties [310]. 1. Kmn vec(A) = vec(AT ) and Knm vec(AT ) = vec(A), where A is an m × n matrix. 2. KTmn Kmn = Kmn KTmn = Imn , or K−1 mn = Knm . 3. KTmn = Knm . 4. Kmn can be represented as a Kronecker product of the essential vectors: Kmn =

n 

(eTj ⊗ Im ⊗ ej ).

j=1

5. K1n = Kn1 = In . 6. Knm Kmn vec(A) = Knm vec(AT ) = vec(A). 7. The eigenvalues of the commutation matrix Knn are 1 and −1 and their multiplicities are respectively 12 n(n + 1) and 12 n(n − 1). 8. The rank of the commutation matrix is given by rank(Kmn ) = 1+d(m−1, n−1), where d(m, n) is the greatest common divisor of m and n and d(n, 0) = d(0, n) = n. 9. Kmn (A ⊗ B)Kpq = B ⊗ A, and thus can be equivalently written as Kmn (A ⊗ B) = (B ⊗ A)Kqp , where A is an n × p matrix and B is an m × q matrix. In particular, Kmn (An×n ⊗ Bm×m ) = (B ⊗ A)Kmn .  T 10. tr(Kmn (Am×n ⊗ Bm×n )) = tr(AT B) = vec(BT ) Kmn vec(A). The construction of an mn × mn commutation matrix is as follows. First let ⎤ ⎡ K1 ⎥ ⎢ (1.11.6) Kmn = ⎣ ... ⎦ , Ki ∈ Rn×mn , i = 1, . . . , m; Km

1.11 Vectorization and Matricization

then the (i, j)th entry of the first submatrix K1 is given by + 1, j = (i − 1)m + 1, i = 1, . . . , n, K1 (i, j) = 0, otherwise.

81

(1.11.7)

Next, the ith submatrix Ki (i = 2, . . . , m) is constructed from the (i − 1)th submatrix Ki−1 as follows: Ki = [0, Ki−1 (1 : mn − 1)],

i = 2, . . . , m,

(1.11.8)

where Ki−1 (1 : mn−1) denotes a submatrix consisting of the first (mn−1) columns of the n × m submatrix Ki−1 . EXAMPLE 1.15

For m = 2, n = 4, ⎡ 1 0 ⎢0 0 ⎢ ⎢0 0 ⎢ ⎢ ⎢0 0 K24 = ⎢ ⎢0 1 ⎢ ⎢0 0 ⎢ ⎣0 0 0 0

we have 0 1 0 0 0 0 0 0

0 0 0 0 0 1 0 0

0 0 1 0 0 0 0 0

0 0 0 0 0 0 1 0

0 0 0 1 0 0 0 0

⎤ 0 0⎥ ⎥ 0⎥ ⎥ ⎥ 0⎥ ⎥. 0⎥ ⎥ 0⎥ ⎥ 0⎦ 1

0 0 1 0 0 0 0 0

0 0 0 0 1 0 0 0

0 0 0 0 0 0 1 0

0 1 0 0 0 0 0 0

0 0 0 1 0 0 0 0

0 0 0 0 0 1 0 0

⎤ 0 0⎥ ⎥ 0⎥ ⎥ ⎥ 0⎥ ⎥. 0⎥ ⎥ 0⎥ ⎥ 0⎦ 1

0 1 0 0 0 0 0 0

0 0 0 1 0 0 0 0

0 0 0 0 0 1 0 0

⎤⎡ ⎤ ⎡ ⎤ a11 a11 0 ⎢ ⎥ ⎢ ⎥ 0⎥ ⎢a21 ⎥ ⎥ ⎢a12 ⎥ ⎢ ⎥ ⎢ ⎥ 0⎥ ⎥ ⎢a31 ⎥ ⎢a21 ⎥ ⎥⎢ ⎥ ⎢ ⎥ 0⎥ ⎢a41 ⎥ ⎢a22 ⎥ ⎥ ⎢ ⎥ = ⎢ ⎥ = vec(AT ). 0⎥ ⎢a12 ⎥ ⎢a31 ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 0⎥ ⎥ ⎢a22 ⎥ ⎢a32 ⎥ 0⎦ ⎣a32 ⎦ ⎣a41 ⎦ 1 a42 a42

If m = 4, n = 2 then ⎡

K42

1 ⎢0 ⎢ ⎢0 ⎢ ⎢ ⎢0 =⎢ ⎢0 ⎢ ⎢0 ⎢ ⎣0 0

Hence we get ⎡

1 ⎢0 ⎢ ⎢0 ⎢ ⎢ ⎢0 K42 vec(A) = ⎢ ⎢0 ⎢ ⎢0 ⎢ ⎣0 0

0 0 1 0 0 0 0 0

0 0 0 0 1 0 0 0

0 0 0 0 0 0 1 0

82

Introduction to Matrix Algebra

1.11.2 Matricization of a Vector The operation for transforming an mn × 1 vector a = [a1 , . . . , amn ]T into an m × n matrix A is known as the matricization of the column vector a, denoted unvecm,n (a), and is defined as ⎡ ⎤ a1 am+1 · · · am(n−1)+1 ⎢ a2 am+2 · · · am(n−1)+2 ⎥ ⎢ ⎥ (1.11.9) Am×n = unvecm,n (a) = ⎢ . ⎥. .. .. .. ⎣ .. ⎦ . . . a2m

am

···

amn

Clearly, the (i, j)th entry Aij of the matrix A is given by the kth entry ak of the vector a as follows: Aij = ai+(j−1)m ,

i = 1, . . . , m, j = 1, . . . , n.

(1.11.10)

Similarly, the operation for transforming a 1 × mn row vector b = [b1 , . . . , bmn ] into an m×n matrix B is called the matricization of the row vector b. This matricization is denoted unrvecm,n (b), and is defined as follows: ⎡ ⎤ b1 b2 ··· bn ⎢ bn+1 bn+2 · · · b2n ⎥ ⎢ ⎥ (1.11.11) Bm×n = unrvecm,n (b) = ⎢ .. .. .. ⎥ . .. ⎣ . . . . ⎦ b(m−1)n+1 b(m−1)n+2 · · · bmn This is equivalently represented in element form as Bij = bj+(i−1)n ,

i = 1, . . . , m, j = 1, . . . , n.

(1.11.12)

It can be seen from the above definitions that there are the following relationships between matricization (unvec) and column vectorization (vec) or row vectorization (rvec): ⎡ ⎤ A11 · · · A1n ⎢ .. .. ⎥ −vec T .. −→ ⎣ . . . ⎦← −−− [A11 , . . . , Am1 , . . . , A1n , . . . , Amn ] , unvec Am1 · · · Amn ⎤ ⎡ A11 · · · A1n rvec ⎢ .. .. ⎥ − .. −→ ⎣ . . . ⎦← −−− [A11 , . . . , A1n , . . . , Am1 , . . . , Amn ], unvec Am1 · · · Amn which can be written as unvecm,n (a) = Am×n



vec(Am×n ) = amn×1 ,

(1.11.13)

unrvecm,n (b) = Bm×n



rvec(Bm×n ) = b1×mn .

(1.11.14)

1.11 Vectorization and Matricization

83

1.11.3 Properties of Vectorization Operator The vectorization operator has the following properties [68], [207], [311]. 1. The vectorization of a transposed matrix is given by vec(AT ) = Kmn vec(A) for A ∈ Cm×n . 2. The vectorization of a matrix sum is given by vec(A + B) = vec(A) + vec(B). 3. The vectorization of a Kronecker product [311, p. 184] is given by vec(X ⊗ Y) = (Im ⊗ Kqp ⊗ In )(vec(X) ⊗ vec(Y)).

(1.11.15)

4. The trace of a matrix product is given by tr(AT B) = (vec(A))T vec(B), H

H

(1.11.16)

tr(A B) = (vec(A)) vec(B),

(1.11.17)

tr(ABC) = (vec(A)) (Ip ⊗ B) vec(C),

(1.11.18)

T

while the trace of the product of four matrices is determined by [311, p. 31] tr(ABCD) = (vec(DT ))T (CT ⊗ A) vec(B) = (vec(D))T (A ⊗ CT ) vec(BT ). 5. The Kronecker product of two vectors a and b can be represented as the vectorization of their outer product baT as follows: a ⊗ b = vec(baT ) = vec(b ◦ a).

(1.11.19)

6. The vectorization of the Hadamard product is given by vec(A ∗ B) = vec(A) ∗ vec(B) = Diag(vec(A)) vec(B),

(1.11.20)

where Diag(vec(A)) is a diagonal matrix whose entries are the vectorization function vec(A). 7. The relation of the vectorization function to the Khatri–Rao product [60] is as follows: vec(Um×p Vp×p Wp×n ) = (WT  U)diag(V),

(1.11.21)

where diag(V) = [v11 , . . . , vpp ]T and vii are the diagonal entries of V. 8. The relation of the vectorization of the matrix product Am×p Bp×q Cq×n to the Kronecker product [428, p. 263] is given by vec(ABC) = (CT ⊗ A) vec(B),

(1.11.22)

vec(ABC) = (Iq ⊗ AB) vec(C) = (C B ⊗ Im ) vec(A), T

T

vec(AC) = (Ip ⊗ A) vec(C) = (C ⊗ Im ) vec(A). T

(1.11.23) (1.11.24)

84

Introduction to Matrix Algebra

EXAMPLE 1.16 Consider the matrix equation AXB = C, where A ∈ Rm×n , X ∈ Rn×p , B ∈ Rp×q and C ∈ Rm×q . By using the vectorization function property vec(AXB) = (BT ⊗ A) vec(X), the vectorization vec(AXB) = vec(C) of the original matrix equation can be rewritten as the Kronecker product form [410] (BT ⊗ A)vec(X) = vec(C), and thus vec(X) = (BT ⊗ A)† vec(C). Then by matricizing vec(X), we get the solution matrix X of the original matrix equation AXB = C. EXAMPLE 1.17 Consider solving the matrix equation AX + XB = Y, where all the matrices are n × n. By the vectorization operator property vec(ADB) = (BT ⊗A) vec(D), we have vec(AX) = vec(AXI) = (In ⊗A)vec(X) and vec(XB) = vec(IXB) = (BT ⊗In )vec(X), and thus the original matrix equation AX+XB = Y can be rewritten as (In ⊗ A + BT ⊗ In ) vec(X) = vec(Y), from which we get vec(X) = (In ⊗ A + BT ⊗ In )† vec(Y). Then, by matricizing vec(X), we get the solution X.

1.12 Sparse Representations A sparse linear combination of prototype signal-atoms (see below) representing a target signal is known as a sparse representation.

1.12.1 Sparse Vectors and Sparse Representations A vector or matrix most of whose elements are zero is known as a sparse vector or a sparse matrix. A signal vector y ∈ Rm can be decomposed into at most m orthogonal basis (vectors) gk ∈ Rm , k = 1, . . . , m. Such an orthogonal basis is called a complete orthogonal basis. In the signal decomposition y = Gc =

m 

c i gi ,

(1.12.1)

i=1

the coefficient vector c must be nonsparse. If the signal vector y ∈ Rm is decomposed into a linear combination of n vectors ai ∈ Rm , i = 1, . . . , n (where n > m), i.e., y = Ax =

n 

x i ai

(n > m),

(1.12.2)

i=1

then the n (> m) vectors ai ∈ Rm , i = 1, . . . , n cannot be an orthogonal basis set. In order to distinguish them from true basis vectors, nonorthogonal column vectors are called atoms or frames. Because the number n of atoms is larger than the dimension m of the vector space Rm , such a set of atoms is said to be overcomplete.

1.12 Sparse Representations

85

A matrix consisting of overcomplete atoms, A = [a1 , . . . , an ] ∈ Rm×n (n > m), is referred to as a dictionary. For a dictionary (matrix) A ∈ Rm×n , it is usually assumed that (1) the row number m of A is less than its column number n; (2) the dictionary A has full-row rank, i.e., rank(A) = m; (3) all columns of A have unit Euclidean norms, i.e., aj 2 = 1, j = 1, . . . , n. The overcomplete signal decomposition formula (1.12.2) is a under-determined equation that has infinitely many solution vectors x. There are two common methods for solving this type of under-determined equations. 1. Classical method (finding the minimum 2 -norm solution) min x 2

subject to Ax = y.

(1.12.3)

The advantage of this method is that the solution is unique and is the minimum norm solution, also called the minimum energy (sum of squared amplitudes of each component) solution or the shortest distance (from the origin) solution. However, because each entry of this solution vector usually takes a nonzero value, it does not meet the requirements of sparse representation in many practical applications. 2. Modern method (finding the minimum 0 -norm solution) min x 0

subject to Ax = y,

(1.12.4)

where the 0 -norm x 0 is defined as the number of nonzero elements of the vector x. The advantage of this method is that it selects a sparse solution vector, which makes it available for many practical applications. The disadvantage of this method is that its computation is more complex. If in the observation data there exist errors or background noise then the minimum 0 -norm solution becomes min x 0

subject to Ax − y 2 ≤ ,

(1.12.5)

where > 0 is very small. When the coefficient vector x is sparse, the signal decomposition y = Ax is known as a sparse decomposition of signals. The columns of the dictionary matrix A are called explanatory variables; the signal vector y is known as the response variable or target signal; Ax is said to be the linear response predictor, while x can be regarded as the sparse representation of the target signal y corresponding to the dictionary A. Equation (1.12.4) is a sparse representation problem of the target signal y corresponding to the dictionary A, while Equation (1.12.5) is a sparse approximation problem of the target signal. Given a positive integer K, if the 0 -norm of the vector x is less than or equal to

86

Introduction to Matrix Algebra

K, i.e., x 0 ≤ K, then x is said to be K-sparse. For a given signal vector y and a dictionary A, if the coefficient vector x satisfying Ax = y has the minimum 0 -norm then x is said to be the sparsest representation of the target signal y corresponding to the dictionary A. Sparse representation is a type of linear inverse problem. In communications and information theory, the matrix A ∈ Rm×N and the vector x ∈ RN represent respectively the coding matrix and the plaintext to be sent, and the observation vector y ∈ Rm is the ciphertext. The linear inverse problem becomes a decoding problem: how to recover the plaintext x from the ciphertext y.

1.12.2 Sparse Representation of Face Recognition As a typical application of sparse representation, let us consider a face recognition problem. Suppose that there are c classes of target faces that the R1 × R2 matrix representation results for each image of a given face have been vectorized to an m × 1 vector d (where m = R1 × R2 is the number of samples creating each facial image, e.g., m = 512 × 512), and suppose that each column vector is normalized to the unit Euclidean norm. Hence Ni training images of the ith known face under different types of illumination can be represented as an m × Ni matrix Di = [di,1 , di,2 , . . . , di,Ni ] ∈ Rm×Ni . Given a large enough training set Di , then a new image y of the ith known face, shot under another illumination, can be represented as a linear combination of the known training images, i.e., y ≈ Di αi , where αi ∈ Rm is a coefficient vector. The problem is that in practical applications we do not usually know to which target collection of known faces the new experimental samples belong, so we need to make face recognition: determine to which target class these samples belong. If we know or make a rough guess that the new testing sample is a certain unknown target face in the c classes of known target faces, a dictionary consisting of training samples for c classes can be written as a training data matrix as follows: D = [D1 , . . . , Dc ] = [D1,1 , . . . , D1,N1 , . . . , Dc,1 , . . . , Dc,Nc ] ∈ Rm×N , (1.12.6) c where N = i=1 Ni denotes the total number of the training images of the c target faces. Hence the face image y to be recognized can be represented as the linear combination ⎤ ⎡ 0N 1 ⎢ . ⎥ ⎢ .. ⎥ ⎥ ⎢ ⎥ ⎢0 ⎢ Ni−1 ⎥ ⎥ ⎢ (1.12.7) y = Dα0 = [D1,1 . . . D1,N1 . . . Dc,1 . . . Dc,Nc ] ⎢ αi ⎥ , ⎥ ⎢ ⎢0Ni+1 ⎥ ⎥ ⎢ ⎢ .. ⎥ ⎣ . ⎦ 0N c

Exercises

87

where 0Nk , k = 1, . . . , i − 1, i + 1, . . . , c is an Nk × 1 zero vector. The face recognition becomes a problem for solving a matrix equation or a linear inverse problem: given a data vector y and a data matrix D, solve for the solution vector α0 of the matrix equation y = Dα0 . It should be noted that, since m < N in the general case, the matrix equation y = Dα0 is under-determined and has infinitely many solutions, among which the sparsest solution is the solution of interest. Since the solution vector must be a sparse vector, the face recognition problem can be described as an optimization problem: min α0 0

subject to y = Dα0 .

(1.12.8)

This is a typical 0 -norm minimization problem. We will discuss the solution of this problem in Chapter 6.

Exercises 1.1 1.2 1.3

1.4

1.5

T

Let x = [x1 , . . . , xm ] , y = [y1 , . . . , yn ]T , z = [z1 , . . . , zk ]T be complex vectors. Use the matrix form to represent the outer products x ◦ y and x ◦ y ◦ z. Show the associative law of matrix addition (A + B) + C = A + (B + C) and the right distributive law of matrix product (A + B)C = AC + BC. Given the system of linear equations ⎧ + ⎪ ⎨ 2y1 − y2 = x1 , 3z1 − z2 = y1 , and , y1 + 2y2 = x2 , ⎪ 5z1 + 2z2 = y2 , ⎩ −2y1 + 3y2 = x3 , use z1 , z2 to represent x1 , x2 , x3 . By using elementary row operations, simplify the matrix ⎡ 0 0 0 0 2 8 4 ⎢ 0 0 0 1 4 9 7 A=⎢ ⎣ 0 3 −11 −3 −8 −15 −32 0 −2 −8 1 6 13 21

⎤ ⎥ ⎥ ⎦

into its reduced-echelon form. Use elementary row operations to solve the system of linear equations 2x1 − 4x2 + 3x3 − 4x4 − 11x5 = 28, −x1 + 2x2 − x3 + 2x4 + 5x5 = −13, −3x3 + 2x4 + 5x5 = −10, 3x1 − 5x2 + 10x3 − 7x4 + 12x5 = 31.

88

1.6

Introduction to Matrix Algebra

Set 13 + 23 + · · · + n3 = a1 n + a2 n2 + a3 n3 + a4 n4 .

1.7

Find the constants a1 , a2 , a3 , a4 . (Hint: Let n = 1, 2, 3, 4, to get a system of linear equations.) Suppose that F : R3 → R2 is a transformation defined as ⎡ ⎤   x1 2x1 − x2 ⎣ , x = x2 ⎦ . F (x) = x2 + 5x3 x3

Determine whether F is a linear transformation. 1.8 Show that xT Ax = tr(xT Ax) and xT Ax = tr(AxxT ). 1.9 Given A ∈ Rn×n , demonstrate the Schur inequality tr(A2 ) ≤ tr(AT A), where equality holds if and only if A is a symmetric matrix. 1.10 The roots λ satisfying |A − λI| = 0 are known as the eigenvalues of the matrix A. Show that if λ is a single eigenvalue of A then there is at least one determinant |Ak − λI| = 0, where Ak is the matrix remaining after the kth row and kth column of A have been removed. 1.11 The equation of a straight line can be represented as ax + by = −1. Show that the equation of the line through two points (x1 , y1 ) and (x2 , y2 ) is given by   1 x y     1 x y  = 0. 1 1  1 x y  2

2

1.12 The equation of a plane can be represented as ax + by + cz = −1. Show that the equation of the plane through the three points (xi , yi , zi ), i = 1, 2, 3, is determined by   1 x y z     1 x1 y1 z1     1 x2 y2 z2  = 0.   1 x y z  3 3 3 1.13 Given a vector set {x1 , . . . , xp } with xp = 0, let a1 , a2 , . . . , ap−1 be any constants, and let yi = xi +ai xp , i = 1, . . . , p−1. Show that the vectors in the set {y1 , . . . , yp−1 } are linearly independent, if and only if the vectors x1 , . . . , xp are linearly independent. 1.14 Without expanding the determinants, prove the following results:     a b c   a + b b + c c + a      2 d e f  =  d + e e + f f + d x y z  x + y y + z z + x 

Exercises

and

 0  2 a b

a 0 c

  c b  b + a a+c c  =  b a 0  a

89

 c  b  . c + b

1.15 Suppose that An×n is positive definite and Bn×n is positive semi-definite; show that det(A + B) ≥ det(A). 1.16 Let A12×12 satisfy A5 = 3A. Find all the possible values of |A|. 1.17 Given a block matrix X = [A, B], show that  T   A A A T B . |x|2 = |AAT + BBT | =  T B A BT B  1.18 Let A2 = A. Use two methods to show that rank(I−A) = n−rank(A): (a) use the property that the matrix trace and the matrix rank are equal; (b) consider linearly independent solutions of the matrix equation (I − A)x = 0. 1.19 Given the matrix ⎤ ⎡ 1 1 1 2 A = ⎣ −1 0 2 −3 ⎦ , 2 4 8 5 find the rank of A. (Hint: Transform the matrix A into its echelon form.) 1.20 Let A be an m × n matrix. Show that rank(A) ≤ m and rank(A) ≤ n. 1.21 Consider the system of linear equations x1 + 3x2 − x3 = a1 , x1 + 2x2 = a2 , 3x1 + 7x2 − x3 = a3 . (a) Determine the necessary and sufficient condition for the above system of linear equations to be consistent. (b) Consider three cases: (i) a1 = 2, a2 = 2, a3 = 6; (ii) a1 = 1, a2 = 0, a3 = −2; (iii) a1 = 0, a2 = 1, a3 = 2. Determine whether the above linear system of equations is consistent. If it is, then give the corresponding solution. 1.22 Determine that for what values of α, the system of linear equations (α + 3)x1 + x2 + 2x3 = α, 3(α + 1)x1 + αx2 + (α + 3)x3 = 3, αx1 + (α − 1)x2 + x3 = α, has a unique solution, no solution or infinitely many solutions. In the case

90

Introduction to Matrix Algebra

where the system of linear equations has infinitely many solutions, find a general solution. 1.23 Set a1 = [1, 1, 1, 3]T , a2 = [−1, −3, 5, 1]T , a3 = [3, 2, −1, p + 2]T and a4 = [−2, −6, 10, p]T . (a) For what values of p are the above four vectors linearly independent? Use a linear combination of a1 , . . . , a4 to express a = [4, 1, 6, 10]T . (b) For what values of p are the above four vectors linearly dependent? Find the rank of the matrix [a1 , a2 , a3 , a4 ] in these cases. 1.24 Given



⎤ a a1 = ⎣ 2 ⎦ , 10



⎤ −2 a2 = ⎣ 1 ⎦ , 5



⎤ −1 a3 = ⎣ 1 ⎦ , 4

⎡ ⎤ 1 b = ⎣b⎦ , c

find respectively the values of a, b, c such that the following conditions are satisfied: (a) b can be linearly expressed by a1 , a2 , a3 and this expression is unique; (b) b cannot be linearly expressed by a1 , a2 , a3 ; (c) b can be linearly expressed by a1 , a2 , a3 , but not uniquely. Find a general expression for b in this case. 1.25 Show that A nonsingular





 A B det = det(A) det(D − CA−1 B). C D

1.26 Verify that the vector group ⎧⎡ 1 ⎪ ⎪ ⎨⎢ ⎢ 0 ⎣ 1 ⎪ ⎪ ⎩ 2

⎤⎫ ⎤ ⎡ −1 ⎪ −1 ⎪ ⎥ ⎢ 1 ⎥ ⎢ −2 ⎥⎬ ⎥ ⎥,⎢ ⎥,⎢ ⎦ ⎣ 1 ⎦ ⎣ 1 ⎦⎪ ⎪ ⎭ 0 0 ⎤ ⎡

is a set of orthogonal vectors. 1.27 Suppose that A = A2 is an idempotent matrix. Show that all its eigenvalues are either 1 or 0. 1.28 If A is an idempotent matrix, show that each of AH , IA and I − AH is a idempotent matrix. 1.29 Given the 3 × 5 matrix ⎡ ⎤ 1 3 2 5 7 A = ⎣ 2 1 0 6 1⎦ , 1 1 2 5 4 use the MATLAB functions orth(A) and null(A) to find an orthogonal basis for the column space Span(A) and for the null space Null(A).

Exercises

91

1.30 Let x and y be any two vectors in the Euclidean space Rn . Show the Cauchy– Schwartz inequality |xT y| ≤ x

y . (Hint: Observe that x − cy 2 ≥ 0 holds for all scalars c.) 1.31 Define a transformation H : R2 → R2 as     x x1 + x 2 − 1 , x= 1 . H(x) = 3x1 x2 Determine whether H is a linear transformation. 1.32 Put ⎡ ⎤ ⎡ ⎤ 1 0.5 0.2 0.3 P = ⎣ 0.3 0.8 0.3 ⎦ , x0 = ⎣0⎦ . 0 0.2 0 0.4 Assume that the state vector of a particular system can be described by the Markov chain xk+1 = Pxk , k = 0, 1, . . . Compute the state vectors x1 , . . . , x15 and analyze the change in the system over time. 1.33 Given a matrix function ⎡ ⎤ 2x −1 x 2 ⎢ 4 x 1 −1 ⎥ ⎥, A(x) = ⎢ ⎣ 3 2 x 5 ⎦ 1

−2

3

x

find d3 |A(x)|/dx3 . (Hint: Expand the determinant |A(x)| according to any row or column; only the terms x4 and x3 are involved.) 1.34 Show that if A1 is nonsingular then   A 1 A2  −1   A3 A4  = |A1 ||A4 − A3 A1 A2 |. 1.35 Determine the positive definite status of the following quadratics: (a) f = −2x21 − 8x22 − 6x23 + 2x1 x2 + 2x1 x3 , (b) f = x21 + 4x22 + 9x23 + 15x24 − 2x1 x2 + 4x1 x3 + 2x1 x4 − 6x2 x4 − 12x3 x4 . 1.36 Show that (a) if B is a real-valued nonsingular matrix then A = BBT is positive definite. (b) if |C| = 0 then A = CCH is positive definite. 1.37 Suppose that A and B are n × n real matrices. Show the following Cauchy– Schwartz inequalities for the trace function: (a) (tr(AT B))2 ≤ tr(AT ABT B), where the equality holds if and only if AB is a symmetric matrix; (b) (tr(AT B))2 ≤ tr(AAT BBT ), where the equality holds if and only if AT B is a symmetric matrix. 1.38 Show that tr(ABC) = tr(BCA) = tr(CAB).

92

Introduction to Matrix Algebra

1.39 Assuming that the inverse matrices appearing below exist, show the following results: (a) (b) (c) (d)

(A−1 + I)−1 = A(A + I)−1 . (A−1 + B−1 )−1 = A(A + B)−1 B = B(A + B)−1 A. (I + AB)−1 A = A(I + BA)−1 . A − A(A + B)−1 A = B − B(A + B)−1 B.

1.40 Show that an eigenvalue of the inverse of a matrix A is equal to the reciprocal of the corresponding eigenvalue of the original matrix, i.e., eig(A−1 ) = 1/eig(A). 1.41 Verify that A† = (AH A)† AH and A† = AH (AAH )† meet the four Moore– Penrose conditions. 1.42 Find the Moore–Penrose inverse of the matrix A = [1, 5, 7]T . 1.43 Prove that A(AT A)−2 AT is the Moore–Penrose inverse of the matrix AAT . 1.44 Given the matrix ⎡ ⎤ 1 0 −2 ⎢ 0 1 −1 ⎥ ⎥, X=⎢ ⎣ −1 1 1 ⎦ 2

1.45

1.46

1.47 1.48

−1

2

use respectively the recursive method (Algorithm 1.6) and the trace method (Algorithm 1.7) find the Moore–Penrose inverse matrix X† . Consider the mapping UV = W, where U ∈ Cm×n , V ∈ Cn×p , W ∈ Cm×p and U is a rank-deficient matrix. Show that V = U† W, where U† ∈ Cn×m is the Moore–Penrose inverse of U. Show that if Ax = b is consistent then its general solution is x = A† b + (I − A† A)z, where A† is the Moore–Penrose inverse of A and z is any constant vector. Show that the linear mapping T : V → W has the properties T (0) = 0 and T (−x) = −T (x). Show that, for any m × n matrix A, its vectorization function is given by vec(A) = (In ⊗ A)vec(In ) = (AT ⊗ In )vec(Im ).

1.49 Show that the necessary and sufficient condition for A ⊗ B to be nonsingular is that both the matrices A, B are nonsingular. Furthermore, show that (A ⊗ B)−1 = A−1 ⊗ B−1 . 1.50 Show that (A ⊗ B)† = A† ⊗ B† . 1.51 If A, B, C are m × m matrices and CT = C, show that (vec(C))T (A ⊗ B) vec(C) = (vec(C))T (B ⊗ A) vec(C). 1.52 Show that tr(ABCD) = (vec(DT ))T (CT ⊗ A) vec(B) = (vec(D)T )(A ⊗ CT ) vec(BT ).

Exercises

93

 1.53 Let A, B ∈ Rm×n , show tr(AT B) = (vec(A))T vec(B) = vec(A∗B), where  vec(·) represents the sum of all elements of the column vector functions. 1.54 Let xi and xj be two column vectors of the matrix X, their covariance matrix is Cov xi , xTj = Mij . Hence the variance–covariance matrix of the vectorization function vec(X), denoted Var(vec(X)) is a block matrix with the submatrix Mij , i.e., Var(vec(X)) = {Mij }. For the special cases of Mij = mij V, if mij is the entry of the matrix M, show that: (a) Var(vec(X)) = M ⊗ V; (b) Var(vec(TX)) = M ⊗ TVTT ; (c) Var(vec(XT )) = V ⊗ M. 1.55 Consider two n × n matrices A and B. (a) Let d = [d1 , . . . , dn ]T and D = Diag(d1 , . . . , dn ). Show that dT (A∗B)d = tr(ADBT D). (b) If both A and B are the positive definite matrices, show that the Hadamard product A ∗ B is a positive definite matrix. This property is known as the positive definiteness of the Hadamard product. 1.56 Show that A ⊗ B = vec(BAT ). 1.57 Show that vec(PQ) = (QT ⊗ P) vec(I) = (QT ⊗ I) vec(P) = (I ⊗ P) vec(Q). 1.58 Verify the following result: tr(XA) = (vec(AT ))T vec(X) = (vec(XT ))T vec(A). 1.59 Let A be an m × n matrix, and B be a p × q matrix. Show (A ⊗ B)Knq = Kmp (B ⊗ A),

Kpm (A ⊗ B)Knq = B ⊗ A,

where Kij is the commutation matrix. 1.60 For any m × n matrix A and p × 1 vector b, show the following results. (a) (b) (c) (d)

Kpm (A ⊗ b) = b ⊗ A. Kmp (b ⊗ A) = A ⊗ b. (A ⊗ b)Knp = bT ⊗ A. (bT ⊗ A)Kpn = A ⊗ b.

1.61 Show that the Kronecker product of the m × m identity matrix and the n × n identity matrix yields the mn × mn identity matrix, namely Im ⊗ In = Imn . 1.62 Set x ∈ RI×JK , G ∈ RP ×QR , A ∈ RI×P , B ∈ RJ×Q , C ∈ RK×R and AT A = IP , BT B = IQ , CT C = IR . Show that if X = AG(C ⊗ B)T and X and A, B, C are given then the matrix G can be recovered or reconstructed from G = AT X(C ⊗ B). 1.63 Use vec(UVW) = (WT ⊗ U) vec(V) to find the vectorization representation of X = AG(C ⊗ B)T . 1.64 Apply the Kronecker product to solve for the unknown matrix X in the matrix equation AXB = C.

94

Introduction to Matrix Algebra

1.65 Apply the Kronecker product to solve for the unknown matrix X in the generalized continuous-time Lyapunov equation LX + XN = Y, where the dimensions of all matrices are n × n and the matrices L and N are invertible. 1.66 Show the following results: (A  B) ∗ (C  D) = (A ∗ C)  (B ∗ D), (A  B  C)T (A  B  C) = (AT A) ∗ (BT B) ∗ (CT C),  † (A  B)† = AT A) ∗ (BT B) (A  B)T . 1.67 Try extending the 0 -norm of a vector to the 0 -norm of an m × n matrix. Use this to define whether a matrix is K-sparse.

2 Special Matrices

In real-world applications, we often meet matrices with special structures. Such matrices are collectively known as special matrices. Understanding the internal structure of these special matrices is helpful for using them and for simplifying their representation. This chapter will focus on some of the more common special matrices with applications.

2.1 Hermitian Matrices In linear algebra, the adjoint of a matrix A refers to its conjugate transpose and is commonly denoted by AH . If AH = A, then A is said to be self-adjoint. A self-adjoint matrix A = AH ∈ Cn×n is customarily known as a Hermitian matrix. A matrix A is known as anti-Hermitian if A = −AH . The centro-Hermitian matrix R is an n × n matrix whose entries have the sym∗ . metry rij = rn−j+1,n−i+1 Hermitian matrices have the following properties. 1. A is a Hermitian matrix, if and only if xH Ax is a real number for all complex vectors x. 2. For all A ∈ Cn×n , the matrices A + AH , AAH and AH A are Hermitian. 3. If A is a Hermitian matrix, then Ak are Hermitian matrices for all k = 1, 2, 3, . . . If A is a nonsingular Hermitian matrix then its inverse, A−1 , is Hermitian as well. 4. If A and B are Hermitian matrices then αA + βB is Hermitian for all real numbers α and β. 5. If A and B are Hermitian then AB + BA and j(AB − BA) are also Hermitian. 6. If A and B are anti-Hermitian matrices then αA + βB is anti-Hermitian for all real numbers α and β. 7. For all A ∈ Cn×n , the matrix A − AH is anti-Hermitian. √ 8. If A is a Hermitian matrix then j A (j = −1) is anti-Hermitian. 9. If A is an anti-Hermitian matrix, then j A is Hermitian. 95

96

Special Matrices

Positive definition criterion for a Hermitian matrix An n × n Hermitian matrix A is positive definite, if and only if any of the following conditions is satisfied. • • • •

The quadratic form xH Ax > 0, ∀ x = 0. The eigenvalues of A are all larger than zero. There is an n × n nonsingular matrix R such that A = RH R. There exists an n × n nonsingular matrix P such that PH AP is positive definite.

Let z be a Gaussian random vector with zero mean. Then its correlation matrix Rzz = E{zzH } is Hermitian, and is always positive definite, because the quadratic form xH Rzz x = E{|xH z|2 } > 0. Positive definite and positive semi-definite matrices obey the following inequalities [214, Section 8.7]: (1) Hadamard inequality If an m × m matrix A = [aij ] is positive definite then det(A) 

m (

aii

i=1

and equality holds if and only if A is a diagonal matrix.   A B is position definite, where (2) Fischer inequality If a block matrix P = BH C submatrices A and C are the square nonzero matrices, then det(P)  det(A) det(C). (3) Oppenheim inequality For two m × m positive semi-definite matrices A and B, m ( det(A) bii  det(A  B), i=1

where A  B is the Hadamard product of A and B. (4) Minkowski inequality For two m × m positive definite matrices A and B, one has    n det(A + B) ≥ n det(A) + n det(B).

2.2 Idempotent Matrix In some applications one uses a multiple product of an n×n matrix A, resulting into three special matrices: an idempotent matrix, a unipotent matrix and a tripotent matrix. DEFINITION 2.1

A matrix An×n is idempotent, if A2 = AA = A.

2.2 Idempotent Matrix

97

The eigenvalues of any idempotent matrix except for the identity matrix I take only the values 1 and 0, but a matrix whose eigenvalues take only 1 and 0 is not necessarily an idempotent matrix. For example, the matrix ⎡ ⎤ 11 3 3 1⎣ B= 1 1 1 ⎦ 8 −12 −4 −4 has three eigenvalues 1, 0, 0, but it is not an idempotent matrix, because ⎡ ⎤ 11 3 3 1 B2 = ⎣ 0 0 0 ⎦ = B. 8 −11 −3 −3 An idempotent matrix has the following useful properties [433], [386]: 1. 2. 3. 4. 5. 6. 7. 8. 9.

The eigenvalues of an idempotent matrix take only the values 1 and 0. All idempotent matrices A = I are singular. The rank and trace of any idempotent matrix are equal, i.e., rank(A) = tr(A). If A is an idempotent matrix then AH is also an idempotent matrix, i.e., A H AH = A H . If A is an idempotent n × n matrix then In − A is also an idempotent matrix, and rank(In − A) = n − rank(A). All symmetric idempotent matrices are positive semi-definite. Let the n × n idempotent matrix A have the rank rA ; then A has rA eigenvalues equal to 1 and n − rA eigenvalues equal to zero. A symmetric idempotent matrix A can be expressed as A = LLT , where L satisfies LT L = IrA and rA is the rank of A. All the idempotent matrices A are diagonalizable, i.e.,   IrA O −1 , (2.2.1) U AU = Σ = O O where rA = rank(A) and U is a unitary matrix.

DEFINITION 2.2 I.

A matrix An×n is called unipotent or involutory, if A2 = AA =

If an n×n matrix A is unipotent, then the function f (·) has the following property [386]: 1 f (sI + tA) = [(I + A)f (s + t) + (I − A)f (s − t)]. (2.2.2) 2 There exists a relationship between an idempotent matrix and an unipotent matrix: the matrix A is an unipotent matrix if and only if 12 (A + I) is an idempotent matrix. DEFINITION 2.3

A matrix An×n is called a nilpotent matrix, if A2 = AA = O.

98

Special Matrices

If A is a nilpotent matrix, then the function f (·) has the following property [386]: f (sI + tA) = If (s) + tAf (s),

(2.2.3)

where f (s) is the first-order derivative of f (s). DEFINITION 2.4

A matrix An×n is called tripotent, if A3 = A.

It is easily seen that if A is a tripotent matrix then −A is also a tripotent matrix. It should be noted that a tripotent matrix is not necessarily an idempotent matrix, although an idempotent matrix must be a triponent matrix (because if A2 = A then A3 = A2 A = AA = A). In order to show this point, consider the eigenvalues of a tripotent matrix. Let λ be an eigenvalue of the tripotent matrix A, and u the eigenvector corresponding to λ, i.e., Au = λu. Premultiplying both sides of Au = λu by A, we have A2 u = λAu = λ2 u. Premultiplying A2 u = λ2 u by A, we immediately have A3 u = λ2 Au = λ3 u. Because A3 = A is a tripotent matrix, the above equation can be written as Au = λ3 u, and thus the eigenvalues of a tripotent matrix satisfies the relation λ = λ3 . That is to say, the eigenvalues of a tripotent matrix have three possible values −1, 0, +1; which is different from the eigenvalues of an idempotent matrix, which take only the values 0 and +1. In this sense, an idempotent matrix is a special tripotent matrix with no eigenvalue equal to −1.

2.3 Permutation Matrix The n × n identity matrix I = [e1 , . . . , en ] consists of the basis vectors in an order such that the matrix is diagonal. On changing the order of the basis vectors, we obtain four special matrices: the permutation matrix, the commutation matrix, the exchange matrix and the selection matrix.

2.3.1 Permutation Matrix and Exchange Matrix DEFINITION 2.5 A square matrix is known as a permutation matrix, if each of its rows and columns have one and only one nonzero entry equal to 1. A permutation matrix P has the following properties [60]. (1) PT P = PPT = I, i.e., a permutation matrix is orthogonal. (2) PT = P−1 . (3) PT AP and A have the same diagonal entries but their order may be different.

2.3 Permutation Matrix

EXAMPLE 2.1

Given the 5 × 4 matrix ⎡ a11 a12 ⎢a21 a22 ⎢ A=⎢ ⎢a31 a32 ⎣a 41 a42 a51 a52

For the permutation matrices ⎡ 0 0 ⎢0 1 P4 = ⎢ ⎣1 0 0 0 we have



a51 ⎢a31 ⎢ P5 A = ⎢ ⎢a21 ⎣a 41 a11

a52 a32 a22 a42 a12

0 0 0 1

a53 a33 a23 a43 a13



1 0⎥ ⎥, 0⎦ 0 ⎤ a54 a34 ⎥ ⎥ a24 ⎥ ⎥, a ⎦ 44

a14

a13 a23 a33 a43 a53

99

⎤ a14 a24 ⎥ ⎥ a34 ⎥ ⎥. a ⎦ 44

a54



0 ⎢0 ⎢ P5 = ⎢ ⎢0 ⎣0 1

0 0 1 0 0 ⎡

a13 ⎢a23 ⎢ AP4 = ⎢ ⎢a33 ⎣a 43 a53

0 1 0 0 0

0 0 0 1 0 a12 a22 a32 a42 a52

⎤ 1 0⎥ ⎥ 0⎥ ⎥, 0⎦ 0 a14 a24 a34 a44 a54

⎤ a11 a21 ⎥ ⎥ a31 ⎥ ⎥. a ⎦ 41

a51

That is, premultiplying an m × n matrix A by an m × m permutation matrix is equivalent to rearranging the rows of A, and postmultiplying the matrix A by an n × n permutation matrix is equivalent to rearranging the columns of A. The new sequences of rows or columns depend on the structures of the permutation matrices. A p × q permutation matrix can be a random arrangement of q basis vectors e1 , . . . , eq ∈ Rp×1 . If the basis vectors are arranged according to some rule, the permutation matrix may become one of three special permutation matrices: a commutation matrix, an exchange matrix or a shift matrix. 1. Commutation Matrix As stated in Section 1.11, the commutation matrix Kmn is defined as the special permutation matrix such that Kmn vec Am×n = vec(AT ). The role of this matrix is to commute the entry positions of an mn × 1 vector in such a way that the new vector Kmn vec(A) becomes vec(AT ), so this special permutation matrix is known as the commutation matrix. 2. Exchange Matrix The exchange matrix is usually represented using the notation J, and is defined as ⎡ ⎤ 0 1 ⎢ 1 ⎥ ⎢ ⎥ (2.3.1) J=⎢ ⎥ · ⎣ ⎦ ·· 1 0

100

Special Matrices

which has every entry on the cross-diagonal line equal to 1, whereas all other entries in the matrix are equal to zero. The exchange matrix is called the reflection matrix or backward identity matrix, since it can be regarded as a backward arrangement of the basis vectors, namely J = [en , en−1 , . . . , e1 ]. Premultiplying an m × n matrix A by the exchange matrix Jm , we can invert the order of the rows of A, i.e., we exchange all row positions with respect to its central horizontal axis, obtaining ⎡ ⎤ am1 am2 · · · amn ⎢ .. .. .. .. ⎥ ⎢ . . . ⎥ (2.3.2) Jm A = ⎢ . ⎥. ⎣a a ··· a ⎦ 21

22

a11

a12

···

2n

a1n

Similarly, postmultiplying A by the exchange matrix Jn , we invert the order of the columns of A, i.e., we exchange all column positions with respect to its central vertical axis: ⎡ ⎤ a1n · · · a12 a11 ⎢ a2n · · · a22 a21 ⎥ ⎢ ⎥ AJn = ⎢ . (2.3.3) .. .. .. ⎥ . . ⎣ . . . . ⎦ amn

···

am2

am1

It is easily seen that J2 = JJ = I,

JT = J.

(2.3.4)

Here J2 = I is the involutory property of the exchange matrix, and JT = J expresses its symmetry. That is, an exchange matrix is both an involutory matrix (J2 = I) and a symmetric matrix (JT = J). As stated in Chapter 1, the commutation matrix has the property KTmn = K−1 mn = Knm . Clearly, if m = n then KTnn = K−1 nn = Knn . This implies that K2nn = Knn Knn = Inn ,

(2.3.5)

KTnn

(2.3.6)

= Knn .

That is to say, like the exchange matrix, a square commutation matrix is both an involutory matrix and a symmetric matrix. In particular, when premultiplying an m × n matrix A by an m × m exchange matrix Jm and postmultiplying an n × n exchange matrix Jn , we have ⎡ ⎤ am,n am,n−1 ··· am,1 ⎢am−1,n am−1,n−1 · · · am−1,1 ⎥ ⎢ ⎥ (2.3.7) Jm AJn = ⎢ . .. .. .. ⎥ . ⎣ .. . . . ⎦ a1,n−1 ··· a1,1 a1,n

2.3 Permutation Matrix

101

Similarly, the operation Jc inverts the element order of the column vector c and c J inverts the element order of the row vector cT . T

3. Shift Matrix The n × n shift matrix is defined as ⎡ 0 ⎢0 ⎢ ⎢ P = ⎢ ... ⎢ ⎣0 1

1 0 .. .

0 1 .. .

··· ··· .. .

0 0

0 0

··· ···

⎤ 0 0⎥ ⎥ .. ⎥ . .⎥ ⎥ 1⎦

(2.3.8)

0

In other words, the entries of the shift matrix are pi,i+1 = 1 (1  i  n − 1) and pn1 = 1; all other entries are equal to zero. Evidently, a shift matrix can be represented using basis vectors as P = [en , e1 , . . . , en−1 ]. Given an m × n matrix A, the operation Pm A will make A’s first row move to its lowermost row, namely ⎡ ⎤ a21 a22 · · · a2n ⎢ .. .. .. .. ⎥ ⎢ . . . ⎥ Pm A = ⎢ . ⎥. ⎣a a ··· a ⎦ m1

m2

a11

a12

···

mn

a1n

Similarly, postmultiplying A by an n × n shift matrix Pn , we have ⎡ ⎤ a1n a11 · · · a1,n−1 ⎢ a2n a21 · · · a2,n−1 ⎥ ⎢ ⎥ APn = ⎢ . .. .. .. ⎥ . ⎣ .. . . . ⎦ amn

am1

···

am,n−1

The role of Pn is to move the rightmost column of A to the leftmost position. It is easy to see that, unlike a square exchange matrix Jn or a square commutation matrix Knn , a shift matrix is neither involutory nor symmetric.

2.3.2 Generalized Permutation Matrix 1. Generalized Permutation Matrix Consider the observation-data model x(t) = As(t) =

n 

ai si (t),

(2.3.9)

i=1

where s(t) = [s1 (t), . . . , sn (t)]T represents a source vector and A = [a1 , . . . , an ], with ai ∈ Rm×1 , i = 1, · · · , n, is an m × n constant-coefficient matrix (m ≥ n)

102

Special Matrices

that represents the linear mixing process of n sources and is known as the mixture matrix. The mixture matrix A is assumed to be of full-column rank. The question is how to recover the source vector s(t) = [s1 (t), . . . , sn (t)]T using only the m-dimensional observation-data vector x(t). This is the well-known blind source separation (BSS) problem. Here, the terminology “blind” has two meanings: (a) the sources s1 (t), . . . , sn (t) are unobservable; (b) how the n sources are mixed is unknown (i.e., the mixture matrix A is unknown). The core issue of the BSS problem is to identify the generalized inverse of the mixture matrix, A† = (AT A)−1 AT , because the source vector s(t) is then easily recovered using s(t) = A† x(t). Unfortunately, in mixture-matrix identification there exist two uncertainties or ambiguities. (1) If the ith and jth source signals are exchanged in order, and the ith and the jth columns of the mixture matrix A are exchanged at the same time, then the observation data vector x(t) is unchanged. This shows that the source signal ordering cannot be identified from only the observation-data vector x(t). Such an ambiguity is known as the ordering uncertainty of separated signals. (2) Using only the observation data vector x(t), it is impossible to identify accurately the amplitudes, si (t), i = 1, . . . , n, of the source signals, because x(t) =

n  ai αi si (t), α i=1 i

(2.3.10)

where the αi are unknown scalars. This ambiguity is called the amplitude uncertainty of the separate signals. The two uncertainties above can be described by a generalized permutation matrix. DEFINITION 2.6 An m × m matrix G is known as a generalized permutation matrix (or a g-matrix), if each row and each column has one and only one nonzero entry. It is easy to show that a square matrix is a g-matrix if and only if it can be decomposed into the product G of a permutation matrix P and a nonsingular diagonal matrix D: G = PD.

(2.3.11)

For instance, ⎡

0 0 ⎢0 0 ⎢ G=⎢ ⎢0 γ ⎣0 0 ρ 0

0 β 0 0 0

⎤ ⎡ 0 0 α ⎢0 0 0⎥ ⎥ ⎢ ⎢ 0 0⎥ ⎥ = ⎢0 ⎣0 ⎦ λ 0 1 0 0

0 0 1 0 0

0 1 0 0 0

0 0 0 1 0

⎤⎡ 1 ρ ⎢ 0⎥ ⎥⎢ ⎥ 0⎥ ⎢ ⎢ 0⎦ ⎣ 0 0

0

⎤ ⎥ ⎥ ⎥. ⎥ ⎦

γ β λ α

2.3 Permutation Matrix

103

By the above definition, if premultiplying (or postmultiplying) a matrix A by a g-matrix, then the rows (or columns) of A will be rearranged, and all entries of each row (or column) of A will be multiplied by a scalar factor. For example, ⎡

0 0 ⎢0 0 ⎢ ⎢0 γ ⎢ ⎣0 0 ρ 0

0 β 0 0 0

0 0 0 λ 0

⎤⎡ α a11 ⎢a21 0⎥ ⎥⎢ ⎢ 0⎥ ⎥ ⎢a31 ⎦ 0 ⎣a41 0 a51

a12 a22 a32 a42 a52

a13 a23 a33 a43 a53

⎤ ⎡ αa51 a14 ⎢βa31 a24 ⎥ ⎥ ⎢ ⎢ a34 ⎥ ⎥ = ⎢ γa21 ⎣ λa ⎦ a44 41 a54 ρa11

αa52 βa32 γa22 λa42 ρa12

αa53 βa33 γa23 λa43 ρa13

⎤ αa54 βa34 ⎥ ⎥ γa24 ⎥ ⎥. λa44 ⎦ ρa14

Using the g-matrix, Equation (2.3.10) can be equivalently written as x(t) = AG−1 (Gs)(t) = AG−1˜s(t),

(2.3.12)

where ˜s(t) = Gs(t). Therefore, the task of BSS is now to identify the product AG−1 of the matrix A and the generalized inverse matrix G−1 rather than the generalized inverse A† , which greatly eases the solution of the BSS problem. That is to say, the solution of the BSS is now given by ˜s(t) = (AG−1 )† x(t) = GA† x(t). If we let W = GA† be a de-mixing or separation matrix, then adaptive BSS is used to update adaptively both the de-mixing matrix W(t) and the reconstructed source signal vector ˜s(t) = W(t)x(t). 2. Selection Matrix As the name implies, the selection matrix is a type of matrix that can select certain rows or columns of a matrix. Let xi = [xi (1), . . . , xi (N )]T and, given an m × N matrix ⎡ ⎤ x1 (1) x1 (2) · · · x1 (N ) ⎢ x2 (1) x2 (2) · · · x2 (N ) ⎥ ⎢ ⎥ X=⎢ . .. .. .. ⎥ , ⎣ .. . . . ⎦ xm (1)

xm (2)

···

xm (N )

we have ⎡ ⎢ ⎢ J1 X = ⎢ ⎣ ⎡

x1 (1) x2 (1) .. .

x1 (2) x2 (2) .. .

··· ··· .. .

xm−1 (1)

xm−1 (2)

···

x2 (1) ⎢ x3 (1) ⎢ J2 X = ⎢ . ⎣ .. xm (1)

x2 (2) x3 (2) .. .

··· ··· .. .

xm (2)

···

x1 (N ) x2 (N ) .. .

⎤ ⎥ ⎥ ⎥ ⎦

if

J1 = [Im−1 , 0m−1 ],

xm−1 (N ) ⎤ x2 (N ) x3 (N ) ⎥ ⎥ .. ⎥ if J2 = [0m−1 , Im−1 ]. . ⎦ xm (N )

104

Special Matrices

That is, the matrix J1 X selects the uppermost m − 1 rows of the original matrix X, and the matrix J2 X selects the lowermost m − 1 rows of X. Similarly, we have ⎡ ⎤ x1 (1) x1 (2) · · · x1 (N − 1)   ⎢ x2 (1) x2 (2) · · · x2 (N − 1) ⎥ IN −1 ⎢ ⎥ if J , = XJ1 = ⎢ . ⎥ .. .. .. 1 0N −1 ⎣ .. ⎦ . . . xm (1)

xm (2)

···

x1 (2) ⎢ x2 (2) ⎢ XJ2 = ⎢ . ⎣ .. xm (2)

x1 (3) x2 (3) .. .

··· ··· .. .

xm (3)

···



xm (N − 1) ⎤ x1 (N ) x2 (N ) ⎥ ⎥ .. ⎥ if . ⎦ xm (N )

 0N −1 J2 = . IN −1 

In other words, the matrix XJ1 selects the leftmost N − 1 columns of the matrix X, and the matrix XJ2 selects the rightmost N − 1 columns of X.

2.4 Orthogonal Matrix and Unitary Matrix The vectors x1 , . . . , xk ∈ Cn constitute an orthogonal set if xH i xj = 0, 1 ≤ i < j ≤ k. Moreover, if the vectors are normalized, i.e., x 22 = xH x = 1, i = 1, . . . , k, then i i the orthogonal set is known as an orthonormal set. THEOREM 2.1

A set of orthogonal nonzero vectors is linearly independent.

Proof Suppose that {x1 , . . . , xk } is an orthogonal set, and 0 = α1 x1 + · · · + αk xk . Then we have 0 = 0H 0 =

k  k  i=1 j=1

αi∗ αj xH i xj =

k 

|αi |2 xH i xi .

i=1

k

2 H Since the vectors are orthogonal, and xH i xi > 0, i=1 |αi | xi xi = 0 implies that 2 all |αi | = 0 and so all αi = 0, and thus x1 , . . . , xk are linearly independent.

DEFINITION 2.7

A real square matrix Q ∈ Rn×n is orthogonal, if QQT = QT Q = I.

(2.4.1)

A complex square matrix U ∈ Cn×n is unitary matrix if UUH = UH U = I.

(2.4.2)

DEFINITION 2.8 A real m×n matrix Qm×n is called semi-orthogonal if it satisfies only QQT = Im or QT Q = In . Similarly, a complex matrix Um×n is known as para-unitary if it satisfies only UUH = Im or UH U = In . THEOREM 2.2 [214]

If U ∈ Cn×n then the following statements are equivalent:

2.4 Orthogonal Matrix and Unitary Matrix

• • • • •

105

U is a unitary matrix; U is nonsingular and UH = U−1 ; UUH = UH U = I; UH is a unitary matrix; the columns of U = [u1 , . . . , un ] constitute an orthonormal set, namely + 1, i = j, H ui uj = δ(i − j) = 0, i = j;

• the rows of U constitute an orthonormal set; • for all x ∈ Cn , the Euclidean lengths of y = Ux and x are the same, namely

y 2 = x 2 . If a linear transformation matrix A is unitary, then the linear transformation Ax is known as a unitary transformation. A unitary transformation has the following properties. (1) The vector inner product is invariant under unitary transformation: x, y = Ax, Ay, since Ax, Ay = (Ax)H Ay = xH AH Ay = xH y = x, y. (2) The vector norm is invariant under unitary transformation, i.e., Ax 2 = x 2 , since Ax 2 = Ax, Ax = x, x = x 2 . (3) The angle between the vectors Ax and Ay is also invariant under unitary transformation, namely cos θ =

x, y Ax, Ay = .

Ax

Ay

x

y

(2.4.3)

This is the result of combining the former two properties. The determinant of a unitary matrix is equal to 1, i.e., | det(A)| = 1,

if A is unitary.

(2.4.4)

DEFINITION 2.9 The matrix B ∈ Cn×n such that B = UH AU is said to be unitarily equivalent to A ∈ Cn×n . If U is a real orthogonal matrix then we say that B is orthogonally equivalent to A. DEFINITION 2.10 AAH .

The matrix A ∈ Cn×n is known as a normal matrix if AH A =

The following list summarizes the properties of a unitary matrix [307]. 1. Am×m is a unitary matrix ⇔ the columns of A are orthonormal vectors. 2. Am×m is a unitary matrix ⇔ the rows of A are orthonormal vectors. 3. Am×m is real that A is a unitary matrix ⇔ A is orthogonal.

106

Special Matrices

4. Am×m is a unitary matrix ⇔ AAH = AH A = Im ⇔ AT is a unitary matrix ⇔ AH is a unitary matrix ⇔ A∗ is a unitary matrix ⇔ A−1 is a unitary matrix ⇔ Ai is a unitary matrix, i = 1, 2, . . . 5. Am×m , Bm×m are unitary matrices ⇒ AB is a unitary matrix. 6. If Am×m is a unitary matrix then • • • • • • •

| det(A)| = 1. rank(A) = m. A is normal, i.e., AAH = AH A. λ is an eigenvalue of A ⇒ |λ| = 1. For a matrix Bm×n , AB F = B F . For a matrix Bn×m , BA F = B F . For a vector xm×1 , Ax 2 = x 2 .

7. If Am×m and Bn×n are two unitary matrices then • A ⊕ B is a unitary matrix. • A ⊗ B is a unitary matrix. DEFINITION 2.11 An N × N diagonal matrix whose diagonal entries take the values either +1 or −1 is known as a signature matrix. DEFINITION 2.12 Let J be an N ×N signature matrix. Then the N ×N matrix Q such that QJQT = J is called the J-orthogonal matrix or the hypernormal matrix. From the above definition we know that when a signature matrix equals the identity matrix, i.e., J = I, the J-orthogonal matrix Q reduces to an orthogonal matrix. Clearly, a J-orthogonal matrix Q is nonsingular, and the absolute value of its determinant is equal to 1. Any N × N J-orthogonal matrix Q can be equivalently defined by QT JQ = J.

(2.4.5)

Substituting the definition formula QJQT = J into the right-hand side of Equation (2.4.5), it immediately follows that QT JQ = QJQT .

(2.4.6)

This symmetry is known as the hyperbolic symmetry of the signature matrix J. The matrix vvT Q=J−2 T (2.4.7) v Jv is called the hyperbolic Householder matrix [71]. In particular, if J = I then the hyperbolic Householder matrix reduces to the Householder matrix.

2.5 Band Matrix and Triangular Matrix

107

2.5 Band Matrix and Triangular Matrix A triangular matrix is one of the standard forms of matrix decomposition; the triangular matrix itself is a special example of the band matrix.

2.5.1 Band Matrix A matrix A ∈ Cm×n such that aij = 0, |i − j| > k, is known as a band matrix. In particular, if aij = 0, ∀ i > j + p, then A is said to have under bandwidth p, and if aij = 0, ∀ j > i + q, then A is said to have upper bandwidth q. The following is an example of a 7 × 5 band matrix with under bandwidth 1 and upper bandwidth 2: ⎤ ⎡ × × × 0 0 ⎢× × × × 0 ⎥ ⎥ ⎢ ⎢ 0 × × × ×⎥ ⎥ ⎢ ⎥ ⎢ ⎢ 0 0 × × ×⎥ , ⎥ ⎢ ⎢ 0 0 0 × ×⎥ ⎥ ⎢ ⎣ 0 0 0 0 ×⎦ 0 0 0 0 0 where × denotes any nonzero entry. A special form of band matrix is of particular interest. This is the tridiagonal matrix. A matrix A ∈ Cn×n is tridiagonal, if its entries aij = 0 when |i − j| > 1. Clearly, the tridiagonal matrix is a square band matrix with upper bandwidth 1 and under bandwidth 1. However, the tridiagonal matrix is a special example of the Hessenberg matrix, which is defined below. An n × n matrix A is known as an upper Hessenberg matrix if it has the form ⎤ ⎡ a11 a12 a13 · · · a1n ⎥ ⎢a ⎢ 21 a22 a23 · · · a2n ⎥ ⎥ ⎢ ⎢ 0 a32 a33 · · · a3n ⎥ ⎥ ⎢ A=⎢ 0 0 a43 · · · a4n ⎥ . ⎥ ⎢ .. .. .. ⎥ ⎢ .. .. . ⎣ . . . . ⎦ 0 0 0 · · · ann A matrix A is known as a lower Hessenberg matrix if AT is a upper Hessenberg matrix.

2.5.2 Triangular Matrix Two common special band matrices are the upper triangular matrix and the lower triangular matrix. The triangular matrix is a canonical form in matrix decomposition.

108

Special Matrices

A square matrix U = [uij ] with uij = 0, i > j, is called an upper triangular matrix, and its general form is ⎤ ⎡ u11 u12 · · · u1n ⎢ 0 u22 · · · u2n ⎥ ⎥ ⎢ U=⎢ . .. .. ⎥ ⇒ |U| = u11 u22 · · · unn . .. ⎣ .. . . . ⎦ 0

···

0

unn

A square matrix L = [lij ] such that lij = 0, i < j, is known as the lower triangular matrix, and its general form is ⎡ ⎤ l11 0 ⎢ l12 l22 ⎥ ⎢ ⎥ L=⎢ . ⎥ ⇒ |L| = l11 l22 · · · lnn . .. .. ⎣ .. ⎦ . . ln1

ln2

···

lnn

In fact, a triangular matrix is not only an upper Hessenberg matrix but also a lower Hessenberg matrix. Summarizing the definitions on triangular matrixes, a square matrix A = [aij ] is said to be (1) (2) (3) (4) (5) (6)

lower triangular if aij = 0 (i < j); strict lower triangular if aij = 0 (i ≤ j); unit triangular if aij = 0 (i < j), aii = 1 (∀ i); upper triangular if aij = 0 (i > j); strict upper triangular if aij = 0 (i ≥ j); unit upper triangular if aij = 0 (i > j), aii = 1 (∀ i). 1. Properties of Upper Triangular Matrices

1. The product of upper triangular matrices is also an upper triangular matrix; i.e., if U1 , . . . , Uk are upper triangular matrices then U = U1 · · · Uk is also an upper triangular matrix. 2. The determinant of an upper triangular matrix U = [uij ] is equal to the product of its diagonal entries: det(U) = u11 · · · unn =

n (

uii .

i=1

3. The inverse of an upper triangular matrix is also an upper triangular matrix. 4. The kth power Uk of an upper triangular matrix Un×n is also an upper triangular matrix, and its ith diagonal entry is equal to ukii . 5. The eigenvalues of an upper triangular matrix Un×n are u11 , . . . , unn . 6. A positive definite Hermitian matrix A can be decomposed into A = TH DT, where T is a unit upper triangular complex matrix, and D is a real diagonal matrix.

2.6 Summing Vector and Centering Matrix

109

2. Properties of Lower Triangular Matrices 1. A product of lower triangular matrices is also a lower triangular matrix, i.e., if L1 , . . . , Lk are lower triangular matrices then L = L1 · · · Lk is also a lower triangular matrix. 2. The determinant of a lower triangular matrix is equal to the product of its diagonal entries, i.e., det(L) = l11 · · · lnn =

n (

lii .

i=1

3. The inverse of a lower triangular matrix is also a lower triangular matrix. 4. The kth power Lk of a lower triangular matrix Ln×n is also a lower triangular k matrix, and its ith diagonal entry is lii . 5. The eigenvalues of a lower triangular matrix Ln×n are l11 , . . . , lnn . 6. A positive definite matrix An×n can be decomposed into the product of a lower triangular matrix Ln×n and its transpose, i.e., A = LLT . This decomposition is called the Cholesky decomposition of A. The lower triangular matrix L such that A = LLT is sometimes known as the square root of A. More generally, any matrix B satisfying B2 = A

(2.5.1)

is known as the square root matrix of A, denoted A1/2 . It is should be noted that the square root matrix of a square matrix A is not necessarily unique. 3. Block Triangular Matrices If, for a block triangular matrix, the matrix blocks on its diagonal line or crossdiagonal line are invertible, then its inverse is given respectively by 

A O B C

−1

−1 A B C O  −1 A B O C



 A−1 O , = −C−1 BA−1 C−1   O C−1 , = B−1 −B−1 AC−1   −1 −A−1 BC−1 A . = O C−1 

(2.5.2) (2.5.3) (2.5.4)

2.6 Summing Vector and Centering Matrix A special vector and a special matrix used frequently in statistics and data processing are the summing vector and centering matrix.

110

Special Matrices

2.6.1 Summing Vector A vector with all entries 1 is known as a summing vector and is denoted 1 = [1, . . . , 1]T . It is called a summing vector because the sum of n scalars can be expressed as the inner product of a summing vector and another vector: given an m-vector x = [x1 , . . . , xm ]T , the sum of its components can be expressed as m m T T i=1 xi = 1 x or i=1 xi = x 1. EXAMPLE 2.2 expressed as

If we let x = [a, b, −c, d]T then the sum a + b − c + d can be ⎡

⎤ a ⎢b⎥ T ⎥ a + b − c + d = [1 1 1 1] ⎢ ⎣−c⎦ = 1 x = 1, x. d

In some calculations we may meet summing vectors with different dimensions. In this case, we usually indicate by a subscript the dimension of each summing vector as a subscript in order to avoid confusion. For example, 13 = [1, 1, 1]T . Consider this product of the summing vector and a matrix: ⎡ ⎤ 4 −1 1T3 X3×2 = [1 1 1] ⎣ −4 3 ⎦ = [1 1] = 1T2 . 1 −1 The inner product of a summing vector and itself is an integer equal to its dimension, namely 1, 1 = 1Tn 1n = n.

(2.6.1)

The outer product of two summing vectors is a matrix all of whose entries are equal to 1; for example,     1 1 1 1 T = J2×3 . [1 1 1] = 12 13 = 1 1 1 1 More generally, 1p 1Tq = Jp×q

(a matrix with all entries equal to 1).

(2.6.2)

It is easy to verify that Jm×p Jp×n = pJm×n ,

Jp×q 1q = q1p ,

1Tp Jp×q = p1Tq .

(2.6.3)

In particular, for an n × n matrix Jn , we have Jn = 1n 1Tn ,

J2n = nJn .

(2.6.4)

Hence, selecting ¯ = 1J , J n n n

(2.6.5)

2.6 Summing Vector and Centering Matrix

111

¯2 = J ¯ . That is, J ¯ is idempotent. we have J n n n

2.6.2 Centering Matrix The matrix ¯ = I − 1J C n = In − J n n n n

(2.6.6)

is known as a centering matrix. It is easily verified that a centering matrix is not only symmetric but also idempotent, namely Cn = CTn = C2n . In addition, a centering matrix has the following properties: 0 Cn 1 = 0, Cn Jn = Jn Cn = 0.

(2.6.7)

(2.6.8)

The summing vector 1 and the centering matrix C are very useful in mathematical statistics [433, p. 67]. First, the mean of a set of data x1 , . . . , xn can be expressed using a summing vector as follows: 1 1 1 1 x = (x1 + · · · + xn ) = xT 1 = 1T x, n i=1 i n n n n

x ¯=

(2.6.9)

where x = [x1 , . . . , xn ]T is the data vector. Next, using the definition formula (2.6.6) of the centering matrix and its property shown in (2.6.8), it follows that ¯ = x − 1 11T x = x − x Cx = x − Jx ¯1 n ¯ , . . . , xn − x ¯ ]T . = [x1 − x

(2.6.10)

In other words, the role of the linear transformation matrix C is to subtract the mean of n data from the data vector x. This is the mathematical meaning of the centering matrix. Moreover, for the inner product of the vector Cx, we have ¯ , . . . , xn − x ¯][x1 − x ¯ , . . . , xn − x ¯ ]T Cx, Cx = (Cx)T Cx = [x1 − x n  = (xi − x ¯ )2 . i=1

From Equation (2.6.8) it is known that CT C = CC = C, and thus the above

112

Special Matrices

equation can be simplified to xT Cx =

n 

(xi − x ¯ )2 .

(2.6.11)

i=1

The right-hand side of (2.6.11) is the well-known covariance of the data x1 , . . . , xn . That is to say, the covariance of a set of data can be reexpressed in the quadratic form of the centering matrix as the kernel xT Cx.

2.7 Vandermonde Matrix and Fourier Matrix Consider two special matrices whose entries in each row (or column) constitute a geometric series. These two special matrices are the Vandermonde matrix and the Fourier matrix, and they have wide applications in engineering. In fact, the Fourier matrix is a special example of the Vandermonde matrix.

2.7.1 Vandermonde Matrix The n × n Vandermonde matrix is a matrix taking the special form ⎡

1

x1

⎢ ⎢ 1 x2 A=⎢ ⎢ .. .. ⎣. . 1 xn or



1 ⎢ x1 ⎢ ⎢ 2 A=⎢ ⎢ x1 ⎢ .. ⎣ . xn−1 1



x21

···

x22 .. .

··· .. .

⎥ ⎥ xn−1 2 ⎥ .. ⎥ . ⎦

x2n

···

xn−1 n

1 x2

··· ···

x22 .. .

··· .. .

xn−1 2

···

xn−1 1

⎤ 1 xn ⎥ ⎥ ⎥ x2n ⎥ ⎥. .. ⎥ . ⎦ n−1 xn

(2.7.1)

(2.7.2)

That is to say, the entries of each row (or column) constitute a geometric series. The Vandermonde matrix has a prominent property: if the n parameters x1 , . . . , xn are different then the Vandermonde matrix is nonsingular, because its determinant is given by [32, p. 193] det(A) =

n ( i,j=1, i>j

(xi − xj ).

(2.7.3)

2.7 Vandermonde Matrix and Fourier Matrix

113

Given an n × n complex Vandermonde matrix ⎡

1 a1 .. .

1 a2 .. .

··· ··· .. .

1 an .. .

an−1 1

an−1 2

···

an−1 n

⎢ ⎢ A=⎢ ⎣

⎤ ⎥ ⎥ ⎥, ⎦

ak ∈ C,

(2.7.4)

its inverse is given by [339] ⎡

⎤T

⎢ σ (a , . . . , aj−1 , aj+1 , . . . , an ) ⎥ i+j n−j 1 ⎥ A−1 = ⎢ j−1 n ⎣(−1) ⎦   (aj − ak ) (ak − aj ) k=1

k=j+1

,

(2.7.5)

i=1,...,n j=1,...,n

where σ0 (a1 , . . . , aj−1 , aj+1 , . . . , an ) = 1, σ1 (a1 , . . . , aj−1 , aj+1 , . . . , an ) = σi (a1 , . . . , aj−1 , aj+1 , . . . , an ) =

n  k=1,k=j n 

ak , ak · · · ak+i−1 , i = 2, . . . , n,

k=1,k=j

with ai = 0, i > n. EXAMPLE 2.3 In the extended Prony method, the signal model is assumed to be the superposition of p exponential functions, namely x ˆn =

p 

bi zin ,

n = 0, 1, . . . , N − 1

(2.7.6)

i=1

is used as the mathematical model for fitting the data x0 , x1 , . . . , xN −1 . In general, bi and zi are assumed to be complex numbers, and bi = Ai exp(j θi ),

zi = exp[(αi + j 2πfi )Δt],

where Ai is the amplitude of the ith exponential function, θi is its phase (in radians), αi its damping factor, fi its oscillation frequency (in Hz), and Δt denotes the sampling time interval (in seconds). The matrix form of Equation (2.7.6) is ˆ Φb = x ˆ = [ˆ in which b = [b0 , b1 , . . . , bp ]T , x x0 , x ˆ1 , . . . , x ˆN −1 ]T , and Φ is a complex Vander-

114

Special Matrices

monde matrix given by ⎡

1 ⎢ z1 ⎢ ⎢ 2 Φ=⎢ ⎢ z1 ⎢ .. ⎣ . z1N −1

1 z2

1 z3

··· ···

z22 .. .

z32 .. .

··· .. .

z2N −1

z3N −1

···

⎤ 1 zp ⎥ ⎥ ⎥ zp2 ⎥ ⎥. .. ⎥ . ⎦

(2.7.7)

zpN −1

N −1 By minimizing the square error = ˆn |2 we get a least squares n=1 |xn − x ˆ , as follows: solution of the matrix equation Φb = x b = [ΦH Φ]−1 ΦH x.

(2.7.8)

It is easily shown that the computation of ΦH Φ in Equation (2.7.8) can be greatly simplified so that, without doing the multiplication operation on the Vandermonde matrix, we can get directly ⎡ ⎤ γ11 γ12 · · · γ1p ⎢γ21 γ22 · · · γ2p ⎥ ⎢ ⎥ . .. .. .. ⎥ , (2.7.9) ΦH Φ = ⎢ ⎢ .. . . . ⎥ ⎣ ⎦ γp1

γp2

···

γpp

where γij =

(zi∗ zj )N − 1 . (zi∗ zj ) − 1

(2.7.10)

The N × p matrix Φ shown in Equation (2.7.7) is one of two Vandermonde matrices widely applied in signal processing; another common complex Vandermonde matrix is given by ⎡ ⎤ 1 1 ··· 1 ⎢ eλ1 eλ 2 ··· eλ d ⎥ ⎢ ⎥ ⎢ ⎥. .. .. .. .. (2.7.11) Φ=⎢ ⎥ . . . . ⎣ ⎦ eλ1 (N −1) eλ2 (N −1) · · · eλd (N −1)

2.7.2 Fourier Matrix The Fourier transform of the discrete-time signals x0 , x1 , . . . , xN −1 is known as the discrete Fourier transform (DFT) or spectrum of these signals, and is defined as Xk =

N −1  n=0

xn e−j 2πnk/N =

N −1  n=0

xn wnk ,

k = 0, 1, . . . , N − 1.

(2.7.12)

2.7 Vandermonde Matrix and Fourier Matrix

This equation can be expressed in ⎤ ⎡ ⎡ 1 1 X0 ⎢ X1 ⎥ ⎢ 1 w ⎥ ⎢ ⎢ ⎢ . ⎥ = ⎢. .. ⎣ .. ⎦ ⎣ .. . XN −1 1 wN −1

115

matrix form as ⎤⎡

··· ··· .. .

wN −1 .. .

···

w(N −1)(N −1)

1

⎥⎢ ⎥⎢ ⎥⎢ ⎦⎣

x0 x1 .. .

⎤ ⎥ ⎥ ⎥ ⎦

(2.7.13)

xN −1

or simply as ˆ = Fx, x

(2.7.14)

ˆ = [X0 , X1 , . . . , XN −1 ]T are respectively the where x = [x0 , x1 , . . . , xN −1 ]T and x discrete-time signal vector and the spectrum vector, whereas ⎤ ⎡ 1 1 ··· 1 ⎥ ⎢1 w ··· wN −1 ⎥ ⎢ ⎥ , w = e−j 2π/N ⎢ .. .. .. (2.7.15) F = ⎢ .. ⎥ . . . ⎦ ⎣. 1 wN −1 · · · w(N −1)(N −1) is known as the Fourier matrix with (i, k)th entry F (i, k) = w(i−1)(k−1) . Clearly, each row and each column of the Fourier matrix constitute a respective geometric series, and thus the Fourier matrix can be regarded as an N × N special Vandermonde matrix. From the definition it can be seen that the Fourier matrix is symmetric, i.e., FT = F. Equation (2.7.14) shows that the DFT of a discrete-time signal vector can be expressed by the matrix F, which is the reason why F is termed the Fourier matrix. Moreover, from the definition of the Fourier matrix it is easy to verify that FH F = FFH = N I.

(2.7.16)

Since the Fourier matrix is a special Vandermonde matrix that is nonsingular, it follows from FH F = N I that the inverse of the Fourier matrix is given by F−1 =

1 H 1 F = F∗ . N N

(2.7.17)

Hence, from Equation (2.7.2) it can be seen immediately that ˆ= x = F−1 x which can ⎡ x0 ⎢ x1 ⎢ ⎢ . ⎣ ..

be written as ⎡ ⎤ 1 ⎢1 ⎥ 1 ⎢ ⎥ ⎢. ⎥= . ⎦ N⎢ ⎣.

xN −1

1 w∗ .. .

1 (wN −1 )∗

··· ··· .. . ···

1 ∗ ˆ, F x N ⎤⎡

1 (w

(2.7.18)

N −1 ∗

)

.. . (w(N −1)(N −1) )∗

⎤ X0 ⎥⎢ ⎥ ⎢ X1 ⎥ ⎥⎢ . ⎥ , ⎥⎣ . ⎥ . ⎦ ⎦ XN −1

(2.7.19)

116

Special Matrices

thus xn =

N −1 1  Xk e j2πnk/N , N

n = 0, 1, . . . , N − 1.

(2.7.20)

k=0

This is the formula for the inverse discrete Fourier transform (IDFT). From the definition it is easy to see that the n × n Fourier matrix F has the following properties [307]. (1) The Fourier matrix is symmetric, i.e., FT = F. (2) The inverse of the Fourier matrix is given by F−1 = N −1 F∗ . (3) F2 = P = [e1 , eN , eN −1 , . . . , e2 ] (the permutation matrix), where ek is the basis vector whose kth entry equals 1 and whose other entries are zero. (4) F4 = I.

2.7.3 Index Vectors Given an N × 1 vector x = [x0 , x1 , . . . , xN −1 ]T , we will consider the subscript representation of its elements in binary code. Define the index vector [390] ⎡ ⎤ 0 ⎢ 1 ⎥ ⎢ ⎥ (2.7.21) iN = ⎢ ⎥ , N = 2n , .. ⎣ ⎦ . N − 1 in which i denotes the binary representation of the integer i, with i = 0, 1, . . . , N − 1. EXAMPLE 2.4

For N = 22 = 4 and N = 23 = 8, we have respectively ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 0 00 0 ⎢1⎥ ⎢01⎥ ⎢1⎥ ⎢ ⎥ ⇔ i4 = ⎢ ⎥ = ⎢ ⎥ , ⎣2⎦ ⎣10⎦ ⎣2⎦ 3 11 3 ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ 000 0 0 ⎢1⎥ ⎢1⎥ ⎢001⎥ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎢2⎥ ⎢2⎥ ⎢010⎥ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎢3⎥ ⎢3⎥ ⎢011⎥ ⎥. ⎢ ⎥ ⇔ i8 = ⎢ ⎥ = ⎢ ⎢4⎥ ⎢4⎥ ⎢100⎥ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎢5⎥ ⎢5⎥ ⎢101⎥ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎣6⎦ ⎣6⎦ ⎣110⎦ 111 7 7

2.7 Vandermonde Matrix and Fourier Matrix

117

Importantly, an index vector can be repressed as the Kronecker product of binary codes 0 and 1. Let a, b, c, d be binary codes; then the binary Kronecker product of binary codes is defined as ⎡ ⎤ ac     ⎢ad⎥ a c ⊗ =⎢ ⎥ , b 2 d 2 ⎣ bc ⎦ bd 2

(2.7.22)

where xy denotes the arrangement in order of the binary code x and y rather than their product. EXAMPLE 2.5 The binary Kronecker product representations of the index vectors i4 and i8 are as follows: ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 01 00 00 0     ⎢01 10 ⎥ ⎢01⎥ ⎢1⎥ 00 01 ⎥ ⎢ ⎥ ⎢ ⎥ ⊗ =⎢ i4 = ⎣1 0 ⎦ = ⎣10⎦ = ⎣2⎦ 11 10 1 0 11 3 11 10 and ⎤ ⎡ ⎤ ⎤ ⎡ 0 000 02 01 00 ⎢0 0 1 ⎥ ⎢001⎥ ⎢1⎥ 2 1 0 ⎥ ⎢ ⎥ ⎥ ⎢ ⎡ ⎤ ⎢ ⎢0 1 0 ⎥ ⎢010⎥ ⎢2⎥ 01 00 ⎥ ⎢ ⎥ ⎢ 2 1 0⎥ ⎢         ⎥ ⎢ ⎥ ⎥ ⎢ ⎢01 10 ⎥ ⎢ 01 00 02 02 ⎢02 11 10 ⎥ ⎢011⎥ ⎢3⎥ ⎢ ⎥ = ⊗ ⊗ = ⊗⎣ i8 = = ⎥ = ⎢ ⎥. ⎢ ⎢ ⎥ 12 11 10 12 11 00 ⎦ ⎢12 01 00 ⎥ ⎢100⎥ ⎢4⎥ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎢12 01 10 ⎥ ⎢101⎥ ⎢5⎥ 11 10 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎣12 11 00 ⎦ ⎣110⎦ ⎣6⎦ 7 111 12 11 10 ⎡

More generally, the binary Kronecker product representation of the index vector iN is given by        0N −2 0 0 0N −1 ⊗ ⊗ ··· ⊗ 1 ⊗ 0 = 1N −1 1N −2 11 10 ⎤ ⎡ ⎤ ⎡ 0 0N −1 · · · 01 00 ⎢0N −1 · · · 01 10 ⎥ ⎢ 1 ⎥ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ .. .. = =⎢ ⎥. ⎢ ⎥ . . ⎥ ⎢ ⎥ ⎢ ⎣ ⎣1 ⎦ N − 2⎦ ···1 0 

iN

N −1

1 0

1N −1 · · · 11 10

(2.7.23)

N − 1

On the other hand, the bit-reversed index vector of iN is denoted iN,rev and is

118

Special Matrices

defined as follows:

⎡ iN,rev

0rev 1rev .. .



⎢ ⎥ ⎢ ⎥ ⎢ ⎥ =⎢ ⎥, ⎢ ⎥ ⎣N − 2 ⎦ rev N − 1rev

(2.7.24)

where irev represents the bit-reversed result of the binary code of the integer i, i = 0, 1, . . . , N − 1. For instance, if N = 8, then the reversed results of the binary code 001 and 011 are respectively 100 and 110. Interestingly, on reversing the order of the binary Kronecker product of the index vector iN , we obtain directly the bit-reversed index vector       01 0N −1 00 iN,rev = ⊗ ⊗ ··· ⊗ . (2.7.25) 10 11 1N −1 EXAMPLE 2.6 by

The bit-reversed index vectors of i4 and i8 are respectively given

i4,rev

and

i8,rev

⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 00 01 00 0     ⎢ ⎥ ⎢ ⎥ ⎢ 0 0 00 11 ⎥ ⎢10⎥ ⎢2⎥ ⎥ = 0 ⊗ 1 =⎢ = = ⎣ 10 11 10 01 ⎦ ⎣01⎦ ⎣1⎦ 11 3 10 11

⎤ ⎡ ⎤ ⎤ ⎡ 0 000 00 01 02 ⎢00 01 12 ⎥ ⎢100⎥ ⎢4⎥ ⎥ ⎢ ⎥ ⎥ ⎢ ⎢ ⎡ ⎤ ⎢0 1 0 ⎥ ⎢010⎥ ⎢2⎥ 00 01 0 1 2⎥ ⎥ ⎢ ⎥ ⎢ ⎢   ⎢       ⎥ ⎢ ⎥ ⎥ ⎢ ⎢00 11 ⎥ 110 0 01 02 00 0 1 1 ⎥ ⎢6⎥ ⎢ ⎥ ⎢ 0 1 2 2 ⎥⊗ = ⊗ ⊗ =⎢ = = ⎥ = ⎢ ⎥. ⎢ ⎥ ⎢ ⎣1 0 ⎦ ⎢10 01 02 ⎥ ⎢001⎥ ⎢1⎥ 10 11 12 12 0 1 ⎥ ⎢ ⎥ ⎥ ⎢ ⎢ ⎢10 01 12 ⎥ ⎢101⎥ ⎢5⎥ 10 11 ⎥ ⎢ ⎥ ⎥ ⎢ ⎢ ⎣10 11 02 ⎦ ⎣011⎦ ⎣3⎦ 7 111 10 11 12 ⎡

Notice that the intermediate result of a binary Kronecker product should be written in conventional binary code sequence; for example, 00 01 12 should be written as 100.

2.7.4 FFT Algorithm ˆ = Fx shown in Equation (2.7.14). We now consider a fast algorithm for the DFT x To this end, perform Type-I elementary row operation on the augmented matrix ˆ ]. Clearly, during the elementary row operations, the subscript indices of the [F, x

2.7 Vandermonde Matrix and Fourier Matrix

119

row vectors of the Fourier matrix F and the subscript indices of the elements of the ˆ are the same. Hence, we have output vector x ˆ rev = FN,rev x. x

(2.7.26)

Therefore, the key step in deriving the N -point FFT algorithm is to construct the N × N bit-reversed Fourier matrix FN,rev . Letting ⎧ ⎫ 1 1 ⎪ ⎪ ⎪ ⎪ ⎪   ⎨ 1 −1 ⎪ ⎬ 1 1 , (2.7.27) {A}2 =   , B= ⎪ 1 −1 1 −j ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ 1 j the N × N bit-reversed Fourier matrix can be recursively constructed by FN,rev = {A}N/2 ⊗ {A}N/4 ⊗ · · · ⊗ {A}2 ⊗ B, where + {A}N/2 =

{A}N/4 {R}N/4

0 ,

{R}N/4

⎧ ⎫ ⎪ ⎨ R1 ⎪ ⎬ .. = . ⎪ ⎪ ⎩ ⎭ RN/4

(2.7.28)

(2.7.29)

together with  Rk = EXAMPLE 2.7

 1 e−j(2k−1)π/N . 1 −e−j(2k−1)π/N

k = 1, . . . ,

N . 2

(2.7.30)

When N = 4, we have the 4 × 4 bit-reversed Fourier matrix ⎡

⎤ ⎡ 1 1 ⊗ [1, 1] ⎥ ⎢ 1 −1 ⎥ ⎢  ⎥=⎣ 1 −j ⎦ −j ⊗ [1, −1] 1 j j

1 1 1 −1

⎢ ⎢ F4,rev = {A}2 ⊗ B = ⎢ ⎣ 1 1



⎤ 1 1 1 −1 ⎥ ⎥ . (2.7.31) −1 j ⎦ −1 −j

From Equation (2.7.26), we have ⎡

⎤ ⎡ X0 ⎢ X2 ⎥ ⎢ ⎢ ⎥=⎢ ⎣X ⎦ ⎣ 1 X3

1 1 1 −1 1 −j 1 j

This is just the four-point FFT algorithm.

⎤⎡ ⎤ 1 1 x0 ⎢x 1 ⎥ 1 −1 ⎥ ⎥⎢ ⎥. −1 j ⎦ ⎣x 2 ⎦ −1 −j x3

(2.7.32)

120

Special Matrices

EXAMPLE 2.8

When N = 8, from (2.7.29) we have ⎧   ⎫ ⎪ ⎪ 1 1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 1 −1 ⎪ ⎪ ⎪ ⎪ ⎪  ⎪  ⎪ ⎪ ⎪ ⎪ 1 −j ⎪ ⎪ ⎪ ⎪ ⎪ # ⎪ " ⎨ ⎬ 1 j {A}2 , {A}4 = =   ⎪ {R}2 1 e−jπ/4 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 1 −e−jπ/4 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪   ⎪ ⎪ ⎪ ⎪ −j3π/4 ⎪ ⎪ 1 e ⎪ ⎪ ⎪ ⎩ 1 −e−j3π/4 ⎪ ⎭

(2.7.33)

and thus the 8 × 8 bit-reversed Fourier matrix is given by F8,rev = {A}4 ⊗ ({A}2 ⊗ B) = {A}4 ⊗ F4,rev ⎧   ⎫ ⎪ ⎪ 1 1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 1 −1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪   ⎪ ⎪ ⎡ ⎤ ⎪ ⎪ 1 −j ⎪ ⎪ 1 1 1 1 ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ ⎢ 1 j 1 −1 ⎥ ⎢ 1 −1 ⎥ ⊗ =   ⎣ 1 −j −1 −jπ/4 ⎪ ⎪ j ⎦ 1 e ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ −jπ/4 ⎪ ⎪ 1 j −1 −j ⎪ ⎪ ⎪ 1 −e ⎪ ⎪ ⎪  ⎪ ⎪ ⎪ ⎪ 1 ⎪ e−j3π/4 ⎪ ⎪ ⎪ ⎪ ⎩ 1 −e−j3π/4 ⎪ ⎭ ⎡ 1 1 1 1 1 1 ⎢1 −1 1 −1 1 −1 ⎢ ⎢1 −j −1 j 1 −j ⎢ ⎢ j −1 −j 1 j ⎢1 =⎢ ⎢1 e−jπ/4 −j e−j3π/4 −1 −e−jπ/4 ⎢ ⎢1 −e−jπ/4 −j −e−j3π/4 −1 e−jπ/4 ⎢ −jπ/4 ⎣1 e−j3π/4 j e −1 −e−j3π/4 −j3π/4 −jπ/4 1 −e j −e −1 e−j3π/4 From Equation (2.7.26) it ⎡ ⎤ ⎡ X0 1 1 ⎢X ⎥ ⎢1 −1 ⎢ 4⎥ ⎢ ⎢X ⎥ ⎢ 1 −j ⎢ 2⎥ ⎢ ⎢ ⎥ ⎢ j ⎢ X6 ⎥ ⎢ 1 ⎢ ⎥=⎢ ⎢ X1 ⎥ ⎢ 1 e−jπ/4 ⎢ ⎥ ⎢ ⎢X5 ⎥ ⎢1 −e−jπ/4 ⎢ ⎥ ⎢ ⎣X3 ⎦ ⎣1 e−j3π/4 1 −e−j3π/4 X7

⎤ 1 1 1 −1 ⎥ ⎥ ⎥ −1 j ⎥ ⎥ −1 −j ⎥ −j3π/4 ⎥. ⎥ j −e −j3π/4 ⎥ ⎥ j e ⎥ −jπ/4 ⎦ −j −e −j e−jπ/4 (2.7.34)

immediately follows that 1 1 1 −1 −1 j −1 −j −j e−j3π/4 −j −e−j3π/4 j e−jπ/4 j −e−jπ/4

1 1 1 −1 1 −j 1 j −1 −e−jπ/4 −1 e−jπ/4 −1 −e−j3π/4 −1 e−j3π/4

⎤⎡ ⎤ x0 1 1 ⎢x ⎥ 1 −1 ⎥ ⎥⎢ 1 ⎥ ⎥⎢x ⎥ −1 j ⎥⎢ 2 ⎥ ⎥⎢ ⎥ −1 −j ⎥⎢x3 ⎥ −j3π/4 ⎥⎢ ⎥ . ⎥⎢x4 ⎥ j −e ⎥⎢ ⎥ ⎢ ⎥ j e−j3π/4 ⎥ ⎥⎢x5 ⎥ −jπ/4 ⎦⎣ ⎦ −j −e x6 −j e−jπ/4 x7

2.8 Hadamard Matrix

121

(2.7.35) This is the eight-point FFT algorithm. Similarly, we can derive other N (= 16, 32, . . .)-point FFT algorithms.

2.8 Hadamard Matrix The Hadamard matrix is an important special matrix in communication, information theory, signal processing and so on. DEFINITION 2.13 Hn ∈ Rn×n is known as a Hadamard matrix if its entries take the values either +1 or −1 such that Hn HTn = HTn Hn = nIn .

(2.8.1)

The properties of the Hadamard matrix are as follows. (1) After multiplying the entries of any row or column of the Hadamard matrix by −1, the result is still a Hadamard matrix. Hence, we can obtain a special Hadamard matrix, with all entries of the first row and of the first column equal to +1. Such a special Hadamard matrix is known as a standardized Hadamard matrix. (2) An n × n Hadamard matrix exists only when n = 2k , where k is an integer. √ (3) (1/ n)Hn is an orthonormal matrix. (4) The determinant of the n × n Hadamard matrix Hn is det(Hn ) = nn/2 . Standardizing a Hadamard matrix will greatly help to construct a Hadamard matrix with higher-dimension. The following theorem gives a general construction approach for the standardized orthonomal Hadamard matrix. THEOREM 2.3 Putting n = 2k , where k = 1, 2, . . ., the standardized orthonormal Hadamard matrix has the following general construction formula: ¯  ¯ 1 Hn/2 Hn/2 ¯ Hn = √ , (2.8.2) ¯ ¯ 2 H −H n/2

where ¯ = √1 H 2 2



1 1 1 −1

n/2

 .

(2.8.3)

¯ is clearly a standardizing orthonormal Hadamard matrix, since Proof First, H 2 T ¯ ¯ ¯ ¯ ¯ k be the standardizing orthonormal Hadamard H2 H2 = H2 HT2 = I2 . Next, let H 2 k T ¯ ¯ Tk = I k k . Hence, for n = 2k+1 , it is ¯ ¯ kH matrix for n = 2 , namely H2k H2k = H 2 2 ×2 2 easily seen that   ¯ 2k H ¯ k 1 H 2 ¯ H2k+1 = √ ¯ k −H ¯ k 2 H 2

2

122

Special Matrices

satisfies the orthogonality condition, i.e.,   T ¯ Tk ¯ 2k ¯ k H H H 1 2 2 T ¯ k+1 = ¯ k+1 H H 2 2 ¯ Tk H ¯ Tk −H ¯ k 2 H 2

2

2

 ¯ k H 2 = I2k+1 ×2k+1 . ¯ k −H 2

¯ Tk+1 = I k+1 k+1 . Moreover, since H ¯ k ¯ k+1 H Similarly, it is easy to show that H 2 2 ×2 2 2 ¯ is standardized, H2k+1 is standardized as well. Hence, the theorem holds also for n = 2k+1 . By mathematical induction, the theorem is proved. A nonstandardizing Hadamard matrix can be written as Hn = Hn/2 ⊗ H2 = H2 ⊗ · · · ⊗ H2 (n = 2k ), (2.8.4)   1 1 . where H2 = 1 −1 Evidently, there is the following relationship between the standardizing and nonstandardizing Hadamard matrices: ¯ = √1 H . H n n n EXAMPLE 2.9

(2.8.5)

When n = 23 = 8, we have the Hadamard matrix       1 1 1 1 1 1 ⊗ ⊗ H8 = 1 −1 1 −1 1 −1 ⎤ ⎡ 1 1 1 1 1 1 1 1 ⎢ 1 −1 1 −1 1 −1 1 −1 ⎥ ⎥ ⎢ ⎢ 1 1 −1 −1 1 1 −1 −1 ⎥ ⎥ ⎢ ⎥ ⎢ 1 1 −1 −1 1 ⎥ ⎢ 1 −1 −1 =⎢ ⎥. ⎢ 1 1 1 1 −1 −1 −1 −1 ⎥ ⎢ ⎥ ⎢ 1 −1 1 −1 −1 1 −1 1 ⎥ ⎢ ⎥ ⎣ 1 1 −1 −1 −1 −1 1 1 ⎦ 1 −1 −1 1 −1 1 1 −1

Figure 2.1 shows the waveforms of the eight rows of the Hadamard matrix φ0 (t), φ1 (t), . . . , φ7 (t). It is easily seen that the Hadamard waveforms φ0 (t), φ1 (t), . . . , φ7 (t) are orthog1 onal to each other, namely 0 φi (t)φj (t)dt = δ(i − j). If H is a Hadamard matrix, the linear transformation Y = HX is known as the Hadamard transform of the matrix X. Since the Hadamard matrix is a standardizing orthonormal matrix, and its entries are only +1 or −1, the Hadamard matrix is the only orthonormal transformation using addition and subtraction alone. The Hadamard matrix may be used for mobile communication coding. The resulting code is called the Hadamard (or Walsh–Hadamard) code. In addition, owing to the orthogonality between its row vectors, they can be used to simulate the spread waveform vector of each user in a code division multiple access (CDMA) system.

2.9 Toeplitz Matrix φ0 (t) 1

φ1 (t) 1

1 t

123

φ2 (t) 1

φ3 (t) 1

1 t

1 t

1 t

−1

−1

−1

−1

φ4 (t) 1

φ5 (t) 1

φ6 (t) 1

φ7 (t) 1

1 t

1 t

1 t

−1

−1

−1

1 t −1

Figure 2.1 Waveform of each row of the Hadamard matrix H8 .

2.9 Toeplitz Matrix In the early twentieth century, Toeplitz [467] presented a special matrix in his paper on the bilinear function relating to the Laurent series. All entries on any diagonal of this special matrix take the same values, namely ⎡

a0 ⎢ a1 ⎢ ⎢a 2 A=⎢ ⎢ ⎢. ⎣ ..

a−1 a0

a−2 a−1

a1

a0

··· ··· .. .

.. .

..

..

an

an−1

. ···

. a1

⎤ a−n a−n+1 ⎥ ⎥ .. ⎥ n . ⎥ ⎥ = [ai−j ]i,j=0 . ⎥ a−1 ⎦

(2.9.1)

a0

This form of matrix with A = [ai−j ]ni,j=0 is known as the Toeplitz matrix. Clearly, an (n + 1) × (n + 1) Toeplitz matrix is determined completely by the entries a0 , a−1 , . . . , a−n of its first row and the entries a0 , a1 , . . . , an of its first column.

2.9.1 Symmetric Toeplitz Matrix The entries of a symmetric Toeplitz matrix A = [a|i−j| ]ni,j=0 satisfy the symmetry relations a−i = ai , i = 1, 2, . . . , n. Thus, a symmetric Toeplitz matrix is completely described by the entries of its first row. An (n + 1) × (n + 1) symmetric Toeplitz matrix is usually denoted Toep[a0 , a1 , . . . , an ].

124

Special Matrices

If the entries of a complex Toeplitz matrix ⎡ a0 a∗1 a∗2 ⎢ ⎢ a1 a0 a∗1 ⎢ ⎢ A = ⎢ a2 a1 a0 ⎢ ⎢ .. .. . .. ⎣. . an

an−1

···

satisfy a−i = a∗i , i.e., ⎤ ··· a∗n ⎥ · · · a∗n−1 ⎥ ⎥ .. ⎥ .. , . . ⎥ ⎥ ⎥ .. . a∗ ⎦

(2.9.2)

1

a1

a0

then it is known as a Hermitian Toeplitz matrix. Furthermore, an (n + 1) × (n + 1) Toeplitz matrix with the special structure ⎤ ⎡ 0 −a∗1 −a∗2 · · · −a∗n ⎥ ⎢ ⎢ a1 0 −a∗1 · · · −a∗n−1 ⎥ ⎥ ⎢ .. ⎥ ⎢ .. (2.9.3) AS = ⎢ a 2 . a1 0 . ⎥ ⎥ ⎢ ⎥ ⎢ .. .. . . .. .. ⎣. . −a∗ ⎦ an

an−1

···

1

a1

0

is called a skew-Hermitian Toeplitz matrix, whereas ⎡ ⎤ a0 −a∗1 −a∗2 · · · −a∗n ⎢ ⎥ ⎢ a1 a0 −a∗1 · · · −a∗n−1 ⎥ ⎢ ⎥ .. ⎥ ⎢ .. A = ⎢ a2 ⎥ . a a . 1 0 ⎢ ⎥ ⎢ .. ⎥ .. . . . . ∗ ⎣. . . . −a1 ⎦ an an−1 · · · a1 a0

(2.9.4)

is known as a skew-Hermitian-type Toeplitz matrix. Toeplitz matrices have the following properties [397]. (1) The linear combination of several Toeplitz matrices is also a Toeplitz matrix. (2) If the entries of a Toeplitz matrix A satisfy aij = a|i−j| , then A is a symmetric Toeplitz matrix. (3) The transpose AT of the Toeplitz matrix A is also a Toeplitz matrix. (4) The entries of a Toeplitz matrix are symmetric relative to its cross-diagonal lines. In statistical signal processing and other related fields, it is necessary to solve the matrix equation Ax = b with symmetric Toeplitz matrix A. Such a system of linear equations is called a Toeplitz system. By using the special structure of the Toeplitz matrix, a class of Levinson recursive algorithms has been proposed for solving a Toeplitz system of linear equations. For a real positive-definite Toeplitz matrix, its classical Levinson recursion [291] involved a large redundancy in the calculation. In order to reduce the computational

2.9 Toeplitz Matrix

125

redundancy, Delsarte and Genin presented successively a split Levinson algorithm [120] and a split Schur algorithm [121]. Later, Krishna and Morgera [265] extended the split Levinson recursion to a Toeplitz system of complex linear equations. Although these algorithms are recursive, their computational complexity is O(n2 ). In addition to the Levinson recursive algorithms, one can apply an FFT algorithm to solve the Toeplitz system of linear equations, and this class of algorithms only requires a computational complexity O(n log2 n), which is faster than the Levinson recursive algorithms. In view of this, FFT algorithms for solving the Toeplitz system of linear equations are called fast Toeplitz algorithms in most of the literature. These fast Toeplitz algorithms include the Kumar algorithm [269], the Davis algorithm [117] and so on, while some publications refer to this class of FFT algorithms as superfast algorithms. The computation complexity of the Kumar algorithm is O(n log2 n). In particular, the conjugate gradient algorithm in [89] requires only a computational complexity O(n log n) for solving the Toeplitz system of complex linear equations.

2.9.2 Discrete Cosine Transform of Toeplitz Matrix An N ×N Toeplitz matrix can be computed fast using an N th-order discrete cosine transform (DCT). The N th-order DCT can be expressed as an N × N matrix T, where the (m, l)th −1 entry of T = [tm,l ]N m,l=0 is defined as ) π * tm,l = τm cos m(2l + 1) for m, l = 0, 1, . . . , N − 1, (2.9.5) 2N + 1/N , m = 0, (2.9.6) τm =  2/N , m = 1, 2, . . . , N − 1. From the above definition, it is easy to see that T−1 = TT , namely T is an ˆ = TATT orthogonal matrix. Hence, given any N × N matrix A, its DCT matrix A and the original matrix A have the same eigenvalues. Algorithm 2.1 shows the fast DCT algorithm for a Toeplitz matrix. The common way of computing Toeplitz matrix eigenvalues is to adopt some procedure (such as the Given rotation) to transform the nondiagonal entries of A to zero, namely to diagonalize A. As mentioned above, since T is an orthogonal ˆ = TATT and A have the same eigenvalues. Hence, we can diagonalize the matrix, A ˆ DCT A instead of A itself to find the eigenvalues of A. In some cases, the diagonal ˆ are already exact estimates of the eigenvalues of A; entries of the DCT matrix A see e.g. [185]. ˆ can be used as an approxSome examples show [357] that the fast DCT matrix A ˆ are focused imation of the Toeplitz matrix A because the main components of A on a much smaller matrix partition that corresponds to the steady part of A. From

126

Special Matrices

Algorithm 2.1

Fast DCT algorithm for a Toeplitz matrix [357]

−1 N −1 input: Toeplitz matrix A = [al,k ]N l,k=0 = [ak−l ]l,k=0 .

1. For m = 0, 1, . . . , N − 1, compute N −1 π xm,0 = . wm(2l+1) al,0 , w = exp −j 2N l=0

(Complexity: one 2N -point DFT.) 2. For n = −(N − 1), . . . , (N − 1), compute N −1 N −1 v1 (n) = w2nk ak+1 , v2 (n) = w2nk a1−N +k . k=0

k=0

(Complexity: two 2N -point DFTs.) 3. For m = 0, 1, . . . , N − 1, compute xm,N = (−1)m xm,0 + w−m ((−1)m v1 (n) − v2 (n)). (Complexity: N -times multiplication.) 4. For n = −(N − 1), . . . , (N − 1), compute N −1 N −1 kw2nk ak+1 , u2 (n) = kw2nk a1−N +k . u1 (n) = k=0

k=0

(Complexity: two 2N -point DFTs.) 5. For m = 0, 1, . . . , N − 1; n = −(N − 1), . . . , (N − 1), compute   1 ym,n = −2n xm,0 w−2n − xm,N w−2n (−1)n + wm (v1 (n)−(−1)m v2 (n)) , w − w2m ym,−m = xm,0 + (N − 1)(−1)m xm,N − w−m (u1 (−m) − (−1)m u2 (−m)). (Complexity: several multiplications are needed for each value.) 6. For m = 0, 1, . . . , N − 1; n = −(N − 1), . . . , (N − 1), compute   ∗ ym,n + ym,−n . a ˆm,n = τm τn Re wn 2 (Complexity: several multiplications are required for each value.)

ˆ is easily found the viewpoint of a decision about rank, that of the DCT matrix A seen but the rank of A is difficult to ascertain. Notice that, since A and its DCT ˆ have the same eigenvalues, their ranks are the same. A

Exercises 2.1 2.2 2.3 2.4

Let A be an idempotent and symmetric matrix. Show that A is semi-definite. Let A be an n × n antisymmetric matrix such that A = −AT . Show that if n is an odd number then the determinant of A must be equal to zero. Show that if A = −AT then A + I is nonsingular. Assume that A = −AT . Show that the Cayley transform T = (I−A)(I+A)−1 is an orthogonal matrix.

Exercises

2.5

Prove the following properties of an idempotent matrix. (a) (b) (c) (d)

2.6

127

The eigenvalues of any idempotent matrix take only the values 1 and 0. All idempotent matrices (except for the identity matrix) A are singular. If A is an idempotent matrix then rank(A) = tr(A). If A is an idempotent matrix then AH is also an idempotent matrix.

If A is an idempotent matrix, show that (a) the matrices Ak and A have the same eigenvalues. (b) Ak and A have the same rank.

2.7

2.8 2.9 2.10 2.11 2.12 2.13

Let A be an orthogonal matrix, and let A + I be nonsingular. Show that the matrix A can be expressed as the Cayley transform A = (I − S)(I + S)−1 , where S is a real antisymmetric matrix such that A = −AT . Let P be an n×n permutation matrix. Show that there exists a positive integer k such that Pk = I. (Hint: Consider the matrix sequence P, P2 , P3 , . . .) Assume that P and Q are two n × n permutation matrices. Show that PQ is also an n × n permutation matrix. Show that, for any matrix A, there is a triangular matrix T such that TA is unitary. Show that, for a given matrix A, there is a matrix J, whose diagonal entries take the values +1 or −1, such that JA + I is nonsingular. Show that if H = A + j B is a Hermitian matrix and A is nonsingular then | det(H)|2 = |A|2 |I + A−1 BA−1 B|. Let A, S be n × n matrices and let S be nonsingular. (a) Show that (S−1 AS)2 = S−1 A2 S and (S−1 AS)3 = S−1 A3 S. (b) Use mathematical induction to show that (S−1 AS)k = S−1 Ak S, where k is a positive integer.

2.14 Let A be an n × n real matrix. Show that B = (A + AT )/2 is a symmetric matrix, while C = (A − AT )/2 is an antisymmetric matrix. 2.15 Let yi = yi (x1 , . . . , xn ), i = 1, . . . , n be n functions of x1 , . . . , xn . The matrix J = J(y, x) = [∂yi /∂xj ] is called the Jacobian matrix of the function yi (x1 , . . . , xn ), i = 1, . . . , n, and its determinant is the Jacobian determinant; y = [y1 , . . . , yn ]T and x = [x1 , . . . , xn ]T . Show that J(z, y)J(y, x) = J(z, x). 2.16 Let n × n matrices X and Y be symmetric, and let Y = AXAT . Show that the Jacobian determinant |J(Y, X)| = |A|n+1 . 2.17 [32, p. 265] An n × n matrix M is known as a Markov matrix if its entries n satisfy the condition mij ≥ 0, i=1 mij = 1, j = 1, 2, . . . , n. Assuming that P and Q are two Markov matrices, show that (a) for a constant 0 ≤ λ ≤ 1, the matrix λP + (1 − λ)Q is a Markov matrix. (b) the matrix product PQ is a Markov matrix as well. 2.18 Show the following properties of the Fourier matrix F:

128

Special Matrices

(a) F−1 = N −1 F∗ . (b) F2 = [e1 , eN , eN −1 , . . . , e2 ]. (c) F4 = I. 2.19 The square matrix A satisfying AAH = AH A is known as a normal matrix. Show that if A is a normal matrix then the matrix A − λI is also a normal matrix. 2.20 Two matrices A and B such that their product satisfies the commutative law AB = BA are called exchangeable matrices. Show that if A and B are exchangeable then the condition for AH and B to be exchangeable is that A is a normal matrix. 2.21 Let A be an n × n complex matrix with eigenvalues λ1 , . . . , λn . Show that A is a normal matrix if and only if one of the following conditions holds: (a) (b) (c) (d)

A = B + jC where B and C are Hermitian and exchangeable matrices. The eigenvalues of AAH are |λ1 |2 , . . . , |λn |2 . The eigenvalues of A + AH are λ1 + λ∗1 , . . . , λn + λ∗n . A = UΣUH , where U is a unitary matrix and Σ is a diagonal matrix.

3 Matrix Differential

The matrix differential is a generalization of the multivariate function differential. The matrix differential (including the matrix partial derivative and gradient) is an important operation tool in matrix analysis and calculation and has wide applications in statistics, optimization, manifold computation, geometric physics, differential geometry, econometrics and many engineering disciplines.

3.1 Jacobian Matrix and Gradient Matrix In the first half of this chapter we discuss the partial derivatives of a real scalar function, real vector function and real matrix function with respect to a real vector or matrix variable. In order to facilitate understanding, the symbols for variables and functions are first introduced below. • x = [x1 , . . . , xm ]T ∈ Rm denotes a real vector variable. • X = [x1 , . . . , xn ] ∈ Rm×n denotes a real matrix variable. • f (x) ∈ R is a real scalar function with an m × 1 real vector x as variable, denoted f : Rm → R, such as f (x) = tr(xxT ). • f (X) ∈ R is a real scalar function whose variable is an m × n real matrix X, denoted f : Rm×n → R; e.g., f (X) = det(XT X). • f (x) ∈ Rp is a p-dimensional column vector function with an m × 1 real vector x as variable, denoted f : Rm → Rp ; e.g., f (x) = bx + c. • f (X) ∈ Rp is a p-dimensional real column vector function whose variable is an m × n real matrix X, denoted f : Rm×n → Rp , such as f (X) = vec(XT X). • F(x) ∈ Rp×q is a p × q real matrix function whose variable is an m × 1 real vector x, denoted F : Rm → Rp×q ; e.g., F(x) = xxT . • F(X) ∈ Rp×q is a p × q real matrix function whose variable is an m × n real matrix X, denoted F : Rm×n → Rp×q ; e.g., F(X) = XT X. Table 3.1 summarizes the classification of the real functions above. 129

130

Matrix Differential

Table 3.1 Classification of real functions Function type

Variable x ∈ Rm

Variable X ∈ Rm×n

scalar function f ∈ R

f (x) f : Rm → R f (x) f : Rm → Rp F(x) F : Rm → Rp×q

f (X) f : Rm×n → R f (X) f : Rm×n → Rp F(X) F : Rm×n → Rp×q

vector function f ∈ Rp matrix function F ∈ Rp×q

In this section we discuss the partial derivatives of a real scalar function and a real matrix function.

3.1.1 Jacobian Matrix DEFINITION 3.1 The row partial derivative operator with respect to an m × 1 vector or an m × n matrix X are respectively defined as   ∂ ∂ ∂ def (3.1.1) = ,..., Dx = ∂ xT ∂x1 ∂xm and Dvec X

  ∂ ∂ ∂ ∂ ∂ . = = ,..., ,..., ,..., ∂(vec X)T ∂x11 ∂xm1 ∂x1n ∂xmn

def

The Jacobian operator with respect to an m × n matrix X is defined as ⎡ ⎤ ∂ ∂ ··· ⎢ ∂x11 ∂xm1 ⎥ ⎢ . ∂ def .. ⎥ . ⎢ ⎥ .. DX = = ⎢ .. . ⎥. T ∂X ⎣ ∂ ∂ ⎦ ··· ∂x1n

(3.1.2)

(3.1.3)

∂xmn

DEFINITION 3.2 The partial derivative vector resulting from the operation of Dx on a real scalar function f (x) with m × 1 vector variable x is an 1 × m row vector and is given by   ∂f (x) ∂f (x) ∂f (x) . (3.1.4) = ,..., Dx f (x) = ∂ xT ∂x1 ∂xm The partial derivative matrix ⎡ ∂f (X) ··· ⎢ ∂x11 ∂f (X) ⎢ . .. =⎢ DX f (X) = . ⎢ .. ∂ XT ⎣ ∂f (X) ···

DEFINITION 3.3

∂x1n



∂f (X) ∂xm1 ⎥

⎥ ⎥ ∈ Rn×m ⎥ ∂f (X) ⎦ .. .

∂xmn

(3.1.5)

3.1 Jacobian Matrix and Gradient Matrix

131

is known as the Jacobian matrix of the real scalar function f (X), and the partial derivative vector ∂f (X) ∂(vec X)T   ∂f (X) ∂f (X) ∂f (X) ∂f (X) = ,..., ,..., ,..., ∂x11 ∂xm1 ∂x1n ∂xmn

Dvec X f (X) =

(3.1.6)

is the row partial derivative vector of the real scalar function f (X) with matrix variable X. There is the following relationship between the Jacobian matrix and the row partial derivative vector:  T Dvec X f (X) = rvecDX f (X) = vecDTX f (X) . (3.1.7) That is to say, the row vector partial derivative Dvec X f (X) is equal to the transpose of the column vectorization vec(DTX f (X)) of the Jacobian matrix DTX f (X). This important relation is the basis of the Jacobian matrix identification. As a matter of fact, the Jacobian matrix is more useful than the row partial derivative vector. The following theorem provides a specific expression for the Jacobian matrix of a p × q real-valued matrix function F(X) with m × n matrix variable X. DEFINITION 3.4 [311] Let the vectorization of a p × q matrix function F(X) be given by def

vecF(X) = [f11 (X), . . . , fp1 (X), . . . , f1q (X), . . . , fpq (X)]T ∈ Rpq .

(3.1.8)

Then the pq × mn Jacobian matrix of F(X) is defined as def

DX F(X) = whose specific expression DX F(X) ⎡ ⎤ ⎡ ∂f11 ∂f11 ⎢ ∂(vec X)T ⎥ ⎢ ∂x11 · · · ⎢ ⎥ ⎢ . .. .. ⎢ ⎥ ⎢ . . ⎢ ⎥ . . ⎢ ∂fp1 ⎥ ⎢ ∂fp1 ⎢ ⎥ ⎢ ⎢ ··· ⎢ ∂(vec X)T ⎥ ⎢ ∂x ⎢ ⎥ ⎢ 11 ⎢ ⎥ ⎢ .. .. .. ⎢ ⎥=⎢ . . . ⎢ ⎥ ⎢ ⎢ ∂f1q ⎥ ⎢ ∂f1q ⎢ ⎥ ··· ⎢ ∂(vec X)T ⎥ ⎢ ⎢ ∂x11 ⎢ ⎥ .. .. .. ⎢ ⎥ ⎢ . . . ⎢ ⎥ ⎢ ⎣ ⎣ ∂fpq ⎦ ∂fpq ··· T ∂(vec X)

∂x11

∂ vecF(X) ∈ Rpq×mn ∂(vecX)T

(3.1.9)

is given by ∂f11 ∂xm1

··· .. .

∂f11 ∂x1n

··· .. .

∂fp1 ∂xm1

··· .. .

∂fp1 ∂x1n

··· .. .

∂f1q ∂xm1

··· .. .

∂f1q ∂x1n

··· .. .

∂fpq ∂xm1

···

∂fpq ∂x1n

···

.. . .. . .. .

.. . .. . .. .



∂f11 ∂xmn ⎥

⎥ ⎥ ⎥ ∂fp1 ⎥ ⎥ ∂xmn ⎥ .. ⎥ ⎥ . ⎥. ∂f1q ⎥ ⎥ ∂xmn ⎥ ⎥ .. ⎥ . ⎥ ⎦ .. .

∂fpq ∂xmn

(3.1.10)

132

Matrix Differential

3.1.2 Gradient Matrix The partial derivative operator in column form is known as the gradient vector operator. DEFINITION 3.5 The gradient vector operators with respect to an m × 1 vector x and to an m × n matrix X are respectively defined as T  ∂ ∂ def ∂ ∇x = ,..., , (3.1.11) = ∂x ∂x1 ∂xm and ∇vec X

T  ∂ ∂ ∂ ∂ ∂ = ,..., ,..., ,..., . = ∂ vec X ∂x11 ∂x1n ∂xm1 ∂xmn

def

DEFINITION 3.6 X is defined as

(3.1.12)

The gradient matrix operator with respect to an m × n matrix ⎡ def

∇X =

∂ ⎢ ∂x11

⎢ . ∂ .. =⎢ ∂X ⎢ ⎣ ∂

∂xm1

··· ..

.

···



∂ ∂x1n ⎥

.. .

∂ ∂xmn

⎥ ⎥. ⎥ ⎦

(3.1.13)

DEFINITION 3.7 The gradient vectors of functions f (x) and f (X) are respectively defined as T  ∂f (x) ∂f (x) def ∂f (x) ∇x f (x) = ,..., = , (3.1.14) ∂x1 ∂xm ∂x T  ∂f (X) ∂f (X) ∂f (X) def ∂f (X) ,..., ,..., ,..., . (3.1.15) ∇vec X f (X) = ∂x11 ∂xm1 ∂x1n ∂xmn DEFINITION 3.8

The gradient matrix of the function f (X) is defined as ⎡ ⎤ ∂f (X) ∂f (X) ··· ⎢ ∂x11 ∂x1n ⎥ ⎢ . .. ⎥ . ⎢ ⎥ ∂f (X) .. ∇X f (X) = ⎢ .. (3.1.16) . ⎥= ∂X . ⎣ ∂f (X) ∂f (X) ⎦ ··· ∂xm1

∂xmn

Clearly, there is the following relationship between the gradient matrix and the gradient vector of a function f (X): ∇X f (X) = unvec∇vec X f (X).

(3.1.17)

Comparing Equation (3.1.16) with Equation (3.1.5), we have ∇X f (X) = DTX f (X).

(3.1.18)

That is to say, the gradient matrix of a real scalar function f (X) is equal to the transpose of its Jacobian matrix.

3.1 Jacobian Matrix and Gradient Matrix

133

As pointed out by Kreutz-Delgado [262], in manifold computation [3], geometric physics [430], [162], differential geometry [452] and so forth, the row partial derivative vector and the Jacobian matrix are the “most natural” choices. As we will see later, the row partial derivative vector and the Jacobian matrix are also the most natural choices for matrix derivatives. However, in optimization and many engineering applications, the gradient vector and the gradient matrix are a more natural choice than the row partial derivative vector and the Jacobian matrix. An obvious fact is that, given a real scalar function f (x), its gradient vector is directly equal to the transpose of the partial derivative vector. In this sense, the partial derivative in row vector form is a covariant form of the gradient vector, so the row partial derivative vector is also known as the cogradient vector. Similarly, the Jacobian matrix is sometimes called the cogradient matrix. The cogradient is a covariant operator [162] that itself is not the gradient, but is related to the gradient. For this reason, the partial derivative operator ∂/∂xT and the Jacobian operator ∂/∂ XT are known as the (row) partial derivative operator, the covariant form of the gradient operator or the cogradient operator. The direction of the negative gradient −∇x f (x) is known as the gradient flow direction of the function f (x) at the point x, and is expressed as x˙ = −∇x f (x) or

˙ = −∇ X vec X f (X).

(3.1.19)

From the definition formula of the gradient vector the following can be deduced. (1) In the gradient flow direction, the function f (x) decreases at the maximum descent rate. On the contrary, in the oppositive direction (i.e., the positive gradient direction), the function increases at the maximum ascent rate. (2) Each component of the gradient vector gives the rate of change of the scalar function f (x) in the component direction. DEFINITION 3.9 For a real matrix function F(X) ∈ Rp×q with matrix variable X ∈ Rm×n , its gradient matrix is defined as T  ∂ (vec F(X))T ∂ vecF(X) ∇X F(X) = . (3.1.20) = ∂ vecX ∂(vec X)T Obviously, one has ∇X F(X) = (DX F(X))T .

(3.1.21)

In other words, the gradient matrix of a real matrix function is a transpose of its Jacobian matrix.

3.1.3 Calculation of Partial Derivative and Gradient The gradient computation of a real function with respect to its matrix variable has the following properties and rules [307].

134

Matrix Differential

1. If f (X) = c, where c is a real constant and X is an m × n real matrix, then the gradient ∂c/∂X = Om×n . 2. Linear rule If f (X) and g(X) are respectively real-valued functions of the matrix variable X, and c1 and c2 are two real constants, then ∂f (X) ∂g(X) ∂[c1 f (X) + c2 g(X)] . = c1 + c2 ∂X ∂X ∂X

(3.1.22)

3. Product rule If f (X), g(X) and h(X) are real-valued functions of the matrix variable X then ∂(f (X)g(X)) ∂f (X) ∂g(X) = g(X) + f (X) ∂X ∂X ∂X

(3.1.23)

and ∂(f (X)g(X)h(X)) ∂f (X) ∂g(X) = g(X)h(X) + f (X)h(X) ∂X ∂X ∂X ∂h(X) . + f (X)g(X) ∂X 4. Quotient rule

If g(X) = 0 then

1 ∂(f (X)/g(X)) = 2 ∂X g (X)



 ∂f (X) ∂g(X) g(X) . − f (X) ∂X ∂X

(3.1.24)

(3.1.25)

5. Chain rule If X is an m × n matrix and y = f (X) and g(y) are respectively the real-valued functions of the matrix variable X and of the scalar variable y then ∂g(f (X)) dg(y) ∂f (X) . = ∂X dy ∂X

(3.1.26)

As an extension, if g(F(X)) = g(F), where F = [fkl ] ∈ Rp×q and X = [xij ] ∈ R then the chain rule is given by [386]   p q ∂g(F)   ∂g(F) ∂fkl ∂g(F) = = . (3.1.27) ∂ X ij ∂xij ∂fkl ∂xij m×n

k=1 l=1

When computing the partial derivative of the functions f (x) and f (X), we have to make the following basic assumption. Independence assumption Given a real-valued function f , we assume that the m m×n and the matrix variable X = [xij ]m,n vector variable x = [xi ]m i=1 ∈ R i=1,j=1 ∈ R do not themselves have any special structure; namely, the entries of x and X are independent. The independence assumption can be expressed in mathematical form as follows: + 1, i = j, ∂xi = δij = (3.1.28) ∂xj 0, otherwise,

3.1 Jacobian Matrix and Gradient Matrix

and ∂xkl = δki δlj = ∂xij

+

1,

k = i and l = j,

0,

otherwise.

135

(3.1.29)

Equations (3.1.28) and (3.1.29) are the basic formulas for partial derivative computation. Below we will give a few examples. EXAMPLE 3.1 xT Ax.

Find the Jacobian matrix of the real-valued function f (x) =

Solution Because xT Ax =

n n

akl xk xl , by using Equation (3.1.28) we can find

k=1 l=1

the ith component of the row partial derivative vector ∂ xT Ax/∂xT , given by   T n n n n   ∂  ∂x Ax = a x x = x a + xl ail , kl k l k ki ∂xT ∂xi i k=1 l=1

k=1

l=1

from which we immediately get the row partial derivative vector Df (x) = xT A + xT AT = xT (A + AT ) and the gradient vector ∇X f (x) = (Df (x))T= (A + AT )x. EXAMPLE 3.2 Find the Jacobian matrix of the real function f (x) = aT XXT b, where X ∈ Rm×n and a, b ∈ Rn×1 . Solution Noticing that T

T

a XX b =

m  m 

 ak

k=1 l=1

n 

 xkp xlp

bl

p=1

and using Equation (3.1.29), we can see easily that   m m n ∂f (X)    ∂ak xkp xlp bl ∂f (X) = = ∂ XT ij ∂xji ∂xji k=1 l=1 p=1  m  m  n   ∂xkp ∂xlp ak xlp bl = + ak xkp bl ∂xji ∂xji p=1 =

=

k=1 l=1 m  m  n 

aj xli bl +

i=1 l=1 j=1 n m    T

n m  m  

ak xki bj

k=1 i=1 j=1

   X b i aj + XT a i bj ,

i=1 j=1

which yields respectively the Jacobian matrix and the gradient matrix, DX f (X) = XT (baT + abT ) and EXAMPLE 3.3

∇X f (X) = (baT + abT )X.

Consider the objective function f (X) = tr(XB), where X and B

136

Matrix Differential

are m × n and n × m real matrices, respectively. Find the Jacobian matrix and the gradient matrix of f (X) = tr(XB). Solution First, the (k, l)th entry of the matrix product XB is given by [XB]kl = n m n p=1 xkp bpl , so the trace of the matrix product is tr(XB) = p=1 l=1 xlp bpl . Hence, using Equation (3.1.29), it is easily found that m n    n m    ∂xlp ∂ tr(XB) ∂ = = x b bpl = bij , lp pl T ∂X ∂xji p=1 ∂xji ij p=1 l=1

l=1

that is, ∂ tr(XB)/∂ XT = B. Moreover, since tr(BX) = tr(XB), we see that the n × m Jacobian matrix and the m × n gradient matrix are respectively given by DX tr(XB) = DX tr(BX) = B and

∇X tr(XB) = ∇X tr(BX) = BT .

The following are examples of computing the Jacobian matrix and the gradient matrix of a real matrix function. EXAMPLE 3.4 Setting F(X) = X ∈ Rm×n and then computing the partial derivative directly yields ∂xkl ∂fkl = = δlj δki , ∂xij ∂xij which yields the Jacobian matrix DX X = In ⊗ Im = Imn ∈ Rmn×mn . EXAMPLE 3.5 Let F(X) = AXB, where A ∈ Rp×m , X ∈ Rm×n , B ∈ Rn×q . From the partial derivative m n ∂ ( u=1 v=1 aku xuv bvl ) ∂fkl ∂[AXB]kl = = = bjl aki , ∂xij ∂xij ∂xij we get respectively the pq × mn Jacobian matrix and the mn × pq gradient matrix: DX (AXB) = BT ⊗ A and

∇X (AXB) = B ⊗ AT .

EXAMPLE 3.6 Let F(X) = AXT B, where A ∈ Rp×n , X ∈ Rm×n , B ∈ Rm×q . Compute the partial derivative m n ∂ ( u=1 v=1 aku xvu bvl ) ∂fkl ∂[AXT B]kl = = = bil akj ∂xij ∂xij ∂xij to obtain respectively the pq×mn Jacobian matrix and the mn×pq gradient matrix: DX (AXT B) = (BT ⊗ A)Kmn

and

∇X (AXT B) = Knm (B ⊗ AT ),

where Kmn and Knm are commutation matrices.

3.1 Jacobian Matrix and Gradient Matrix

137

Let X ∈ Rm×n , then n ∂ ( u=1 xku xlu ) ∂fkl ∂[XXT ]kl = = = δli xkj + xlj δki , ∂xij ∂xij ∂xij n ∂ ( u=1 xuk xul ) ∂[XT X]kl ∂fkl = = = xil δkj + δlj xik . ∂xij ∂xij ∂xij

EXAMPLE 3.7

Hence the Jacobian matrices are given by DX (XXT ) = (Im ⊗ X)Kmn + (X ⊗ Im ) = (Kmm + Im2 )(X ⊗ Im ) ∈ Rmm×mn , DX (XT X) = (XT ⊗ In )Kmn + (In ⊗ XT ) = (Knn + In2 )(In ⊗ XT ) ∈ Rnn×mn and the gradient matrices are given by ∇X (XXT ) = (XT ⊗ Im )(Kmm + Im2 ) ∈ Rmn×mm , ∇X (XT X) = (In ⊗ X)(Knn + In2 ) ∈ Rmn×nn in which we have used (Ap×m ⊗Bq×n )Kmn = Kpq (Bq×n ⊗Ap×m ) and KTmn = Knm . EXAMPLE 3.8 by

The partial derivatives of the products of three matrices are given    ∂ p xpk [BX]pl + [XT B]kp xpl ∂[XT BX]kl = ∂xij ∂xij   = δpi δkj [BX]pl + δpi δlj [XT B]kp p

∂[XBXT ]kl ∂xij

= [BX]il δkj + δlj [XT B]ki    ∂ p xkp [BXT ]pl + [XB]kp xlp = ∂xij * ) δki δpj [BXT ]pl + δpj δli [XB]kp = p

= [BXT ]jl δki + δli [XB]kj , and thus we obtain respectively the Jacobian matrix and the gradient matrix:     DX (XT BX) = (BX)T ⊗ In Kmn + In ⊗ (XT B) ∈ Rnn×mn , DX (XBXT ) = (XBT ) ⊗ Im + (Im ⊗ (XB)) Kmn ∈ Rmm×mn ;   ∇X (XT BX) = Knm ((BX) ⊗ In ) + In ⊗ (BT X) ∈ Rmn×nn ,   ∇X (XBXT ) = (BXT ) ⊗ Im + Knm Im ⊗ (XB)T ∈ Rmn×mm . Table 3.2 summarizes the partial derivative, the Jacobian matrix and the gradient matrix for several typical matrix functions.

138

Matrix Differential

Table 3.2 Partial derivative, Jacobian matrix and gradient matrix of F(X) F(X)

∂fkl /∂xij

DX F(X)

∇X F(X)

AXB

bjl aki

BT ⊗ A

B ⊗ AT

AT XB

bjl aik

B T ⊗ AT

B⊗A

T

blj aki

B⊗A

B T ⊗ AT

AT XBT

blj aik

B ⊗ AT

BT ⊗ A

AXT B

bil akj

(BT ⊗ A)Kmn

Knm (B ⊗ AT )

AT XT B

bil ajk

(BT ⊗ AT )Kmn

Knm (B ⊗ A)

AXT BT

bli akj

(B ⊗ A)Kmn

Knm (BT ⊗ AT )

AT XT B T

bli ajk

(B ⊗ AT )Kmn

Knm (BT ⊗ A)

XXT

δli xkj + xlj δki

(Kmm + Im2 )(X ⊗ Im )

(XT ⊗ Im )(Kmm + Im2 )

XT X

xil δkj + δlj xik

(Knn + In2 )(In ⊗ XT )   (BX)T ⊗ In Kmn   + In ⊗ (XT B)

(In ⊗ X)(Knn + In2 )

(XBT ) ⊗ Im

(BXT ) ⊗ Im   +Knm Im ⊗ (XB)T

AXB

XT BX

XBXT

[BX]il δkj + δlj [XT B]ki [BXT ]jl δki + δli [XB]kj

+ (Im ⊗ (XB)) Kmn

Knm ((BX) ⊗ In )   + In ⊗ (BT X)

If X = x ∈ Rm×1 then from Table 3.2 we obtain Dx (xxT ) = Im2 (x ⊗ Im ) + Kmm (x ⊗ Im ) = (x ⊗ Im ) + (Im ⊗ x),

(3.1.30)

since Kmm (x ⊗ Im ) = (Im ⊗ x)Km1 and Km1 = Im . Similarly, we can obtain Dx (xxT ) = (K11 + I1 )(I1 ⊗ xT ) = 2xT .

(3.1.31)

It should be noted that direct computation of partial derivatives ∂fkl /∂xij can be used to find the Jacobian matrices and the gradient matrices of many matrix functions, but, for more complex functions (such as the inverse matrix, the Moore– Penrose inverse matrix and the exponential functions of a matrix), direct computation of their partial derivatives is more complicated and difficult. Hence, naturally, we want to have an easily remembered and effective mathematical tool for computing the Jacobian matrices and the gradient matrices of real scalar functions and real matrix functions. Such a mathematical tool is the matrix differential, which is the topic of the next section.

3.2 Real Matrix Differential

139

3.2 Real Matrix Differential The matrix differential is an effective mathematical tool for computing the partial derivatives of a function with respect to its variables. This section introduces the first-order real matrix differential together with its theory, calculation methods and applications.

3.2.1 Calculation of Real Matrix Differential The differential of an m × n matrix X = [xij ] is known as the matrix differential, denoted dX, and is defined as dX = [dxij ]m,n i=1,j=1 . EXAMPLE 3.9

Consider the differential of the scalar function tr(U). We have   n n   d(tr U) = d uii = duii = tr(dU), i=1

i=1

namely d(tr U) = tr(dU). EXAMPLE 3.10 given by

An entry of the matrix differential of the matrix product UV is

      [d(UV)]ij = d [UV]ij = d uik vkj = d(uik vkj ) =



k



(duik )vkj + uik dvkj =

k

 k

k

(duik )vkj +



uik dvkj

k

= [(dU)V]ij + [U(dV)]ij . Hence, we have for the matrix differential d(UV) = (dU)V + U(dV). The real matrix differential has the following two basic properties. (1) Transpose The differential of a matrix transpose is equal to the transpose of the matrix differential, i.e., d(XT ) = (dX)T = dXT . (2) Linearity d(αX + βY) = α dX + β dY. The following summarizes common computation formulas for the matrix differential [311, pp. 148–154]. 1. The differential of a constant matrix is a zero matrix, namely dA = O. 2. The matrix differential of the product αX is given by d(αX) = α dX. 3. The matrix differential of a transposed matrix is equal to the transpose of the original matrix differential, namely d(XT ) = (dX)T = dXT . 4. The matrix differential of the sum (or difference) of two matrices is given by d(U ± V) = dU ± dV. 5. The matrix differential of the matrix product is d(AXB) = A(dX)B.

140

Matrix Differential

6. The matrix differentials of the functions UV and UVW, where U = F(X), V = G(X), W = H(X), are respectively given by d(UV) = (dU)V + U(dV) d(UVW) = (dU)VW + U(dV)W + UV(dW).

(3.2.1) (3.2.2)

7. The differential of the matrix trace d(tr X) is equal to the trace of the matrix differential dX, namely d(tr X) = tr(dX).

(3.2.3)

In particular, the differential of the trace of the matrix function F(X) is given by d(tr F(X)) = tr(dF(X)). 8. The differential of the determinant of X is d|X| = |X|tr(X−1 dX).

(3.2.4)

In particular, the differential of the determinant of the matrix function F(X) is given by d|F(X)| = |F(X)|tr(F−1 (X)dF(X)). 9. The matrix differential of the Kronecker product is given by d(U ⊗ V) = (dU) ⊗ V + U ⊗ dV.

(3.2.5)

10. The matrix differential of the Hadamard product is given by d(U ∗ V) = (dU) ∗ V + U ∗ dV.

(3.2.6)

11. The matrix differential of the inverse matrix is d(X−1 ) = −X−1 (dX)X−1 .

(3.2.7)

12. The differential of the vectorization function vec X is equal to the vectorization of the matrix differential, i.e., d vec X = vec(dX).

(3.2.8)

13. The differential of the matrix logarithm is d log X = X−1 dX.

(3.2.9)

In particular, d log F(X) = F−1 (X) dF(X). 14. The matrix differentials of X† , X† X and XX† , where X† is the Moore–Penrose inverses of X, are d(X† ) = − X† (dX)X† + X† (X† )T (dXT )(I − XX† ) + (I − X† X)(dXT )(X† )T X† ,  T d(X† X) = X† (dX)(I − X† X) + X† (dX)(I − X† X) , *T ) d(XX† ) = (I − XX† )(dX)X† + (I − XX† )(dX)X† .

(3.2.10) (3.2.11) (3.2.12)

3.2 Real Matrix Differential

141

3.2.2 Jacobian Matrix Identification In multivariate calculus, the multivariate function f (x1 , . . . , xm ) is said to be differentiable at the point (x1 , . . . , xm ), if a change in f (x1 , . . . , xm ) can be expressed as Δf (x1 , . . . , xm ) = f (x1 + Δx1 , . . . , xm + Δxm ) − f (x1 , . . . , xm ) = A1 Δx1 + · · · + Am Δxm + O(Δx1 , . . . , Δxm ), where A1 , . . . , Am are independent of Δx1 , . . . , Δxm , respectively, and O(Δx1 , . . . , Δxm ) denotes the second-order and the higher-order terms in Δx1 , . . . , Δxm . In this case, the partial derivative ∂f /∂x1 , . . . , ∂f /∂xm must exist, and ∂f = A1 , ∂x1

...

,

∂f = Am . ∂xm

The linear part of the change Δf (x1 , . . . , xm ), A1 Δx1 + · · · + Am Δxm =

∂f ∂f dx1 + · · · + dx , ∂x1 ∂xm m

is said to be the differential or first-order differential of the multivariate function f (x1 , . . . , xm ) and is denoted by df (x1 , . . . , xm ) =

∂f ∂f dx + · · · + dx . ∂x1 1 ∂xm m

(3.2.13)

The sufficient condition for a multivariate function f (x1 , . . . , xm ) to be differentiable at the point (x1 , . . . , xm ) is that the partial derivatives ∂f /∂x1 , . . . , ∂f /∂xm exist and are continuous. 1. Jacobian Matrix Identification for a Real Function f (x) Consider a scalar function f (x) with variable x = [x1 , . . . , xm ]T ∈ Rm . By regarding the elements x1 , . . . , xm as m variables, and using Equation (3.2.13), we can obtain the differential of the scalar function f (x) directly, as follows: ⎤ ⎡  dx1  ∂f (x) ∂f (x) ∂f (x) ∂f (x) ⎢ . ⎥ df (x) = dx1 + · · · + dxm = ... ⎣ .. ⎦ ∂x1 ∂xm ∂x1 ∂xm dxm or df (x) = where

∂f (x) ∂f (x) dx = (dx)T , ∂ xT ∂x

  ∂f (x) ∂f (x) ∂f (x) = , . . . , and dx = [dx1 , . . . , dxm ]T . ∂ xT ∂x1 ∂xm

(3.2.14)

Equation (3.2.14) is a vector form of the differential rule that implies an important application. If we let A = ∂f (x)/∂ xT then the first-order differential can be denoted as a trace: df (x) = (∂f (x)/∂ xT )dx = tr(A dx). This shows that there

142

Matrix Differential

is an equivalence relationship between the Jacobian matrix of the scalar function f (x) and its matrix differential, namely df (x) = tr(A dx)



Dx f (x) =

∂f (x) = A. ∂ xT

(3.2.15)

In other words, if the differential of the function f (x) is denoted as df (x) = tr(A dx) then the matrix A is just the Jacobian matrix of the function f (x). 2. Jacobian Matrix Identification for a Real Function f (X) Consider a scalar function f (X) with variable X = [x1 , . . . , xn ] ∈ Rm×n . Denoting xj = [x1j , . . . , xmj ]T , j = 1, . . . , n, then from Equation (3.2.2) it is easily known that df (X) =

∂f (X) ∂f (X) dx1 + · · · + dxn ∂ x1 ∂xn

⎤ dx11 ⎢ . ⎥ ⎢ .. ⎥ ⎥ ⎢ ⎥ ⎢  ⎢ dxm1 ⎥  ⎥ ∂f (X) ∂f (X) ∂f (X) ⎢ ∂f (X) ⎢ .. ⎥ = ... ... ... . ⎥ ⎢ ∂x11 ∂xm1 ∂x1n ∂xmn ⎢ ⎥ ⎢ dx1n ⎥ ⎥ ⎢ ⎢ .. ⎥ ⎣ . ⎦ dxmn =



∂f (X) d vecX = Dvec X f (X) dvecX. ∂ (vecX)T

(3.2.16)

By the relationship between the row  T partial derivative vector and the Jacobian matrix, Dvec X f (X) = vec DTX f (X) , Equation (3.2.16) can be written as df (X) = (vec AT )T d(vec X), where



∂f (X) ⎢ ∂x11

A = DX f (X) =

∂f (X) ⎢ . =⎢ ⎢ .. ∂ XT ⎣ ∂f (X) ∂x1n

··· ..

.

···

(3.2.17) ⎤

∂f (X) ∂xm1 ⎥

⎥ ⎥ ⎥ ∂f (X) ⎦ .. .

(3.2.18)

∂xmn

is the Jacobian matrix of the scalar function f (X). Using the relationship between the vectorization operator vec and the trace function tr(BT C) = (vec B)T vec C, and letting B = AT and C = dX, then Equation (3.2.17) can be expressed in the trace form as df (X) = tr(AdX).

(3.2.19)

3.2 Real Matrix Differential

143

This can be regarded as the canonical form of the differential of a scalar function f (X). The above discussion shows that once the matrix differential of a scalar function df (X) is expressed in its canonical form, we can identify the Jacobian matrix and/or the gradient matrix of the scalar function f (X), as shown below. PROPOSITION 3.1 If a scalar function f (X) is differentiable at the point X then the Jacobian matrix A can be directly identified as follows [311]: df (x) = tr(Adx)



Dx f (x) = A,

(3.2.20)

df (X) = tr(AdX)



DX f (X) = A.

(3.2.21)

Proposition 3.1 motivates the following effective approach to directly identifying the Jacobian matrix DX f (X) of the scalar function f (X): Step 1 Find the differential df (X) of the real function f (X), and denote it in the canonical form as df (X) = tr(A dX). Step 2

The Jacobian matrix is directly given by A.

It has been shown [311] that the Jacobian matrix A is uniquely determined: if there are A1 and A2 such that df (X) = Ai dX, i = 1, 2, then A1 = A2 . Since the gradient matrix is the transpose of the Jacobian matrix for a given real function f (X), Proposition 3.1 implies in addition that df (X) = tr(AdX) ⇔ ∇X f (X) = AT .

(3.2.22)

Because the Jacobian matrix A is uniquely determined, the gradient matrix is unique determined as well. 3. Jacobian Matrices of Trace Functions EXAMPLE 3.11 Consider the quadratic form f (x) = xT Ax, where A is a square constant matrix. Writing f (x) = tr(f (x)) and then using the expression for the differential of a matrix product, it is easily obtained that   df (x) = d tr(xT Ax) = tr (dx)T Ax + xT Adx     = tr (dxT Ax)T + xT Adx = tr xT AT dx + xT Adx   = tr xT (A + AT )dx . By Proposition 3.1, we can obtain the gradient vector directly: ∇x (xT Ax) =

T ∂ xT Ax  T = x (A + AT ) = (AT + A)x. ∂x

(3.2.23)

Clearly, if A is a symmetric matrix, then ∇x (xT Ax) = ∂xT Ax/∂x = 2Ax.

144

Matrix Differential

For tr(XT X), since tr(AT B) = tr(BT A), we have     d(tr(XT X)) = tr d(XT X) = tr (dX)T X + XT dX     = tr (dX)T X + tr(XT dX) = tr 2XT dX .

EXAMPLE 3.12

On the one hand, from Proposition 3.1 it is known that the gradient matrix is given by ∂ tr(XT X) = (2XT )T = 2X. ∂X

(3.2.24)

On the other hand, the differential of the trace of three-matrix product tr(XT AX) is given by     d tr(XT AX) = tr d(XT AX) = tr (dX)T AX + XT AdX   = tr (dX)T AX + tr(XT AdX)   = tr (AX)T dX + tr(XT AdX)   = tr XT (AT + A)dX , which yields the gradient matrix T ∂ tr(XT AX)  T T = X (A + A) = (A + AT )X. ∂X

(3.2.25)

Given a trace including an inverse matrix, tr(AX−1 ), since     d tr(AX−1 ) = tr d(AX−1 ) = tr AdX−1     = −tr AX−1 (dX)X−1 = −tr X−1 AX−1 dX ,

EXAMPLE 3.13

the gradient matrix is given by ∂ tr(AX−1 )/∂ X = −(X−1 AX−1 )T . The following are the main points in applying Proposition 3.1: (1) Any scalar function f (X) can always be written in the form of a trace function, because f (X) = tr(f (X)). (2) No matter where dX appears initially in the trace function, we can place it in the rightmost position via the trace property tr(C(dX)B) = tr(BC dX), giving the canonical form df (X) = tr(A dX). Table 3.3 summarizes the differential matrices and Jacobian matrices of several typical trace functions [311]; note that A−2 = A−1 A−1 . 4. Jacobian Matrices of Determinant Functions From the matrix differential d|X| = |X| tr(X−1 dX) and Proposition 3.1, it can immediately be seen that the gradient matrix of the determinant |X| is ∂|X| = |X|(X−1 )T = |X| X−T . ∂X

(3.2.26)

3.2 Real Matrix Differential

145

Table 3.3 Differential matrices and Jacobian matrices of trace functions ∂f (X)/∂XT

Trace function f (X) df (X) tr(X) tr(X

tr(IdX)

−1

−tr(X

)

tr(AX)

−2

I dX)

−X−2

tr(AdX)

A

tr(X )

2tr(XdX)

2X

tr(XT X)

2tr(XT dX)   tr XT (A + AT )dX   tr (A + AT )XT dX

2XT

tr ((AX + XA)dX)   −tr X−1 AX−1 dX   −tr X−1 BAX−1 dX   −tr (X + A)−2 dX

AX + XA

tr((AXB + BXA)dX)   tr (AXT B + AT XT BT )dX   tr XT (BA + AT BT )dX   tr (BA + AT BT )XT dX

AXB + BXA

2

tr(XT AX) tr(XAXT ) tr(XAX) tr(AX

−1

)

tr(AX−1 B)   tr (X + A)−1 tr(XAXB) T

tr(XAX B) tr(AXXT B) tr(AXT XB)

XT (A + AT ) (A + AT )XT −X−1 AX−1 −X−1 BAX−1 −(X + A)−2 AXT B + AT XT BT XT (BA + AT BT ) (BA + AT BT )XT

For the logarithm of the determinant, log |X|, its matrix differential is d log |X| = |X|−1 d|X| = |X|−1 tr(|X|X−1 dX) = tr(X−1 dX),

(3.2.27)

hence the gradient matrix of the determinant logarithm function log |X| is determined by ∂ log |X| (3.2.28) = X−T . ∂X d|X| = |X|tr(X−1 dX) it is known that Consider the determinant of X2 . From  d|X2 | = d|X|2 = 2|X| d|X| = 2|X|2 tr X−1 dX . By applying Proposition 3.1, we have ∂|X|2 (3.2.29) = 2|X|2 (X−1 )T = 2|X|2 X−T . ∂X More generally, the matrix differential of |Xk | is as follows: d|Xk | = |Xk | tr(X−k dXk ) = |Xk |tr(X−k (kXk−1 )dX) = k|Xk | tr(X−1 dX). Hence ∂|Xk | = k|Xk | X−T . ∂X

(3.2.30)

146

Matrix Differential

Letting X ∈ Rm×n and rank(X) = m, i.e., XXT is invertible, then we have ) * d|XXT | = |XXT | tr (XXT )−1 d(XXT ) * ) ** ) ) = |XXT | tr (XXT )−1 (dX)XT + tr (XXT )−1 X(dX)T ) ) * ) ** = |XXT | tr XT (XXT )−1 dX + tr XT (XXT )−1 dX ) * = tr 2|XXT | XT (XXT )−1 dX . By Proposition 3.1, we get the gradient matrix ∂|XXT | = 2|XXT | (XXT )−1 X. ∂X Similarly, set X ∈ Rm×n . If rank(X) = n, i.e., XT X is invertible, then   d|XT X| = tr 2|XT X|(XT X)−1 XT dX , T

T

T

(3.2.31)

(3.2.32)

−1

and hence ∂|X X|/∂ X = 2|X X| X(X X) . Table 3.4 summarizes the real matrix differentials and the Jacobian matrices of several typical determinant functions. Table 3.4 Differentials and Jacobian matrices of determinant functions f (X)

df (X)

∂f (X)/∂X

|X|

|X| tr(X−1 dX)

|X|X−1

log |X|

tr(X−1 dX)

X−1

|X−1 |

−|X−1 | tr(X−1 dX)   2|X|2 tr X−1 dX

−|X−1 |X−1

|X2 |

k|X|k tr(X−1 dX)   |XXT | 2|XXT | tr XT (XXT )−1 dX   |XT X| 2|XT X| tr (XT X)−1 XT dX   log |XT X| 2tr (XT X)−1 XT dX   |AXB| |AXB| tr B(AXB)−1 AdX  |XAXT | tr AXT (XAXT )−1

|XAXT |  +(XA)T (XAT XT )−1 dX  |XT AX|tr (XT AX)−T (AX)T

|XT AX|  +(XT AX)−1 XT A dX |Xk |

2|X|2 X−1 k|X|k X−1 2|XXT |XT (XXT )−1 2|XT X|(XT X)−1 XT 2(XT X)−1 XT |AXB|B(AXB)−1 A |XAXT | AXT (XAXT )−1 +(XA)T (XAT XT )−1 |XT AX| (XT AX)−T (AX)T

+(XT AX)−1 XT A

3.2 Real Matrix Differential

147

3.2.3 Jacobian Matrix of Real Matrix Functions Let fkl = fkl (X) be the entry of the kth row and lth column of the real matrix function F(X); then dfkl (X) = [dF(X)]kl represents the differential of the scalar function fkl (X). From Equation (3.2.16) we have ⎤ ⎡ dx11 ⎢ . ⎥ ⎢ .. ⎥ ⎥ ⎢ ⎥ ⎢  ⎢ dxm1 ⎥  ⎢ ∂fkl (X) ∂fkl (X) ∂fkl (X) ∂fkl (X) ⎢ . ⎥ .. ⎥ dfkl (X) = ... ... ... ⎥. ∂x11 ∂xm1 ∂x1n ∂xmn ⎢ ⎥ ⎢ ⎢ dx1n ⎥ ⎥ ⎢ ⎢ .. ⎥ ⎣ . ⎦ dxmn The above result can be rewritten in vectorization form as follows: dvecF(X) = AdvecX,

(3.2.33)

where dvecF(X) = [df11 (X), . . . , dfp1 (X), . . . , df1q (X), . . . , dfpq (X)]T , d vecX = [dx11 , . . . , dxm1 , . . . , dx1n , . . . , dxmn ] and

⎡ ∂f (X) 11 ⎢ ∂x11 ⎢ .. ⎢ . ⎢ ⎢ ∂fp1 (X) ⎢ ⎢ ∂x 11 ⎢ ⎢ .. A=⎢ . ⎢ ⎢ ∂f1q (X) ⎢ ⎢ ∂x11 ⎢ .. ⎢ ⎢ . ⎣ ∂fpq (X) ∂x11

=

··· .. . ··· .. . ··· .. . ···

∂f11 (X) ∂xm1

···

.. .

.. .

∂fp1 (X) ∂xm1

···

.. .

.. .

∂f1q (X) ∂xm1

···

.. .

.. .

∂fpq (X) ∂xm1

···

T

(3.2.34) (3.2.35)

∂f11 (X) ∂x1n

···

.. .

.. .

∂fp1 (X) ∂x1n

···

.. .

.. .

∂f1q (X) ∂x1n

···

.. .

.. .

∂fpq (X) ∂x1n

···

∂f11 (X) ⎤ ∂xmn ⎥

⎥ ⎥ ⎥ ∂fp1 (X) ⎥ ⎥ ∂xmn ⎥ ⎥ ⎥ .. ⎥ . ⎥ ∂f1q (X) ⎥ ⎥ ∂xmn ⎥ ⎥ .. ⎥ ⎥ . ⎦ .. .

∂fpq (X) ∂xmn

∂ vecF(X) . ∂(vec X)T

(3.2.36)

In other words, the matrix A is the Jacobian matrix DX F(X) of the matrix function F(X). Let F(X) ∈ Rp×q be a matrix function including X and XT as variables, where X ∈ Rm×n . Then the first-order matrix differential is given by d vec F(X) = A d(vecX) + Bd(vec XT ),

A, B ∈ Rpq×mn .

148

Matrix Differential

Since d(vecXT ) = Kmn dvecX, the above equation can be rewritten as d vec F(X) = (A + BKmn )d vec X.

(3.2.37)

The above discussion can be summarized in the following proposition. PROPOSITION 3.2 Given a matrix function F(X) : Rm×n → Rp×q , then its pq × mn Jacobian matrix can be determined from d vecF(X) = A dvecX + Bd(vecX)T ⇔

DX F(X) =

∂ vecF(X) = A + BKmn ∂ (vecX)T

(3.2.38)

and its mn × pq gradient matrix can be identified using ∇X F(X) = (DX F(X))T = AT + Knm BT .

(3.2.39)

Importantly, because dF(X) = A(dX)B



d vecF(X) = (BT ⊗ A)dvecX,

dF(X) = C(dXT )D



dvecF(X) = (DT ⊗ C)Kmn dvecX,

identification given in Proposition 3.2 can be simplified to the differential of F(X). THEOREM 3.1 Given a matrix function F(X) : Rm×n → Rp×q , then its pq × mn Jacobian matrix can be identified as follows: [311] dF(X) = A(dX)B + C(dXT )D, ⇔

DX F(X) =

∂ vec F(X) = (BT ⊗ A) + (DT ⊗ C)Kmn , ∂(vec X)T

(3.2.40)

the mn × pq gradient matrix can be determined from ∂(vec F(X))T (3.2.41) = (B ⊗ AT ) + Knm (D ⊗ CT ). ∂ vec X Table 3.5 summarizes the matrix differentials and Jacobian matrices of some real functions. ∇X F(X) =

EXAMPLE 3.14 Given a matrix function AXT B whose matrix differential is d(AXT B) = A(dXT )B, the Jacobian matrix of AXT B is DX (AXT B) = (BT ⊗ A)Kmn ,

(3.2.42)

whose transpose yields the gradient matrix of AXT B. EXAMPLE 3.15 For the function XT BX, its matrix differential is d(XT BX) = XT BdX + d(XT )BX, which yields the Jacobian matrix of XT BX as follows:   (3.2.43) DX (XT BX) = I ⊗ (XT B) + (BX)T ⊗ I Kmn ; the transpose gives the gradient matrix of XT BX.

3.2 Real Matrix Differential

149

Table 3.5 Matrix differentials and Jacobian matrices of real functions Functions

Matrix differential

Jacobian matrix

f (x) : R → R

df (x) = Adx

A∈R

f (x) : R

df (x) = Adx

A ∈ R1×m

f (X) : Rm×n → R

df (X) = tr(AdX)

A ∈ Rn×m

f (x) : Rm → Rp

df (x) = Adx

A ∈ Rp×m

f (X) : Rm×n → Rp

df (X) = Ad(vecX)

A ∈ Rp×mn

F(x) : Rm → Rp×q

d vecF(x) = Adx

A ∈ Rpq×m

F(X) : Rm×n → Rp×q

dF(X) = A(dX)B

(BT ⊗ A) ∈ Rpq×mn

F(X) : Rm×n → Rp×q

dF(X) = C(dXT )D

(DT ⊗ C)Kmn ∈ Rpq×mn

m

→R

Table 3.6 Jacobian matrices of some matrix functions F(X)

dF(X)

Jacobian matrix

X X

X dX + (dX )X

(In ⊗ XT ) + (XT ⊗ In )Kmn

XXT

X(dXT ) + (dX)XT

(Im ⊗ X)Kmn + (X ⊗ Im )

AXT B

A(dXT )B

(BT ⊗ A)Kmn

XT BX

XT B dX + (dXT )BX

I ⊗ (XT B) + ((BX)T ⊗ I)Kmn

AXT BXC

A(dXT )BXC + AXT B(dX)C

((BXC)T ⊗ A)Kmn +CT ⊗ (AXT B)

AXBXT C

A(dX)BXT C + AXB(dXT )C

(BXT C)T ⊗ A +(CT ⊗ (AXB))Kmn

X−1

−X−1 (dX)X−1

−(X−T ⊗ X−1 )

k

k

T

Xk log X exp(X)

T

T

Xj−1 (dX)Xk−j

(XT )k−j ⊗ Xj−1

j=1

j=1

X−1 dX

I ⊗ X−1





k=0

k 1 Xj (dX)Xk−j (k + 1)! j=0

k=0

k 1 (XT )k−j ⊗ Xj (k + 1)! j=0

Table 3.6 lists some matrix functions and their Jacobian matrices. It should be noted that the matrix differentials of some matrix functions may not be representable in the canonical form required by Theorem 3.1, but nevertheless

150

Matrix Differential

can be expressed in the canonical form in Proposition 3.2. In such cases, we must use Proposition 3.2 to identify the Jacobian matrix. EXAMPLE 3.16 Let F(X, Y) = X ⊗ Y be the Kronecker product of two matrices X ∈ Rp×m and Y ∈ Rn×q . Consider the matrix differential dF(X, Y) = (dX) ⊗ Y + X ⊗ (dY). By the vectorization formula vec(X ⊗ Y) = (Im ⊗ Kqp ⊗ In )(vec X ⊗ vec Y) we have vec(dX ⊗ Y) = (Im ⊗ Kqp ⊗ In )(d vecX ⊗ vec Y) = (Im ⊗ Kqp ⊗ In )(Ipm ⊗ vec Y)d vecX,

(3.2.44)

vec(X ⊗ dY) = (Im ⊗ Kqp ⊗ In )(vec X ⊗ d vecY) = (Im ⊗ Kqp ⊗ In )(vec X ⊗ Inq )d vecY.

(3.2.45)

Hence, the Jacobian matrices are respectively as follows: DX (X ⊗ Y) = (Im ⊗ Kqp ⊗ In )(Ipm ⊗ vec Y),

(3.2.46)

DY (X ⊗ Y) = (Im ⊗ Kqp ⊗ In )(vec X ⊗ Inq ).

(3.2.47)

The analysis and examples in this section show that the first-order real matrix differential is indeed an effective mathematical tool for identifying the Jacobian matrix and the gradient matrix of a real function. The operation of this tool is simple and easy to master.

3.3 Real Hessian Matrix and Identification In real-world problems, we have to employ not only the first-order derivative but also the second-order derivative of a given real function in order to obtain more information about it. In the above section we presented the Jacobian matrix and the gradient matrix, which are two useful representations of the first-order derivative of a real-valued function. Here we discuss the second-order derivative of a real function. The main problem with which we are concerned is how to obtain the Hessian matrix of a real function.

3.3.1 Real Hessian Matrix Consider a real vector x ∈ Rm×1 , the second-order derivative of a real function f (x) is known as the Hessian matrix, denoted H[f (x)], and is defined as   ∂f (x) ∂ ∂ 2 f (x) ∈ Rm×m , = (3.3.1) H[f (x)] = ∂ x∂ xT ∂ x ∂ xT which is simply represented as H[f (x)] = ∇2x f (x) = ∇x (Dx f (x)),

(3.3.2)

3.3 Real Hessian Matrix and Identification

151

where Dx is the cogradient operator. Hence, the (i, j)th entry of the Hessian matrix is defined as    2  ∂f (x) ∂ f (x) ∂ , (3.3.3) = [Hf (x)]ij = ∂ x ∂ xT ij ∂xi ∂xj thus for the matrix itself we have ⎡

∂2f ⎢ ∂x1 ∂x1

H[f (x)] =

···

⎢ ∂ 2 f (x) ⎢ .. = ⎢ . ⎢ ∂ x ∂ xT ⎣ ∂2f

..

.. .

.

···

∂xm ∂x1



∂2f ∂x1 ∂xm ⎥

∂2f ∂xm ∂xm

⎥ ⎥ ⎥ ∈ Rm×m . ⎥ ⎦

(3.3.4)

This shows that the Hessian matrix of a real scalar function f (x) is an m × m matrix that consists of m2 second-order partial derivatives of f (x) with respect to the entries xi of the vector variable x. By the definition, it can be seen that the Hessian matrix of a real scalar function f (x) is a real symmetric matrix, i.e, T

(H[f (x)]) = H[f (x)].

(3.3.5)

The reason is that, for a second-differentiable continuous function f (x), its second∂2f ∂2f order derivative is independent of the order of differentiation, i.e., = . ∂xi ∂xj

∂xj ∂xi

By mimicking the definition of the Hessian matrix of a real scalar function f (x), we can define the Hessian matrix of a real scalar function f (X): H[f (X)] =

∂ 2 f (X) = ∇X (DX f (X)) ∈ Rmn×mn , ∂ vecX∂(vecX)T

which can be written in element form as ⎡ ∂2f ∂2f ∂2f ⎢ ∂x11 ∂x11 · · · ∂x11 ∂xm1 · · · ∂x11 ∂x1n ⎢ .. .. .. .. .. ⎢ ⎢ . . . . . ⎢ ⎢ ∂2f ∂2f ∂2f ⎢ ··· ··· ⎢ ∂xm1 ∂x11 ∂xm1 ∂xm1 ∂xm1 ∂x1n ⎢ .. .. .. .. .. ⎢ . . . . . ⎢ ⎢ ∂2f ∂2f ⎢ ∂2f ··· ··· ⎢ ∂x1n ∂xm1 ∂x1n ∂x1n ⎢ ∂x1n ∂x11 ⎢ . . . .. .. ⎢ .. .. .. . . ⎢ ⎣ ∂2f ∂2f ∂2f ··· ··· ∂xmn ∂x11

From

∂2f ∂xij ∂xkl

=

∂xmn ∂xm1

∂2f ∂xkl ∂xij

∂xmn ∂x1n

···



∂2f ∂x11 ∂xmn ⎥

⎥ ⎥ ⎥ ⎥ ⎥ ∂2f ⎥ ··· ∂xm1 ∂xmn ⎥ ⎥ .. .. ⎥. . . ⎥ ⎥ 2 ∂ f ⎥ ··· ⎥ ∂x1n ∂xmn ⎥ ⎥ .. .. ⎥ . . ⎥ 2 ⎦ ∂ f ··· .. .

(3.3.6)

.. .

(3.3.7)

∂xmn ∂xmn

it is easily seen that the Hessian matrix of a real scalar

152

Matrix Differential

function f (X) is a real symmetric matrix: (H[f (X)])T = H[f (X)].

(3.3.8)

3.3.2 Real Hessian Matrix Identification Let us discuss how to identify the Hessian matrices for real scalar functions f (x) and f (X). First, consider the case of a real scalar function f (x). In many cases, it may involve considerable trouble to compute a Hessian matrix from its definition formula. A simpler method is based on the relationship between the second-order differential matrix and the Hessian matrix. Noting that the differential dx is not a function of the vector x, we have d2 x = d(dx) = 0.

(3.3.9)

Keeping this point in mind, from Equation (3.2.14) it is easy to find the secondorder differential d2 f (x) = d(df (x)):   ∂ df (x) ∂ ∂f 2 (x) ∂f (x) d2 f (x) = dxT dx = (dx)T dx = dxT T ∂x ∂x ∂x ∂ x ∂ xT which can simply be written as d2 f (x) = (dx)T H[f (x)]dx,

(3.3.10)

where H[f (x)] =

∂f 2 (x) ∂ x ∂ xT

(3.3.11)

is the Hessian matrix of the function f (x) and the (i, j)th entry of H[f (x)] is given by   ∂f (x) ∂f 2 (x) ∂ = . (3.3.12) hij = ∂xi ∂xj ∂xi ∂xj Noting that the matrix A ∈ R1×m in the first-order differential df (x) = Adx is usually a real row vector and that its differential is still a real row vector, we have dA = (dx)T B ∈ R1×m , where B ∈ Rm×m . Hence, the second-order differential of the function f (x) is a quadratic function, d2 f (x) = d(Adx) = (dx)T Bdx.

(3.3.13)

By comparing Equation (3.3.13) with Equation (3.3.10), it can be seen that the Hessian matrix Hx [f (x)] = B. In order to ensure that the Hessian matrix is real symmetric, we take 1 (3.3.14) H[f (x)] = (BT + B). 2

3.3 Real Hessian Matrix and Identification

153

Now consider the case of a real scalar function f (X). From Equation (3.2.16) it follows that the second-order differential of the scalar function f (X) is given by ∂ df (X) ∂ vec X   ∂f (X) ∂ T = d(vec X) d vec X ∂ vec X ∂(vec X)T ∂f 2 (X) = (d vecX)T d vecX, ∂ vecX ∂(vec X)T

d2 f (X) = d(vec X)T

namely d2 f (X) = d(vecX)T H[f (X)]dvecX.

(3.3.15)

Here H[f (X)] =

∂ 2 f (X) ∂ vecX ∂(vecX)T

(3.3.16)

is the Hessian matrix of f (X). Formula (3.3.15) is called the second-order (matrix) differential rule for f (X). For the real scalar function f (X), the matrix A in the first-order differential df (X) = Ad vecX is usually a real row vector of the variable matrix X and the differential of A is still a real row vector; thus we have dA = d(vec X)T B

∈ R1×mn ,

where B ∈ Rmn×mn . Hence, the second-order differential of f (X) is a quadratic function d2 f (X) = d(vec X)T BdvecX.

(3.3.17)

By comparing Equation (3.3.17) with Equation (3.3.15), it follows that the Hessian matrix of the function f (X) is H[f (X)] =

1 T (B + B), 2

(3.3.18)

because a real Hessian matrix must be symmetric. The results above can be summarized in the following proposition. PROPOSITION 3.3 For a scalar function f (·) with vector x or matrix X as function variable, there is the following second-order identification relation between the second-order differential and the Hessian matrix: d2 f (x) = (dx)T Bdx



d2 f (X) = d(vecX)T Bdvec X



1 T (B + B), 2 1 H[f (X)] = (BT + B). 2 H[f (x)] =

(3.3.19) (3.3.20)

154

Matrix Differential

More generally, we have the second-order identification theorem for matrix function as follows. THEOREM 3.2 [311, p. 115] Let F : S → Rm×p be a matrix function defined on a set S ∈ Rn×q , and twice differentiable at an interior point C of S. Then d2 vec(F(C; X)) = (Imp ⊗ d vecX)T B(C)dvecX

(3.3.21)

for all X ∈ Rn×q , if and only if H[F(C)] = 12 (B(C) + (B(C)) v ), where B, (B )v ∈ Rpmqn×mn are given by ⎡ T⎤ ⎡ ⎤ B11 B11 ⎢ . ⎥ ⎢ . ⎥ ⎢ .. ⎥ ⎢ .. ⎥ ⎢ ⎢ ⎥ ⎥ ⎢BT ⎥ ⎢B ⎥ ⎢ p1 ⎥ ⎢ p1 ⎥ ⎢ . ⎥ ⎢ . ⎥

⎢ ⎥ ⎥ (3.3.22) B=⎢ ⎢ .. ⎥ , (B )v = ⎢ .. ⎥ ⎢ T⎥ ⎢ ⎥ ⎢B1q ⎥ ⎢B1q ⎥ ⎢ ⎢ ⎥ ⎥ ⎢ .. ⎥ ⎢ .. ⎥ ⎣ . ⎦ ⎣ . ⎦ Bpq BTpq with Bij ∈ Rmn×mn , ∀ i = 1, . . . , p, j = 1, . . . , q. Theorem 3.2 gives the following second-order identification formula for a matrix function: d2 vecF(X) = (Imp ⊗ d vecX)T Bd vecX 1 ⇔ H[F(X)] = [B + (B )v ]. 2

(3.3.23)

In particular, if F ∈ Rm×p takes the form of a scalar function f or a vector function f ∈ Rm×1 , and the variable matrix X ∈ Rn×q a scalar x or a vector x ∈ Rn×1 , then the above second-order identification formula yields the following identification formulas: f (x): d2 [f (x)] = β(dx)2 ⇔ H[f (x)] = β, f (x): d2 [f (x)] = (dx)T Bdx ⇔ H[f (x)] = 12 (B + BT ), f (X): d2 [f (X)] = d(vec X)T BdvecX ⇔ H[f (X)] = 12 (B + BT ); f (x): d2 [f (x)] = b(dx)2 ⇔ H[f (x)] = b, f (x): d2 [f (x)] = (Im ⊗ dx)T Bdx ⇔ H[f (x)] = 12 [B + (B )v ], f (X): d2 [f (X)] = (Im ⊗ d(vec X))T BdvecX ⇔ H[f (X)] = 12 [B + (B )v ]; F(x): d2 [F(x)] = B(dx)2 ⇔ H[F(x)] = vec B, F(x): d2 [vecF] = (Imp ⊗ dx)T Bdx ⇔ H[F(x)] = 12 [B + (B )v ]. The Hessian matrix identification based on Theorem 3.2 requires a row vectorization operation on the second-order matrix differential. An interesting question is

3.3 Real Hessian Matrix and Identification

155

whether we can directly identify the Hessian matrix from the second-order matrix differential without such a row vectorization operation. As mentioned earlier, the first-order differential of a scalar function f (X) can be written in the canonical form of trace function as df (X) = tr(AdX), where A = A(X) ∈ Rn×m is generally a matrix function of the variable matrix X ∈ Rm×n . Without loss of generality, it can be assumed that the differential matrix of the matrix function A = A(X) takes one of two forms: dA = B(dX)C,

or

dA = U(dX)T V,

B, C ∈ Rn×m

U ∈ Rn×n , V ∈ Rm×m .

(3.3.24) (3.3.25)

Notice that the above two differential matrices include (dX)C, BdX and (dX)T V, U(dX)T as two special examples. Substituting (3.3.24) and (3.3.25) into df (X) = tr(AdX), respectively, it is known that the second-order differential d2 f (X) = tr(dAdX) of the real scalar function f (X) takes one of two forms d2 f (X) = tr(B(dX)CdX)   d2 f (X) = tr V(dX)U(dX)T .

or

(3.3.26) (3.3.27)

By the trace property tr(ABCD) = (vec DT )T (CT ⊗ A)vec B, we have tr(B(dX)CdX) = (vec(dX)T )T (CT ⊗ B)vec(dX) = (d(Kmn vec X))T (CT ⊗ B)dvecX 

tr VdXU(dX)

 T

= (d(vec X))T Knm (CT ⊗ B)dvecX,

(3.3.28)

= (vec(dX)) (U ⊗ V)vec(dX) T

T

= (d(vec X))T (UT ⊗ V)dvecX,

(3.3.29)

KTmn = Knm. in which we have used the results Kmn vec Am×n = vec(ATm×n ) and  The expressions d2 f (X) = tr(B(dX)CdX) and d2 f (X) = tr V(dX)U(dX)T can be viewed as two canonical forms for the second-order matrix differentials of the real function f (X).   Each of the second-order differentials tr V(dX)U(dX)T and tr(B(dX)CdX) can be equivalently expressed as the the canonical quadratic form required by Proposition 3.3, so this can be equivalently stated as the following theorem for Hessian matrix identification. THEOREM 3.3 [311, p. 192] Let the real function f (X) with the m×n real matrix X as variable be second-order differentiable. Then there is the following relation between the second-order differential matrix d2 f (X) and the Hessian matrix H[f (X)]:   d2 f (X) = tr V(dX)U(dX)T 1 ⇔ H[f (X)] = (UT ⊗ V + U ⊗ VT ), (3.3.30) 2

156

Matrix Differential

where U ∈ Rn×n , V ∈ Rm×m , or



d2 f (X) = tr(B(dX)CdX) 1 H[f (X)] = Knm (CT ⊗ B + BT ⊗ C), 2

(3.3.31)

where Knm is the commutation matrix. Theorem 3.3 shows that the basic problem in the Hessian matrix identification is how to express the second-order matrix differential of a given real scalar function as one of the two canonical forms required by Theorem 3.3. The following three examples show how to find the Hessian matrix of a given real function using Theorem 3.3. EXAMPLE 3.17 Consider the real function f (X) = tr(X−1 ), where X is an n × n matrix. First, computing the first-order differential   df (X) = −tr X−1 (dX)X−1 and then using d(trU) = tr(dU), we get the second-order differential     d2 f (X) = −tr (dX−1 )(dX)X−1 − tr X−1 (dX)(dX−1 )   = 2 tr X−1 (dX)X−1 (dX)X−1   = 2 tr X−2 (dX)X−1 dX . Now, by Theorem 3.3, we get the Hessian matrix H[f (X)] = EXAMPLE 3.18 tial is

  ∂ 2 tr(X−1 ) = Knn X−T ⊗ X−2 + (X−2 )T ⊗ X−1 . ∂ vecX∂(vecX)T For the real function f (X) = tr(XT AX), its first-order differen  df (X) = tr XT (A + AT )dX .

Thus, the second-order differential is   d2 f (X) = tr (A + AT )(dX)(dX)T . By Theorem 3.3, the Hessian matrix is given by H[f (X)] =

∂ 2 tr(XT AX) = I ⊗ (A + AT ). ∂ vecX∂(vecX)T

EXAMPLE 3.19 The first-order differential of the function log |Xn×n | is d log |X| = tr(X−1 dx), and the second-order differential is −tr(X−1 (dx)X−1 dx). From Theorem 3.3, the Hessian matrix is obtained as follows: H[f (X)] =

∂ 2 log |X| = −Knn (X−T ⊗ X−1 ). ∂ vecX∂ (vecX)T

3.4 Complex Gradient Matrices

157

3.4 Complex Gradient Matrices In array processing and wireless communications, a narrowband signal is usually expressed as a complex equivalent baseband signal, and thus the transmit and receive signals with system parameters are expressed as complex vectors. In these applications, the objective function of an optimization problem is a real-valued function of a complex vector or matrix. Hence, the gradient of the objective function is a complex vector or matrix. Obviously, this complex gradient has the following two forms: (1) complex gradient the gradient of the objective function with respect to the complex vector or matrix variable itself; (2) conjugate gradient the gradient of the objective function with respect to the complex conjugate vector or matrix variable.

3.4.1 Holomorphic Function and Complex Partial Derivative Before discussing the complex gradient and conjugate gradient, it is necessary to recall the relevant facts about complex functions. For convenience, we first list the standard symbols for complex variables and complex functions. • z = [z1 , . . . , zm ]T ∈ Cm is a complex variable vector whose complex conjugate is z∗ . • Z = [z1 , . . . , zn ] ∈ Cm×n is a complex variable matrix with complex conjugate Z∗ . • f (z, z∗ ) ∈ C is a complex scalar function of m × 1 complex variable vectors z and z∗ , denoted f : Cm × Cm → C. • f (Z, Z∗ ) ∈ C is a complex scalar function of m × n complex variable matrices Z and Z∗ , denoted f : Cm×n × Cm×n → C. • f (z, z∗ ) ∈ Cp is a p × 1 complex vector function of m × 1 complex variable vectors z and z∗ , denoted f : Cm × Cm → Cp . • f (Z, Z∗ ) ∈ Cp is a p × 1 complex vector function of m × n complex variable matrices Z and Z∗ , denoted f : Cm×n × Cm×n → Cp . • F(z, z∗ ) ∈ Cp×q is a p × q complex matrix function of m × 1 complex variable vectors z and z∗ , denoted F : Cm × Cm → Cp×q . • F(Z, Z∗ ) ∈ Cp×q is a p × q complex matrix function of m × n complex variable matrices Z and Z∗ , denoted F : Cm×n × Cm×n → Cp×q . Table 3.7 summarizes the classification of complex-valued functions.

158

Matrix Differential

Table 3.7 Classification of complex-valued functions Function

Variables z, z ∗ ∈ C

Variables z, z∗ ∈ Cm

Variables Z, Z∗ ∈ Cm×n

f ∈C

f (z, z ∗ ) f :C×C→C

f (z, z∗ ) Cm × Cm → C

f (Z, Z∗ ) Cm×n × Cm×n → C

f ∈ Cp

f (z, z ∗ ) f : C × C → Cp

f (z, z∗ ) Cm × Cm → Cp

f (Z, Z∗ ) Cm×n × Cm×n → Cp

F ∈ Cp×q

F(z, z ∗ ) F : C × C → Cp×q

F(z, z∗ ) Cm × Cm → Cp×q

F(Z, Z∗ ) Cm×n × Cm×n → Cp×q

DEFINITION 3.10 [264] Let D ⊆ C be the definition domain of the function f : D → C. The function f (z) with complex variable z is said to be a complex analytic function in the domain D if f (z) is complex differentiable, namely lim

Δz→0

f (z + Δz) − f (z) exists for all z ∈ D. Δz

The terminology “complex analytic” is commonly replaced by the completely synonymous terminology “holomorphic”. Hence, a complex analytic function is usually called a holomorphic function. It is noted that a complex function is (real) analytic in the real-variable x-domain and y-domain, but it is not necessarily holomorphic in the complex variable domain z = x + jy, i.e., it may be complex nonanalytic. It is assumed that the complex function f (z) can be expressed in terms of its real part u(x, y) and imaginary part v(x, y) as f (z) = u(x, y) + jv(x, y), where z = x + jy and both u(x, y) and v(x, y) are real functions. For a holomorphic scalar function, the following four statements are equivalent [155]. (1) The complex function f (z) is a holomorphic (i.e., complex analytic) function. (2) The derivative f (z) of the complex function exists and is continuous. (3) The complex function f (z) satisfies the Cauchy–Riemann condition ∂v ∂u = ∂x ∂y

and

∂v ∂u =− . ∂x ∂y

(3.4.1)

(4) All derivatives of the complex function f (z) exist, and f (z) has a convergent power series. The Cauchy–Riemann condition is also known as the Cauchy–Riemann equations, and its direct result is that the function f (z) = u(x, y) + jv(x, y) is a holomorphic function only when both the real functions u(x, y) and v(x, y) satisfy the Laplace

3.4 Complex Gradient Matrices

159

equation at the same time: ∂ 2 u(x, y) ∂ 2 u(x, y) + =0 ∂x2 ∂y 2

and

∂ 2 v(x, y) ∂ 2 v(x, y) + = 0. ∂x2 ∂y 2

(3.4.2)

A real function g(x, y) is called a harmonic function, if it satisfies the Laplace equation ∂ 2 g(x, y) ∂ 2 g(x, y) + = 0. (3.4.3) ∂x2 ∂y 2 A complex function f (z) = u(x, y) + jv(x, y) is not a holomorphic function if either of the real functions u(x, y) and v(x, y) does not meet the Cauchy–Riemann condition or the Laplace equation. Although the power function z n , the exponential function ez , the logarithmic function ln z, the sine function sin z and the cosine function cos z are holomorphic functions, i.e., analytic functions in the complex plane, many commonly used functions are not holomorphic, as shown in the following examples. • For the complex function f (z) = z ∗ = x − jy = u(x, y) + jv(x, y), its real part u(x, y) = x and imaginary part v(x, y) = −y obviously do not meet the Cauchy– Riemann condition

∂u ∂v . = ∂x ∂y

• Any nonconstant real function f (z) ∈ R does not satisfy the Cauchy–Riemann conditions

∂u ∂u ∂v ∂v and = = − , because the imaginary part v(x, y) of f (z) = ∂x ∂y ∂y ∂x 

u(x, y) + jv(x, y) is zero. In particular, the real function f (z) = |z| = x2 + y 2 is nondifferentiable, while for f (z) = |z|2 = x2 + y 2 = u(x, y) + jv(x, y), its real part u(x, y) = x2 + y 2 is not a harmonic function because the Laplace condition is not satisfied, namely,

∂ 2 u(x, y) ∂ 2 u(x, y) + = 0. 2 ∂x ∂y 2

• The complex functions f (z) = Re z = x and f (z) = Im z = y do not meet the Cauchy–Riemann condition. Since many common complex functions f (z) are not holomorphic, a natural question to ask is whether there is a general representation form that can ensure that any complex function is holomorphic. To answer this question, it is necessary to recall the definition on the derivative of a complex number z and its conjugate z ∗ in complex function theory. The formal partial derivatives of complex numbers are defined as   ∂ 1 ∂ ∂ , (3.4.4) = −j ∂z 2 ∂x ∂y   ∂ 1 ∂ ∂ . (3.4.5) = +j ∗ ∂z 2 ∂x ∂y The formal partial derivatives above were presented by Wirtinger [511] in 1927, so they are sometimes called Wirtinger partial derivatives.

160

Matrix Differential

Regarding the partial derivatives of the complex variable z = x + jy, there is a basic assumption on the independence of the real and imaginary parts: ∂x =0 ∂y

and

∂y = 0. ∂x

(3.4.6)

By the definition of the partial derivative and by the above independence assumption, it is easy to find that     1 ∂y ∂z ∂x ∂y 1 ∂x ∂x ∂y + j = + j = + j + j ∂z ∗ ∂z ∗ ∂z ∗ 2 ∂x ∂y 2 ∂x ∂y 1 1 = (1 + 0) + j (0 + j), 2 2     ∂x ∂y 1 ∂x ∂x 1 ∂y ∂y ∂z ∗ = −j = −j −j −j ∂z ∂z ∂z 2 ∂x ∂y 2 ∂x ∂y 1 1 = (1 − 0) − j (0 − j). 2 2 This implies that ∂z ∗ ∂z =0 and = 0. (3.4.7) ∗ ∂z ∂z Equation (3.4.7) reveals a basic result in the theory of complex variables: the complex variable z and the complex conjugate variable z ∗ are independent variables. In the standard framework of complex functions, a complex function f (z) (where def

z = x + jy) is written in the real polar coordinates r = (x, y) as f (r) = f (x, y). However, in the framework of complex derivatives, under the independence assumption of the real and imaginary parts, the complex function f (z) is written as def

f (c) = f (z, z ∗ ) in the conjugate coordinates c = (z, z ∗ ) instead of the polar coordinates r = (x, y). Hence, when finding the complex partial derivative ∇z f (z, z ∗ ) and the complex conjugate partial derivative ∇z∗ f (z, z ∗ ), the complex variable z and the complex conjugate variable z ∗ are regarded as two independent variables:  ∂f (z, z ∗ )  ∗ ∇z f (z, z ) = ,  ∗ ∂z z =const  (3.4.8) ∂f (z, z ∗ )  ∗ ∇z ∗ f (z, z ) = . ∂z ∗ z=const This implies that when any nonholomorphic function f (z) is written as f (z, z ∗ ), it becomes holomorphic, because, for a fixed z ∗ value, the complex function f (z, z ∗ ) is analytic on the whole complex plane z = x + jy; and, for a fixed z value, the complex function f (z, z ∗ ) is analytic on the whole complex plane z ∗ = x − jy, see, e.g., [155], [263]. EXAMPLE 3.20 Given a real function f (z, z ∗ ) = |z|2 = zz ∗ , its first-order partial derivatives ∂|z|2 /∂z = z ∗ and ∂|z|2 /∂z ∗ = z exist and are continuous. This is to say, although f (z) = |z|2 itself is not a holomorphic function with respect to z, the

3.4 Complex Gradient Matrices

161

function f (z, z ∗ ) = |z|2 = zz ∗ is analytic on the whole complex plane z = x + jy (when z ∗ is fixed as a constant) and is analytic on the whole complex plane z ∗ = x − jy (when z is fixed as a constant). Table 3.8 gives for comparison the nonholomorphic and holomorphic representation forms of complex functions. Table 3.8 Nonholomorphic and holomorphic functions Functions Coordinates Representation

Nonholomorphic  def r = (x, y) ∈ R × R z = x + jy

Holomorphic  def c = (z, z ∗ ) ∈ C × C z = x + jy, z ∗ = x − jy

f (r) = f (x, y)

f (c) = f (z, z ∗ )

The following are common formulas and rules for the complex partial derivatives [263]: (1) The conjugate partial derivative of the complex conjugate function ∂f ∗ (z, z ∗ ) = ∂z ∗



∂f (z, z ∗ ) ∂z

∗ .

(2) The partial derivative of the conjugate of the complex function ∂f ∗ (z, z ∗ ) = ∂z



∂f (z, z ∗ ) ∂z ∗

∗

∂f ∗ (z, z ∗ ) : ∂z ∗

(3.4.9) ∂f ∗ (z, z ∗ ) : ∂z

.

(3.4.10)

(3) Complex differential rule df (z, z ∗ ) =

∂f (z, z ∗ ) ∂f (z, z ∗ ) ∗ dz . dz + ∂z ∂z ∗

(3.4.11)

(4) Complex chain rule ∂h(g(z, z ∗ )) ∂h(g(z, z ∗ )) ∂g(z, z ∗ ) ∂h(g(z, z ∗ )) ∂g ∗ (z, z ∗ ) = + , ∂z ∂g(z, z ∗ ) ∂z ∂g ∗ (z, z ∗ ) ∂z ∂h(g(z, z ∗ )) ∂g(z, z ∗ ) ∂h(g(z, z ∗ )) ∂g ∗ (z, z ∗ ) ∂h(g(z, z ∗ )) = + . ∂z ∗ ∂g(z, z ∗ ) ∂z ∗ ∂g ∗ (z, z ∗ ) ∂z ∗

(3.4.12) (3.4.13)

3.4.2 Complex Matrix Differential The concepts of the complex function f (z) and the holomorphic function f (z, z ∗ ) of a complex scalar z can be easily extended to the complex matrix function F(Z) and the holomorphic complex matrix function F(Z, Z∗ ).

162

Matrix Differential

On the holomorphic complex matrix functions, the following statements are equivalent [68]. (1) The matrix function F(Z) is a holomorphic function of the complex matrix variable Z. ∂ vecF(Z) (2) The complex matrix differential d vecF(Z) = d vecZ. T

∂(vecZ) ∂ vecF(Z) = O (the zero matrix) holds. (3) For all Z, ∂(vec Z∗ )T ∂ vecF(Z) ∂ vecF(Z) +j = O holds. (4) For all Z, ∂(vecReZ)T ∂(vecImZ)T

The complex matrix function F(Z, Z∗ ) is obviously a holomorphic function, and its matrix differential is ∂ vecF(Z, Z∗ ) ∂ vecF(Z, Z∗ ) d vecZ + d vecZ∗ . ∂(vecZ)T ∂(vecZ∗ )T

d vecF(Z, Z∗ ) =

(3.4.14)

The partial derivative of the holomorphic function F(Z, Z∗ ) with respect to the real part ReZ of the matrix variable Z is given by ∂ vecF(Z, Z∗ ) ∂ vecF(Z, Z∗ ) ∂ vecF(Z, Z∗ ) = + , T ∂(vecReZ) ∂(vecZ)T ∂(vecZ∗ )T and the partial derivative of F(Z, Z∗ ) with respect to the imaginary part ImZ of the matrix variable Z is as follows:   ∂ vecF(Z, Z∗ ) ∂ vecF(Z, Z∗ ) ∂ vecF(Z, Z∗ ) . = j − ∂(vecImZ)T ∂(vecZ)T ∂(vecZ∗ )T The complex matrix differential d Z = [dZij ]m,n i=1,j=1 has the following properties [68]. 1. 2. 3. 4. 5.

Transpose dZT = d(ZT ) = (dZ)T . Hermitian transpose dZH = d(ZH ) = (dZ)H . Conjugate dZ∗ = d(Z∗ ) = (dZ)∗ . Linearity (additive rule) d(Y + Z) = dY + dZ. Chain rule If F is a function of Y, while Y is a function of Z, then d vecF = where

∂ vecF ∂ vecY ∂ vecF d vecY = d vecZ, ∂(vecY)T ∂(vecY)T ∂(vecZ)T

∂ vecF ∂ vecY and are the normal complex partial derivative and ∂(vecY)T ∂(vec Z)T

the generalized complex partial derivative, respectively. 6. Multiplication rule d(UV) = (dU)V + U(dV) dvec(UV) = (VT ⊗ I)dvecU + (I ⊗ U)dvecV. 7. Kronecker product d(Y ⊗ Z) = dY ⊗ Z + Y ⊗ dZ.

3.4 Complex Gradient Matrices

163

8. Hadamard product d(Y ∗ Z) = dY ∗ Z + Y ∗ dZ. In the following we derive the relationship between the complex matrix differential and the complex partial derivative. First, the complex differential rule for scalar variables, df (z, z ∗ ) =

∂f (z, z ∗ ) ∂f (z, z ∗ ) ∗ dz dz + ∂z ∂z ∗

(3.4.15)

is easily extended to a complex differential rule for the multivariate real scalar ∗ )): function f (·) = f ((z1 , z1∗ ), . . . , (zm , zm ∂f (·) ∂f (·) ∗ ∂f (·) ∗ ∂f (·) dz1 + · · · + dzm + dz1 + · · · + dzm ∗ ∂z1 ∂zm ∂z1∗ ∂zm ∂f (·) ∂f (·) ∗ = dz + dz . (3.4.16) ∂ zT ∂ zH

df (·) =

This complex differential rule is the basis of the complex matrix differential. In particular, if f (·) = f (z, z∗ ), then df (z, z∗ ) =

∂f (z, z∗ ) ∂f (z, z∗ ) ∗ dz + dz T ∂z ∂ zH

or, simply denoted, df (z, z∗ ) = Dz f (z, z∗ ) dz + Dz∗ f (z, z∗ ) dz∗ . ∗ T ] and Here dz = [dz1 , . . . , dzm ]T , dz∗ = [dz1∗ , . . . , dzm    ∂f (z, z∗ )  ∂f (z, z∗ ) ∂f (z, z∗ ) Dz f (z, z∗ ) = , = , . . . , ∂ zT z∗ =const ∂z1 ∂zm    ∂f (z, z∗ )  ∂f (z, z∗ ) ∂f (z, z∗ ) ∗ Dz∗ f (z, z ) = = ,..., ∗ ∂ zH z=const ∂z1∗ ∂zm

(3.4.17)

(3.4.18) (3.4.19)

are respectively the cogradient vector and the conjugate cogradient vector of the real scalar function f (z, z∗ ), while   ∂ def ∂ ∂ Dz = , (3.4.20) = , . . . , ∂ zT ∂z1 ∂zm   ∂ def ∂ ∂ (3.4.21) = , . . . , D z∗ = ∗ ∂ zH ∂z1∗ ∂zm are termed the cogradient operator and the conjugate cogradient operator of complex vector variable z ∈ Cm , respectively. Let z = x + jy = [z1 , . . . , zm ]T ∈ Cm with x = [x1 , . . . , xm ]T ∈ Rm , y = [y1 , . . . , ym ]T ∈ Rm , namely zi = xi + jyi , i = 1, . . . , m; the real part xi and the imaginary part yi are two independent variables.

164

Matrix Differential

Applying the complex partial derivative operators   ∂ 1 ∂ ∂ = −j Dzi = , ∂zi 2 ∂xi ∂yi   ∂ ∂ 1 ∂ Dzi∗ = ∗ = +j ∂zi 2 ∂xi ∂yi

(3.4.22) (3.4.23)

to each element of the row vector zT = [z1 , . . . , zm ], we obtain the following complex cogradient operator   ∂ ∂ 1 ∂ Dz = (3.4.24) = − j ∂ zT 2 ∂ xT ∂ yT and the complex conjugate cogradient operator   ∂ ∂ 1 ∂ Dz∗ = = +j T . ∂ zH 2 ∂ xT ∂y

(3.4.25)

Similarly, the complex gradient operator and the complex conjugate gradient operator are respectively defined as T  ∂ def ∂ ∂ ∇z = = ,..., , ∂z ∂z1 ∂zm T  ∂ def ∂ ∂ ∇z ∗ = = , . . . , . ∗ ∂ z∗ ∂z1∗ ∂zm

(3.4.26) (3.4.27)

Hence, the complex gradient vector and the complex conjugate gradient vector of the real scalar function f (z, z∗ ) are respectively defined as  ∂f (z, z∗ )  ∗ ∇z f (z, z ) = = (Dz f (z, z∗ ))T (3.4.28) ∂ z z∗ = const vector  ∂f (z, z∗ )  ∇z∗ f (z, z∗ ) = = (Dz∗ f (z, z∗ ))T . (3.4.29) ∂ z∗ z = const vector Applying the complex partial derivative operator to each element of the complex vector z = [z1 , . . . , zm ]T , we get the complex gradient operator   ∂ ∂ 1 ∂ (3.4.30) ∇z = = −j ∂z 2 ∂x ∂y and the complex conjugate gradient operator   ∂ ∂ 1 ∂ ∇z∗ = . = + j ∂ z∗ 2 ∂x ∂y

(3.4.31)

By the definitions of the complex gradient operator and the complex conjugate

3.4 Complex Gradient Matrices

165

gradient operator, it is not difficult to obtain     ∂ zT ∂ xT ∂ yT 1 ∂ xT ∂ xT 1 ∂ yT ∂ yT = +j = −j +j −j = Im×m , ∂z ∂z ∂z 2 ∂x ∂y 2 ∂x ∂y     1 ∂ yT ∂ zT ∂ xT ∂ yT 1 ∂ xT ∂ xT ∂ yT + j = Om×m . = + j = + j + j ∂ z∗ ∂z∗ ∂ z∗ 2 ∂x ∂y 2 ∂x ∂y When finding the above two equations, we used

∂ xT ∂ xT = Om×m and = Im×m , ∂x ∂y

∂ yT ∂ yT = Om×m , because the real part x and the imaginary part y of = Im×m , ∂y ∂x

the complex vector variable z are independent. Summarizing the above results and their conjugate, transpose and complex conjugate transpose, we have the following important results: ∂ zT = I, ∂z ∂ zT = O, ∂ z∗

∂ zH = I, ∂ z∗ ∂ zH = O, ∂z

∂z = I, ∂ zT ∂z = O, ∂ zH

∂ z∗ = I, ∂ zH ∂ z∗ = O. ∂ zT

(3.4.32) (3.4.33)

The above results reveal an important fact of the complex matrix differential: under the basic assumption that the real part and imaginary part of a complex vector are independent, the complex vector variable z and its complex conjugate vector variable z∗ can be viewed as two independent variables. This important fact is not surprising because the angle between z and z∗ is π/2, i.e., they are orthogonal to each other. Hence, we can summarize the rules for using the cogradient operator and gradient operator as follows. (1) When using the complex cogradient operator ∂/∂ zT or the complex gradient operator ∂/∂ z, the complex conjugate vector variable z∗ is handled as a constant vector. (2) When using the complex conjugate cogradient operator ∂/∂ zH or the complex conjugate gradient operator ∂/∂ z∗ , the vector variable z is handled as a constant vector. We might regard the above rules as embodying the independent rule of complex partial derivative operators: when applying the complex partial derivative operator, the complex cogradient operator, the complex conjugate cogradient operator, the complex gradient operator or the complex conjugate gradient operator, the complex vector variables z and z∗ can be handled as independent vector variables, namely, when one vector is the variable of a given function, the other can be viewed as a constant vector. At this point, let us consider the real scalar function f (Z, Z∗ ) with variables Z, Z∗ ∈ Cm×n . Performing the vectorization of Z and Z∗ , respectively, from Equation (3.4.17) we get the first-order complex differential rule for the real scalar

166

Matrix Differential

function f (Z, Z∗ ): ∂f (Z, Z∗ ) dvecZ + ∂(vecZ)T ∂f (Z, Z∗ ) = dvecZ + ∂(vecZ)T

df (Z, Z∗ ) =

∂f (Z, Z∗ ) dvecZ∗ ∂(vecZ∗ )T ∂f (Z, Z∗ ) dvecZ∗ , ∂(vecZ∗ )T

(3.4.34)

where   ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) , = ,..., ,..., ,..., ∂(vec Z)T ∂Z11 ∂Zm1 ∂Z1n ∂Zmn   ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) . = , . . . , , . . . , , . . . , ∗ ∗ ∗ ∗ ∂(vec Z∗ )T ∂Z11 ∂Zm1 ∂Z1n ∂Zmn Now define the complex cogradient vector and the complex conjugate cogradient vector: ∂f (Z, Z∗ ) , ∂(vec Z)T ∂f (Z, Z∗ ) Dvec Z∗ f (Z, Z∗ ) = . ∂(vec Z∗ )T Dvec Z f (Z, Z∗ ) =

(3.4.35) (3.4.36)

The complex gradient vector and the complex conjugate gradient vector of the function f (Z, Z∗ ) are respectively defined as ∂f (Z, Z∗ ) , ∂ vec Z ∂f (Z, Z∗ ) . ∇vec Z∗ f (Z, Z∗ ) = ∂ vec Z∗ ∇vec Z f (Z, Z∗ ) =

(3.4.37) (3.4.38)

The conjugate gradient vector ∇vec Z∗ f (Z, Z∗ ) has the following properties [68]: (1) The conjugate gradient vector of the function f (Z, Z∗ ) at an extreme point is equal to the zero vector, i.e., ∇vec Z∗ f (Z, Z∗ ) = 0. (2) The conjugate gradient vector ∇vec Z∗ f (Z, Z∗ ) and the negative conjugate gradient vector −∇vec Z∗ f (Z, Z∗ ) point in the direction of the steepest ascent and steepest descent of the function f (Z, Z∗ ), respectively. (3) The step length of the steepest increase slope is ∇vec Z∗ f (Z, Z∗ ) 2 . (4) The conjugate gradient vector ∇vec Z∗ f (Z, Z∗ ) is the normal to the surface f (Z, Z∗ ) = const. Hence, the conjugate gradient vector ∇vec Z∗ f (Z, Z∗ ) and the negative conjugate gradient vector −∇vec Z∗ f (Z, Z∗ ) can be used separately in gradient ascent algorithms and gradient descent algorithms. Furthermore, the complex Jacobian matrix and the complex conjugate Jacobian

3.4 Complex Gradient Matrices

167

matrix of the real scalar function f (Z, Z∗ ) are respectively as follows:  ∂f (Z, Z∗ )  DZ f (Z, Z ) = ∂ ZT Z∗ = const matrix ⎡ ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) ⎤ ··· ∂Zm1 ⎥ ⎢ ∂Z11 ⎢ ⎥ .. .. .. =⎢ ⎥, . . . ⎣ ⎦ ∗ ∗ ∂f (Z, Z ) ∂f (Z, Z ) ··· ∂Z1n ∂Zmn  ∗  ) ∂f (Z, Z def  DZ∗ f (Z, Z∗ ) = ∂ ZH Z = const matrix ⎡ ⎤ ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) · · · ∗ ∗ ⎢ ∂Z11 ⎥ ∂Zm1 ⎢ ⎥ . . . ⎥. . . . =⎢ . . . ⎢ ⎥ ⎣ ∂f (Z, Z∗ ) ∗ ⎦ ∂f (Z, Z ) ··· ∗ ∗ ∗

def

∂Z1n

(3.4.39)

(3.4.40)

∂Zmn

Similarly, the complex gradient matrix and the complex conjugate gradient matrix of the real scalar function f (Z, Z∗ ) are respectively given by  ∂f (Z, Z∗ )  ∇Z f (Z, Z ) =  ∗ ∂Z Z = const matrix ⎡ ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) ⎤ ··· ∂Z1n ⎥ ⎢ ∂Z11 ⎢ ⎥ .. .. .. =⎢ ⎥, . . . ⎣ ⎦ ∗ ∗ ∂f (Z, Z ) ∂f (Z, Z ) ··· ∂Zm1 ∂Zmn  ∗  ) ∂f (Z, Z def  ∇Z∗ f (Z, Z∗ ) = ∂ Z∗ Z = const matrix ⎡ ⎤ ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) · · · ∗ ∗ ⎥ ⎢ ∂Z11 ∂Z1n ⎥ ⎢ . . . ⎥. . . . =⎢ . . . ⎥ ⎢ ⎣ ∂f (Z, Z∗ ) ∗ ⎦ ∂f (Z, Z ) ··· ∗ ∗ ∗

def

∂Zm1

(3.4.41)

(3.4.42)

∂Zmn

Summarizing the definitions above, there are the following relations among the various complex partial derivatives of the real scalar function f (Z, Z): (1) The conjugate gradient (cogradient) vector is equal to the complex conjugate of the gradient (cogradient) vector; and the conjugate Jacobian (gradient) matrix is equal to the complex conjugate of the Jacobian (gradient) matrix. (2) The gradient (conjugate gradient) vector is equal to the transpose of the co-

168

Matrix Differential

gradient (conjugate cogradient) vector, namely ∇vec Z f (Z, Z∗ ) = DTvec Z f (Z, Z∗ ), ∗

∇vec Z∗ f (Z, Z ) =

(3.4.43)

DTvec Z∗ f (Z, Z∗ ).

(3.4.44)

(3) The cogradient (conjugate cogradient) vector is equal to the transpose of the vectorization of Jacobian (conjugate Jacobian) matrix: Dvec Z f (Z, Z) = (vecDZ f (Z, Z∗ )) , T



(3.4.45)

T

Dvec Z∗ f (Z, Z) = (vecDZ∗ f (Z, Z )) .

(3.4.46)

(4) The gradient (conjugate gradient) matrix is equal to the transpose of the Jacobian (conjugate Jacobian) matrix: ∇Z f (Z, Z∗ ) = DTZ f (Z, Z∗ ), ∗

∇Z∗ f (Z, Z ) =

(3.4.47)

DTZ∗ f (Z, Z∗ ).

(3.4.48)

The following are the rules of operation for the complex gradient. 1. If f (Z, Z∗ ) = c (a constant), then its gradient matrix and conjugate gradient matrix are equal to the zero matrix, namely ∂c/∂ Z = O and ∂c/∂ Z∗ = O. 2. Linear rule If f (Z, Z∗ ) and g(Z, Z∗ ) are scalar functions, and c1 and c2 are complex numbers, then ∂(c1 f (Z, Z∗ ) + c2 g(Z, Z∗ )) ∂f (Z, Z∗ ) ∂g(Z, Z∗ ) = c1 + c2 . ∗ ∗ ∂Z ∂Z ∂ Z∗ 3. Multiplication rule ∗ ∗ ∂f (Z, Z∗ )g(Z, Z∗ ) ∗ ∂f (Z, Z ) ∗ ∂g(Z, Z ) = g(Z, Z ) + f (Z, Z ) . ∂Z∗ ∂ Z∗ ∂ Z∗

If g(Z, Z∗ ) = 0 then   ∗ ∗ ∂f /g 1 ∗ ∂f (Z, Z ) ∗ ∂g(Z, Z ) g(Z, Z ) . = 2 − f (Z, Z ) ∂ Z∗ g (Z, Z∗ ) ∂ Z∗ ∂ Z∗

4. Quotient rule

If h(Z, Z∗ ) = g(F(Z, Z∗ ), F∗ (Z, Z∗ )) then the quotient rule becomes ∂h(Z, Z∗ ) ∂g(F(Z, Z∗ ), F∗ (Z, Z∗ )) ∂ (vec F(Z, Z∗ )) = T ∂ vec Z ∂ vec Z ∂ (vec F(Z, Z∗ ))

T

∂g(F(Z, Z∗ ), F∗ (Z, Z∗ )) ∂ (vec F∗ (Z, Z∗ )) , T ∂ vec Z ∂ (vec F∗ (Z, Z∗ )) T

+

(3.4.49)

∂h(Z, Z∗ ) ∂g(F(Z, Z∗ ), F∗ (Z, Z∗ )) ∂ (vecF(Z, Z∗ )) = T ∂ vec Z∗ ∂ vec Z∗ ∂ (vec F(Z, Z∗ ))

T

∂g(F(Z, Z∗ ), F∗ (Z, Z∗ )) ∂ (vec F∗ (Z, Z∗ )) . T ∂ vec Z∗ ∂ (vecF∗ (Z, Z∗ )) T

+

(3.4.50)

3.4 Complex Gradient Matrices

169

3.4.3 Complex Gradient Matrix Identification If we let A = DZ f (Z, Z∗ ) and

B = DZ∗ f (Z, Z∗ ),

(3.4.51)

then ∂f (Z, Z∗ ) = rvecDZ f (Z, Z∗ ) = rvecA = (vec(AT ))T , ∂(vec Z)T ∂f (Z, Z∗ ) = rvecDZ∗ f (Z, Z∗ ) = rvecB = (vec(BT ))T . ∂(vec Z)H

(3.4.52) (3.4.53)

Hence, the first-order complex matrix differential formula (3.4.34) can be rewritten as df (Z, Z∗ ) = (vec(AT ))T dvec Z + (vec(BT ))T dvecZ∗ .

(3.4.54)

Using tr(CT D) = (vec C)T vecD, Equation (3.4.54) can be written as df (Z, Z∗ ) = tr(AdZ + BdZ∗ ).

(3.4.55)

From Equations (3.4.51)–(3.4.55) we can obtain the following proposition for identifying the complex Jacobian matrix and the complex gradient matrix. PROPOSITION 3.4 Given a scalar function f (Z, Z∗ ) : Cm×n × Cm×n → C, its complex Jacobian and gradient matrices can be respectively identified by + DZ f (Z, Z∗ ) = A, ∗ ∗ df (Z, Z ) = tr(AdZ + BdZ ) ⇔ (3.4.56) DZ∗ f (Z, Z∗ ) = B + ∇Z f (Z, Z∗ ) = AT , ∗ ∗ df (Z, Z ) = tr(AdZ + BdZ ) ⇔ (3.4.57) ∇Z∗ f (Z, Z∗ ) = BT . That is to say, the complex gradient matrix and the complex conjugate gradient matrix are respectively identified as the transposes of the matrices A and B. This proposition shows that the key to identifying a complex Jacobian or gradient matrix is to write the matrix differential of the function f (Z, Z∗ ) in the canonical form df (Z, Z∗ ) = tr(AdZ+BdZ∗ ). In particular, if f (Z, Z∗ ) is a real function then B = A∗ . EXAMPLE 3.21

The matrix differential of tr(ZAZ∗ B) is given by

d(tr(ZAZ∗ B)) = tr((dZ)AZ∗ B) + tr(ZA(dZ∗ )B) = tr(AZ∗ BdZ) + tr(BZAdZ∗ ) from which it follows that the gradient matrix and the complex conjugate gradient

170

Matrix Differential

matrix of tr(ZAZ∗ B) are respectively given by ∇Z tr(ZAZ∗ B) = (AZ∗ B)T = BT ZH AT , ∇Z∗ tr(ZAZ∗ B) = (BZA)T = AT ZT BT . EXAMPLE 3.22

From the complex matrix differentials



d|ZZ | = |ZZ∗ |tr((ZZ∗ )−1 d(ZZ∗ )) = |ZZ∗ |tr(Z∗ (ZZ∗ )−1 dZ) + |ZZ∗ |tr((ZZ∗ )−1 ZdZ∗ ), d|ZZH | = |ZZH |tr((ZZH )−1 d(ZZH )) = |ZZH |tr(ZH (ZZH )−1 dZ) + |ZZH |tr(ZT (Z∗ ZT )−1 dZ∗ ), we get the complex gradient matrix and the complex conjugate gradient matrix as follows: ∇Z |ZZ∗ | = |ZZ∗ |(ZH ZT )−1 ZH ,

∇Z∗ |ZZ∗ | = |ZZ∗ |ZT (ZH ZT )−1 ,

∇Z |ZZH | = |ZZH |(Z∗ ZT )−1 Z∗ ,

∇Z∗ |ZZH | = |ZZH |(ZZH )−1 Z.

Table 3.9 lists the complex gradient matrices of several trace functions. Table 3.9 Complex gradient matrices of trace functions f (Z, Z∗ )

df

∂f /∂Z

∂f /∂ Z∗

tr(AZ)

tr(AdZ)

AT

O

H



T

tr(AZ )

tr(A dZ )

T

tr(ZAZ B) tr(ZAZB) ∗

tr(ZAZ B)

O

T

T

T

T

tr((AZ B + A Z B )dZ) tr((AZB + BZA)dZ) ∗

A

T

T

B ZA + BZA

O

T

O

(AZB + BZA) ∗

tr(AZ BddZ + BZAddZ )

T

H

B Z A

T

AT ZT BT

tr(ZAZH B) tr(AZH BdZ + AT ZT BT dZ∗ ) BT Z∗ AT tr(AZ k

−1

tr(Z )

)

−1

−tr(Z k tr(Z

AZ

k−1

−1

dZ)

dZ)

−T

−Z

T

BZA −T

A Z

T k−1

k(Z )

O O

Table 3.10 lists the complex gradient matrices of several determinant functions. If f (z, z∗ ) = [f1 (z, z∗ ), . . . , fn (z, z∗ )]T is an n × 1 complex vector function with m × 1 complex vector variable then ⎤ ⎡ ⎤ ⎤ ⎡ ⎡ df1 (z, z∗ ) Dz f1 (z, z∗ ) Dz∗ f1 (z, z∗ ) ⎥ ⎢ ⎥ ⎥ ∗ ⎢ ⎢ .. .. .. ⎦=⎣ ⎦ dz + ⎣ ⎦ dz , ⎣ . . . dfn (z, z∗ )

Dz fn (z, z∗ )

Dz∗ fn (z, z∗ )

which can simply be written as df (z, z∗ ) = Dz f (z, z∗ )dz + Dz∗ f (z, z∗ )dz∗ ,

(3.4.58)

3.4 Complex Gradient Matrices

171

Table 3.10 Complex gradient matrices of determinant functions f (Z, Z∗ ) df

∂f /∂ Z∗

∂f /∂Z

|Z|tr(Z−1 dZ) |Z|Z−T T T T −1 2|ZZ |tr(Z (ZZ ) dZ) 2|ZZT |(ZZT )−1 Z 2|ZT Z|tr((ZT Z)−1 ZT dZ) 2|ZT Z|Z(ZT Z)−1 ZZ∗ |tr(Z∗ (ZZ∗ )−1 dZ |ZZ∗ |(ZH ZT )−1 ZH + (ZZ∗ )−1 ZdZ∗ )

|Z| |ZZT | |ZT Z| |ZZ∗ |

|Z∗ Z|tr((Z∗ Z)−1 Z∗ dZ

|Z∗ Z|

|ZZH |tr(ZH (ZZH )−1 dZ + ZT (Z∗ ZT )−1 dZ∗ ) |ZH Z|tr((ZH Z)−1 ZH dZ

|ZH Z|

+ (ZT Z∗ )−1 ZT dZ∗ ) k|Z|k tr(Z−1 dZ)

|Zk |

|ZZ∗ |ZT (ZH ZT )−1

|Z∗ Z|ZH (ZT ZH )−1 |Z∗ Z|(ZT ZH )−1 ZT

+ Z(Z∗ Z)−1 dZ∗ )

|ZZH |

O O O

|ZZH |(Z∗ ZT )−1 Z∗

|ZZH |(ZZH )−1 Z

|ZH Z|Z∗ (ZT Z∗ )−1

|ZH Z|Z(ZH Z)−1

k|Z|k Z−T

O

where df (z, z∗ ) = [df1 (z, z∗ ), . . . , dfn (z, z∗ )]T , while ⎡ Dz f (z, z∗ ) =

∂f1 (z, z∗ ) ⎢ ∂z1

∂ f (z, z∗ ) ⎢ =⎢ ⎢ ∂ zT ⎣ ∂f

.. .

n (z, z



) ∂z1 ⎡ ∂f1 (z, z∗ ) ⎢ ∂z1∗

∂ f (z, z∗ ) ⎢ .. Dz∗ f (z, z∗ ) = =⎢ . ⎢ ∂ zH ⎣ ∂f (z, z∗ ) n ∂z1∗

··· ..

.

··· ··· ..

.

···



∂f1 (z, z∗ ) ∂zm ⎥

⎥ ⎥, ⎥ ∗ ⎦ ∂fn (z, z ) .. .

(3.4.59)

∂zm ⎤ ∂f1 (z, z∗ ) ∗ ∂zm ⎥

⎥ ⎥ ⎥ ∗ ⎦ ∂fn (z, z ) .. .

(3.4.60)

∗ ∂zm

are respectively the complex Jacobian matrix and the complex conjugate Jacobian matrix of the vector function f (z, z∗ ). The above result is easily extended to the case of complex matrix functions as follows. For a p×q matrix function F(Z, Z∗ ) with m×n complex matrix variable Z, if F(Z, Z∗ ) = [f1 (Z, Z∗ ), . . . , fq (Z, Z∗ )], then dF(Z, Z∗ ) = [df1 (Z, Z∗ ), . . . , dfq (Z, Z∗ )], and (3.4.58) holds for the vector functions fi (Z, Z∗ ), i = 1, . . . , q. This implies that ⎤ ⎡ ⎤ ⎤ ⎡ df1 (Z, Z∗ ) Dvec Z f1 (Z, Z∗ ) Dvec Z∗ f1 (Z, Z∗ ) ⎥ ⎢ ⎥ ⎥ ⎢ ⎢ .. .. .. ∗ ⎦=⎣ ⎦ dvecZ + ⎣ ⎦ dvecZ , (3.4.61) ⎣ . . . ⎡

dfq (Z, Z∗ )

Dvec Z fq (Z, Z∗ )

Dvec Z∗ fq (Z, Z∗ )

172

Matrix Differential

where Dvec Z fi (Z, Z∗ ) =

∂ fi (Z, Z∗ ) ∈ Cp×mn , ∂(vecZ)T

Dvec Z∗ fi (Z, Z∗ ) =

∂ fi (Z, Z∗ ) ∈ Cp×mn . ∂(vecZ∗ )T

Equation (3.4.61) can be simply rewritten as dvecF(Z, Z∗ ) = AdvecZ + BdvecZ∗ ∈ Cpq ,

(3.4.62)

where dvecF(Z, Z∗ )) = [df11 (Z, Z∗ ), . . . , dfp1 (Z, Z∗ ), . . . , df1q (Z, Z∗ ), . . . , dfpq (Z, Z∗ )]T , dvecZ = [dZ11 , . . . , dZm1 , . . . , dZ1n , . . . , dZmn ]T , ∗ ∗ ∗ ∗ dvecZ∗ = [dZ11 , . . . , dZm1 , . . . , dZ1n , . . . , dZmn ]T ,

while

⎡ ∂f (Z, Z∗ ) 11 ⎢ ∂Z11 ⎢ .. ⎢ . ⎢ ⎢ ∂fp1 (Z, Z∗ ) ⎢ ⎢ ∂Z11 ⎢ .. A=⎢ . ⎢ ⎢ ∂f (Z, Z∗ ) ⎢ 1q ⎢ ∂Z 11 ⎢ ⎢ .. ⎢ . ⎣ ∗ ∂fpq (Z, Z ) ∂Z11

··· .. . ··· .. . ··· .. . ···

∂f11 (Z, Z∗ ) ∂Zm1

.. .

··· .. .

.. .

··· .. .

.. .

··· .. .

∂fp1 (Z, Z∗ ) ∂Zm1 ∂f1q (Z, Z∗ ) ∂Zm1 ∂fpq (Z, Z∗ ) ∂Zm1

∂ vec F(Z, Z∗ ) = Dvec Z F(Z, Z∗ ), ∂(vec Z)T ⎡ ∂f (Z, Z∗ ) ∂f11 (Z, Z∗ ) 11 ··· ∗ ∗ ∂Zm1 ⎢ ∂Z11 ⎢ . . . ⎢ .. .. .. ⎢ ⎢ ∂fp1 (Z, Z∗ ) ∂fp1 (Z, Z∗ ) ⎢ ··· ⎢ ∂Z ∗ ∗ ∂Zm1 11 ⎢ ⎢ . . . .. .. .. B=⎢ ⎢ ⎢ ∂f1q (Z, Z∗ ) ∂f1q (Z, Z∗ ) ⎢ ··· ∗ ∗ ⎢ ∂Z11 ∂Zm1 ⎢ .. .. .. ⎢ ⎢ . . . ⎣ ∂f (Z, Z∗ ) ∂fpq (Z, Z∗ ) pq ··· ∗ ∗

···

∂f11 (Z, Z∗ ) ∂Z1n

.. .

··· .. .

.. .

··· .. .

.. .

··· .. .

∂fp1 (Z, Z∗ ) ∂Z1n ∂f1q (Z, Z∗ ) ∂Z1n ∂fpq (Z, Z∗ ) ∂Z1n

···



∂f11 (Z, Z∗ ) ∂Zmn ⎥

⎥ ⎥ ⎥ ∗ ⎥ ∂fp1 (Z, Z ) ⎥ ∂Zmn ⎥ ⎥ .. ⎥ . ⎥ ∂f1q (Z, Z∗ ) ⎥ ⎥ ∂Zmn ⎥ ⎥ ⎥ .. ⎥ . ∗ ⎦ .. .

∂fpq (Z, Z ) ∂Zmn

=

∂Z11

=



∂Zm1

(3.4.63) ··· .. . ··· .. . ··· .. . ···

∂ vec F(Z, Z ) = Dvec Z∗ F(Z, Z∗ ). ∂(vec Z∗ )T

∂f11 (Z, Z∗ ) ∗ ∂Z1n

···

∂fp1 (Z, Z∗ ) ∗ ∂Z1n

···

∂f1q (Z, Z∗ ) ∗ ∂Z1n

···

∂fpq (Z, Z∗ ) ∗ ∂Z1n

···

.. . .. . .. .

.. . .. . .. .



∂f11 (Z, Z∗ ) ∗ ∂Zmn ⎥

⎥ ⎥ ⎥ ∗ ⎥ ∂fp1 (Z, Z ) ⎥ ⎥ ∗ ∂Zmn ⎥ ⎥ .. ⎥ . ⎥ ∗ ⎥ ∂f1q (Z, Z ) ⎥ ∗ ⎥ ∂Zmn ⎥ .. ⎥ ⎥ . ⎦ ∂f (Z, Z∗ ) .. .

pq

∗ ∂Zmn

(3.4.64)

3.4 Complex Gradient Matrices

173

Obviously the matrices A and B are the complex Jacobian matrix and the complex conjugate Jacobian matrix, respectively. The complex gradient matrix and the complex conjugate gradient matrix of the matrix function F(Z, Z∗ ) are respectively defined as ∂(vec F(Z, Z∗ ))T = (Dvec Z F(Z, Z∗ ))T , ∂ vec Z ∂(vec F(Z, Z∗ ))T ∇vec Z∗ F(Z, Z∗ ) = = (Dvec Z∗ F(Z, Z∗ ))T . ∂ vec Z∗ ∇vec Z F(Z, Z∗ ) =

(3.4.65) (3.4.66)

In particular, for a real scalar function f (Z, Z∗ ), Equation (3.4.62) reduces to Equation (3.4.34). Summarizing the above discussions, from Equation (3.4.62) we get the following proposition. PROPOSITION 3.5 For a complex matrix function F(Z, Z∗ ) ∈ Cp×q with Z, Z∗ ∈ Cm×n , its complex Jacobian matrix and the conjugate Jacobian matrix can be identified as follows: dvecF(Z, Z∗ ) = Advec Z + BdvecZ∗ + Dvec Z F(Z, Z∗ ) = A, ⇔ Dvec Z∗ F(Z, Z∗ ) = B,

(3.4.67)

and the complex gradient matrix and the conjugate gradient matrix can be identified by dvecF(Z, Z∗ ) = AdvecZ + BdvecZ∗ + ∇vec Z F(Z, Z∗ ) = AT , ⇔ ∇vec Z∗ F(Z, Z∗ ) = BT .

(3.4.68)

If d(F(Z, Z∗ )) = A(dZ)B + C(dZ∗ )D, then the vectorization result is given by dvecF(Z, Z∗ ) = (BT ⊗ A)dvecZ + (DT ⊗ C)dvecZ∗ . By Proposition 3.5 we have the following identification formula: dF(Z, Z∗ ) = A(dZ)B + C(dZ∗ )D + Dvec Z F(Z, Z∗ ) = BT ⊗ A, ⇔ Dvec Z∗ F(Z, Z∗ ) = DT ⊗ C.

(3.4.69)

Similarly, if dF(Z, Z∗ ) = A(dZ)T B + C(dZ∗ )T D then we have the result dvecF(Z, Z∗ ) = (BT ⊗ A)dvecZT + (DT ⊗ C)dvecZH = (BT ⊗ A)Kmn dvecZ + (DT ⊗ C)Kmn dvecZ∗ , where we have used the vectorization property vecXTm×n = Kmn vecX. By Prop-

174

Matrix Differential

osition 3.5, the following identification formula is obtained: dF(Z, Z∗ ) = A(dZ)T B + C(dZ∗ )T D + Dvec Z F(Z, Z∗ ) = (BT ⊗ A)Kmn , ⇔ Dvec Z∗ F(Z, Z∗ ) = (DT ⊗ C)Kmn .

(3.4.70)

The above equation shows that, as in the vector case, the key to identifying the gradient matrix and conjugate gradient matrix of a matrix function F(Z, Z∗ ) is to write its matrix differential into the canonical form dF(Z, Z∗ ) = A(dZ)T B + C(dZ∗ )T D. Table 3.11 lists the corresponding relationships between the first-order complex matrix differential and the complex Jacobian matrix, where z ∈ Cm , Z ∈ Cm×n , F ∈ Cp×q . Table 3.11 Complex matrix differential and complex Jacobian matrix Function First-order matrix differential

Jacobian matrix

∂f ∂f =b = a, ∂z ∂z ∗ ∂f ∂f f (z, z∗ ) df (z, z∗ ) = aT dz + bT dz∗ = aT , = bT ∂zT ∂ zH ∂f ∂f f (Z, Z∗ ) df (Z, Z∗ ) = tr(AdZ + BdZ∗ ) = A, =B ∂ ZT ∂ ZH ∂ vec F ∂ vec F F(Z, Z∗ ) d(vec F) = Ad(vec Z) + Bd(vec Z∗ ) = A, =B ∂(vec Z)T ∂(vec Z∗ )T ⎧ ∂ vec F T ⎪ ⎪ ⎨ ∂(vec Z)T = B ⊗ A ∗ dF = A(dZ)B + C(dZ )D, ∂ vec F ⎪ ⎪ ⎩ = DT ⊗ C ∂(vec Z∗ )T ⎧ ∂vec F T ⎪ ⎪ ⎨ ∂(vec Z)T = (B ⊗ A)Kmn T ∗ T dF = A(dZ) B + C(dZ ) D ⎪ ∂ vecF ⎪ = (DT ⊗ C)Kmn ⎩ ∂ (vec Z∗ )T f (z, z ∗ )

df (z, z ∗ ) = adz + bdz ∗

3.5 Complex Hessian Matrices and Identification The previous section analyzed the first-order complex derivative matrices (the complex Jacobian matrix and complex gradient matrix) and their identification. In this section we discuss the second-order complex derivative matrices of the function f (Z, Z∗ ) and their identification. These second-order complex matrices are a full Hessian matrix and four part Hessian matrices.

3.5 Complex Hessian Matrices and Identification

175

3.5.1 Complex Hessian Matrices The differential of the real function f = f (Z, Z∗ ) can be equivalently written as df = (DZ f )dvecZ + (DZ∗ f )dvecZ∗ ,

(3.5.1)

where DZ f =

∂f , ∂(vec Z)T

DZ∗ f =

∂f . ∂(vec Z∗ )T

(3.5.2)

Noting that the differentials of both DZ f and DZ∗ f are row vectors, we have T ∂ DZ f ∂ DZ f ∗ d(DZ f ) = dvecZ dvecZ + ∂ vecZ ∂ vecZ∗ ∂2f ∂2f ∗ T = (dvecZ)T + (dvecZ ) ∂ vecZ∂(vecZ)T ∂vecZ∗ ∂ (vecZ)T T  ∂ DZ∗ f ∂ DZ∗ f d(DZ∗ f ) = dvecZ∗ dvecZ + ∂ vecZ ∂(vecZ)T ∂2f ∂2f = (dvecZ)T + (d(vec Z∗ )T ) . ∗ T ∗ ∂ vecZ∂(vec Z ) ∂ vecZ ∂(vecZ∗ )T 

Since dvecZ is not a function of vecZ, and dvecZ∗ is not a function of vecZ∗ , it follows that d2 (vec Z) = d(dvecZ) = 0,

d2 (vecZ∗ ) = d(dvecZ∗ ) = 0.

(3.5.3)

Hence, the second-order differential of f = f (Z, Z∗ ) is given by d2 f = d(DZ f )d(vecZ) + d(DZ∗ f )dvecZ∗ ∂2f dvecZ ∂ vecZ∂(vecZ)T ∂2f dvecZ + (dvecZ∗ )T ∗ ∂ vecZ ∂(vecZ)T ∂2f dvecZ∗ + (dvecZ)T ∂ vecZ∂(vec Z∗ )T ∂2f + (dvecZ∗ )T dvecZ∗ , ∗ ∂ vecZ ∂(vec Z∗ )T

= (dvecZ)T

which can be written as    HZ∗ ,Z d2 f = (dvecZ∗ )T , (dvecZ)T HZ,Z H    dvecZ dvecZ , H = dvecZ∗ dvecZ∗

HZ∗ ,Z∗ HZ,Z∗



dvecZ dvecZ∗



(3.5.4)

176

Matrix Differential

where

 HZ∗ ,Z H= HZ,Z

HZ∗ ,Z∗ HZ,Z∗

 (3.5.5)

is known as the full complex Hessian matrix of the function f (Z, Z∗ ), and ⎫ ∂ 2 f (Z, Z∗ ) ⎪ ⎪ , HZ∗ ,Z = ⎪ ⎪ ∂ vecZ∗ ∂(vec Z)T ⎪ ⎪ ⎪ ⎪ 2 ∗ ⎪ ∂ f (Z, Z ) ⎪ ⎪ ⎪ , HZ∗ ,Z∗ = ∗ ∗ T ∂ vecZ ∂(vec Z ) ⎬ (3.5.6) ⎪ ∂ 2 f (Z, Z∗ ) ⎪ ⎪ , HZ,Z = ⎪ ∂ vecZ∂(vec Z)T ⎪ ⎪ ⎪ ⎪ ⎪ 2 ∗ ⎪ ∂ f (Z, Z ) ⎪ ⎪ ⎭ , HZ,Z∗ = ∗ T ∂ vecZ∂(vec Z ) are respectively the part complex Hessian matrices of the function f (Z, Z∗ ); of these, HZ∗ ,Z is the main complex Hessian matrix of the function f (Z, Z∗ ). By the above definition formulas, it is easy to show that the Hessian matrices of a scalar function f (Z, Z∗ ) have the following properties. 1. The part complex Hessian matrices HZ∗ ,Z and HZ,Z∗ are Hermitian, and are complex conjugates: HZ∗ ,Z = HH Z∗ ,Z ,

HZ,Z∗ = HH Z,Z∗ ,

HZ∗ ,Z = H∗Z,Z∗ .

(3.5.7)

2. The other two part complex Hessian matrices are symmetric and complex conjugate each other: HZ,Z = HTZ,Z ,

HZ∗ ,Z∗ = HTZ∗ ,Z∗

HZ,Z = H∗Z∗ ,Z∗ .

(3.5.8)

3. The complex full Hessian matrix is Hermitian, i.e., H = HH . From Equation (3.5.4) it can be seen that, because the second-order differential of the real function f (Z, Z∗ ), d2 f , is a quadratic function, the positive definiteness of the full Hessian matrix is determined by d2 f : (1) (2) (3) (4)

The The The The

full full full full

Hessian Hessian Hessian Hessian

matrix matrix matrix matrix

is is is is

positive definite if d2 f > 0 holds for all vec Z. positive semi-definite if d2 f ≥ 0 holds for all vecZ. negative definite if d2 f < 0 holds for all vecZ. negative semi-definite if d2 f ≤ 0 holds for all vecZ.

For a real function f = f (z, z∗ ), its second-order differential is given by 

dz d f= dz∗ 2

H

 dz , H dz∗ 

(3.5.9)

3.5 Complex Hessian Matrices and Identification

177

in which H is the full complex Hessian matrix of the function f (z, z∗ ), and is defined as   Hz∗ ,z Hz∗ ,z∗ H= . (3.5.10) Hz,z Hz,z∗ The four part complex Hessian matrices are given by ⎫ ∂ 2 f (z, z∗ ) ⎪ Hz∗ ,z = ,⎪ ⎪ ⎪ ∂ z∗ ∂ zT ⎪ ⎪ 2 ∗ ⎪ ⎪ ∂ f (z, z ) ⎪ ⎬ Hz∗ ,z∗ = ,⎪ ∗ H ∂z ∂z ∂ 2 f (z, z∗ ) ⎪ ⎪ Hz,z = ,⎪ ⎪ ∂ z∂ zT ⎪ ⎪ ⎪ 2 ∗ ⎪ ∂ f (z, z ) ⎪ ⎪ Hz,z∗ = .⎭ ∂ z∂ zH

(3.5.11)

Clearly, the full complex Hessian matrix is Hermitian, i.e., H = HH ; and the part complex Hessian matrices have the following relationships: Hz∗ ,z = HH z∗ ,z , Hz,z =

HTz,z ,

Hz,z∗ = HH z,z∗ ,

Hz∗ ,z = H∗z,z∗ ,

(3.5.12)

HTz∗ ,z∗

H∗z∗ ,z∗ .

(3.5.13)

Hz∗ ,z∗ =

Hz,z =

The above analysis show that, in complex Hessian matrix identification, we just need to identify two of the part Hessian matrices HZ,Z and HZ,Z∗ , since HZ∗ ,Z∗ = H∗Z,Z and HZ∗ ,Z = H∗Z,Z∗ . 3.5.2 Complex Hessian Matrix Identification By Proposition 3.5 it is known that the complex differential of a real scalar function f (Z, Z∗ ) can be written in the canonical form df (Z, Z∗ ) = tr(AdZ+A∗ dZ∗ ), where A = A(Z, Z∗ ) is usually a complex matrix function of the matrix variables Z and Z∗ . Hence, the complex matrix differential of A can be expressed as dA = C(dZ)D + E(dZ∗ )F

or dA = D(dZ)T C + F(dZ∗ )T E.

(3.5.14)

Substituting (3.5.14) into the second-order complex differential d2 f (Z, Z∗ ) = d(df (Z, Z∗ )) = tr(dAdZ + dA∗ dZ∗ ),

(3.5.15)

we get d2 f (Z, Z∗ ) = tr(C(dZ)DdZ) + tr(E(dZ∗ )FdZ) + tr(C∗ (dZ∗ )D∗ dZ∗ ) + tr(E∗ (dZ)F∗ dZ∗ ),

(3.5.16)

or d2 f (Z, Z∗ ) = tr(D(dZ)T CdZ) + tr(F(dZ∗ )T EdZ) + tr(D∗ (dZ∗ )T C∗ dZ∗ ) + tr(F∗ (dZ)T E∗ dZ∗ ).

(3.5.17)

178

Matrix Differential

Case 1 Hessian matrix identification based on (3.5.16) By the trace property tr(XYUV) = (vec(VT ))T (UT ⊗ X)vecY, the vectorization property vecZT = Kmn vecZ and the commutation matrix property KTmn = Knm , it is easy to see that tr(C(dZ)DdZ) = (Kmn vecdZ)T (DT ⊗ C)vecdZ = (dvecZ)T Knm (DT ⊗ C)dvecZ. Similarly, we have tr(E(dZ∗ )FdZ) = (dvecZ)T Knm (FT ⊗ E)dvecZ∗ , tr(E∗ (dZ)F∗ dZ∗ ) = (dvecZ∗ )T Knm (FH ⊗ E∗ )dvecZ, tr(C∗ (dZ∗ )D∗ dZ∗ ) = (dvecZ∗ )T Knm (DH ⊗ C∗ )dvecZ∗ . Then, on the one hand, Equation (3.5.16) can be written as follows: d2 f (Z, Z∗ ) = (dvecZ)T Knm (DT ⊗ C)dvecZ + (dvecZ)T Knm (FT ⊗ E)dvecZ∗ + (dvecZ∗ )T Knm (FH ⊗ E∗ )dvecZ + (dvecZ∗ )T Knm (DH ⊗ C∗ )dvecZ.

(3.5.18)

On the other hand, Equation (3.5.4) can be rewritten as follows: d2 f = (dvecZ)T HZ,Z dvecZ + (dvecZ)T HZ,Z∗ dvecZ∗ + (dvecZ∗ )T HZ∗ ,Z dvecZ) + (dvecZ∗ )T HZ∗ ,Z∗ dvecZ∗ . By comparing (3.5.19) with (3.5.18), it immediately follows that  1 HZ,Z = H∗Z∗ ,Z∗ = Knm (DT ⊗ C) + (Knm (DT ⊗ C))H , 2  1 ∗ HZ,Z∗ = HZ∗ ,Z = Knm (FT ⊗ E) + (Knm (FT ⊗ E))H . 2 Case 2 Hessian matrix identification based on (3.5.17)

(3.5.19)

(3.5.20) (3.5.21)

By the trace property tr(XYUV) = (vecVT )T (UT ⊗ X)vec(Y), it is easy to see that the second-order differential formula (3.5.17) can be rewritten as d2 f (Z, Z∗ ) = (dvecZ)T (DT ⊗ C)dvecZ + (dvecZ∗ )T (FT ⊗ E)dvecZ + (dvecZ)T (FH ⊗ E∗ )dvecZ∗ + (dvecZ∗ )T (DH ⊗ C∗ )dvecZ. Comparing (3.5.22) with (3.5.19), we have  1 T HZ,Z = H∗Z∗ ,Z∗ = D ⊗ C + (DT ⊗ C)H , 2  1 T ∗ HZ,Z∗ = HZ∗ ,Z = (F ⊗ E) + (FT ⊗ E)H . 2

(3.5.22)

(3.5.23) (3.5.24)

Exercises

179

The results above can be summarized in the following theorem on complex Hessian matrix identification. THEOREM 3.4 Let f (Z, Z∗ ) be a differentiable real function of m × n matrix variables Z, Z∗ . If the second-order differential d2 f (Z, Z∗ ) has the canonical form (3.5.16) or (3.5.17) then the complex Hessian matrix can be identified by  1 Knm (DT ⊗ C) + (Knm (DT ⊗ C))T tr(C(dZ)DdZ) ⇔ HZ,Z = 2  1 ∗ tr(E(dZ)FdZ ) ⇔ HZ,Z∗ = Knm (FT ⊗ E) + (Knm (FT ⊗ E))H 2 or    1 T tr C(dZ)D(dZ)T ⇔ HZ,Z = D ⊗ C + (DT ⊗ C)T 2    1 T ∗ T tr E(dZ)F(dZ ) ⇔ HZ,Z∗ = (F ⊗ E) + (FT ⊗ E)H 2 together with HZ∗ ,Z∗ = H∗Z,Z and HZ∗ ,Z = H∗Z,Z∗ . The complex conjugate gradient matrix and the full complex Hessian matrix play an important role in the optimization of an objective function with a complex matrix as variable.

Exercises Show that the matrix differential of the Kronecker product is given by d(X ⊗ Y) = (dX)Y + X ⊗ dY. 3.2 Prove that d(X∗Y) = (dX)Y+X∗dY, where X∗Y represents the Hadamard product of X and Y. 3.3 Show that d(UVW) = (dU)VW + U(dV)W + UV(dW). More generally, show that d(U1 U2 · · · Uk ) = (dU1 )U2 · · · Uk + · · · + U1 · · · Uk−1 (dUk ). 3.4 Compute d(tr(Ak )), where k is an integer. 3.5 Let the eigenvalues of the matrix X ∈ Rn×n be λ1 , . . . , λn . Use |X| = λ1 · · · λn to show that d|X| = |X|tr(X−1 dX). (Hint: Rewrite the determinant as |X| = exp(log λ1 + · · · + log λn ).) 3.6 Find the Jacobian matrices and the gradient matrices of the determinant logarithm functions log |XT AX|, log |XAXT | and log |XAX|. 3.7 Show the result dtr(XT X) = 2tr(XT dX). 3.8 Show that the matrix differential of a Kronecker product d(Y ⊗ Z) is dY ⊗ Z + Y ⊗ dZ.   3.9 Given the trace function tr (XT CX)−1 A , find its differential and gradient matrix. 3.10 Show that  dA−1 −1 il =− A−1 ij Akl . dAjk j 3.1

k

180

3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18

Matrix Differential

 (Hint: Write A−1 A = I as j A−1 ij Ajk = δik .) Given the determinant logarithms log |XT AX|, log |XAXT | and log |XAX|, find their Hessian matrices. Identify the Jacobian matrices of the matrix functions AXB and AX−1 B. Find the gradient matrices of the functions tr(AX−1 B) and tr(AXT BXC). Given the determinant functions |XT AX|, |XAXT |, |XAX| and |XT AXT |, determine their gradient matrices. Given a matrix X ∈ Cn×n with eigenvalues λ1 , . . . , λn , use |X| = λ1 · · · λn to prove that d|X| = |X|tr(X−1 dX). Compute d|A−k | and dtr(A−k ), where k is an integer. Identify the Jacobian matrices and the gradient matrices of the determinant logarithms log |XT AX|, log |XAXT | and log |XAX|, respectively. Show that if F is a matrix function and second differentiable then d2 log |F| = −tr(F−1 dF)2 + tr(F−1 )d2 F.

3.19 Use d(X−1 X) to find the differential of the inverse matrix, dX−1 . 3.20 Show that, for a nonsingular matrix X ∈ Rn×n , its inverse matrix X−1 is infinitely differentiable and its rth-order differential is given by d(r) (X−1 ) = (−1)r r!(X−1 dX)r X−1 ,

r = 1, 2, . . .

3.21 Let X ∈ Rm×n , and let its Moore–Penrose inverse matrix be X† . Show that  T d(X† X) = X† (dX)(In − X† X) + X† (dX)(In − X† X) ,  T d(XX† ) = (Im − XX† )(dX)X† + (Im − XX† )(dX)X† . 3.22 3.23 3.24 3.25

(Hint: The matrices X† X and XX† are idempotent and symmetric.) Find the Hessian matrices of the real scalar functions f (x) = aT x and f (x) = xT Ax. Given a real scalar function f (X) = tr(AXBXT ), show that its Hessian matrix is given by H[f (X)] = BT ⊗ A + B ⊗ AT . Compute the complex matrix differential dtr(XAXH B). Given a complex-valued function f (X, X∗ ) = |XAXH B|, find its gradient matrix ∇X f (X, X∗ ) = =

∂f (X, X∗ ) . ∂ X∗

∂f (X, X∗ ) and conjugate gradient matrix ∇X∗ f (X, X∗ ) ∂X

3.26 Given a function f (x) = (|aH x|2 − 1)2 , find its conjugate gradient vector ∂f (x) . ∂ x∗

3.27 Let f (X) = log |det(X)|, find its conjugate gradient vector

∂f (X) . ∂ X∗

3.28 Let f (Z, Z∗ ) = |ZZH |, find its gradient matrix and its four part Hessian matrices.

PART II MATRIX ANALYSIS

4 Gradient Analysis and Optimization

Optimization theory mainly studies the extreme values (the maximal or minimal values) of an objective function, which is usually some real-valued function of a real vector or matrix variable. In many engineering applications, however, the variable of an objective function is a complex vector or matrix. Optimization theory mainly considers (1) the existence conditions for an extremum value (gradient analysis); (2) the design of optimization algorithms and convergence analysis.

4.1 Real Gradient Analysis Consider an unconstrained minimization problem for f (x) : Rn → R described by min f (x),

(4.1.1)

x ∈S

where S ∈ Rn is a subset of the n-dimensional vector space Rn ; the vector variable x ∈ S is called the optimization vector and is chosen so as to fulfil Equation (4.1.1). The function f : Rn → R is known as the objective function; it represents the cost or price paid by selecting the optimization vector x, and thus is also called the cost function. Conversely, the negative cost function −f (x) can be understood as the value function or utility function of x. Hence, solving the optimization problem (4.1.1) corresponds to minimizing the cost function or maximizing the value function. That is to say, the minimization problem min f (x) and the maximization x ∈S

problem max{−f (x)} of the negative objective function are equivalent. x ∈S

The above optimization problem has no constraint conditions and thus is called an unconstrained optimization problem. Most nonlinear programming methods for solving unconstrained optimization problems are based on the idea of relaxation and approximation [344]. • Relaxation The sequence {ak }∞ k=0 is known as a relaxation sequence, if ak+1 ≤ ak , ∀ k ≥ 0. Hence, in the process of solving iteratively the optimization problem (4.1.1) it is necessary to generate a relaxation sequence of the cost function f (xk+1 ) ≤ f (xk ), k = 0, 1, . . . 183

184

Gradient Analysis and Optimization

• Approximation Approximating an objective function means using a simpler objective function instead of the original objective function. Therefore, by relaxation and approximation, we can achieve the following. (1) If the objective function f (x) is lower bounded in the definition domain S ∈ Rn then the sequence {f (xk )}∞ k=0 must converge. (2) In any case, we can improve upon the initial value of the objective function f (x). (3) The minimization of a nonlinear objective function f (x) can be implemented by numerical methods to a sufficiently high approximation accuracy.

4.1.1 Stationary Points and Extreme Points The stationary points and extreme points of an objective function play a key role in optimization. The analysis of stationary points depends on the gradient (i.e., the first-order gradient) vector of the objective function, and the analysis of the extreme points depends on the Hessian matrix (i.e., the second-order gradient) of the objective function. Hence the gradient analysis of an objective function is divided into first-order gradient analysis (stationary-point analysis) and secondorder gradient analysis (extreme-point analysis). In optimization, a stationary point or critical point of a differentiable function of one variable is a point in the domain of the function where its derivative is zero. Informally, a stationary point is a point where the function stops increasing or decreasing (hence the name); it could be a saddle point, however, is usually expected to be a global minimum or maximum point of the objective function f (x). DEFINITION 4.1 A point x in the vector subspace S ∈ Rn is known as a global minimum point of the function f (x) if f (x ) ≤ f (x),

∀ x ∈ S, x = x .

(4.1.2)

A global minimum point is also called an absolute minimum point. The value of the function at this point, f (x ), is called the global minimum or absolute minimum of the function f (x) in the vector subspace S. Let D denote the definition domain of a function f (x), where D may be the whole vector space Rn or some subset of it. If f (x ) < f (x),

∀ x ∈ D,

(4.1.3)

then x is said to be a strict global minimum point or a strict absolute minimum point of the function f (x). Needless to say, the ideal goal of minimization is to find the global minimum of

4.1 Real Gradient Analysis

185

a given objective function. However, this desired object is often difficult to achieve. The reasons are as follows. (1) It is usually difficult to know the global or complete information about a function f (x) in the definition domain D. (2) It is usually impractical to design an algorithm for identifying a global extreme point, because it is almost impossible to compare the value f (x ) with all values of the function f (x) in the definition domain D. In contrast, it is much easier to obtain local information about an objective function f (x) in the vicinity of some point c. At the same time, it is much simpler to design an algorithm for comparing the function value at some point c with other function values at the points near c. Hence, most minimization algorithms can find only a local minimum point c; the value of the objective function at c is the minimum of its values in the neighborhood of c. DEFINITION 4.2 Given a point c ∈ D and a positive number r, the set of all points x satisfying x − c 2 < r is said to be an open neighborhood with radius r of the point c, denoted Bo (c; r) = {x| x ∈ D, x − c 2 < r}.

(4.1.4)

Bc (c; r) = {x| x ∈ D, x − c 2 ≤ r}, then we say that Bc (c; r) is a closed neighborhood of c.

(4.1.5)

If

DEFINITION 4.3 Consider a scalar r > 0, and let x = c + Δx be a point in the definition domain D. If f (c) ≤ f (c + Δx),

∀ 0 < Δx 2 ≤ r,

(4.1.6)

then the point c and the function value f (c) are called a local minimum point and a local minimum (value) of the function f (x), respectively. If ∀ 0 < Δx 2 ≤ r,

f (c) < f (c + Δx),

(4.1.7)

then the point c and the function value f (c) are known as a strict local minimum point and a strict local minimum of the function f (x), respectively. If f (c) ≤ f (x),

∀ x ∈ D,

(4.1.8)

then the point c and the function value f (c) are said to be a global minimum point and a global minimum of the function f (x) in the definition domain D, respectively. If f (c) < f (x),

∀ x ∈ D, x = c,

(4.1.9)

then the point c and the function value f (c) are referred to as a strict global minimum point and a strict global minimum of the function f (x) in the definition domain D, respectively.

186

Gradient Analysis and Optimization

DEFINITION 4.4 Consider a scalar r > 0, and let x = c + Δx be a point in the definition domain D. If f (c) ≥ f (c + Δx),

∀ 0 < Δx 2 ≤ r,

(4.1.10)

then the point c and the function value f (c) are called a local maximum point and a local maximum (value) of the function f (x), respectively. If ∀ 0 < Δx 2 ≤ r,

f (c) > f (c + Δx),

(4.1.11)

then the point c and the function value f (c) are known as a strict local maximum point and a strict local maximum of the function f (x), respectively. If f (c) ≥ f (x),

∀ x ∈ D,

(4.1.12)

then the point c and the function value f (c) are said to be a global maximum point and a global maximum of the function f (x) in the definition domain D, respectively. If f (c) > f (x),

∀ x ∈ D, x = c,

(4.1.13)

then the point c and the function value f (c) are referred to as a strict global maximum point and a strict global maximum of the function f (x) in the definition domain D, respectively. It should be emphasized that the minimum points and the maximum points are collectively called the extreme points of the function f (x), and the minimal value and the maximal value are known as the extrema or extreme values of f (x). A local minimum (maximum) point and a strict local minimum (maximum) point are sometimes called a weak local minimum (maximum) point and a strong local minimum (maximum) point, respectively. In particular, if some point x0 is a unique local extreme point of a function f (x) in the neighborhood B(c; r) then it is said to be an isolated local extreme point. In order to facilitate the understanding of the stationary points and other extreme points, Figure 4.1 shows the curve of a one-variable function f (x) with definition domain D = (0, 6). The points x = 1 and x = 4 are a local minimum point and a local maximum point, respectively; x = 2 and x = 3 are a strict local maximum point and a strict local minimum point, respectively; all these points are extrema, while x = 5 is only a saddle point.

4.1.2 Real Gradient Analysis of f (x) In practical applications, it is still very troublesome to compare directly the value of an objective function f (x) at some point with all possible values in its neighborhood. Fortunately, the Taylor series expansion provides a simple method for coping with this difficulty.

4.1 Real Gradient Analysis

187

Figure 4.1 Extreme points of a one-variable function defined over the interval (0, 6).

The second-order Taylor series approximation of f (x) at the point c is given by 1 T f (c + Δx) = f (c) + (∇f (c)) Δx + (Δx)T H[f (c)]Δx, 2 where

(4.1.14)

 ∂f (c) ∂f (x)  , ∇f (c) = = ∂c ∂ x x=c  ∂ 2 f (x)  ∂ 2 f (c) = H[f (c)] = ∂c∂cT ∂ x∂ xT 

(4.1.15) (4.1.16)

x=c

are respectively the gradient vector and the Hessian matrix of f (x) at the point c. From (4.1.14) we get the following two necessary conditions of a local minimizer of f (x). THEOREM 4.1 (First-order necessary conditions) [353] If x∗ is a local extreme point of a function f (x) and f (x) is continuously differentiable in the neighborhood B(x∗ ; r) of the point x∗ , then  ∂f (x)  ∇x∗ f (x∗ ) = = 0. (4.1.17) ∂ x x=x ∗

THEOREM 4.2 (Second-order necessary conditions) [311], [353] If x∗ is a local minimum point of f (x) and the second-order gradient ∇2x f (x) is continuous in an open neighborhood Bo (x∗ ; r) of x∗ , then   ∂f (x)  ∂ 2 f (x)  2 = 0 and ∇x∗ f (x∗ ) =  0, (4.1.18) ∇x∗ f (x∗ ) = ∂ x x=x ∂ x∂ xT x=x ∗



i.e., the gradient vector of f (x) at the point x∗ is a zero vector and the Hessian matrix ∇2x f (x) at the point x∗ is a positive semi-definite matrix. If Equation (4.1.18) in Theorem 4.2 is replaced by   ∂f (x)  ∂ 2 f (x)  2 = 0 and ∇ f (x ) = 0 ∇x∗ f (x∗ ) = ∗ x∗ ∂ x x=x ∂ x∂ xT x=x ∗



(4.1.19)

188

Gradient Analysis and Optimization

then Theorem 4.2 yields the second-order necessary condition for x∗ to be a local maximum point of the function f (x). In Equation (4.1.19), ∇2x∗ f (x∗ )  0 means that the Hessian matrix at the point x∗ , ∇2x∗ f (x∗ ), is negative semi-definite. It should be emphasized that Theorem 4.2 provides only a necessary condition, not a sufficient condition, for there to be a local minimum point of the function f (x). However, for an unconstrained optimization algorithm, we usually hope to determine directly whether its convergence point x∗ is in fact an extreme point of the given objective function f (x). The following theorem gives the answer to this question. THEOREM 4.3 (Second-order sufficient conditions) [311], [353] Let ∇2x f (x) be continuous in an open neighborhood of x∗ . If the conditions   ∂f (x)  ∂ 2 f (x)  2 = 0 and ∇x∗ f (x∗ ) =  0 (4.1.20) ∇x∗ f (x∗ ) = ∂ x x=x ∂ x∂ xT x=x ∗



are satisfied then x∗ is a strict local minimum point of the objection function f (x). Here ∇2x f (x∗ )  0 means that the Hessian matrix ∇2x∗ f (x∗ ) at the point x∗ is a positive definite matrix. Note 1 Second-order necessary conditions for a local maximum point: If x∗ is a local maximum point of f (x), and the second-order gradient ∇2x f (x) is continuous in an open neighborhood Bo (x∗ ; r) of x∗ , then   ∂f (x)  ∂ 2 f (x)  2 = 0 and ∇x∗ f (x∗ ) =  0. (4.1.21) ∇x∗ f (x∗ ) = ∂ x x=x ∂ x∂ xT x=x ∗



Note 2 Second-order sufficient conditions for a strict local maximum point: suppose that ∇2x f (x) is continuous in an open neighborhood of x∗ ; if the conditions   ∂f (x)  ∂ 2 f (x)  2 = 0 and ∇x∗ f (x∗ ) = ≺ 0 (4.1.22) ∇x∗ f (x∗ ) = ∂ x x=x ∂ x∂ xT x=x ∗



are satisfied then x∗ is a strict local maximum point of the objection function f (x). If ∇f (x∗ ) = 0 but ∇2x∗ f (x∗ ) is an indefinite matrix, then x∗ is a saddle point, not an extreme point, of f (x) in the neighborhood B(x∗ ; r).

4.1.3 Real Gradient Analysis of f (X) Now consider a real function f (X) : Rm×n → R with matrix variable X ∈ Rm×n . First, perform the vectorization of the matrix variable to get an mn × 1 vector vecX. Let D be the subset of the matrix space Rm×n which is the definition domain of the m × n matrix variable X, i.e., X ∈ D.

4.1 Real Gradient Analysis

189

DEFINITION 4.5 A neighborhood B(X∗ ; r) with center point vecX∗ and the radius r is defined as B(X∗ ; r) = {X|X ∈ Rm×n , vecX − vecX∗ 2 < r}.

(4.1.23)

From Equation (4.1.16), the second-order Taylor series approximation formula of the function f (X) at the point X∗ is given by T  ∂f (X∗ ) f (X∗ + ΔX) = f (X∗ ) + vec(ΔX) ∂ vecX∗ 1 ∂ 2 f (X∗ ) + (vec(ΔX))T vec(ΔX) 2 ∂ vecX∗ ∂ (vecX∗ )T  T = f (X∗ ) + ∇vec X∗ f (X∗ ) vec(ΔX) 1 + (vec(ΔX))T H[f (X∗ )]vec(ΔX), (4.1.24) 2 where  ∂f (X)  ∈ Rmn ∇vec X∗ f (X∗ ) = ∂ vecX X=X ∗   ∂ 2 f (X)  H[f (X∗ )] = ∈ Rmn×mn T ∂ vecX∂(vec X) X=X ∗

are the gradient vector and the Hessian matrix of f (X) at the point X∗ , respectively. Comparing Equation (4.1.24) with Equation (4.1.14), we have the following conditions similar to Theorems 4.1–4.3. (1) First-order necessary condition for a minimizer If X∗ is a local extreme point of a function f (X), and f (X) is continuously differentiable in the neighborhood B(X∗ ; r) of the point X∗ , then  ∂f (X)  = 0. (4.1.25) ∇vec X∗ f (X∗ ) = ∂ vecX X=X ∗

(2) Second-order necessary conditions for a local minimum point If X∗ is a local minimum point of f (X), and the Hessian matrix H[f (X)] is continuous in an open neighborhood Bo (X∗ ; r) then   ∂ 2 f (X)  ∇vec X∗ f (X∗ ) = 0 and  0. (4.1.26) T ∂ vecX∂ (vecX) X=X ∗

(3) Second-order sufficient conditions for a strict local minimum point It is assumed that H[f (x)] is continuous in an open neighborhood of X∗ . If the conditions   ∂ 2 f (X)  ∇vec X∗ f (X∗ ) = 0 and 0 (4.1.27) ∂ vecX∂ (vecX)T X=X ∗

are satisfied, then X∗ is a strict local minimum point of f (X).

190

Gradient Analysis and Optimization

(4) Second-order necessary conditions for a local maximum point If X∗ is a local maximum point of f (X), and the Hessian matrix H[f (X)] is continuous in an open neighborhood Bo (X∗ ; r), then   ∂ 2 f (X)  ∇vec X∗ f (X∗ ) = 0 and  0. (4.1.28) T ∂ vecX∂ (vecX) X=X ∗

(5) Second-order sufficient conditions for a strict local maximum point It is assumed that H[f (X)] is continuous in an open neighborhood of X∗ . If the conditions   ∂ 2 f (X)  ∇vec X∗ f (X∗ ) = 0 and ≺0 (4.1.29) T ∂ vecX∂ (vecX) X=X ∗

are satisfied then X∗ is a strict local maximum point of f (X). The point X∗ is only a saddle point of f (X), if ∇vec X∗ f (X∗ ) = 0

and

  ∂ 2 f (X)  is indefinite. T ∂ vecX∂(vecX) X=X ∗

4.2 Gradient Analysis of Complex Variable Function In many engineering applications (such as wireless communications, radar and sonar), the signals encountered in practice are real-valued functions of complex vector variables. In this section we discuss the extreme point of a complex variable function and gradient analysis in unconstrained optimization.

4.2.1 Extreme Point of Complex Variable Function Consider the unconstrained optimization of a real-valued function f (z, z∗ ) : Cn × Cn → R. From the first-order differential ∂f (z, z∗ ) ∂f (z, z∗ ) ∗ dz + dz ∂ zT ∂zH     ∂f (z, z∗ ) ∂f (z, z∗ ) dz = , dz∗ ∂ zT ∂zH

df (z, z∗ ) =

(4.2.1)

and the second-order differential ⎡

∂ 2 f (z, z∗ ) ⎢ ∂ z∗ ∂ zT

d2 f (z, z∗ ) = [dzH , dzT ] ⎣

∂ 2 f (z, z∗ ) ∂ z∂ zT



∂ 2 f (z, z∗ )   ∂ z∗ ∂ zH ⎥ dz ∂ 2 f (z, z∗ ) ∂ z∂ zH



dz∗

,

(4.2.2)

4.2 Gradient Analysis of Complex Variable Function

191

it can be easily seen that the Taylor series approximation of the function f (z, z∗ ) at the point c is given by    ∂f (c, c∗ ) ∂f (c, c∗ ) Δc ∗ ∗ f (z, z ) ≈ f (c, c ) + , Δc∗ ∂c ∂c∗ ⎡ 2 ⎤ ∂ f (c, c∗ ) ∂ 2 f (c, c∗ )   ∗ T 1 ∂ c∗ ∂ cH ⎥ Δc H T ⎢ ∂c ∂c + [Δc , Δc ] ⎣ 2 ⎦ Δc∗ 2 ∂ f (c, c∗ ) ∂ 2 f (c, c∗ ) ∂ c∂ cT

∂ c∂ cH

1 = f (c, c∗ ) + (∇f (c, c∗ ))T Δ˜ c + (Δ˜ c, (4.2.3) c)H H(f (c, c∗ ))Δ˜ 2   Δc with Δc = z − c and Δc∗ = z∗ − c∗ ; while the gradient vector where Δ˜ c= Δc∗ is ⎡ ∗ ⎤ ⎢ ∇f (c, c∗ ) = ⎣ and the Hessian matrix is

∂f (z, z ) ⎥ ∂z ∈ C2n , ⎦ ∂f (z, z∗ ) ∂ z∗ x=c



∂ 2 f (z, z∗ ) ⎢ ∂ z∗ ∂ zT

H[f (c, c∗ )] = ⎣

∂ 2 f (z, z∗ ) ∂ z∂ zT

(4.2.4)



∂ 2 f (z, z∗ ) ∂ z∗ ∂ zH ⎥



∂ 2 f (z, z∗ ) ∂ z∂ zH x=c

∈ C2n×2n .

(4.2.5)

Consider the neighborhood of the point c B(c; r) = {x|0 < Δc 2 = x − c 2 ≤ r}, where Δc 2 is small enough so that the second-order term in Equation (4.2.3) can be ignored. This yields the first-order approximation to the function, as follows: f (z, z∗ ) ≈ f (c, c∗ ) + (∇f (c, c∗ ))T Δ˜ c

(for Δc 2 small enough).

(4.2.6)



Clearly, in order that f (c, c ) takes a minimal value or a maximal value, it is required that ∇f (c, c∗ )) = 02n×1 . This condition is equivalent to ∂f (c, c∗ )/∂ c = 0n×1 and ∂f (c, c∗ )/∂ c∗ = 0n×1 , or simply the condition that the conjugate derivative vector of the function f (z, z∗ ) at the point c is a zero vector, namely  ∂f (z, z∗ )  ∂f (c, c∗ ) = = 0n×1 . (4.2.7) ∂ c∗ ∂ z∗ z=c The Taylor series approximation to the real-valued function f (z, z∗ ) at its stationary point c is given to second-order by 1 f (z, z∗ ) ≈ f (c, c∗ ) + (Δ˜ c. (4.2.8) c)H H[f (c, c∗ )]Δ˜ 2 From Equations (4.2.7) and (4.2.8) we have the following conditions on extreme points, similar to Theorems 4.1–4.3.

192

Gradient Analysis and Optimization

(1) First-order necessary condition for a local extreme point If z∗ is a local extreme point of a function f (z, z∗ ), and f (z, z∗ ) is continuously differentiable in the neighborhood B(c; r) of the point c, then  ∂f (z, z∗ )  = 0. (4.2.9) ∇c f (c, c∗ ) =  ∂z z=c (2) Second-order necessary conditions for a local minimum point If c is a local minimum point of f (z, z∗ ), and the Hessian matrix H[f (z, z∗ )] is continuous in an open neighborhood Bo (c; r) of the point c, then ⎤ ⎡ 2 ∂ f (z,z∗ ) ∂ 2 f (z,z∗ ) ∗ T ∗ H ∂f (c, c∗ ) ∂z ∂z ∂z ∂z ⎦ = 0 and H[f (c, c∗ )] = ⎣ 2  0. (4.2.10) ∂ f (z,z∗ ) ∂ 2 f (z,z∗ ) ∂c∗ ∂z∂ zT

∂z∂zH

z=c

(3) Second-order sufficient conditions for a strict local minimum point Suppose that H[f (z, z∗ )] is continuous in an open neighborhood of c. If the conditions ∂f (c, c∗ ) = 0 and ∂c∗

H[f (c, c∗ )]  0

(4.2.11)

are satisfied then c is a strict local minimum point of f (z, z∗ ). (4) Second-order necessary conditions for a local maximum point If c is a local maximum point of f (z, z∗ ), and the Hessian matrix H[f (z, z∗ )] is continuous in an open neighborhood Bo (c; r) of the point c, then ∂f (c, c∗ ) = 0 and ∂c∗

H[f (c, c∗ )]  0.

(4.2.12)

(5) Second-order sufficient conditions for a strict local maximum point Suppose that H[f (z, z∗ )] is continuous in an open neighborhood of c. If the conditions ∂f (c, c∗ ) = 0 and ∂c∗

H[f (c, c∗ )] ≺ 0.

(4.2.13)

are satisfied then c is a strict local maximum point of f (z, z∗ ). ∂f (c, c∗ )

If = 0, but the Hessian matrix H[f (c, c∗ )] is indefinite, then c is only ∂ c∗ a saddle point of f (z, z∗ ), not an extremum. For a real-valued function f (Z, Z∗ ) : Cm×n × Cm×n → R with complex matrix variable Z, it is necessary to perform the vectorization of the matrix variables Z and Z∗ respectively, to get vecZ and vecZ∗ . From the first-order differential T T   ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) df (Z, Z∗ ) = vec(dZ) + vec(dZ∗ ) ∂vec Z ∂vec Z∗    ∂f (Z, Z∗ ) ∂f (Z, Z∗ ) vec(dZ) = (4.2.14) , ∂(vecZ)T ∂(vec Z∗ )T ) vec(dZ)∗

4.2 Gradient Analysis of Complex Variable Function

and the second-order differential ⎡ ∂ 2 f (Z,Z∗ )  H ∂(vec Z∗ )∂(vec Z)T vec(dZ) ⎣ d2 f (Z, Z∗ ) = ∗ 2 vec(dZ ) ∂ f (Z,Z∗ )

193

∂ 2 f (Z,Z∗ ) ∂(vec Z∗ )∂(vec Z∗ )T ∂ 2 f (Z,Z∗ ) ∂(vec Z)∂(vec Z∗ )T )

∂(vec Z)∂(vec Z)T



  ⎦ vec(dZ) vec(dZ∗ ) (4.2.15)

it follows that the second-order series approximation of f (Z, Z∗ ) at the point C is given by ˜ f (Z, Z∗ ) = f (C, C∗ ) + (∇f (C, C∗ ))T vec(ΔC) 1 ˜ ˜ H H[f (C, C∗ )]vec(ΔC), + (vec(ΔC)) 2 where

 ˜ = ΔC

(4.2.16)

   Z−C ΔC = ∈ C2m×n Z ∗ − C∗ ΔC∗ ⎡ ∂f (Z,Z∗ ) ⎤

∇f (C, C∗ ) = ⎣

∂(vec Z) ∂f (Z,Z∗ ) ∂(vec Z∗ )



(4.2.17)

∈ C2mn×1 ,

(4.2.18)

Z=C

and H[f (C, C∗ )] = H[f (Z, Z∗ )]|Z=C ⎡ ∂ 2 f (Z,Z∗ ) =⎣  =

∂(vec Z∗ )∂(vec Z)T

∂ 2 f (Z,Z∗ ) ∂(vec Z∗ )∂(vec Z∗ )T

∂ 2 f (Z,Z∗ ) ∂(vec Z)∂(vec Z)T

∂ 2 f (Z,Z∗ ) ∂(vec Z)∂(vec Z∗ )T

H HZZ

Z∗ Z

H HZZ∗

Z∗ Z∗



∈ C2mn×2mn .

⎤ ⎦ Z=C

(4.2.19)

Z=C

Notice that ∂f (Z, Z∗ ) = 0mn×1 , ∂vec Z ∗ ∂f (Z, Z ) ⇔ = Om×n . ∂Z∗

∇f (C, C∗ ) = 02mn×1 ⇔ ∂f (Z, Z∗ ) = 0mn×1 ∂vec Z∗

On comparing Equation (4.2.16) with Equation (4.2.3), we have the following conditions for the extreme points of f (Z, Z∗ ). (1) First-order necessary condition for a local extreme point If C is a local extreme point of a function f (Z, Z∗ ), and f (Z, Z∗ ) is continuously differentiable in the neighborhood B(C; r) of the point C, then  ∂f (Z, Z∗ )  ∗ = O. (4.2.20) ∇f (C, C ) = 0 or ∂Z∗ Z=C

194

Gradient Analysis and Optimization

(2) Second-order necessary conditions for a local minimum point If C is a local minimum point of f (Z, Z∗ ), and the Hessian matrix H[f (Z, Z∗ )] is continuous in an open neighborhood Bo (C; r) of the point C, then  ∂f (Z, Z∗ )  = O and H[f (C, C∗ )]  0. (4.2.21) ∂Z∗ Z=C (3) Second-order sufficient conditions for a strict local minimum point Suppose that H[f (Z, Z∗ )] is continuous in an open neighborhood of C. If the conditions  ∂f (Z, Z∗ )  = O and H[f (C, C∗ )]  0 (4.2.22) ∂Z∗ Z=C are satisfied, then C is a strict local minimum point of f (Z, Z∗ ). (4) Second-order necessary conditions for a local maximum point If C is a local maximum point of f (Z, Z∗ ), and the Hessian matrix H[f (Z, Z∗ )] is continuous in an open neighborhood Bo (C; r) of the point C, then  ∂f (Z, Z∗ )  = O and H[f (C, C∗ )]  0. (4.2.23) ∂Z∗ Z=C (5) Second-order sufficient conditions for a strict local maximum point Suppose that H[f (Z, Z∗ )] is continuous in an open neighborhood of C, if the conditions  ∂f (Z, Z∗ )  = O and H[f (C, C∗ )] ≺ 0 (4.2.24) ∂Z∗ Z=C are satisfied, then C is a strict local maximum point of f (Z, Z∗ ).  ∂f (Z, Z∗ )  If = O, but the Hessian matrix H[f (C, C∗ )] is indefinite, then C  ∗ ∂Z

Z=C

is only a saddle point of f (Z, Z∗ ). Table 4.1 summarizes the extreme-point conditions for three complex variable functions. The Hessian matrices in Table 4.1 are defined as follows: ⎡ 2 ∗ 2 ∗ ⎤ ∂ f (z, z )

⎢ ∂ z ∗ ∂z H[f (c, c )] = ⎣ 2 ∗ ∗

∂ f (z, z ) ∂z∂z ⎡ 2 ∂ f (z, z∗ ) ⎢ ∂ z∗ ∂ zT

H[f (c, c∗ )] = ⎣

∂ f (z, z ) ∂ z ∗ ∂z ∗ ⎥



∈ C2×2 ,



∈ C2n×2n ,

∂ 2 f (z, z ∗ ) ∂ z ∂z ∗ z=c ⎤ ∂ 2 f (z, z∗ ) ∂ z∗ ∂ zH ⎥

∂ 2 f (z, z∗ ) ∂ 2 f (z, z∗ ) ∂ z∂ zT ∂ z∂ zH z=c ⎡ ⎤ 2 ∗ ∂ f (Z, Z ) ∂ 2 f (Z, Z∗ ) ⎢ ∂(vecZ∗ )∂ (vecZ)T ∂(vecZ∗ )∂(vecZ∗ )T ⎥

H[f (C, C∗ )] = ⎢ ⎣

∂ 2 f (Z, Z∗ ) ∂(vecZ)∂(vecZ)T

⎥ ∈ C2mn×2mn . ⎦ ∂ 2 f (Z, Z∗ ) ∂(vecZ)∂(vecZ∗ )T Z = C

4.2 Gradient Analysis of Complex Variable Function

195

Table 4.1 Extreme point conditions for the complex variable functions f (z, z ∗ ) : C → R f (z, z∗ ) : Cn → R f (Z, Z∗ ) : Cm×n → R    ∂f (z, z ∗ )  ∂f (z, z∗ )  ∂f (Z, Z∗ )  = 0 = 0 =O ∂z ∗ z=c ∂ z∗ z=c ∂Z∗ Z=C

Stationary point Local minimum

H[f (c, c∗ )] 0

H[f (c, c∗ )] 0

H[f (C, C∗ )] 0

Strict local minimum

H[f (c, c∗ )] 0

H[f (c, c∗ )] 0

H[f (C, C∗ )] 0

Local maximum

H[f (c, c∗ )] 0

H[f (c, c∗ )] 0

H[f (C, C∗ )] 0

Strict local maximum H[f (c, c∗ )] ≺ 0

H[f (c, c∗ )] ≺ 0

H[f (C, C∗ )] ≺ 0

H[f (c, c∗ )] indef.

H[f (C, C∗ )] indef.

Saddle point

H[f (c, c∗ )] indef.

4.2.2 Complex Gradient Analysis Given a real-valued objective function f (w, w∗ ) or f (W, W∗ ), the gradient analysis of its unconstrained minimization problem can be summarized as follows. (1) The conjugate gradient matrix determines a closed solution of the minimization problem. (2) The sufficient conditions of local minimum points are determined by the conjugate gradient vector and the Hessian matrix of the objective function. (3) The negative direction of the conjugate gradient vector determines the steepestdescent iterative algorithm for solving the minimization problem. (4) The Hessian matrix gives the Newton algorithm for solving the minimization problem. In the following, we discuss the above gradient analysis. 1. Closed Solution of the Unconstrained Minimization Problem By letting the conjugate gradient vector (or matrix) be equal to a zero vector (or matrix), we can find a closed solution of the given unconstrained minimization problem. EXAMPLE 4.1 For an over-determined matrix equation Az = b, define the error sum of squares, J(z) = Az − b 22 = (Az − b)H (Az − b) = zH AH Az − zH AH b − bH Az + bH b, as the objective function. Let its conjugate gradient vector ∇z∗ J(z) = AH Az −

196

Gradient Analysis and Optimization

AH b be equal to a zero vector. Clearly, if AH A is nonsingular then z = (AH A)−1 AH b.

(4.2.25)

This is the LS solution of the over-determined matrix equation Az = b. EXAMPLE 4.2 For an over-determined matrix equation Az = b, define the loglikelihood function 1 1 l(ˆ z) = C − 2 eH e = C − 2 (b − Aˆ z)H (b − Aˆ z), (4.2.26) σ σ where C is a real constant. Then, the conjugate gradient of the log-likelihood function is given by ∂l(ˆ z) 1 1 z) = = 2 AH b − 2 AH Aˆ z. ∇zˆ∗ l(ˆ ∗ ∂z σ σ By setting ∇zˆ∗ l(ˆ z) = 0, we obtain AH Azopt = AH b, where zopt is the solution for maximizing the log-likelihood function l(ˆ z). Hence, if AH A is nonsingular then zopt = (AH A)−1 AH b.

(4.2.27)

This is the ML solution of the matrix equation Az = b. Clearly, the ML solution and the LS solution are the same for the matrix equation Az = b. 2. Sufficient Condition for Local Minimum Point The sufficient conditions for a strict local minimum point are given by Equation (4.2.11). In particular, for a convex function f (z, z∗ ), any local minimum point z0 is a global minimum point of the function. If the convex function f (z, z∗ ) is ∂f (z, z∗ )  = 0 is a global differentiable then the stationary point z0 such that  ∗ minimum point of f (z, z∗ ).

∂z

z=z0

3. Steepest Descent Direction of Real Objective Function There are two choices in determining a stationary point C of an objective function f (Z, Z∗ ) with a complex matrix as variable:   ∂f (Z, Z∗ )  ∂f (Z, Z∗ )  = O or = Om×n . (4.2.28) m×n  ∂Z ∂Z∗ Z=C Z=C The question is which gradient to select when designing a learning algorithm for an optimization problem. To answer this question, it is necessary to introduce the definition of the curvature direction. DEFINITION 4.6 [158] If H is the Hessian operator acting on a nonlinear function f (x), a vector p is said to be: (1) the direction of positive curvature of the function f , if pH Hp > 0;

4.2 Gradient Analysis of Complex Variable Function

197

(2) the direction of zero curvature of the function f , if pH Hp = 0; (3) the direction of negative curvature of the function f , if pH Hp < 0. The scalar pH Hp is called the curvature of the function f along the direction p. The curvature direction is the direction of the maximum rate of change of the objective function. THEOREM 4.4 [57] Let f (z) be a real-valued function of the complex vector z. By regarding z and z∗ as independent variables, the curvature direction of the real objective function f (z, z∗ ) is given by the conjugate gradient vector ∇z∗ f (z, z∗ ). Theorem 4.4 shows that each component of ∇z∗ f (z, z∗ ), or ∇vec Z∗ f (Z, Z∗ ), gives the rate of change of the objective function f (z, z∗ ), or f (Z, Z∗ ), in this direction: • the conjugate gradient vector ∇z∗ f (z, z∗ ), or ∇vec Z∗ f (Z, Z∗ ), gives the fastest growing direction of the objective function; • the negative gradient vector −∇z∗ f (z, z∗ ), or −∇vec Z∗ f (Z, Z∗ ), provides the steepest decreasing direction of the objective function. Hence, along the negative conjugate gradient direction, the objective function reaches its minimum point at the fastest rate. This optimization algorithm is called the gradient descent algorithm or steepest descent algorithm. As a geometric interpretation of Theorem 4.4, Figure 4.2 shows the gradient and the conjugate gradient of the function f (z) = |z|2 at the points c1 and c2 , where ∂f (z)  , i = 1, 2. ∇c∗i f =  ∗ ∂z

z=ci

jy −∇c1 f −∇c∗ f 1

∇c∗ f 1

c1 ∇c1 f

−∇c∗ f

0

x

2

c2

∇c∗ f 2

Figure 4.2 The gradient and the conjugate gradient of f (z) = |z|2 .

Clearly, only the negative direction of the conjugate gradient −∇z∗ f (z) points to the global minimum z = 0 of the objective function. Hence, in minimization

198

Gradient Analysis and Optimization

problems, the negative direction of the conjugate gradient is used as the updating direction: zk = zk−1 − μ∇z∗ f (z),

μ > 0.

(4.2.29)

That is to say, the correction amount −μ∇z∗ f (z) of alternate solutions in the iterative process is proportional to the negative conjugate gradient of the objective function. Equation (4.2.29) is called the learning algorithm for the alternate solution of the optimization problem. Because the direction of the negative conjugate gradient vector always points in the decreasing direction of an objection function, this type of learning algorithm is called a steepest descent algorithm. The constant μ in a steepest descent algorithm is referred to as the learning step and determines the rate at which the alternate solution converges to the optimization solution. 4. Newton Algorithm for Solving Minimization Problems The conjugate gradient vector contains only first-order information about an objective function. If, further, we adopt the second-order derivative information provided by the Hessian matrix of the objective function, then this promises an optimization algorithm with better performance. An optimization algorithm based on the Hessian matrix is the well-known Newton method. The Newton method is a simple and efficient constrained optimization algorithm, and is widely applied. 4.3 Convex Sets and Convex Function Identification In the above section we discussed unconstrained optimization problems. This section presents constrained optimization theory. The basic idea of solving a constrained optimization problem is to transform it into a unconstrained optimization problem. 4.3.1 Standard Constrained Optimization Problems Consider the standard form of a constrained optimization problem min f0 (x) x

fi (x) ≤ 0, i = 1, . . . , m, Ax = b

(4.3.1)

fi (x) ≤ 0, i = 1, . . . , m, hi (x) = 0, i = 1, . . . , q.

(4.3.2)

subject to

which can be written as min f0 (x) x

subject to

The variable x in constrained optimization problems is called the optimization variable or decision variable and the function f0 (x) is the objective function or cost function, while fi (x) ≤ 0, x ∈ I,

and

hi (x) = 0, x ∈ E,

(4.3.3)

4.3 Convex Sets and Convex Function Identification

199

are known as the inequality constraint and the equality constraint, respectively; I and E are respectively the definition domains of the inequality constraint function and the equality constraint function, i.e., I=

m 1

dom fi

and

E=

i=1

q 1

dom hi .

(4.3.4)

i=1

The inequality constraint and the equality constraint are collectively called explicit constraints. An optimization problem without explicit constraints (i.e., m = q = 0) reduces to an unconstrained optimization problem. The inequality constraint fi (x) ≤ 0, i = 1, . . . , m and the equality constraint hi (x) = 0, i = 1, . . . , q, denote m + q strict requirements or stipulations restricting the possible choices of x. An objective function f0 (x) represents the costs paid by selecting x. Conversely, the negative objective function −f0 (x) can be understood as the values or benefits got by selecting x. Hence, solving the constrained optimization problem (4.3.2) amounts to choosing x given m + q strict requirements such that the cost is minimized or the value is maximized. The optimal solution of the constrained optimization problem (4.3.2) is denoted p , and is defined as the infimum of the objective function f0 (x): p = inf{f0 (x)|fi (x) ≤ 0, i = 1, . . . , m, hi (x) = 0, i = 1, . . . , q}.

(4.3.5)

If p = ∞, then the constrained optimization problem (4.3.2) is said to be infeasible, i.e., no point x can meet the m + q constraint conditions. If p = −∞ then the constrained optimization problem (4.3.2) is lower unbounded. The following are the key steps in solving the constrained optimization (4.3.2). (1) Search for the feasible points of a given constrained optimization problem. (2) Search for the point such that the constrained optimization problem reaches the optimal value. (3) Avoid turning the original constrained optimization into a lower unbounded problem. The point x meeting all the inequality constraints and equality constraints is called a feasible point. The set of all feasible points is called the feasible domain or feasible set, denoted F, and is defined as def

F = I ∩ E = {x|fi (x) ≤ 0, i = 1, . . . , m, hi (x) = 0, i = 1, . . . , q}.

(4.3.6)

The points not in the feasible set are referred to as infeasible points. The intersection set of the definition domain domf0 and the feasible domain F of an objective function, D = domf0 ∩

m 1 i=1

domfi ∩

p 1 i=1

domhi = domf0 ∩ F,

(4.3.7)

200

Gradient Analysis and Optimization

is called the definition domain of a constrained optimization problem. A feasible point x is optimal if f0 (x) = p . It is usually difficult to solve a constrained optimization problem. When the number of decision variables in x is large, it is particularly difficult. This is due to the following reasons [211]. (1) A constrained optimization problem may be occupied by a local optimal solution in its definition domain. (2) It is very difficult to find a feasible point. (3) The stopping criteria in general unconstrained optimization algorithms usually fail in a constrained optimization problem. (4) The convergence rate of a constrained optimization algorithm is usually poor. (5) The size of numerical problem can make a constrained minimization algorithm either completely stop or wander, so that these algorithm cannot converge in the normal way. The above difficulties in solving constrained optimization problems can be overcome by using the convex optimization technique. In essence, convex optimization is the minimization (or maximization) of a convex (or concave) function under the constraint of a convex set. Convex optimization is a fusion of optimization, convex analysis and numerical computation.

4.3.2 Convex Sets and Convex Functions First, we introduce two basic concepts in convex optimization. DEFINITION 4.7 A set S ∈ Rn is convex, if the line connecting any two points x, y ∈ S is also in the set S, namely x, y ∈ S, θ ∈ [0, 1]



θx + (1 − θ)y ∈ S.

(4.3.8)

Figure 4.3 gives a schematic diagram of a convex set and a nonconvex set. b

b a

(a) Convex set

a (b) Nonconvex set

Figure 4.3 Convex set and nonconvex set.

    Many familiar setsare  convex,e.g., the unit ball S = x x 2 ≤ 1 . However, the unit sphere S = x x 2 = 1 is not a convex set because the line connecting two points on the sphere is obviously not itself on the sphere.

4.3 Convex Sets and Convex Function Identification

201

A convex set has the following important properties [344, Theorem 2.2.4]. Letting S1 ⊆ Rn and S2 ⊆ Rm be two convex sets, and A(x) : Rn → Rm be the linear operator such that A(x) = Ax + b, then:    1. The intersection S1 ∩ S2 = x ∈ Rn x ∈ S1 , x ∈ S2 (m = n) is a convex set.    2. The sum of sets S1 + S2 = z = x + yx ∈ S1 , y ∈ S2 (m = n) is a convex set.    3. The direct sum S1 ⊕ S2 = (x, y) ∈ Rn+m x ∈ S1 , y ∈ S2 is a convex set.    4. The conic hull K(S1 ) = z ∈ Rn z = βx, x ∈ S1 , β ≥ 0 is a convex set.    5. The affine image A(S1 ) = y ∈ Rm y = A(x), x ∈ S1 is a convex set.    6. The inverse affine image A−1 (S2 ) = x ∈ Rn x = A−1 (y), y ∈ S2 is a convex set. 7. The following convex hull is a convex set:    conv(S1 , S2 ) = z ∈ Rn z = αx + (1 − α)y, x ∈ S1 , y ∈ S2 ; α ∈ [0, 1] . The most important property of a convex set is an extension of Property 1: the intersection of any number (or even an uncountable number) of convex sets is still a convex set. For instance, the intersection of the unit ball and nonnegative orthant Rn+ , S = {x| x 2 ≤ 1, xi ≥ 0}, retains the convexity. However, the union of two convex sets is not usually convex. For example, S1 = {x| x 2 ≤ 1} and S2 = {x| x − 3 × 1 2 ≤ 1} (where 1 is a vector with all entries equal to 1) are convex sets, but their union S1 ∪ S2 is not convex, because the segment connecting any two points in the two balls is not in S1 ∪ S2 . Given a vector x ∈ Rn and a constant ρ > 0, then    (4.3.9) Bo (x, ρ) = y ∈ Rn  y − x 2 < ρ ,    n Bc (x, ρ) = y ∈ R y − x 2 ≤ ρ , (4.3.10) are called the open ball and the closed ball with center x and radius ρ. A convex set S ⊆ Rn is known as a convex cone if all rays starting at the origin and all segments connecting any two points of these rays are still within the convex set, i.e., x, y ∈ S, λ, μ ≥ 0



λx + μy ∈ S.

(4.3.11)

The nonnegative orthant Rn+ = {x ∈ Rn |x  0} is a convex cone. The set of n positive semi-definite matrices X  0, S+ = {X ∈ Rn×n |X  0}, is a convex cone as well, since the positive combination of any number of positive semi-definite n matrices is still positive semi-definite. Hence, S+ is called the positive semi-definite cone. DEFINITION 4.8 The vector function f (x) : Rn → Rm is known as an affine function if it has the form f (x) = Ax + b.

(4.3.12)

202

Gradient Analysis and Optimization

Similarly, the matrix function F(x) : Rn → Rp×q is called an affine function, if it has the form F(x) = A0 + x1 A1 + · · · + xn An ,

(4.3.13)

where Ai ∈ Rp×q . An affine function sometimes is roughly referred to as a linear function (see below). DEFINITION 4.9 [211] Given n × 1 vectors xi ∈ Rn and real numbers θi ∈ R, then y = θ1 x1 + · · · + θk xk is called: (a) a linear combination for any real numbers θi ;  (b) an affine combination if i θi = 1;  (c) a convex combination if i θi = 1 and all θi ≥ 0; (d) a conic combination if θi ≥ 0, i = 1, . . . , k. An intersection of sets has the following properties [211]. 1. 2. 3. 4.

The The The The

intersection intersection intersection intersection

of of of of

subspaces is also a subspace. affine functions is also an affine function. convex sets is also a convex set. convex cones is also a convex cone.

DEFINITION 4.10 [344] Given a convex set S ∈ Rn and a function f : S → R, then we have the following. • The function f : Rn → R is convex if and only if S = domf is a convex set, and for all vectors x, y ∈ S and each scalar α ∈ (0, 1), the function satisfies the Jensen inequality f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y).

(4.3.14)

• The function f (x) is known as a strictly convex function if and only if S = domf is a convex set and, for all vectors x, y ∈ S and each scalar α ∈ (0, 1), the function satisfies the inequality f (αx + (1 − α)y) < αf (x) + (1 − α)f (y).

(4.3.15)

In optimization, it is often required that an objective function is strongly convex. A strongly convex function has the following three equivalent definitions: (1) The function f (x) is strongly convex if [344] f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) −

μ α(1 − α) x − y 22 2

holds for all vectors x, y ∈ S and scalars α ∈ [0, 1] and μ > 0.

(4.3.16)

4.3 Convex Sets and Convex Function Identification

203

(2) The function f (x) is strongly convex if [43] (∇f (x) − ∇f (y))T (x − y) ≥ μ x − y 22 holds for all vectors x, y ∈ S and some scalar μ > 0. (3) The function f (x) is strongly convex if [344] μ f (y) ≥ f (x) + [∇f (x)]T (y − x) + y − x 22 2 for some scalar μ > 0.

(4.3.17)

(4.3.18)

In the above three definitions, the constant μ (> 0) is called the convexity parameter of the strongly convex function f (x). The following is the relationship between convex functions, strictly convex functions and strongly convex functions: strongly convex function ⇒ strictly convex function ⇒ convex function. (4.3.19) DEFINITION 4.11 [460] A function f (x) is quasi-convex if, for all vectors x, y ∈ S and a scalar α ∈ [0, 1], the inequality f (αx + (1 − α)y) ≤ max{f (x), f (y)}

(4.3.20)

holds. A function f (x) is strongly quasi-convex if, for all vectors x, y ∈ S, x = y and the scalar α ∈ (0, 1), the strict inequality f (αx + (1 − α)y) < max{f (x), f (y)}

(4.3.21)

holds. A function f (x) is referred to as strictly quasi-convex if the strict inequality (4.3.21) holds for all vectors x, y ∈ S, f (x) = f (y) and a scalar α ∈ (0, 1). 4.3.3 Convex Function Identification Consider an objective function defined in the convex set S, f (x) : S → R, a natural question to ask is how to determine whether the function is convex. Convexfunction identification methods are divided into first-order gradient and secondorder gradient identification methods. DEFINITION 4.12 [432] Given a convex set S ∈ Rn , then the mapping function F(x) : S → Rn is called: (a) a monotone function in the convex set S if F(x) − F(y), x − y ≥ 0,

∀ x, y ∈ S;

(4.3.22)

(b) a strictly monotone function in the convex set S if F(x) − F(y), x − y > 0,

∀ x, y ∈ S and x = y;

(4.3.23)

(c) a strongly monotone function in the convex set S if F(x) − F(y), x − y ≥ μ x − y 22 ,

∀ x, y ∈ S.

(4.3.24)

204

Gradient Analysis and Optimization

If the gradient vector of a function f (x) is taken as the mapping function, i.e., F(x) = ∇x f (x), then we have the following first-order necessary and sufficient conditions for identifying the convex function. 1. First-Order Necessary and Sufficient Conditions THEOREM 4.5 [432] Let f : S → R be a first differentiable function defined in the convex set S in the n-dimensional vector space Rn , then for all vectors x, y ∈ S, we have f (x) convex



∇x f (x) − ∇x f (y), x − y ≥ 0,

f (x) strictly convex



∇x f (x) − ∇x f (y), x − y > 0, x = y,

f (x) strongly convex



∇x f (x) − ∇x f (y), x − y ≥ μ x − y 22 .

THEOREM 4.6 [53] If the function f : S → R is differentiable in a convex definition domain then f is convex if and only if f (y) ≥ f (x) + ∇x f (x), y − x.

(4.3.25)

2. Second-Order Necessary and Sufficient Conditions THEOREM 4.7 [311] Let f : S → R be a function defined in the convex set S ∈ Rn , and second differentiable, then f (x) is a convex function if and only if the Hessian matrix is positive semi-definite, namely Hx [f (x)] =

∂ 2 f (x)  0, ∂ x∂ xT

∀ x ∈ S.

(4.3.26)

Remark Let f : S → R be a function defined in the convex set S ∈ Rn , and second differentiable; then f (x) is strictly convex if and only if the Hessian matrix is positive definite, namely Hx [f (x)] =

∂ 2 f (x)  0, ∂ x∂ xT

∀ x ∈ S,

(4.3.27)

whereas the sufficient condition for a strict minimum point requires that the Hessian matrix is positive definite at one point c, Equation (4.3.27) requires that the Hessian matrix is positive definite at all points in the convex set S. The following basic properties are useful for identifying the convexity of a given function [211]. 1. The function f : Rn → R is convex, if and only if f˜(t) = f (x0 + th) is convex for t ∈ R and all vectors x0 , h ∈ Rn . 2. If α1 , α2 ≥ 0, and f1 (x) and f2 (x) are convex functions, then f (x) = α1 f1 (x) + α2 f2 (x) is also convex.  3. If p(y) ≥ 0 and q(x, y) is a convex function at the point x ∈ S then p(y)q(x, y)d y is convex at all points x ∈ S.

4.4 Gradient Methods for Smooth Convex Optimization

205

4. If f (x) is a convex function then its affine transformation f (Ax + b) is also convex. It should be pointed out that, apart from the 0 -norm, the vector norms for all other p (p ≥ 1), 1/p  n  p

x p = | xi | , p ≥ 1; x ∞ = max | xi | (4.3.28) i

i=1

are convex functions.

4.4 Gradient Methods for Smooth Convex Optimization Convex optimization is divided into first-order algorithms and second-order algorithms. Focusing upon smoothing functions, this section presents first-order algorithms for smooth convex optimization: the gradient method, the gradient-projection method, the conjugate gradient method and the Nesterov optimal gradient method.

4.4.1 Gradient Method Consider the smooth convex optimization problem of an objective function f : Q → R, described by min f (x),

(4.4.1)

x∈Q

where x ∈ Q ⊂ Rn , and f (x) is smooth and convex. As shown in Figure 4.4, the vector x can be regarded as the input of a black box, the outputs of which are the function f (x) and the gradient function ∇f (x).  x

black box

f (x) ∇f (x)

Figure 4.4 The first-order black-box method.

Let xopt denote the optimal solution of min f (x). First-order black-box optimization uses only f (x) and its gradient vector ∇f (x) for finding the vector y ∈ Q such that f (y) − f (xopt ) ≤ , where is a given accuracy error. The solution y satisfying the above condition is known as the -suboptimal solution of the objective function f (x). The first-order black-box optimization method includes two basic tasks: (1) the design of a first-order iterative optimization algorithm;

206

Gradient Analysis and Optimization

(2) the convergence-rate or complexity analysis of the optimization algorithm. First, we introduce optimization-algorithm design. The descent method is a simplest first-order optimization method whose basic idea for solving the unconstrained minimization problem of a convex function min f (x) is as follows: when Q = Rn , use the optimization sequence xk+1 = xk + μk Δxk ,

k = 1, 2, . . .

(4.4.2)

to search for the optimal point xopt . In the above equation, k = 1, 2, . . . represents the iterative number and μk ≥ 0 is the step size or step length of the kth iteration; the concatenated symbol of Δ and x, Δx, denotes a vector in Rn known as the step direction or search direction of the objective function f (x), while Δxk = xk+1 − xk represents the search direction at the kth iteration. Because the minimization-algorithm design requires the objective function to decrease during the iteration process, i.e., f (xk+1 ) < f (xk ),

∀ xk ∈ dom f,

(4.4.3)

this method is called the descent method. From the first-order approximation expression to the objective function at the point xk f (xk+1 ) ≈ f (xk ) + (∇f (xk ))T Δxk ,

(4.4.4)

(∇f (xk ))T Δxk < 0

(4.4.5)

it follows that if

then f (xk+1 ) < f (xk ). Hence the search direction Δxk such that (∇f (xk ))T Δxk < 0 is called the descent step or descent direction of the objective function f (x) at the kth iteration. Obviously, in order to make (∇f (xk ))T Δxk < 0, we should take Δxk = −∇f (xk ) cos θ,

(4.4.6)

where 0 ≤ θ < π/2 is the acute angle between the descent direction and the negative gradient direction −∇f (xk ). The case θ = 0 implies Δxk = −∇f (xk ). That is to say, the negative gradient direction of the objective function f at the point x is directly taken as the search direction. In this case, the length of the descent step Δxk 2 = ∇f (xk ) 2 takes the maximum value, and thus the descent direction Δxk has the maximal descent step or rate and the corresponding descent algorithm, xk+1 = xk − μk ∇f (xk ), is called the steepest descent method.

k = 1, 2, . . . ,

(4.4.7)

4.4 Gradient Methods for Smooth Convex Optimization

207

The steepest descent method can be explained using the quadratic approximation to the function. The second-order Taylor series expansion is given by 1 f (y) ≈ f (x) + (∇f (x))T (y − x) + (y − x)T ∇2 f (x)(y − x). 2

(4.4.8)

If t−1 I is used to replace the Hessian matrix ∇2 f (x) then f (y) ≈ f (x) + (∇f (x))T (y − x) +

1

y − x 22 . 2t

(4.4.9)

This is the quadratic approximation to the function f (x) at the point y. ∂f (y) It is easily found that the gradient vector ∇f (y) = = ∇f (x) + t−1 (y − x). ∂y

Letting ∇f (y) = 0, the solution is given by y = x − t∇f (x). Then letting y = xk+1 and x = xk , we immediately get the updating formula of the steepest descent algorithm, xk+1 = xk − t∇f (xk ). The steepest descent direction Δx = −∇f (x) uses only first-order gradient information about the objective function f (x). If second-order information, i.e., the Hessian matrix ∇2 f (xk ) of the objective function is used then we may find a better search direction. In this case, the optimal descent direction Δx is the solution minimizing the second-order Taylor approximation function of f (x): " # 1 T T 2 min f (x + Δx) = f (x) + (∇f (x)) Δx + (Δx) ∇ f (x)Δx . (4.4.10) Δx 2 At the optimal point, the gradient vector of the objective function with respect to the parameter vector Δx has to equal zero, i.e., ∂f (x + Δx) = ∇f (x) + ∇2 f (x)Δx = 0 ∂Δx  −1 ∇f (x), ⇔ Δxnt = − ∇2 f (x)

(4.4.11)

where Δxnt is called the Newton step or the Newton descent direction and the corresponding optimization method is the Newton or Newton–Raphson method. Algorithm 4.1 summarizes the gradient descent algorithm and its variants. Algorithm 4.1

Gradient descent algorithm and its variants

given: An initial point x1 ∈ dom f and the allowed error  > 0. Put k = 1. repeat 1. Compute the gradient ∇f (xk ) (and the Hessian matrix ∇2 f (xk )). 2. Choose the descent direction  − ∇f (xk ) (steepest descent method), Δxk = −1  2 ∇f (xk ) (Newton method). − ∇ f (xk ) 3. Choose a step μk > 0, and update xk+1 = xk + μk Δxk . 4. exit if |f (xk+1 ) − f (xk )| ≤ . return k ← k + 1. output: x ← xk .

208

Gradient Analysis and Optimization

According to different choices of the step μk , gradient descent algorithms have the following common variants [344]. (1) Before the gradient algorithm is run, a step sequence {μk }∞ k=0 is chosen, e.g., √ μk = μ (fixed step) or μk = h/ k + 1. (2) Full relaxation Take μk = arg min f (xk − μ∇f (xk )). μ≥0

(3) Goldstein–Armijo rule Find xk+1 = xk − μ∇f (xk ) such that α∇f (xk ), xk − xk+1  ≤ f (xk ) − f (xk+1 ), β∇f (xk ), xk − xk+1  ≥ f (xk ) − f (xk+1 ), where 0 < α < β < 1 are two fixed parameters. Note that the variable vector x is unconstrained in the gradient algorithms, i.e., x ∈ Rn . When choosing a constrained variable vector x ∈ C, where C ⊂ Rn , then the update formula in the gradient algorithm should be replaced by a formula containing the projected gradient: xk+1 = PC (xk − μk ∇f (xk )).

(4.4.12)

This algorithm is called the projected gradient method or the gradient-projection method. In (4.4.12), PC (y) is known as the projection operator and is defined as 1

x − y 22 . 2 x∈C The projection operator can be equivalently expressed as PC (y) = arg min

PC (y) = PC y,

(4.4.13)

(4.4.14)

where PC is the projection matrix onto the subspace C. If C is the column space of the matrix A, then PA = A(AT A)−1 AT .

(4.4.15)

We will discuss the projection matrix in detail in Chapter 9 (on projection analysis). In particular, if C = Rn , i.e., the variable vector x is unconstrained, then the projection operator is equal to the identity matrix, namely PC = I, and thus PRn (y) = Py = y,

∀ y ∈ Rn .

In this case, the projected gradient algorithm simplifies to the actual gradient algorithm. The following are projections of the vector x onto several typical sets [488]. 1. The projection onto the affine set C = {x|Ax = b} with A ∈ Rp×n , rank(A) = p is PC (x) = x + AT (AAT )−1 (b − Ax).

(4.4.16)

If p ' n or AAT = I, then the projection PC (x) gives a low cost result.

4.4 Gradient Methods for Smooth Convex Optimization

209

 2. The projection onto the hyperplane C = {xaT x = b} (where a = 0) is PC (x) = x +

b − aT x a.

a 22

(4.4.17)

3. The projection onto the nonnegative orthant C = Rn+ is PC (x) = (x)+



[(x)+ ]i = max{xi , 0}.

4. The projection onto the half-space C = {x|aT x ≤ b} (where a = 0): + b − aT x a, if aT x > b, x+ a22 PC (x) = x, if aT x ≤ b.

(4.4.18)

(4.4.19)

5. The projection onto the rectangular set C = [a, b] (where ai ≤ xi ≤ bi ) is + ai , PC (x) =

if xi ≤ ai ,

xi ,

if ai ≤ xi ≤ bi ,

bi ,

if xi ≥ bi .

(4.4.20)

6. The projection onto the second-order cones C = {(x, t)| x 2 ≤ t, x ∈ Rn } is ⎧ ⎪ (x, t), if x 2 ≤ t, ⎪ ⎪ ⎪   ⎨ t + x2 x PC (x) = (4.4.21) , if − t < x 2 < t, 2x2 t ⎪ ⎪ ⎪ ⎪ ⎩ (0, 0), if x 2 ≤ −t, x = 0. 7. The projection onto the Euclidean ball C = {x| x 2 ≤ 1} is + 1 x, if x 2 > 1, PC (x) = x2 x, if x 2 ≤ 1.

(4.4.22)

8. The projection onto the 1 -norm ball C = {x| x 1 ≤ 1} is + xi − λ, if xi > λ, PC (x)i =

0,

if − λ ≤ xi ≤ λ,

xi + λ,

if xi < −λ,

(4.4.23)

where λ = 0 if x 1 ≤ 1; otherwise, λ is the solution of the equation n  max{| xi | − λ, 0} = 1. i=1

9. The projection onto the positive semi-definite cones C = Sn+ is n  PC (X) = max{0, λi }qi qTi , where X =

n i=1

i=1

λi qi qTi is the eigenvalue decomposition of X.

(4.4.24)

210

Gradient Analysis and Optimization

4.4.2 Conjugate Gradient Method Consider the iteration solution of the matrix equation Ax = b, where A ∈ Rn×n is a nonsingular matrix. This matrix equation can be equivalently written as x = (I − A)x + b.

(4.4.25)

This inspires the following iteration algorithm: xk+1 = (I − A)xk + b.

(4.4.26)

This algorithm is known as the Richardson iteration and can be rewritten in the more general form xk+1 = Mxk + c,

(4.4.27)

where M is an n × n matrix, called the iteration matrix. An iteration with the form of Equation (4.4.27) is known as a stationary iterative method and is not as effective as a nonstationary iterative method. Nonstationary iteration methods comprise a class of iteration methods in which xk+1 is related to the former iteration solutions xk , xk−1 , . . . , x0 . The most typical nonstationary iteration method is the Krylov subspace method, described by xk+1 = x0 + Kk ,

(4.4.28)

Kk = Span(r0 , Ar0 , . . . , Ak−1 r0 )

(4.4.29)

where

is called the kth iteration Krylov subspace; x0 is the initial iteration value and r0 denotes the initial residual vector. Krylov subspace methods have various forms, of which the three most common are the conjugate gradient method, the biconjugate gradient method and the preconditioned conjugate gradient method. 1. Conjugate Gradient Algorithm The conjugate gradient method uses r0 = Ax0 − b as the initial residual vector. The applicable object of the conjugate gradient method is limited to the symmetric positive definite equation, Ax = b, where A is an n × n symmetric positive definite matrix. The nonzero vector combination {p0 , p1 , . . . , pk } is called A-orthogonal or Aconjugate if pTi Apj = 0,

∀ i = j.

(4.4.30)

This property is known as A-orthogonality or A-conjugacy. Obviously, if A = I then the A-conjugacy reduces to general orthogonality. All algorithms adopting the conjugate vector as the update direction are known as conjugate direction algorithms. If the conjugate vectors p0 , p1 , . . . , pn−1 are not

4.4 Gradient Methods for Smooth Convex Optimization

211

predetermined, but are updated by the gradient descent method in the updating process, then we say that the minimization algorithm for the objective function f (x) is a conjugate gradient algorithm. Algorithm 4.2 gives a conjugate gradient algorithm. Algorithm 4.2

Conjugate gradient algorithm [179], [243]

input: A = AT ∈ Rn×n , b ∈ Rn , the largest iteration step number kmax and the allowed error . initialization: Choose x0 ∈ Rn , and let r = Ax0 − b and ρ0 = r22 . repeat 1. If k = 1 then p = r. Otherwise, let β = ρk−1 /ρk−2 and p = r + βp. 2. w = Ap. 3. α = ρk−1 /(pT w). 4. x = x + αp. 5. r = r − αw. 6. ρk = r22 . √ 7. exit if ρk < b2 or k = kmax . return k ← k + 1. output: x ← xk .

From Algorithm 4.2 it can be seen that in the iteration process of the conjugate gradient algorithm, the solution of the matrix equation Ax = b is given by xk =

k  i=1

αi pi =

k  ri−1 , ri−1  i=1

pi , Api 

pi ,

(4.4.31)

that is, xk belongs to the kth Krylov subspace xk ∈ Span{p1 , p2 , . . . , pk } = Span{r0 , Ar0 , . . . , Ak−1 r0 }. The fixed-iteration method requires the updating of the iteration matrix M, but no matrix needs to be updated in Algorithm 4.2. Hence, the Krylov subspace method is also called the matrix-free method [243]. 2. Biconjugate Gradient Algorithm If the matrix A is not a real symmetric matrix, then we can use the biconjugate gradient method of Fletcher [156] for solving the matrix equation Ax = b. As the ¯ , that are A-conjugate in name implies, there are two search directions, p and p this method: ¯ Ti Apj = pTi A¯ p pj = 0,

i = j,

¯rTi rj = rTi ¯rTj ¯ Tj ¯rTi pj = rTi p

= 0,

i = j,

= 0,

j < i.

(4.4.32)

212

Gradient Analysis and Optimization

Algorithm 4.3 shows a biconjugate gradient algorithm. Algorithm 4.3

Biconjugate gradient algorithm [156], [230]

¯1 = ¯ initialization: p1 = r1 , p r1 . repeat 1. αk = ¯ rTk rk /(¯ pTk Apk ). 2. rk+1 = rk − αk Apk . ¯k. 3. ¯rk+1 = ¯rk − αk AT p rTk+1 rk+1 /(¯rTk rk ). 4. βk = ¯ 5. pk+1 = rk+1 + βk pk . ¯ k+1 = ¯rk+1 + βk p ¯k. 6. p 7. exit if k = kmax . return k ← k + 1. ¯ k+1 ← x ¯ k + αk p ¯k. output: x ← xk + αk pk , x

3. Preconditioned Conjugate Gradient Algorithm Consider the symmetric indefinite saddle point problem      x f A BT = , B O q g where A is an n × n real symmetric positive definite matrix, B is an m × n real matrix with full row rank m (≤ n), and O is an m × m zero matrix. Bramble and Pasciak [56] developed a preconditioned conjugate gradient iteration method for solving the above problems. The basic idea of the preconditioned conjugate gradient iteration is as follows: through a clever choice of the scalar product form, the preconditioned saddle matrix becomes a symmetric positive definite matrix. To simplify the discussion, it is assumed that the matrix equation with large condition number Ax = b needs to be converted into a new symmetric positive definite equation. To this end, let M be a symmetric positive definite matrix that can approximate the matrix A and for which it is easier to find the inverse matrix. Hence, the original matrix equation Ax = b is converted into M−1 Ax = M−1 b such that two matrix equations have the same solution. However, in the new matrix equation M−1 Ax = M−1 b there exists a hidden danger: M−1 A is generally not either symmetric or positive definite even if both M and A are symmetric positive definite. Therefore, it is unreliable to use the matrix M−1 directly as the preprocessor of the matrix equation Ax = b. Let S be the square root of a symmetric matrix M, i.e., M = SST , where S is symmetric positive definite. Now, use S−1 instead of M−1 as the preprocessor to

4.4 Gradient Methods for Smooth Convex Optimization

213

ˆ, convert the original matrix equation Ax = b to S−1 Ax = S−1 b. If x = S−T x then the preconditioned matrix equation is given by ˆ = S−1 b. S−1 AS−T x

(4.4.33)

Compared with the matrix M−1 A, which is not symmetric positive definite, S AS−T must be symmetric positive definite if A is symmetric positive definite. The symmetry of S−1 AS−T is easily seen, its positive definiteness can be verified as follows: by checking the quadratic function it easily follows that yT (S−1 AS−T )y = zT Az, where z = S−T y. Because A is symmetric positive definite, we have zT Az > 0, ∀ z = 0, and thus yT (S−1 AS−T )y > 0, ∀ y = 0. That is to say, S−1 AS−T must be positive definite. At this point, the conjugate gradient method can be applied to solve the matrix ˆ , and then we can recover x via x = S−T x ˆ. equation (4.4.33) in order to get x Algorithm 4.4 shows a preconditioned conjugate gradient (PCG) algorithm with a preprocessor, developed in [440]. −1

Algorithm 4.4

PCG algorithm with preprocessor

input: A, b, preprocessor S−1 , maximal number of iterations kmax , and allowed error  < 1. initialization: k = 0, r0 = Ax − b, d0 = S−1 r0 , δnew = rT0 d0 , δ0 = δnew . repeat 1. qk+1 = Adk . 2. α = δnew /(dTk qk+1 ). 3. xk+1 = xk + αdk . 4. If k can be divided exactly by 50, then rk+1 = b − Axk . Otherwise, update rk+1 = rk − αqk+1 . 5. sk+1 = S−1 rk+1 . 6. δold = δnew . 7. δnew = rTk+1 sk+1 . 8. exit if k = kmax or δnew < 2 δ0 . 9. β = δnew /δold . 10. dk+1 = sk+1 + βdk . return k ← k + 1. output: x ← xk .

The use of a preprocessor can be avoided, because there are correspondences ˆ = S−1 b, as between the variables of the matrix equations Ax = b and S−1 AS−T x follows [243]: ˆk, xk = S−1 x

rk = Sˆrk ,

ˆk, pk = S−1 p

zk = S−1 ˆrk .

On the basis of these correspondences, it easy to develop a PCG algorithm without preprocessor [243], as shown in Algorithm 4.5.

214 Algorithm 4.5

Gradient Analysis and Optimization PCG algorithm without preprocessor

input: A = AT ∈ Rn×n , b ∈ Rn , maximal iteration number kmax , and allowed error . initialization: x0 ∈ Rn , r0 = Ax0 − b, ρ0 = r22 and M = SST . Put k = 1. repeat 1. zk = Mrk−1 . 2. τk−1 = zTk rk−1 . 3. If k = 1, then β = 0, p1 = z1 . Otherwise, β = τk−1 /τk−2 , pk ← zk + βpk . 4. wk = Apk . 5. α = τk−1 /(pTk wk ). 6. xk+1 = xk + αpk . 7. rk = rk−1 − αwk . 8. ρk = rTk rk . √ 9. exit if ρk < b2 or k = kmax . return k ← k + 1. output: x ← xk .

A wonderful introduction to the conjugate gradient method is presented in [440]. Regarding the complex matrix equation Ax = b, where A ∈ Cn×n , x ∈ Cn , b ∈ n C , we can write it as the following real matrix equation:      b AR −AI xR = R . (4.4.34) A I AR xI bI If A = AR + jAI is a Hermitian positive definite matrix, then Equation (4.4.34) is a symmetric positive definite matrix equation. Hence, (xR , xI ) can be solved by adopting the conjugate gradient algorithm or the preconditioned conjugate gradient algorithm. The preconditioned conjugate gradient method is a widely available technique for solving partial differential equations, and it has important applications in optimal control [208]. As a matter of fact, it is usually necessary to apply the preconditioned conjugate gradient algorithm to solve the KKT equation or the Newton equation in optimization problems in order to improve the numerical stability of solving the optimization search direction procedure. In addition to the three conjugate gradient algorithms described above, there is a projected conjugate gradient algorithm [179]. This algorithm requires the use of a seed space Kk (A, r0 ) to solve the matrix equation Axq = bq , q = 1, 2, . . . The main advantage of gradient methods is that the computation of each iteration is very simple; however, their convergence is slow. Hence, an important question is how to accelerate a gradient method so as to get an optimal gradient method. To this end, it is necessary first to analyze the convergence rate of the gradient methods.

4.4 Gradient Methods for Smooth Convex Optimization

215

4.4.3 Convergence Rates By the convergence rate is meant the number of iterations needed for an optimization algorithm to make the estimated error of the objective function achieve the required accuracy, or given an iteration number K, the accuracy that an optimization algorithm reaches. The inverse of the convergence rate is known as the complexity of an optimization algorithm. Let x be a local or global minimum point. The estimation error of an optimization algorithm is defined as the difference between the value of an objective function at iteration point xk and its value at the global minimum point, i.e., δk = f (xk ) − f (x ). We are naturally interested in the convergence problems of an optimization algorithm: (1) Given an iteration number K, what is the designed accuracy

lim δk ?

1≤k≤K

(2) Given an allowed accuracy , how many iterations does the algorithm need to achieve the designed accuracy min δk ≤ ? k

When analyzing the convergence problems of optimization algorithms, we often focus on the speed of the updating sequence {xk } at which the objective function argument converges to its ideal minimum point x . In numerical analysis, the speed at which a sequence reaches its limit is called the convergence rate. 1. Q-Convergence Rates Assume that a sequence {xk } converges to x . If there are a real number α ≥ 1 and a positive constant μ independent of the iterations k such that

xk+1 − x 2 k→∞ xk − x α 2

μ = lim

(4.4.35)

then we say that {xk } has an α-order Q-convergence rate. The Q-convergence rate means the quotient convergence rate. It has the following typical values [360]: 1. When α = 1, the Q-convergence rate is called the limit-convergence rate of {xk }:

xk+1 − x 2 . k→∞ xk − x 2

μ = lim

(4.4.36)

According to the value of μ, the limit-convergence rate of the sequence {xk } can be divided into three types: (1) sublinear convergence rate, α = 1, μ = 1; (2) linear convergence rate, α = 1, μ ∈ (0, 1); (3) superlinear convergence rate, α = 1, μ = 0 or 1 < α < 2, μ = 0. 2. When α = 2, we say that the sequence {xk } has a quadratic convergence rate.

216

Gradient Analysis and Optimization

3. When α = 3, we say that the sequence {xk } has a cubic convergence rate. If {xk } is sublinearly convergent, and lim

k→∞

xk+2 − xk+1 2 = 1,

xk+1 − xk 2

then the sequence {xk } is said to have a logarithmic convergence rate. Sublinear convergence comprises a class of slow convergence rates; the linear convergence comprises a class of fast convergence; and the superlinear convergence and quadratic convergence comprise classes of respectively very fast and extremely fast convergence. When designing an optimization algorithm, we often require it at least to be linearly convergent, preferably to be quadratically convergent. The ultra-fast cubic convergence rate is in general difficult to achieve. 2. Local Convergence Rates The Q-convergence rate is a limited convergence rate. When we evaluate an optimization algorithm, a practical question is: how many iterations does it need to achieve the desired accuracy? The answer depends on the local convergence rate of the output objective function sequence of a given optimization algorithm. The local convergence rate of the sequence {xk } is denoted rk and is defined by 2 2 2 xk+1 − x 2 2 2 . rk = 2 (4.4.37) x k − x 2 2 The complexity of an optimization algorithm is defined as the inverse of the local convergence rate of the updating variable. The following is a classification of local convergence rates [344]. (1) Sublinear rate This convergence rate is given by a power function of the iteration number k and is usually denoted as   1 . (4.4.38) f (xk ) − f (x ) ≤ = O √ k (2) Linear rate This convergence rate is expressed as an exponential function of the iteration number k, and is usually defined as   1 f (xk ) − f (x ) ≤ = O . (4.4.39) k (3) Quadratic rate This convergence rate is a bi-exponential function of the iteration number k and is usually measured by   1 f (xk ) − f (x ) ≤ = O . (4.4.40) k2

4.5 Nesterov Optimal Gradient Method

217

For example, in order to achieve the approximation accuracy f (xk ) − f (x ) ≤ = 0.0001, optimization algorithms with sublinear, linear and quadratic rates need to run about 108 , 104 and 100 iterations, respectively. THEOREM 4.8 [344] Let = f (xk ) − f (x ) be the estimation error of the objective function given by the updating sequence of the gradient method xk+1 = xk − α∇f (xk ). For a convex function f (x), the upper bound of the estimation error = f (xk ) − f (x ) is given by f (xk ) − f (x ) ≤

2L x0 − x 22 . k+4

(4.4.41)

This theorem shows that the local convergence rate of the gradient method xk+1 = xk − α∇f (xk ) is the linear rate O(1/k). Although this is a fast convergence rate, it is not by far optimal, as will be seen in the next section.

4.5 Nesterov Optimal Gradient Method Let Q ⊂ R be a convex set in the vector space Rn . Consider the unconstrained optimization problem min f (x). n

x∈Q

4.5.1 Lipschitz Continuous Function DEFINITION 4.13 [344] We say that an objective function f (x) is Lipschitz continuous in the definition domain Q if |f (x) − f (y)| ≤ L x − y 2 ,

∀ x, y ∈ Q

(4.5.1)

holds for some Lipschitz constant L > 0. Similarly, we say that the gradient vector ∇f (x) of a differentiable function f (x) is Lipschitz continuous in the definition domain Q if

∇f (x) − ∇f (y) 2 ≤ L x − y 2 ,

∀ x, y ∈ Q

(4.5.2)

holds for some Lipschitz constant L > 0. Hereafter a Lipschitz continuous function with the Lipschitz constant L will be denoted as an L-Lipschitz continuous function. The function f (x) is said to be continuous at the point x0 , if limx→x0 f (x) = f (x0 ). When we say that f (x) is a continuous function, this means the f (x) is continuous at every point x ∈ Q. A Lipschitz continuous function f (x) must be a continuous function, but a continuous function is not necessarily a Lipschitz continuous function.

218

Gradient Analysis and Optimization

√ EXAMPLE 4.3 The function f (x) = 1/ x is not a Lipschitz continuous function in the open interval (0, 1). If it is assumed func  continuous   to be a Lipschitz 1  1 1  1 tion, then it must satisfy |f (x1 ) − f (x2 )| =  √ − √  ≤ L − , namely x1 x2 x1 x2   1   1 2  √ + √  ≤ L. However, when the point x → 0, and x1 = 1/n , x2 = 9/n2 , we x1 x2 √ have L ≥ n/4 → ∞. Hence, the continuous function f (x) = 1/ x is not a Lipschitz continuous function in the open interval (0, 1). An everywhere differentiable function is called a smooth function. A smooth function must be continuous, but a continuous function is not necessarily smooth. A typical example occurs when the continuous function has “sharp” point at which it is nondifferentiable. Hence, a Lipschitz continuous function is not necessarily differentiable, but a function f (x) with Lipschitz continuous gradient in the definition domain Q must be smooth in Q, because the definition of a Lipschitz continuous gradient stipulates that f (x) is differentiable in the definition domain. In convex optimization, the notation CLk,p (Q) (where Q ⊆ Rn ) denotes the Lipschitz continuous function class with the following properties [344]: • The function f ∈ CLk,p (Q) is k times continuously differentiable in Q. • The pth-order derivative of the function f ∈ CLk,p (Q) is L-Lipschitz continuous:

f (p) (x) − f (p) (y) ≤ L x − y 2 ,

∀ x, y ∈ Q.

CLk,p (Q)

is said to be a differentiable function. Obviously, it If k = 0, then f ∈ always has p ≤ k. If q > k then CLq,p (Q) ⊆ CLk,p (Q). For example, CL2,1 (Q) ⊆ CL1,1 (Q). The following are three common function classes CLk,p (Q): (1) f (x) ∈ CL0,0 (Q) is L-Lipschitz continuous but nondifferentiable in Q; (2) f (x) ∈ CL1,0 (Q) is L-Lipschitz continuously differentiable in Q, but its gradient is not; (3) f (x) ∈ CL1,1 (Q) is L-Lipschitz continuously differentiable in Q, and its gradient ∇f (x) is L-Lipschitz continuous in Q. The basic property of the CLk,p (Q) function class is that if f1 ∈ CLk,p (Q), f2 ∈ 1 k,p CL2 (Q) and α, β ∈ R then (Q), αf1 + βf2 ∈ CLk,p 3

where L3 = |α| L1 + |β| L2 . Among all Lipschitz continuous functions, CL1,1 (Q), with the Lipschitz continuous gradient, is the most important function class, and is widely applied in convex optimization. On the CL1,1 (Q) function class, one has the following two lemmas [344]. LEMMA 4.1

The function f (x) belongs to CL2,1 (Rn ) if and only if

f

(x) F ≤ L,

∀ x ∈ Rn .

(4.5.3)

4.5 Nesterov Optimal Gradient Method

LEMMA 4.2

219

If f (x) ∈ CL1,1 (Q) then

|f (y) − f (x) − ∇f (x), y − x| ≤

L

y − x 22 , 2

∀ x, y ∈ Q.

(4.5.4)

Since the function CL2,1 must also be a CL1,1 function, Lemma 4.1 directly provides a simple way to determine whether a function belongs to the CL1,1 function class. Below we give some examples of the applications of Lemma 4.1 to determine the CL1,1 (Q) function class. EXAMPLE 4.4 The linear function f (x) = a, x + b belongs to the C01,1 function class; namely, the gradient of any linear function is not Lipschitz continuous because f (x) = a,

f

(x) = O



f

(x) 2 = 0.

EXAMPLE 4.5 The quadratic function f (x) = 12 xT Ax + aT x + b belongs to the 1,1 C A (Rn ) function class, because F

∇f (x) = Ax + a,

∇2 f (x) = A



f

(x) 2 . = A F .

1,1 EXAMPLE 4.6 The logarithmic function f (x) = ln(1 + ex ) belongs to the C1/2 (R) function class, because   1 + e2x  1 ex ex 1 



≤ , 1 − f (x) = , f (x) = ⇒ |f (x)| = 1 + ex (1 + ex )2 2 (1 + ex )2  2

where, in order to find f

(x), we let y = f (x), yielding y = f

(x) = y(ln y) . √ EXAMPLE 4.7 The function f (x) = 1 + x2 belongs to the C11,1 (R) function class, because f (x) = √

x , 1 + x2

f

(x) =

1 (1 + x2 )3/2



|f

(x)| ≤ 1.

Lemma 4.2 is a key inequality for analyzing the convergence rate of a CL1,1 function f (x) in gradient algorithms. By Lemma 4.2, it follows [536] that L − xk 22

x 2 k+1 L L ≤ f (xk ) + ∇f (xk ), x − xk  + x − xk 22 − x − xk+1 22 2 2 L L 2 2 ≤ f (x) + x − xk 2 − x − xk+1 2 . 2 2

f (xk+1 ) ≤ f (xk ) + ∇f (xk ), xk+1 − xk  +

Put x = x and δk = f (xk ) − f (x ); then L L

x − xk+1 22 ≤ −δk+1 + x − xk 22 2 2 k+1  L ≤ ··· ≤ − δi + x − x0 22 . 2 i=1

0≤

220

Gradient Analysis and Optimization

From the estimation errors of the projected gradient algorithm δ1 ≥ δ2 ≥ · · · ≥ δk+1 , it follows that −(δ1 +· · ·+δk+1 ) ≤ −(k+1)δk+1 , and thus the above inequality can be simplified to 0≤

L L

x − xk+1 22 ≤ −(k + 1)δk + x − x0 22 , 2 2

from which the upper bound of the convergence rate of the projected gradient algorithm is given by [536] δk = f (xk ) − f (x ) ≤

L x − x0 22 . 2(k + 1)

(4.5.5)

This shows that, as for the (basic) gradient method, the local convergence rate of the projected gradient method is O(1/k).

4.5.2 Nesterov Optimal Gradient Algorithms THEOREM 4.9 [343] Let f (x) be a convex function with L-Lipschitz gradient, if the updating sequence {xk } meets the condition xk ∈ x0 + Span{x0 , . . . , xk−1 } then the lower bound of the estimation error = f (xk ) − f (x ) achieved by any first-order optimization method is given by f (xk ) − f (x ) ≥

3L x0 − x 22 , 32(k + 1)2

(4.5.6)

where Span{u0 , . . . , uk−1 } denotes the linear subspace spanned by u0 , . . . , uk−1 , x0 is the initial value of the gradient method and f (x ) denotes the minimal value of the function f . Theorem 4.9 shows that the optimal convergence rate of any first-order optimization method is the quadratic rate O(1/k 2 ). Since the convergence rate of gradient methods is the linear rate O(1/k), and the convergence rate of the optimal first-order optimization methods is the quadratic rate O(1/k 2 ), the gradient methods are far from optimal. The heavy ball method (HBM) can efficiently improve the convergence rate of the gradient methods. The HBM is a two-step method: let y0 and x0 be two initial vectors and let αk and βk be two positive valued sequences; then the first-order method for solving an unconstrained minimization problem minn f (x) can use two-step updates [396]: x∈R

yk = βk yk−1 − ∇f (xk ), xk+1 = xk + αk yk .

0 (4.5.7)

4.5 Nesterov Optimal Gradient Method

221

In particular, letting y0 = 0, the above two-step updates can be rewritten as the one-step update xk+1 = xk − αk ∇f (xk ) + βk (xk − xk−1 ),

(4.5.8)

where xk − xk−1 is called the momentum. As can be seen in (4.5.7), the HBM treats the iterations as a point mass with momentum xk − xk−1 , which thus tends to continue moving in the direction xk − xk−1 . Let yk = xk + βk (xk − xk−1 ), and use ∇f (yk ) instead of ∇f (xk ). Then the update equation (4.5.7) becomes 0 yk = xk + βk (xk − xk−1 ), xk+1 = yk − αk ∇f (yk ). This is the basic form of the optimal gradient method proposed by Nesterov in 1983 [343], usually called the Nesterov (first) optimal gradient algorithm and shown in Algorithm 4.6. Algorithm 4.6

Nesterov (first) optimal gradient algorithm [343]

initialization: Choose x−1 = 0, x0 ∈ Rn and α0 ∈ (0, 1). Set k = 0, q = μ/L. repeat 1. Compute αk+1 ∈ (0, 1) from the equation α2k+1 = (1 − αk+1 )α2k + qαk+1 . α (1 − αk ) 2. Set βk = k2 . αk + αk+1 3. Compute yk = xk + βk (xk − xk−1 ). 4. Compute xk+1 = yk − αk+1 ∇f (yk ). 5. exit if xk converges. return k ← k + 1. output: x ← xk .

It is easily seen that the Nesterov optimal gradient algorithm is intuitively like the heavy ball formula (4.5.7) but it is not identical; i.e., it uses ∇f (y) instead of ∇f (x). In the Nesterov optimal gradient algorithm, the estimation sequence {xk } is an approximation solution sequence, and {yk } is a searching point sequence. THEOREM 4.10 [343] Let f be a convex function with L-Lipschitz gradient. The Nesterov optimal gradient algorithm achieves f (xk ) − f (x∗ ) ≤

CL xk − x∗ 22 . (k + 1)2

(4.5.9)

222

Gradient Analysis and Optimization

Clearly, the convergence rate of the Nesterov optimal gradient method is the optimal rate for the first-order gradient methods up to constants. Hence, the Nesterov gradient method is indeed an optimal first-order minimization method. It has been shown [180] that the use of a strong convexity constant μ∗ is very effective in reducing the number of iterations of the Nesterov algorithm. A Nesterov algorithm with adaptive convexity parameter, using a decreasing sequence μk → μ∗ , was proposed in [180]. It is shown in Algorithm 4.7. Algorithm 4.7

Nesterov algorithm with adaptive convexity parameter

input: Lipschitz constant L, and the convexity parameter guess μ∗ ≤ L. given: x0 , v0 = x0 , γ0 > 0, β > 1, μ0 ∈ [μ∗ , γ0 ). Set k = 0, θ ∈ [0, 1], μ+ = μ0 . repeat 1. dk = vk − xk . 2. yk = xk + θk dk . 3. exit if ∇f (yk ) = 0. 4. Steepest descent step: xk+1 = yk − ν∇f (yk ). If L is known, then ν ≥ 1/L. 5. If (γk − μ∗ ) < β(μ+ − μ∗ ), then choose μ+ ∈ [μ∗ , γ/β]. ∇f (yk )2 . 6. Compute μ ˜= 2[f (yk ) − f (xk+1 )]

7. If μ+ ≥ μ ˜, then μ+ = max{μ∗ , μ ˜/10}.

8. Compute αk as the largest root of Aα2 + Bα + C = 0. 9. γk+1 = (1 − αk )γk + αk μ+ . 1 ((1 − αk )γk vk + αk (μ+ yk − ∇f (yk ))). 10. vk+1 = γk+1 11. return k ← k + 1. output: yk as an optimal solution.

The coefficients A, B, C in step 8 are given by )μ * G = γk

vk − yk 2 + (∇f (yk ))T (vk − yk ) , 2 1 A = G + ∇f (yk ) 2 + (μ − γk )(f (xk ) − f (yk )), 2   B = (μ − γk ) f (xk+1 ) − f (xk ) − γk (f (yk ) − f (xk )) − G,   C = γk f (xk+1 ) − f (xk ) , with μ = μ+ It is suggested that γ0 = L if L is known, μ0 = max{μ∗ , γ0 /100} and β = 1.02. The Nesterov (first) optimal gradient algorithm has two limitations: (1) yk may be out of the definition domain Q, and thus the objective function f (x) is required to be well-defined at every point in Q;

4.6 Nonsmooth Convex Optimization

223

(2) this algorithm is available only for the minimization of the Euclidean norm

x 2 . To cope with these limitations, Nesterov developed two other optimal gradient algorithms [346], [345]. Denote by d(x) a prox-function of the domain Q. It is assumed that d(x) is continuous and strongly convex on Q. Depending on whether

x 1 or x 2 is being minimized, there are different choices for the prox-function [345]: n n   |xi |, d(x) = ln n + xi ln |xi |, (4.5.10) using x 1 = i=1

using x 2 =

 n

i=1

1/2 x2i

,

d(x) =

i=1

2 n  1 1 xi − . 2 i=1 n

Define the ith element of the n × 1 mapping vector VQ (u, g) as [345]  −1 n (i) −gi −gi uj e , i = 1, . . . , n. VQ (u, g) = ui e

(4.5.11)

(4.5.12)

j=1

Algorithm 4.8 shows the Nesterov third optimal gradient algorithm. Algorithm 4.8

Nesterov (third) optimal gradient algorithm [345]

given: Choose x0 ∈ Rn and α ∈ (0, 1). Set k = 0.   initialization: y0 = argminx L d(x) + 12 (f (x0 ) + f  (x0 ), x − x0 ) : x ∈ Q . α repeat    i+1 L d(x) + α 2 i=0 k

1. Find zk = argminx 2 k+3

(f (xi ) + ∇f (xi ), x − xi ) : x ∈ Q .

and xk+1 = τk zk + (1 − τk )yk .

α ˆ k+1 (i) = VQ(i) zk , τk ∇f (xk+1 ) , i = 1, . . . , n. 3. Use (4.5.12) to compute x

2. Set τk =

L

ˆ k+1 + (1 − τk )yk . 4. Set yk+1 = τk x 5. exit if xk converges. return k ← k + 1. output: x ← xk .

4.6 Nonsmooth Convex Optimization The gradient methods require that for the objective function f (x), the gradient ∇f (x) exists at the point x; and the Nesterov optimal gradient method requires, further, that the objective function has L-Lipschitz continuous gradient. Hence, the gradient methods in general and the Nesterov optimal gradient method in particular are available only for smooth convex optimization. As an important extension

224

Gradient Analysis and Optimization

of gradient methods, this section focuses upon the proximal gradient method which is available for nonsmooth convex optimization. The following are two typical examples of nonsmooth convex minimization. (1) Basis pursuit de-noising seeks a sparse near-solution to an under-determined matrix equation [95] " # λ 2 min x 1 + Ax − b 2 . (4.6.1) x∈Rn 2 This problem is essentially equivalent to the Lasso (sparse least squares) problem [462] λ minn Ax − b 22 subject to x 1 ≤ K. (4.6.2) x∈R 2 (2) Robust principal component analysis approximates an input data matrix D to a sum of a low-rank matrix L and a sparse matrix S: 3 4 γ min L ∗ + λ S 1 + L + S − D 2F . (4.6.3) 2 The key challenge of the above minimization problems is the nonsmoothness induced by the nondifferentiable 1 norm · 1 and nuclear norm · ∗ . Consider the following combinatorial optimization problem min {F (x) = f (x) + h(x)} , x∈E

(4.6.4)

where E ⊂ Rn is a finite-dimensional real vector space and h : E → R is a convex function, but is nondifferentiable or nonsmooth in E. f : Rn → R is a continuous smooth convex function, and its gradient is LLipschitz continuous:

∇f (x) − ∇f (y) 2 ≤ L x − y 2 ,

∀ x, y ∈ Rn .

To address the challenge of nonsmoothness, there are two problems to be solved: • how to cope with the nonsmoothness; • how to design a Nesterov-like optimal gradient method for nonsmooth convex optimization.

4.6.1 Subgradient and Subdifferential Because the gradient vector of a nonsmooth function h(x) does not exist everywhere, neither a gradient method nor the Nesterov optimal gradient method is available. A natural question to ask is whether a nonsmooth function has some class of “generalized gradient” similar to a gradient vector. For a twice continuous differentiable function f (x), its second-order approximation is given by f (x + Δx) ≈ f (x) + (∇f (x))T Δx + (Δx)T HΔx.

4.6 Nonsmooth Convex Optimization

225

If the Hessian matrix H is positive semi-definite or positive definite then we have the inequality f (x + Δx) ≥ f (x) + (∇f (x))T Δx, or f (y) ≥ f (x) + (∇f (x))T (y − x),

∀ x, y ∈ dom f (x).

(4.6.5)

Although a nonsmooth function h(x) does not have a gradient vector ∇h(x), it is possible to find another vector g instead of the gradient vector ∇f (x) such that the inequality (4.6.5) still holds. DEFINITION 4.14 A vector g ∈ Rn is said to be a subgradient vector of the function h : Rn → R at the point x ∈ Rn if h(y) ≥ h(x) + gT (y − x),

∀ x, y ∈ dom h.

(4.6.6)

The set of all subgradient vectors of the function h at the point x is known as the subdifferential of the function h at the point x, denoted ∂h(x), and is defined as  def   ∂h(x) = gh(y) ≥ h(x) + gT (y − x), ∀ y ∈ dom h . (4.6.7) When h(x) is differentiable, we have ∂h(x) = {∇h(x)}, as the gradient vector of a smooth function is unique, so we can view the gradient operator ∇h of a convex and differentiable function as a point-to-point mapping, i.e., ∇h maps each point x ∈ dom h to the point ∇h(x). In contrast, the subdifferential operator ∂h, defined in Equation (4.6.7), of a closed proper convex function h can be viewed as a pointto-set mapping, i.e., ∂h maps each point x ∈ dom h to the set ∂h(x).1 Any point g ∈ ∂h(x) is called a subgradient of h at x. Generally speaking, a function h(x) may have one or several subgradient vectors at some point x. The function h(x) is said to be subdifferentiable at the point x if it at least has one subgradient vector. More generally, the function h(x) is said to be subdifferentiable in the definition domain dom h, if it is subdifferentiable at all points x ∈ dom h. The basic properties of the subdifferential are as follows [344], [54]. 1. Convexity ∂h(x) is always a closed convex set, even if h(x) is not convex. 2. Nonempty and boundedness If x ∈ int(dom h) then the subdifferential ∂h(x) is nonempty and bounded. 3. Nonnegative factor If α > 0, then ∂(αh(x)) = α∂h(x). 4. Subdifferential If h is convex and differentiable at the point x then the subdifferential is a singleton ∂h(x) = {∇h(x)}, namely its gradient is its unique subgradient. Conversely, if h is a convex function and ∂h(x) = {g} then h is differentiable at the point x and g = ∇h(x). 1

A proper convex function f (x) takes values in the extended real number line such that f (x) < +∞ for at least one x and f (x) > −∞ for every x.

226

Gradient Analysis and Optimization

5. Minimum point of nondifferentiable function The point x is a minimum point of the convex function h if and only if h is subdifferentiable at x and 0 ∈ ∂h(x ).

(4.6.8)

This condition is known as the first-order optimality condition of the nonsmooth convex function h(x). If h is differentiable, then the first-order optimality condition 0 ∈ ∂h(x) simplifies to ∇h(x) = 0. 6. Subdifferential of a sum of functions If h1 , . . . , hm are convex functions then the subdifferential of h(x) = h1 (x) + · · · + hm (x) is given by ∂h(x) = ∂h1 (x) + · · · + ∂hm (x). 7. Subdifferential of affine transform If φ(x) = h(Ax + b) then the subdifferential ∂φ(x) = AT ∂h(Ax + b). 8. Subdifferential of pointwise maximal function Let h be a pointwise maximal function of the convex functions h1 , . . . , hm , i.e., h(x) = max hi (x); then i=1,...,m

∂h(x) = conv (∪ {∂hi (x)|hi (x) = h(x)}) . That is to say, the subdifferential of a pointwise maximal function h(x) is the convex hull of the union set of subdifferentials of the “active function” hi (x) at the point x. ˜ The subgradient vector g of the function h(x) is denoted as g = ∇h(x). EXAMPLE 4.8 The function h(x) = |x| is nonsmooth at x = 0 as its gradient ∇|x| does not exist there. To find the subgradient vectors of h(x) = |x|, rewrite the function as h(s, x) = |x| = sx from which it follows that the gradient of the function f (s, x) is given by ∂f (s, x)/∂x = s; s = −1 if x < 0 and s = +1 if x > 0. Moreover, if x = 0 then the subgradient vector should satisfy Equation (4.6.6), i.e., |y| ≥ gy or g ∈ [−1, +1]. Hence, the subdifferential of the function |x| is given by ⎧ ⎪ x < 0, ⎪ ⎨{−1}, ∂|x| = {+1}, (4.6.9) x > 0, ⎪ ⎪ ⎩[−1, 1], x = 0. EXAMPLE 4.9

Consider the function h(x) =

n i=1

|aTi x − bi |. Write

I− (x) = {i|ai , x − bi < 0}, I+ (x) = {i|ai , x − bi > 0}, I0 (x) = {i|ai , x − bi = 0}.

4.6 Nonsmooth Convex Optimization

Let h(x) =

n i=1

|aTi x − bi | =

n

T i=1 si (ai x

⎧ ⎪ ⎪ ⎨−1, si = 1, ⎪ ⎪ ⎩[−1, 1],

Hence the subdifferential of h(x) = ∂h(x) =

n  i=1

n i=1



s i ai =

− bi ); then x ∈ I− (x), x ∈ I+ (x), x ∈ I0 (x).

|aTi x − bi | is given by 

ai −

i∈I+ (x)

227



ai +

i∈I− (x)

α i ai ,

(4.6.10)

i∈I0 (x)

where αi = [−1, 1]. EXAMPLE 4.10 The 1 -norm h(x) = x 1 = |x1 | + · · · + |xm | is a special example n of the function h(x) = i=1 |aTi x − bi | with ai = ei and bi = 0, where ei is the basis vector with nonzero entry ei = 1. Hence, by Equation (4.6.10), we have    ei − ei + αi ei , x = 0, ∂h(x) = i∈I+ (x)

i∈I− (x)

i∈I0 (x)

where I− (x) = {i|xi < 0}, I+ (x) = {i|xi > 0} and I0 (x) = {i|xi = 0}. For x = 0, by Definition 4.14 the subgradient should satisfy y 1 ≥ 0 1 + gT (y − 0) with h(0) = 0, i.e., y 1 ≥ gT y, yielding gi = max1≤i≤n |yi | ≤ 1, y = 0. Therefore, the subdifferential of the 1 -norm h(x) = x 1 is given by ⎧   ⎨ ei − ei + αi ei , x = 0, xi 0 ⎩ {y ∈ Rn | max1≤i≤n |yi | ≤ 1}, x = 0, y = 0. EXAMPLE 4.11 x = 0, and thus

The Euclidean norm h(x) = x 2 is differentiable away from

∂ x 2 = {∇ x 2 } = {∇(xT x)1/2 } = {x/ x 2 },

∀ x = 0.

At the point x = 0, by Equation (4.6.6) we have g ∈ ∂h(0)



y 2 ≥ 0 + gT (y − 0)



g, y ≤ 1,

y 2

Hence, the subdifferential of the function x 2 is given by + x = 0, {x/ x 2 }, ∂ x 2 = n {y ∈ R | y 2 ≤ 1}, x = 0, y = 0.

∀ y = 0.

(4.6.12)

Let X ∈ Rm×n be any matrix with singular value decomposition X = UΣVT . Then the subdifferential of the nuclear norm of the matrix X (i.e., the sum of all singular values) is given by [79], [292], [503]    ∂ X ∗ = UVT +WW ∈ Rm×n , UT W = O, WV = O, W spec ≤ 1 . (4.6.13)

228

Gradient Analysis and Optimization

4.6.2 Proximal Operator Let Ci = dom fi (x), i = 1, . . . , P be a closed convex set of the m-dimensional 5P Euclidean space Rm and C = i=1 Ci be the intersection of these closed convex sets. Consider the combinatorial optimization problem min x∈C

P 

fi (x),

(4.6.14)

i=1

where the closed convex sets Ci , i = 1, . . . , P express the constraints imposed on the combinatorial optimization solution x. The types of intersection C can be divided into the following three cases [73]: (1) The intersection C is nonempty and “small” (all members of C are quite similar). (2) The intersection C is nonempty and “large” (the differences between the members of C are large). (3) The intersection C is empty, which means that the imposed constraints of intersecting sets are mutually contradictory. It is difficult to solve the combinatorial optimization problem (4.6.14) directly. But, if + 0, x ∈ Ci , f1 (x) = x − x0 , fi (x) = ICi (x) = +∞, x ∈ / Ci , then the original combinatorial optimization problem can be divided into separate problems x∈

min  P

i=2

Ci

x − x0 .

(4.6.15)

Differently from the combinatorial optimization problem (4.6.14), the separated optimization problems (4.6.15) can be solved using the projection method. In particular, when the Ci are convex sets, the projection of a convex objective function onto the intersection of these convex sets is closely related to the proximal operator of the objective function. DEFINITION 4.15 [331]

or

The proximal operator of a convex function h(x) is defined as " # 1 2 proxh (u) = arg min h(x) + x − u 2 2 x

(4.6.16)

" # 1 2 proxμh (u) = arg min h(x) +

x − u 2 2μ x

(4.6.17)

with scale parameter μ > 0.

4.6 Nonsmooth Convex Optimization

229

In particular, for a convex function h(X), its proximal operator is defined as " # 1 2 (4.6.18) proxμh (U) = arg min h(X) +

X − U F . 2μ X The proximal operator has the following important properties [488]: 1. Existence and uniqueness The proximal operator proxh (u) always exists, and is unique for all x. 2. Subgradient characterization There is the following correspondence between the proximal mapping proxh (u) and the subgradient ∂h(x): x = proxh (u)



x − u ∈ ∂h(x).

(4.6.19)

3. Nonexpansive mapping The proximal operator proxh (u) is a nonexpansive ˆ = proxh (ˆ mapping with constant 1: if x = proxh (u) and x u) then ˆ ) ≥ x − x ˆ 22 . ˆ )T (u − u (x − x 4. Proximal operator of a separable sum function If h : Rn1 × Rn2 → R is a separable sum function, i.e., h(x1 , x2 ) = h1 (x1 ) + h2 (x2 ), then proxh (x1 , x2 ) = (proxh1 (x1 ), proxh2 (x2 )). 5. Scaling and translation of argument proxh (x) =

If h(x) = f (αx + b), where α = 0, then

 1 proxα2 f (αx + b) − b . α

6. Proximal operator of the conjugate function If h∗ (x) is the conjugate function of the function h(x) then, for all μ > 0, the proximal operator of the conjugate function is given by proxμh∗ (x) = x − μproxh/μ (x/μ). If μ = 1 then the above equation simplifies to x = proxh (x) + proxh∗ (x).

(4.6.20)

This decomposition is called the Moreau decomposition. An operator closely related to the proximal operator is the soft thresholding operator of a real variable. DEFINITION 4.16 The action of the soft thresholding operator on a real variable x ∈ R is denoted Sτ [x] or soft(x, τ ) and is defined as ⎧ ⎪ ⎪ ⎨x − τ, x > τ, soft(x, τ ) = Sτ [x] = 0, (4.6.21) |x| ≤ τ, ⎪ ⎪ ⎩x + τ, x < −τ.

230

Gradient Analysis and Optimization

Here τ > 0 is called the soft threshold value of the real variable x. The action of the soft thresholding operator can be equivalently written as soft(x, τ ) = (x − τ )+ − (−x − τ )+ = max{x − τ, 0} − max{−x − τ, 0} = (x − τ )+ + (x + τ )− = max{x − τ, 0} + min{x + τ, 0}. This operator is also known as the shrinkage operator, because it can reduce the variable x, the elements of the vector x and the matrix X to zero, thereby shrinking the range of elements. Hence, the action of the soft thresholding operator is sometimes written as [29], [52] soft(x, τ ) = (|x| − τ )+ sign(x) = (1 − τ /|x|)+ x.

(4.6.22)

The soft thresholding operation on a real vector x ∈ Rn , denoted soft(x, τ ), is defined as a vector with entries soft(x, τ )i = max{xi − τ, 0} + min{xi + τ, 0} + x − τ, x > τ, i

=

i

0, xi + τ,

|xi | ≤ τ, xi < −τ.

(4.6.23)

The soft thresholding operation on a real matrix X ∈ Rm×n , denoted soft(X), is defined as an m × n real matrix with entries soft(X, τ )ij = max{xij − τ, 0} + min{xij + τ, 0} + x − τ, x > τ, ij

=

ij

0, xij + τ,

|xij | ≤ τ, xij < −τ.

(4.6.24)

Given a function h(x), our goal is to find an explicit expression for " # 1 2

x − u 2 . proxμh (u) = arg min h(x) + 2μ x THEOREM 4.11 [344, p. 129] Let x denote an optimal solution of the minimization problem min φ(x). If the function φ(x) is subdifferentiable then x = x∈dom

arg min φ(x) or φ(x ) = min φ(x) if and only if 0 ∈ ∂φ(x ). x∈domφ

x∈domφ

By the above theorem, the first-order optimality condition of the function φ(x) = h(x) +

1

x − u 22 2μ

(4.6.25)

0 ∈ ∂h(x ) +

1 (x − u), μ

(4.6.26)

is given by

4.6 Nonsmooth Convex Optimization

because ∂ we have

231

1 1

x − u 22 = (x − u). Hence, if and only if 0 ∈ ∂h(x ) + μ−1 (x − u), 2μ μ

" # 1 2 x = proxμh (u) = arg min h(x) +

x − u 2 . 2μ x

(4.6.27)

From Equation (4.6.26) we have 0 ∈ μ∂h(x ) + (x − u)



u ∈ (I + μ∂h)x



(I + μ∂h)−1 u ∈ x .

Since x is only one point, (I + μ∂h)−1 u ∈ x should read as (I + μ∂h)−1 u = x and thus 0 ∈ μ∂h(x ) + (x − u)



x = (I + μ∂h)−1 u



proxμh (u) = (I + μ∂h)−1 u.

This shows that the proximal operator proxμh and the subdifferential operator ∂h are related as follows: proxμh = (I + μ∂h)−1 .

(4.6.28)

The (point-to-point) mapping (I + μ∂h)−1 is known as the resolvent of the subdifferential operator ∂h with parameter μ > 0, so the proximal operator proxμh is the resolvent of the subdifferential operator ∂h. Notice that the subdifferential ∂h(x) is a point-to-set mapping, for which neither direction is unique, whereas the proximal operation proxμh (u) is a point-to-point mapping: proxμh (u) maps any point u to a unique point x. EXAMPLE 4.12 Consider a linear function h(x) = aT x + b, where a ∈ Rn , b ∈ R. By the first-order optimality condition   ∂ 1 1 aT x + b +

x − u 22 = a + (x − u) = 0, ∂x 2μ μ we get proxμh (u) = x = u − μa. EXAMPLE 4.13 For the quadratic function h(x) = 12 xT Ax − bT x + c, where A ∈ Rn×n is positive definite. The first-order optimality condition of proxμh (u) is   1 T ∂ 1 1 x Ax − bT x + c +

x − u 22 = Ax − b + (x − u) = 0, ∂x 2 2μ μ with x = x = proxμh (u). Hence, we have proxμh (u) = (A + μ−1 I)−1 (μ−1 u + b) = u + (A + μ−1 I)−1 (b − Au). EXAMPLE 4.14

If h is the indicator function " 0, h(x) = IC (x) = +∞,

x ∈ C, x∈ / C,

232

Gradient Analysis and Optimization

where C is a closed nonempty convex set, then the proximal operator of h(x) reduces to Euclidean projection onto C, i.e., PC (u) = arg minx∈C x − u 2 . Table 4.2 lists the proximal operators of several typical functions [109], [373].

Table 4.2 Proximal operators of several typical functions Functions

Proximal operators

h(x) = φ(x − z)

proxμh (u) = z + proxμφ (u − z)

h(x) = φ(x/ρ)

proxh (u) = ρproxφ/ρ2 (u/ρ)

h(x) = φ(−x)

proxμh (u) = −proxμφ (−u)



proxμh (u) = u − proxμφ (u)

h(x) = φ (x)

proxμh (u) = PC (u) = arg minx∈C x − u2

h(x) = IC (x)

proxμh (u) = u − μPC (u/μ)  ui − μ, ui > μ, |ui | ≤ μ, (proxμh (u))i = 0, ui + μ, ui < −μ.  (1 − μ/u2 )u, u2 ≥ μ, proxμh (u) = 0, u2 < μ.

T

h(x) = supy∈C y x h(x) = x1 h(x) = x2

proxμh (u) = u − μ a

h(x) = aT x + b h(x) = h(x) =

proxμh (x) = u + (A + μ−1 I)−1 (b − Au)    (proxμh (u))i = 12 ui + u2i + 4μ

1 T x Ax + bT x 2  − n i=1 log xi

In the following, we present the proximal gradient method for solving nonsmooth convex optimization problems.

4.6.3 Proximal Gradient Method Consider a typical form of nonsmooth convex optimization, min J(x) = f (x) + h(x), x

(4.6.29)

where f (x) is a convex, smooth (i.e., differentiable) and L-Lipschitz function, and h(x) is a convex but nonsmooth function (such as x 1 , X ∗ and so on). 1. Quadratic Approximation Consider the quadratic approximation of an L-Lipschitz smooth function f (x)

4.6 Nonsmooth Convex Optimization

233

around the point xk : 1 f (x) = f (xk ) + (x − xk )T ∇f (xk ) + (x − xk )T ∇2 f (xk )(x − xk ) 2 L T ≈ f (xk ) + (x − xk ) ∇f (xk ) + x − xk 22 , 2 where ∇2 f (xk ) is approximated by a diagonal matrix LI. Minimize J(x) = f (x) + h(x) via iteration to yield xk+1 = arg min {f (x) + h(x)} x " # L ≈ arg min f (xk ) + (x − xk )T ∇f (xk ) + x − xk 22 + h(x) 2 x + 2  22 0 2 L2 1 2 = arg min h(x) + 2x − xk − ∇f (xk ) 2 2 2 L x 2   1 = proxL−1 h xk − ∇f (xk ) . L In practical applications, the Lipschitz constant L of f (x) is usually unknown. Hence, a question of how to choose L in order to accelerate the convergence of xk arises. To this end, let μ = 1/L and consider the fixed point iteration xk+1 = proxμh (xk − μ∇f (xk )) .

(4.6.30)

This iteration is called the proximal gradient method for solving the nonsmooth convex optimization problem, where the step μ is chosen to equal some constant or is determined by linear searching. 2. Forward–Backward Splitting To derive a proximal gradient algorithm for nonsmooth convex optimization, let A = ∂h and B = ∇f denote the subdifferential operator and the gradient operator, respectively. Then, the first-order optimality condition for the objective function h(x) + f (x), 0 ∈ (∂h + ∇f )x , can be written in operator form as 0 ∈ (A + B)x , and thus we have 0 ∈ (A + B)x



(I − μB)x ∈ (I + μA)x



(I + μA)−1 (I − μB)x = x .

(4.6.31)

Here (I − μB) is a forward operator and (I + μA)−1 is a backward operator. The backward operator (I + μ∂h)−1 is sometimes known as the resolvent of ∂h with parameter μ. By proxμh = (I + μA)−1 and (I − μB)xk = xk − μ∇f (xk ), Equation (4.6.31) can be written as the fixed-point iteration xk+1 = (I + μA)−1 (I − μB)xk = proxμh (xk − μ∇f (xk )) .

(4.6.32)

234

Gradient Analysis and Optimization

If we let yk = (I − μB)xk = xk − μ∇f (xk )

(4.6.33)

then Equation (4.6.32) reduces to xk+1 = (I + μA)−1 yk = proxμh (yk ).

(4.6.34)

In other words, the proximal gradient algorithm can be split two iterations: (1) forward iteration yk = (I − μB)xk = xk − μ∇f (xk ); (2) backward iteration xk+1 = (I + μA)−1 yk = proxμh (yk ). The forward iteration is an explicit iteration which is easily computed, whereas the backward iteration is an implicit iteration. EXAMPLE 4.15

For the 1 -norm function h(x) = x 1 , the backward iteration is xk+1 = proxμ x 1 (yk ) = soft · 1 (yk , μ) = [soft · 1 (yk (1), μ), . . . , soft · 1 (yk (n), μ)]T

(4.6.35)

can be obtained by the soft thresholding operation [163]: xk+1 (i) = (soft · 1 (yk , μ))i = sign(yk (i)) max{|yk (i)| − μ, 0},

(4.6.36)

for i = 1, . . . , n. Here xk+1 (i) and yk (i) are the ith entries of the vectors xk+1 and yk , respectively. EXAMPLE 4.16

For the 2 -norm function h(x) = x 2 , the backward iteration is xk+1 = proxμ x 2 (yk ) = soft · 2 (yk , μ) = [soft · 2 (yk (1), μ), . . . , soft · 2 (yk (n), μ)]T ,

(4.6.37)

where soft · 2 (yk (i), μ) = max{|yk (i)| − μ, 0}

yk (i) ,

yk 2

i = 1, . . . , n.

(4.6.38)

min{m,n} EXAMPLE 4.17 For the nuclear norm of matrix X ∗ = i=1 σi (X), the corresponding proximal gradient method becomes   Xk = proxμ · ∗ Xk−1 − μ∇f (Xk−1 ) . (4.6.39) If W = UΣVT is the SVD of the matrix W = Xk−1 − μ∇f (Xk−1 ) then proxμ · ∗ (W) = UDμ (Σ)VT ,

(4.6.40)

where Dμ (Σ) is called the singular value thresholding operation, defined as " σi (X) − μ, if σi (X) > μ, [Dμ (Σ)]i = (4.6.41) 0, otherwise.

4.6 Nonsmooth Convex Optimization

235

EXAMPLE 4.18 If the nonsmooth function h(x) = IC (x) is an indicator function, then the proximal gradient iteration xk+1 = proxμh (xk − μ∇f (xk )) reduces to the gradient projection iteration xk+1 = PC (xk − μ∇f (xk )) .

(4.6.42)

A comparison between other gradient methods and the proximal gradient method is summarized below. • The update formulas are as follows: general gradient method, Newton method,

xk+1 = xk − μ∇f (xk );

xk+1 = xk − μH−1 (xk )∇f (xk );

proximal gradient method,

xk+1 = proxμh (xk − μ∇f (xk )) .

General gradient methods and the Newton method use a low-level (explicit) update, whereas the proximal gradient method uses a high-level (implicit) operation proxμh . • General gradient methods and the Newton method are available only for smooth and unconstrained optimization problems, while the proximal gradient method is available for smooth or nonsmooth and/or constrained or unconstrained optimization problems. • The Newton method allows modest-sized problems to be addressed; the general gradient methods are applicable for large-size problems, and, sometimes, distributed implementations, and the proximal gradient method is available for all large-size problems and distributed implementations. In particular, the proximal gradient method includes the general gradient method, the projected gradient method and the iterative soft thresholding method as special cases. (1) Gradient method If h(x) = 0, as proxh (x) = x the proximal gradient algorithm (4.6.30) reduces to the general gradient algorithm xk = xk−1 − μk ∇f (xk−1 ). That is to say, the gradient algorithm is a special case of the proximal gradient algorithm when the nonsmooth convex function h(x) = 0. (2) Projected gradient method For h(x) = IC (x), an indictor function, the minimization problem (4.6.4) becomes the unconstrained minimization min f (x). x∈C

Because proxh (x) = PC (x), the proximal gradient algorithm reduces to   (4.6.43) xk = PC xk−1 − μk ∇f (xk−1 ) 2 22 = arg min 2u − xk−1 + μk ∇f (xk−1 )22 . (4.6.44) u∈C

This is just the projected gradient method.

236

Gradient Analysis and Optimization

(3) Iterative soft thresholding method When h(x) = x 1 , the minimization problem (4.6.4) becomes the unconstrained minimization problem min (f (x) + x 1 ). In this case, the proximal gradient algorithm becomes   (4.6.45) xk = proxμk h xk−1 − μk ∇f (xk−1 ) . This is called the iterative soft thresholding ⎧ ⎪ ⎨ ui − μ, proxμh (u)i = 0, ⎪ ⎩ ui + μ,

method, for which ui > μ, − μ ≤ ui ≤ μ, ui < −μ.

From the viewpoint of convergence, the proximal gradient method is suboptimal. A Nesterov-like proximal gradient algorithm was developed by Beck and Teboulle [29], who called it the fast iterative soft thresholding algorithm (FISTA), as shown in Algorithm 4.9. Algorithm 4.9

FISTA algorithm with fixed step [29]

input: The Lipschitz constant L of ∇f (x). Initialize: y1 = x0 ∈ Rn , t1 = 1. repeat

  1 1. Compute xk = proxL−1 h yk − ∇f (yk ) . L  2 1 + 1 + 4tk . 2. Compute tk+1 = 2   t −1 (xk − xk−1 ). 3. Compute yk+1 = xk + k tk+1 4. exit if xk is converged.

return k ← k + 1. output: x ← xk .

THEOREM 4.12 [29] Let {xk } and {yk } be two sequences generated by the FISTA algorithm; then, for any iteration k ≥ 1, F (xk ) − F (x ) ≤

2L xk − x 22 , (k + 1)2

∀ x ∈ X ,

where x and X represent respectively the optimal solution and the optimal solution set of min (F (x) = f (x) + h(x)). Theorem 4.12 shows that to achieve the -optimal solution F (¯ x)  − F (x ) ≤ , the √ FISTA algorithm needs at most (C/ − 1) iterations, where C = 2L x0 − x 22 . Since it has the same fast convergence rate as the optimal first-order algorithm, the FISTA is indeed an optimal algorithm.

4.7 Constrained Convex Optimization

237

4.7 Constrained Convex Optimization Consider the constrained minimization problem min f0 (x) x

subject to

fi (x) ≤ 0, 1 ≤ i ≤ m, hi (x) = 0, 1 ≤ i ≤ q.

(4.7.1)

If the objective function f0 (x) and the inequality constraint functions fi (x), i = 1, . . . , m are convex, and the equality constraint functions hi (x) have the affine form h(x) = Ax − b, then Equation (4.7.1) is called a constrained convex optimization problem. The basic idea in solving a constrained optimization problem is to transform it into an unconstrained optimization problem. The transformation methods are of three types: (1) the Lagrange multiplier method; (2) the penalty function method; (3) the augmented Lagrange multiplier method. 4.7.1 Lagrange Multiplier Method Consider a simpler form of the constrained optimization problem (4.7.1) min f (x) subject to

Ax = b,

(4.7.2)

where x ∈ Rn , A ∈ Rm×n and the objective function f : Rn → R is convex. The Lagrange multiplier method transforms the constrained optimization problem (4.7.2) into an unconstrained optimization problem with objective function L(x, λ) = f (x) + λT (Ax − b),

(4.7.3)

where λ ≥ 0 is known as the dual variable or the Lagrange multiplier vector, all of whose entries must be larger than or equal to zero. The dual objective function of the original optimization problem (4.7.2) is given by g(λ) = inf L(x, λ) = −f ∗ (−AT λ) − bT λ, x

(4.7.4)

in which f ∗ is the convex conjugate of f . By the Lagrange multiplier method, the original equality constrained minimization problem (4.7.2) becomes the dual maximization problem   maxm g(λ) = −f ∗ (−AT λ) − bT λ . (4.7.5) λ∈R

The dual ascent method uses gradient ascent to solve the maximization problem (4.7.5). The dual ascent method consists of two steps: xk+1 = arg min L(x, λk ),

(4.7.6)

λk+1 = λk + μk (Axk+1 − b),

(4.7.7)

x

238

Gradient Analysis and Optimization

where (4.7.6) is the minimization step of the original variable x whereas (4.7.7) is the updating step of the dual variable λ, with step size μk . Because the dual variable λ ≥ 0 can be viewed as a “price” vector, the updating of the dual variable is also called the price-ascent or price-adjustment step. The object of price-ascent is to maximize the revenue function λk . The dual ascent method contains two aspects. (1) The update of the dual variable λ adopts the gradient ascent method. (2) The choice of step size μk should ensure the ascent of the dual objective function, namely g(λk+1 ) > g(λk ).

4.7.2 Penalty Function Method The penalty function method is a widely used constrained optimization method, and its basic idea is this: by using the penalty function and/or the barrier function, a constrained optimization becomes an unconstrained optimization of the composite function consisting of the original objective function and the constraint conditions. For the standard constrained optimization problem (4.7.1), the feasible set is defined as the set of points satisfying all the inequality and equality constraints, namely    F = xfi (x) ≤ 0, i = 1, . . . , m, hi (x) = 0, i = 1, . . . , q . (4.7.8) The set of points satisfying only the strict inequality constraints fi (x) < 0 and the equality constraints hi (x) = 0 is denoted by    relint(F) = xfi (x) < 0, i = 1, . . . , m, hi (x) = 0, i = 1, . . . , q (4.7.9) and is called the relative feasible interior set or the relative strictly feasible set. The points in the feasible interior set are called relative interior points. The penalty function method transforms the original constrained optimization problem into the unconstrained optimization problem   (4.7.10) min Lρ (x) = f0 (x) + ρp(x) , x∈S

where the coefficient ρ is a penalty parameter that reflects the intensity of “punishment” via the weighting of the penalty function p(x); the greater ρ is, the greater the value of the penalty term. The transformed optimization problem (4.7.10) is often called the auxiliary optimization problem of the original constrained optimization problem. The main property of the penalty function is as follows: if p1 (x) is the penalty for the closed set F1 , and p2 (x) is the penalty for the closed set F2 , then p1 (x) + p2 (x) is the penalty for the intersection F1 ∩ F2 . The following are two common penalty functions.

4.7 Constrained Convex Optimization

239

(1) Exterior penalty function p(x) = ρ1

m  

q  r |hi (x)|2 , max{0, fi (x)} + ρ2

i=1

(4.7.11)

i=1

where r is usually 1 or 2. (2) Interior penalty function p(x) = ρ1

m  i=1



q  1 |hi (x)|2 . log(−fi (x)) + ρ2 fi (x) i=1

(4.7.12)

In the exterior penalty function, if fi (x) ≤ 0, ∀ i = 1, . . . , m and hi (x) = 0, ∀ i = 1, . . . , q, then p(x) = 0, i.e., the penalty function has no effect on the interior points in the feasible set int(F). On the contrary, if, for some iteration point xk , the inequality constraint fi (xk ) > 0, i ∈ {1, . . . , m}, and/or h(xk ) = 0, i ∈ {1, . . . , q}, then the penalty term p(xk ) = 0, that is, any point outside the feasible set F is “punished”. Hence, the penalty function defined in (4.7.11) is known as an exterior penalty function. The role of (4.7.12) is equivalent to establishing a fence on the feasible set boundary bnd(F) to block any point in the feasible interior set int(F) from crossing the boundary of the feasible set bnd(F). Because the points in the relative feasible interior set relint(F) are slightly punished, the penalty function given in (4.7.12) is called the interior penalty function; it is also known as the barrier function. The following are three typical barrier functions for a closed set F [344]: (1) the power-function barrier function φ(x) =

m i=1



1 , p ≥ 1; (fi (x))p

m 1 log(−fi (x)); fi (x) i=1   m 1 exp − . (3) the exponential barrier function φ(x) = fi (x) i=1

(2) the logarithmic barrier function φ(x) = −

In other words, the logarithmic barrier function in (4.7.12) can be replaced by the power-function barrier function or the exponential barrier function. m  1 When p = 1, the power-function barrier function φ(x) = − is called the i=1

fi (x)

inverse barrier function and was presented by Carroll in 1961 [86]; whereas φ(x) = μ

m  i=1

1 log(−fi (x))

(4.7.13)

is known as the classical Fiacco–McCormick logarithmic barrier function [152], where μ is the barrier parameter. A summary of the features of the external penalty function follows.

240

Gradient Analysis and Optimization

(1) The exterior penalty function method is usually known as the penalty function method, and the interior penalty function method is customarily called the barrier method. (2) The exterior penalty function method punishes all points outside of the feasible set, and its solution satisfies all the inequality constraints fi (x) ≤ 0, i = 1, . . . , m, and all the equality constraints hi (x) = 0, i = 1, . . . , q, and is thus an exact solution to the constrained optimization problem. In other words, the exterior penalty function method is an optimal design scheme. In contrast, the interior penalty function method or barrier method blocks all points on the boundary of the feasible set, and the solution that is found satisfies only the strict inequality fi (x) < 0, i = 1, . . . , m, and the equality constraints hi (x) = 0, i = 1, . . . , q, and hence is only approximate. That is to say, the interior penalty function method is a suboptimal design scheme. (3) The exterior penalty function method can be started using an unfeasible point, and its convergence is slow, whereas the interior penalty function method requires the initial point to be a feasible interior point, so its selection is difficult; but it has a good convergence and approximation performance. In evolutionary computations the exterior penalty function method is normally used, while the interior function method is NP-hard owing to the feasible initial point search. Engineering designers, especially process controllers, prefer to use the interior penalty function method, because this method allows the designers to observe the changes in the objective function value corresponding to the design points in the feasible set in the optimization process. However, this facility cannot be provided by any exterior penalty function method. In the strict sense of the penalty function classification, all the above kinds of penalty function belong to the “death penalty”: the infeasible solution points x ∈ S \ F (the difference set of the search space S and the feasible set F ) are completely ruled out by the penalty function p(x) = +∞. If the feasible search space is convex, or it is the rational part of the whole search space, then this death penalty works very well [325]. However, for genetic algorithms and evolutionary computations, the boundary between the feasible set and the infeasible set is unknown and thus it is difficult to determine the precise position of the feasible set. In these cases, other penalty functions should be used [529]: static, dynamic, annealing, adaptive or co-evolutionary penalties.

4.7.3 Augmented Lagrange Multiplier Method According to [278], [41], the main deficiencies of the Lagrange multiplier method are as follows.

4.7 Constrained Convex Optimization

241

(1) Only when a constrained optimization problem has a locally convex structure, is the dual unconstrained optimization problem well-defined, so that the updating of the Lagrange multiplier λk+1 = λk + αk h(xk ) can be implemented. (2) The convergence of the Lagrangian objective function is more time-consuming, as the updating of the Lagrange multiplier is an ascent iteration and its convergence is only moderately fast. According to [41], the deficiency in the penalty function method is that its convergence is slow, and a large penalty parameter easily results in an ill-conditioned unconstrained optimization problem and thus causes numerical instability of the optimization algorithm. A simple and efficient way to overcome the disadvantages of the Lagrange multiplier method and the penalty function method is to combine them into an augmented Lagrange multiplier method. Consider the augmented Lagrangian function L(x, λ, ν) = f0 (x) +

m 

λi fi (x) +

i=1 T

q 

νi hi (x)

i=1

= f0 (x) + λ f (x) + ν T h(x),

(4.7.14)

where λ = [λ1 , . . . , λm ]T and ν = [ν1 , . . . , νq ]T are the Lagrange multiplier vector and the penalty parameter vector, respectively, whereas f (x) = [f1 (x), . . . , fm (x)]T and h(x) = [h1 (x), . . . , hq (x)]T are the inequality constraint vector and the equality constraint vector, respectively. If on the one hand ν = 0 then the augmented Lagrangian function simplifies to the Lagrangian function: L(x, λ, 0) = L(x, λ) = f0 (x) +

m 

λi fi (x).

i=1

This implies that the augmented Lagrange multiplier method reduces to the Lagrange multiplier method for an inequality constrained-optimization problem. On the other hand, if λ = 0 and ν = ρh(x) with ρ > 0 then the augmented Lagrangian function reduces to the penalty function: L(x, 0, ν) = L(x, ν) = f0 (x) + ρ

q 

|hi (x)|2 .

i=1

This implies that the augmented Lagrange multiplier method includes the penalty function method for a equality constrained optimization problem as a special case. The above two facts show that the augmented Lagrange multiplier method combines the Lagrange multiplier method and the penalty function method.

242

Gradient Analysis and Optimization

4.7.4 Lagrangian Dual Method As in the Lagrange multiplier method, the Lagrange multiplier vector λ of the augmented Lagrange multiplier method has the nonnegative constraint λ ≥ 0. Under this constraint, let us consider how to solve the unconstrained minimization problem + 0 q m   λi fi (x) + νi hi (x) . (4.7.15) min L(x, λ, ν) = min f0 (x) + λ≥0, ν

i=1

i=1

The vector x is known as the optimization variable or decision variable or primal variable, while λ is the dual variable. The original constrained optimization problem (4.7.1) is simply called the original problem, and the unconstrained optimization problem (4.7.15) is simply known as the dual problem. Owing to the nonnegativity of the Lagrange multiplier vector λ, the augmented Lagrangian function L(x, λ, ν) may tend to negative infinity when some λi equals a very large positive number. Therefore, we first need to maximize the original augmented Lagrangian function, to get + 0 q m   J1 (x) = max f0 (x) + λi fi (x) + νi hi (x) . (4.7.16) λ≥0, ν

i=1

i=1

The problem with using the unconstrained maximization (4.7.16) is that it is impossible to avoid violation of the constraint fi (x) > 0. This may result in J1 (x) equalling positive infinity; namely, " f0 (x), if x meets all original constraints, J1 (x) = (4.7.17) (f0 (x), +∞), otherwise. From the above equation it follows that in order to minimize f0 (x) subject to all inequality and equality constraints, we should minimize J1 (x) to get the primal cost function JP (x) = min J1 (x) = min max L(x, λ, ν). x

x

λ≥0, ν

(4.7.18)

This is a minimax problem, whose solution is the supremum of the Lagrangian function L(x, λ, ν), namely   q m   λi fi (x) + νi hi (x) . (4.7.19) JP (x) = sup f0 (x) + i=1

i=1

From (4.7.19) and (4.7.17) it follows that the optimal value of the original constrained minimization problem is given by p = JP (x ) = min f0 (x) = f0 (x ), x

which is simply known as the optimal primal value.

(4.7.20)

4.7 Constrained Convex Optimization

243

However, the minimization of a nonconvex objective function cannot be converted into the minimization of another convex function. Hence, if f0 (x) is a convex function then, even if we designed an optimization algorithm that can find a local ˜ of the original cost function, there is no guarantee that x ˜ is a global extremum x extremum point. Fortunately, the minimization of a convex function f (x) and the maximization of the concave function −f (x) are equivalent. On the basis of this dual relation, it is easy to obtain a dual method for solving the optimization problem of a nonconvex objective function: convert the minimization of a nonconvex objective function into the maximization of a concave objective function. For this purpose, construct another objective function from the Lagrangian function L(x, λ, ν): J2 (λ, ν) = min L(x, λ, ν) x + = min f0 (x) + x

m  i=1

λi fi (x) +

q 

0 νi hi (x) .

(4.7.21)

i=1

From the above equation it is known that + if x meets all the original constraints, min f0 (x), x min L(x, λ, ν) = x (−∞, min f0 (x)), otherwise. x

(4.7.22) Its maximization function JD (λ, ν) = max J2 (λ, ν) = max min L(x, λ, ν) λ≥0, ν

λ≥0, ν

x

(4.7.23)

is called the dual objective function for the original problem. This is a maximin problem for the Lagrangian function L(x, λ, ν). Since the maximin of the Lagrangian function L(x, λ, ν) is its infimum, we have   q m   λi fi (x) + νi hi (x) . (4.7.24) JD (λ, ν) = inf f0 (x) + i=1

i=1

The dual objective function defined by Equation (4.7.24) has the following characteristics. (1) The function JD (λ, ν) is an infimum of the augmented Lagrangian function L(x, λ, ν). (2) The function JD (λ, ν) is a maximizing objective function, and thus it is a value or utility function rather than a cost function. (3) The function JD (λ, ν) is lower unbounded: its lower bound is −∞. Hence, JD (λ, ν) is a concave function of the variable x even if f0 (x) is not a convex function.

244

Gradient Analysis and Optimization

THEOREM 4.13 [353, p. 16] Any local minimum point x of an unconstrained convex optimization function f (x) is a global minimum point. If the convex function f (x) is differentiable then the stationary point x such that ∂f (x)/∂x = 0 is a global minimum point of f (x). Theorem 4.13 shows that any extreme point of the concave function is a global extreme point. Therefore, the algorithm design of the standard constrained minimization problem (4.7.1) becomes the design of an unconstrained maximization algorithm for the dual objective function JD (λ, ν). Such a method is called the Lagrangian dual method.

4.7.5 Karush–Kuhn–Tucker Conditions The optimal value of the dual objective function is simply called the optimal dual value, denoted d = JD (λ , ν ).

(4.7.25)

From (4.7.22) and (4.7.23) it is immediately known that d ≤ min f0 (x) = p . x

(4.7.26)

The difference between the optimal primal value p and the optimal dual value d , denoted p − d , is known as the duality gap between the original minimization problem and the dual maximization problem. Equation (4.7.26) gives the relationship between the maximin and the minimax of the augmented Lagrangian function L(x, λ, ν). In fact, for any nonnegative realvalued function f (x, y), there is the following inequality relation between the maximin and minimax: max min f (x, y) ≤ min max f (x, y). x

y

y

x

(4.7.27)

If d ≤ p then the Lagrangian dual method is said to have weak duality, while when d = p , we say that the Lagrangian dual method satisfies strong duality. Given an allowed dual gap , the points x and (λ, ν) satisfying p − f0 (x) ≤

(4.7.28)

are respectively called the -suboptimal original point and the -suboptimal dual points of the dual-concave-function maximization problem. Let x and (λ , ν ) represent any original optimal point and dual optimal points with zero dual gap = 0. Since x among all original feasible points x minimizes the augmented Lagrangian objective function L(x, λ , ν ), the gradient vector of L(x, λ , ν ) at the point x must be equal to the zero vector, namely ∇f0 (x ) +

m  i=1

λ i ∇fi (x ) +

q  i=1

νi ∇hi (x ) = 0.

4.7 Constrained Convex Optimization

245

Therefore, the Karush–Kuhn–Tucker (KKT) conditions (i.e., the first-order necessary conditions) of the Lagrangian dual unconstrained optimization problem are given by [353] ⎫ fi (x ) ≤ 0, i = 1, . . . , m (original inequality constraints), ⎪ ⎪ ⎪ ⎪ ⎪ hi (x ) = 0, i = 1, . . . , q (original equality constraints), ⎪ ⎪ ⎪ ⎪ ⎬ λi ≥ 0, i = 1, . . . , m (nonnegativity), (4.7.29) ⎪ λ i fi (x ) = 0, i = 1, . . . , m (complementary slackness), ⎪ ⎪ ⎪ ⎪ q m ⎪   ⎪ ⎪ ∇f0 (x )+ λi ∇fi (x ) + νi ∇hi (x ) = 0 (stationary point).⎪ ⎭ i=1

i=1

A point x satisfying the KKT conditions is called a KKT point. Remark 1 The first KKT condition and the second KKT condition are the original inequality and equality constraint conditions, respectively. Remark 2 The third KKT condition is the nonnegative condition of the Lagrange multiplier λi , which is a key constraint of the Lagrangian dual method. Remark 3 The fourth KKT condition (complementary slackness) is also called dual complementary and is another key constraint of the Lagrangian dual method. This condition implies that, for a violated constraint fi (x) > 0, the corresponding Lagrange multiplier λi must equal zero, and thus we can avoid completely any violated constraint. Thus the role of this condition is to establish a barrier fi (x) = 0, i = 1, . . . , m on the boundary of inequality constraints, that prevents the occurrence of constraint violation fi (x) > 0. Remark 4 The fifth KKT condition is the condition for there to be a stationary point at minx L(x, λ, ν). Remark 5 If the inequality constraint fi (x) ≤ 0, i = 1, . . . , m, in the constrained optimization (4.7.1) becomes ci (x) ≥ 0, i = 1, . . . , m, then the Lagrangian function should be modified to L(x, λ, ν) = f0 (x) −

m  i=1

λi ci (x) +

q 

νi hi (x),

i=1

and all inequality constrained functions fi (x) in the KKT condition formula (4.7.29) should be replaced by −ci (x). In the following, we discuss the necessary modifications of the KKT conditions under some assumptions. DEFINITION 4.17 For inequality constraints fi (x) ≤ 0, i = 1, . . . , m, if fi (¯ x) = 0 ¯ then the ith constraint is said to be an active constraint at the at the point x ¯ . If fi (¯ point x x) < 0 then the ith constraint is called an inactive constraint at ¯ . If fi (¯ the point x x) > 0 then the ith constraint is known as a violated constraint

246

Gradient Analysis and Optimization

¯ . The index set of all active constraints at the point x ¯ is denoted at the point x ¯. A(¯ x) = {i|fi (¯ x) = 0} and is referred to as the active set of the point x Let m inequality constraints fi (x), i = 1, . . . , m, have k active constraints fA1 (x ), . . . , fAk (x ) and m − k inactive constraints at some KKT point x . In order to satisfy the complementarity in the KKT conditions λi fi (x ) = 0, the Lagrange multipliers λ i corresponding to the inactive constraints fi (x ) < 0 must be equal to zero. This implies that the last KKT condition in (4.7.29) becomes ∇f0 (x ) +



λ i ∇fi (x ) +







∂f0 (x ) ⎢ ∂x1 ⎥

⎢ ⎢ ⎢ ⎣ ∂f

∂h1 (x ) ⎢ ∂x1

⎥ ⎢ ⎥+⎢ ⎥ ⎢ ⎣ ∂h  ⎦ (x ) .. .

0

∂xn



νi ∇hi (x ) = 0

i=1

i∈A

or

q 

.. .

1 (x ∂xn



∂fA1 (x )

⎢ ∂x1 ⎢ .. = −⎢ . ⎢ ⎣ ∂f (x ) A1

∂xn



)

··· ..

.

···



∂hq (x ) ⎡ ⎤ ∂x1 ⎥ ν1

··· ..

⎥⎢ . ⎥ ⎥⎣ . ⎦ ⎥ .  ⎦ ∂hq (x ) νq .. .

.

···

∂xn ⎤ ∂fAk (x ) ⎡ ⎤ ∂x1 ⎥ λA1 

⎥⎢ . ⎥ ⎥⎣ . ⎦ . ⎥ ∂fAk (x ) ⎦ λ Ak .. .

∂xn

namely ∇f0 (x ) + (Jh (x ))T ν = −(JA (x ))T λ A ,

(4.7.30)

where Jh (x ) is the Jacobian matrix of the equality constraint function hi (x) = 0, i = 1, . . . , q at the point x and ⎡ ⎤ ∂fA1 (x ) ∂fA1 (x ) ···   ⎢ ∂x1 ∂xn ⎥ ⎢ ⎥ . .. . ⎢ ⎥ ∈ Rk×n , . . JA (x ) = ⎢ (4.7.31) . . . ⎥ ⎣ ∂fAk (x )  ⎦ ∂fAk (x ) ···   ∂x1

λ A

=

[λ A1 , . . . , λ Ak ]

∂xn

∈R , k

(4.7.32)

are the Jacobian matrix and Lagrange multiplier vector respectively of the active constraint function. x) of the active constraint Equation (4.7.30) shows that if the Jacobian matrix JA (¯ ¯ is of full row rank, then the actively constrained function at the feasible point x Lagrange multiplier vector can be uniquely determined:   x)JA (¯ x)T )−1 JA (¯ x) ∇f0 (¯ x) + (Jh (¯ x))T ν . (4.7.33) λ A = −(JA (¯ In optimization-algorithm design we always want strong duality to be set up. A simple method for determining whether strong duality holds is Slater’s theorem.

4.7 Constrained Convex Optimization

247

Define the relative interior of the feasible domain of the original inequality constraint function F as follows: relint(F) = {x|fi (x) < 0, i = 1, . . . , m, hi (x) = 0, i = 1, . . . , p}.

(4.7.34)

¯ ∈ relint(F) is called a A point in the relative interior of the feasible domain x relative interior point. In an optimization process, the constraint restriction that iterative points should be in the interior of the feasible domain is known as the Slater condition. Slater’s theorem says that if the Slater condition is satisfied, and the original inequality optimization problem (4.7.1) is a convex optimization problem, then the optimal value d of the dual unconstrained optimization problem (4.7.23) is equal to the the optimal value p of the original optimization problem, i.e., strong duality holds. The following summarizes the relationships between the original constrained optimization problem and the Lagrangian dual unconstrained convex optimization problem. (1) Only when the inequality constraint functions fi (x), i = 1, . . . , m, are convex, and the equality constraint functions hi (x), i = 1, . . . , q, are affine, can an original constrained optimization problem be converted into a dual unconstrained maximization problem by the Lagrangian dual method. (2) The maximization of a concave function is equivalent to the minimization of the corresponding convex function. (3) If the original constrained optimization is a convex problem then the points ˜ ν) ˜ and (λ, ˜ satisfying the KKT condiof the Lagrangian objective function x tions are the original and dual optimal points, respectively. In other words, the optimal solution d of the Lagrangian dual unconstrained optimization is the optimal solution p of the original constrained convex optimization problem. (4) In general, the optimal solution of the Lagrangian dual unconstrained optimization problem is not the optimal solution of the original constrained optimization problem but only a -suboptimal solution, where = f0 (x ) − JD (λ , ν ). Consider the constrained optimization problem with equality and inequality constraints min f (x) subject to x

Ax ≤ h, Bx = b.

(4.7.35)

Let the nonnegative vector s ≥ 0 be a slack variable vector such that Ax + s = h. Hence, the inequality constraint Ax ≤ h becomes the equality constraint Ax + s − h = 0. Taking the penalty function φ(g(x)) = 12 g(x) 22 , the augmented Lagrangian objective function is given by Lρ (x, s, λ, ν) = f (x) + λT (Ax + s − h) + ν T (Bx − b)  ρ + Bx − b 22 + Ax + s − h 22 , 2

(4.7.36)

248

Gradient Analysis and Optimization

where the two Lagrange multiplier vectors λ ≥ 0 and ν ≥ 0 are nonnegative vectors, and the penalty parameter ρ > 0. The dual gradient ascent method for solving the original optimization problem (4.7.35) is given by xk+1 = arg min Lρ (x, sk , λk , νk ),

(4.7.37)

sk+1 = arg min Lρ (xk+1 , s, λk , νk ),

(4.7.38)

λk+1 = λk + ρk (Axk+1 + sk+1 − h),

(4.7.39)

νk+1 = νk + ρk (Bxk+1 − b),

(4.7.40)

x

s≥0

where the gradient vectors are ∂Lρ (xk+1 , sk+1 , λk , νk ) = Axk+1 + sk+1 − h, ∂λk ∂Lρ (xk+1 , sk+1 , λk , νk ) = Bxk+1 − b. ∂νk Equations (4.7.37) and (4.7.38) are respectively the updates of the original variable x and the intermediate variable s, whereas Equations (4.7.39) and (4.7.40) are the dual gradient ascent iterations of the Lagrange multiplier vectors λ and ν, taking into account the inequality constraint Ax ≥ h and the equality constraint Bx = b, respectively.

4.7.6 Alternating Direction Method of Multipliers In applied statistics and machine learning we often encounter large-scale equality constrained optimization problems, where the dimension of x ∈ Rn is very large. If the vector x may be decomposed into several subvectors, i.e., x = (x1 , . . . , xr ), and the objective function may also be decomposed into f (x) =

r  i=1

fi (xi ),

r where xi ∈ Rni and i=1 ni = n, then a large-scale optimization problem can be transformed into a few distributed optimization problems. The alternating direction multiplier method (ADMM) is a simple and effective method for solving distributed optimization problems. The ADMM method decomposes an optimization problem into smaller subproblems; then, their local solutions are restored or reconstructed into a large-scale optimization solution of the original problem. The ADMM was proposed by Gabay and Mercier [167] and Glowinski and Marrocco [176] independently in the mid-1970s.

4.7 Constrained Convex Optimization

249

Corresponding to the composition of the objective function f (x), the equality constraint matrix is blocked as follows: A = [A1 , . . . , Ar ],

Ax =

r 

Ai x i .

i=1

Hence the augmented Lagrangian objective function can be written as [52] Lρ (x, λ) =

r 

Li (xi , λ)

i=1

=

r  

fi (xi ) + λ Ai xi T



i=1

22 2 r 2 ρ2 2 2 − λ b + 2 (Ai xi ) − b2 . 2 2 2 i=1 T

2

Applying the dual ascent method to the augmented Lagrangian objective function yields a decentralized algorithm for parallel computing [52]: = arg min Li (xi , λk ), xk+1 i xi ∈Rni

λk+1 = λk + ρk

 r 

i = 1, . . . , r,

(4.7.41)

 Ai xk+1 i

−b .

(4.7.42)

i=1

Here the updates xi (i = 1, . . . , r) can be run independently in parallel. Because xi , i = 1, . . . , r are updated in an alternating or sequential manner, this augmented multiplier method is known as the “alternating direction” method of multipliers. In practical applications, the simplest decomposition is the objective function decomposition of r = 2: min {f (x) + h(z)}

subject to Ax + Bz = b,

(4.7.43)

where x ∈ Rn , z ∈ Rm , A ∈ Rp×n , B ∈ Rp×m , b ∈ Rp . The augmented Lagrangian cost function of the optimization problem (4.7.43) is given by Lρ (x, z, λ) = f (x) + h(z) + λT (Ax + Bz − b) ρ + Ax + Bz − b 22 . 2

(4.7.44)

It is easily seen that the optimality conditions are divided into the original feasibility condition Ax + Bz − b = 0

(4.7.45)

and two dual feasibility conditions, 0 ∈ ∂f (x) + AT λ + ρAT (Ax + Bz − b) = ∂f (x) + AT λ,

(4.7.46)

0 ∈ ∂h(z) + B λ + ρB (Ax + Bz − b) = ∂h(z) + B λ,

(4.7.47)

T

T

T

250

Gradient Analysis and Optimization

where ∂f (x) and ∂h(z) are the subdifferentials of the subobjective functions f (x) and h(z), respectively. The updates of the ADMM for the optimization problem min Lρ (x, z, λ) are as follows: xk+1 = arg min Lρ (x, zk , λk ),

(4.7.48)

zk+1 = arg min Lρ (xk+1 , z, λk ),

(4.7.49)

λk+1 = λk + ρk (Axk+1 + Bzk+1 − b).

(4.7.50)

x∈Rn

z∈Rm

The original feasibility cannot be strictly satisfied; its error rk = Axk + Bzk − b

(4.7.51)

is known as the original residual (vector) in the kth iteration. Hence the update of the Lagrange multiplier vector can be simply written as λk+1 = λk + ρk rk+1 .

(4.7.52)

Likewise, neither can dual feasibility be strictly satisfied. Because xk+1 is the minimization variable of Lρ (x, zk , λk ), we have 0 ∈ ∂f (xk+1 ) + AT λk + ρAT (Axk+1 + Bzk − b) = ∂f (xk+1 ) + AT (λk + ρrk+1 + ρB(zk − zk+1 )) = ∂f (xk+1 ) + AT λk+1 + ρAT B(zk − zk+1 ). Comparing this result with the dual feasibility formula (4.7.46), it is easy to see that sk+1 = ρAT B(zk − zk+1 )

(4.7.53)

is the error vector of the dual feasibility, and hence is known as the dual residual vector in the (k + 1)th iteration. The stopping criterion of the ADMM is as follows: the original residual and the dual residual in the (k + 1)th iteration should be very small, namely [52]

rk+1 2 ≤ pri ,

sk+1 2 ≤ dual ,

(4.7.54)

where pri and dual are respectively the allowed disturbances of the primal feasibility and the dual feasibility. If we let ν = (1/ρ)λ be the Lagrange multiplier vector scaled by the ratio 1/ρ, called the scaled dual vector, then Equations (4.7.48)–(4.7.50) become [52]   xk+1 = arg min f (x) + (ρ/2) Ax + Bzk − b + νk 22 , (4.7.55) n x∈R   zk+1 = arg min h(z) + (ρ/2) Axk+1 + Bz − b + νk 22 , (4.7.56) z∈Rm

νk+1 = νk + Axk+1 + Bzk+1 − b = νk + rk+1 .

(4.7.57)

4.8 Newton Methods

251

The scaled dual vector has an interesting interpretation [52]: from the residual at the kth iteration, rk = Axk + Bzk − b, it is easily seen that νk = ν0 +

k 

ri .

(4.7.58)

i=1

That is, the scaled dual vector is the running sum of the original residuals of all k iterations. Equations (4.7.55)–(4.7.57) are referred to as the scaled ADMM, while Equations (4.7.48)–(4.7.50) are the ADMM with no scaling.

4.8 Newton Methods The first-order optimization algorithms use only the zero-order information f (x) and the first-order information ∇f (x) about an objective function. It is known [305] that if the objective function is twice differentiable then the Newton method based on the Hessian matrix is of quadratic or more rapid convergence.

4.8.1 Newton Method for Unconstrained Optimization For an unconstrained optimization minn f (x), if the Hessian matrix H = ∇2 f (x) x∈R

is positive definite then, from the Newton equation ∇2 f (x)Δx = −∇f (x), we get the Newton step Δx = −(∇2 f (x))−1 ∇f (x), which results in the gradient descent algorithm xk+1 = xk − μk (∇2 f (xk ))−1 ∇f (xk ).

(4.8.1)

This is the well-known Newton method. The Newton method may encounter the following two thorny issues in applications. (1) The Hessian matrix H = ∇2 f (x) is difficult to find. (2) Even if H = ∇2 f (x) can be found, its inverse H−1 = (∇2 f (x))−1 may be numerically unstable. There are the following three methods for resolving the above two thorny issues. 1. Truncated Newton Method Instead of using the inverse of the Hessian matrix directly, an iteration method for solving the Newton matrix equation ∇2 f (x)Δxnt = −∇f (x) is used to find an approximate solution for the Newton step Δxnt . The iteration method for solving an Newton matrix equation approximately is known as the truncated Newton method [122], where the conjugate gradient algo-

252

Gradient Analysis and Optimization

rithm and the preconditioned conjugate gradient algorithm are two popular algorithms for such a case. The truncated Newton method is especially useful for large-scale unconstrained and constrained optimization problems and interior-point methods. 2. Modified Newton Method When the Hessian matrix is not positive definite, the Newton matrix equation can be modified to [158] (∇2 f (x) + E)Δxnt = −∇f (x), (4.8.2) where E is a positive semi-definite matrix that is usually taken as a diagonal matrix such that ∇2 f (x) + E is symmetric positive definite. Such a method is called the modified Newton method. The typical modified Newton method takes E = δI, where δ > 0 is small. 3. Quasi-Newton Method Using a symmetric positive definite matrix Bk to approximate the inverse of the Hessian matrix H−1 k in the Newton method, we have the following recursion: xk+1 = xk − Bk ∇f (xk ).

(4.8.3)

The Newton method using a symmetric matrix to approximate the Hessian matrix is called the quasi-Newton method. Usually a symmetric matrix ΔHk = Hk+1 − Hk is used to approximate the Hessian matrix, and we denote ρk = f (xk+1 ) − f (xk ) and δk = xk+1 − xk . According to different approximations for the Hessian matrix, the quasi-Newton procedure employs the following three common methods [344]. (1) Rank-1 updating method ΔHk =

(δk − Hk ρk )(δk − Hk ρk )T . δk − Hk ρk , ρk 

(2) DFP (Davidon–Fletcher–Powell) method ΔHk =

δk δkT H ρ ρT H − k k k k. ρk , δk  Hk ρk , ρk 

(3) BFGS (Broyden–Fletcher–Goldfarb–Shanno) method ΔHk =

H ρ ρT H Hk ρk δkT + δk ρTk Hk − βk k k k k , Hk ρk , ρk  Hk ρk , ρk 

where βk = 1 + ρk , δk /Hk ρk , ρk . An important property of the DFP and BFGS methods is that they preserve the positive definiteness of the matrices.

4.8 Newton Methods

253

In the iterative process of the various Newton methods, it is usually required that a search is made of the optimal point along the direction of the straight line {x+μΔx | μ ≥ 0}. This step is called linear search. However, this choice of procedure can only minimize the objective function approximately, i.e., make it sufficiently small. Such a search for approximate minimization is called an inexact line search. In this search the step size μk on the search line {x + μΔx | μ ≥ 0} must decrease the objective function f (xk ) sufficiently to ensure that f (xk + μΔxk ) < f (xk ) + αμ(∇f (xk ))T Δxk ,

α ∈ (0, 1).

(4.8.4)

The inequality condition (4.8.4) is sometimes called the Armijo condition [42], [157], [305], [353]. In general, for a larger step size μ, the Armijo condition is usually not met. Hence, Newton algorithms can start the search from μ = 1; if the Armijo condition is not met then the step size μ needs to be reduced to β μ, with the correction factor β ∈ (0, 1). If for step size β μ the Armijo condition is still not met, then the step size needs to be reduced again until the Aemijo condition is met. Such a search method is known as backtracking line search. Algorithm 4.10 shows the Newton algorithm via backtracking line search. Algorithm 4.10

Newton algorithm via backtracking line search

initialization: x1 ∈ dom f (x), and parameters α ∈ (0, 0.5), β ∈ (0, 1). Put k = 1. repeat 1. Compute bk = ∇f (xk ) and Hk = ∇2 f (xk ). 2. Solve the Newton equation Hk Δxk = −bk . 3. Update xk+1 = xk + μΔxk . 4. exit if |f (xk+1 ) − f (xk )| < . return k ← k + 1. output: x ← xk .

The backtracking line search method can ensure that the objective function satisfies f (xk+1 ) < f (xk ) and the step size μ may not need to be too small. If the Newton equation is solved by other methods in step 2, then Algorithm 4.10 becomes the truncated Newton method, the modified Newton method or the quasi-Newton method, respectively. Now consider the minimization min f (z), where z ∈ Cn and f : Cn → R. By the second-order Taylor series approximation of a binary complex function, ∂f (x, y) ∂f (x, y) f (x + h1 , y + h2 ) = f (x, y) + h1 + h2 ∂x ∂y   2 2 1 ∂ f (x, y) ∂ f (x, y) ∂ 2 f (x, y) h21 , + + 2h1 h2 + h22 2! ∂x∂x ∂x∂y ∂y∂y it is easily seen that the second-order Taylor series approximation of the holomor-

254

Gradient Analysis and Optimization

phic function f (z, z∗ ) is given by f (z + Δz, z∗ + Δz∗ )     Δz ∗ ∗ T ∗ T = f (z, z ) + (∇z f (z, z )) , (∇z∗ f (z, z )) Δz∗ ⎡ 2 ⎤ ∂ f (z, z∗ ) ∂ 2 f (z, z∗ )    ∗ T 1 ∂ z∗ ∂ zH ⎥ Δz H T ⎢ ∂z ∂z . (Δz) , (Δz) ⎣ 2 + ⎦ Δz∗ 2 ∂ f (z, z∗ ) ∂ 2 f (z, z∗ ) ∂ z∂ zT

From the first-order optimality condition

∂ z∂ zH

(4.8.5)  

0 ∂f (z + Δz, z∗ + Δz∗ )   = , we obtain 0 Δz ∂ ∗ Δz

immediately the equation for the complex Newton step:      ∇zk f (zk , z∗k ) Hz∗k , zk Hz∗k , z∗k Δznt,k , = − ∇z∗k f (zk , z∗k ) Hzk , zk Hzk , z∗k Δz∗nt,k

(4.8.6)

where Hz∗k , zk =

∂ 2 f (zk , z∗k ) , ∂ z∗k ∂ zTk

Hz∗k , z∗k (k) =

Hzk , zk =

∂ 2 f (zk , z∗k ) , ∂ zk ∂ zTk

Hzk , z∗k =

∂ 2 f (zk , z∗k ) , ∂ z∗k ∂ zH k

∂ 2 f (zk , z∗k ) , ∂ zk ∂ zH k

(4.8.7)

are the part-Hessian matrices of the holomorphic function f (zk , z∗k ), respectively. Hence, the updating formula of the complex Newton method is described by       Δznt,k zk+1 zk = ∗ +μ . (4.8.8) Δz∗nt,k z∗k+1 zk 4.8.2 Newton Method for Constrained Optimization First, consider the equality constrained optimization problem with a real variable: min f (x) x

subject to

Ax = b,

(4.8.9)

where f : Rn → R is a convex function and is twice differentiable, while A ∈ Rp×n and rank(A) = p with p < n. Let Δxnt denote the Newton search direction. Then the second-order Taylor approximation of the objective function f (x) is given by 1 f (x + Δxnt ) = f (x) + (∇f (x))T Δxnt + (Δxnt )T ∇2 f (x)Δxnt , 2 subject to A(x + Δxnt ) = b and AΔxnt = 0. In other words, the Newton search direction can be determined by the equality constrained optimization problem: " # 1 T T 2 min f (x) + (∇f (x)) Δxnt + (Δxnt ) ∇ f (x)Δxnt subject to AΔxnt = 0. Δxnt 2

4.8 Newton Methods

255

Letting λ be the Lagrange multiplier vector multiplying the equality constraint AΔxnt = 0, we have the Lagrangian objective function 1 L(Δxnt , λ) = f (x)+(∇f (x))T Δxnt + (Δxnt )T ∇2 f (x)Δxnt +λT AΔxnt . (4.8.10) 2 By the first-order optimal condition

∂L(Δxnt , λ) = 0 and the constraint condition ∂Δxnt

AΔxnt = 0, it is easily obtained that

∇f (x) + ∇2 f (x)Δxnt + AT λ = 0 and which can be merged into  2 ∇ f (x) A

AT O



AΔxnt = 0

   Δxnt −∇f (x) . = O λ

(4.8.11)

Let x be the optimal solution to the equality constrained optimization problem (4.8.9), and let the corresponding optimal value of objective function f (x) be    p = inf f (x)Ax = b = f (x ). (4.8.12) As a stopping criterion for the Newton algorithm, one can use [53] λ2 (x) = (Δxnt )T ∇2 f (x)Δxnt .

(4.8.13)

Algorithm 4.11 takes a feasible point as the initial point, and thus is called the feasible-start Newton algorithm. This algorithm has the following features: (1) It is a descent method, because backtracking line search can ensure objective function descent at each iteration, and that the equality constraint is satisfied at each iteration point xk . (2) It requires a feasible point as an initial point. However, in many cases it is not easy to find a feasible point as an initial point. In these cases, it is necessary to consider an infeasible start Newton method. If xk is an infeasible point, consider the equality constrained optimization problem " # 1 min f (xk + Δxk ) = f (xk ) + (∇f (xk ))T Δxk + (Δxk )T ∇2 f (xk )Δxk Δxk 2 subject to A(xk + Δxk ) = b. Let λk+1 (= λk + Δλk ) be the Lagrange multiplier vector corresponding to the equality constraint A(xk + Δxk ) = b. Then the Lagrangian objective function is 1 L(Δxk , λk+1 ) = f (xk ) + (∇f (xk ))T Δxk + (Δxk )T ∇2 f (xk )Δxk 2 + λTk+1 (A(xk + Δxk ) − b).

256

Gradient Analysis and Optimization

Algorithm 4.11

Feasible-start Newton algorithm [53]

given: A feasible initial point x1 ∈ dom f with Ax1 = b, tolerance  > 0, parameters α ∈ (0, 0.5), β ∈ (0, 1). Put k = 1. repeat 1. Compute the gradient vector ∇f (xk ) and the Hessian matrix ∇2 f (xk ). 2. Use the preconditioned conjugate gradient algorithm to solve ⎤ ⎡ ⎤ ⎤⎡ ⎡ −∇f (xk ) Δxnt,k ∇2 f (xk ) AT ⎦=⎣ ⎦. ⎦⎣ ⎣ O A O λnt 3. Compute λ2 (xk ) = (Δxnt,k )T ∇2 f (xk )Δxnt,k . 4. exit if λ2 (xk ) < . 5. Let μ = 1. 6. while not converged do 7. Update μ ← βμ. 8. break if f (xk + μΔxnt,k ) < f (xk ) + αμ(∇f (xk ))T Δxnt,k . 9. end while 10. Update xk+1 ← xk + μΔxnt,k . return k ← k + 1. output: x ← xk .

From the optimal condition known that

∂L(Δxk , λk+1 ) ∂L(Δxk , λk+1 ) = 0 and = 0, it is easily ∂Δxk ∂λk+1

∇f (xk ) + ∇2 f (xk )Δxk + AT λk+1 = 0, AΔxk = −(Axk − b), which can be merged into  2 ∇ f (xk ) A

AT O



   Δxk ∇f (xk ) . =− Axk − b λk+1

Substituting λk+1 = λk + Δλk into the above equation, we have      2 ∇f (xk ) + AT λk ∇ f (xk ) AT Δxk . =− A O Δλk Axk − b

(4.8.14)

(4.8.15)

Algorithm 4.12 shows an infeasible-start Newton algorithm based on (4.8.15).

4.8 Newton Methods Algorithm 4.12

257

Infeasible-start Newton algorithm [53]

given: An initial point x1 ∈ Rn , any initial Lagrange multiplier λ1 ∈ Rp , an allowed tolerance  > 0, α ∈ (0, 0.5), and β ∈ (0, 1). Put k = 1. repeat 1. Compute the gradient vector ∇f (xk ) and the Hessian matrix ∇2 f (xk ). 2. Compute r(xk , λk ) via Equation (4.8.16). 3. exit if Axk = b and r(xk , λk )2 < . 4. Adopt the preconditioned conjugate gradient algorithm to solve the KKT equation (4.8.15), yielding the Newton step (Δxk , Δλk ). 5. Let μ = 1. 6. while not converged do 7. Update μ ← βμ. 8. break if f (xk + μΔxnt,k ) < f (xk ) + αμ(∇f (xk ))T Δxnt,k . 9. end while 10. Newton update





xk+1 xk Δxk = +μ . λk+1 λk Δλk

return k ← k + 1. output: x ← xk , λ ← λk .

Equation (4.8.15) reveals one of the stopping criterions of the Newton algorithms. Define the residual vector     rdual (xk , λk ) ∇f (xk ) + AT λk , (4.8.16) r(xk , λk ) = = rpri (xk , λk ) Axk − b where rdual (xk , λk ) = ∇f (xk ) + AT λk and rpri (xk , λk ) = Axk − b represent the dual and original residual vectors, respectively. Obviously, one stopping criterion of the infeasible-start Newton algorithm can be summarized as follows:         rdual (xk , λk ) Δxk 0 0 ≈ ⇔ ≈ ⇔ r(xk , λk ) 2 < (4.8.17) rpri (xk , λk ) Δλk 0 0 holds for a very small disturbance error > 0. Another stopping criterion of the infeasible-start Newton algorithm is that the equality constraint must be met. This stopping criterion can ensure that the convergence point of the Newton algorithm is feasible, although its initial point and most iteration points are allowed to be infeasible. Now consider the equality constrained minimization problem with complex variable: min f (z) subject to z

Az = b,

(4.8.18)

258

Gradient Analysis and Optimization

where z ∈ Cn , f : Cn → R is the convex function and twice differentiable and A ∈ Cp×n with rank(A) = p and p < n. Let (Δzk , Δz∗k ) denote the complex search direction at the kth iteration. Then the second-order Taylor expansion of f (zk , z∗k ) is given by f (zk + Δzk , z∗k + Δz∗k ) = f (zk , z∗k ) + (∇zk f (zk , z∗k ))T Δzk + (∇z∗k f (zk , z∗k ))T Δz∗k ⎡ 2 ⎤ ∂ f (zk , z∗k ) ∂ 2 f (zk , z∗k )   ⎥ Δzk  ⎢ ∂ z∗k ∂ zTk ∂ z∗k ∂ zH 1 k H T ⎢ ⎥ + (Δzk ) , (Δzk ) ⎣ 2 ∗ 2 ∂ f (zk , z∗k ) ∂ 2 f (zk , z∗k ) ⎦ Δzk ∂ zk ∂ zTk

∂ z k ∂ zH k

subject to A(zk + Δzk ) = b or AΔzk = 0 if zk is a feasible point. In other words, the complex Newton search direction can be determined by the equality constrained optimization problem: min

Δzk ,Δz∗ k

f (zk + Δzk , z∗k + Δz∗k ) subject to

AΔzk = 0.

(4.8.19)

Let λk+1 = λk + Δλk ∈ Rp be a Lagrange multiplier vector. Then the equality constrained minimization problem (4.8.19) can be transformed into the unconstrained minimization problem   min∗ f (zk + Δzk , z∗k + Δz∗k ) + (λk + Δλk )T AΔzk . Δzk ,Δzk ;Δλk

It is easy to obtain the complex Newton equation ⎡

∂ 2 f (zk , z∗k ) ⎢ ∂ z∗k ∂ zTk

⎢ ⎢ ∂ 2 f (z , z∗ ) ⎢ k k ⎢ ∂ z ∂ zT ⎣ k k A

∂ 2 f (zk , z∗k ) ∂ z∗k ∂ zH k ∂ 2 f (zk , z∗k ) ∂ z k ∂ zH k

O

⎤ AT ⎡ ⎤ ⎡ ⎥ Δz ⎤ ∇zk f (zk , z∗k ) + AT λk ⎥ k ⎥⎣ ∗⎦ ⎦ . (4.8.20) Δzk = − ⎣ ∇z∗k f (zk , z∗k ) O⎥ ⎥ ⎦ Δλk 0 O

Define the residual vector r(zk , z∗k , λk ) =



 ∇zk f (zk , z∗k ) + AT λk . ∇z∗k f (zk , z∗k )

(4.8.21)

The feasible-start complex Newton algorithm is shown in Algorithm 4.13. This algorithm is easily extended into the infeasible start complex Newton algorithm. For this end, the equality constrained minimization problem corresponding to Equation (4.8.19) becomes min

Δzk ,Δz∗ k

f (zk + Δzk , z∗k + Δz∗k ) subject to

A(zk + Δzk ) = b.

(4.8.22)

4.8 Newton Methods Algorithm 4.13

259

Feasible-start complex Newton algorithm

given: A feasible initial point z1 ∈ dom f with Az1 = b, tolerance  > 0, parameters α ∈ (0, 0.5), β ∈ (0, 1). Put k = 1. repeat 1. Compute the residual vector r(zk , z∗k , λk ) via Equation (4.8.21). 2. exit if r(zk , z∗k , λk )2 < . 3. Compute ∇z f (zk , z∗k ), ∇z∗ f (zk , z∗k ) and the full Hessian matrix ⎡ 2 ⎤ ∂ f (zk , z∗k ) ∂ 2 f (zk , z∗k ) ⎢ ∂ z ∗ ∂ zT ⎥ ∂ z∗k ∂ zH ⎢ ⎥ k Hk = ⎢ 2 k k∗ ⎥. 2 ∗ ⎣ ∂ f (zk , zk ) ∂ f (zk , zk ) ⎦ ∂ zk ∂ zTk ∂zk ∂ zH k 4. Use the preconditioned conjugate gradient algorithm to solve the Newton equation (4.8.20), yielding the Newton step (Δznt,k , Δz∗nt,k , Δλnt,k ). 5. Let μ = 1. 6. while not converged do 7. Update μ ← βμ. 8. break if f (zk +



μΔznt,k , z∗k

+

μΔz∗nt,k )


0. The relationship xi zi = μ, ∀ i = 1, . . . , m, is

262

Gradient Analysis and Optimization

also called the perturbed complementarity condition or complementary slackness. Obviously, if μ → 0 then xi zi → 0. Replacing the complementarity condition with perturbed complementarity, the KKT equations can be equivalently written as Ax = b,

x ≥ 0,

A y + z = c,

z ≥ 0,

T

(4.9.11)

T

x z = nμ. DEFINITION 4.18 [417] If the original problem has a feasible solution x > 0 and the dual problem has also a solution (y, z) with z > 0 then the original–dual optimization problem is said to meet the interior-point condition (IPC). If the interior-point condition is satisfied then the solution of Equation (4.9.11) is denoted (x(μ), y(μ), z(μ)) and is called the μ-center of the original problem (P ) and the dual problem (D). The set of all μ-centers is known as the center path of the original–dual problem. Substituting xk = xk−1 + Δxk , yk = yk−1 + Δyk and zk = zk−1 + Δzk into Equation (4.9.11), we have [416] AΔxk = b − Axk−1 , A Δyk + Δzk = c − AT yk−1 − zk−1 , T

zTk−1 Δxk + xTk Δzk−1 = nμ − xTk−1 zk−1 , equivalently written as ⎡

A ⎣ O zTk−1

O AT 0T

⎤⎡ ⎤ ⎡ ⎤ O Δxk b − Axk−1 I ⎦ ⎣Δyk ⎦ = ⎣c − AT yk−1 − zk−1 ⎦ . T xk−1 Δzk nμ − xTk−1 zk−1

(4.9.12)

The interior-point method based on the first-order optimal condition is called the first-order interior-point method, whose key step is to solve the KKT equation (4.9.12) to obtain the Newton step (Δxk , Δyk , Δzk ). This Newton step can be implemented by finding the inverse matrix directly, but an approach with better numerical performance is to use an iteration method for solving the KKT equation (4.9.12). Notice that since the leftmost matrix is not real symmetric, the conjugate gradient method and the preconditioned conjugate gradient method are unable to be used. Algorithm 4.14 gives a feasible-start original–dual interior-point algorithm. The above algorithm requires (x0 , y0 , z0 ) to be a feasible point. This feasible point can be determined by using Algorithm 4.15.

4.9 Original–Dual Interior-Point Method Algorithm 4.14

263

Feasible-start original–dual interior-point algorithm [416]

input: Accuracy parameter  > 0, updating parameter θ, 0 < θ < 1, feasible point (x0 , y0 , z0 ). initialization: Take μ0 = xT0 z0 /n, and set k = 1. repeat 1. Solve the KKT equation (4.9.12) to get the solution (Δxk , Δyk , Δzk ). 2. Update μk = (1 − θ)μk−1 . 3. Update the original variable, dual variable and Lagrange multiplier: xk = xk−1 + μk Δxk , yk = yk−1 + μk Δyk , zk = zk−1 + μk Δzk . exit if xTk zk < . return k ← k + 1. output: (x, y, z) ← (xk , yk , zk )

Algorithm 4.15

Feasible-point algorithm [416]

input: Accuracy parameter  > 0, penalty update parameter θ (0 < θ < 1) and threshold parameter τ > 0. initialization: x0 > 0, z0 > 0, y0 , xT0 z0 = nμ0 . Put k = 1. repeat 1. Compute the residuals rk−1 = b − AT xk−1 , rk−1 = c − AT yk−1 − zk−1 . c b 2. μ-update νk−1 = μk−1 /μ0 . 3. Solve the ⎤ ⎡ ⎤ ⎡ KKT equation ⎤ ⎡ Δf xk θνk−1 r0b A O O ⎥⎢ ⎥ ⎢ ⎥ ⎢ AT I ⎦ ⎣Δf yk ⎦ = ⎣ θνk−1 r0c ⎦. ⎣ O T T T f T zk−1 0 xk−1 Δ zk nμ − xk−1 zk−1 4. Update (xk , yk , zk ) = (xk−1 , yk−1 , zk−1 ) + (Δf xk , Δf yk , Δf zk ). 5. exit if max{xTk zk , b − AT xk , c − AT yk − zk } ≤ . return k ← k + 1. output: (x, y, z) ← (xk , yk , zk ).

4.9.3 Second-Order Original–Dual Interior-Point Method In the first-order original–dual interior-point method there exists the following shortcomings. • The matrix in the KKT equation is not symmetric, so it is difficult to apply the conjugate gradient method or the preconditioned conjugate gradient method for solving the KKT equation. • The KKT equation uses only the first-order optimal condition; it does not use the second-order information provided by the Hessian matrix. • It is difficult to ensure that iteration points are interior points. • It is inconvenient to extend this method to the general nonlinear optimization problem.

264

Gradient Analysis and Optimization

In order to overcome the above shortcomings, the original–dual interior-point method for nonlinear optimization consists of the following three basic elements. (1) Barrier function Limit the variable x to interior-points. (2) Newton method Solve the KKT equation efficiently. (3) Backtracking line search Determine a suitable step size. This interior-point method, combining the barrier function with the Newton method, is known as the second-order original–dual interior-point method. Define a slack variable z ∈ Rm which is a nonnegative vector z ≥ 0 such that h(x)−z = 0. Then the inequality constrained original problem (P ) can be expressed as the equality constrained optimization problem min f (x) subject to

h(x) = z, z ≥ 0.

(4.9.13)

To remove the inequality z ≥ 0, we introduce the Fiacco–McCormick logarithmic barrier m  log zi , (4.9.14) bμ (x, z) = f (x) − μ i=1

where μ > 0 is the barrier parameter. For very small μ, the barrier function bμ (x, z) and the original objective function f (x) have the same role at points for which the constraints are not close to zero. The equality constrained optimization problem (4.9.13) can be expressed as the unconstrained optimization problem + 0 m  min Lμ (x, z, λ) = f (x) − μ log zi − λT (h(x) − z) , (4.9.15) x, z, λ

i=1

where Lμ (x, z, λ) is the Lagrangian objective function and λ ∈ Rm is the Lagrange multiplier or the dual variable. Denote Z = Diag(z1 , . . . , zm ), Let ∇x Lμ =

Λ = Diag(λ1 , . . . , λm ).

(4.9.16)

∂Lμ ∂L ∂Lμ = 0, ∇z Lμ = Tμ = 0 and ∇λ Lμ = = 0. Then, the first∂ xT ∂z ∂λT

order optimal condition is given by ∇f (x) − (∇h(x))T λ = 0, −μ1 + ZΛ1 = 0, h(x) − z = 0. Here 1 is an m-dimensional vector with all entries equal to 1.

(4.9.17)

4.9 Original–Dual Interior-Point Method

265

In order to derive the Newton equation for the unconstrained optimization problem (4.9.15), consider the Lagrangian objective function Lμ (x + Δx, z + Δz, λ + Δλ) = f (x + Δx) − μ

m 

log(zi + Δzi )

i=1

  − (λ + Δλ)T h(x + Δx) − z − Δz , where 1 f (x + Δx) = f (x) + (∇f (x))T Δx + (Δx)T ∇2 f (x)Δx, 2 1 hi (x + Δx) = hi (x) + (∇hi (x))T Δx + (Δx)T ∇2 hi (x)Δx, 2 for i = 1, . . . , m. Letting ∇x Lμ =

∂Lμ ∂Lμ ∂Lμ = 0, ∇z Lμ = = 0 and ∇λ Lμ = = 0, ∂(Δx)T ∂(Δz)T ∂(Δλ)T

the Newton equation is given by [490] ⎡ ⎤ ⎤⎡ ⎤ ⎡ H(x, λ) O −(A(x))T Δx −∇f (x) + (∇h(x))T λ ⎣ O ⎦, ⎦ ⎣ Δz ⎦ = ⎣ μ1 − ZΛ1 Λ Z z − h(x) A(x) −I O Δλ

(4.9.18)

where H(x, λ) = ∇ f (x) − 2

m 

λi ∇2 hi (x),

A(x) = ∇h(x).

(4.9.19)

i=1

In (4.9.18), Δx is called the optimality direction and Δz the centrality direction, while Δλ is known as the feasibility direction [490]. The triple (Δx, Δz, Δλ) gives the Newton step, i.e. the update search direction of the interior-point method. From (4.9.18) it is easy to get [490] ⎡ ⎤⎡ ⎤ ⎡ ⎤ −H(x, λ) O (A(x))T Δx α ⎣ (4.9.20) O −Z−1 Λ −I ⎦ ⎣ Δz ⎦ = ⎣−β ⎦ , A(x) −I O Δλ γ where α = ∇f (x) − (∇h(x))T λ, β = μZ

−1

1 − λ,

γ = z − h(x).

(4.9.21) (4.9.22) (4.9.23)

The above three variables have the following meanings. (1) The vector γ measures the original infeasibility: γ = 0 means that x is a feasible point satisfying the equality constraint h(x) − z = 0; otherwise, x is an infeasible point.

266

Gradient Analysis and Optimization

(2) The vector α measures the dual infeasibility: α = 0 means that the dual variable λ satisfies the first-order optimal condition, and thus is feasible; otherwise, λ violates the first-order optimal condition and hence is infeasible. (3) The vector β measures the complementary slackness: β = 0, i.e., μZ−1 1 = λ, means that the complementary slackness zi λi = μ, ∀ i = 1, . . . , m and complementarity is completely satisfied when μ = 0. The smaller is the deviation of μ from zero, the smaller is the complementary slackness; the larger μ is, the larger is the complementary slackness. Equation (4.9.20) can be decomposed into the following two parts: Δz = Λ−1 Z(β − Δλ) and



−H(x, λ) A(x)

(A(x))T ZΛ−1



   α Δx = . γ + ZΛ−1 β Δλ

(4.9.24)

(4.9.25)

Importantly, the original Newton equation (4.9.20) becomes a sub-Newton equation (4.9.25) with smaller dimension. The Newton equation (4.9.25) is easy to solve iteratively via the preconditioned conjugate gradient algorithm. Once the Newton step (Δx, Δz, Δλ) is found from Equation (4.9.24) and the solution of Equation (4.9.25), we can perform the following updating formulae xk+1 = xk + ηΔxk , zk+1 = zk + ηΔzk ,

(4.9.26)

λk+1 = λk + ηΔλk . Here η is the common step size of the updates. 1. Modification of Convex Optimization Problems The key of the interior-point method is to ensure that each iteration point xk is an interior point such that h(xk ) > 0. However, for nonquadratic convex optimization, to simply select step size is insufficient to ensure the positiveness of the nonnegative variables. For this reason, it is necessary to modify the convex optimization problem appropriately. A simple modification of the convex optimization problem is achieved by introducing the merit function. Unlike the general cost function minimization, the minimization of the merit function has two purposes: as is necessary, to promote the appearance of iterative points close to the local minimum point of the objective function, but also to ensure that the iteration points are feasible points. Hence the main procedure using the merit function is twofold: (1) transform the constrained optimization problem into an unconstrained optimization problem;

4.9 Original–Dual Interior-Point Method

267

(2) reduce the merit function in order to move the iterative point closer to optimal solution of the original constrained optimization problem. Consider the classical Fiacco–McCormick merit function [152] ψρ,μ (x, z) = f (x) − μ

m 

ρ log zi + h(x) − z 22 . 2 i=1

(4.9.27)

Define the dual normal matrix (4.9.28) N(x, λ, z) = H(x, λ) + (A(x))T Z−1 ΛA(x), m let b(x, λ) = f (x) − i=1 λi log zi represent the barrier function and let the dual normal matrix N(x, λ, z) be positive definite. Then it can be shown [490] that the search direction (Δx, Δλ) determined by Equation (4.9.20) has the following properties. • If γ = 0 then



∇x b(x, λ) ∇λ b(x, λ)

 Δx ≤ 0. Δλ

T 

• There is a ρmin ≥ 0 such that for each ρ > ρmin , the following inequality holds:  T   Δx ∇x ψρ,μ (x, z) ≤ 0. Δλ ∇z ψρ,μ (x, z) In the above two cases, the equality holds if and only if (x, z) satisfies (4.9.17) for some λ. 2. Modification of Nonconvex Optimization Problems For a nonconvex optimization problem, the Hessian matrix H(x, λ) may not be positive semi-definite, hence the dual normal matrix N(x, λ, z) may not be positive definite. In this case, the Hessian matrix needs to be added a very small perturba˜ tion. That is to say, using H(x, λ) = H(x, λ) + δI to replace the Hessian matrix H(x, λ) in the dual normal matrix, we obtain ˜ N(x, λ, z) = H(x, λ) + δI + (A(x))T Z−1 ΛA(x). Although interior-point methods usually take only a few iterations to converge, they are difficult to handle for a higher-dimensional matrix because the complexity of computing the step direction is O(m6 ), where m is the dimension of the matrix. As a result, on a typical personal computer (PC), generic interior-point solvers cannot handle matrices with dimensions larger than m = 102 . However, image and video processing often involve matrices of dimension m = 104 to 105 and web search and bioinformatics can easily involve matrices of dimension m = 106 and beyond. In these real applications, generic interior point solvers are too limited, and only first-order gradient methods are practical [298].

268

Gradient Analysis and Optimization

Exercises 4.1

4.2

Let y be a real-valued measurement data vector given by y = αx+v, where α is a real scalar, x represents a real-valued deterministic process, the additive noise vector v has zero-mean, and covariance matrix Rv = E{vvT }. Find an optimal filter vector w such that α ˆ = wT y is an unbiased estimator with minimum variance. Let f (t) be a given function. Consider the minimization of the quadratic function: 1 2 min Q(x) = min (f (t) − x0 − x1 t − · · · − xn tn ) dt. 0

4.3

4.4

Determine whether the matrix associated with the set of linear equations is ill-conditioned. Given a function f (X) = log | det(X)|, where X ∈ Cn×n and | det(X)| is the absolute value of the determinant det(X). Find the conjugate gradient matrix df /dX∗ . Consider the equation y = Aθ + e, where e is an error vector. Define the def weighted sum of squared errors Ew = eH We, where W is a Hermitian positive definite matrix which weights the errors. (a) Find the solution for the parameter vector θ such that Ew is minimized. This solution is called the weighted least squares estimate of θ. (b) Use the LDLH decomposition W = LDLH to prove that the weighted least squares criterion is equivalent to the prewhitening of the error or data vector.

4.5 4.6

Let a cost function be f (w) = wH Re w and let the constraint on the filter be Re(wH x) = b, where b is a constant. Find the optimal filter vector w. Explain whether the following constrained optimization problems have solutions: (a) min (x1 + x2 ) subject to x21 + x22 = 2, 0 ≤ x1 ≤ 1, 0 ≤ x2 ≤ 1; (b) min (x1 + x2 ) subject to x21 + x22 ≤ 1, x1 + x2 = 4; (c) min (x1 x2 ) subject to x1 + x2 = 3.

4.7

4.8 4.9

Consider the constrained optimization problem min{(x − 1)(y + 1)} subject to x − y = 0. Use the Lagrange multiplier method to show that the minimum point is (1, 1) and the Lagrange multiplier λ = 1. For a Lagrangian function ψ(x, y) = (x − 1)(y + 1) − 1(x − y), show that ψ(x, y) has a saddle point at (0, 0), i.e., the point (0, 0) cannot minimize ψ(x, y). Solve the constrained optimization problem min{J(x, y, z) = x2 + y 2 + z 2 } subject to 3x + 4y − z = 25. The steepest descent direction of an objective function J(M) is given by dJ(M) − . Given a positive definite matrix W, determine which of the gradient dM

matrices −W

dJ(M) dM

and −

dJ(M) W dM

is the steepest descent direction of J(M),

Exercises

269

and write down the corresponding gradient descent algorithm. 4.10 Let an observation data vector be produced by the linear regression model y = Xβ + , E{ } = 0, E{ T } = σ 2 I. We want to design a filter matrix A whose output vector e = Ay satisfies E{e− } = 0, and minimizes E{(e− )T (e− )}. Show that this optimization problem is equivalent to   min tr(AT A) − 2tr(A) subject to AX = O, where O is a zero matrix. 4.11 Show that the solution matrix of the optimization problem   min tr(AT A) − 2tr(A) ˆ = I − XX† . subject to AX = O (zero matrix) is given by A 4.12 Show that the unconstrained minimization problem 3 4 min (y − Xβ)T (V + XXT )† (y − Xβ) and the constrained minimization problem   min (y − Xβ)V† (y − Xβ) subject to

(I − VV† )Xβ = (I − VV† )y

have the same solution vector β. 4.13 Solve the constrained optimization problem   min tr(AT A) − 2tr(A) subject to Ax = 0. A

4.14 Show that the constrained optimization problem 1 min xT x x 2

subject to Cx = b

has the unique solution x∗ = C† b. 4.15 Consider matrices Y ∈ Rn×m , Z ∈ Rn×(n−m) and let their columns constitute a linearly independent set. If a solution vector subject to Ax = b is expressed as x = YxY + ZxZ , where xY and xZ are some m × 1 and (n − m) × 1 vectors, respectively, show that x = Y(AY)−1 b + ZxZ .   4.16 If a constrained optimization problem is min tr(AVAT ) subject to AX = W, show that A = W(XT V0† X)† XT V†0 +Q(I−V0 V0† ), where V0 = V+XXT and Q is any matrix. 4.17 Assume for a matrix X that rank(X) ≤ r. Show that, for all semi-orthogonal matrices A such that AT A = Ir and all matrices Z ∈ Rn×r , the following result is true: 4 3 min tr (X − ZAT )(X − ZAT )T = 0. X

270

Gradient Analysis and Optimization

4.18 For p > 1 and ai ≥ 0, i = 1, . . . , n, show that for each set of nonnegative n real numbers {x1 , . . . , xn } such that i=1 xqi = 1, where q = p/(p − 1), the relationship 1/p  n n   p ai xi ≤ ai i=1

i=1

is satisfied, and equality holds if and only if a1 = · · · = an = 0 or xqi = p 1/p api ( k=1 apk ) , i = 1, . . . , n. This relationship is called the representation p 1/p theorem of ( i=1 api ) . 4.19 Show that the solution matrix of the constrained optimization problem   min tr(AT A) − 2tr(A) subject to AX = O (the zero matrix) ˆ = I − XX† . (Hint: The matrix constraint condition AX = O is given by A is equivalent to the scalar constraint condition tr(LAX) = 0, ∀ L.)

5 Singular Value Analysis

Beltrami (1835–1899) and Jordan (1838–1921) are recognized as the founders of singular value decomposition (SVD): Beltrami in 1873 published the first paper on SVD [34]. One year later, Jordan published his own independent derivation of SVD [236]. Now, the SVD and its generalizations are one of the most useful and efficient tools for numerical linear algebra, and are widely applied in statistical analysis, and physics and in applied sciences, such as signal and image processing, system theory, control, communication, computer vision and so on. This chapter presents first the concept of the stability of numerical computations and the condition number of matrices in order to introduce the necessity of the SVD of matrices; then we discuss SVD and generalized SVD together with their numerical computation and applications.

5.1 Numerical Stability and Condition Number In many applications such as information science and engineering, it is often necessary to consider an important problem: in the actual observation data there exist some uncertainties or errors, and, furthermore, numerical calculation of the data is always accompanied by error. What is the impact of these errors? Is a particular algorithm numerically stable for data processing? In order to answer these questions, the following two concepts are extremely important: (1) the numerical stability of various kinds of algorithm; (2) the condition number or perturbation analysis of the problem of interest. Let f be some application problem, let d ∈ D be the data without noise or disturbance, where D denotes a data group, and let f (d ) ∈ F represent the solution of f , where F is a solution set. Given observed data d ∈ D, we want to evaluate f (d). Owing to background noise and/or observation error, f (d) is usually different from f (d ). If f (d) is “close” to f (d ) then the problem f is “well-conditioned”. On the contrary, if f (d) is obviously different from f (d ) even when d is very close to d then we say that the problem f is “ill-conditioned”. If there is no further 271

272

Singular Value Analysis

information about the problem f , the term “approximation” cannot describe the situation accurately. In perturbation theory, a method or algorithm for solving a problem f is said to be numerically stable if its sensitivity to a perturbation is not larger than the sensitivity inherent in the original problem. A stable algorithm can guarantee that the solution f (d) with a slight perturbation Δd of d approximates closely the solution without the perturbation. More precisely, f is stable if the approximate solution f (d) is close to the solution f (d ) without perturbation for all d ∈ D close to d . A mathematical description of numerical stability is discussed below. First, consider the well-determined linear equation Ax = b, where the n × n coefficient matrix A has known entries and the n × 1 data vector b is also known, while the n × 1 vector x is an unknown parameter vector to be solved. Naturally, we will be interested in the stability of this solution: if the coefficient matrix A and/or the data vector b are perturbed, how will the solution vector x be changed? Does it maintain a certain stability? By studying the influence of perturbations of the coefficient matrix A and/or the data vector b, we can obtain a numerical value, called the condition number, describing an important characteristic of the coefficient matrix A. For the convenience of analysis, we first assume that there is only a perturbation δb of the data vector b, while the matrix A is stable. In this case, the exact solution vector x will be perturbed to x + δ x, namely A(x + δ x) = b + δ b.

(5.1.1)

δ x = A−1 δ b,

(5.1.2)

This implies that

since Ax = b. By the property of the Frobenius norm, Equation (5.1.2) gives

δ x 2 ≤ A−1 F δ b 2 .

(5.1.3)

Similarly, from the linear equation Ax = b we get

b 2 ≤ A F x 2 .

(5.1.4)

From Equations (5.1.3) and (5.1.4) it is immediately clear that   δ b 2

δ x 2 ≤ A F A−1 F .

x 2

b 2

(5.1.5)

Next, consider the influence of simultaneous perturbations δx and δA. In this case, the linear equation becomes (A + δA)(x + δ x) = b.

5.1 Numerical Stability and Condition Number

273

From the above equation it can be derived that   δx = (A + δA)−1 − A−1 b = {A−1 (A − (A + δA)) (A + δA)−1 }b = −A−1 δA(A + δA)−1 b = −A−1 δA(x + δx)

(5.1.6)

from which we have

δx 2 ≤ A−1 F δA F x + δx 2 , namely   δA F

δx 2 ≤ A F A−1 F .

x + δx 2

A F

(5.1.7)

Equations (5.1.5) and (5.1.7) show that the relative error of the solution vector x is proportional to the numerical value cond(A) = A F A−1 F ,

(5.1.8)

where cond(A) is called the condition number of the matrix A and is sometimes denoted κ(A). If a small perturbation of a coefficient matrix A causes only a small perturbation of the solution vector x, then the matrix A is said to be a well-conditioned. If, however, a small perturbation of A causes a large perturbation of x then the matrix A is ill-conditioned. The condition number characterizes the influence of errors in the entries of A on the solution vector; hence it is an important index measuring the numerical stability of a set of linear equations. Further, consider an over-determined linear equation Ax = b, where A is an m × n matrix with m > n. An over-determined equation has a unique linear least squares solution given by AH Ax = AH b, i.e., x = (AH A)−1 AH b. It is easy to show (see Subsection 5.2.2) that   2  cond AH A = cond(A) .

(5.1.9)

(5.1.10)

From Equations (5.1.5) and (5.1.7) it can be seen that the effects of the datavector error δ b and the coefficient-matrix error δA on the solution-vector error for the over-determined equation (5.1.9) are each proportional to the square of the condition number of A. EXAMPLE 5.1

Consider the matrix



⎤ 1 1 A = ⎣δ 0⎦ , 0 δ

274

Singular Value Analysis

where δ is small. The condition number of A is of the same order of magnitude as δ −1 . Since   1 + δ2 1 H B=A A= , 1 1 + δ2 the condition number of AH A is of order δ −2 . In addition, if we use the QR decomposition A = QR to solve an over-determined equation Ax = b then cond(Q) = 1,

cond(A) = cond(QH A) = cond(R),

(5.1.11)

since QH Q = I. In this case the effects of the errors of b and A will be proportional to the condition number of A, as shown in Equations (5.1.5) and (5.1.7), respectively. The above fact tells us that, when solving an over-determined matrix equation, the QR decomposition method has better numerical stability (i.e., a smaller condition number) than the least squares method. If the condition number of A is large then the matrix equation Ax = b is illconditioned. In this case, for a data vector b close to the ideal vector b without perturbation, the solution associated with b will be far from the ideal solution associated with b . A more effective approach than QR decomposition for solving this class of illconditioned linear equations is the total least squares method (which will be presented in Chapter 6), whose basis is the singular value decomposition of matrices. This is the topic of the next section.

5.2 Singular Value Decomposition (SVD) Singular value decomposition is one of the most basic and most important tools in modern numerical analysis (especially in numerical computations). This section presents the definition and geometric interpretation of SVD together with the properties of singular values.

5.2.1 Singular Value Decomposition The SVD was proposed first by Beltrami in 1873 for a real square matrix A [34]. Consider the bilinear function f (x, y) = xT Ay,

A ∈ Rn×n ;

by introducing the linear transformation x = Uξ and y = Vη Beltrami transformed the bilinear function into f (x, y) = ξ T Sη, where S = UT AV.

(5.2.1)

5.2 Singular Value Decomposition (SVD)

275

Beltrami observed that if both U and V are constrained to be orthogonal matrices then in their choice of entries there exist n2 − n degrees of freedom, respectively. Beltrami proposed using these degrees of freedom to make all off-diagonal entries of S zero, so that the matrix S = Σ = Diag(σ1 , σ2 , . . . , σn ) becomes diagonal. Then premultiplying (5.2.1) by U, postmultiplying it by VT and making use of the orthonormality of U and V, we immediately get A = UΣVT .

(5.2.2)

This is the singular value decomposition (SVD) of a real square matrix A obtained by Beltrami in 1873 [34]. In 1874, Jordan independently derived the SVD of real square matrices [236]. The history of the invention of SVD can be found in MacDuffee’s book [309, p. 78] or in the paper of Stewart [454], which reviews the early development of SVD in detail. Later, Autonne in 1902 [19] extended the SVD to complex square matrices, while Eckart and Young [141] in 1939 extended it further to general complex rectangular matrices; therefore, the SVD theorem for a complex rectangular matrix is usually called the Autonee–Eckart–Young theorem, stated below. THEOREM 5.1 (SVD) Let A ∈ Rm×n (or Cm×n ); then there exist orthogonal (or unitary) matrices U ∈ Rm×m (or U ∈ Cm×m ) and V ∈ Rn×n (or V ∈ Cn×n ) such that   Σ1 O T H , (5.2.3) A = UΣV (or UΣV ), Σ = O O where Σ1 = Diag(σ1 , . . . , σr ) with diagonal entries σ1 ≥ · · · ≥ σr > 0,

r = rank(A).

(5.2.4)

The above theorem was first shown by Eckart and Young [141] in 1939, but the proof by Klema and Laub [250] is simpler. The numerical values σ1 , . . . , σr together with σr+1 = · · · = σn = 0 are called the singular values of the matrix A. EXAMPLE 5.2 A singular value σi of an m×n matrix A is called a single singular value of A if σi = σj , ∀ j = i. Here are several explanations and remarks on singular values and SVD. 1. The SVD of a matrix A can be rewritten as the vector form A=

r 

σi ui viH .

(5.2.5)

i=1

This expression is sometimes said to be the dyadic decomposition of A [179].

276

Singular Value Analysis

2. Suppose that the n × n matrix V is unitary. Postmultiply (5.2.3) by V to get AV = UΣ, whose column vectors are given by " σi ui , i = 1, 2, . . . , r, Avi = (5.2.6) 0, i = r + 1, r + 2, . . . , n. Therefore the column vectors vi of V are known as the right singular vectors of the matrix A, and V is called the right singular-vector matrix of A. 3. Suppose that the m × m matrix U is unitary. Premultiply (5.2.3) by UH to yield UH A = ΣV, whose column vectors are given by " σi viT , i = 1, 2, . . . , r, H ui A = (5.2.7) 0, i = r + 1, r + 2, . . . , n. The column vectors ui are called the left singular vectors of the matrix A, and U is the left singular-vector matrix of A. 4. When the matrix rank r = rank(A) < min{m, n}, because σr+1 = · · · = σh = 0 with h = min{m, n}, the SVD formula (5.2.3) can be simplified to A = Ur Σr VrH ,

(5.2.8)

where Ur = [u1 , . . . , ur ], Vr = [v1 , . . . , vr ] and Σr = Diag(σ1 , . . . , σr ). Equation (5.2.8) is called the truncated singular value decomposition or the thin singular value decomposition of the matrix A. In contrast, Equation (5.2.3) is known as the full singular value decomposition. H 5. Premultiplying (5.2.6) by uH i , and noting that ui ui = 1, it is easy to obtain uH i Avi = σi ,

i = 1, 2, . . . , min{m, n},

which can be written in matrix form as   Σ1 O , Σ1 = Diag(σ1 , . . . , σr ). UH AV = Σ = O O

(5.2.9)

(5.2.10)

Equations (5.2.3) and (5.2.10) are two definitive forms of SVD. 6. From (5.2.3) it follows that AAH = UΣ2 UH . This shows that the singular value σi of an m × n matrix A is the positive square root of the corresponding nonnegative eigenvalue of the matrix product AAH . 7. If the matrix Am×n has rank r then • the leftmost r columns of the m × m unitary matrix U constitute an orthonormal basis of the column space of the matrix A, i.e., Col(A) = Span{u1 , . . . , ur }; • the leftmost r columns of the n×n unitary matrix V constitute an orthonormal basis of the row space of A or the column space of AH , i.e., Row(A) = Span{v1 , . . . , vr }; • the rightmost n − r columns of V constitute an orthonormal basis of the null space of the matrix A, i.e., Null(A) = Span{vr+1 , . . . , vn };

5.2 Singular Value Decomposition (SVD)

277

• the rightmost m − r columns of U constitute an orthonormal basis of the null space of the matrix AH , i.e., Null(AH )=Span{ur+1 , . . . , um }.

5.2.2 Properties of Singular Values THEOREM 5.2 (Eckart–Young theorem) [141] Cm×n are given by σ1 ≥ σ2 ≥ · · · ≥ σr ≥ 0,

If the singular values of A ∈

r = rank(A),

then σk =

min ( E spec |rank(A + E) ≤ k − 1) ,

E∈Cm×n

k = 1, . . . , r,

(5.2.11)

and there is an error matrix Ek with Ek spec = σk such that rank(A + Ek ) = k − 1,

k = 1, . . . , r.

(5.2.12)

The Eckart–Young theorem shows that the singular value σk is equal to the minimum spectral norm of the error matrix Ek such that the rank of A + Ek is k − 1. An important application of the Eckart–Young theorem is to the best rank-k approximation of the matrix A, where k < r = rank(A). Define k  Ak = σi ui viH , k < r; (5.2.13) i=1

then Ak is the solution of the optimization problem Ak = arg min A − X 2F ,

k < r,

(5.2.14)

rank(X)=k

and the squared approximation error is given by 2 + · · · + σr2 .

A − Ak 2F = σk+1

(5.2.15)

THEOREM 5.3 [215], [226] Let A be an m × n matrix with singular values σ1 ≥ · · · ≥ σr , where r = min{m, n}. If the p × q matrix B is a submatrix of A with singular values γ1 ≥ · · · ≥ γmin{p,q} then σi ≥ γ i ,

i = 1, . . . , min{p, q}

(5.2.16)

and γi ≥ σi+(m−p)+(n−q) ,

i ≤ min{p + q − m, p + q − n}.

(5.2.17)

This is the interlacing theorem for singular values. The singular values of a matrix are closely related to its norm, determinant and condition number.

278

Singular Value Analysis

1. Relationship between Singular Values and Norms The spectral norm of a matrix A is equal to its maximum singular value, namely

A spec = σ1 .

(5.2.18)

Since the Frobenius norm A F of a matrix A is of unitary invariance, i.e.,

UH AV F = A F , we have 

A F =

n m  

1/2 |aij |

2

= UH AV F = Σ F =

 σ12 + · · · + σr2 . (5.2.19)

i=1 j=1

That is to say, the Frobenius norm of any matrix is equal to the positive square root of the sum of all nonzero squared singular values of the matrix. 2. Relationship between Singular Values and Determinant Let A be an n × n matrix. Since the absolute value of the determinant of any unitary matrix is equal to 1, from Theorem 5.1 it follows that | det(A)| = | det(UΣVH )| = | det Σ| = σ1 · · · σn .

(5.2.20)

If all the σi are nonzero then | det(A)| =  0, which shows that A is nonsingular. If there is at least one singular value σi = 0 (i > r) then det(A) = 0, namely A is singular. This is the reason why the σi , i = 1, . . . , min{m, n} are known as the singular values. 3. Relationship between Singular Values and Condition Number For a given m × n matrix A, its condition number can be defined by its singular values as cond(A) = σ1 /σp ,

p = min{m, n}.

(5.2.21)

From the definition formula (5.2.21) it can be seen that the condition number is a positive number larger than or equal to 1 since σ1 ≥ σp . Obviously, because there is at least one singular value σp = 0, the condition number of a singular matrix is infinitely large. If the condition number is not infinite but is large then the matrix A is said to be nearly singular. This implies that when the condition number is large, the linear dependence between the row (or column) vectors is strong. On the other hand, the condition number of an orthogonal or unitary matrix is equal to 1 by Equation (5.1.8). In this sense, the orthogonal or unitary matrix is of “ideal condition”. For an over-determined linear equation Ax = b, the SVD of AH A is AH A = VΣ2 VH ,

(5.2.22)

i.e., the maximum and minimum singular values of the matrix AH A are respectively

5.2 Singular Value Decomposition (SVD)

279

the squares of the maximum and minimum singular values of A, and thus cond(AH A) =

σ12 = (cond(A))2 . σn2

(5.2.23)

In other words, the condition number of the matrix AH A is the square of the condition number of A. This implies that, for a given data matrix X, its SVD has better numerical stability than the EVD of the covariance matrix XXH . The following are equality relationships for singular values [307]. 1. Am×n and its Hermitian matrix AH have the same nonzero singular values. 2. The nonzero singular values of a matrix Am×n are the positive square root of the nonzero eigenvalues of AAH or AH A. 3. There is a single singular value σ > 0 of a matrix Am×n if and only if σ 2 is a single eigenvalue of AAH or AH A. 4. If p = min{m, and σ1 , . . . , σp are the singular values of an m × n matrix A,  H n},  p then tr A A = i=1 σi2 . 5. The absolute value of the determinant of an n × n matrix is equal to the product of its singular values, namely | det(A)| = σ1 · · · σn . 6. The spectral norm of a matrix A is equal to its maximum singular value, namely

A spec = σmax . 7. If m ≥ n then, for a matrix Am×n , one has * )  σmin (A) = min (xH AH Ax)1/2  xH x = 1, x ∈ Cn , * )  σmax (A) = max (xH AH Ax)1/2  xH x = 1, x ∈ Cn . 8. If an m × m matrix A is nonsingular, then   1/2  1 xH (A−1 )H A−1 x  n . = max  x = 0, x ∈ C  σmin (A) xH x 9. If σ1 , . . . , σp are the nonzero singular values of A ∈ Cm×n , with p = min{m, n},

then the matrix

O AH

A O

has 2p nonzero singular values σ1 , . . . , σp , −σ1 , . . . ,

−σp and |m − n| zero singular values.

10. If A = U ΣO1 O VH is the SVD of the m×n matrix A then the Moore–Penrose O inverse matrix of A is given by   −1 Σ1 O † UH . A =V O O 11. When P and Q are respectively an m × m and an n × n unitary matrix, the SVD of PAQH is given by ˜ V ˜H PAQH = UΣ

(5.2.24)

˜ = PU and V ˜ = QV. That is to say, the two matrices PAQH and A have with U

280

Singular Value Analysis

the same singular values, i.e., the singular values have the unitary invariance, but their singular vectors are different. The following summarizes the inequality relations of singular values [214], [215], [275], [93], [307]. (1) If A and B are m × n matrices then, for p = min{m, n}, one has σi+j−1 (A + B) ≤ σi (A) + σj (B),

1 ≤ i, j ≤ p, i + j ≤ p + 1.

In particular, when j = 1, σi (A + B) ≤ σi (A) + σ1 (B) holds for i = 1, . . . , p. (2) Given Am×n and Bm×n , one has σmax (A + B) ≤ σmax (A) + σmax (B). (3) If A and B are the m × n matrices then p 

[σj (A + B) − σj (A)]2 ≤ B 2F ,

p = min{m, n}.

j=1

(4) If the singular values of A satisfies σ1 (A) ≥ · · · ≥ σm (A) then k  j=1

[σm−k+j (A)]2 ≤

k 

aH j aj ≤

j=1

k 

[σj (A)]2 ,

k = 1, 2, . . . , m.

j=1

(5) If p = min{m, n}, and the singular values of Am×n and Bm×n are arranged as σ1 (A) ≥ · · · ≥ σp (A), σ1 (B) ≥ · · · ≥ σp (B) and σ1 (A + B) ≥ · · · ≥ σp (A + B), then σi+j−1 (ABH ) ≤ σi (A)σj (B),

1 ≤ i, j ≤ p, i + j ≤ p + 1.

(6) If B is an m × (n − 1) matrix obtained by deleting any column of an m × n matrix A, and their singular values are arranged in descending order, then σ1 (A) ≥ σ1 (B) ≥ σ2 (A) ≥ σ2 (B) ≥ · · · ≥ σh (A) ≥ σh (B) ≥ 0, where h = min{m, n − 1}. (7) The maximum singular value of a matrix Am×n satisfies the inequality  σmax (A) ≥

1 tr(AH A) n

1/2 .

5.2.3 Rank-Deficient Least Squares Solutions In applications of singular value analysis, it is usually necessary to use a low-rank matrix to approximate a noisy or disturbed matrix. The following theorem gives an evaluation of the quality of approximation.

5.2 Singular Value Decomposition (SVD)

281

p THEOREM 5.4 Let A = σi ui viT be the SVD of A ∈ Rm×n , where p = i=1 rank(A). If k < p and Ak = ki=1 σi ui viT is the rank-k approximation of A, then the approximation quality can be respectively measured by the spectral norm or the Frobenius norm, min rank(B)=k

A − B spec = A − Ak spec = σk+1 ,

or

 min rank(B)=k

A − B F = A − Ak F =

q 

(5.2.25)

1/2 σi2

(5.2.26)

i=k+1

with q = min{m, n}. Proof

See, e.g., [140], [328], [226].

In signal processing and system theory, the most common system of linear equations Ax = b is over-determined and rank-deficient; namely, the row number m of the matrix A ∈ Cm×n is larger than its column number n and r = rank(A) < n. Let the SVD of A be given by A = UΣVH , where Σ = Diag(σ1 , . . . , σr , 0, . . . , 0). Consider (5.2.27) G = VΣ† UH , where Σ† = Diag(1/σ1 , . . . , 1/σr , 0, . . . , 0). By the property of singular values it is known that G is the Moore–Penrose inverse matrix of A. Hence ˆ = Gb = VΣ† UH b x

(5.2.28)

which can be represented as xLS =

r 

(uH i b/σi )vi .

(5.2.29)

i=1

The corresponding minimum residual is given by 2 2 ρLS = AxLS − b 2 = 2[ur+1 , . . . , um ]H b22 .

(5.2.30)

The approach to solving the least squares problem via SVD is simply known as the SVD method. Although, in theory, when i > r the singular values σi = 0, the computed singular values σˆi , i > r, are not usually equal to zero and sometimes even have quite a large perturbation. In these cases, an estimate of the matrix rank r is required. In signal processing and system theory, the rank estimate rˆ is usually called the effective rank. Effective-rank determination is carried out by one of the following two common methods. 1. Normalized Singular Value Method Compute the normalized singular values σ ¯i =

σ ˆi , σ ˆ1

(5.2.31)

282

Singular Value Analysis

and select the largest integer i satisfying the criterion σ ¯i ≥ as an estimate of the effective rank rˆ. Obviously, this criterion is equivalent to choosing the maximum integer i satisfying σ ˆi ≥ σ ˆ1

(5.2.32)

as rˆ; here is a very small positive number, e.g., = 0.1 or = 0.05. 2. Norm Ratio Method Let an m × n matrix Ak be the rank-k approximation to the original m × n matrix A. Define the Frobenius norm ratio as 

Ak F σ12 + · · · + σk2 ν(k) = = 2 , h = min{m, n}, (5.2.33)

A F σ1 + · · · + σh2 and choose the minimum integer k satisfying ν(k) ≥ α

(5.2.34)

as the effective rank estimate rˆ, where α is at the threshold of 1, e.g., α = 0.997 or 0.998. After the effective rank rˆ has been determined via the above two criteria, ˆ LS = x

rˆ 

(ˆ uH σi )ˆ vi i b/ˆ

(5.2.35)

i=1

can be regarded as a reasonable approximation to the LS solution xLS . Clearly, this solution is the LS solution of the linear equation Arˆx = b, where Arˆ =

rˆ 

σi ui viH .

(5.2.36)

i=1

In LS problems, using Arˆ instead of A corresponds to filtering out the smaller singular values. This filtering is effective for a noisy matrix A. It is easily seen ˆ LS given by Equation (5.2.35) still contains n parameters. that the LS solution x However, by the rank-deficiency of the matrix A we know that the unknown parameter vector x contains only r independent parameters; the other parameters are linearly dependent on r independent parameters. In many engineering applications, we naturally want to find r independent parameters other than the n parameters containing redundancy components. In other words, our objective is to estimate only the principal parameters and to eliminate minor components. This problem can be solved via the low-rank total least squares (TLS) method, which will be presented in Chapter 6.

5.3 Product Singular Value Decomposition (PSVD)

283

5.3 Product Singular Value Decomposition (PSVD) In the previous section we discussed the SVD of a single matrix. This section presents the SVD of the product of two matrices.

5.3.1 PSVD Problem By the product singular value decomposition (PSVD) is meant the SVD of a matrix product BT C. Consider the matrix product A = BT C,

B ∈ Cp×m ,

C ∈ Cp×n ,

rank(B) = rank(C) = p.

(5.3.1)

In principle, the PSVD is equivalent to the direct SVD of the matrix product. However, if we first compute the product of two matrices, and then calculate the PSVD then the smaller singular values are often considerably disturbed. To illustrate this, look at the following example [135]. EXAMPLE 5.3

Let



BT =

1 ξ −1 ξ

1 B C= √ 2 T

 , 

1 C= √ 2 1−ξ −1 − ξ



1 1 −1 1  1+ξ . −1 + ξ

 , (5.3.2)

Clearly, C is an orthogonal matrix and the two columns [1, −1]T and [ξ, ξ]T of BT are orthogonal to each √ other. The real singular values of the matrix product BT C √ are σ1 = 2 and σ2 = 2|ξ|. However, if |ξ| is smaller than the cutoff error , then

1 1 1 the floating point calculation result of Equation (5.3.2) is BT C = √ −1 −1 2 √ whose singular values are σ ˆ1 = 2 and σ ˆ2 = 0. In this case, σ ˆ1 = σ1 and σ ˆ 2 ≈ σ2 . If |ξ| > 1/ , then the floating point calculation result of the matrix product is

√ 1 −ξ ξ T B C = √ −ξ ξ , whose computed singular values are σ ˆ1 = 2 |ξ| and σ ˆ2 = 0. 2

ˆ2 is obviously different from σ2 . In this case σ ˆ1 is clearly larger than σ1 , and σ Laub et al. [283] pointed out that when a linear system is close to uncontrollable and unobservable an accurate calculation for the small singular values is very important because, if a nonzero small singular value is calculated as zero, it may lead to a wrong conclusion: a minimum-phase system could be determined as a nonminimum phase system. The above examples show that to use the direct SVD of the matrix product BT C is not numerically desirable. Hence, it is necessary to consider how to compute the SVD of A = BT C such that the PSVD has an accuracy close to that of B and C. This is the so-called product singular value decomposition problem. Such a PSVD was first proposed by Fernando and Hammarling in 1988 [151], and is described in the following theorem.

284

Singular Value Analysis

THEOREM 5.5 (PSVD) [151] Let BT ∈ Cm×p , C ∈ Cp×n . Then there are two unitary matrices U ∈ Cm×m , V ∈ Cn×n and a nonsingular matrix Q ∈ Cp×p such that ⎤ ⎡ I ⎦, UBH Q = ⎣ OB (5.3.3) ΣB ⎤ ⎡ OC ⎦, Q−1 CVH = ⎣ (5.3.4) I ΣC where ΣB = Diag(s1 , . . . , sr ), 1 > s1 ≥ · · · ≥ sr > 0, ΣC = Diag(t1 , . . . , tr ), 1 > t1 ≥ · · · ≥ tr > 0, and s2i + t2i = 1, i = 1, . . . , r. By Theorem 5.5, we can deduce that UBH CVH = Diag(OC , OB , ΣB ΣC ). Hence the singular values of the matrix product BH C consist of both zero and nonzero singular values given by si ti , i = 1, . . . , r.

5.3.2 Accurate Calculation of PSVD The basic idea of the accurate calculation algorithm for PSVD in [135] is as follows: after any matrix A is multiplied by an orthogonal matrix, the singular values of A remain unchanged. Hence, for matrices B and C, if we let B = TBU,

C = (TT )−1 CV,

(5.3.5)

where T is nonsingular and U and V are orthogonal matrices, then B T C = UT BT TT (TT )−1 CV = UT (BT C)V and BT C have the exactly same singular values (including zero singular values). Given B ∈ Rp×m and C ∈ Rp×n , p ≤ min{m, n}, denote the row vectors of B as τ bi , i = 1, 2, . . . , p. Drmac’s PSVD algorithm is shown in Algorithm 5.1.   Σ⊕O , so the SVD of BT C After performing Algorithm 5.1, we get V, QF and O is given by U(BT C)VT = Σ, where U=



 T V I

 Σ⊕O . Σ= O 

QTF ,

(5.3.6)

The above algorithm for computing the PSVD of two matrices has been extended to the accurate calculation of the PSVD of three matrices [137]; see Algorithm 5.2.

5.4 Applications of Singular Value Decomposition Algorithm 5.1

285

PSVD(B, C) [135]

input: B ∈ Rp×m , C ∈ Rp×n , p ≤ min{m, n}. Denote Bτ = [bτ1 , . . . , bτp ]. 1. Compute Bτ = Diag(bτ1 2 , . . . , bτp 2 ). Let B1 = B†τ B, C1 = Bτ C. 2. Calculate the QR decomposition of CT1 ⎡ ⎤ R CT1 Π = Q ⎣ ⎦ , O where R ∈ Rr×p , rank(R) = r and Q is a (n − r) × p orthogonal matrix. 3. Compute F = BT1 ΠRT . 4. Compute the QR decomposition of F ⎤ ⎡ RF ⎦. FΠF = QF ⎣ O 5. Compute the SVD Σ = VT RF W. output: The SVD of BT C is ⎡ ⎤ ⎡ Σ⊕O VT ⎣ ⎦=⎣ O

⎤ I

   ⎦ QTF BT C Q(W ⊕ In−p ) ,

where A ⊕ D represents the direct sum of A and D.

Algorithm 5.2

PSVD of BT SC [137]

input: B ∈ Rp×m , S ∈ Rp×q , C ∈ Rq×n , p ≤ m, q ≤ n. 1. Compute Br = Diag(bτ1 2 , . . . , bτp 2 ) and Cr = Diag(cτ1 2 , . . . , cτq 2 ). where bτi and cτj are the ith row vectors of B and C, respectively. 2. Set B1 = B†τ B, C1 = C†τ C, S1 = Bτ SCτ . 3. Compute LU decomposition Π1 S1 Π2 = LU, where L ∈ Rp×p , U ∈ Rp×q . 4. Compute M = LT Π1 B1 and N = UΠT2 C1 . 5. Compute the SVD of MT N via Algorithm 5.1 to get Q, QF , V and W. output:

The SVD of BT SC is ⎤ ⎡ ⎤ ⎡    Σ⊕O VT ⎦ QTF BT SC Q(W ⊕ In−p ) . ⎣ ⎦=⎣ I O

5.4 Applications of Singular Value Decomposition The SVD has been widely applied for solving many engineering problems. This section presents two typical application examples of SVD, in static system modeling and in image compression.

286

Singular Value Analysis

5.4.1 Static Systems Taking an electronic device as an example, we consider the SVD of a static system of voltages v1 and v2 and currents i1 and i2 : ⎡ ⎤ v  1    ⎥ 1 −1 0 0 ⎢ v ⎢ 2⎥ = 0 . (5.4.1) 0 0 1 1 ⎣ i1 ⎦ 0 6 78 9 i F

2

In this model of a static system, the entries of the matrix F limits the allowable values of v1 , v2 , i1 , i2 . If the measurement devices for voltages and currents have the same accuracy (for example 1%) then we can easily detect whether any set of measurements is a solution of Equation (5.4.1) within the desired range of accuracy. Let us assume that another method gives a static system model different from Equation (5.4.1): ⎡ ⎤ v1     6 6 ⎥ ⎢ 1 −1 10 10 ⎢v 2 ⎥ = 0 . (5.4.2) ⎣i ⎦ 0 0 0 1 1 1 i2 Obviously, only when the current measurements are very accurate, will a set of v1 , v2 , i1 , i2 satisfy Equation (5.4.2) with suitable accuracy; for the general case where current measurements have perhaps 1% measurement errors, Equation (5.4.2) is quite different from the static system model (5.4.1). In this case, the voltage relationships given by Equations (5.4.1) and (5.4.2) are v1 −v2 = 0 and v1 −v2 +104 = 0, respectively. However, from the algebraic viewpoint, Equations (5.4.1) and (5.4.2) are completely equivalent. Therefore, we hope to have some means to compare several algebraically equivalent model representations in order to determine which we want, i.e., one that can be used for the common static system model available in general rather than special circumstances. The basic mathematical tool for solving this problem is the SVD. More generally, consider a matrix equation for a static system including n resistors [100]:   v = 0. (5.4.3) F i Here F is an m × n matrix. In order to simplify the representation, some constant compensation terms are removed. Such a representation is very versatile, and can come from physical devices (e.g., linearized physical equations) and network equations. The action of the matrix F on the accurate and nonaccurate data parts can be analyzed using SVD. Let the SVD of F be F = UT ΣV.

(5.4.4)

5.4 Applications of Singular Value Decomposition

287

Using a truncated SVD, the nonprecision parts of F can be removed, and thus the SVD of the matrix F will provide a design equation that is algebraically equivalent to but numerically more reliable than the original static system model. Since U is an orthogonal matrix, from Equations (5.4.3) and (5.4.4) we have   v = 0. (5.4.5) ΣV i If the diagonal matrix Σ is blocked,  Σ=

Σ1 O

 O , O

then making a corresponding blocking of the orthogonal matrix V,   A B , V= C D where [A, B] includes the top r rows of V, Equation (5.4.5) can be rewritten as     Σ1 O A B v = 0. O O C D i Hence we obtain the new system model   v [A, B] = 0, i

(5.4.6)

which is algebraically equivalent to, but numerically much more reliable than, the original system model.

5.4.2 Image Compression The rapid development of science and technology and the popularity of network applications have produced a very large amount of digital information that needs to be stored, processed and transmitted. Image information, as an important multimedia resource, involves a large amount of data. For example, a 1024 × 768 24-bit BMP image uses about 2.25 MB. Large amounts of image information bring much compressive stress to the capacity of memorizers, the channel bandwidth of communications trunks and the processing speed of computers. Clearly, to solve this problem it is not realistic simply to increase the memory capacity and improve the channel bandwidth and processing speed of a computer. Hence, image compression is very necessary. Image compression is feasible because the image data are highly correlated. There is a large correlation between adjacent pixels in most images, and thus an image has a lot of spatial redundancy. Also, there is a large correlation between the front and back frames of the image sequence (i.e., there is time redundancy). Furthermore, if

288

Singular Value Analysis

the same code length is used to represent symbols with different probabilities of occurrence then a lot of bit numbers will be wasted, that is, the image representations will have symbol redundancy. Spatial redundancy, time redundancy and symbol redundancy constitute the three main factors in image compression. It is also important that a certain amount of distortion is allowed in image compression. Because an image has a matrix structure, the SVD can be applied to image compression. An important function of SVD is to reduce greatly the dimension of a matrix or image. In image and video processing the frames of an image are arranged into the columns of an image matrix A ∈ Rm×n . If the rank of A is r = rank(A), and its truncated SVD is given by A=

r 

σi ui viT ,

i=1

then the original image matrix A can be represented by an m×r orthogonal matrix Ur = [ui , . . . , ur ], an n × r orthogonal matrix Vr = [v1 , . . . , vr ]T and a diagonal vector composed of the singular values of A, σ = [σ1 , . . . , σr ]T . In other words, the orthogonal matrices Ur and Vr contain mr and nr data entries, respectively. So, the m × n original image matrix A (total data mn) is compressed into Ur , Vr and the diagonal vector σ (total data (m + n + 1)r). Then, the image compression ratio is given by mn ρ= . (5.4.7) r(m + n + 1) Hence, the basic idea of image compression based on SVD is to find the SVD of the image matrix and to use the k singular values and the corresponding left- and right-singular vectors to reconstruct the original image. If k ≥ r then the corresponding image compression is known as the lossless compression. Conversely, if k < r then the image compression is said to be lossy compression. In the general case, the number of selected singular values should meet the condition k(m + n + 1) ' mn.

(5.4.8)

Thus, during the process of transferring images, we need only transfer k(m + n + 1) data entries on selected singular values and the corresponding left- and rightsingular vectors rather than mn data entries. At the receiving end, after receiving k(m + n + 1) data entries, we can reconstruct the original image matrix through Ak =

k 

σi ui viT .

(5.4.9)

i=1

The error between the reconstructed image Ak and the original image A is measured

5.5 Generalized Singular Value Decomposition (GSVD)

289

by 2

2 2 + σk+2 + · · · + σr2 .

A − Ak F = σk+1

(5.4.10)

The contribution of some singular value σi to the image can be defined as i =

σ12

σi2 . + · · · + σr2

(5.4.11)

For a given image, the larger a singular value is, the larger is its contribution to the image information. Conversely, the smaller a singular value is, the smaller is its k contribution to the image information. For example, if i=1 i is close to 1 then the main information about the image is contained in Ak . On the basis of satisfying the visual requirements, if a suitable number of larger singular values k (< r) is selected then the original image A can be recovered by using these Ak . The smaller k is, the smaller is the data amount needed to approximate A, and thus the larger is the compression ratio; as k → r, the more similar to A is the compressed image matrix Ak . In some applications, if the compression ratio is specified in advance then the number k of singular values to be used for image compression can be found, and thus the contribution of image compression k can be computed from i=1 i .

5.5 Generalized Singular Value Decomposition (GSVD) Previous sections presented the SVD of a single matrix and the SVD of the product of two matrices BT C. In this section we discuss the SVD of the matrix pencil or pair (A, B). This decomposition is called generalized SVD.

5.5.1 Definition and Properties The generalized singular value decomposition (GSVD) method was proposed by Van Loan in 1976 [483]. THEOREM 5.6 [483] If A ∈ Cm×n , m ≥ n, and B ∈ Cp×n , there exist two unitary matrices U ∈ Cm×m and V ∈ Cp×p together with a nonsingular matrix Q ∈ Cn×n such that ⎤ ⎡ Ir ⎦ ∈ Rm×k , (5.5.1) ΣA = ⎣ UAQ = [ΣA , O] , SA k n−k

VBQ = [ΣB , O] , k n−k

⎡ ΣB = ⎣

OA OB

⎤ ⎦ ∈ Rp×k ,

SB Ik−r−s

(5.5.2)

290

Singular Value Analysis

where SA = Diag(αr+1 , . . . , αr+s ), SB = Diag(βr+1 , . . . , βr+s ), 1 > αr+1 ≥ · · · ≥ αr+s > 0, 0 < βr+1 ≤ · · · ≤ βr+s < 1, αi2

+ βi2 = 1,

i = r + 1, . . . , r + s,

with integers   A − rank(B), B   A . s = rank(A) + rank(B) − rank B

k = rank

  A , B

r = rank

(5.5.3) (5.5.4)

There is a variety of methods to prove this theorem, these can be found in Van Loan [483], Paige and Saunders [370], Golub and Van Loan [179] and Zha [540]. Using the method of [483], the diagonal entries of the diagonal matrices ΣA in Equation (5.5.1) and ΣB in Equation (5.5.2) constitute singular value pairs (αi , βi ). Since OA is a (m − r − s) × (k − r − s) zero matrix and OB is a (p − k + r) × r zero matrix, the first k singular value pairs are respectively given by αi = 1, αi , β i

βi = 0,

i = 1, . . . , r,

(the entries of SA and SB ), αi = 0,

βi = 1,

i = r + 1, . . . , r + s,

i = r + s + 1, . . . , k.

Hence the first k generalized singular values are given by ⎧ ⎪ ⎨ αi /βi = 1/0 = ∞, i = 1, . . . , r, γi = αi /βi , i = r + 1, . . . , r + s, ⎪ ⎩ i = r + s + 1, . . . , k. αi /βi = 0/1 = 0, These k singular value pairs (αi , βi ) are called the nontrivial generalized singular value pairs of the matrix pencil (A, B); αi /βi i = 1, 2, . . . , k, are known as the nontrivial generalized singular values of (A, B) and include the infinite, finite and zero values. In the contrary, the other n − k pairs of generalized singular values, corresponding to the zero vectors in Equations (5.5.1) and (5.5.2), are called the trivial generalized singular value pairs of the matrix pencil (A, B). The number of columns of the matrix A is restricted to be not greater than its number of rows in Theorem 5.6. Paige and Saunders [370] generalized Theorem 5.6 to propose the GSVD of any matrix pencil (A, B) with the same column number for A and B.

5.5 Generalized Singular Value Decomposition (GSVD)

291

Let A ∈ Cm×n and B ∈ Cp×n . Then for the block

THEOREM 5.7 (GSVD2) [370] matrix

  A K= , B

t = rank(K),

there are unitary matrices U ∈ Cm×m ,

V ∈ Cp×p ,

W ∈ Ct×t ,

Q ∈ Cn×n

such that t

n−t

H R9 , 6789 O ], UH AQ = ΣA [ 6W78 t

n−t

H VH BQ = ΣB [ 6W78 R9 , 6789 O ],

where

⎡ ΣA m×t

=⎣



IA

⎦,

DA OA

⎡ ΣB = ⎣ p×t



IB



DB

(5.5.5)

OB

and R ∈ Ct×t is nonsingular; its singular values are equal to the nonzero singular values of the matrix K. The matrix IA is an r × r identity matrix and IB is a (t − r − s) × (t − r − s) identity matrix, where the values of r and s are dependent on the given data, and OA and OB are respectively the (m − r − s) × (t − r − s) and (p − t + r) × r zero matrices that may have not any row or column (i.e., m − r − s = 0, t − r − s = 0 or p − t + r = 0 may occur) whereas DA = Diag(αr+1 , αr+2 , . . . , αr+s ), DB = Diag(βr+1 , βr+2 , . . . , βr+s ) satisfy 1 > αr+1 ≥ αr+2 ≥ · · · ≥ αr+s > 0,

0 < βr+1 ≤ βr+2 ≤ · · · ≤ βr+s < 1

and αi2 + βi2 = 1, Proof

i = r + 1, r + 2, . . . , r + s.

See [218].

The following are a few remarks on the GSVD. Remark 1

From the above definition it can be deduced that H AB−1 = UΣA [WH R, O]QH · Q[WH R, O]−1 Σ−1 B V H = UΣA Σ−1 B V .

This shows the equivalence between the GSVD of the matrix pencil (A, B) and the SVD of the matrix product AB−1 , because the generalized left and right singularvector matrices U and V of (A, B) are equal to the left and right singular-vector matrices U and V of AB−1 , respectively.

292

Singular Value Analysis

Remark 2 From the equivalence between the GSVD of (A, B) and the SVD of AB−1 it is obvious that if B = I is an identity matrix, then the GSVD simplifies to the general SVD. This result can be directly obtained from the definition of the GSVD, because all singular values of the identity matrix are equal to 1, and thus the generalized singular values of the matrix pencil (A, I) are equivalent to the singular values of A. Remark 3 Because AB−1 has the form of a quotient, and the generalized singular values are the quotients of the singular values of the matrices A and B, the GSVD is sometimes called the quotient singular value decomposition (QSVD).

5.5.2 Algorithms for GSVD If either A or B is an ill-conditioned matrix, then the calculation of AB−1 will lead usually to large numerical error, so the SVD of AB−1 is not recommended for computing the GSVD of the matrix pencil (A, B). A natural question to ask is whether we can get the GSVD of (A, B) directly without computing AB−1 . It is entirely possible, because we have the following theorem. THEOREM 5.8 [179] If A ∈ Cm1 ×n (m1 ≥ n) and B ∈ Cm2 ×n (m2 ≥ n) then there exists a nonsingular matrix X ∈ Cn×n such that XH (AH A)X = DA = Diag(α1 , α2 , . . . , αn ) ,

αk ≥ 0,

X (B B)X = DB = Diag(β1 , β2 , . . . , βn ) , βk ≥ 0.  The quantities σk = αk /βk are called the generalized singular values of the matrix pencil (A, B), and the columns xk of X are known as the generalized singular vectors associated with the generalized singular values σk . H

H

The above theorem provides various algorithms for computing the GSVD of the matrix pencil (A, B). In particular, we are interested in seeking the generalized singular-vector matrix X with DB = I, because in this case the generalized singular √ values σk are given directly by αk . Algorithms 5.3 and 5.4 give two GSVD algorithms. The main difference between Algorithms 5.3 and 5.4 is that the former requires computation of the matrix product AH A and BH B, and the latter avoids this computation. Because computation of the matrix product may make the condition number become large, Algorithm 5.4 has a better numerical performance than Algorithm 5.3. In 1998, Drmac [136] developed a tangent algorithm for computing the GSVD, as shown in Algorithm 5.5. This algorithm is divided into two phases: in the first phase the matrix pencil (A, B) is simplified into a matrix F; in the second phase the SVD of F is computed. The theoretical basis of the tangent algorithm is as follows: the GSVD is invariant

5.5 Generalized Singular Value Decomposition (GSVD) Algorithm 5.3

293

GSVD algorithm 1 [484]

input: A ∈ Cm1 ×n , B ∈ Cm2 ×n . 1. Compute S1 = AH A and S2 = BH B. 2. Calculate the eigenvalue decomposition UH 2 S2 U2 = D = Diag(γ1 , . . . , γn ). 3. Compute Y = U2 D−1/2 and C = Y H S1 Y. 4. Compute the EVD QH CQ = Diag(α1 , . . . , αn ), where QH Q = I. output: Generalized singular-vector matrix X = YQ and generalized singular values √ αk , k = 1, . . . , n. Algorithm 5.4

GSVD algorithm 2 [484]

input: A ∈ Cm1 ×n , B ∈ Cm2 ×n . 1. Compute the SVD UH 2 BV2 = D = Diag(γ1 , . . . , γn ). 2. Calculate Y = V2 D−1 V2 = Diag(1/γ1 , . . . , 1/γn ). 3. Compute C = AY. 4. Compute the SVD UH 1 CV1 = DA = Diag(α1 , . . . , αn ). output: Generalized singular-vector matrix X = YV1 , generalized singular values αk , k = 1, . . . , n. Algorithm 5.5

Tangent algorithm for GSVD [136]

input: A = [a1 , . . . , an ] ∈ Rm×n , B ∈ Rp×n , m ≥ n, rank(B) = n. 1. Compute ΔA = Diag(a1 2 , . . . , an 2 ), Ac = AΔ−1 A ,

B1 = BΔ−1 A .

 R = QT B1 Π. O 3. Solve the matrix equation FR = Ac Π to get F = Ac ΠR−1 .   Σ 4. Compute the SVD = VT FU. O   U O −1 . 5. Compute X = Δ−1 U and W = Q A ΠR O Ip−n output: The GSVD of (A,

B) reads as



2. Compute the Householder QR decomposition

VT AX =

Σ , O

WT BX =

I . O

That is, the generalized singular values of (A, B) are given by the diagonal entries of Σ, and the generalized left and right singular-vector matrices of (A, B) are given by V and W, respectively.

under a equivalence transformation, namely (A, B) → (A , B ) = (UT AS, VT BS),

(5.5.6)

294

Singular Value Analysis

where U, V are any orthogonal matrices and S is any nonsingular matrix. Hence, by the definition, two matrix pencils (A, B) and (A , B ) have the same GSVD.

5.5.3 Two Application Examples of GSVD The following are two application examples of GSVD. 1. Multi-Microphone Speech Enhancement A noisy speech signal sampled by multi-microphone at a discrete time k can be described by the observation model y(k) = x(k) + v(k), where x(k) and v(k) are the speech signal vector and the additive noise vector, respectively. Letting Ryy = E{y(k)yT (k)} and Rvv = E{v(k)vT (k)} represent the autocorrelation matrices of the observation data vector y and the additive noise vector v, respectively, one can make their joint diagonalization 0 2 Ryy = QDiag(σ12 , . . . , σm )QT , (5.5.7) 2 Rvv = QDiag(η12 , . . . , ηm )QT . Doclo and Moonen [127] showed in 2002 that, in order to achieve multi-microphone speech enhancement, the M × M optimal filter matrix W(k) with minimum meansquare error (MMSE) is given by −1 W(k) = R−1 yy (k)Rxx (k) = Ryy (k)(Ryy (k) − Rvv (k))   2 η12 η22 ηM −T = Q Diag 1 − 2 , 1 − 2 , . . . , 1 − 2 Q. σ1 σ2 σM

(5.5.8)

Let Y(k) be p×M speech-data matrices containing p speech-data vectors recorded during speech-and-noise periods, and let V(k ) be the q×M additive noise matrix containing q noise data vectors recorded during noise-only periods: ⎤ ⎤ ⎡ T

⎡ T y (k − p + 1) v (k − q + 1) ⎥ ⎥ ⎢ ⎢ .. .. ⎥ ⎥ ⎢ ⎢ . .

⎥, ⎥ ⎢ V(k (5.5.9) ) = Y(k) = ⎢ ⎢ yT (k − 1) ⎥ ⎢ vT (k − 1) ⎥ , ⎦ ⎦ ⎣ ⎣ yT (k) vT (k ) where p and q are typically larger than M . The GSVD of the matrix pencil (Y(k), V(k )) is defined as Y(k) = UY ΣY QT ,

V(k ) = UV ΣV QT ,

(5.5.10)

where ΣY = Diag(σ1 , . . . , σM ), ΣV = Diag(η1 , . . . , ηM ), UY and UV are orthogonal matrices, Q is an invertible but not necessarily orthogonal matrix containing the

5.5 Generalized Singular Value Decomposition (GSVD)

295

generalized singular vectors and σi /ηi are the generalized singular values. Substituting these results into Equation (5.5.8), the estimate of the optimal filter matrix is given by [127]   2 2 ˆ = Q−T Diag 1 − p η1 , . . . , p ηM QT . W (5.5.11) 2 q σ12 q σM 2. Dimension Reduction of Clustered Text Data In signal processing, pattern recognition, image processing, information retrieval and so on, dimension reduction is imperative for efficiently manipulating the massive quantity of data. Suppose that a given m × n term-document matrix A ∈ Rm×n is already partitioned into k clusters: A = [A1 , . . . , Ak ],

where

Ai ∈ R

m×ni

,

k 

ni = n.

i=1

Our goal is to find a lower-dimensional representation of the term-document matrix A such that the cluster structure existing in m-dimensional space is preserved in the lower-dimensional representation. The centroid ci of each cluster matrix Ai is defined as the average of the columns in Ai , i.e., 1 ci = Ai 1i , where 1i = [1, . . . , 1]T ∈ Rni ×1 , ni and the global centroid is given by c=

1 A1, n

where

1 = [1, . . . , 1]T ∈ Rn×1 .

Denote the ith cluster as Ai = [ai1 , . . . , aini ], and let Ni = {i1, . . . , ini } denote the set of column indices that belong to the cluster i. Then the within-class scatter matrix Sw is defined as [218] Sw =

k  

(aj − ci )(aj − ci )T

i=1 j∈Ni

and the between-class scatter matrix Sb is defined as Sb =

k 

ni (ci − c)(ci − c)T .

i=1

Defining the matrices Hw = [A1 − c1 1T1 , . . . , Ak − ck 1Tk ] ∈ Rm×n , √ √ Hb = [ n1 (c1 − c), . . . , nk (c − ck )] ∈ Rm×k ,

(5.5.12)

296

Singular Value Analysis

the corresponding scatter matrices can be represented as Sw = Hw HTw ∈ Rm×m ,

Sb = Hb HTb ∈ Rm×m .

It is assumed that the lower-dimensional representation of the term-document matrix A ∈ Rm×n consists of l vectors ui ∈ Rm×1 with ui 2 = 1, where i = 1, . . . , l and l < min{m, n, k}. Consider the generalized Rayleigh quotient λi =

HTb ui 22 uTi Sb ui = ;

HTw ui 22 uTi Sw ui

(5.5.13)

its numerator and denominator denote the between-class scatter and the withinclass scatter of the vector ui , respectively. As the measure of cluster quality, the between-class scatter should be as large as possible, while the within-class scatter should be as small as possible. Hence, for each ui , the generalized Rayleigh quotient, namely uTi Sb ui = λi uTi Sw ui

or

uTi Hb HTb ui = λi uTi Hw HTw ui

(5.5.14)

should be maximized. This is the generalized eigenvalue decomposition of the cluster matrix pencil (Sb , Sw ). Equation (5.5.14) can be equivalently written as  HTb ui = σi HTw ui , σi = λi . (5.5.15) This is just the GSVD of the matrix pencil (HTb , HTw ). From the above analysis, the dimension reduction method for a given termdocument matrix A ∈ Rm×n can be summarized as follows. (1) Compute the matrices Hb and Hw using Equation (5.5.12). (2) Perform the GSVD of the matrix pencil (Hb , Hw ) to get the nonzero generalized singular values σi , i = 1, . . . , l and the corresponding left-generalized singular vectors u1 , . . . , ul . (3) The lower-dimensional representation of A is given by [u1 , . . . , ul ]T . The GSVD has also been applied to the comparative analysis of genome-scale expression data sets of two different organisms [13], to discriminant analysis [219] etc.

5.6 Low-Rank–Sparse Matrix Decomposition In the fields of applied science and engineering (such as image, voice and video processing, bioinformatics, network search and e-commerce), a data set is often high-dimensional, and its dimension can be even up to a million. It is important to discover and utilize the lower-dimensional structures in a high-dimensional data set.

5.6 Low-Rank–Sparse Matrix Decomposition

297

5.6.1 Matrix Decomposition Problems In many areas of engineering and applied sciences (such as machine learning, control, system engineering, signal processing, pattern recognition and computer vision), a data (image) matrix A ∈ Rm×n usually has the following special constructions. (1) High dimension m, n are usually very large. (2) Low rank r = rank(A) ' min{m, n}. (3) Sparsity Some observations are grossly corrupted, so that A becomes A + E; some entries of the error matrix E may be very large but most are equal to zero. (4) Incomplete data Some observations may be missing too. The following are several typical application areas with the above special constructions [90], [77]. 1. Graphical modeling In many applications, because a small number of characteristic factors can explain most of the statistics of the observation data, a high-dimensional covariance matrix is commonly approximated by a low-rank matrix. In the graphical model [284] the inverse of the covariance matrix (also called the information matrix) is assumed to be sparse with respect to some graphic. Thus, in the statistical model setting, the data matrix is often decomposed into the sum of a low-rank matrix and a sparse matrix, to describe the roles of unobserved hidden variables and of the graphical model, respectively. 2. Composite system identification In system identification a composite system is often represented by the sum of a low-rank Hankel matrix and a sparse Hankel matrix, where the sparse Hankel matrix corresponds to a linear time-invariant system with sparse impulse response, and the low-rank Hankel matrix corresponds to the minimum realization system with a small order of model. 3. Face recognition Since images of a convex, Lambertian surface under different types of lighting span a lower-dimensional subspace [27], the lower-dimensional models are the most effective for image data. In particular, human face images can be well-approximated by a lower-dimensional subspace. Therefore, to retrieve this subspace correctly is crucial in many applications such as face recognition and alignment. However, actual face images often suffer from shadows and highlighting effects (specularities, or saturations in brightness), and hence it is necessary to remove such defects in face images by using low-rank matrix decomposition. 4. Video surveillance Given a sequence of surveillance video frames, it is often necessary to identify activities standing out from the background. If the video frames are stacked as columns of a data matrix D and a low-rank and sparse

298

Singular Value Analysis

decomposition D = A + E is made then the low-rank matrix A naturally corresponds to the stationary background and the sparse matrix E captures the moving objects in the foreground. 5. Latent semantic indexing Web search engines often need to index and retrieve the contents of a huge set of documents. A popular scheme is latent semantic indexing (LSI) [124]; its basic idea is to encode the relevance of a term or word in a document (for example, by its frequency) and to use these relevance indicators as elements of a document-versus-term matrix D. If D can be decomposed into a sum of a low-rank matrix A and a sparse matrix E then A could capture the common words used in all the documents while E captures the few key words that best distinguish each document from others. The above examples show that for many applications it is not enough to decompose a data matrix into the sum of a low-rank matrix and an error (or perturbation) matrix. A better approach is to decompose a data matrix D into the sum of a low-rank matrix A and a sparse matrix E, i.e., D = A + E, in order to recover its low-rank component matrix. Such a decomposition is called a low-rank–sparse matrix decomposition.

5.6.2 Singular Value Thresholding Singular value thresholding is a key operation in matrix recovery and matrix completion. Before discussing in detail how to solve matrix recovery (and matrix completion) problems, it is necessary to introduce singular value thresholding. Consider the truncated SVD of a low-rank matrix W ∈ Rm×n : W = UΣVT ,

Σ = Diag(σ1 , . . . , σr ),

(5.6.1)

where r = rank(W) ' min{m, n}, U ∈ Rm×r , V ∈ Rn×r . Let the threshold value τ ≥ 0, then Dτ (W) = UDτ (Σ)VT

(5.6.2)

is called the singular value thresholding (SVT) of the matrix W, where Dτ (Σ) = soft(Σ, τ ) = Diag ((σ1 − τ )+ , . . . , (σr − τ )+ ) is known as the soft thresholding and " σi − τ, (σi − τ )+ = 0,

(5.6.3)

if σi > τ, otherwise,

is the soft thresholding operation. The relationship between the SVT and the SVD is as follows. • If the soft threshold value τ = 0 then the SVT reduces to the truncated SVD (5.6.1).

5.6 Low-Rank–Sparse Matrix Decomposition

299

• All singular values are soft thresholded by the soft threshold value τ > 0, which just changes the magnitudes of singular values; it does not change the left and right singular-vector matrices U and V. The proper choice of soft threshold value τ can effectively set some singular values to zero. In this sense the SVT transform is also called a singular value shrinkage operator. It is noted that if the soft threshold value τ is larger than most of the singular values, then the rank of the SVT operator D(W) will be much smaller than the rank of the original matrix W. The following are two key points for applying the SVT: (1) How should an optimization problem be written in a standard form suitable for the SVT? (2) How should the soft thresholding value τ be chosen? The answers to the above two questions depend on the following proposition and theorem, respectively. PROPOSITION 5.1 Let X, Y = tr(XT Y) denote the inner product of matrices X ∈ Rm×n and Y ∈ Rm×n . Then we have

X ± Y 2F = X 2F ± 2X, Y + Y 2F .

(5.6.4)

Proof By the relationship between the Frobenius norm and the trace of a matrix,

A F = tr(AH A), we have

X ± Y 2F = tr[(X ± Y)T (X ± Y)] = tr(XT X ± XT Y ± YT X + YT Y) = X 2F ± 2tr(XT Y) + Y 2F = X 2F ± 2X, Y + Y 2F , where we have used the trace property tr(YT X) = tr(XT Y). THEOREM 5.9 For each soft threshold value μ > 0 and the matrix W ∈ Rm×n , the SVT operator obeys [75] " # 1 Usoft(Σ, μ)VT = arg min μ X ∗ + X − W 2F = proxμ · ∗ (W), (5.6.5) 2 X and [75]

" # 1 soft(W, μ) = arg min μ X 1 + X − W 2F = proxμ · 1 (W), 2 X

(5.6.6)

where UΣVT is the SVD of W, and the soft thresholding (shrinkage) operator + wij − μ, wij > μ, [soft(W, μ)]ij =

wij + μ, wij < −μ, 0,

otherwise,

300

Singular Value Analysis

with wij ∈ R the (i, j)th entry of W ∈ Rm×n . Theorem 5.9 tells us that when solving a matrix recovery or matrix completion problem, the underlying question is how to transform the corresponding optimization problem into the normalized form shown in Equations (5.6.5) and/or (5.6.6).

5.6.3 Robust Principal Component Analysis Given an observation or data matrix D = A + E, where A and E are unknown, but A is known to be low rank and E is known to be sparse, the aim is to recover A. Because the low-rank matrix A can be regarded as the principal component of the data matrix D, and the sparse matrix E may have a few gross errors or outlying observations, the SVD based principal component analysis (PCA) usually breaks down. Hence, this problem is a robust principal component analysis problem [298], [514]. So-called robust PCA is able to correctly recover underlying low-rank structure in the data matrix, even in the presence of gross errors or outlying observations. This problem is also called the principal component pursuit (PCP) problem [90] since it tracks down the principal components of the data matrix D by minimizing a combination of the nuclear norm A ∗ and the weighted 1 -norm μ E 1 : min { A ∗ + μ E 1 } A,E

subject to

D = A + E.

(5.6.7)

By using the above minimization, an initially unknown low-rank matrix A and unknown sparse matrix E can be recovered from the data matrix D ∈ Rm×n . min{m,n} Here, A ∗ = σi (A) represents the nuclear norm of A, i.e., the sum i of its all singular values, which reflects the cost of the low-rank matrix A and m n

E 1 = i=1 j=1 |Eij | is the sum of the absolute values of all entries of the additive error matrix E, where some of its entries may be arbitrarily large, whereas the role of the constant μ > 0 is to balance the contradictory requirements of low rank and sparsity. The unconstrained minimization of the robust PCA or the PCP problem is expressed as " # 1 min A ∗ + μ E 1 + (A + E − D) . (5.6.8) A,E 2 It can be shown [77] that, under rather weak assumptions, the PCP estimate can exactly recover the low-rank matrix A from the data matrix D = A + E with gross but sparse error matrix E. Thus, to solve the robust PCA or the PCP problem, consider a more general family of optimization problems of the form F (X) = f (X) + μh(X),

(5.6.9)

5.6 Low-Rank–Sparse Matrix Decomposition

301

where f (X) is a convex, smooth (i.e., differentiable) and L-Lipschitz function and h(X) is a convex but nonsmooth function (such as X 1 , X ∗ and so on). Instead of directly minimizing the composite function F (X), we minimize its separable quadratic approximation Q(X, Y), formed at specially chosen points Y: 1 Q(X, Y) = f (Y) + ∇f (Y), X − Y + (X − Y)T ∇2 f (Y)(X − Y) + μh(X) 2 L = f (Y) + ∇f (Y), X − Y + X − Y 2F + μh(X), (5.6.10) 2 where ∇2 f (Y) is approximated by LI. When minimizing Q(X, Y) with respect to X, the function term f (Y) may be regarded as a constant term that is negligible. Hence, we have + 22 0 2 2 L2 1 X − Yk + ∇f (Yk )2 Xk+1 = arg min μh(X) + 2 2 2 2 L X F   1 (5.6.11) = proxμL−1 h Yk − ∇f (Yk ) . L If we let f (Y) = 12 A + E − D 2F and h(X) = A ∗ + λ E 1 then the Lipschitz constant L = 2 and ∇f (Y) = A + E − D. Hence 1 Q(X, Y) = μh(X) + f (Y) = (μ A ∗ + μλ E 1 ) + A + E − D 2F 2 reduces to the quadratic approximation of the robust PCA problem. By Theorem 5.9, one has   1 A Ak+1 = proxμ/2 · ∗ Yk − (Ak + Ek − D) = Usoft(Σ, λ)VT , 2     1 E E μλ , Ek+1 = proxμλ/2 · 1 Yk − (Ak + Ek − D) = soft Wk , 2 2 where UΣVT is the SVD of WkA and 1 WkA = YkA − (Ak + Ek − D), 2 1 WkE = YkE − (Ak + Ek − D). 2 Algorithm 5.6 gives a robust PCA algorithm. The robust PCA has the following convergence.

(5.6.12) (5.6.13)

THEOREM 5.10 [298] Let F (X) = F (A, E) = μ A ∗ +μλ E 1 + 12 A+E−D 2F . Then, for all k > k0 = C1 / log(1/n) with C1 = log(μ0 /μ), one has F (X) − F (X ) ≤

4 Xk0 − X 2F , (k − k0 + 1)2

where X is a solution to the robust PCA problem min F (X). X

(5.6.14)

302

Singular Value Analysis Robust PCA via accelerated proximal gradient [298], [514]

Algorithm 5.6

input: Data matrix D ∈ Rm×n , λ, allowed tolerance . ¯ ← δμ0 . initialization: A0 , A−1 ← O; E0 , E−1 ← O; t0 , t−1 ← 1; μ repeat 1. YkA ← Ak + 2. YkE ← Ek + 3.

WkA



YkA

tk−1 − 1 tk tk−1 − 1



tk 1 (Ak 2

(Ak − Ak−1 ). (Ek − Ek−1 ).

+ Ek − D).

svd(WkA ).

4. (U, Σ, V) = μ 5. r = max j : σj > k . 2

r μ 6. Ak+1 = i=1 σi − k ui viT . 2

1

7. WkE ← YkE − (Ak + Ek − D). 2

λμ 8. Ek+1 = soft WkE , k . 9. tk+1 ←

1+

4t2k + 1 2

2

.

¯). 10. μk+1 ← max(ημk , μ A A E 11. SA k+1 = 2(Yk − Ak+1 ) + (Ak+1 + Ek+1 − Yk − Yk ). E A E 12. SE k+1 = 2(Yk − Ek+1 ) + (Ak+1 + Ek+1 − Yk − Yk ). 2 E 2 13. exit if Sk+1 2F = SA k+1 F + Sk+1 F ≤ .

return k ← k + 1. output: A ← Ak , E ← Ek .

5.7 Matrix Completion Finding a low-rank matrix to fit or approximate a higher-dimensional data matrix is a fundamental task in data analysis. In a slew of applications, two popular empirical approaches have been used to represent the target rank-k matrix. (1) Low-rank decomposition A noisy data matrix X ∈ Rm×n is decomposed as the sum of two matrices A + E, where A ∈ Rm×n is a low-rank matrix with the same rank k as the target matrix to be estimated and E ∈ Rm×n is a fitting or approximating error matrix. (2) Low-rank bilinear modeling: The data matrix X ∈ Rm×n is modeled in a bilinear form X = UVT , where U ∈ Rm×k and V ∈ Rn×k are full-column rank matrices. When most entries of a data matrix are missing, its low-rank decomposition problem is called matrix completion, as presented by Cand`es and Recht in 2009 [79]. The low-rank bilinear modeling of a matrix is known as matrix sensing and was developed by Recht et al. [412] and Lee and Bresler [288].

5.7 Matrix Completion

303

Specifically, the matrix completion problem and the matrix sensing problem can be described as follows. • Matrix completion involves completing a low-rank matrix by taking into account, i.e., observing, only a few of its elements. • Matrix sensing refers to the problem of recovering a low-rank matrix B ∈ Rm×n , given d linear measurements bi = tr(ATi B) and measurement matrices Ai , i = 1, . . . , d. In fact, matrix completion is a special case of the matrix sensing problem where each observed entry in the matrix completion problem represents a single-element measurement matrix Ai = Ai . This section focuses upon matrix completion.

5.7.1 Matrix Completion Problems Matrix completion problems stem from the widespread practical applications. 1. Dynamic scene reconstruction It is a classical and hot issue in computer vision and graphics to make a dynamic scene reconstruction from multiview video sequences. Three-dimensional (3-D) motion estimation from multiview video sequences is of vital importance in achieving high-quality dynamic scene reconstruction and has great application potential in the virtual reality and future movie industry. Because 3-D motion vectors can be treated as an incomplete matrix according to their local distributions and frequently there are outliers in the known motion information, 3-D motion estimation is naturally a matrix completion problem. 2. Radar imaging For high-resolution radar imaging, the resulting high sampling rate poses difficulties for raw data transmission and storage, which results in the undersampling of stepped-frequency-radar data. Radar change imaging aims to recover the changed portion of an imaged region using the data collected before and after a change. Hence, radar change imaging with undersampled data can be cast as a matrix completion problem [45]. Tomographic synthetic aperture radar (TomoSAR) is a new SAR imaging modality, but the massive baselines involved make TomoSAR imaging an expensive and difficult task. However, the use of limited baselines with fixed-elevation aperture length to improve the recovery quality is worth researching. By exploiting multibaseline two-dimensional (2D) focused SAR image data, the missing-data compensation method, based on matrix completion, can solve the 3-D focused TomoSAR imaging problem [46]. 3. MIMO radar Multiple-input multiple-output (MIMO) radar has received significant attention owing to its improved resolution and target estimation performance as compared with conventional radars. For a sufficiently large number of transmission and reception antennas and a small number of targets, the data

304

Singular Value Analysis

matrix is low rank, and it can be recovered from a small number of its entries via matrix completion. This implies that, at each reception antenna, matched filtering does not need to be performed with all the transmit waveforms, but rather with just a small number of randomly selected waveforms from the waveform dictionary [237]. 4. Ranking and collaborative filtering Predicting the user’s preferences is increasingly important in e-commerce and advertisement. Companies now routinely collect user rankings for various products (such as movies, books or games). Socalled ranking and collaborative filtering employs incomplete user rankings on some products for predicting any given user’s preferences on another product. The best known ranking and collaborative filtering is perhaps the Netflix Prize for movie ranking. In this case the data matrix is incomplete, as the data collection process often lacks control or sometimes a small portion of the available rankings could be noisy and even have been subject to tampering. Therefore, it necessary to infer a low-rank matrix L from an incomplete and corrupted matrix [79], [78]. 5. Global positioning Owing to power constraints, sensors in global positioning networks may only be able to construct reliable distance estimates from their immediate neighbors, and hence only a partially observed distance matrix can be formed. However, the problem is to infer all the pairwise distances from just a few observed data so that the locations of the sensors can be reliably estimated. This reduces to a matrix completion problem, where the unknown matrix is of rank 2 (if the sensors are located in the plane) or 3 (if they are located in space) [79], [78]. 6. Remote sensing In array signal processing, it is necessary to estimate the direction of arrival (DOA) of incident signals in a coherent radio-frequency environment. The DOA is frequently estimated by using the covariance matrix for the observed data. However, in remote sensing applications, one may not be able to estimate or transmit all entries of the covariance matrix owing to power constraints. To infer a full covariance matrix from just a few observed partial correlations, one needs to solve a matrix completion problem in which the unknown signal covariance matrix has low rank since it is equal to the number of incident waves, which is usually much smaller than the number of sensors [79], [78]. 7. Multi-label image classification In multi-label image classification, multiple labels are assigned to a given image; thus is useful in web-based image analysis. For example, keywords can be automatically assigned to an uploaded web image so that annotated images may be searched directly using text-based image retrieval systems. An ideal multi-label algorithm should be able to handle missing features, or cases where parts of the training data labels are unknown, and it should be robust against outliers and background noise. To this end, Luo et

5.7 Matrix Completion

305

al. [306] have developed recently a multiview matrix completion for handling multiview features in semi-supervised multi-label image classification. 8. Computer vision and graphics Many problems can be formulated as missingvalue estimation problems, e.g., image in-painting, video decoding, video inpainting, scan completion and appearance acquisition completion [300]. The differences between matrix completion and matrix recovery (or low-rank– sparse matrix decomposition) are given below. (1) Only a small number of matrix elements is known in matrix completion, while the whole data matrix is known in matrix recovery. (2) Matrix completion reconstructs a higher-dimensional matrix, while matrix recovery extracts the lower-dimensional features of a higher-dimensional matrix. (3) The capability of matrix completion is limited by noncoherence and the sampling rate, while the capability of the matrix recovery is limited by real matrix rank and data sampling method.

5.7.2 Matrix Completion Model and Incoherence Let D ∈ R be a higher-dimensional incomplete data matrix, as discussed above: only a small amount of entries are known or observed and the index set of these known or sample entries is Ω, namely only the matrix entries Dij , (i, j) ∈ Ω are known or observed. The support of the index set is called the base number, denoted |Ω|. The base number is the number of sample elements, and the ratio of the number of samples to the dimension number of the data matrix, p = |Ω|/(mn), is called the sample density of the data matrix. In some typical applications (e.g., the Netflix recommendation system), the sample density is only 1% or less. As mentioned earlier, the mathematical problem of recovering a low-rank matrix from an incomplete data matrix is called matrix completion, and can be represented as ˆ = arg min rank(X) X (5.7.1) m×n

X

subject to PΩ (X) = PΩ (D) or

Xij = Dij ,

for (i, j) ∈ Ω,

(5.7.2)

where Ω ⊂ {1, . . . , m} × {1, . . . , n} is a subset of the complete set of entries and PΩ : Rm×n → Rm×n is the projection onto the index set Ω: " Dij , (i, j) ∈ Ω, [PΩ (D)]ij = (5.7.3) 0, otherwise. Unfortunately, the rank minimization problem (5.7.1) is NP-hard owing to the

306

Singular Value Analysis

nonconvexity and discontinuous nature of the rank function. It has been shown [412] that the nuclear norm of a matrix is the tightest convex lower bound of its rank function. Hence, a popular alternative is the convex relaxation of the rank function: ˆ = arg min X ∗ X

subject to Xij = Dij ,

(i, j) ∈ Ω,

(5.7.4)

X

where X ∗ is the nuclear norm of the matrix X, defined as the sum of its singular values. It was observed in [79] that if the entries of D are equal to zero in most rows or columns then it is impossible to complete D unless all of its entries are observed. An interesting question is how to guarantee the unique completion of an incomplete data matrix. The answer to this question depends on an additional property known as the incoherence of the data matrix D. Suppose that the SVD of an m × n matrix M with rank r is M = UΣVT . The matrix M is said to satisfy the standard incoherence condition with parameter μ0 if [98] : # " 2 T 2 μ0 r 2 2 , max U ei 2 ≤ 1≤i≤m m (5.7.5) : # " 2 2 μ0 r max 2VT ej 22 ≤ , 1≤j≤n n where ei ∈ Rm×1 and ej ∈ Rn×1 are the standard basis vectors, with only nonzero ith and jth entries, respectively. The following theorem shows that only the standard incoherence condition is required for the unique recovery of an incomplete data matrix. THEOREM 5.11 [98] Suppose that a data matrix D satisfies the standard incoherence condition (5.7.5) with parameter μ0 . There exist universal constants c0 , c1 , c2 > 0 such that if p ≥ c0

μ0 r log2 (m + n) , min{m, n}

(5.7.6)

then the matrix D is the unique optimal solution to (5.7.4) with probability at least 1 − c1 (m + n)−c2 .

5.7.3 Singular Value Thresholding Algorithm Consider the matrix completion problem # " 1 2 subject to PΩ (X) = PΩ (D), min τ X ∗ + X F 2

(5.7.7)

5.7 Matrix Completion

307

with Lagrangian objective function 1 L(X, Y) = τ X ∗ + X 2F + Y, PΩ (D) − PΩ (X), 2

(5.7.8)

where the inner product A, B = tr(AT B) = B, A. Notice that Y, PΩ (D) and PΩ (Y) 2F = PΩ (Y), PΩ (Y) are constants with respect to X. We delete the former term and add the latter term to yield " # 1 min L(X, Y) = min τ X ∗ + X 2F + Y, PΩ (D − X) 2 X X " # 1 1 = min τ X ∗ + X 2F − X, PΩ (Y) + PΩ (Y) 2F . (5.7.9) 2 2 X Here we have Y, PΩ X = tr(YT PΩ X) = tr(XT PΩT Y) = tr(XT PΩ Y) = X, PΩ Y, because PΩT = PΩ . By Theorem 5.9, from Equation (5.7.9) we have " # 1 1 2 2 Xk+1 = arg min τ X ∗ + X F − X, PΩ Yk  + PΩ Yk F 2 2 X " # 1 = arg min X ∗ + X − PΩ Yk 2F . (5.7.10) 2 X Because Yk = PΩ (Yk ), ∀ k ≥ 0, Theorem 5.9 gives Xk+1 = Dτ (PΩ Yk ) = proxτ · ∗ (Yk ) = Usoft(Σ, τ )VT ,

(5.7.11)

where UΣVT is the SVD of Yk . Moreover, from Equation (5.7.8), the gradient of the Lagrangian function at the point Y is given by ∂L(X, Y) = PΩ (X − D). ∂YT Hence the gradient descent update of Yk is Yk+1 = Yk + μPΩ (D − Xk+1 ), where μ is the step size. Equations (5.7.11) and (5.7.12) give the iteration sequence [75] 0 Xk+1 = Usoft(Σ, τ )(VT ), Yk+1 = Yk + μPΩ (D − Xk+1 ).

(5.7.12)

(5.7.13)

The above iteration sequence has the following two features [75]: (1) Sparsity For each k ≥ 0, the entries of Yk outside Ω are equal to zero, and thus Yk is a sparse matrix. This fact can be used to evaluate the shrink function rapidly.

308

Singular Value Analysis

(2) Low-rank property The matrix Xk has low rank, and hence the iteration algorithm requires only a small amount of storage since we need to keep only the principal factors in the memory. Algorithm 5.7 gives a singular value thresholding algorithm for matrix completion. Algorithm 5.7

Singular value thresholding for matrix completion [75]

given: Dij , (i, j) ∈ Ω, step size δ, tolerance , parameter τ , increment , maximum iteration count kmax . initialization: Define k0 as the integer such that

τ δ PΩ (D) 2

∈ (k0 − 1, k0 ].

0

Set Y = k0 δPΩ (D). Put r0 and k = 1. repeat 1. Compute the SVD (U, Σ, V) = svd(Y k−1 ). 2. Set rk = max{j|σj > τ }. rk 3. Update Xk = j=1 (σj − τ )uj vjT . 4. exit if PΩ (Xk − D)F /PΩ DF ≤  or k = kmax .  k−1 k ), if (i, j) ∈ Ω, Yij + δ(Dij − Xij 5. Compute Yijk = 0, if (i, j) ∈ / Ω. return k ← k + 1. output: Xopt = Xk .

5.7.4 Fast and Accurate Matrix Completion As compared with the nuclear norm, a better convex relaxation of the rank function is truncated nuclear norm regularization, proposed recently in [221]. DEFINITION 5.1 [221] Given a matrix X ∈ Rm×n with r = rank(X), its truncated nuclear norm X ∗,r is defined as the sum of min{m, n} − r minimum singular min{m,n} values, i.e., X ∗,r = i=r+1 σi . Using the truncated nuclear norm instead of the nuclear norm, the standard matrix completion (5.7.4) becomes the truncated nuclear norm minimization min X ∗,r X

subject to PΩ (X) = PΩ (D).

(5.7.14)

Unlike the convex nuclear norm, the truncated nuclear norm X ∗,r is nonconvex, and thus (5.7.14) is difficult to solve. Fortunately, the following theorem can be used to overcome this difficulty.

5.7 Matrix Completion

309

THEOREM 5.12 [221] Given any matrix X ∈ Rm×n and r ≤ min{m, n}, for two matrices A ∈ Rr×m , B ∈ Rr×n , if AAT = Ir×r and BBT = Ir×r then tr(AXBT ) ≤

r 

σi (X).

(5.7.15)

i=1

A simple choice is A = [u1 , . . . , ur ]T , B = [v1 , . . . , vr ]T , and X = is the truncated SVD of X. Under this choice, one has + 0 r  T tr(AXB ) = max σi (X) . AAT =I, BBT =I

r i=1

σi (X)ui viT

(5.7.16)

i=1

Hence the truncated nuclear norm minimization (5.7.14) can be relaxed as " # min X ∗ − max tr(AXBT ) subject to PΩ (X) = PΩ (D). (5.7.17) X

AAT =I,BBT =I

If Al and Bl are found then the above optimization problem becomes 3 4 subject to PΩ (X) = PΩ (D), min X ∗ − tr(Al XBTl )

(5.7.18)

X

and its Lagrangian objective function is L(X) = X ∗ − tr(Al XBTl ) +

λ

PΩ X − PΩ D 2F , 2

(5.7.19)

where λ is a Lagrange multiplier. Let f (X) =

λ

PΩ X − PΩ D 2F − tr(Al XBTl ) 2

and

h(X) = X ∗ ,

then f (X) is a convex, differentiable and L-Lipschitz function, and h(X) is a convex but nonsmooth function. Consider the quadratic approximation of f (X) at a given point Y: L fˆ(X) = f (Y) + X − Y, ∇f (Y) + X − Y 2F . 2 Then, the Lagrangian objective function L(X) = h(X) + fˆ(X) can be reformulated as L Q(X, Y) = X ∗ + f (Y) + X − Y, ∇f (Y) + X − Y 2F . (5.7.20) 2 In minimizing Q(X, Y) with respect to X, the second additive term f (Y) is cons1 tant and can be ignored. On the other hand, another constant term, ∇f (Y) 2F = 2L

310

Singular Value Analysis

2 L 1 ∇f (Y) , 2 L F

can be added in Q(X, Y). Then, we have

min Q(X, Y) X " # L 1 L 1 2 2 = min X ∗ + X − Y F + LX − Y, ∇f (Y) + ∇f (Y) F 2 L 2 L X " # L 1 (by Proposition 5.1) = min X ∗ + X − Y + ∇f (Y) 2F 2 L X which yields Xk+1

" # L 1 2 = arg min X ∗ + X − Yk + ∇f (Yk ) F 2 L X   −1 = proxL−1 · ∗ Yk − L ∇f (Yk ) = Usoft(Σ, t)VT

(by Theorem 5.9),

(5.7.21)

where t = 1/L and UΣVT is the SVD of the matrix Yk − tk ∇f (Yk ). 1 From f (Y) = λ PΩ Y − PΩ D 2F − tr(Al YBTl ) it is easy to see that ∇f (Y) = 2 λ(PΩ Y − PΩ D) − ATl Bl , and thus   (5.7.22) (U, Σ, V) = svd Yk + tk (ATl Bl − λ(PΩ Yk − PΩ D)) . Algorithm 5.8 give the accelerated proximal gradient line (APGL) search algorithm for truncated nuclear norm minimization that was presented in [221]. Algorithm 5.8

Truncated nuclear norm minimization via APGL search

input: Al , Bl ; DΩ = PΩ D and tolerance . initialization: t1 = 1, X1 = DΩ , Y1 = X1 . repeat

  1. Compute Wk+1 = Yk + tk ATl Bl − λ(PΩ Yk − PΩ D) . 2. Perform the SVD (U, Σ, V) = svd(Wk+1 ). min{m,n} max{σi − tk , 0}ui viT . 3. Compute Xk+1 = i=1 4. exit if Xk+1 − Xk F ≤ . 5. Compute tk+1 =

1+

1 + 4t2k 2

6. Update Yk+1 = Xk+1 +

.

tk − 1 (Xk+1 tk+1

− Xk ).

return k ← k + 1. output: Xopt = Xk .

However, the truncated nuclear norm minimization problem (5.7.19) can also be written in the standard form of the alternating-direction method of multipliers

5.7 Matrix Completion

(ADMM): 3 4 min X ∗ − tr(Al WBTl ) X,W

subject to X = W, PΩ W = PΩ D.

311

(5.7.23)

The corresponding augmented Lagrangian objective function is given by L(X, Y, W, β) = X ∗ − tr(Al WBTl ) +

β

X − W 2F + Y, X − W, 2

where β > 0 is the penalty parameter. Ignoring the constant term tr(Al WBTl ) and adding the constant term β −1 Y 2F , we obtain " # β 1 2 2 Xk+1 = arg min X ∗ + X − Wk F + Yk , X − Wk  + Yk F 2 β X " # β 1 = arg min X ∗ + X − (Wk − Yk ) 2F 2 β X = proxβ −1 ·|∗ (Wk − β −1 Yk ) = Usoft(Σ, β −1 )VT ,

(5.7.24)

where UΣVT is the SVD of the matrix Wk − β −1 Yk . Similarly, ignoring the constant term Xk+1 ∗ and adding two constant terms Xk+1 , ATl Bl  and (2β)−1 Yk − ATl Bl 2F , we have Wk+1 = arg min L(Xk+1 , W, Yk , β) PΩ W=PΩ D

" β = arg min −tr(Al WBTl ) + Xk+1 − W 2F + Xk+1 , ATl Bl  2 PΩ W=PΩ D # 1

Yk − ATl Bl 2F +Yk , Xk+1 − W + 2β + 2  22 0 2 β2 1 T 2 W − Xk+1 − (Yk + Al Bl ) 2 . (5.7.25) = arg min 2 2 2 β PΩ W=PΩ D F Its closed solution is given by Wk+1 = Xk+1 −

1 (Yk + ATl Bl ), β

Wk+1 = PΩ (Wk+1 ) + PΩ (D).

(5.7.26) (5.7.27)

Algorithm 5.9 lists the ADMM algorithm for truncated nuclear norm minimization. Unlike traditional nuclear norm heuristics, which take into account all the singular values, the truncated nuclear norm regularization approach in [221] aims to minimize the sum of the smallest min{m, n} − r singular values, where r is the matrix rank. This helps to give a better approximation to the matrix rank when the original matrix has a low-rank structure.

312

Singular Value Analysis

Algorithm 5.9

Truncated nuclear norm minimization via ADMM [221]

input: Al , Bl ; Dij , (i, j) ∈ Ω, tolerance  and singular value threshold τ . initialization: X1 = DΩ , W1 = X1 , Y1 = X1 and β = 1. repeat 1. Compute the SVD (U, Σ, V) = svd(Wk − β −1 Yk ). min{m,n} max{σi − β −1 , 0}ui viT . 2. Compute Xk+1 = i=1 3. exit if Xk+1 − Xk F ≤ . 4. Update Wk+1 = Xk+1 − β −1 (Yk + ATl Bl ). 5. Update Wk+1 = PΩ (Wk+1 ) + PΩ (D). 6. Compute Yk+1 = Yk + β(Xk+1 − Wk+1 ). return k ← k + 1. output: Xopt = Xk .

Exercises 5.1

Given the matrix

5.2

find its SVD. Use the SVD of the matrix



⎤ 1 1 A = ⎣1 1⎦ , 0 0



1 A = ⎣0 1

⎤ 0 1⎦ 1

5.4

to find its Moore–Penrose inverse A† , and verify that it satisfies the four Moore–Penrose conditions. Prove that if A is a normal matrix then each of its singular values is the modulus of the corresponding eigenvalue of AT A. Compute the SVD of the following two matrices: ⎤ ⎡   1 −1 3 4 5 ⎣ ⎦ . A= and A = 3 −3 2 1 7 −3 3

5.5

Given the matrix

5.3



−149 A = ⎣ 537 −27

−50 180 9

⎤ −154 546 ⎦ , −25

find the singular values of A and the left and right singular vectors associated with the smallest singular value.

Exercises

5.6

5.7 5.8 5.9

313

Given a complex matrix A = B + jC, where B, C ∈ Rm×n , use the SVD of B −C the block matrix to represent the SVD of A. C B Let A = xpH + yqH , where x ⊥ y and p ⊥ q. Find the Frobenius norm

A F . (Hint: Compute AH A, and find the singular values of A.) Let A = UΣVH be the SVD of A. What is the relationship between the singular values of AH and A? Given an n × n (where n ≥ 3) matrix ⎡ ⎤ 1 a ··· a ⎢a 1 · · · a ⎥ ⎢ ⎥ A = ⎢. . . ⎥, . . ... ⎦ ⎣ .. .. a a ··· 1

use the SVD to find the value of a such that rank(A) = n − 1. 5.10 Show that if A is a square matrix then | det(A)| is equal to the product of the singular values of A. 5.11 Suppose that A is a invertible matrix. Find the SVD of A−1 . 5.12 Show that if A is an n × n positive definite matrix then the singular values of A and its eigenvalues are the same. 5.13 Let A be an m × n matrix and P be an m × m positive definite matrix. Prove that the singular values of PA and A are the same. What is the relationship between the left and right singular vectors? 5.14 Let A be an m × n matrix and λ1 , . . . , λn be the eigenvalues of the matrix AT A with the corresponding eigenvectors u1 , . . . , un . Prove that the singular values σi of A are equal to the norms Aui , namely σi = Aui , i = 1, . . . , n. 5.15 Let λ1 , . . . , λn and u1 , . . . , un be the eigenvalues and eigenvectors of the matrix AT A, respectively. Suppose that A has r nonzero singular values. Show that {Au1 , . . . , Aur } is an orthogonal basis set of the column space Col(A) and that rank(A) = r. 5.16 Let B, C ∈ Rm×n . Find the relationship between the singular values and the singular vectors  of the complex matrix A = B + j C and those of the real B −C . block matrix C B 5.17 Use the singular vectorsof the matrix A ∈ Rm×n (m ≥ n) to represent the O AT . eigenvectors of A O   ai ∈ Rn+1 to the straight line 5.18 Prove that the distance from the point bi xT a = b is equal to ; |aTi x − bi |2 di = . xT x + 1

314

Singular Value Analysis

5.19 Adopt the MATLAB function [U, S, V] = svd(X) to solve the matrix equation Ax = b, where ⎡ ⎤ ⎡ ⎤ 1 1 1 1 ⎢ 3 1 3⎥ ⎢4⎥ ⎥ ⎢ ⎥ A=⎢ ⎣1 0 1⎦ , b = ⎣3⎦ . 2 2 1 2 5.20 Suppose that the observation data of a computer simulation are generated by √ √ x(n) = 20 sin(2π 0.2n) + 2 sin(2π 0.215n) + w(n), where w(n) is a Gaussian white noise with mean 0 and covariance 1 and n = 1, 2, . . . , 128. Perform 50 independent runs. Determine the effective rank of the autocorrelation matrix ⎡ ⎤ r(0) r(−1) ··· r(−2p) ⎢ r(1) r(0) · · · r(−2p + 1)⎥ ⎢ ⎥ R=⎢ . ⎥, .. . .. .. ⎣ .. ⎦ . . r(M )

r(M − 1)

where M = 50, p = 10, and r(k) =

···

r(M − 2p)

128−k  1 x(i)x(i 128 i=1

+ k) represents sample

autocorrelation functions of the observation data (unknown observation data are taken as zero). 5.21 [179] Apply the SVD to show that if A ∈ Rm×n (m ≥ n) then there are Q ∈ Rm×n and P ∈ Rn×n such that A = QP, where QT Q = In , and P is symmetric and nonnegative definite. This decomposition is sometimes called the polar decomposition, because it is similar to the complex decomposition z = |z|e j arg(z) .

6 Solving Matrix Equations

In a wide variety of science and engineering disciplines, we often encounter three main types of matrix equation: (1) Over-determined matrix equations (ODMEs) Am×n xn×1 = bm×1 with m > n; the data matrix A and the data vector b are known, but one or both may have errors or interference. (2) Blind matrix equations (BMEs) Only the data matrix X is known, while the coefficient matrix A and the parameter matrix S are unknowns in the equation Am×n Sn×p = Xm×p . (3) Under-determined sparse matrix equations (UDSMEs) Ax = b with m < n; the data vector b and the data matrix A are known, but the data vector b is a sparse vector. In this chapter we discuss the methods for solving these three types of matrix equation: ⎧ ⎧ ⎪ Least squares ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨Data least squares ⎪ ⎪ ⎪ ⎪ ODMEs Tikhonov method and Gauss–Seidel methods ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ Total least squares ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎪ Constrained total least squares ⎪ ⎪ ⎪ ⎪ + ⎪ ⎪ ⎪ Subspace methods ⎪ Matrix ⎨BMEs Nonnegative matrix decomposition equations ⎪ ⎪ ⎪ ⎧ ⎪ ⎪ ⎪ ⎪ Orthogonal matching pursuit ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ Gradient projection algorithm ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ Lasso and LARS methods ⎪ ⎪ UDSMEs ⎪ ⎪ ⎪ ⎪ Homotopy algorithm ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ Augmented Lagrange multiplier ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ Bregman iteration algorithms 315

316

Solving Matrix Equations

6.1 Least Squares Method Linear parameter estimation is widely used in science and technology problems, and the least squares method is the most common.

6.1.1 Ordinary Least Squares Methods Consider an over-determined matrix equation Ax = b, where b is an m × 1 data vector, A is an m × n data matrix, and m > n. Suppose that in the data vector there is additive observation error or noise, i.e., b = b0 + e, where b0 and e are the errorless data vector and the additive error vector, respectively. In order to counterbalance the influence of the error e on the matrix equation solution, we introduce a correction vector Δb and use it to “perturb” the data vector b. Our goal is to make the correction term Δb that compensates the uncertainty existing in the data vector b as small as possible so that b + Δb → b0 and hence the following transformation is achieved: ⇒

Ax = b + Δb

Ax = b0 .

(6.1.1)

That is to say, if we select the correction vector Δb = Ax − b directly, and make Δb as small as possible, we can achieve the solution of errorless matrix equation Ax = b0 . This idea for solving the matrix equation can be described by the optimization problem   min Δb 2 = Ax − b 22 = (Ax − b)T (Ax − b) . (6.1.2) x

Such a method is known as the ordinary least squares (OLS) method, usually called simply the least squares (LS) method. As a matter of fact, the correction vector Δb = Ax − b is just the error vector of both sides of the matrix equation Ax = b. Hence the central idea of the LS method is to find the solution x such that the sum of the squared error Ax − b 22 is minimized, namely ˆ LS = arg min Ax − b 22 . x x

(6.1.3)

In order to derive an analytic solution for x, expand Equation (6.1.2) to get φ = xT AT Ax − xT AT b − bT Ax + bT b. Find the derivative of φ with respect to x and let the result equal zero; thus dφ = 2AT Ax − 2AT b = 0. dx That is to say, the solution x must satisfy AT Ax = AT b.

(6.1.4)

6.1 Least Squares Method

317

The above equation is “identifiable” or “unidentifiable” depending on the rank of the m × n matrix A. • Identifiable When Ax = b is an over-determined equation, it has a unique solution xLS = (AT A)−1 AT b

if rank(A) = n,

(6.1.5)

or xLS = (AT A)† AT b

if rank(A) < n,

(6.1.6)

where B† denotes the Moore–Penrose inverse matrix of B. In parameter estimation theory the unknown parameter vector x is said to be uniquely identifiable if it is uniquely determined. • Unidentifiable For an under-determined equation Ax = b, if rank(A) = m < n, then different solutions for x give the same value of Ax. Clearly, although the data vector b can provide some information about Ax, we cannot distinguish the different parameter vectors x corresponding to the same Ax. Such a unknown vector x is said to be unidentifiable.

6.1.2 Properties of Least Squares Solutions In parameter estimation, an estimate θˆ of the parameter vector θ is called an unbiased estimator, if its mathematical expectation is equal to the true unknown paˆ = θ. Further, an unbiased estimator is called the optimal rameter vector, i.e., E{θ} unbiased estimator if it has the minimum variance. Similarly, for an over-determined matrix equation Aθ = b + e with noisy data vector b, if the mathematical expectation of the LS solution θˆLS is equal to the true parameter vector θ, i.e., E{θˆLS } = θ, and has the minimum variance, then θˆLS is the optimal unbiased solution. THEOREM 6.1 (Gauss–Markov theorem)

Consider the set of linear equations

Ax = b + e,

(6.1.7)

where the m × n matrix A and the n × 1 vector x are respectively a constant matrix and the parameter vector; b is the m × 1 vector with random error vector e = [e1 , . . . , em ]T whose mean vector and covariance matrix are respectively E{e} = 0,

Cov(e) = E{eeH } = σ 2 I.

ˆ if and only if Then the n × 1 parameter vector x is the optimal unbiased solution x rank(A) = n. In this case the optimal unbiased solution is given by the LS solution ˆ LS = (AH A)−1 AH b, x

(6.1.8)

Var(ˆ xLS ) ≤ Var(˜ x),

(6.1.9)

and its covariance satisfies

318

Solving Matrix Equations

˜ is any other solution of the matrix equation Ax = b + e. where x Proof

From the assumption condition E{e} = 0, it follows that E{b} = E{Ax} − E{e} = Ax.

(6.1.10)

Using Cov(e) = E{eeH } = σ 2 I, and noticing that Ax is statistically uncorrelated with the error vector e, we have E{bbH } = E{(Ax − e)(Ax − e)H } = E{AxxH AH } + E{eeH } = AxxH AH + σ 2 I.

(6.1.11)

Since rank(A) = n and the matrix product AH A is nonsingular, we get E{ˆ xLS } = E{(AH A)−1 AH b} = (AH A)−1 AH E{b} = (AH A)−1 AH Ax = x, ˆ LS = (AH A)−1 AH b is an unbiased solution to the matrix i.e., the LS solution x equation Ax = b + e. ˆ LS has the minimum variance. To this end we suppose that Next, we show that x ˜ that is given by x has an alternative solution x ˜=x ˆ LS + Cb + d, x where C and d are a constant matrix and a constant vector, respectively. The ˜ is unbiased, i.e., solution x E{˜ x} = E{ˆ xLS } + E{Cb} + d = x + CAx + d = x + CAx + d,

∀ x,

if and only if CA = O (the zero matrix),

d = 0.

(6.1.12)

Using these constraint conditions, it can be seen that E{Cb} = CE{b} = CAθ = 0. Then we have 3  H 4 Cov(˜ x) = Cov(ˆ xLS + Cb) = E (ˆ xLS − x) + Cb (ˆ xLS − x) + Cb     xLS − x)(Cb)H + E Cb(ˆ xLS − x)H = Cov(ˆ xLS ) + E (ˆ   + E CbbH CH . (6.1.13)

6.1 Least Squares Method

319

However, from (6.1.10)–(6.1.12), we have       E (ˆ xLS − x)(Cb)H = E (AH A)−1 AH bbH CH − E xbH CH     = (AH A)−1 AH E bbH CH − xE bH CH   = (AH A)−1 AH AxxH AH + σ 2 I CH − xxH AH CH = O,   H E Cb(ˆ xLS − x) = O, = E (ˆ xLS − x)(Cb)H     H H  H H = CE bb C = C AxxH AH + σ 2 I CH E Cbb C 

 H

= σ 2 CCH . Hence, (6.1.13) can be simplified to Cov(˜ x) = Cov(ˆ xLS ) + σ 2 CCH .

(6.1.14)

By the property tr(A + B) = tr(A) + tr(B) of the trace function, and noticing that for a random vector x with zero mean, tr(Cov(x)) = Var(x), we can rewrite (6.1.14) as xLS ), Var(˜ x) = Var(ˆ xLS ) + σ 2 tr(CCH ) ≥ Var(ˆ ˆ LS has the minimum variance, and hence because tr(CCH ) ≥ 0. This shows that x is the optimal unbiased solution to the matrix equation Ax = b + e. Notice that the condition in the Gauss–Markov theorem Cov(e) = σ 2 I implies that all components of the additive noise vector e are mutually uncorrelated and have the same variance σ 2 . Only in this case is the LS solution unbiased and optimal. Now consider the relationship between the LS solution and the maximum likelihood (ML) solution under the conditions of the Gauss–Markov theorem. If the additive noise vector e = [e1 , . . . , em ]T is a complex Gaussian random vector with independently identical distribution (iid) then from (1.5.18) it is known that the pdf of e is given by f (e) =

  1 exp −(e − μe )H Γ−1 e (e − μe ) , π m |Γe |

(6.1.15)

2 . where |Γe | = σ12 · · · σm Under the conditions of the Gauss–Markov theorem (i.e., all the iid Gaussian random variables of the error vector have zero mean and the same variance σ 2 ), the pdf of the additive error vector reduces to     1 1 H 1 1 2 f (e) = (6.1.16) exp − 2 e e = exp − 2 e 2 , (πσ 2 )m σ (πσ 2 )m σ

whose likelihood function is L(e) = log f (e) = −

1 1

e 22 = − m 2(m+1) Ax − b 22 . π m σ 2(m+1) π σ

(6.1.17)

320

Solving Matrix Equations

Thus the ML solution of the matrix equation Ax = b is given by " # 1 2 ˆ ML = arg max − m 2(m+1) Ax − b 2 x x π σ " # 1 2 ˆ LS .

Ax − b 2 = x = arg min x 2

(6.1.18)

That is to say, under the conditions of the Gauss–Markov theorem, the ML solution ˆ LS are equivalent for the matrix equation Ax = b. ˆ ML and the LS solution x x It is easily seen that when the error vector e is a zero-mean Gaussian random vector but its entries have differing variances since the covariance matrix Γe = σ 2 I, the ML solution in this case is given by " #  H −1  1 ˆ ML = arg max exp −e Γe e , x 2 x π m σ12 · · · σm ˆ LS ; namely, the LS solution is no which is obviously not equal to the LS solution x longer optimal in the case of differing variances.

6.1.3 Data Least Squares Consider again an over-determined matrix equation Ax = b. Unlike in an ordinary LS problem, it is assumed that the data vector b does not contain the observed error or noise; on the contrary, the data matrix A = A0 + E contains the observed error or noise, and every element of the error matrix E obeys an independent Gaussian distribution with the zero-mean and equal variance. The perturbation matrix ΔA is used to correct the erroneous-data matrix A in such a way that A + ΔA = A0 + E + ΔA → A0 . As with the ordinary LS method, by forcing (A + ΔA)x = b to compensate the errors in the data matrix, one can obtain (A + ΔA)x = b ⇒ A0 x = b. In this case, the optimal solution of x is given by ˆ DLS = arg min ΔA 22 x x

subject to

b ∈ Range(A + ΔA).

(6.1.19)

This is the data least squares (DLS) method. In the above equation, the constraint condition b ∈ Range(A + ΔA) implies that, for every given exact data vector b ∈ Cm and the erroneous data matrix A ∈ Cm×n , we can always find a vector x ∈ Cn such that (A+ΔA)x = b. Hence the two constraint conditions b ∈ Range(A+ΔA) and (A + ΔA)x = b are equivalent. By using the Lagrange multiplier method, the constrained data LS problem (6.1.19) becomes the unconstrained optimization problem   min L(x) = tr(ΔA(ΔA)H ) + λH (Ax + ΔAx − b) . (6.1.20)

6.2 Tikhonov Regularization and Gauss–Seidel Method

321

Let the conjugate gradient matrix ∂L(x)/(∂ΔAH ) = O (the zero matrix). Then we have ΔA = −λxH . Substitute ΔA = −λxH into the constraint condition (A + ΔA)x = b to yield λ = (Ax − b)/(xH x), and thus ΔA = −(Ax − b)xH /(xH x). Therefore, the objective function is given by   (Ax − b)xH x(Ax − b)H 2 H . J(x) = ΔA 2 = tr(ΔA(ΔA) ) = tr xH x xH x Then, using the trace function property tr(BC) = tr(CB), we immediately have (Ax − b)H (Ax − b) J(x) = , (6.1.21) xH x which yields (Ax − b)H (Ax − b) ˆ DLS = arg min . (6.1.22) x x xH x This is the data least squares solution of the over-determined equation Ax = b.

6.2 Tikhonov Regularization and Gauss–Seidel Method Suppose that we are solving an over-determined matrix equation Am×n xn×1 = bm×1 (where m > n). The ordinary LS method and the data LS method are based on two basic assumptions: (1) the data matrix A is nonsingular or of full column rank; (2) the data vector b or the data matrix A includes additive noise or error. In this section we discuss the regularization method for solving matrix equations with rank deficiency or noise.

6.2.1 Tikhonov Regularization When m = n and A is nonsingular, the solution of the matrix equation Ax = b ˆ = A−1 b; and when m > n and Am×n is of full column rank, the is given by x ˆ LS = A† b = (AH A)−1 AH b. solution is x A problem is that the data matrix A is often rank deficient in engineering applicaˆ = A−1 b or x ˆ LS = (AH A)−1 AH b may diverge; tions. In these cases, the solution x even if a solution exists, it may be a bad approximation to the unknown vector x. If, however, we were lucky and happened to find a reasonable approximation of ˆ ≤ A−1 Aˆ ˆ ≤ A† Aˆ x, the error estimate x − x x − b or x − x x − b would be greatly disappointing [347]. By observation, it can easily be seen that the problem lies in the inversion of the covariance matrix AH A of the rank-deficient data matrix A. As an improvement of the LS cost function 12 Ax − b 22 , Tikhonov [464] in 1963 proposed the regularized least squares cost function  1 J(x) = (6.2.1)

Ax − b 22 + λ x 22 , 2

322

Solving Matrix Equations

where λ ≥ 0 is the regularization parameter. The conjugate gradient of the cost function J(x) with respect to the argument x is given by  ∂J(x) ∂  = (Ax − b)H (Ax − b) + λxH x = AH Ax − AH b + λx. ∂xH ∂xH Let ∂J(x)/∂ xH = 0; then ˆ Tik = (AH A + λI)−1 AH b. x

(6.2.2)

This method using (AH A + λI)−1 instead of the direct inverse of the covariance matrix (AH A)−1 is called the Tikhonov regularization method (or simply the regularized method). In signal processing and image processing, the regularization method is sometimes known as the relaxation method. The Tikhonov-regularization based solution to the matrix equation Ax = b is called the Tikhonov regularization solution, denoted xTik . The idea of the Tikhonov regularization method is that by adding a very small perturbation λ to each diagonal entry of the covariance matrix AH A of a rank deficient matrix A, the inversion of the singular covariance matrix AH A becomes the inversion of a nonsingular matrix AH A + λI, thereby greatly improving the numerical stability of the solution process for the rank-deficient matrix equation Ax = b. Obviously, if the data matrix A is of full column rank but includes error or noise, we must adopt the opposite method to Tikhonov regularization and add a very small negative perturbation, −λ, to each diagonal entry of the covariance matrix AH A. Such a method using a very small negative perturbation matrix −λI is called the anti-Tikhonov regularization method or anti-regularized method, and the solution is given by ˆ = (AH A − λI)−1 AH b. x

(6.2.3)

The total least squares method is a typical anti-regularized method and will be discussed later. As stated above, the regularization parameter λ should take a small value so that (AH A + λI)−1 can better approximate (AH A)−1 and yet can avoid the singularity of AH A; hence the Tikhonov regularization method can obviously improve the numerical stability in solving the singular and ill-conditioned equations. This is so because the matrix AH A is positive semi-definite, so the eigenvalues of AH A + λI lie in the interval [λ, λ + A 2F ], which gives a condition number cond(AH A + λI) ≤ (λ + A 2F )/λ.

(6.2.4)

Clearly, as compared with the condition number cond(AH A) ≤ ∞, the condition number of the Tikhonov regularization is a great improvement. In order to improve further the solution of singular and ill-conditioned matrix equations, one can adopt iterative Tikhonov regularization [347].

6.2 Tikhonov Regularization and Gauss–Seidel Method

323

Let the initial solution vector x0 = 0 and the initial residual vector r0 = b; then the solution vector and the residual vector can be updated via the following iteration formulas: xk = xk−1 + (AH A + λI)−1 AH rk−1 , (6.2.5) rk = b − Axk , k = 1, 2, . . . Let A = UΣVH be the SVD of the matrix A, then AH A = VΣ2 VH and thus the ordinary LS solution and the Tikhonov regularization solution are respectively given by ˆ LS = (AH A)−1 AH b = VΣ−1 UH b, x H

ˆ Tik = (A A + x

2 σmin I)−1 AH b

(6.2.6) 2

= V(Σ +

2 σmin I)−1 ΣUH b,

(6.2.7)

where σmin is the minimal nonzero singular value of A. If the matrix A is singular or ill-conditioned, i.e., σn = 0 or σn ≈ 0, then some diagonal entry of Σ−1 will be 1/σn = ∞, which thereby makes the LS solution diverge. On the contrary, the SVD ˆ Tik has good numerical stability because based Tikhonov regularization solution x the diagonal entries of the matrix   σn σ1 2 2 −1 (Σ + δ I) Σ = Diag (6.2.8) ,..., 2 2 2 σ12 + σmin σn + σmin 2 lie in the interval [0, σ1 (σ12 + σmin )−1 ]. When the regularization parameter λ varies in the definition interval [0, ∞), the family of solutions for a regularized LS problem is known as its regularization path. A Tikhonov regularization solution has the following important properties [248].

ˆ Tik = (AH A + λI)−1 AH b 1. Linearity The Tikhonov regularization LS solution x is a linear function of the observed data vector b. 2. Limit characteristic when λ → 0 When the regularization parameter λ → 0, the Tikhonov regularization LS solution converges to the ordinary LS solution ˆ LS = A† b = (AH A)−1 AH b. The ˆ Tik = x or the Moore–Penrose solution lim x λ→0 ˆ Tik has the minimum 2 -norm among all the feasible points solution point x meeting AH (Ax − b) = 0: ˆ Tik = x

arg min AT (b−Ax)=0

x 2 .

(6.2.9)

3. Limit characteristic when λ → ∞ When λ → ∞, the optimal solution of the ˆ Tik = 0. Tikhonov regularization LS problem converges to the zero vector: lim x λ→∞

4. Regularization path When the regularization parameter λ varies in [0, ∞), the optimal solution of the Tikhonov regularization LS problem is a smooth function of the regularization parameter, i.e., when λ decreases to zero, the optimal solution converges to the Moore–Penrose solution and when λ increases, the optimal solution converges to the zero vector.

324

Solving Matrix Equations

Tikhonov regularization can effectively prevent the divergence of the LS solution ˆ LS = (AT A)−1 AT b when A is rank deficient; thereby it obviously improves the x convergence property of the LS algorithm and the alternating LS algorithm, and so is widely applied.

6.2.2 Regularized Gauss–Seidel Method Let Xi ⊆ Rni be the feasible set of the ni × 1 vector xi . Consider the nonlinear minimization problem   (6.2.10) min f (x) = f (x1 , . . . , xm ) , x∈X

where x ∈ X = X1 × X2 × · · · × Xm ⊆ Rn is the Cartesian product of a closed m  nonempty convex set Xi ⊆ Rni , i = 1, . . . , m, and ni = n. i=1

Equation (6.2.10) is an unconstrained optimization problem with m coupled variable vectors. An efficient approach for solving this class of coupled optimization problems is the block nonlinear Gauss–Seidel method, called simply the GS method [42], [190]. In every iteration of the GS method, m − 1 variable vectors are regarded as known, and the remaining variable vector is minimized. This idea constitutes the basic framework of the GS method for solving nonlinear unconstrained optimization problem (6.2.10). (1) Initialize m − 1 variable vectors xi , i = 2, . . . , m, and let k = 0. (2) Find the solution of the separated suboptimization problem   k k xk+1 = arg min f xk+1 , . . . , xk+1 i = 1, . . . , m. (6.2.11) 1 i i−1 , y, xi+1 , . . . , xm , y∈Xi

At the (k + 1)th iteration of updating the vector xi , the vectors x1 , . . . , xi−1 have been updated as xk+1 , . . . , xk+1 1 i−1 , so these updated subvectors and the vectors xki+1 , . . . , xkm yet be updated are to be regarded as known. (3) Test whether the m variable vectors are all convergent. If they are convergent then output the optimization results (xk+1 , . . . , xk+1 m ); otherwise, let k ← k + 1 1, return to Equation (6.2.11) and continue to iterate until the convergence criterion is met. If the objective function f (x) of the optimization (6.2.10) is an LS error function (for example Ax − b 22 ), then the GS method is customarily called the alternating least squares (ALS) method. EXAMPLE 6.1 Consider the full-rank decomposition of an m × n known data matrix X = AB, where the m × r matrix A is of full column rank and the r × n matrix

6.2 Tikhonov Regularization and Gauss–Seidel Method

325

B is of full row rank. Let the cost function of the matrix full-rank decomposition be 1 f (A, B) = X − AB 2F . (6.2.12) 2 Then the ALS algorithm first initializes the matrix A. At the (k + 1)th iteration, from the fixed matrix Ak we update the LS solution of B as follows: Bk+1 = (ATk Ak )−1 ATk X.

(6.2.13)

Next, from the transpose of the matrix decomposition XT = BT AT , we can immediately update the LS solution of AT : ATk+1 = (Bk+1 BTk+1 )−1 Bk+1 XT .

(6.2.14)

The above two kinds of LS procedures are performed alternately. Once the ALS algorithm converges, the optimization results of the matrix decomposition can be obtained. To analyze the convergence of the GS method, we first introduce the concepts of the limit point and the critical point. Let S be a subset in topological space X; a point x in the space X is called a limit point of the subset S if every neighbourhood of x contains at least one point in S apart from x itself. In other words, the limit point x is a point in topological space X approximated by a point in S (other than x). A point on a function curve at which the derivative is equal to zero or does not exist plays an important role in the optimization problem. A point x is known as a critical point of the function f (x) if x lies in the definition domain of the function and the derivative f (x) = 0 or does not exist. The geometric interpretation of a critical point is that the tangent of the point on the curve is horizontal or vertical, or does not exist.

3 Consider a real function f (x) = x4 − 4x2 whose derivative √ f (x)√= 4x − 8x.

From f (x) = 0 we get three critical points of f (x): x = 0, − 2 and 2. The above concepts on the limit point and critical points of a scalar variable are easily extended to a vector variable. DEFINITION 6.1 The vector x ∈ Rn is said to be a limit point of the vector n ∞ sequence {xk }∞ k=1 in the vector space R if there is a subsequence of {xk }k=1 that converges to x. DEFINITION 6.2 Let f : X → R (where X ⊂ Rn ) be a real function; the vector ¯ ∈ Rn is known as a critical point of the function f (x) if the following condition x is satisfied: ¯ ) ≥ 0, gT (¯ x)(y − x

∀ y ∈ X,

(6.2.15)

where gT (¯ x) represents the transpose of the vector function g(¯ x) and either g(¯ x) =

326

Solving Matrix Equations

∇f (¯ x) is the gradient vector of a continuous differentiable function f (x) at the ¯ or g(¯ point x x) ∈ ∂f (¯ x) is the subgradient vector of a nondifferentiable nonsmooth ¯. function f (x) at the point x ¯ is an interior point in X then the critical-point condition (6.2.15) If X = Rn or x reduces to the stationary-point condition 0 = ∇f (¯ x) (for a continuous differentiable objective function) or 0 ∈ ∂f (¯ x) (for a nonsmooth objective function) for the unconstrained minimization problem min f (x). Let xk = (xk1 , . . . , xkm ) denote the iteration results generated by the GS algorithm; it is expected that the iterative sequence {xk }∞ k=1 has limit points and that each limit point is a critical point of the objective function f . For the optimization problem (6.2.10) this convergence performance of the GS algorithm depends on the quasi-convexity of the objective function f . Put α ∈ (0, 1) and yi = xi . The objective function f (x) in the ALS problem (6.2.10) is called the quasi-convex with respect to xi ∈ Xi if f (x1 , . . . , xi−1 , αxi + (1 − α)yi , xi+1 , . . . , xm ) ≤ max{f (x), f (x1 , . . . , xi−1 , yi , xi+1 , . . . , xm )}. The function f (x) is known as strictly quasi-convex, if f (x1 , . . . , xi−1 , αxi + (1 − α)yi , xi+1 , . . . , xm ) < max{f (x), f (x1 , . . . , xi−1 , yi , xi+1 , . . . , xm )}. The following give the convergence performance of the GS algorithm under different assumption conditions. THEOREM 6.2 [190] Let the function f be strictly quasi-convex with respect to the vectors xi (i = 1, . . . , m − 2) in X, and let the sequence generated by the GS ¯ of {xk } is a critical point method, {xk }, have limit points; then every limit point x of the optimization problem (6.2.10). THEOREM 6.3 [42] For the objective function f in Equation (6.2.10), it is assumed that, for every i and x ∈ X, the minimum point of the optimization algorithm min f (x1 , . . . , xi−1 , y, xi+1 , . . . , xm )

y∈Xi

is uniquely determined. If {xk } is the iterative sequence generated by the GS algorithm then every limit point of xk is a critical point. However, in practical applications, the objective function in the optimization (6.2.10) often does not satisfy the conditions of the above two theorems. For example, by Theorem 6.2, in the case of a rank-deficient matrix A, the quadratic objective function Ax − b 22 is not quasi-convex, so the convergence is not be guaranteed.

6.2 Tikhonov Regularization and Gauss–Seidel Method

327

The sequence generated by the GS method contains limit points, but they may not be the critical points of the optimization problem (6.2.10). The fact that the GS algorithm may not converge was observed by Powell in 1973 [399], who called it the “circle phenomenon” of the GS method. Recently, a lot of simulation experiences have shown [338], [293] that even it converges, the iterative process of the ALS method easily falls into a “swamp”: an unusually large number of iterations leads to a very slow convergence rate. In particular, as long as the column rank of one variable matrix in a set of m variable matrices is deficient, or, although the m variable matrices are of full-column rank, there is collinearity among the column vectors of some of the variable matrices, one easily observes this swamp phenomenon [338], [293]. A simple and effective way for avoiding the circle and swamp phenomena of the GS method is to make a Tikhonov regularization of the objective function in the optimization problem (6.2.10); namely, the separated suboptimization algorithm (6.2.11) is regularized as   1 k k k 2 xk+1 = arg min f (xk+1 , . . . , xk+1 1 i i−1 , y, xi+1 , . . . , xm ) + τi y − xi 2 , 2 y∈Xi (6.2.16) where i = 1, . . . , m. The above algorithm is called the proximal point version of the GS methods [18], [44], abbreviated as the PGS method. The role of the regularization term y − xki 22 is to force the updated vector k+1 xi = y to be close to xki ; this avoids any violent shock in the iterative process and so prevents the divergence of the algorithm. The GS or PGS method is said to be well defined, if each suboptimization problem has an optimal solution [190]. THEOREM 6.4 [190] If the PGS method is well-defined, and in the sequence {xk } ¯ of {xk } is a critical point of the there exist limit points, then every limit point x optimization problem (6.2.10). This theorem shows that the convergence performance of the PGS method is better than that of the GS method. Many simulation experiments show [293] that under the condition of achieving the same error, the iterative number of the GS method in a swamp iteration is unusually large whereas the PGS method tends to converge quickly. The PGS method is also called the regularized Gauss–Seidel method in some literature. The ALS method and the regularized ALS method have important applications in nonnegative matrix decomposition and tensor analysis; these will be discussed in later chapters.

328

Solving Matrix Equations

6.3 Total Least Squares (TLS) Methods Although its original name is different, the total least squares (TLS) method has a long history. The earliest ideas about TLS can be traced back to the paper of Pearson in 1901 [381] who considered an approximate method for solving the matrix equation Ax = b when in both A and b there exist errors. However, as late as 1980, Golub and Van Loan [178] for the first time gave an overall treatment from the point of view of numerical analysis and formally referred to this method as the total least squares. In mathematical statistics the method is called orthogonal regression or errors-in-variables regression [175]. In system identification the TLS method is called the characteristic vector method or the Koopmans–Levin method [486]. Now the TLS method is widely used in statistics, physics, economics, biology and medicine, signal processing, automatic control, system science and many other disciplines and fields.

6.3.1 TLS Problems Let A0 and b0 represent an unobservable error-free data matrix and an error-free data vector, respectively. The actual observed data matrix and data vector are respectively given by A = A0 + E,

b = b0 + e,

(6.3.1)

where E and e express the error data matrix and the error data vector, respectively. The basic idea of the TLS is not only to use a correction vector Δb to perturb the data vector b, but also to use a correction matrix ΔA to perturb the data matrix A, thereby making a joint compensation for errors or noise in both A and b: b + Δb = b0 + e + Δb → b0 , A + ΔA = A0 + E + ΔA → A0 . The purpose is to suppress the influence of the observation error or noise on the matrix equation solution in order to transform the solution of a noisy matrix equation into the solution of an error-free matrix equation: (A + ΔA)x = b + Δb



A0 x = b 0 .

(6.3.2)

Naturally, we want the correction data matrix and the correction data vectors to be as small as possible. Hence the TLS problem can be expressed as the constrained optimization problem TLS:

min

ΔA,Δb,x

[ΔA, Δb] 2F

subject to

(A + ΔA)x = b + Δb

(6.3.3)

or TLS:

min D 2F z

subject to Dz = −Bz,

(6.3.4)

6.3 Total Least Squares (TLS) Methods

where D = [ΔA, Δb], B = [A, b] and z =

x −1



329

is an (n + 1) × 1 vector.

Under the assumption z 2 = 1, from D 2 ≤ D F and D 2 = sup Dz 2

z 2 =1

we have

min D 2F z

=

min Dz 22 z

and thus we can rewrite (6.3.4) as

TLS:

min Dz 22

subject to

z = 1

(6.3.5)

TLS:

min Bz 22

subject to

z = 1,

(6.3.6)

z

or z

since Dz = −Bz.

6.3.2 TLS Solution There are two possible cases in the TLS solution of over-determined equations. Case 1

Single smallest singular value

The singular value σn of B is significantly larger than the singular value σn+1 , i.e., there is only one smallest singular value. Equation (6.3.6) shows that the TLS problem can be summarized as follows. Find a perturbation matrix D ∈ Cm×(n+1) with the minimum norm squared such that B + D is of non full rank (if full rank then there is only the trivial solution z = 0). The TLS problem (6.3.6) is easily solved via the Lagrange multiplier method. To this end, define the objective function J(z) = Bz 22 + λ(1 − zH z),

(6.3.7) ∂J(z)

= where λ is the Lagrange multiplier. Noticing that Bz 22 = zH BH Bz, from ∂z∗ 0 it follows that BH Bz = λz.

(6.3.8)

This shows that the Lagrange multiplier should be selected as the smallest eigenvalue λmin of the matrix BH B = [A, b]H [A, b] (i.e., the square of the smallest singular value of B), while the TLS solution vector z is the

eigenvector corresponding to x λmin . In other words, the TLS solution vector −1 is the solution of the Rayleigh quotient minimization problem ⎧ ⎫ H

x x ⎨ [A, b]H [A, b] −1

Ax − b 22 ⎬ −1 min J(x) = = . (6.3.9) H

x x x ⎩

x 22 + 1 ⎭ −1

−1

Let the SVD of the m × (n + 1) augmented matrix B be B = UΣVH ; let its singular values be arranged in the order σ1 ≥ · · · ≥ σn+1 and let their corresponding

330

Solving Matrix Equations

right singular vectors be v1 , . . . , vn+1 . Then, according to the above analysis, the TLS solution is z = vn+1 , namely ⎡ ⎤ v(1, n + 1) 1 ⎢ ⎥ .. xTLS = − (6.3.10) ⎣ ⎦, . v(n + 1, n + 1) v(n, n + 1) where v(i, n + 1) is the ith entry of (n + 1)th column of V. Remark If the augmented data matrix is given by B = [−b, A] then the TLS solution is provided by ⎡ ⎤ v(2, n + 1) 1 ⎢ ⎥ .. xTLS = (6.3.11) ⎣ ⎦. . v(1, n + 1) v(n + 1, n + 1) Case 2

Multiple smallest singular values

In this case, there are multiple smallest singular values of B, i.e., the smallest singular values are repeated or are very close. Let σ1 ≥ σ2 ≥ · · · ≥ σp > σp+1 ≈ · · · ≈ σn+1 ,

(6.3.12)

and vi be any column in the subspace S = Span{vp+1 , vp+2 , . . . , vn+1 }. Then above   y any right singular vector vp+i = p+i gives a TLS solution αp+i xi = −yp+i /αp+i ,

i = 1, . . . , n + 1 − p.

Hence there are n+1−p TLS solutions. The interesting solution is the TLS solution that is unique in some sense. The possible unique TLS solutions are of two kinds: (1) the minimum norm solution, composed of n parameters; (2) the optimal LS approximate solution, containing only p parameters. 1. Minimum Norm Solution The minimum norm solution of the matrix equation Am×n xn×1 = bm×1 is a TLS solution with n parameters. The TLS algorithm for finding the minimum norm solution was proposed by Golub and Van Loan [178]; it is given as Algorithm 6.1. Remark If the augmented data matrix is given by B = [−b, A], then the Householder transformation in step 3 becomes ⎡ ⎤ .. ⎢ α .. 0 · · · 0 ⎥ ⎥ V1 Q = ⎢ (6.3.13) ⎣ - - ..- - - - - - ⎦ , .. y . ×

6.3 Total Least Squares (TLS) Methods Algorithm 6.1

331

TLS algorithm for minimum norm solution

input: A ∈ Cm×n , b ∈ Cm , α > 0. repeat 1. Compute B = [A, b] = UΣVH , and save V and all singular values. 2. Determine the number p of principal singular values. 3. Put V1 = [vp+1 , . . . , vn+1 ], and compute the Householder transformation

. y .. × V1 Q = - - ...- - - - - - , α .. 0 · · · 0 where α is a scalar, and × denotes the irrelevant block. 4. exit if α = 0. return p ← p − 1. output: xTLS = −y/α.

and the output is given by xTLS = y/α. It should be noticed that, like the unknown parameter vector x of the original matrix equation Ax = b, the minimum norm solution xTLS contains n parameters. From this fact it is seen that even if the effective rank of B is p < n, the minimum norm solution still assumes that n unknown parameters of the vector x are mutually independent. As a matter of fact, because both the augmented matrix B = [A, b] and the original data matrix A have the same rank, the rank of A is also p. This implies that only p columns of A are linearly independent and thus that the number of the principal parameters in the original matrix equation Ax = b is p rather than n. To sum up, the minimum norm solution of the TLS problem contains some redundant parameters that is linearly dependent on other parameters. In signal processing and system theory, the unique TLS solution with no redundant parameters is more interesting, since it is the optimal LS approximate solution. 2. Optimal Least Squares Approximate Solution ˆ be an optimal proximation with rank p of the First let the m × (n + 1) matrix B augmented matrix B, i.e., ˆ = UΣp VH , B

(6.3.14)

where Σp = Diag(σ1 , . . . , σp , 0, . . . , 0). ˆ (p) be a submatrix of the m × (n + 1) optimal Then let the m × (p + 1) matrix B j ˆ defined as approximate matrix B, ˆ (p) : submatrix consisting of the jth to the (j + p)th columns of B. ˆ B j

(6.3.15)

332

Solving Matrix Equations

ˆ (p) , B ˆ (p) , . . . , B ˆ (p) Clearly, there are (n + 1 − p) submatrices B 1 2 n+1−p . As stated before, the fact that the efficient rank of B is equal to p means that p components are linearly independent in the parameter vector x. Let the (p + 1) × 1

(p) vector be a = x−1 , where x(p) is the column vector consisting of the p linearly independent unknown parameters of the vector x. Then, the original TLS problem becomes that of solving the following n + 1 − p TLS problems: ˆ (p) a = 0 , B j

j = 1, 2, . . . , n + 1 − p

or equivalently that of solving the synthetic TLS problem ⎡ ⎤ ˆ : p + 1) B(1 ⎢ ⎥ .. ⎣ ⎦ a = 0, . ˆ + 1 − p : n + 1) B(n

(6.3.16)

(6.3.17)

ˆ : p + i) = B ˆ (p) is defined in (6.3.15). It is not difficult to show that where B(i i ˆ : p + i) = B(i

p 

σk uk (vki )H ,

(6.3.18)

k=1

where vki is a windowed segment of the kth column vector of V, defined as vki = [v(i, k), v(i + 1, k), . . . , v(i + p, k)]T .

(6.3.19)

Here v(i, k) is the (i, k)th entry of V. According to the least squares principle, finding the LS solution of Equation (6.3.17) is equivalent to minimizing the measure (or cost) function ˆ : p + 1)a]H B(1 ˆ : p + 1)a + [B(2 ˆ : p + 2)a]H B(2 ˆ : p + 2)a f (a) = [B(1 ˆ + 1 − p : n + 1)a]H B(n ˆ + 1 − p : n + 1)a + · · · + [B(n  n+1−p  ˆ : p + i) a. ˆ : p + i)]H B(i (6.3.20) = aH [B(i i=1

Define the (p + 1) × (p + 1) matrix S(p) =

n+1−p 

ˆ : p + i), ˆ : p + i)]H B(i [B(i

(6.3.21)

i=1

then the measure function can be simply written as f (a) = aH S(p) a.

(6.3.22)

The minimal variable a of the measure function f (a) is given by ∂f (a)/∂a∗ = 0 below: S(p) a = αe1 ,

(6.3.23)

6.3 Total Least Squares (TLS) Methods

333

where e1 = [1, 0, . . . , 0]T and the constant α > 0 represents the error energy. From (6.3.21) and (6.3.18) we have S(p) =

p 

n+1−p 

j=1

i=1

σj2 vji (vji )H .

(6.3.24)

Solving the matrix equation (6.3.23) is simple. If we let S−(p) be the inverse matrix S(p) , then the solution vector a depends only on the first column of the inverse matrix S−(p) . it is easily seen that the ith entry of x(p) = [xTLS (1), . . . , xTLS (p)]T (p)

in the TLS solution vector a = x−1 is given by xTLS (i) = −S−(p) (i, 1)/S−(p) (p + 1, 1),

i = 1, . . . , p.

(6.3.25)

This solution is known as the optimal least squares approximate solution. Because the number of parameters in this solution and the effective rank are the same, it is also called a low-rank TLS solution [74]. Notice that if the augmented matrix B = [−b, A] then xTLS (i) = S−(p) (i + 1, 1)/S−(p) (1, 1),

i = 1, 2, . . . , p.

(6.3.26)

In summary, the algorithm for the low-rank TLS solution is given in Algorithm 6.2. The basic idea of this algorithm was proposed by Cadzow in 1982 [74]. Algorithm 6.2

SVD-TLS algorithm

input: A ∈ Cm×n , b ∈ Cn . 1. Compute the SVD B = [A, b] = UΣVH , and save V. 2. Determine the effective rank p of B. 3. Use (6.3.24) and (6.3.19) to calculate the (p + 1) × (p + 1) matrix S(p) . 4. Compute S−(p) and xTLS (i) = −S−(p) (i, 1)/S−(p) (p + 1, 1), i = 1, . . . , p. output: xTLS .

6.3.3 Performances of TLS Solution The TLS has two interesting interpretations: one is its geometric interpretation [178] and the other is a closed solution interpretation [510]. 1. Geometric Interpretation of TLS Solution Let aTi be the ith row of the matrix A and bi be the ith entry of the vector b. Then the TLS solution xTLS is the minimal vector such that 0 + n 

Ax − b 22 |aTi x − bi |2 , (6.3.27) min = x

x 22 + 1 xT x + 1 i=1

334

Solving Matrix Equations ! " a

where |aTi x−bi |/(xT x+1) is the distance from the point b i ∈ Rn+1 to the nearest i point in the subspace Px , which is defined as "  # a n×1 T :a∈R , b ∈ R, b = x a . (6.3.28) Px = b Hence the TLS solution can be expressed using the subspace Px [178]: sum of the ! " a

squared distances from the TLS solution point b i to points in the subspace Px is i minimized. Figure 6.1 shows, for comparison, the TLS solution and the LS solution. b

b = xa (a2 , b2 ) (a3 , b3 )

(a1 , b1 )

a LS solution TLS solution

Figure 6.1 LS solution and TLS solution.

In Figure 6.1, each dotted line, which is a vertical distance parallel to the b-axis, is an LS solution; and each solid line, which starts at the point (ai , bi ) and is the vertical distance to the straight line b = xa, is a TLS solution. From this geometric interpretation it can be concluded that the TLS solution is better than the LS solution, because the residual error of curve fitting given by the TLS solution is smaller. 2. Closed Solution of TLS Problems If the singular values of the augmented matrix B are σ1 ≥ · · · ≥ σn+1 then the TLS solution can be expressed as [510] 2 I)−1 AH b. xTLS = (AH A − σn+1

(6.3.29)

Compared with the Tikhonov regularization method, the TLS is a kind of antiregularization method and can be interpreted as a least squares procedure with 2 I from the covariance matrix noise removal: it first removes the noise term σn+1 T T 2 A A and then finds the inverse matrix of A A − σn+1 I to get the LS solution. Letting the noisy data matrix be A = A0 + E, its covariance matrix AH A = H H A0 A 0 + E H A 0 + A H 0 E + E E. Obviously, when the error matrix E has zero mean, the mathematical expectation of the covariance matrix is given by H H H E{AH A} = E{AH 0 A0 } + E{E E} = A0 A0 + E{E E}.

6.3 Total Least Squares (TLS) Methods

335

If the column vectors of the error matrix are statistically independent and have the 2 same variance, i.e., E{ET E} = σ 2 I, then the smallest eigenvalue λn+1 = σn+1 of H the (n + 1) × (n + 1) covariance matrix A A is the square of the singular value of 2 the error matrix E. Because the square of the singular value σn+1 happens to reflect 2 the common variance σ of each column vector of the error matrix, the covariance H 2 matrix AH 0 A0 of the error-free data matrix can be retrieved from A A − σn+1 I, T 2 H namely as A A − σn+1 I = A0 A0 . In other words, the TLS method can effectively restrain the influence of the unknown error matrix. It should be pointed out that the main difference between the TLS method and the Tikhonov regularization method for solving the matrix equation Am×n xn = bm is that the TLS solution contains only p = rank([A, b]) principal parameters, and excludes the redundant parameters, whereas the Tikhonov regularization method can only provide all n parameters including the redundant parameters.

6.3.4 Generalized Total Least Squares The ordinary LS, the data LS, Tikhonov regularization and the TLS method can be derived and explained by a unified theoretical framework. Consider the minimization problem 32 2 2 2 4 2[ΔA, Δb]22 + λ2x22 (6.3.30) min F 2 ΔA,Δb,x

subject to (A + αΔA)x = b + βΔb, where α and β are the weighting coefficients of respectively the perturbation ΔA of the data matrix A and the perturbation Δb of the data vector b, and λ is the Tikhonov regularization parameter. The above minimization problem is called the generalized total least squares (GTLS) problem. 1. Comparison of Optimization Problems (1) Ordinary least squares: α = 0, β = 1 and λ = 0, which gives min

ΔA,Δb,x

Δb 22

subject to Ax = b + Δb.

(6.3.31)

(2) Data least squares: α = 1, β = 0 and λ = 0, which gives min

ΔA 2F

(A + ΔA)x = b.

(6.3.32)

(3) Tikhonov regularization: α = 0, β = 1 and λ > 0, which gives   min

Δb 22 + λ x 22 subject to Ax = b + Δb.

(6.3.33)

ΔA,Δb,x

ΔA,Δb,x

subject to

336

Solving Matrix Equations

(4) Total least squares: α = β = 1 and λ = 0, which gives min

ΔA,Δb,x

ΔA, Δb 22

subject to

(A + ΔA)x = b + Δb.

(6.3.34)

2. Comparison of Solution Vectors The constraint condition (A + αΔA)x = (b + βΔb) can be represented as    αx  −1 −1 = 0. [α A, β b] + [ΔA, Δb] −β  αx , the above equation becomes Letting D = [ΔA, Δb] and z = −β   Dz = − α−1 A, β −1 b z. 

(6.3.35)

Under the assumption zH z = 1, we have 2 22 min D 2F = min Dz 22 = min 2[α−1 A, β −1 b]z22 , and thus the solution to the GTLS problem (6.3.30) can be rewritten as +2 0 2 2[α−1 A, β −1 b]z22 2 2 ˆ GTLS = arg min + λ x 2 . x z zH z

(6.3.36)

Noting that zH z = α2 xH x + β 2 and [α−1 A, β −1 b]z = [α−1 A, β −1 b]



 αx = Ax − b, −β

the solution to the GTLS problem (6.3.36) is given by " #

Ax − b 22 2 ˆ GTLS = arg min + λ x x 2 . x α2 x 22 + β 2

(6.3.37)

When the weighting coefficients α, β and the Tikhonov regularization parameter λ take appropriate values, the GTLS solution (6.3.37) gives the following results: ˆ LS = arg min Ax − b 22 x x

Ax − b 22 x

x 22   = arg min Ax − b 22 + λ x 22

ˆ DLS = arg min x ˆ Tik x

x

ˆ TLS = arg min x x

Ax − b 22

x 22 + 1

(α = 0, β = 1, λ = 0),

(6.3.38)

(α = 1, β = 0, λ = 0),

(6.3.39)

(α = 0, β = 1, λ > 0),

(6.3.40)

(α = 1, β = 1, λ = 0).

(6.3.41)

6.3 Total Least Squares (TLS) Methods

337

3. Comparison of Perturbation Methods (1) Ordinary LS method This uses the possible small correction term Δb to perturb the data vector b in such a way that b − Δb ≈ b0 , and thus compensates for the observed noise e in b. The correction vector is selected as Δb = Ax − b, ˆ LS = (AH A)−1 AH b. and the analytical solution is x (2) Data LS method The correction term ΔA = (Ax − b)xH /(xH x) compensates the observed error matrix E in the data matrix A. The data LS solution is ˆ DLS = arg min x x

(Ax − b)H (Ax − b) . xH x

(3) Tikhonov regularization method This adds the same perturbation term λ > 0 to every diagonal entry of the matrix AH A to avoid the numerical instability ˆ Tik = (AH A + of the LS solution (AH A)−1 AH b. The analytical solution is x −1 H λI) A b. (4) TLS method By subtracting the perturbation matrix λI, the noise or perturbation in the covariance matrix of the original data matrix is restrained. There are three kinds of TLS solution: the minimum norm solution, the antiˆ TLS = (AH A − λI)−1 AH b and the regularization solution with n components x SVD -TLS solution with only p = rank([A, b]) principal parameters. 4. Comparison of Application Ranges (1) The LS method is applicable for a data matrix A with full column rank and a data vector b containing iid Gaussian errors. (2) The data LS method is applicable for a data vector b without error and a data matrix A that has full column rank and iid Gaussian error column vectors. (3) The Tikhonov regularization method is applicable for a data matrix A with deficient column rank. (4) The TLS method is applicable for a data matrix A with full column rank, where both A and the data vector b contain iid Gaussian error.

6.3.5 Total Least Squares Fitting In the numerical analysis of science and engineering problems, it is usually necessary to fit a curve or a curved surface to a given set of points. Because these data points have generally been observed, they inevitably contain errors or are contaminated by noise. In such cases the TLS method is expected to provide better fitting results than the ordinary LS method. Consider a data fitting process: given n data points (x1 , y1 ), . . . , (xn , yn ), we want to fit a straight line to these points. Assume the straight line equation is ax+by−c = 0. If the straight line goes through the point (x0 , y0 ) then c = ax0 +by0 .

338

Solving Matrix Equations

Now consider fitting a straight line through the center x ¯, y¯ of n known data points n n 1 1 xi , y¯ = y. (6.3.42) x ¯= n i=1 n i=1 i Substituting c = a¯ x + b¯ y into ax + by − c = 0, the straight line equation can be written as a(x − x ¯) + b(y − y¯) = 0

(6.3.43)

or equivalently in the slope form m(x − x ¯) + (y − y¯) = 0.

(6.3.44)

The parameter vector [a, b]T is called the normal vector of the fitting straight line, and −m = −a/b is its slope. Then, the straight line fitting problem becomes that of finding the normal vector [a, b]T or the slope parameter m. Evidently, if we are substituting n known data points into a straight line equation then it cannot strictly be satisfied, and thus there are fitting errors. The procedure in LS fitting is to minimize the squared sum of the fitting errors; i.e., the cost function of LS fitting is (1)

n 

(2)

i=1 n 

¯, y¯) = DLS (m, x DLS (m, x ¯, y¯) =

((xi − x ¯) + m(yi − y¯))2 ,

(6.3.45)

(m(xi − x ¯) − (yi − y¯))2 .

(6.3.46)

i=1 (i)

Letting ∂DLS (m, x ¯, y¯)/(∂m) = 0, i = 1, 2, we can find the slope m of the straight line. Then, substituting m into Equation (6.3.44), we can get the fitting equation of the straight line. Unlike LS fitting, TLS fitting minimize the sum of the squared distances from the known data points to the linear equation a(x − x0 ) + b(y − y0 ) = 0. The distance d of a point (p, q) from the straight line ax+by −c = 0 is determined by (ap + bq − c)2 (a(p − x0 ) + b(q − y0 ))2 d2 = = . (6.3.47) a2 + b2 a 2 + b2 Then, the sum of the squared distances from the known n data points to the line a(x − x ¯) + b(y − y¯) = 0 is given by D(a, b, x ¯, y¯) =

n  (a(xi − x ¯) + b(yi − y¯))2 . a 2 + b2 i=1

(6.3.48)

LEMMA 6.1 [350] For a data point (x0 , y0 ) on the line a(x − x ¯)Cb(y − y¯) = 0 we have the relationship D(a, b, x ¯, y¯) ≤ D(a, b, x0 , y0 ),

(6.3.49)

6.3 Total Least Squares (TLS) Methods

339

and the equality holds if and only if x0 = x ¯ and y0 = y¯. Lemma 6.1 shows that the TLS best-fit line must pass through the center of n data points in order to minimize the deviation D. Consider how to minimize the deviation D. To this end, we write D as the norm squared of the product of the 2 × 1 unit vector t = (a2 + b2 )−1/2 [a, b]T and the n × 2 matrix M, i.e., 22 2⎡ ⎤ 2 x1 − x ¯ y1 − y¯  2 2 2 a 2 2⎢ .. ⎥ √ 1 (6.3.50) D(a, b, x ¯, y¯) = Mt 22 = 2⎣ ... 2 , ⎦ . 2 a 2 + b2 b 2 2 2 x −x ¯ yn − y¯ n 2 where



¯ x1 − x ⎢ .. M=⎣ . ¯ xn − x

⎤ y1 − y¯ .. ⎥ . . ⎦

(6.3.51)

yn − y¯

From Equation (6.3.50) one can directly obtain the following result. ¯, y¯) reaches a minimum PROPOSITION 6.1 [350] The distance squared sum D(a, b, x for the unit normal vector t = (a2 + b2 )−1/2 [a, b]T . In this case, the mappingt →  1

Mt 2 reaches the minimum value in the unit sphere S = t ∈ R2  t 2 = 1 . Proposition 6.1 shows that the distance squared sum D(a, b, x ¯, y¯) has a minimum. The following theorem provides a way for finding this minimal distance squared sum. THEOREM 6.5 [350] If the 2 × 1 normal vector t is the eigenvector corresponding to the smallest eigenvalue of the 2 × 2 matrix MT M, then the distance squared sum D(a, b, x ¯, y¯) takes the minimum value σ22 . Here we gives a proof of the above theorem that is simpler than the proof in [350]. Using the fact that t is a unit vector, we have t 2 = 1, and thus the distance squared sum D(a, b, x ¯, y¯) can be written as D(a, b, x ¯, y¯) =

tT MT Mt . tT t

(6.3.52)

This is the typical Rayleigh quotient form. Clearly, the condition that D(a, b, x ¯, y¯) takes a minimum value is that the normal vector t is the eigenvector corresponding to the smallest eigenvalue of the matrix MT M. EXAMPLE 6.2 point to get

Given three data points (2, 1), (2, 4), (5, 1), compute their central 1 3

x ¯ = (2 + 2 + 5) = 3,

1 3

y¯ = (1 + 4 + 1) = 2.

340

Solving Matrix Equations

By subtracting the mean values from the data points, we get the zero-mean data matrix ⎤ ⎤ ⎡ ⎡ −1 −1 2−3 1−2 M = ⎣2 − 3 4 − 2⎦ = ⎣ −1 2 ⎦, 5−3

1−2

and hence

 T

M M= whose EVD is given by ⎡ 1 M M=⎣ T



2 1 −√ 2

1 √ 2 1 √ 2



 ⎦ 9 0

−1

2 6 −3 −3 6

⎡  0 ⎣ 3

1 √ 2 1 √ 2



1

−√

2 1 √ 2

⎤ ⎦=



6 −3

−3 6

 .

√ √ Therefore the normal vector t = [a, b]T = [1/ 2, 1/ 2]T . The TLS best-fit equation is as follows: a(x − x ¯) + b(y − y¯) = 0



1 √ (x 2

− 3) +

1 √ (y 2

− 2) = 0,

i.e., y = −x + 5. In this case, the distance squared sum is 2⎡ ⎤ ⎡ 1 ⎤ 22 2 −1 −1 2 √ 2 2 ⎦2 2 2 2 = 3. ⎣ ⎦ ⎣ ¯, y¯) = Mt 2 = 2 −1 DTLS (a, b, x 2 2 1 √ 2 2 2 −1 2 2 In contrast with the TLS fitting, the cost function of the LS fitting takes the form  1 [m(xi − x ¯) + (yi − y¯)]2 2 m + 1 i=1 3

(1)

DLS (m, x ¯, y¯) = =

1 [(−m − 1)2 + (−m + 2)2 + (2m − 1)2 ]. m2 + 1

(1)

From ∂DLS (m, x ¯, y¯)/(∂m) = 6m − 3 = 0 we get m = 1/2, i.e., its slope is −1/2. In 1 this case, the LS best-fit equation is (x − 3) + (y − 2) = 0, i.e., x + 2y − 7 = 0, 2

(1)

¯, y¯) = 3.6. and the corresponding distance squared sum DLS (m, x Similarly, if instead the LS fitting uses the cost function  1 (m(yi − y¯) + (xi − x ¯))2 m2 + 1 i=1 3

(2)

DLS (m, x ¯, y¯) = =

  1 (−m − 1)2 + (2m − 1)2 + (−m + 2)2 , +1

m2

(2)

then the minimal slope such that DLS (m, x ¯, y¯) is minimized is m =

1 , 2

i.e., the

6.3 Total Least Squares (TLS) Methods

341

best-fit equation is 2x − y − 4 = 0, and the corresponding distance squared sum is (2) DLS (m, x ¯, y¯) = 3.6. Figure 6.2 plots the results of the TLS fitting method and the two LS fitting methods. y ◦ (2, 4)

(3, 2) (2, 1) ◦

◦ (5, 1) x

LS fitting

TLS fitting

LS fitting

Figure 6.2 The LS fitting lines and the TLS fitting line. (1)

(2)

For this example, DLS (m, x ¯, y¯) = DLS (m, x ¯, y¯) > DTLS (a, b, x ¯, y¯), i.e., the two LS fittings have the same fitting deviations, which are larger than the fitting deviation of the TLS. It can be seen that the TLS fitting is thus more accurate than the LS fitting. Theorem 6.5 is easily extended to higher-dimensional cases. Let n data vectors xi = [x1i , . . . , xmi ]T , i = 1, . . . , n represent m-dimensional data points, and 1 x = [¯ x1 , . . . , x ¯ m ]T (6.3.53) n i=1 i n be the mean (i.e., central) vector, where x ¯j = i=1 xji . Now use an m-dimensional normal vector r = [r1 , . . . , rm ]T to fit a hyperplane x to the known data vector, i.e., x satisfies the normal equation n

¯= x

¯ , r = 0. x − x From the n × m matrix ⎡ ⎤ ⎡ ¯ x1 − x ¯1 x11 − x ⎢ .. ⎥ ⎢ . .. M=⎣ . ⎦=⎣ ¯ xn − x

¯1 xn1 − x

(6.3.54)

x12 − x ¯2 .. .

··· .. .

⎤ x1m − x ¯m ⎥ .. ⎦, .

xn2 − x ¯2

···

xnm − x ¯m

(6.3.55)

one can fit the m-dimensional hyperplane, as shown in Algorithm 6.3. It should be noted that if the smallest eigenvalues of the matrix MT M (or the smallest singular values of M) have multiplicity, then correspondingly there are multiple eigenvectors, leading to the result that the hyperplane fitting problem has multiple solutions. The occurrence of this kind of situation shows that a linear data-fitting model may not be appropriate, and thus one should try other, nonlinear, fitting models.

342 Algorithm 6.3

Solving Matrix Equations TLS algorithm for fitting m-dimensional syperplane [350]

input: n data vectors x1 , . . . , xn . ¯= 1 1. Compute the mean vector x n

n

xi .

i=1

2. Use Equation (6.3.55) to form the n × m matrix M. 3. Compute the minimal eigenpair (λ, u) of MT M. Let r = u. ¯ , r = 0. output: Hyperplane x − x

Total least squares methods have been widely used in the following areas: signal processing [241], biomedical signal processing [482], image processing [348], frequency domain system identification [393], [429], variable-error modeling [481], [457], subspace identification of linear systems [485], radar systems [150], astronomy [58], communications [365], fault detection [223] and so on.

6.3.6 Total Maximum Likelihood Method In the previous sections, we presented the LS method, the DLS method, Tikhonov regularization and the TLS method for solving the matrix equation Ax = b + w, where the data matrix A is assumed to be deterministic and known. But, in some important applications, A is random and unobservable. Consider the matrix equation Ax = b + w,

(6.3.56)

where A ∈ Rm×n is the data matrix, x ∈ Rn is an unknown deterministic vector and w ∈ Rm is an unknown perturbation or noise vector subject to the Gaussian 2 distribution N (0, σw ). The data matrix A = [aij ] is a random matrix with random Gaussian variables whose distribution is assumed to be 2 aij , σA ), aij = N (¯

¯ [¯ aij ] = A.

(6.3.57)

Recently, under the assumption that the measurement matrix consists of random Gaussian variables, an alternative approach to solving the matrix equation Ax = b + w has been proposed, [508], [509]. This technique is referred to as the total maximum likelihood (TML) method in [30]. The TML solution of the random matrix equation Ax = b + w is determined by [508] ˆ TML = arg min log p(b; x), x x

(6.3.58)

6.3 Total Least Squares (TLS) Methods

343



 ¯ σ 2 ( x 2 + σ 2 )I . Hence, the solution can be rewritten as where b ∼ N Ax, w A " # ¯ 2  2 

b − Ax 2 2 2 ˆ TML = arg min x . (6.3.59) 2 x 2 + σ 2 + m log σA x 2 + σw x σA w 2 THEOREM 6.6

For any t ≥ 0, let f (t) =

min Ax − b 22 ,

x: x 22 =t

(6.3.60)

and denote the optimal argument by x(t). Then, the maximum likelihood estimator of x in the random matrix equation (6.3.56) is x(t∗ ), where t∗ is the solution to the following unimodal optimization problem: " # f (t) 2 2 + m log(σ t + σ ) . (6.3.61) min A w 2 t + σ2 t≥0 σA w Proof

See [508].

Theorem 6.6 allows for an efficient solution of the ML problem. This is so because [508]: (1) there are standard methods for evaluating f (t) in (6.3.60) for any t ≥ 0; (2) the line search in (6.3.61) is unimodal in t ≥ 0, and thus any simple onedimensional search algorithm can efficiently find its global minima. For the solution of the optimization problem (6.3.60), one has the following lemma: LEMMA 6.2

[159], [508] The solution of optimization problem in (6.3.60), min Ax − b 22 ,

(6.3.62)

x: x 22 =t

has the solution

 T  ¯ + α∗ I † Ab, ¯ ¯ A x(t) = A

¯ is the unique root of the equation ¯ T A) where α∗ ≥ −λmin (A 22 2 † 2 ¯T ¯ ¯ 2

x(t) 2 = 2 A A + α∗ I Ab 2 = t. 2

(6.3.63)

(6.3.64)

2

¯ one can calculate (A ¯TA ¯ + ¯TA, Remark Using the eigenvalue decomposition of A ∗ †¯ 2 α I) Ab 2 for different values of α. The monotonicity of this squared norm enables us to find the α that satisfies (6.3.64) using a simple line search. The above discussion is summarized as Algorithm 6.4. The TML method has been extended to the case in which the measurement matrix is structured, so that the perturbations are not arbitrary but rather follow a fixed pattern. For this case, a structured TML (STML) strategy was proposed in [30].

344

Solving Matrix Equations

Algorithm 6.4

Total maximum likelihood algorithm [508]

¯ t0 , Δt. input: b, A, initialization: k = 1. repeat 1. Compute tk = t0 + Δt. 2. Use one-dimensional line search to solve the equation # #2 † # ¯T ¯ ¯ # # A A + αI Ab # = tk 2

in order to get the optimal solution α∗ . ¯ + α∗ I)† Ab. ¯ ¯TA 3. xk = (A 2 ¯ 4. f (tk ) = b − Ax(t k )2 .

5. J(tk ) =

f (tk ) 2 t + σ2 σA w k

2 2 + m log(σA tk + σ w ).

6. exit if J(tk ) > J(tk−1 ). return k ← k + 1. output: xTML ← xk−1 .

It is interesting and enlightening to compare the TML with the GTLS from the viewpoint of optimization: (1) The optimization solutions of the GTLS problem and of the TML problem are respectively given by " #

b − Ax 22 2 ˆ GTLS = arg min (6.3.65) + λ x GTLS : x 2 , x α2 x 22 + β 2 " # ¯ 2  2 

b − Ax 2 2 2 ˆ TML = arg min + m log σ

x + σ . (6.3.66) TML : x A 2 w 2 x 2 + σ 2 x σA w 2 They consist of a cost function and a penalty function. Taking α = σA and β = σw and replacing the deterministic data matrix A in the GTLS by the ¯ of the random data matrix A, the cost function of the deterministic term A GTLS problem becomes the cost function of the TML problem. (2) The penalty function in the GTLS problem is the regularization term λ x 22 , whereas the penalty function in the TML problem is the barrier function term 2 2 m log(σA

x 22 + σw ).

6.4 Constrained Total Least Squares The data LS method and the TLS method for solving the matrix equation Ax = b consider the case where the data matrix includes observed errors or noise, but the two methods assume that the errors are iid random varibles and have the same variances. However, in some important applications, the noise coefficients of the data

6.4 Constrained Total Least Squares

345

matrix A may be statistically correlated or, although statistically independent, they have different variances. In this section we discuss the solution of over-determined matrix equations when the column vectors of the noise matrix are statistically correlated.

6.4.1 Constrained Total Least Squares Method The matrix equation Am×n xn = bm can be rewritten as     x x = 0, = 0 or C [A, b] −1 −1

(6.4.1)

where C = [A, b] ∈ Cm×(n+1) is the augmented data matrix. Consider the noise matrix D = [E, e] of [A, b]. If the column vectors of the noise matrix are statistically correlated then the column vectors of the correction matrix should also be statistically correlated in order to use an augmented correction matrix ΔC = [ΔA, Δb] to suppress the effects of the noise matrix D = [E, e]. A simple way to make the column vectors of the correction matrix ΔC statistically correlated is to let every column vector be linearly dependent on the same vector (e.g., u): ΔC = [G1 u, . . . , Gn+1 u] ∈ Rm×(n+1) ,

(6.4.2)

where Gi ∈ Rm×m , i = 1, . . . , n+1, are known matrices, while u is to be determined. The constrained TLS problem can be stated as follows [1]: determine a solution vector x and a minimum-norm perturbation vector u such that     x = 0, (6.4.3) C + [G1 u, . . . , Gn+1 u] −1 which can be equivalently expressed as the constrained optimization problem     x T = 0, (6.4.4) min u Wu subject to C + [G1 u, . . . , Gn+1 u] u,x −1 where W is a weighting matrix and is usually diagonal or the identity matrix. The correction matrix ΔA is constrained as ΔA = [G1 u, . . . , Gn u], while the correction vector Δb is constrained as Δb = Gn+1 u. In constrained TLS problems, the linear correlation structure between column vectors of the augmented correction matrix [ΔA, Δb] is kept by selecting appropriate matrices Gi (i = 1, . . . , n + 1). The key to the constrained TLS method is how to choose appropriate matrices G1 , . . . , Gn+1 , depending on the application. THEOREM 6.7 [1]

Let Wx =

n  i=1

xi Gi − Gn+1 .

(6.4.5)

346

Solving Matrix Equations

Then the constrained TLS solution is given by +  H  0 x x H H † C (Wx Wx ) C , min F (x) = x −1 −1

(6.4.6)

where Wx† is the Moore–Penrose inverse matrix of Wx . A complex form of the Newton method was proposed in [1] for calculating the constrained TLS solution. The matrix F (x) is regarded as a complex analytic function with 2n complex variables x1 , . . . , xn , x∗1 , . . . , x∗n . Then the Newton recursive formulas are given by x = x0 + (A∗ B−1 A − B∗ )−1 (a∗ − A∗ B−1 a), where

⎫ T  ⎪ ∂F ∂F ∂F ∂F ⎪ a= , ,..., = complex gradient of F,⎪ = ⎪ ⎪ ∂x ∂x1 ∂x2 ∂xn ⎪ ⎪ ⎬ 2 ∂ F A= = nonconjugate complex Hessian matrix of F, ⎪ ⎪ ∂x∂xT ⎪ ⎪ ⎪ 2 ⎪ ∂ F ⎪ ⎭ B= = conjugate complex Hessian matrix of F. ∗ T ∂x ∂x

(6.4.7)

(6.4.8)

The (k, l)th entries of the two n × n part Hessian matrices are defined as  2     ∂ F ∂F ∂F ∂2F 1 ∂F ∂F , (6.4.9) = = −j −j ∂x∂xT k,l ∂xk ∂xl 4 ∂xkR ∂xkI ∂xlR ∂xlI      ∂F ∂F ∂2F ∂2F 1 ∂F ∂F . (6.4.10) = = + j − j ∂x∗ ∂xT k,l ∂x∗k ∂xl 4 ∂xkR ∂xkI ∂xlR ∂xlI Here xkR and xkI are respectively the real part and the imaginary part of xk . Put   x H −1 , (6.4.11) u = (Wx Wx ) C −1   ˜ = CIn+1,n − G WH u, . . . , G WH u , B (6.4.12) 1 n x x  H  H ˜ G = G1 u, . . . , Gn u , (6.4.13) where In+1,n is a (n + 1) × n diagonal matrix with diagonal entries equal to 1. Hence, a, A and B can be calculated as follows:   ˜ T, a = uH B (6.4.14)  H H T  −1 H H H H −1 ˜− G ˜ W (W W ) B ˜ , ˜ W W W B A = −G (6.4.15) x x x x x x   H   ˜ T +G ˜ (W WH )−1 B ˜ H WH (W WH )−1 W − I G. ˜ B= B (6.4.16) x x x x x x It is shown in [1] that the constrained TLS estimate is equivalent to the constrained maximum likelihood estimate.

6.4 Constrained Total Least Squares

347

6.4.2 Harmonic Superresolution Abatzoglou et al. [1] presented applications of the constrained TLS method in harmonic superresolution. Assume that L narrowband wavefront signals (harmonics) irradiate a uniform linear array of N elements. The array output signals satisfy the forward linear prediction equation [271]   x = 0, k = 1, 2, . . . , M, (6.4.17) Ck −1 where M is the number of snapshots and ⎡ yk (1) yk (2) ··· ⎢ y (2) y (3) ··· k k ⎢ ⎢ . . .. ⎢ .. .. . ⎢ ⎢ ⎢ yk (N − L) yk (N − L + 1) · · · Ck = ⎢ ⎢ ---------------------⎢ ∗ ⎢yk (L + 1) yk∗ (L) ··· ⎢ ⎢ .. .. .. ⎣ . . . ∗ ∗ yk (N − 1) ··· yk (N )

yk (L + 1) yk (L + 2) .. .



⎥ ⎥ ⎥ ⎥ ⎥ ⎥ yk (N ) ⎥ ⎥. ------- ⎥ ⎥ yk∗ (1) ⎥ ⎥ ⎥ .. ⎦ . ∗ yk (N − L)

(6.4.18)

Here yk (i) is the output observation of the kth array element at the time i. The matrix Ck is referred to as the data matrix at the kth snapshot. Combining all the data matrices into one data matrix C, we get ⎤ ⎡ C1 ⎥ ⎢ C = ⎣ ... ⎦ . (6.4.19) CM Then, the harmonic superresolution problem can be summarized as follows: use the constrained TLS method to solve the matrix equation   x = 0, C −1 while the estimation result of Wx is given by   0 ˆ = W1 W , x 0 W2 where

⎡ ⎢ W1 = ⎣

x1 0

x2 .. .

··· .. . x1

xL x2

−1 .. .

..

···

xL

0 .

⎤ ⎥ ⎦

−1

348

Solving Matrix Equations

and

⎡ ⎢ W2 = ⎣

−1

··· .. . −1

xL .. .

0

x2 xL

x1 .. .

0 ..

···

x2

.

⎤ ⎥ ⎦.

x1

Moreover,   H x x CH (Wx WxH )−1 C −1 −1    H  M x x ˆ H )−1 C ˆ W . CH ( W = m x x m −1 −1 

F (x) =

(6.4.20)

m=1

In order to estimate the angles of arrival φi of the L space signals at the linear uniform array, the constrained TLS method consists of the following three steps [1]. (1) Use the Newton method to minimize F (x) in (6.4.20), yielding x. (2) Find the roots zi , i = 1, . . . , L, of the linear-prediction-coefficient polynomial L 

xk z k−1 − z L = 0.

(6.4.21)

k=1

(3) Estimate the angles of arrival φi = arg(zi ), i = 1, . . . , L.

6.4.3 Image Restoration It is important to be able to recover lost information from degraded image data. The aim of image restoration is to find an optimal solution for the original image in the case where there is known record data and some prior knowledge. Let the N × 1 point-spread function (PSF) be expressed by ¯ + Δh, h=h

(6.4.22)

¯ and Δh ∈ RN are respectively the known part and (unknown) error part of where h the PSF. The error components Δh(i), i = 0, 1, . . . , N − 1, of Δh = [Δh(0), Δh(1), . . . , Δh(N −1)]T are independent identically distributed (iid) noises with zero mean and the same variance σh . The observed degraded image is expressed by g, and the imaging equation can be expressed as g = Hf + Δg,

(6.4.23)

where f and Δg ∈ RN are respectively the original image and the additive noise of the observed image. The additive noise vector Δg = [Δg(0), Δg(1), . . . , Δg(N −1)]T is an iid random vector and is statistically independent of the PSF error component

6.4 Constrained Total Least Squares

349

¯ Δh. The matrix H ∈ RN ×N denotes the PSF, which consists of the known part H and the noise part: ¯ + ΔH. H=H (6.4.24) The TLS solution of Equation (6.4.23) is 2 22 2 2 ˆ g ˆ ]2 , f = arg min 2[H, g] − [H, ˆ g]∈ RN ×(N +1) [H,ˆ

(6.4.25)

F

ˆ is where the constraint condition of g ˆ ˆ ∈ Range(H). g

(6.4.26)

By defining an unknown regularized noise vector T  Δh(0) Δh(N − 1) Δg(0) Δg(N − 1) u= ,..., , ,..., , σh σh σg σg

(6.4.27)

Mesarovic et al. [323] proposed the constrained TLS-based image restoration algorithm ¯ − g + Lu = 0, f = arg min u 2 subject to Hf (6.4.28) f

2

where L is an N × 2N matrix defined as ⎡ σh f (N − 1) · · · σh f (1) ⎢ σh f (0) ⎢ ⎢ σh f (1) σh f (0) · · · σh f (2) L=⎢ ⎢ . . .. .. .. .. ⎢ . . ⎣ σh f (N − 1)

σh f (N − 2)

···

σh f (0)

.. . .. . .. . .. .

σg

0

···

0 .. .

σg .. .

··· .. .

0

0

···

⎤ 0⎥ ⎥ 0⎥ ⎥ . (6.4.29) .. ⎥ .⎥ ⎦ σg

¯ In the case where there is a given data vector g and a part of the PSF matrix H is known, solving for the original image f in Equation (6.4.23) is a typical inverse problem: the solution of the image restoration problem corresponds mathematically to the existence and uniqueness of the inverse transformation of Equation (6.4.23). If the inverse transformation does not exist then the image restoration is called a singular inverse problem. Moreover, even if the inverse transformation exists, its solution may be not unique. For a practical physical problem, such a nonunique solution is not acceptable. In this case, the image restoration is said to be an illconditioned inverse problem. This implies that a very small perturbation in the observed data vector g may lead to a large perturbation in the image restoration [16], [465]. An effective way to overcome the ill-conditioned problem in image restoration is to use the regularization method [465], [123], which yields the regularized constrained TLS algorithm [123], [168]. The basic idea of the regularized constrained TLS image restoration algorithm is to introduce a regularizing operator Q and a regularizing parameter λ > 0, and

350

Solving Matrix Equations

to replace the objective function by the sum of two complementary functions; i.e., (6.4.28) becomes   f = arg min u 22 + λ Qf 22 (6.4.30) f

subject to ¯ − g + Lu = 0. Hf The choice of regularizing parameter λ needs to take into account both the fidelity to the observed data and the smoothness of the solution. In order to improve further the performance of the regularized constrained TLS image restoration algorithm, Chen et al. [97] proposed a method for adaptively choosing the regularizing parameter λ, this method is called the adaptively regularized constrained TLS image restoration. The solution of this algorithm is given by   f = arg min u 22 + λ(f ) Qf 22 (6.4.31) f

subject to ¯ − g + Lu = 0. Hf

6.5 Subspace Method for Solving Blind Matrix Equations Consider a blind matrix equation X = AS,

(6.5.1)

where X ∈ CN ×M is a complex matrix whose entries are the observed data and the two complex matrices A ∈ CN ×d and S ∈ Cd×M are unknown. The question is: when only X is known, the solution of the unknown matrix S of blind matrix equation can be found? The answer is yes, but it is necessary to assume two conditions, that the matrix A is of full column rank and the matrix S is of full row rank. These assumptions are often satisfied in engineering problems. For example, in array signal processing, full column rank of the matrix A means that the directions of arrival (DOA) of all source signals are mutually independent, while full row rank of the matrix S requires that each source signal is independently transmitted. Assume that N is the data length, d is the number of sources, M is the number of sensors, usually M ≥ d and N > M . Define the truncated SVD of the data matrix X as ˆΣ ˆV ˆ H, X=U

(6.5.2)

ˆ is a d × d diagonal matrix consisting of the d principal singular values where Σ of X.

6.5 Subspace Method for Solving Blind Matrix Equations

351

ˆ i.e., the matrices A and U ˆ span the same signal subSince Col(A) = Col(U), space, we have ˆ = AT, U (6.5.3) where T is a d × d nonsingular matrix. Let W be a d × N complex matrix that represents a neural network or filter. Premultiply Equation (6.5.1) by W: WX = WAS.

(6.5.4)

Adjusting the matrix W so that WA = Id , the solution of the above equation is given by S = WX.

(6.5.5)

In order to find W, we compute ˆ = WAT = T WU and thus obtain ˆ H. W = TU

(6.5.6)

Summarizing the above discussion, we have the following method for solving the blind matrix (6.5.1): ⎫ Data model X = AS, ⎪ ⎪ ⎪ ⎪ ⎪ ˆ ˆ ˆ ⎬ Truncated SVD X = UΣV, (6.5.7) ˆ = AT for T, ⎪ Solving U ⎪ ) * ⎪ ⎪ ⎭ ˆ T X.⎪ Solution of matrix equation S = TU This method is called the subspace method because it is based on the signal subspace. Hence, the key problem in solving a blind matrix equation is how to find the ˆ = AT in the case where both A and T are unknown. nonsingular matrix T from U Here we consider an example of solving the blind matrix equation (6.5.1) in wireless communications. Without taking into consideration the multipath transmission in wireless communication, the matrix in Equation (6.5.1) is given by [493] X = Aθ B, with

⎡ ⎢ ⎢ Aθ = [a(θ1 ), . . . , a(θd )] = ⎢ ⎣ B = Diag(β1 , β2 , . . . , βd ),

(6.5.8)

1 θ1 .. .

1 θ2 .. .

··· ··· .. .

1 θd .. .

θ1N −1

θ2N −1

···

θdN −1

⎤ ⎥ ⎥ ⎥, ⎦

352

Solving Matrix Equations

in which θi and βi are the unknown direction of arrival and the unknown attenuation coefficient of the ith user, respectively. Define the diagonal matrix Θ = Diag(θ1 , . . . , θd )

(6.5.9)

and the (M − 1) × M selection matrices J1 = [IM −1 , 0],

J2 = [0, IM −1 ]

which select respectively the upper M − 1 rows and the lower M − 1 rows of the matrix Aθ . It is easy to see that (J1 Aθ )Θ = J2 Aθ .

(6.5.10)

ˆ = XT = A BT. U θ

(6.5.11)

Hence

In order to find the nonsingular matrix T, if we premultiply (6.5.11) by the selection matrices J1 and J2 , respectively, and let A θ = J1 Aθ then, using (6.5.10), we get ˆ = (J1 Aθ )BT = A BT, ˆ 1 = J1 U U θ ˆ = (J2 Aθ )BT = A ΘBT. ˆ 2 = J2 U U θ

(6.5.12) (6.5.13)

Since B and Θ are diagonal matrices, we have ΘB = BΘ and thus ˆ = A BΘT = A BTT−1 ΘT = U ˆ T−1 ΘT, U θ θ 2 1 which can be written as −1 ˆ ˆ †U U ΘT, 1 2 =T

(6.5.14)

ˆ )−1 U ˆ H is the generalized inverse of the matrix U ˆ † = (U ˆ HU ˆ . where U 1 1 1 1 1 Because Θ is a diagonal matrix, it is easily seen that Equation (6.5.14) is a typical similarity transformation. Hence, through this similarity transformation of ˆ ˆ †U the matrix U 1 2 , we can obtain the nonsingular matrix T. The above discussion is summarized in Algorithm 6.5 for solving the blind matrix equation X = Aθ B. Although we describe only the case of single-path transmission, the subspace method for solving the matrix equation AS = X is also applicable to the case of multipath transmission. The difference is in the form of the matrix A, so that the method of finding the nonsingular matrix T is also different. The interested reader may refer to [493].

6.6 Nonnegative Matrix Factorization: Optimization Theory Algorithm 6.5

353

Solving blind matrix equation X = Aθ B

input: X ∈ CN ×M . ˆΣ ˆV ˆ H. 1. Compute the truncated SVD X = U ˆ 1 = UJ ˆ 1 and U ˆ 2 = UJ ˆ 2. 2. Calculate U ˆ †U ˆ 3. Make the similarity transformation T−1 U 1 2 T to get T.  H  ˆ T X. output: B = U

6.6 Nonnegative Matrix Factorization: Optimization Theory A matrix with nonnegative real entries is called a nonnegative matrix. Consider the nonnegative blind matrix equation X = AS, where a known matrix X and two unknown matrices A and S are nonnegative matrices. This blind matrix equation is widespread in engineering application problems.

6.6.1 Nonnegative Matrices An n × n matrix A or an n × 1 vector a is said to be

DEFINITION 6.3

(1) positive (or elementwise positive), denoted A > O or a > 0, if all its entries are positive; (2) nonnegative (or elementwise nonnegative), denoted A ≥ O or a ≥ 0, if its all entries are nonnegative. We use the notation A > B (A ≥ B) to mean that A − B is a positive (nonnegative) matrix, i.e., Aij > Bij (Aij ≥ Bij ) for all i, j. DEFINITION 6.4 We say that a matrix A ∈ Rn×n is reducible if there exists a permutation matrix P such that   A11 A12 , (6.6.1) B = PAPT = O A21 where A11 ∈ Rr×r , A22 ∈ Rn−r,n−r , A12 ∈ Rr×(n−r) and O is an (n − r) × r zero matrix. DEFINITION 6.5 ducible.

A matrix A ∈ Rn×n is said to be irreducible, if it is not re-

DEFINITION 6.6 An n×n nonnegative matrix A is said to be regular or primitive, if there is a k ≥ 1 such that all the entries of Ak are positive. EXAMPLE 6.3

Any positive matrix A is regular since, for k = 1, A > O.

354

EXAMPLE 6.4

Solving Matrix Equations

The following matrices are not regular:       1 1 1 0 0 1 A= , A= and A = 0 1 1 1 1 0

because, for any k ≥ 1, Ak ≥ O. But, the matrix   1 1 A= 1 0 is regular, since



1 A= 0 EXAMPLE 6.5

1 1

2



 2 1 = > O. 1 1

The following matrix is ⎡ 1 A = ⎣1 0

regular: ⎤ 0 1 0 0⎦ 1 0

because, for k = 4, we have ⎡

2 A 4 = ⎣2 1

⎤ 1 1 1 1⎦ > O. 1 1

There is the following relationship between an irreducible matrix A and a regular matrix. LEMMA 6.3 If A is a nonnegative and irreducible n×n matrix then (I+A)n−1 > O, i.e., I + A is regular or primitive. Proof [355] It suffices to prove that (I + A)n−1 x > O for any x ≥ 0, x = 0. Define the sequence xk+1 = (I + A)xk ≥ 0, k = 0, 1, . . . , n − 2, x0 = x. Since xk+1 = xk + Axk , xk+1 has no more zero entries than xk . In order to prove by contradiction that xk+1 has fewer zero entries than xk , suppose that xk+1 and xk have exactly the same number of zero entries. Then, there exists a permutation matrix P such that     y z Pxk+1 = , Pxk = , y, z ∈ Rm , y, z > 0, 1 ≤ m < n. 0 0 Then

  y = P(xk + Axk ) = Pxk + PAPT Pxk 0      A11 A12 z z . + = 0 A21 A22 0

Pxk+1 =

This implies that A21 = O, which contradicts the assumption condition that A is

6.6 Nonnegative Matrix Factorization: Optimization Theory

355

irreducible. Thus, x0 = x has at most n − 1 zero entries, xk has at most n − k − 1 zero entries, and hence xn−1 = (I + A)n−1 x0 has no zero entry, i.e., xn−1 is a positive vector, which completes the proof. THEOREM 6.8 (Perron–Frobenius theorem for regular matrices) Suppose that A ∈ Rn×n is a nonnegative and regular matrix, i.e., Ak > 0 for some k ≥ 1; then • there is a Perron–Frobenius (PF) eigenvalue λPF of A that is real and positive, together with positive left and right eigenvectors; • any other eigenvalue λ satisfies |λ| < λPF ; • the eigenvalue λPF is simple, i.e., has multiplicity 1, and corresponds to a 1 × 1 Jordan block. The Perron–Frobenius theorem was proved by Oskar Perron in 1907 [385] for positive matrices and was extended by Georg Frobenius in 1912 [165] to nonnegative and irreducible matrices. The Perron–Frobenius theorem has important applications to probability theory (Markov chains), the theory of dynamical systems, economics, population dynamics (the Leslie population-age-distribution model), power control and so on. Interested readers can refer to the books [39], [324]. In particular, the Perron–Frobenius theorem has been extended to tensors [39].

6.6.2 Nonnegativity and Sparsity Constraints In many engineering applications it is necessary to impose two constraints on the data: a nonnegativity constraint and a sparsity constraint. 1. Nonnegativity Constraint As the name suggests, the nonnegativity constraint constrains the data to be nonnegative. A lot of actual data are nonnegative, and so constitute the nonnegative data matrices. Such nonnegative matrices exist widely in daily life. The following are four important actual examples of nonnegative matrices [277]. 1. In document collections, documents are stored as vectors. Each element of a document vector is a count (possibly weighted) of the number of times a corresponding term appears in that document. Stacking document vectors one after the other creates a nonnegative term-by-document matrix that represents the entire document collection numerically. 2. In image collections, each image is represented by a vector, and each element of the vector corresponds to a pixel. The intensity and color of a pixel is given by a nonnegative number, thereby creating a nonnegative pixel-by-image matrix. 3. For item sets or recommendation systems, the information for a purchase history of customers or ratings on a subset of items is stored in a nonnegative sparse matrix.

356

Solving Matrix Equations

4. In gene expression analysis, gene-by-experiment matrices are formed from observations of the gene sequences produced under various experimental conditions. In addition, in pattern recognition and signal processing, for some particular pattern or target signal a linear combination of all the feature vectors (an “allcombination”) may not be appropriate. On the contrary, a part-combination of some feature vectors is more suitable. For example, in face recognition, the combination of the specific parts of the eye, nose, mouth and so on is often more effective. In an all-combination, the positive and negative combination coefficients emphasize respectively the positive and negative effects of some features, while a zero combination coefficient implies a characteristics that does not play a role. In a part-combination, there are only two kinds of characteristics: those play or do not play a role. Therefore, in order to emphasize the role of some main features, it is natural to add nonnegative constraints to the elements in the coefficient vector. 2. Sparsity Constraint By a sparsity constraint is meant an assumption that the data are not dense but sparse, i.e., most data take a zero value, only a few take nonzero values. A matrix for which most entries are zero and only a few entries are nonzero is called a sparse matrix, while a sparse matrix whose entries take only nonnegative nonzero values is known as a nonnegative sparse matrix. For example, in the commodity recommendation system, a matrix that is composed of the customer’s purchases or scores is a nonnegative sparse matrix. In economics, a lot of variables and data (such as volume and price) are not only sparse but also nonnegative. Sparsity constraints can increase the effectiveness of investment portfolios, while nonnegative constraints can improve investment efficiency and also reduce the investment risks [436], [538]. Although many natural signals and images are not themselves sparse, after a certain transformation they are sparse in the transform domain. For example, the discrete cosine transform (DCT) of a face and a medical image is a set of typical sparse data. The short-time Fourier transform (STFT) of a speech signal is also sparse in the time domain.

6.6.3 Nonnegative Matrix Factorization Model The basic problem of linear data analysis is as follows. By an appropriate transformation or factorization, a higher-dimensional original data vector can be represented as a linear combination of a set of low-dimensional vectors. Because the nature or character of the original data vector is extracted, it can be used for pattern recognition. So these low-dimensional vectors are often called “pattern vectors” or “basis vectors” or “feature vectors”.

6.6 Nonnegative Matrix Factorization: Optimization Theory

357

In the process of data analysis, modeling and processing, the two basic requirements of a pattern vector must be considered. (1) Interpretability The components of a pattern vector should have definite physical or physiological meaning. (2) Statistical fidelity When the data are consistent and do not have too much error or noise, the components of a pattern vector should be able to explain the variance of the data (in the main energy distribution). Vector quantization (VQ) and principal component analysis (PCA) are two widely used unsupervised learning algorithms; they adopt fundamentally different ways to encoding of data. 1. Vector Quantization Method The vector quantization (VQ) method uses a stored prototype vector as the code vector. Let cn be a k-dimensional code vector and let there be N such stored code vectors, i.e., cn = [cn,1 , . . . , cn,k ]T ,

n = 1, . . . , N.

The collection of N code vectors {c1 , . . . , cN } comprises a codebook. The subset consisting of the data vectors closest to the stored pattern vectors or the code vectors cn is called the encoding region of the code vectors cn , denoted Sn and defined as    Sn = x  x − cn 2 ≤ x − cn 2 , ∀ n = 1, . . . , N . (6.6.2) The formulation of the vector quantization problem is as follows: given M k ×1 data vectors xi = [xi,1 , . . . , xi,k ]T , i = 1, . . . , M , determine the encoding region of these vectors, i.e., their corresponding code vectors. Let X = [x1 , . . . , xM ] ∈ Rk×M be a data matrix and C = [c1 , . . . , cN ] ∈ Rk×N denote a codebook matrix. Then the VQ of the data matrix can be described by the model X = CS,

(6.6.3)

where S = [s1 , . . . , sM ] ∈ RN ×M is a quantization coefficient matrix, its columns are called quantization coefficient vectors. From the viewpoint of optimization, the VQ optimization criterion is “winnertake-all”. According to this criterion, the input data is clustered into mutually exclusive patterns [172]. From the viewpoint of encoding, the VQ is a “grandmother cell coding”: all data are explained or clustered simply by a basis vector [506]. Specifically, each quantization coefficient vector is an N × 1 basis vector with only a single entry 1 and the rest zero entries. Therefore, if the (i, j)th entry of the codebook matrix is equal to 1 then the data vector xj is judged to be closest to the code vector ci , that is, the data vector corresponds to only one code vector. The

358

Solving Matrix Equations

VQ method can capture the nonlinear structure of the input data but its capturing ability is weak, because the data vector and the code vector are in one-to-one correspondence in this method. If the dimension of the data matrix is large, a large number of code vectors are needed to represent the input data. 2. Principal Component Analysis Method The linear data model is a widely used data model. Principal component analysis (PCA), linear discriminant analysis (LDA), independent component analysis (ICA) and other multivariate data analysis methods adopt this linear data model. As we saw, the VQ method uses code vectors. Similarly, the base matrix A in the PCA method consists of a set of orthogonal basis vectors ai corresponding to the principal components. These basis vectors are called the pattern or feature vectors. For a data vector x, PCA uses the principle of shared constraints to make the optimization and uses a linear combination of the pattern vectors x = As to represent the input data. From the encoding point of view, PCA provides a distributed encoding, as compared with the grandmother-cell-encoding-based VQ method, thus the PCA method requires only a small group of basic vectors to represent higher-dimensional data. The disadvantages of the PCA method are as follows. 1. It cannot capture any nonlinear structure in the input data. 2. The basis vectors can be statistically interpreted as the directions of maximum difference, but many directions do not have a clear visual interpretation. The reason is that the entries of the basis matrix A and the quantization coefficient vector s can take zero, positive or negative signs. Since the basis vectors are used in the linear combination, and this combination involves a complex cancellation of positive and negative numbers, many individual basis vectors lose their intuitive physical meaning due to this cancellation and do not have an explanatory role for nonnegative data (such as the pixel values of a color image). On the one hand, the entries of a nonnegative pattern vector should all be nonnegative values, but on the other hand mutually orthogonal eigenvectors must contain negative entries: if all the entries of the eigenvector u1 corresponding to the maximum eigenvalue are nonnegative, then any other eigenvector orthogonal to u1 must contain at least one negative entry, otherwise the orthogonality condition of two vectors u1 , uj  = 0, j = 1, cannot be met. This fact indicates that the mutually orthogonal eigenvectors cannot be used as pattern vectors or basis vectors in nonnegative data analysis. In the PCA, LDA and ICA methods, the coefficient vector elements usually take positive or negative values; few take a zero value. That is to say, in these methods, all the basis vectors are involved in the fitting or regression of the data vectors.

6.6 Nonnegative Matrix Factorization: Optimization Theory

359

3. Nonnegative Matrix Factorization Method Unlike in the VQ, PCA, LDA and ICA methods, the basis vectors and the elements of the coefficient vector are treated as nonnegative constraints in nonnegative matrix factorization (NMF). It can be seen that the number of basis vectors involved in the fitting or regression of a data vector is certainly less in the NMF. From this perspective, NMF basis vectors have the role of extracting the principal basis vectors. Another prominent advantage of NMF is that the nonnegative constraint on combination factors facilitates the generation of sparse coding, so that many encoded values are zero. In biology, the human brain encodes information in this sparse coding way [153]. Therefore, as an alternative method of linear data analysis, we should use nonnegative instead of orthogonal constraints on the basis vectors. The NMF method gives a multilinear nonnegative approximation to the data. K×1 Let x(j) = [x1 (j), . . . , xI (j)]T ∈ RI×1 and s(j) = [s1 (j), . . . , sK (j)]T ∈ R+ + denote respectively the nonnegative data vector and K-dimensional nonnegative coefficient vector measured by I sensors at the discrete time j, where R+ denotes the nonnegative quadrant. The mathematical model for nonnegative vectors is described by ⎡ ⎤ ⎡ ⎤ ⎤⎡ x1 (j) s1 (j) a11 · · · a1K ⎢ .. ⎥ ⎢ .. .. ⎥ ⎢ .. ⎥ or x(j) = As(j), .. (6.6.4) ⎣ . ⎦=⎣ . . . ⎦⎣ . ⎦ xI (j)

aI1

···

aIK

sK (j)

in which A = [a1 , . . . , aK ] ∈ RI×K is the basis matrix and ak , k = 1, . . . , K, are the basis vectors. Since the measurement vectors x(j) at different times are expressed in the same set of basis vectors, so these I-dimensional basis vectors can be imagined as the building blocks of data representation, while the element sk (j) of the K-dimensional coefficient vector represents the strength of the kth basis vector (building block) ak in the data vector x(j), reflecting the contribution of ak in the fitting or regression of x. Therefore, the elements sk (j) of the coefficient vectors are often referred to as fitting coefficients, regression coefficients or combination coefficients: (1) sk (j) > 0 represents a positive contribution of the basis vector ak to the additive combination; (2) sk (j) = 0 represents a null contribution of the corresponding basis vector, i.e., it is not involved in the fitting or regression; (3) sk (j) < 0 implies a negative contribution of the basis vector ak . If the nonnegative data vectors measured at the discrete times j = 1, . . . , J are arranged as a nonnegative observation matrix, then [x(1), . . . , x(J)] = A[s(1), . . . , s(J)]



X = AS.

(6.6.5)

360

Solving Matrix Equations

The matrix S is called the coefficient matrix. It is essentially an encoding of the basis matrix. The problem of solving the blind nonnegative matrix equation X = AS can be described as: given a nonnegative matrix X ∈ RI×J (its entries xij ≥ 0) with a + I×r low rank r < min{I, J}, find the basis matrix A ∈ R+ and the coefficient matrix S ∈ Rr×J for X such that + X = AS + N (6.6.6) or Xij = [AS]ij + Nij =

r 

aik skj + nij ,

(6.6.7)

k=1

where N ∈ RI×J is the approximate error matrix. The coefficient matrix S is also called the encoding variable matrix, its entries are unknown hidden nonnegative components. The problem of decomposing a nonnegative data matrix into the product of a nonnegative basis matrix and a nonnegative coefficient matrix is called nonnegative matrix factorization (NMF) and was proposed by Lee and Seung in 1999 in Nature [286]. The NMF is essentially a multilinear nonnegative data representation. If the data matrix X is a positive matrix then the basis matrix A and the coefficient matrix S are required to be positive matrices. Such a matrix factorization is called positive matrix factorization (PMF) and was proposed by Paatero and Tapper in 1994 [367]. Equation (6.6.6) shows that when the data-matrix rank r = rank(X) < min{I, J}, the nonnegative matrix approximation AS can be regarded as a compression and de-noising form of the data matrix X. Table 6.1 shows the connections and differences between VQ, PCA and NMF. Table 6.1 Comparison of VQ, PCA and NMF

Constraints Form

VQ

PCA

NMF

winner-take-all  codebook matrix C coeff. matrix S

all share  basis matrix A coeff. matrix S  combination X = AS

minority share  basis matrix A coeff. matrix S NMF X = AS

Model

clustering X = CS

Analysis ability

nonlinear

linear

multilinear

Encoding way

grandmother cell

distributive

distributive + nonnegative

Machine learning

single mode

distributed

parts combination

6.6 Nonnegative Matrix Factorization: Optimization Theory

361

In the following, we address the features of NMF. 1. Distributed nonnegative encoding Nonnegative matrix factorization does not allow negative elements in the factorization matrices A and S. Unlike the single constraint of the VQ, the nonnegative constraint allows the use of a combination of basis images, or eigenfaces, to represent a face image. Unlike the PCA, the NMF allows only additive combinations; because the nonzero elements of A and S are all positive, the occurrence of any subtraction between basis images can be avoided. In terms of optimization criteria, the VQ adopts the “winnertake-all” constraint and the PCA is based on the “all-sharing” constraint, while the NMF adopts a “group-sharing” constraint together with a nonnegative constraint. From the encoding point of view, the NMF is a distributed nonnegative encoding, which often leads to sparse encoding. 2. Parts combination The NMF gives the intuitive impression that it is not a combination of all features; rather, just some of the features, (simply called the parts), are combined into a (target) whole. From the perspective of machine learning, the NMF is a kind of machine learning method based on the combination of parts which has the ability to extract the main features. 3. Multilinear data analysis capability The PCA uses a linear combination of all the basis vectors to represent the data and can extract only the linear structure of the data. In contrast, the NMF uses combinations of different numbers of and different labels of the basis vectors (parts) to represent the data and can extract its multilinear structure; thereby it has certain nonlinear data analysis capabilities.

6.6.4 Divergences and Deformed Logarithm The NMF is essentially an optimization problem and often uses a divergence as its cost function. The distance D(p g) between probability densities p and q is called the divergence if it meets only the nonnegativity and positive definite condition D(p g) ≥ 0 (the equality holds if and only if p = g). According to the prior knowledge of the statistical noise distribution, common divergences in NMF are the Kullback–Leibler divergence and the alpha–beta (AB) divergence and so on. 1. Squared Euclidean Distance Consider the NMF approximation X ≈ AS. The distance between the nonnegative measurement matrix X and the NMF AS is denoted D(X AS). When the approximation error follows a normal distribution, the NMF generally uses the squared

362

Solving Matrix Equations

Euclidean distance of the error matrix as the cost function: 1  2 (xij − [AS]ij ) . 2 i=1 j=1 I

DE (X AS) = X − AS 22 =

J

(6.6.8)

In many applications, in addition to the nonnegative constraint aik ≥ 0, skj ≥ 0, ∀ i, j, k, it is usually required that the expectation solution of S or of A has some characteristics associated with the cost functions JS (S) or JA (A) respectively, which yields the cost function DE (X AS) =

1

X − AS 22 + αA JA (A) + αS JS (S) 2

(6.6.9)

called the squared Euclidean distance between parameters αS and αA . 2. AB Divergence Let P, G ∈ RI×J be a nonnegative measurement matrix and its approximation matrix; then their alpha–beta divergence is simply called the AB divergence, and is defined as (see e.g., [14], [101], [102]) (α,β)

DAB (P G) ⎧ J ) I * ⎪ 1 α β α+β α+β ⎪ β α ⎪ − p , α, β, α + β = 0, g − p − g ⎪ ij ij ij ij α+β α+β ⎪ αβ i=1 j=1 ⎪ ⎪ ⎪ ) * ⎪1  J I  ⎪ pα ij α α ⎪ ⎪ pα , α=  0, β = 0, α − pij + gij ij ln gij ⎪ α2 ⎪ i=1 j=1  ⎪  ⎨ * ) α −1 J I gα gij 1  = ln pij −1 , α = −β = 0, α + α 2 p ⎪ α ij ij ⎪ i=1 j=1   ⎪ ⎪  J I  ⎪ gβ 1 ⎪ β β β ⎪ gij , α = 0, β = 0, ln pij ⎪ β2 β − gij + pij ⎪ ⎪ ij i=1 j=1 ⎪ ⎪ J I 2 ⎪ 1  ⎪ ⎪ α = 0, β = 0. ln pij − ln gij , ⎩2 i=1 j=1

The AB divergence is a constant function with α and β as parameters, and contains most divergences as special cases. (1) Alpha divergence When α + β = 1, the AB divergence reduces to the alpha divergence (α divergence): (α,1−α)

Dα (P G) = DAB

(P G)

  1 1−α − αpij + (α − 1)gij , pα ij gij α(α − 1) i=1 j=1 I

=

J

(6.6.10)

where α = 0 and α = 1. The following are some common α divergences [102]: • When α = 2, the α divergence reduces to the Pearson χ2 distance. • When α = 0.5, the α divergence reduces to the Hellinger distance.

6.6 Nonnegative Matrix Factorization: Optimization Theory

363

• When α = −1, the α divergence reduces to the Neyman χ2 distance. • As α → 0, the limit of the α divergence is the KL divergence of G from P, i.e., lim Dα (P G) = DKL (G P).

α→0

• As α → 1, the limit of the α is the KL divergence of P from G, i.e., lim Dα (P G) = DKL (P G).

α→1

(2) Beta divergence When α = 1, the AB divergence reduces to the beta divergence (β divergence): (1,β)

Dβ (P G) = DAB (P G)  J  I 1 β 1  β 1+β pij gij , − − =− p1+β g β i=1 j=1 1 + β ij 1 + β ij

(6.6.11)

where β = 0. In particular, if β = 1 then 1  = (p − gij )2 2 i=1 j=1 ij I

Dβ=1 (P G) =

(1,1) DAB (P G)

J

(6.6.12)

reduces to the squared Euclidean distance. (3) Kullback–Leibler (KL) divergence When α = 1 and β = 0, the AB divergence reduces to the standard KL divergence, i.e., (1,0)

DAB (P G) = Dβ=0 (P G) = DKL (P G). (4) Itakura–Saito (IS) divergence When α = 1 and β = −1, the AB divergence gives the standard Itakura–Saito divergence   J I   gij pij (1,−1) ln + −1 . (6.6.13) DIS (P G) = DAB (P G) = pij gij i=1 j=1 3. Kullback–Leibler Divergence Let φ : D → R be a continuous differentiable convex function defined in the closed convex set D ⊆ RK + . The Bregman distance between two vectors x, g ∈ D associated with the function φ is denoted Bφ (x g), and defined as def

Bφ (x g) = φ(x) − φ(g) − ∇φ(x), x − g.

(6.6.14)

Here ∇φ(x) is the gradient of the function φ at x. In particular, if φ(x) =

K 

xi ln xi

(6.6.15)

i=1

then the Bregman distance is called the Kullback–Leibler (KL) divergence, denoted

364

Solving Matrix Equations

DKL (x g). In probability and information theory, the KL divergence is also called the information divergence, the information gain, or the relative entropy. For two probability distribution matrices of a stochastic process P, G ∈ RI×J , if they are nonnegative then their KL divergence DKL (P G) is defined as   J I   pij pij ln (6.6.16) DKL (P G) = − pij + gij . gij i=1 j=1 Clearly, we have DKL (P G) ≥ 0 and DKL (P G) = DKL (G P), i.e., the KL divergence does not have symmetry. 4. Entropy, Tsallis Statistics and Deformed Logarithms In mathematical statistics and information theory, for a set of probabilities {pi }  such that i pi = 1, the Shannon entropy is defined as  S=− pi log2 pi , (6.6.17) i

while the Boltzmann–Gibbs entropy is defined as  pi ln pi . SBG = −k

(6.6.18)

i

For two independent systems A and B, their joint probability density p(A, B) = p(A)p(B), and their Shannon entropy S(·) and Boltzmann–Gibbs entropy SBG (·) have the additivity property S(A + B) = S(A) + S(B),

SBG (A + B) = SBG (A) + SBG (B),

so the Shannon entropy and the Boltzmann–Gibbs entropy are each extensive entropies. In physics, the Tsallis entropy is an extension of the standard Boltzmann–Gibbs entropy. The Tsallis entropy was proposed by Tsallis in 1988, and is also called the q-entropy [473]; it is defined as q 1 − i (pi )q Sq (pi ) = . (6.6.19) q−1 The Boltzmann–Gibbs entropy is the limit of the Tsallis entropy when q → 1, i.e., SBG = lim Dq (pi ). q→1

The Tsallis entropy has only pseudo-additivity: Sq (A + B) = Sq (A) + Sq (B) + (1 − q)Sq (A)Sq (B). Hence it is a nonextensive or nonadditive entropy [474]. The mathematical statistics defined by the Tsallis entropy is often called the Tsallis mathematical statistics. The major mathematical tools of Tsallis mathematical

6.6 Nonnegative Matrix Factorization: Optimization Theory

365

statistics are the q-logarithm and the q-exponential. In particular, the important q-Gaussian distribution is defined by the q-exponential. For nonnegative real numbers q and x, the function ⎧ (1−q) −1 ⎨x , q = 1 1−q (6.6.20) lnq x = ⎩ln x, q=1 is called the Tsallis logarithm of x [473], also called the q-logarithm. For all x ≥ 0, the Tsallis logarithm is an analytic, increasing, concave function. The inverse function of the q-logarithm is called the q-exponential, defined as ⎧ ⎪ (1 + (1 − q)x)1/(1−q) , 1 + (1 − q)x > 0, ⎪ ⎪ ⎪ ⎨0, q < 1, expq (x) = (6.6.21) ⎪ +∞, q > 1, ⎪ ⎪ ⎪ ⎩ exp(x), q = 1. The relationship between the q-exponential and the Tsallis logarithm is given by expq (lnq x) = x,

(6.6.22)

lnq expq (x) = x.

(6.6.23)

The probability density distribution f (x) is known as a q-Gaussian distribution, if √ β f (x) = expq (−βx2 ), (6.6.24) Cq where expq (x) = [1 + (1 − q)x]1/(1−q) is the q-exponential; the normalized factor Cq is given by √  1  ⎧ 2 π Γ 1−q ⎪ ⎪  1  , −∞ < q < 1, √ ⎪ ⎪ (3 − q) 1 − q Γ 1−q ⎪ ⎨√ Cq = π, q = 1, ⎪ √  3−q  ⎪ ⎪ π Γ ⎪ 2(q−1) ⎪√  1 , 1 < q < 3. ⎩ q − 1Γ

q−1

When q → 1, the limit of the q-Gaussian distribution is the Gaussian distribution; the q-Gaussian distribution is widely applied in statistical mechanics, geology, anatomy, astronomy, economics, finance and machine learning. Compared with the Gaussian distribution, a prominent feature of the q-Gaussian distribution with 1 < q < 3 is its obvious tailing. Owing to this feature, the qlogarithm (i.e., the Tsallis logarithm) and the q-exponential are very well suitable for using the AB-divergence as the cost function of the NMF optimization problems. In order to facilitate the application of the q-logarithm and the q-exponential in

366

Solving Matrix Equations

NMF, define the deformed logarithm

+

φ(x) = ln1−α x =

(xα − 1)/α,

α = 0,

ln x,

α = 0.

The inverse transformation of the deformed logarithm, ⎧ exp(x), α = 0, ⎪ ⎨ −1 1/α φ (x) = exp1−α (x) = (1 + αx) , α = 0, 1 + αx ≥ 0, ⎪ ⎩ 0, α = 0, 1 + αx < 0

(6.6.25)

(6.6.26)

is known as the deformed exponential. The deformed logarithm and the deformed exponential have important applications in the optimization algorithms of NMF.

6.7 Nonnegative Matrix Factorization: Optimization Algorithms The NMF is a minimization problem with nonnegative constraints. By the classification of Berry et al. [40], the NMF has three kinds of basic algorithm: (1) multiplication algorithms; (2) gradient descent algorithms; (3) alternating least squares (ALS) algorithms. These algorithms belong to the category of first-order optimization algorithms, where the multiplication algorithms are essentially gradient descent algorithms as well, but with a suitable choice of step size, a multiplication algorithm can transform the subtraction update rule of the general gradient descent method into a multiplicative update. Later, on the basis of the classification of Berry et al. [40], Cichocki et al. [104] added the quasi-Newton method (a second-order optimization algorithm) and the multilayer decomposition method. In the following, the above five representative methods are introduced in turn.

6.7.1 Multiplication Algorithms The gradient descent algorithm is a widely applied optimization algorithm; its basic idea is that a correction term is added during the updating of the variable to be optimized. The step size is a key parameter for this kind of algorithm, and determines the magnitude of the correction. Consider the general updating rule of the gradient descent algorithm for the unconstrained minimization problem min f (X): xij ← xij − ηij ∇f (xij ),

i = 1, . . . , I; j = 1, . . . , J,

(6.7.1)

6.7 Nonnegative Matrix Factorization: Optimization Algorithms

367

where xij is the entry of the variable matrix X, and ∇f (xij ) = [∂f (X)/∂X]ij is the gradient matrix of the cost function f (X) at the point xij . If the step size ηij is chosen in such a way that the addition term xij in the additive update rule is eliminated, then the original gradient descent algorithm with additive operations becomes the gradient descent algorithm with multiplication operations. This multiplication algorithm was proposed by Lee and Seung [287] for the NMF but is available for other many optimization problems as well. 1. Multiplication Algorithm for Squared Euclidean Distance Minimization Consider an unconstrained problem min DE (X AS) = 12 X − AS 22 using the typical squared Euclidean distance as the cost function; its gradient descent algorithm is given by ∂DE (X AS) , ∂aik ∂DE (X AS) − ηkj , ∂skj

aik ← aik − μik

(6.7.2)

skj ← skj

(6.7.3)

where   ∂DE (X AS) = − (X − AS)ST ik , ∂aik   ∂DE (X AS) = − AT (X − AS) kj ∂skj are the gradients of the cost function with respect to the entry aik of the variable matrix A and the entry skj of the variable matrix S, respectively. If we choose skj aik , ηkj = , (6.7.4) μik = T T [A AS]kj [ASS ]ik then the gradient descent algorithm becomes a multiplication algorithm: aik ← aik

[XST ]ik , [ASST ]ik

skj ← skj

[AT X]kj , [AT AS]kj

i = 1, . . . , I, k = 1, . . . , K, k = 1, . . . , K, j = 1, . . . , J.

(6.7.5) (6.7.6)

Regarding on multiplication algorithms, there are four important remarks to be made. Remark 1 The theoretical basis of the multiplication algorithm is the auxiliary function in the expectation-maximization (EM) algorithm. Hence, the multiplication algorithm is practically an expectation maximization maximum likelihood (EMML) algorithm [102].

368

Solving Matrix Equations

Remark 2 The elementwise multiplication algorithm above is easily rewritten as a multiplication algorithm in matrix form: < = A ← A ∗ (XST ) * (ASST ) , (6.7.7)   T (6.7.8) S ← S ∗ (A X) * (AT AS) , where B ∗ C represents the componentwise product (Hadamard product) of two matrices, while B * C denotes the componentwise division of two matrices, namely [B ∗ C]ik = bik cik ,

[B * C]ik = bik /cik .

(6.7.9)

Remark 3 In the gradient descent algorithm, a fixed step length or an adaptive step length is usually taken; this is independent of the index of the updated variable. In other words, the step size may change with the time but at a given update time. In the same update time, the different entries of the variable matrix are updated with the same step size. It contrast, the multiplication algorithm adopts different steps μik for the different entries of the variable matrix. Therefore, this kind of step length is adaptive to the matrix entries. This is an important reason why the multiplication algorithm can improve on the performance of the gradient descent algorithm. Remark 4 The divergence D(X AS) is nonincreasing in the different multiplicative update rules [287], and this nonincreasing property is likely to have the result that the algorithm cannot converge to a stationary point [181]. For the regularized NMF cost function J(A, S) =

1

X − AS 22 + αA JA (A) + αS JS (S), 2

(6.7.10)

since the gradient vectors are respectively given by ∂J (A) ∂DE (X AS) + αA A , ∂aik ∂aik ∂J (S) ∂DE (X AS) ∇skj J(A, S) = + αA S , ∂skj ∂skj ∇aik J(A, S) =

the multiplication algorithm should be modified to [287] < = [XST ]ik − αA ∇JA (aik ) + , aik ← aik T [ASS ]ik + < = [AT X]kj − αS ∇JS (skj ) + skj ← skj , [AT AS]kj +

(6.7.11)

(6.7.12)

where ∇JA (aik ) = ∂JA (A)/∂aik , ∇JS (skj ) = ∂JS (S)/∂skj , and [u]+ = max{u, }. The parameter is usually a very small positive number that prevents the emergence

6.7 Nonnegative Matrix Factorization: Optimization Algorithms

369

of a zero denominator, as needed to ensure the convergence and numerical stability of the multiplication algorithms. The elementwise multiplication algorithm above can also be rewritten in the matrix form ) * A ← A ∗ (XST − αA ΨA ) * (ASST + I) , (6.7.13)  T  S ← S ∗ (A X − αS ΨS ) * (AT AS + I) , (6.7.14) where ΨA = ∂JA (A)/∂A and ΨS = ∂JS (S)/∂ S are two gradient matrices, while I is the identity matrix. 2. Multiplication Algorithms for KL Divergence Minimization Consider the minimization of the KL divergence   I  J  xi 1 j 1 xi1 j1 ln DKL (X AS) = − xi1 j1 + [AS]i1 j1 . [AS]i1 j1 i =1 j =1 1

(6.7.15)

1

Since J  ∂DKL (X AS) =− ∂aik j=1 I  ∂DKL (X AS) =− ∂skj i=1

 

skj xij + skj [AS]ij aik xij + aik [AS]ij

 ,  ,

the gradient descent algorithm is given by ⎛  ⎞ J  skj xij aik ← aik − μik × ⎝− + skj ⎠ , [AS] ij j=1   I   aik xij . + aik skj ← skj − ηkj × − [AS]ij i=1 Selecting μik = J

1

j=1 skj

,

ηkj = I

1

i=1

aik

,

the gradient descent algorithm can be rewritten as a multiplication algorithm [287]: J j=1 skj xik /[AS]ik aik ← aik , (6.7.16) J j=1 skj I a x /[AS]ik skj ← skj i=1ikI ik . (6.7.17) i=1 aik

370

Solving Matrix Equations

The corresponding matrix form is ) *  * ) A ← A * 1I ⊗ (S1K )T ∗ (X * (AS)) ST , ) *  * ) S ← S * (AT 1I ) ⊗ 1K ∗ AT (X * (AS)) ,

(6.7.18) (6.7.19)

where 1I is an I × 1 vector whose entries are all equal to 1. 3. Multiplication Algorithm for AB Divergence Minimization For AB divergence (where α, β, α + β = 0) we have (α,β)

DAB (X AS)

 J  I α β 1  α β α+β α+β xij [AS]ij − ; − =− x [AS]ij αβ i=1 j=1 α + β ij α+β

(6.7.20)

its gradient is given by * ∂DAB (X AS) 1 ) 1−β α+β−1 skj xα =− [AS] − [AS] s ij kj , ij ij ∂aik α j=1

(6.7.21)

* ∂DAB (X AS) 1 ) 1−β α+β−1 aik xα =− [AS] − [AS] a ij ik . ij ij ∂skj α i=1

(6.7.22)

J

(α,β)

I

(α,β)

Hence, letting the step sizes be μik = J

αaik

α+β−1 j=1 skj [AS]ij

ηkj = I i=1

αskj aik [AS]α+β−1 ij

, ,

the multiplication algorithm for the AB divergence minimization is given by J β−1 α j=1 skj xij [AS]ij , aik ← aik J α+β−1 j=1 skj [AS]ij I β−1 α i=1 aik xij [AS]ij skj ← skj I . α+β−1 i=1 aik [AS]ij In order to speed up the convergence of the algorithm, one can use instead the following updating rules [101]  J  β−1 1/α α j=1 skj xij [AS]ij , (6.7.23) aik ← aik J α+β−1 j=1 skj [AS]ij  I  β−1 1/α α i=1 aik xij [AS]ij , (6.7.24) skj ← skj I α+β−1 i=1 aik [AS]ij

6.7 Nonnegative Matrix Factorization: Optimization Algorithms

371

where the positive relaxation parameter 1/α is used to improve the convergence of the algorithm. When β = 1 − α, from the multiplication algorithm for the AB divergence minimization we get the multiplication algorithm for the α divergence minimization,  J 1/α α j=1 (xij /[AS]ij ) skj aik ← aik , (6.7.25) J j=1 skj  I 1/α α i=1 aik (xij /[AS]ij ) . (6.7.26) skj ← skj I i=1 aik For the general cases of α = 0 and/or β = 0, the gradient of the AB divergence is given by [101] J (α,β)  xij ∂DAB (X AS) =− [AS]λ−1 , ij skj ln1−α ∂aik [AS] ij j=1

(6.7.27)

I (α,β)  xij ∂DAB (X AS) =− [AS]λ−1 , ij aik ln1−α ∂skj [AS] ij i=1

(6.7.28)

where λ = α + β. Then, by the relationship between the deformed logarithm and the deformed exponential exp1−α (ln1−α x) = x, it is easy to see that the gradient algorithm for AB divergence minimization is given by   (α,β) ∂DAB (X AS) , (6.7.29) aik ← exp1−α ln1−α aik − μik ∂ ln1−α aik   (α,β) ∂DAB (X AS) . (6.7.30) skj ← exp1−α ln1−α skj − ηkj ∂ ln1−α skj Choosing μik = J ηkj =

a2α−1 ik

,

λ−1 j=1 skj [AS]ij s2α−1 kj , I λ−1 a i=1 ik [AS]ij

then the multiplication algorithm for AB divergence-based NMF is as follows [101]: ⎛ ⎞ J  skj [AS]λ−1 x ij ij ⎠ aik ← aik exp1−α ⎝ ln1−α , (6.7.31) J λ−1 [AS] s [AS] ij ij j=1 kj j=1  I   aik [AS]λ−1 xij ij . (6.7.32) ln1−α skj ← skj exp1−α I λ−1 [AS]ij i=1 aik [AS]ij i=1 78 9 6 78 96 weighting coefficient

α-zooming

372

Solving Matrix Equations

By α-zooming is meant that, owing to the zoom function of the parameter α, the relative errors of updating the elements of the basis matrix A = [aik ] and the coefficient matrix S = [skj ] are mainly controlled by the deformed logarithm ln1−α (xij /[AS]ij ), as follows: (1) When α > 1, the deformed logarithm ln1−α (xij /[AS]ij ) has the “zoom-out” function and emphasizes the role of the larger ratio xij /[AS]ij . Because the smaller ratio is reduced, it can be more easily neglected. (2) When α < 1, the “zoom-in” function of the deformed logarithm highlights the smaller ratio xij /[AS]ij . Because the AB divergence contains several divergences as special cases, the AB multiplication algorithm for NMF is correspondingly available for a variety of NMF algorithms. In the updating formulas of various multiplication algorithms for NMF, it is usually necessary to add a very small perturbation (such as = 10−9 ) in order to prevent the denominator becoming zero.

6.7.2 Nesterov Optimal Gradient Algorithm The gradient descent NMF algorithm in fact consists of two gradient descent algorithms, ∂f (Ak , Sk ) , ∂Ak ∂f (Ak , Sk ) = Sk − μ S . ∂Sk

Ak+1 = Ak − μA Sk+1

In order to ensure the nonnegativity of Ak and Sk , all elements of the updated matrices Ak+1 and Sk+1 need to be projected onto the nonnegative quadrant in each updating step, which constitutes the projected gradient algorithm for NMF [297]:   ∂f (Ak , Sk ) Ak+1 = Ak − μA , (6.7.33) ∂Ak +   ∂f (Ak , Sk ) . (6.7.34) Sk+1 = Sk − μS ∂Sk + Consider the factorization of a low-rank nonnegative matrix X ∈ Rm×n 1 min X − AS 2F 2

subject to A ∈ Rm×r , S ∈ Rr×n + + ,

(6.7.35)

where r = rank(X) < min{m, n}. Since Equation (6.7.35) is a nonconvex minimization problem, one can use alternating nonnegative least squares to represent the local solutions of the minimization

6.7 Nonnegative Matrix Factorization: Optimization Algorithms

373

problem (6.7.35): Sk+1 ATk+1

" # 1 2 = arg min F (Ak , S) = X − Ak S F , 2 S≥O " # 1 T T T T T 2 = arg min F (Sk+1 , A ) = X − Sk+1 A F . 2 A≥O

(6.7.36) (6.7.37)

Here k denotes the kth data block. Recently, Guan et al. [191] showed that the NMF objective function satisfies the two conditions of the Nesterov optimal gradient method: (1) the objective function F (At , S) = 12 X − At S 2F is a convex function; (2) the gradient of the objective function F (At , S) is Lipschitz continuous with Lipschitz constant L = ATt At F . On the basis of the above results, a Nesterov nonnegative matrix factorization (NeNMF) algorithm was proposed in [191], as shown in Algorithm 6.6, where OGM(At , S) is given in Algorithm 6.7. Algorithm 6.6

NeNMF algorithm [191]

input: Data matrix X ∈ Rm×n , 1 ≤ r ≤ min{m, n}. + initialization: A1 ≥ O, S1 ≥ O, k = 1. repeat 1. Update Sk+1 = OGM(Ak , S) and Ak+1 = OGM(STk+1 , AT ). P 2. exit if ∇P S F (Ak , Sk ) = O and ∇A F (Ak , Sk ) = O, where  % $ P [Sk ]ij > 0, [∇S F (Ak , Sk )]ij ,   ∇S F (Ak , Sk ) ij = min 0, [∇S F (Ak , Sk )]ij , [Sk ]ij = 0.  % $ P [Ak ]ij > 0, [∇A F (Ak , Sk )]ij ,   ∇A F (Ak , Sk ) ij = min 0, [∇A F (Ak , Sk )]ij , [Ak ]ij = 0.

return k ← k + 1. r×n output: Basis matrix A ∈ Rm×r , coefficient matrix S ∈ R+ . +

Remark If Ak → STk and Sk → ATk in Algorithm 6.7, and the Lipschitz constant is replaced by L = Sk STk F , then Algorithm 6.7 will give the optimal gradient method OGM(STk+1 , AT ), yielding the output ATk+1 = ATt . It has been shown [191] that owing to the introduction of structural information, when minimizing one matrix with another matrix held fixed, the NeNMF algorithm can converge at the rate O(1/k 2 ).

374

Solving Matrix Equations Optimal gradient method OGM(Ak , S) [191]

Algorithm 6.7

input: Ak and Sk . initialization: Y0 = Sk ≥ O, α0 = 1, L = ATk Ak F , t = 0. repeat

  1. Update St = P+ Yt − L−1 ∇S F (Ak , Yt ) .   1 2. Update αt+1 = 1 + 4αt2 + 1 . 2 α −1 (St − St−1 ). 3. update Yt+1 = St + t αt+1 P 4. exit if ∇S F (Ak , St ) = O.

return t ← t + 1. output: Sk+1 = St .

6.7.3 Alternating Nonnegative Least Squares The alternating least squares method was used for NMF first by Paatero and Tapper [367]. The resulting method is called the alternating nonnegative least squares (ANLS) method. The NMF optimization problem for XI×J = AI×K SK×J , described by 1 min X − AS 2F A,S 2

subject to A, S ≥ O,

can be decomposed into two ANLS subproblems [367] " # 1 (A fixed), ANLS1 : min f1 (S) = AS − X 2F S≥O 2 " # 1 T T T T 2 (S fixed). ANLS2 : min f2 (A ) = S A − X F A≥O 2

(6.7.38)

(6.7.39) (6.7.40)

These two ANLS subproblems correspond to using the LS method to solve alternately the matrix equations AS = X and ST AT = XT , whose LS solutions are respectively given by   (6.7.41) S = P+ (AT A)† AT X ,   T † T T A = P+ (SS ) SX . (6.7.42) If A and/or S are singular in the iteration process, the algorithm cannot converge. In order to overcome the shortcoming of the ALS algorithm due to numerical stability, Langville et al. [277] and Pauca et al. [378] independently proposed a constrained nonnegative matrix factorization (CNMF): " #  1 2 2 2 CNMF : min subject to A, S ≥ O,

X − AS F + α A F + β S F A,S 2 (6.7.43)

6.7 Nonnegative Matrix Factorization: Optimization Algorithms

375

where α ≥ 0 and β ≥ 0 are two regularization parameters that suppress A 2F and

S 2F , respectively. This constrained NMF is a typical application of the regularization least squares method of Tikhonov [464]. The regularization NMF problem can be decomposed into two alternating regularization nonnegative least squares (ARNLS) problems " # 1 1 2 2 ARNLS1 : min J1 (S) = AS − X F + β S F (A fixed), (6.7.44) 2 2 S∈RJ×K + " # 1 1 J2 (AT ) = ST AT − XT 2F + α A 2F (S fixed), ARNLS2 : min 2 2 A∈RI×J + (6.7.45) equivalently written as

+

2   22 0 2 12 A X 2 √ ARNLS1 : min J1 (S) = 2 , S− 2 OJ×K 2F βIJ 2 S∈RJ×K + + 2 T   T 22 0 2 12 S X T T 2 2 J2 (A ) = 2 √ . ARNLS2 : min A − I×J αIJ OJ×I 2F 2 A∈R+

(6.7.46)

(6.7.47)

From the matrix differentials  1  dJ1 (S) = d tr((AS − X)T (AS − X)) + β tr(ST S) 2   = tr (ST AT A − XT A + β ST )dS ,  1  dJ2 (AT ) = d tr((AS − X)(AS − X)T ) + αtr(AT A) 2) * = tr (ASST − XST + αA)dAT ,

we get the gradient matrices ∂J1 (S) = −AT X + AT AS + βS, ∂S ∂J2 (AT ) = −SXT + SST AT + αAT . ∂AT

(6.7.48) (6.7.49)

From ∂J1 (S)/∂S = O and ∂J2 (AT )/∂AT = O, it follows that the solutions of the two regularization least squares problems are respectively given by (AT A + βIJ )S = AT X or T

T

T

(SS + αIJ )A = SX

or

S = (AT A + βIJ )−1 AT X, T

T

A = (SS + αIJ )

−1

T

SX .

(6.7.50) (6.7.51)

The LS method for solving the above two problems is called the alternate constrained least squares method, whose basic framework is as follows [277]. (1) Use the nonzero entries to initialize A ∈ RI×K .

376

Solving Matrix Equations

(2) Find iteratively the regularization LS solutions (6.7.50) and (6.7.51), and force the matrices S and A to be nonnegative: skj = [S]kj = max{0, skj }

and

aik = [A]ik = max{0, aik }.

(6.7.52)

(3) Now normalize each column of A and each row of S to the unit Frobenius norm. Then return to (2), and repeat the iterations until some convergence criteria are met. A better way, however, is to use the multiplication algorithms to solve two alternating least squares problems. From the gradient formulas (6.7.48) and (6.7.49), it immediately follows that the alternating gradient algorithm is given by   skj ← skj + ηkj AT X − AT AS − βS kj ,   aik ← aik + μik XST − ASST − αA ik . Choosing ηkj = 

skj T A AS +

βS

 , kj

μik = 

aik  , ASST + αA ik

the gradient algorithm becomes the multiplication algorithm  T  A X kj  , skj ← skj  T A AS + βS kj   XST ik  . aik ← aik  ASST + αA ik

(6.7.53) (6.7.54)

As long as the matrices A and S are initialized using nonnegative values, the above-described iteration ensures the nonnegativity of the two matrices. By [378], choosing step sizes ηkj = 

skj  T A AS

, kj

μik = 

aik  ASST ik

gives the multiplication algorithm  AT X − βS kj  ← skj  T , A AS kj +   XST − αA ik  . ← aik  ASST ik + 

skj aik

(6.7.55) (6.7.56)

Notice that the above algorithm cannot ensure the nonnegativity of the matrix entries, owing to the subtraction operations in the numerators.

6.7 Nonnegative Matrix Factorization: Optimization Algorithms

377

6.7.4 Quasi-Newton Method 1. Basic Quasi-Newton Method [539] Consider solving the over-determined matrix equation STK×J ATI×K = XTI×K (where J + I) for the unknown matrix AT , an effective way is the quasi-Newton method. The Hessian matrix HA = ∇2A (DE ) = II×I ⊗SST ∈ RIK×IK of the cost function DE (X AS) = (1/2) X − AS 22 is a block diagonal matrix whose block matrix on the diagonal is SST . Then the quasi-Newton method takes the update     , A ← A − ∇A DE (X AS) H−1 A here ∇A (DE (X AS)) = (AS − X)ST is the gradient matrix of the cost function DE (X AS). Hence, the quasi-Newton algorithm is given by <  −1 = A ← A − (AS − X)ST SST . In order to prevent the matrix SST from being singular or having a large condition number, one can use the relaxation method <  −1 = A ← A − (AS − X)ST SST + λIK×K , where λ is a Tikhonov regularization parameter. 2. Multilayer Decomposition Method [105], [106] The basic idea of the multilayer decomposition method is as follows. The largest error is in the first decomposition result X ≈ A(1) S(1) ; hence S(1) is regarded as a new data matrix, and then the NMF of the second layer S(1) ≈ A(2) S(2) is constructed. The second layer decomposition result is still subject to error, and so it is necessary to construct the third layer NMF S(2) ≈ A(3) S(3) , and to continue this process to form an L-layer NMF: X ≈ A(1) S(1) ∈ RI×J S(1) ≈ A(2) S(2) ∈ RK×J .. . S(L−1) ≈ A(L) S(L) ∈ RK×J

(where A(1) ∈ RI×K ), (where A(2) ∈ RK×K ),

(where A(L) ∈ RK×K ).

The decomposition of every layer is made by NMF, and any algorithm described above can be adopted. The final multilayer decomposition result is X ≈ A(1) A(2) · · · A(L) S(L) , from which the NMF is given by A = A(1) A(2) · · · A(L) and S = S(L) .

378

Solving Matrix Equations

6.7.5 Sparse Nonnegative Matrix Factorization When a sparse representation of the data is expected to obtained by NMF, it is necessary to consider NMF with a sparse constraint. Given a vector x ∈ Rn , Hoyer [220] proposed using the ratio of the 1 -norm and the 2 -norm, √ n − x 1 / x 2 √ , (6.7.57) sparseness (x) = n−1 as the measure of the sparseness of the vector. Clearly, if x has only one nonzero element then its sparseness is equal to 1; if and only if the absolute values of all the elements of x are equal, the sparseness of the vector is zero. The sparseness of any vector lies in the region [0, 1]. An NMF with nonnegative constraint is defined as follows [220]: given a nonnegI×K ative data matrix X ∈ RI×J and the + , find the nonnegative basis matrix A ∈ R+ K×J nonnegative coefficient matrix S ∈ R+ such that L(A, S) = X − AS 2F

(6.7.58)

is minimized and A and S satisfy respectively the following sparse constraints: sparseness (ak ) = Sa ,

sparseness (sk ) = Ss ,

k = 1, . . . , K.

Here ak and sk are respectively the kth column of the nonnegative matrix A and the kth row of S, K is the number of components, Sa and Ss are respectively the (expected) sparseness of the columns of A and the rows of S. The following is the basic framework of nonnegative matrix factorization with sparseness constraint [220]: 1. Use two random positive matrices to initialize A and S. 2. If adding a sparseness constraint to the matrix A then: • let A ← A − μA (AS − X)ST ; • project every column vector of A onto a new nonnegative column vector with unchanged 2 -norm but 1 -norm equal to the expected sparseness. 3. If not adding a sparseness constraint to A then A ← A ∗ (XST ) * (ASST ). 4. If adding a sparseness constraint to the matrix S then: • let S ← S − μS AT (AS − X); • project every row vector of S onto a new nonnegative row vector with unchanged 2 -norm but 1 -norm equal to the expected sparseness. 5. If not adding a sparseness constraint to S then S ← S ∗ (AT X) * (AT AS). In the following, we consider a regularization NMF with sparseness constraint on

6.7 Nonnegative Matrix Factorization: Optimization Algorithms

the columns of S: min A,S

⎧ 1⎨ 2⎩

AS − X 2F + α A 2F + β

J 

S:,j 21

j=1

379

⎫ ⎬ ⎭

subject to A, S ≥ O,

(6.7.59)

where S:,j is the jth column of the matrix S. The sparse NMF problem (6.7.59) can be decomposed into two alternating LS subproblems [247] + 2   22 0 12 A X 2 2 2 min J3 (S) = 2 √ T S − , (6.7.60) 2 β1 O 2 A∈RI×K K×J K + F + 2 T   T 22 0 2 12 S X T T 2 √ J4 (A ) = 2 . (6.7.61) min A − 2 αIK OK×I 2F 2 S∈RK×J + From the matrix differentials * 1 )  dJ3 (S) = d tr (AS − X)T (AS − X) + βST 1K 1TK S 2)  *  = tr ST AT A − XT A + βST EK dS , * ) dJ4 (AT ) = dJ2 (AT ) = tr (ASST − XST + αA)dAT , it follows that the gradient matrices of the objective function of the sparse nonnegative LS problem are given by ∂J3 (S) = −AT X + AT AS + βEJ S, ∂S ∂J4 (AT ) = −SXT + SST AT + αAT . ∂AT

(6.7.62) (6.7.63)

Here EK = 1K 1TK is a K × K matrix with all entries equal to 1. Then, the alternating sparse nonnegative LS solutions are respectively given by (AT A + βEJ )S = AT X, (SST + αIJ )AT = SXT , namely S = (AT A + βEJ )−1 AT X, T

T

A = (SS + αIJ )

−1

T

SX .

However, the gradient algorithm for the sparse NMF is [S]kj ← [S]kj + ηkj [AT X − AT AS − βEK S]kj , [AT ]ki ← [AT ]ki + μki [SXT − SST AT − αAT ]ki .

(6.7.64) (6.7.65)

380

Solving Matrix Equations

Choosing the step sizes ηkj =

[S]kj , + βEJ S]kj

[AT AS

μki =

[AT ]ki , + αAT ]ki

[SST AT

the gradient algorithm becomes the multiplication algorithm [S]kj ← [S]kj

[AT X]kj , [AT AS + βEJ S]kj

[AT ]ki ← [AT ]ki

(6.7.66)

[SXT ]ki . + αAT ]ki

[SST AT

(6.7.67)

This is the alternating LS multiplication algorithm for a sparse NMF [247]. The NMF is widely used, and here we present a few typical applications. In image processing the data can be represented as an m × n nonnegative matrix X each of whose columns is an image described by m nonnegative pixel values. Then NMF gives two factor matrices A and S such that X ≈ AS, where A is the basis matrix and S is the coding matrix. In pattern recognition applications such as face recognition, each column of the basis matrix A can be regarded as one key part of the whole face, such as nose, eye, ear etc., and each column of the coding matrix S gives a weight by which the corresponding image can be reconstructed as a linear combination of different parts. Then, NMF can discover the common basis hidden behind an observed image and hence perform face recognition. However, for an image such as a face or brain, etc., the standard version of NMF does not necessarily provide a correct part-of-whole representation. In such cases, it is necessary to add the sparsity constraint of NMF to the face or brain recognition in order to improve the part-of-whole representation of the image. The main purpose of clustering text, image or biology data is to discover patterns from data automatically. Given an m × n nonnegative matrix X = [x1 , . . . , xn ], suppose that there are r cluster patterns. Our problem is to determine to which cluster the m × 1 data vector xj belongs. Via NMF, we can find two factor matrices A ∈ Rm×r and S ∈ Rr×n such that X ≈ AS, where r is the cluster number. The + + factor matrix A can be regarded as a cluster centroid matrix, while S is a cluster membership indicator matrix. The NMF model X = AS can be rewritten as ⎤ ⎡ S11 · · · S1n ⎢ .. ⎥ .. (6.7.68) [x1 , . . . , xn ] = [a1 , . . . , ar ] ⎣ ... . . ⎦ Sr1

···

Srn

or xj =

r  i=1

ai Sij ,

j = 1, . . . , n.

(6.7.69)

6.8 Sparse Matrix Equation Solving: Optimization Theory

381

This shows that if Skj is the largest value of the jth column of S then the jth data vector xj can be determined as belonging to the kth cluster pattern. Nonnegative matrix factorization has been successfully applied in different fields such as metagenes and molecular pattern discovery in bioinformatics [70], text or document clustering [379], [434], the analysis of financial data [134] and so on. The use of K-means provides one of the most famous and traditional methods in clustering analysis, and probabilistic latent semantic indexing (PLSI) is one of the state-of-the-art unsupervised learning models in data mining. The NMF method is closely related to K-means and the PLSI [544] as follows. (1) A model using soft K-means can be rewritten as a symmetric NMF model, and hence K-means and NMF are equivalent. This equivalence justifies the ability of NMF for data clustering, but this does not mean that K-means and NMF will generate identical cluster results since they employ different algorithms. (2) The PLSI and NMF methods optimize the same objective function (the KL divergence), but the PLSI has additional constraints. The algorithms for the two models can generate equivalent solutions, but they are different in essence.

6.8 Sparse Matrix Equation Solving: Optimization Theory In sparse representation and compressed sensing (also known as compressive sensing, compressive sampling, or sparse sampling), it is necessary to solve an underdetermined sparse matrix equation Ax = b, where x is a sparse vector with only a few nonzero entries. In this section we discuss the use of optimization theory for solving sparse matrix equations.

6.8.1 1 -Norm Minimization A full row rank under-determined matrix equation Am×n xn×1 = bm×1 with m < n has infinitely many solutions. Suppose that we seek the sparsest solution, with the fewest nonzero entries. Can it ever be unique? How do we find the sparsest solution? This section answers the first question; the second question will be discussed in the next section. For any positive number p > 0, the p -norm of the vector x is defined as ⎛

x p = ⎝

⎞1/p



|xi |p ⎠

.

(6.8.1)

i∈support(x)

Thus the 0 -norm of an n × 1 vector x can be defined as

x 0 = lim x pp = lim p→0

p→0

n  i=1

|xi |p =

n  i=1

1(xi = 0) = #{i|xi = 0}.

(6.8.2)

382

Solving Matrix Equations

Thus if x 0 ' n then x is sparse. As described in Chapter 1, the core problem of sparse representation is the 0 norm minimization (P0 )

min x 0 x

subject to b = Ax,

(6.8.3)

where A ∈ Rm×n , x ∈ Rn , b ∈ Rm . As an observation signal is usually contaminated by noise, the equality constraint in the above optimization problem should be relaxed to 0 -norm minimization with the inequality constraint ≥ 0 which allows a certain error perturbation: min x 0 x

subject to

Ax − b 2 ≤ .

(6.8.4)

A key term coined and defined in [130] that is crucial for the study of uniqueness is the spark of the matrix A. DEFINITION 6.7 [130] Given a matrix A, its sparsity σ = Spark(A) is defined as the smallest possible number such that there exists a subgroup of σ columns from A that are linearly dependent. The spark gives a simple criterion for the uniqueness of a sparse solution of an under-determined system of linear equations Ax = b, as shown below. THEOREM 6.9 [182], [130], [69] If a system of linear equations Ax = b has a solution x obeying x 0 < spark(A)/2, this solution is necessarily the sparsest solution. To solve the optimization problem (P0 ) directly, one must screen out all nonzero elements in the coefficient vector x. However, this method is intractable, or nondeterministic polynomial-time hard (NP-hard), since the searching space is too large [313], [118], [337]. The index set of nonzero elements of the vector x = [xi , . . . , xn ]T is called its support, denoted support(x) = {i|xi = 0}. The length of the support (i.e., the number of nonzero elements) is measured by the 0 -norm

x 0 = |support(x)|.

(6.8.5)

A vector x ∈ Rn is said to be K-sparse if x 0 ≤ K, where K ∈ {1, . . . , n}. The set of K-sparse vectors is denoted by    ΣK = x ∈ Rn×1  x 0 ≤ K . (6.8.6) ˆ ∈ ΣK then the vector x ˆ ∈ Rn is known as the K-sparse approximation or If x K-term approximation of x ∈ Rn . A given set of L m×1 real input vectors {y1 , . . . , yL } constitutes the data matrix Y = [y1 , . . . , yL ] ∈ Rm×L . The sparse coding problem involves determining n m × 1 basis vectors a1 , . . . , an ∈ Rm , and, for each input vector bl determining an n × 1

6.8 Sparse Matrix Equation Solving: Optimization Theory

383

sparse weighting vector, or coefficient vector, sl ∈ Rn such that the weighted linear combination of a few of these basis vectors can approximate the original input vector, namely n  sl,i ai = Asl , l = 1, . . . , L, (6.8.7) yl = i=1

where sl,i is the ith entry of the sparse weighting vector sl . Sparse coding can be regarded as a form of neural coding: as the weight vector is sparse, so for each input vector, only a small number of neurons (basis vectors) are strongly excited. If the input vectors are different then the excited neurons are also different. ˆ is said to be the optimal K-sparse approximation of x in the p The vector x ˆ reaches an norm condition if the p -norm of the approximation error vector x − x infimum, i.e., ˆ p = inf x − z p .

x − x z∈ΣK

Obviously, there is a close relationship between the 0 -norm definition formula (6.8.5) and the p -norm definition formula (6.8.1): as p → 0, x 0 = lim x pp . p→0

Because, if and only if p ≥ 1, x p is a convex function, the 1 -norm is the objective function closest to 0 -norm. Thus, from the viewpoint of optimization, the 1 -norm is said to be the convex relaxation of the 0 -norm. Therefore, the 0 -norm minimization problem (P0 ) can be transformed into the convex relaxed 1 -norm minimization problem min x 1

(P1 )

x

subject to

b = Ax.

(6.8.8)

This is a convex optimization problem because the 1 -norm x 1 , as the objective function, is itself convex and the equality constraint b = Ax is affine. In an actual situation of existing observation noise, the equality constrained optimization problem (P1 ) can be relaxed to the inequality constrained optimization problem (P10 )

min x 1 x

subject to

b − Ax 2 ≤ .

(6.8.9)

The 1 -norm minimization in Equation (6.8.9) is also called basis pursuit (BP). This is a quadratically constrained linear programming (QCLP) problem. If x1 is the solution to (P1 ) and x0 is the solution to (P0 ), then [129]

x1 1 ≤ x0 1 ,

(6.8.10)

because x0 is only a feasible solution to (P1 ), while x1 is the optimal solution to (P1 ). The direct relationship between x0 and x1 is given by Ax1 = Ax0 . Similarly to the inequality constrained 0 -norm minimization expression (6.8.4),

384

Solving Matrix Equations

the inequality constrained 1 -norm minimization expression (6.8.9) also has two variants: (1) Since x is constrained as a K-sparse vector, the inequality constrained 1 -norm minimization becomes the inequality constrained 2 -norm minimization 1 min b − Ax 22 subject to x 1 ≤ q. (6.8.11) x 2 This is a quadratic programming (QP) problem. (2) Using the Lagrange multiplier method, the inequality constrained 1 -norm minimization problem (P11 ) becomes " # 1 (6.8.12) (P12 ) min

b − Ax 22 + λ x 1 . λ,x 2 (P11 )

This optimization is called basis pursuit de-noising (BPDN) [96]. The optimization problems (P10 ) and (P11 ) are respectively called error constrained 1 -norm minimization and 1 -penalty minimization [471]. The 1 -penalty minimization is also known as regularized 1 linear programming or 1 -norm regularization least squares. The Lagrange multiplier acts as a regularization parameter that controls the sparseness of the sparse solution; the greater λ is, the more sparse the solution x is. When the regularization parameter λ is large enough, the solution x is a zero vector. With the gradual decrease of λ, the sparsity of the solution vector x also gradually decreases. As λ tends to 0, the solution vector x becomes the vector such that b − Ax 22 is minimized. That is to say, λ > 0 can balance the twin objectives by minimizing the error squared sum cost function 12 b − Ax 22 and the 1 -norm cost function x 1 : 1 (6.8.13) J(λ, x) = b − Ax 22 + λ x 1 . 2 6.8.2 Lasso and Robust Linear Regression Minimizing the squared error usually leads to sensitive solutions. Many regularization methods have been proposed to decrease this sensitivity. Among them, Tikhonov regularization [465] and Lasso [462], [143] are two widely known and cited algorithms, as pointed out in [515]. The problem of regression analysis is a fundamental problem within many fields, such as statistics, supervised machine learning, optimization and so on. In order to reduce the computational complexity of solving the optimization problem (P1 ) directly, consider a linear regression problem: given an observed data vector b ∈ Rm and an observed data matrix A ∈ Rm×n , find a fitting coefficient vector x ∈ Rn such that ˆbi = x1 a + x2 a + · · · + xn a , i = 1, . . . , m, (6.8.14) i1

i2

in

6.8 Sparse Matrix Equation Solving: Optimization Theory

385

or ˆ= b

n 

xi ai = Ax,

(6.8.15)

i=1

where

a11 ⎢ .. A = [a1 , . . . , an ] = ⎣ .

··· .. .

⎤ a1n .. ⎥ , . ⎦

am1

···

amn



T

x = [x1 , . . . , xn ] , b = [b1 , . . . , bm ]T . As a preprocessing of the linear regression, it is assumed that m 

bi = 0,

i=1

m 

aij = 0,

i=1

m 

a2ij = 1,

j = 1, . . . , n,

(6.8.16)

i=1

and that the column vectors of the input matrix A are linearly independent. The preprocessed input matrix A is called an orthonormal input matrix, its column vectors ai are known as a predictors and the vector x is simply called the coefficient vector. Tibshirani [462] in 1996 proposed the least absolute shrinkage and selection operator (Lasso) algorithm for solving the linear regression. The basic idea of the Lasso is as follows: under the constraint that the 1 -norm of the prediction vector does not exceed an upper bound q, the squared sum of the prediction errors is minimized, namely Lasso :

min b − Ax 22 x

subject to x 1 =

n 

|xi | ≤ q.

(6.8.17)

i=1

Obviously, the Lasso model and the QP problem (6.8.11) have exactly the same form. The bound q is a tuning parameter. When q is large enough, the constraint has no effect on x and the solution is just the usual multiple linear least squares regression of x on ai1 , . . . , ain and bi , i = 1, . . . , m. However, for smaller values of q (q ≥ 0), some of the coefficients xj will take the value zero. Choosing a suitable q will lead to a sparse coefficient vector x. The regularized Lasso problem is given by   Regularized Lasso: min b − Ax 22 + λ x 1 . (6.8.18) x

The Lasso problem involves both the 1 -norm constrained fitting for statistics and data mining. The Lasso has the following two basic functions.

386

Solving Matrix Equations

(1) Contraction function The Lasso shrinks the range of parameters to be estimated; only a small number of selected parameters are estimated at each step. (2) Selection function The Lasso automatically selects a very small part of the variables for linear regression, yielding a spare solution. Therefore, the Lasso is a shrinkage and selection method for linear regression. It has a close relation to the soft thresholding of wavelet coefficients, forward stagewise regression, boosting methods and so on. The Lasso achieves a better prediction accuracy by shrinkage and variable selection. In practical linear regression, the observed matrix is usually corrupted by some potentially malicious perturbation. In order to find the optimal solution in the worst case, robust linear regression solves the following minimax problem [515] min max b − (A + ΔA)x 2 ,

x∈Rn ΔA∈ U

where ΔA = [δ1 , . . . , δn ] denotes the perturbation of the matrix A and  3 4 def  U = [δ1 , . . . , δn ] δi 2 ≤ ci , i = 1, . . . , n

(6.8.19)

(6.8.20)

is called the uncertainty set, or the set of admissible perturbations of the matrix A. This uncertainty set is said to be featurewise uncoupled, in contrast with coupled uncertainty sets, which require the perturbations of different features to satisfy some joint constraints. Uncoupled 1 norm-bounded uncertainty sets lead to an easily solvable optimization problem, as shown in the following theorem. THEOREM 6.10 [515] The robust regression problem (6.8.19) with uncertainty set U given by (6.8.20) is equivalent to the following 1 -regularized regression problem: + 0 n  minn b − Ax 2 + ci |xi | . (6.8.21) x∈R

i=1

With the 1 -norm changed to an arbitrary norm · a , Theorem 6.10 can be extended to the following result. THEOREM 6.11 [515]

Let  3 4 def  Ua = [δ1 , . . . , δn ] δi a ≤ ci , i = 1, . . . , n .

(6.8.22)

Then the robust regression problem (6.8.19) with uncertainty set Ua is equivalent to the following arbitrary norm · a -regularized regression problem: " # (6.8.23) minn max b − (A + ΔA)x a . x∈R

ΔA∈Ua

6.8 Sparse Matrix Equation Solving: Optimization Theory

THEOREM 6.12

387

The optimization problems

minn ( b − Ax 2 + λ1 x 1 )

x∈R

and

  minn b − Ax 22 + λ2 x 1

x∈R

(6.8.24)

are equivalent, up to changes in the tradeoff parameters λ1 and λ2 . Proof

See Appendix A in [515].

Taking ci = λ and orthonormalizing the column vector ai of A for all i = 1, . . . , n, the 1 regularized regression problem (6.8.21) becomes min ( b − Ax 2 + λ x 1 ) .

x∈Rn

(6.8.25)

By Theorem 6.12, this optimization problem is equivalent to the regularized Lasso problem (6.8.18). That is to say, a robust linear regression problem is equivalent to a regularized Lasso problem. Regarding the solution to Lasso as the solution of a robust least squares problem has two important consequences [515]. (1) Robustness provides a connection of the regularizer to a physical property, namely, protection from noise. (2) Perhaps most significantly, robustness is a strong property that can itself be used as an avenue for investigating different properties of the solution: robustness of the solution can explain why the solution is sparse; moreover, a robust solution is, by definition, the optimal solution under a worst-case perturbation. The Lasso has many generalizations; here are a few examples. In machine learning, sparse kernel regression is referred to as generalized Lasso [418]. The multidimensional shrinkage-thresholding method is known as group Lasso [402]. The sparse multiview feature selection method via low-rank analysis is called MRMLasso (multiview rank minimization-based Lasso) [528]. Distributed Lasso solves distributed sparse linear regression problems [316].

6.8.3 Mutual Coherence and RIP Conditions The 1 -norm minimization problem (P1 ) is a convex relaxation of the 0 -norm minimization (P0 ). Unlike the 0 -norm minimization problem, the 1 -norm minimization problem has the tractability. A natural question to ask is “what is the relationship between these two kinds of optimization problems?” DEFINITION 6.8 (see [313], [131], [69]) The mutual coherence of a given matrix A is the largest absolute normalized inner product between different columns from A. Denoting the kth column in A by ak , the mutual coherence is thus defined as μ(A) =

max

1≤k,j≤n,k=j

|aTk aj | .

ak 2 aj 2

(6.8.26)

388

Solving Matrix Equations

THEOREM 6.13 [130], [69]

For any matrix A ∈ Rm×n , one has spark(A) ≥ 1 +

1 . μ(A)

(6.8.27)

LEMMA 6.4 [130], [69] For an under-determined equation Ax = b (A ∈ Rm×n with full rank), if there exists a solution x such that   1 1 1+ (6.8.28)

x 0 < 2 μ(A) then x is not only the unique solution of (P1 ) but also the unique solution of (P0 ). DEFINITION 6.9 A matrix A is said to satisfy a Kth-order restricted isometry property (RIP) condition, if

x 0 ≤ K



(1 − δK ) x 22 ≤ AK x 22 ≤ (1 + δK ) x 22 ,

(6.8.29)

where 0 ≤ δK < 1 is a constant related to the sparseness of A, while AK is a submatrix consisting of any K columns of the dictionary matrix A. The RIP condition was presented by Cand`es and Tao [81] in 2006 and was refined by Foucart and Lai [160] in 2009. When the RIP condition is satisfied, the nonconvex 0 -norm minimization (P0 ) is equivalent to the convex 1 -norm minimization (P1 ). Let I = {i|xi = 0} ⊂ {1, . . . , n} denote the support of the nonzero elements of the sparse vector x; then |I| denotes the length of the support, i.e., the number of nonzero elements of the sparse vector x. The Kth-order RIP condition with parameter δK is denoted by RIP(K, δK ), where δK is called the restricted isometry constant (RIC) and is defined as the infimum of all the parameters δ such that RIP(K, δK ) holds, i.e.,    δK = inf δ (1 − δ) z 22 ≤ AI z 22 ≤ (1 + δ) z 22 , ∀ |I| ≤ K, ∀ z ∈ R|I| . (6.8.30) By Definition 6.9 it is known that if the matrix AK is orthogonal, then δK = 0 since AK x 2 = x 2 . Therefore, a nonzero value of the RIC δK of a matrix can be used to evaluate the nonorthogonality of the matrix. It is shown in [76] that a more compact lower bound of the RIC is δK < 0.307.

(6.8.31)

Under this condition, if there is no noise then a K-sparse signal can be exactly recovered by 1 -norm minimization and, in the case of noise, the K-sparse signals can be robustly estimated by 1 -norm minimization. The RIC meets the monotonicity condition [76] δK ≤ δK 1

if K ≤ K1 .

(6.8.32)

6.8 Sparse Matrix Equation Solving: Optimization Theory

389

6.8.4 Relation to Tikhonov Regularization In Tikhonov regularization LS problems, when using the 1 -norm of the unknown coefficient vector x in the regularization term instead of the 2 -norm, we have the 1 regularization LS problem " # 1 2 (6.8.33) min

b − Ax 2 + λ x 1 . x 2 The 1 regularization LS problem always has a solution, but not necessarily a unique solution. The following are the 1 regularization properties, which reflect the similarity and differences between 1 -regularization and Tikhonov regularization [248]. 1. Nonlinearity The solution vector x of a Tikhonov regularization problem is a linear function of the observed data vector b, but the solution vector of the 1 -regularization problem is not. 2. Regularization path When the regularization parameter λ varies in the interval [0, ∞), the optimal solution of the Tikhonov regularization problem is a smooth function of the regularization parameter. However, the solution family of the 1 -regularization problem has a piecewise linear solution path [143]: there are regularization parameters λ1 , . . . , λk (where 0 = λk < · · · < λ1 = λmax ) such that the solution vector of the 1 -regularization problem is piecewise linear: x1 =

λ − λi+1 (i) λi − λ (i+1) x − x . λi − λi+1 1 λi − λi+1 1

(6.8.34)

(i)

Here λi+1 ≤ λ ≤ λi , i = 1, . . . , k − 1, and x1 is the solution vector of the 1 -regularization problem when the regularization parameter is λi , while x1 is (1)

the optimal solution of the 1 -regularization problem. Therefore, x1 = 0 and x1 = 0 when λ ≥ λ1 . 3. Limit characteristic of λ → 0 When λ → 0, the limit point of the solution of the Tikhonov regularization problem has the smallest 2 -norm x 2 among all the feasible points such that AH (b − Ax) = 0. In contrast the limit of the solution of the 1 -regularization problem when λ → 0 has the smallest 1 -norm

x 1 among all the feasible points such that AH (b − Ax) = 0. 4. Limit characteristics of λ ≥ λmax When λ → ∞, the optimal solution of the Tikhonov regularization problem converges to a zero vector. However, only if λ ≥ λmax = AH b ∞ ,

(6.8.35)

does the solution of the 1 -regularization problem converge to a zero vector. In the above equation, AH b ∞ = max{[AH b]i } is the ∞ -norm of the vector. The most fundamental difference between the 1 -regularization LS problem and the Tikhonov regularization LS problem is that the solution of the former is usually a sparse vector while the coefficients of the solution of the latter are generally nonzero.

390

Solving Matrix Equations

6.8.5 Gradient Analysis of 1 -Norm Minimization The signum function of a real-valued variable ⎧ ⎪ ⎪ ⎨+1, sign(t) = 0, ⎪ ⎪ ⎩−1,

t ∈ R is defined as t > 0, (6.8.36)

t = 0, t < 0.

The signum multifunction of t ∈ R, denoted SGN(t), also called the set-valued function, is defined as [194] ⎧ ⎪ t > 0, ⎪{+1}, ∂|t| ⎨ = [−1, +1], t = 0, (6.8.37) SGN(t) = ⎪ ∂t ⎪ ⎩{−1}, t < 0. The signum multifunction is also the subdifferential of |t|. For the 1 -norm optimization problem " # 1 min J(λ, x) = min

b − Ax 22 + λ x 1 , x x 2

(6.8.38)

the gradient vector of the objective function is given by ∂J(λ, x) = −AT (b − Ax) + λ∇x x 1 ∂x = −c + λ∇x x 1 .

∇x J(λ, x) =

(6.8.39)

Here c = AT (b − Ax) is the vector of residual correlation vector, and ∇x x 1 = [∇x1 x 1 , . . . , ∇xn x 1 ]T is the gradient vector of the 1 -norm x 1 , with ith entry ⎧ ⎪ xi > 0, ⎪ ⎨{+1}, ∂ x 1 ∇xi x 1 = = {−1}, (6.8.40) xi < 0, ⎪ ∂xi ⎪ ⎩[−1, +1], x = 0, i

for i = 1, . . . , n. From (6.8.39) it can be seen that the stationary point of the 1 -norm minimization problem (P12 ) is determined by the condition ∇x J(λ, x) = −c + λ∇x x 1 = 0, i.e., c = λ∇x x 1 .

(6.8.41)

Letting c = [c(1), . . . , c(n)]T and substituting (6.8.40) into (6.8.41), the stationarypoint condition can be rewritten as ⎧ ⎪ xi > 0, ⎪{+λ}, ⎨ (6.8.42) c(i) = {−λ}, xi < 0, ⎪ ⎪ ⎩[−λ, λ], x = 0, i

for i = 1, . . . , n. Since 1 -norm minimization is a convex optimization problem, the

6.9 Sparse Matrix Equation Solving: Optimization Algorithms

391

above stationary-point condition is in practice a sufficient and necessary condition for the optimal solution of the 1 -norm minimization. The stationary-point condition (6.8.42) can be represented using the vector of residual correlations as c(I) = λsign(x) and

|c(I c )| ≤ λ,

(6.8.43)

in which I c = {1, . . . , n}\I is the complementary set of the support I. This shows that the amplitude of the residual correlation is equal to λ and the sign is consistent with the sign of the corresponding element of the vector x. Equation (6.8.43) can be equivalently written as |c(j)| = λ,

∀j ∈ I

and

|c(j)| ≤ λ

∀ j ∈ I c.

(6.8.44)

That is to say, the absolute value of the residual correlation within the support is equal to λ, while that outside the support is less than or equal to λ, namely

c ∞ = max{c(j)} = λ.

6.9 Sparse Matrix Equation Solving: Optimization Algorithms The common basic idea behind 1 -norm minimization algorithms is as follows: by recognizing the support region of sparse vectors, the solution of an underdetermined sparse matrix equation is transformed into the solution of an overdetermined (nonsparse) matrix equation.

6.9.1 Basis Pursuit Algorithms The orthogonal matching pursuit method is a kind of greedy stepwise least squares method for fitting sparse models. The general method for finding the whole optimal solution of an under-determined matrix equation Am×n xn×1 = bm×1 (m ' n) with sparseness s is first to find the ˜ s×1 = b (usually LS solutions of all the over-determined matrix equations Am×s x m + s) and then to determine the optimal solution among these solutions. Because there are Csm possible combinations of the over-determined equations, the overall solution process is time-consuming and tedious. The basic idea of a greedy algorithm [470] is not to seek a global optimal solution but to try to find a local optimal solution in some sense as quickly as possible. The greedy method cannot get a global optimal solution for all problems, but it may provide a global optimal solution or its approximation in most cases. Typical greedy algorithms use the following matching pursuit methods: 1. Basic matching pursuit (MP) method This was proposed by Mallat and Zhang [313] in 1993; its basic idea is not to minimize some cost function, but to iteratively construct a sparse solution x using a linear combination of a few column

392

2.

3.

4.

5.

Solving Matrix Equations

vectors (called atoms) of a dictionary matrix A to make a sparse approximation to x to get Ax = b. In each iteration the column vector in the dictionary matrix A that is most similar to the present residual vector r = Ax − b is selected as a new column of the action set. If the residual is decreasing with iteration, the convergence of the MP algorithm can be guaranteed. Orthogonal matching pursuit (OMP) Matching pursuit can construct a residual vector orthogonal to every column vector of the selected dictionary matrix at each iteration, but the residual vector generally is not orthogonal to the previously selected column vectors of the dictionary matrix. The OMP [377], [118], [173] ensures that the residual vector is orthogonal to all the previously selected columns vectors after each iteration, and thus guarantees the optimality of the iteration process, and reduces the number of iterations, making the performance more robust. The complexity of the OMP algorithm is O(mn), and it can produce a coefficient vector with the sparseness K ≤ m/(2 log n). Regularization orthogonal matching pursuit (ROMP) On the basis of the OMP algorithm, the regularization procedure is added [340], [341] by first selecting multiple atoms from the relevant atoms as the candidate set and then selecting a fraction of atoms from the candidate set in accordance with the regularization principle. Finally, these atoms are incorporated into the final support set to achieve a fast and effective choice of atoms. Stagewise orthogonal matching pursuit (StOMP) This is a reduced form of the OMP algorithm [133]. At the expense of approximation accuracy, it further improves the computation speed and has complexity O(n), which is more suitable for solving large-scale sparse approximation problems. Compression sampling matching pursuit (CoSaMP) This is an improvement of ROMP [342]. As compared with the OMP algorithm, the approximate accuracy of the CoSaMP algorithm is higher, its complexity O(n log2 n) is lower and the sparseness of the sparse solution is K ≤ m/(2 log(1 + n/K)). Algorithm 6.8 shows the orthogonal matching pursuit algorithm. The following are three different stopping criteria [472].

(1) Stop running after a fixed number of iterations. (2) Stop running when the residual energy (the sum of the squared amplitudes of the residual vector’s components) rk 2 is less than a given value , i.e.,

rk 2 ≤ . (3) Stop running when no column of A has any significant residual energy rk , i.e.,

AH rk ∞ ≤ . A subspace pursuit algorithm for compressed-sensing signal reconstruction is shown in Algorithm 6.9. At the kth iteration, the OMP, stagewise OMP and regular OMP methods combine the index of the new candidate column of the dictionary matrix A with the

6.9 Sparse Matrix Equation Solving: Optimization Algorithms Algorithm 6.8

393

Orthogonal matching pursuit algorithm [377], [118], [69]

task: Approximate the solution of (P0 ): argmin x0 subject to Ax = b. x

input: A ∈ Rm×n , b ∈ Rm and the error threshold . initialization: Initial solution x0 = 0, initial residual r0 = Ax0 − b = b, initial solution support Ω0 = support(x0 ) = ∅. Put k = 1. repeat 1. Sweep: use zj = aTj rk−1 /a2j to compute (j) = min aj zj − rk−1 22 , ∀j. zj

/ Ωk−1 , and 2. Update support: find a minimizer j0 of (j) satisfying (j0 ) ≤ (j), ∀ j ∈ update Ωk = Ωk−1 ∪ j0 . −1 H 3. Update provisional solution: compute xk = (AH AΩk b. Ωk AΩk ) 4. Update residual: calculate rk = b − AΩk xk . 5. exit if the stopping criterion is satisfied. return k ← k + 1.  xk (i), i ∈ Ωk , output: The sparse coefficient vector x(i) = 0, i∈ / Ωk .

Algorithm 6.9

Subspace pursuit algorithm [115]

task: Approximate the solution of (P0 ): argmin x0 subject to Ax = b. x

input: Sparsity K, A ∈ Rm×n , b ∈ Rm . initialization: 1. Ω0 = {K| indexes corresponding to the largest-magnitude entries in AT b}. 2. r0 = b − AΩ0 A†Ω0 b. repeat ˜ k = Ωk−1 & {K| indexes corresponding to the largest magnitude entries in the 1. Ω vector ATΩk−1 rk−1 }. 2. Compute xp = A†Ω˜ b. k

3. Ωk = {K| indexes corresponding to the largest magnitude entries in xp }. 4. rk = b − AΩk A†Ωk b. 5. exit if rk 2 ≤ rk−1 2 . 6. Let Ωk = Ωk−1 . return k ← k + 1. output: x = xp .

index set at the (k − 1)th iteration Ωk−1 . Once a candidate is selected, it will remain in the list of selected columns until the algorithm ends. Unlike these methods, for a K-sparse signal, Algorithm 6.9 keeps the index set of K candidate columns unchanged and allows the candidate column to be continuously updated during the iteration process.

394

Solving Matrix Equations

6.9.2 First-Order Augmented Lagrangian Algorithm In practice, solving the LP problem (P1 ) is hard, because the matrix A is large and dense and an LP problem is often ill-conditioned. In typical compressed sensing (CS) applications, the dimension of A is large (n ≈ 106 ), hence general interiorpoint methods are not practical for solving (P1 ) in CS applications owing to the need to factorize an m × n matrix A. However, the LP problem (P1 ) can be efficiently solved in the Lagrangian form (see e.g., [154], [194], [20]) " # 1 2 (6.9.1) min λ x 1 + Ax − b 2 x∈Rn 2 with penalty parameter λ , 0. An augmented Lagrange function for the above LP problem is given by [20] 1 J(x) = λ x 1 − λθ T (Ax − b) + Ax − b 22 , 2

(6.9.2)

where θ is a Lagrange multiplier vector for the constraints Ax = b. Algorithm 6.10 shows the first-order augmented Lagrangian (FAL) algorithm. This algorithm solves the LP problem by inexactly solving a sequence of optimization problems of the form " # 1 T 2 min λk x 1 − λk θk (Ax − b) + Ax − b 2 , (6.9.3) 2 x∈Rn | x 1 ≤ηk for an appropriately chosen sequence of {λk , θk , ηk }, k = 0, 1, . . . The APG stopping criterion APGstop is given by step 8 in Algorithm 6.10, and the FAL stopping criterion FALstop is given by 2 2  FALstop = 2u(l) − u(l−1) 2∞ ≤ γ (6.9.4) for noiseless measurements or

2 (l) 2 2u − u(l−1) 2 2 2 2 ≤γ FALstop = 2u(l−1) 2 2

(6.9.5)

for noisy measurements, where γ is the threshold value. Algorithm FAL produces xsol = u(l) when FALstop is true. The APG function used in step 9 in Algorithm 6.10 is given by Algorithm 6.11.

6.9.3 Barzilai–Borwein Gradient Projection Algorithm Consider the standard bound-constrained quadratic program (BCQP) problem " # 1 T T min q(x) = x Ax − b x subject to v ≤ x ≤ u. (6.9.6) z 2

6.9 Sparse Matrix Equation Solving: Optimization Algorithms Algorithm 6.10

395

FAL ({λk , k , τk }k∈Z+ , η) [20]

  task: Find x = argmin λk x1 − λk θkT (Ax − b) + 12 Ax − b22 . x

input: x0 , L = σmax (AAT ). initialization: θ1 = 0, k = 0. 1. while (FALSTOP not true) do 2.

k ← k + 1.

3.

pk = λk x1 , fk (x) = 12 Ax − b − λk θk 22 .

4.

hk (x) = 12 x − xk−1 22 .

5.

ηk ← η + (λk /2)θk 22 .

6.

Fk = {x ∈ Rn | x1 ≤ ηk }.

7.

k,max ← σmax (A)(ηk + xk−1 2 )

8.

APGSTOP = { ≥ k,max } or {∃ g ∈ ∂Pk (x)|v with g2 ≤ τk }.

9.

xk ← APG (pk , fk , L, Fk , xk−1 , hk , APGSTOP).

10.



2/k .

θk+1 ← θk − (Axk − b)/λk .

11. end while 12. output: x ← xk .

Algorithm 6.11

APG (pk , fk , L, Fk , xk−1 , hk , APGSTOP) [20]

initialization: u0 = x0 , w0 ← argminx∈p h(x), θ0 ← 1, l ← 0. 1. while (APGSTOP not true) do

3.

vl ← (1 − θl )ul + θl wl .      l −1 wl+1 ← argmin p(z) + [∇f (vl )]T z + (L/c)h(z)z ∈ F . i=0 θi

4.

ˆ l+1 ← (1 − θl )ul + θl wl+1 . u

5.

7.

Hl (x) ← p(x) + [∇f (vl )]T x + x − vl 22 . 2  ul+1 ← argmin{Hl (x)x ∈ F }.

1  4 θl+1 ← θl − 4θl2 − θl2 .

8.

l ← l + 1.

2.

6.

L

2

9. end while 10. return ul or vl depending on APGSTOP.

Gradient projection (GP) or projected gradient methods provide an alternative way of solving large-scale BCQP problems.

396

Solving Matrix Equations

Let Ω be the feasible set of (6.9.6) such that    Ω = x ∈ R n v ≤ x ≤ u ,

(6.9.7)

and let P denote the projection operator onto Ω, that is P(x) = mid(v, x, u),

(6.9.8)

where mid(v, x, u) is the vector whose ith component is the median of the set {vi , xi , ui }. Suppose that a feasible point xk is generated, the gradient projection method computes the next point as [114] xk+1 = P(xk − αk gk ),

(6.9.9)

where αk > 0 is some step size and gk = Axk − b. Barzilai and Borwein [26] proposed a method for greatly improving the effectiveness of gradient projection methods by appropriate choices of the step size αk . Two useful choices of the step size were given in [26]: αkBB1 =

sTk−1 sk−1 , sTk−1 yk−1

αkBB2 =

sTk−1 yk−1 , T y yk−1 k−1

(6.9.10)

where sk−1 = xk − xk−1 and yk−1 = gk − gk−1 . Gradient projection using the Barzilai–Borwein step size is called the Barzilai– Borwein gradient projection (BBGP) method. One of its features is that it is nonmonotonic, i.e., q(xk ) may increase on some iterations. Nevertheless, the BBGP method is simple and easy to implement and avoids any matrix factorization. Interestingly, the convex unconstrained optimization problem " # 1 (6.9.11)

b − Ax 22 + τ x 1 , x ∈ Rn , A ∈ Rm×n , b ∈ Rm min x 2 can be transformed into a bound-constrained quadratic program (BCQP) problem. To this end, define ui = (xi )+ and vi = (−xi )+ for all i = 1, . . . , n, where (x)+ = max{x, 0} denotes the positive-part operator. Then, the coefficient vector x can be split into a difference between nonnegative vectors: x = u − v,

u ≥ 0,

v ≥ 0.

(6.9.12)

The above equation can also be written as v ≤ x ≤ u, where, for vectors a, b, a ≤ b means ai ≤ bi , ∀ i. Obviously, x 1 = 1Tn u + 1Tn v, where 1n = [1, . . . , 1]T is an n × 1 summing vector consisting of n ones. By setting x = u − v, the convex unconstrained optimization problem (6.9.11) can be rewritten as the following BCQP problem [154]: " # 1 T T min c z + z Bz ≡ F (z) subject to z ≥ 0. (6.9.13) z 2

6.9 Sparse Matrix Equation Solving: Optimization Algorithms

397

Here   u z= , v



 −AT b , c = τ 12n + AT b



 AT A −AT A B= . −AT A AT A

(6.9.14)

Algorithm 6.12 shows the GPSR-BB algorithm (Barzilai–Borwein gradient projection for sparse reconstruction), which was proposed in [154]. Algorithm 6.12 GPSR-BB algorithm [154]     task: Solve min 12 b − Ax22 + τ x1 or min cT z + 12 zT Bz ≡ F (z) . x

z≥0

input: B and F (z). initialization: Given z0 , choose αmin , αmax , α0 ∈ [αmin , αmax ], and set k = 0. repeat 1. Compute δk = (zk − αk ∇F (zk ))+ − zk . 2. Compute γk = δkT Bδk . 3. Line search: If γk = 0, let λk = 1, otherwise   λk = mid 0, δkT ∇F (zk )/γk , 1 . 4. Update zk+1 = zk + λk δk . 5. exit if zk+1 − (zk+1 − α∇F ¯ (zk+1 ))+ 2 ≤ tol, where tol is a small parameter and α ¯ is a positive constant. 6. if γk = 0, let αk+1 = αmax , otherwise   αk+1 = mid αmin , δk 22 /γk , αmax . 7. return k ← k + 1. output: z = zk+1 .

6.9.4 ADMM Algorithms for Lasso Problems The alternating direction method of multipliers (ADMM) form of the Lasso problem  min 12 Ax − b 22 + λ x 1 is given by " # 1 2 min subject to x − z = 0. (6.9.15)

Ax − b 2 + λ z 1 2 The corresponding ADMM algorithm is [52]: xk+1 = (AT A + ρI)−1 (AT b + ρzk − yk ), k+1

= Sλ/ρ (x

k+1

k

z y

k+1

= y + ρ(x

k

+ y /ρ),

k+1

−z

k+1

(6.9.16) (6.9.17)

),

(6.9.18)

398

Solving Matrix Equations

where the soft thresholding operator S is defined as ⎧ ⎪ ⎨ x − λ/ρ, x > λ/ρ, Sλ/ρ (x) = 0, |x| ≤ λ/ρ, ⎪ ⎩ x + λ/ρ, x < λ/ρ. The Lasso problem can be generalized to " # 1 min

Ax − b 22 + λ Fx 1 , 2

(6.9.19)

(6.9.20)

where F is an arbitrary linear transformation. The above problem is called the generalized Lasso problem. The ADMM form of the generalized Lasso problem can be written as " # 1 min subject to Fx − z = 0. (6.9.21)

Ax − b 22 + λ z 1 2 The corresponding ADMM algorithm can be described as [52] xk+1 = (AT A + ρFT F)−1 (AT b + ρFT zk − yk ), zk+1 = Sλ/ρ (Fx y

k+1

k

k+1

= y + ρ(Fx

+ yk /ρ),

k+1

−z

k+1

(6.9.22) (6.9.23)

).

Consider the group Lasso problem [537] + 0 n  1 2 min

xi 2

Ax − b 2 + λ 2 i=1

(6.9.24)

(6.9.25)

N with feature groups Ax = i=1 Ai xi , where Ai ∈ Rm×ni , xi ∈ Rni , i = 1, . . . , N N and i=1 ni = n. Another difference from the generalized Lasso is the Euclideannorm (not squared) regularization of x. The group Lasso is also called the sum-ofnorms regularization [356]. N Put Axk = N −1 i=1 Ai xki . The ADMM algorithm for the group Lasso is given by [52] 3ρ 4 k k k k 2 ¯ xk+1 (6.9.26) = arg min x − A x − z + Ax + y

+ λ x

A i i i i i 2 , 2 i 2 xi 1 ¯k+1 = (6.9.27) (b + ρAxk+1 + ρyk ), z N +ρ ¯k+1 ). yk+1 = yk + Axk+1 − z (6.9.28) 6.9.5 LARS Algorithms for Lasso Problems An efficient approach for solving Lasso problems is the least angle regressive (LARS) algorithm [143]. Algorithm 6.13 shows the LARS algorithm with Lasso modification.

6.9 Sparse Matrix Equation Solving: Optimization Algorithms Algorithm 6.13

399

LARS algorithm with Lasso modification [143]

given: The data vector b ∈ Rm and the input matrix A ∈ Rm×n . ˆ = 0 and AΩ = A. Put k = 1. initialization: Ω0 = ∅, b 0

repeat ˆ k−1 ). ˆk = ATΩk−1 (b − b 1. Compute the correlation vector c

ck (j)| = C} with 2. Update the active set Ωk = Ωk−1 ∪ {j (k) | |ˆ ck (n)|}. C = max{|ˆ ck (1)|, . . . , |ˆ ck (j)). 3. Update the input matrix AΩk = [sj aj , j ∈ Ωk ], where sj = sign(ˆ 4. Find the direction of the current minimum angle GΩk = ATΩk AΩk ∈ Rk×k , αΩk = (1Tk GΩk 1k )−1/2 ,

k wΩk = αΩk G−1 Ωk 1k ∈ R ,

μk = A Ω k w Ω k ∈ R k . 5. Compute b = ATΩk μk = [b1 , . . . , bm ]T and estimate the coefficient vector T ˆ k = (ATΩk AΩk )−1 ATΩk = G−1 x Ωk AΩk .  +  x C − cˆk (j) C + cˆk (j) , , γ˜ = min − wj 6. Compute γˆ = minc j∈Ωk

αΩ − b j k

αΩ + bj k

j∈Ωk

j

+

, where wj is the jth

of entry wΩk = [w1 , . . . , wn ]T and min{·}+ denotes the positive minimum term. If

there is not positive term then min{·}+ = ∞.

7. If γ˜ < γˆ then the fitted vector and the active set are modified as follows: ˆk = b ˆ k−1 + γ˜ μk , Ωk = Ωk − {˜j}, b where the removed index ˜j is the index j ∈ Ωk such that γ˜ is a minimum. ˆ k and Ωk are modified as follows: Conversely, if γˆ < γ˜ then b ˆk = b ˆ k−1 + γˆ μk , Ωk = Ωk ∪ {ˆj}, b where the added index ˆj is the index j ∈ Ωk such that γˆ is a minimum. 8. exit If some stopping criterion is satisfied. 9. return k ← k + 1. ˆk. output: The coefficient vector x = x

The LARS algorithm is a stepwise regression method, and its basic idea is this: while ensuring that the current residual is equal to the selected correlation coefficient, select as the solution path the projection of the current residual onto the space constituted by the selected variables. Then, continue to search on this solution path, absorbing the new added variable, and adjusting the solution path. Suppose that Ωk is the active set of variables at the beginning of the kth iteration step, and let xΩk be the coefficient vector for these variables at this step. Letting the initial active set Ω0 = ∅, the initial value of the residual vector is

400

Solving Matrix Equations

given by r = b. At the first iteration find the column (i.e., the predictor) of the (1) input matrix A that is correlated with the residual vector r = b; denote it ai , and enter its index in the active set Ω1 . Thus, one obtains the first regression-variable set AΩ1 . The basic step of the LARS algorithm is as follows: suppose that after k−1 LARS steps the regression variable set is AΩk−1 . Then, one has the vector expression  −1/2 T wΩk−1 = 1TΩk−1 (ATΩk−1 AΩk−1 )−1 1Ωk−1 (AΩk−1 AΩk−1 )−1 1Ωk−1 .

(6.9.29)

Hence AΩk−1 wΩk−1 is the solution path of the LARS algorithm in the current regression variable set Ωk−1 , and wΩk−1 is the path for which x continues to search. The core idea of the LARS algorithm with Lasso modification, as shown in Algorithm 6.13, is “one at a time”: at the every iteration step, one must add or remove a variable. Specifically, from the existing regression variable set and the current residuals, a solution path is determined. The maximum possible step forward on this path is denoted by γˆ and the maximum step needed to find a new variable is denoted by γ˜ . If γ˜ < γˆ then the new variable xj (γ) corresponding to the LARS estimate is not a Lasso estimate, and hence this variable should be deleted from the regression variable set. Conversely, if γ˜ > γˆ then the new estimate xj (γ) corresponding to the LARS estimate should become a new Lasso estimate, and hence this variable should be added to the regression set. After deleting or adding a variable, one should stop searching on the original solution path. Then, one recomputes the correlation coefficient of the current residual and current new variable set to determine a new solution path, and continues the “one at a time” form of the LARS iteration steps. This process is repeated until all Lasso estimates are obtained by using the LARS algorithm.

6.9.6 Covariance Graphical Lasso Method The estimation of a covariance matrix via a sample of vectors drawn from a multivariate Gaussian distribution is among the most fundamental problems in statistics [48]. Suppose n samples of p normal random vectors are given by x1 , . . . , xn ∼ Np (0, Σ). The log likelihood is [48] L(Σ) = −

n np n − log det Σ − tr(Σ−1 S), 2 2 2

n where S = n−1 i=1 xi xTi is the sample covariance matrix. This approximates the true covariance matrix Σ = [σij ] with σij = E{xTi xj }. Bien and Tibshirani [48] in 2011 proposed a covariance graphical Lasso method using a Lasso penalty on the elements of the covariance matrix. The basic version of the covariance graphical Lasso problem minimizes the ob-

6.9 Sparse Matrix Equation Solving: Optimization Algorithms

401

jective function [48], [498]:   min g(Σ) = min ρ Σ 1 + log(det Σ) + tr(SΣ−1 ) Σ

Σ

(6.9.30)

over the space of positive definite matrices M+ with shrinkage parameter ρ ≥ 0. Then, covariance graphical Lasso estimation is used to solve the optimization problem (6.9.30). The covariance graphical Lasso method is particularly useful owing to the following two facts [498]: (1) by using the 1 -norm term, the covariance graphical Lasso is able to set some offdiagonal elements of Σ exactly equal to zero at the minimum point of (6.9.30); (2) the zeros in Σ encode the marginal independence structures among the components of a multivariate normal random vector with covariance matrix Σ. However, the objective function is not convex, which makes the optimization challenging. To address this challenge, Bien and Tibshirani [48] proposed a majorize– minimize approach to minimize (6.9.30) approximately, and Wang [498] in 2014 proposed a coordinate descent algorithm for covariance graphical Lasso. In comparison with the majorize–minimize algorithm of Bien and Tibshirani, Wang’s coordinate descent algorithm is simpler to implement, substantially faster to run and numerically more stable as shown in experimental results. The idea of coordinate descent is simple: to update one column and one row of Σ at a time while holding all the rest of the elements in Σ fixed. For example, to update the last column and row, Σ and S are partitioned as follows:     Σ11 σ12 S11 s12 Σ= , S = , (6.9.31) T σ12 σ22 sT12 s22 where Σ11 and S11 are respectively the covariance matrix and the sample covariance matrix of the first p − 1 variables, and σ12 and s12 are respectively the covariance and the sample covariance between the first p − 1 variables and the last variable while σ22 and s22 respectively are the variance and the sample variance of the last variable. T Σ−1 Letting β = σ12 and γ = σ22 − σ12 11 σ12 , we have   −1 T −1 −1 −1 −1 T −1 + Σ ββ Σ γ −Σ β γ Σ 11 11 11 11 Σ−1 = . (6.9.32) −1 −β T Σ−1 γ −1 11 γ Therefore, the three terms in (6.9.30) can be represented as follows [498]: β Σ 1 = 2ρ β 1 + ρ(β T Σ−1 11 β + γ) + c1 , log(det Σ) = log γ + c2 , −1 −1 −1 tr(SΣ−1 ) = β T Σ−1 − 2sT12 Σ−1 + s22 γ −1 + c3 . 11 SΣ11 βγ 11 βγ

402

Solving Matrix Equations

Dropping the constants c1 , c2 , c3 , we get the objection function in (6.9.30): −1 T −1 −1 f (β, γ) = 2ρ β 1 + ρ(β T Σ−1 11 β + γ) + β Σ11 SΣ11 βγ −1 − 2sT12 Σ−1 + s22 γ −1 + log γ. 11 βγ

(6.9.33)

This objective function of the covariance graphical Lasso method can be written as two separated objective functions f1 (γ) = log γ + aγ −1 + ργ,

(6.9.34)

f2 (β) = 2ρ β 1 + β Vβ − 2u β, T

T

(6.9.35)

−1 −1 −1 T −1 −1 where a = β T Σ−1 Σ11 SΣ−1 11 SΣ11 β − 2s12 Σ11 β + s22 , V = γ 11 + ρΣ11 and u = −1 −1 γ Σ11 s12 . From (6.9.34) it can be seen that " a, ρ = 0, (6.9.36) γ = arg min f1 (γ) =    γ − 1 + 1 + 4aρ /(2ρ), ρ = 0.

Equation (6.9.35) shows that f2 (β) is still a Lasso problem and can be efficiently solved by coordinate descent algorithms. For j ∈ {1, . . . , p − 1}, the minimum point of (6.9.35) along the coordinate direction in which βj varies is given by  > p−1  v βˆ , ρ v , j = 1, . . . , p − 1 (6.9.37) βˆ = soft u − j

j

kj k

jj

k=1,k=j

where soft(x, ρ) = sign(x)(|x| − ρ)+ is the soft thresholding operator. A coordinate descent algorithm for covariance graphical Lasso is proposed in [498] based on the above results, as shown in Algorithm 6.14.

6.9.7 Homotopy Algorithm In topology, the concept of homotopy describes a “continuous change” between two objects. The homotopy algorithm is a kind of searching algorithm that starts from a simple solution and uses iterative calculation to find the desired complex solution. Therefore, the key of a homotopy algorithm is to determine its initial simple solution appropriately. Consider the relationship between the 1 -norm minimization problem (P1 ) and the unconstrained 2 -minimization problem (P12 ). Suppose that there is a corresponding unique solution xλ for every minimization problem (P12 ) with λ ∈ [0, ∞). Then the set {xλ |λ ∈ [0, ∞)} determines a solution path and has xλ = 0 for a ˜ λ of (P12 ) converges to the sufficiently large λ; and when λ → 0, the solution x solution of the 1 -norm minimization problem (P1 ). Hence, xλ = 0 is the initial solution of the homotopy algorithm for solving the minimization problem (P1 ). The homotopy algorithm for solving the unconstrained 2 -norm minimization

6.9 Sparse Matrix Equation Solving: Optimization Algorithms Algorithm 6.14

403

Coordinate descent algorithm [498]

task: Find Σ = argmin{ρΣ1 + log(det Σ) + tr(SΣ−1 )}. Σ

input: S = Y T Y, ρ.





initialization: Given Σ0 = S and partition S = ⎣

S11

s12

sT12

s22

repeat





1. Σk+1 = Σk , and Partition Σk+1 = ⎣

Σ11

σ12

T σ12

σ22

⎦. Put k = 0.

⎦.

2. Put β = σ12 . −1 T SΣ−1 3. a = β T Σ−1 11 β − 2s12 Σ11 β + s22 .  11 a, ρ = 0,  4. γ = (−1 + 1 + 4aρ)/(2ρ), ρ = 0. −1 −1 5. V = γ −1 Σ−1 11 SΣ11 + ρΣ11 .

6. u = γ −1 Σ−1 11 s12 .

'  vjj , j = 1, . . . , p. 7. Compute βˆj = soft uj − k=j vkj βˆk , ρ ⎡ ⎤−1 Σ−1 + Σ−1 ββ T Σ−1 γ −1 −Σ−1 βγ −1 11 11 11 11 ⎦ . 8. Calculate Σk = ⎣ −1 −β T Σ−1 γ −1 11 γ 9. exit if Σk+1 − Σk 2 ≤ . 10. return k ← k + 1. output: Σk .

problem (P12 ) starts from the initial value x0 = 0, and runs in an iteration form to compute the solutions xk at the kth step for k = 1, 2, . . . In the whole calculation, the following active set is kept unchanged:    (6.9.38) I = j  |ck (j)| = ck ∞ = λ . Algorithm 6.15 shows the homotopy algorithm for solving the 1 -norm minimization problem. With the reduction of λ, the objective function of (P11 ) will go through a homotopy process from the 1 -norm constraint to the 2 -norm objective function. It has been shown [143], [363], [132] that the homotopy algorithm is an efficient solver for the 1 -norm minimization problem (P1 ).

6.9.8 Bregman Iteration Algorithms The sparse optimization models for solving the matrix equation Au = b discussed above can be summarized as follows:

404

Solving Matrix Equations

Algorithm 6.15

Homotopy algorithm [132]

input: Observed vector b ∈ Rm , input matrix A and parameter λ. initialization: x0 = 0, c0 = AT b. repeat 1. Use (6.9.38) to form the active set I, and put ck (I) = [ck (i), i ∈ I] and AI = [ai , i ∈ I]. 2. Compute the residual correlation vector ck (I) = ATI (b − AI xk ). 3. Solve the matrix equation ATI AI dk (I) = sign(ck (I)) for dk (I). 4. Compute



 λ − ck (i) λ + ck (i) , , i∈I 1 − φTi vk 1 + φTi vk   x (i) . γk− = min − k i∈I dk (i) 5. Determine the breakpoint γk = min{γk+ , γk− }. γk+ = minc +

6. Update the solution vector xk = xk−1 + γk dk . 7. exit if xk ∞ = 0. 8. return k ← k + 1. output: x = xk .

(1) 0 norm minimization model min u 0 subject to Au = b; u

(2) basis-pursuit (BP)/compressed sensing model min u 1 subject to Au = b; u

(3) basis-pursuit de-noising model min u 1 subject to Au − b 2 < ; u

(4) Lasso model min Au − b 2 subject to u 1 ≤ s. u

Among these models, the BP model is a relaxed form of the 0 -norm minimization model, and the Lasso model is a linear prediction representation equivalent to the basis pursuit de-noising model. Now, consider the general form of the above four optimization models: uk+1 = arg min {J(u) + λH(u)} , u

(6.9.39)

where J : X → R and H : X → R are nonnegative convex functions (X is a closed convex set), but J is a nonsmooth, while H is differentiable. Figuratively speaking, a function is of the bounded-variation type if the oscillation of its graphics (i.e. its swing or variation) is manageable or “tamed” in a certain range. The bounded-variation norm of a vector u, denoted u BV , is defined as [8]  |∇u|dx, (6.9.40)

u BV = u 1 + J0 (u) = u 1 + Ω

6.9 Sparse Matrix Equation Solving: Optimization Algorithms

405

where J0 (u) denotes the total variation of u. Let J(u) = u BV and H(u) = 12 u − f 22 . Then the optimization model (6.9.39) becomes a total variation or Rudin–Osher–Fatemi (ROF) de-noising model [420]: " # λ (6.9.41) uk+1 = arg min u BV + u − f 22 . 2 u A well-known iteration method for solving the optimization problem (6.9.39) is Bregman iteration, based on the Bregman distance [59]. DEFINITION 6.10 Let J(u) be a convex function, let the vectors u, v ∈ X and let g ∈ ∂J(v) be the subgradient vector of the function J at the point v. The Bregman distance between the points u and v, denoted DJg (u, v), is defined as DJg (u, v) = J(u) − J(v) − g, u − v.

(6.9.42)

The Bregman distance is not a distance in traditional sense, because DJg (u, v) = However, it has the following properties that make it an efficient tool for solving the 1 -norm regularization problem: DJg (v, u).

(1) For all u, v ∈ X and g ∈ ∂J(v), the Bregman distance is nonnegative, i.e., DJg (u, v) ≥ 0. (2) The Bregman distance for the same point is zero, i.e., DJg (v, v) = 0. (3) The Bregman distance can measure the closeness of two points u and v since DJg (u, v) ≥ DJg (w, v) for any point w on the line connecting u and v. Consider the first-order Taylor series approximation of the nonsmooth function J(u) at the kth iteration point uk , denoted J(u) = J(uk ) + gk , u − uk . The approximate error is measured by the Bregman distance DJgk (u, uk ) = J(u) − J(uk ) − gk , u − uk .

(6.9.43)

Early in 1965, Bregman proposed that the unconstrained optimization problem (6.9.39) can be modified as [59]   g uk+1 = arg min DJk (u) + λH(u) u

= arg min {J(u) − gk , u − uk  + λH(u)} . u

(6.9.44)

This is the well-known Bregman iteration. In what follows we give the Bregman iteration algorithm and its generalization. The objective function of the Bregman iterative optimization problem is denoted by L(u) = J(u)−gk , u−uk +λH(u). By the stationary-point condition 0 ∈ ∂L(u) it follows that 0 ∈ ∂J(u) − gk + λ∇H(u). Hence, at the (k + 1)th iterative point uk+1 one has gk+1 = gk − λ∇H(uk+1 ),

gk+1 ∈ ∂J(uk+1 ).

(6.9.45)

406

Solving Matrix Equations

Equations (6.9.44) and (6.9.45) constitute the Bregman iterative algorithm, which was proposed by Osher et al. in 2005 [364] for image processing. THEOREM 6.14 [364] Suppose that J and H are convex functions and that H is differentiable. If a solution to Equation (6.9.44) exists then the following convergence results are true. • The function H is monotonically decreasing in the whole iterative process, namely H(uk+1 ) ≤ H(uk ). • The function H will converge to the optimal solution H(u ), since H(uk ) ≤ H(u ) + J(u )/k. The Bregman iterative algorithm has two versions [532]: Version 1 k = 0, u0 = 0, g0 = 0. Iterations: " # 1 g uk+1 = arg min DJk (u, uk ) + Au − b 22 , 2 u gk+1 = gk − AT (Auk+1 − b). If uk does not converge then let k ← k + 1 and return to the iteration. Version 2 k = 0, b0 = 0, u0 = 0. Iterations: bk+1 = b + (bk − Auk ). " # 1 uk+1 = arg min J(u) + Au − bk+1 22 . 2 u If uk does not converge then let k ← k + 1 and return to the iteration. It has been shown [532] that the above two versions are equivalent. The Bregman iterative algorithm provides an efficient tool for optimization, but, g at every step, one must minimize the objective function DJk (u, uk )+H(u). In order to improve the computational efficiency of the Bregman iterative algorithm, Yin et al. [532] proposed a linearized Bregman iterative algorithm. The basic idea of the linearized Bregman iteration is as follows: for the Bregman iteration, use a first-order Taylor series expansion to linearize the nonlinear function H(u) to H(u) = H(uk )+∇H(uk ), u−uk  at the point uk . Then, the optimization problem (6.9.39) with λ = 1 becomes   g uk+1 = arg min DJk (u, uk ) + H(uk ) + ∇H(uk ), u − uk  . u

Note that the first-order Taylor series expansion is exact only for the neighborhood of u at the point uk , and the additive constant term H(uk ) can be made negative

6.9 Sparse Matrix Equation Solving: Optimization Algorithms

407

in the optimization with respect to u, so a more exact expression of the above optimization problem is " # 1 gk 2 uk+1 = arg min DJ (u, uk ) + ∇H(uk ), u − uk  + u − uk 2 . (6.9.46) 2δ u Importantly, the above equation can be equivalently written as " # 1 g uk+1 = arg min DJk (u, uk ) + u − (uk − δ∇H(uk )) 22 , 2δ u

(6.9.47)

because (6.9.46) and (6.9.47) differ by only a constant term that is independent of u. In particular, if H(u) = 12 Au − b 22 then from ∇H(u) = AT (Au − b) one can write (6.9.47) as " #  2 1 2 g 2u − uk − δAT (Auk − b) 22 . uk+1 = arg min DJk (u, uk ) + (6.9.48) 2 2δ u Consider the objective function in (6.9.48) as L(u) = J(u) − J(uk ) − gk , u − uk  +

 2 1 2 2u − uk − δAT (Auk − b) 22 . 2 2δ

By the subdifferential stationary-point condition 0 ∈ ∂L(u) it is known that *  1) 0 ∈ ∂J(u) − gk + u − uk − δAT (Auk − b) . δ Denoting gk+1 ∈ ∂J(uk+1 ), from the above equation one has [532] gk+1 = gk − AT (Auk − b) − =

k 

AT (b − Aui ) −

i=1

uk+1 − uk = ··· δ

uk+1 . δ

(6.9.49)

Letting vk =

k 

AT (b − Aui ),

(6.9.50)

i=1

one can obtain two important iterative formulas. First, from (6.9.49) and (6.9.50) we get the update formula of the variable u at the kth iteration: uk+1 = δ(vk − gk+1 ).

(6.9.51)

Then, Equation (6.9.50) directly yields an iterative formula for the intermediate variable vk : vk+1 = vk + AT (b − Auk+1 ).

(6.9.52)

408

Solving Matrix Equations

Equations (6.9.51) and (6.9.52) constitute the linearized Bregman iterative algorithm [532] for solving the optimization problem " # 1 2 (6.9.53) min J(u) + Au − b 2 . u 2 In particular, for the optimization problem " # 1 min μ u 1 + Au − b 22 , u 2 we have

⎧ ⎪ ⎪ ⎨{+1}, ∂( u 1 )i = [−1, +1], ⎪ ⎪ ⎩{−1},

(6.9.54)

if ui > 0, (6.9.55)

if ui = 0, if ui < 0.

Hence Equation (6.9.51) can be written in the component form as uk+1 (i) = δ(vk (i) − gk+1 (i)) = δ shrink(vk (i), μ),

i = 1, . . . , n,

(6.9.56)

where ⎧ ⎪ ⎪ ⎨y − α, shrink(y, α) = sign(y) max{|y| − α, 0} = 0, ⎪ ⎪ ⎩y + α,

y ∈ (α, ∞), y ∈ [−α, α], y ∈ (−∞, −α)

is the shrink operator. The above results can be summarized as the linearized Bregman iterative algorithm for solving the basis-pursuit de-noising or total variation de-noising problem [532], see Algorithm 6.16. Consider the optimization problem uk+1 = arg min { Au 1 + H(u)} . u

(6.9.57)

Introducing the intermediate variable z = Au, the unconstrained optimization problem (6.9.39) can be written as the constrained optimization (uk+1 , zk+1 ) = arg min { z 1 + H(u)} u,z

subject to z = Au.

(6.9.58)

By adding an 2 -norm penalty term, this constrained optimization problem becomes the unconstrained optimization problem " # λ 2 (6.9.59) (uk+1 , zk+1 ) = arg min z 1 + H(u) + z − Au 2 . 2 u,z

Exercises

409

Linearized Bregman iterative algorithm [532]

Algorithm 6.16

  task: Solve basis-pursuit/total variation de-noising problem min μu1 + 12 Au − b22 . u

input: Input matrix A ∈ Rm×n and observed vector b ∈ Rm . initialization: k = 0, u0 = 0, v0 = 0. repeat 1. for i = 1, . . . , n do 2. uk+1 (i) = δ shrink(vk (i), μ). 3. end for 4. vk+1 = vk + AT (b − Auk+1 ). 5. exit if uk+1 − uk 2 ≤ . return k ← k + 1. output: uk .

The split Bregman iterative algorithm is given by [177] " # λ uk+1 = arg min H(u) + zk − Au − bk 22 , 2 u " # λ zk+1 = arg min z 1 + z − Auk+1 − bk 22 , 2 z bk+1 = bk + [Auk+1 − zk+1 ].

(6.9.60) (6.9.61) (6.9.62)

The three iterations of the split Bregman iteration algorithm have the following characteristics. (1) The first iteration is a differentiable optimization problem that can be solved by the Gauss–Seidel method. (2) The second iteration can be solved efficiently using the shrink operator. (3) The third iteration is a explicit calculation.

Exercises 6.1

Consider the matrix equation Ax + = x, where is an additive color noise vector satisfying the conditions E{ } = 0 and E{ T } = R. Let R be known, and use the weighting error function Q(x) = T W as the cost function for ˆ WLS . Such a method is called the weighting finding the optimal estimate x least squares method. Show that ˆ WLS = (AT WA)−1 AT Wx, x where the optimal choice of the weighting matrix W is Wopt = R−1 .

410

6.2

Solving Matrix Equations

Given an over-determined linear equation ZTt Xt = ZTt Yt x with Zt ∈ R(t+1)×K is known as the instrumental variable matrix and t + 1 > K. ˆ , find an (a) Letting the estimate of the parameter vector x at time t be x expression for it. This is called the instrumental variable method. (b) Setting       Yt Zt Xt Yt+1 = , Zt+1 = , Xt+1 = , bt+1 zt+1 xt+1 find a recursive computation formula for xt+1 .

6.3

6.4

Consider the matrix equation y = Aθ + e, where e is an error vector. Define the weighting-error squared sum Ew = eH We, where the weighting matrix W is Hermitian positive definite. Find a solution such that Ew is minimized. Such a solution is called the weighted least squares (WLS) solution. Let λ > 0, and let Ax = b is an over-determined matrix equation. Show that the anti-Tikhonov regularized optimization problem # " 1 1 min

Ax − b 22 − λ x 22 2 2

6.5

has optimal solution x = (AH A − λI)−1 AH b. [179] The TLS solution of the linear equation Ax = b can be also expressed as min

b+e∈Range(A+E)

D[E, e]T F ,

E ∈ Rm×n ,

e ∈ Rm ,

where D = Diag(d1 , . . . , dm ) and T = Diag(t1 , . . . , tn+1 ) are nonsingular. Show the following results. (a) If rank(A) < n then the above TLS problem has one solution if and only if b ∈ Range(A). (b) If rank(A) = n, AT D2 b = 0, |tn+1 | Db 2 ≥ σn (DAT1 ), where T1 = Diag(t1 , . . . , tn ) then the TLS problem has no solution. Here σn (C) denotes the nth singular value of the matrix C. 6.6

6.7 6.8

Consider the TLS in the previous problem. Show that if C = D[A, b]T = [A1 , d] and σn (C) > σn+1 (C) then the TLS solution satisfies (AT1 A1 − 2 σn+1 (C)I)x = AT1 d. Given the data points (1, 3), (3, 1), (5, 7), (4, 6), (7, 4), find respectively TLS and LS linear fittings and analyze their squared sum of distances. Prove that the solution of the optimization problem   min tr(AT A) − 2tr(A) subject to AX = O A

ˆ = I − XX† , where O is a null matrix. is given by A

Exercises

6.9

411

Consider the harmonic recovery problem in additive white noise x(n) =

p 

Ai sin(2π fi n + φi ) + e(n),

i=1

where Ai , fi and φi are the amplitude, frequency and phase of the ith harmonic, and e(n) is the additive white noise. The above harmonic process obeys the modified Yule–Walker (MYW) equation Rx (k) +

2p 

ai Rx (k − i) = 0,

∀ k,

i=1

and the harmonic frequency can be recovered from   1 Imzi , i = 1, 2, . . . , p, fi = arctan 2π Rezi where zi is a root in the conjugate root pair (zi , zi∗ ) of the characteristic 2p polynomial A(z) = 1 + i=1 ai z −i . If √ √ x(n) = 20 sin(2π 0.2n) + 2 sin(2π0.213n) + e(n), where e(n) is a Gaussian white noise with mean 0 and variance 1, take n = 1, 2, . . . , 128 and use the LS method and SVD-TLS algorithm to estimate the AR parameters ai of the ARMA model and the harmonic frequencies f1 and f2 . Assume that there are 40 MYW equations, and that p = 2 and p = 3 respectively in the LS method, while the number of unknown parameters is 14 in the TLS algorithm. When using the TLS algorithm, the number of harmonics is determined by the number of leading singular values and then the roots of the characteristic polynomial are computed. Compare the computer simulation results of the LS method and the SVD-TLS algorithm. 6.10 The TLS problem for solving the matrix equation Ax = b can be equivalently represented as min

b+Δb∈Range(A+ΔA)

C[ΔA, Δb]T 2F ,

ΔA ∈ Rm×n ,

Δb ∈ Rm ,

where C = Diag(c1 , . . . , dm ) and T = Diag(t1 , . . . , tn+1 ). Show the following results. (a) If rank(A) < n then the above TLS problem has a solution if and only if b ∈ Range(A). (b) If rank(A) = n, AT Cb = 0 and |tn+1 | Cb 2 ≥ σn (CAT1 ) with T1 = Diag(t1 , . . . , tn ) then the TLS problem has no solutions, σn (CAT1 ) is the nth singular value of the matrix CAT1 .

412

Solving Matrix Equations

6.11 Show the nonnegativity of the KL divergence  I  K   pij DKL (P Q) = − pij + qij pij log qij i=1 j=1 and show that the KL divergence is equal to zero if and only if P = Q. 6.12 Letting DE (X AS) = 12 X − AS 22 , show that ∂DE (X AS) = −(X − AS)ST , ∂A ∂DE (X AS) ∇DE (X AS) = = −AT (X − AS). ∂S ∇DE (X AS) =

7 Eigenanalysis

Focusing on the eigenanalysis of matrices, i.e. analyzing matrices by means of eigenvalues, in this chapter we first discuss the eigenvalue decomposition (EVD) of matrices and then present various generalizations of EVD: generalized eigenvalue decomposition, the Rayleigh quotient, the generalized Rayleigh quotient, quadratic eigenvalue problems and the joint diagonalization of multiple matrices. In order to facilitate readers’ understanding, this chapter will present some typical applications.

7.1 Eigenvalue Problem and Characteristic Equation The eigenvalue problem not only is a very interesting theoretical problem but also has a wide range of applications.

7.1.1 Eigenvalue Problem If L[w] = w holds for any nonzero vector w, then L is called an identity transformation. More generally, if, when a linear operator acts on a vector, the output is a multiple of this vector, then the linear operator is said to have an input reproducing characteristic. This input reproducing has two cases. (1) For any nonzero input vector, the output of a linear operator is always completely the same as the input vector (e.g., it is the identity operator). (2) For some special input vector only, the output vector of a linear operator is the same as the input vector up to a constant factor. DEFINITION 7.1 Suppose that a nonzero vector u is the input of the linear operator L. If the output vector is the same as the input vector u up to a constant factor λ, i.e., L[u] = λu,

u = 0,

(7.1.1)

then the vector u is known as an eigenvector of the linear operator L and the scalar λ is the corresponding eigenvalue of L. 413

414

Eigenanalysis

The most commonly used linear operator in engineering applications is undoubtedly a linear time-invariant system. From the above definition, it is known that if each eigenvector u is regarded as an input of the linear time-invariant system, then the eigenvalue λ associated with eigenvector u is equivalent to the gain of the linear system L with u as the input. Only when the input of the system is an eigenvector u, its output is the same as the input up to a constant factor. Thus an eigenvector can be viewed as describing the characteristics of the system, and hence is also called a characteristic vector. This is a physical explanation of eigenvectors in relation to linear systems. If a linear transformation w = L[x] can be represented as w = Ax, then A is called a standard matrix of the linear transformation. Clearly, if A is a standard matrix of a linear transformation, then its eigenvalue problem (7.1.1) can be written as Au = λu,

u = 0.

(7.1.2)

The scalar λ is known as an eigenvalue of the matrix A, and the vector u as an eigenvector associated with the eigenvalue λ. Equation (7.1.2) is sometimes called the eigenvalue–eigenvector equation. EXAMPLE 7.1 Consider a linear time-invariant system h(k) with transfer function ∞ H(e jω ) = k=−∞ h(k)e−jωk . When inputting a complex exponential or harmonic signal ejωn , the system output is given by ∞ 

L[e jωn ] =

k=−∞

h(n − k)e jωk =

∞ 

h(k)e jω(n−k) = H(e jω )e jωn .

k=−∞

Letting u(ω) = [1, e jω , . . . , e jω(N −1) ]T be an input vector, the system output is given by ⎡ ⎡ ⎤ ⎤ 1 1 ⎢ e jω ⎥ ⎢ e jω ⎥ ⎢ ⎥ ⎥ jω ⎢ = H(e ) L⎢ ⎢ ⎥ ⎥ ⇒ L[u(ω)] = H(e jω )u(ω). .. .. ⎣ ⎣ ⎦ ⎦ . . e jω(N −1)

e jω(N −1)

This shows that the harmonic signal vector u(ω) = [1, e jω , . . . , e jω(N −1) ]T is an eigenvector of the linear time-invariant system, and the system transfer function H(e jω ) is the eigenvalue associated with u(ω). From Equation (7.1.2) it is easily seen that if A ∈ Cn×n is a Hermitian matrix then its eigenvalue λ must be a real number, and A = UΣUH ,

(7.1.3)

where U = [u1 , . . . , un ]T is a unitary matrix, and Σ = Diag(λ1 , . . . , λn ). Equation (7.1.3) is called the eigenvalue decomposition (EVD) of A.

7.1 Eigenvalue Problem and Characteristic Equation

415

Since an eigenvalue λ and its associated eigenvector u often appear in pairs, the two-tuple (λ, u) is called an eigenpair of the matrix A. Although an eigenvalue can take a zero value, an eigenvector cannot be a zero vector. Equation (7.1.2) means that the linear transformation Au does not “change the direction” of the input vector u. Hence the linear transformation Au is a mapping “keeping the direction unchanged”. In order to determine the vector u, we can rewrite (7.1.2) as (A − λI)u = 0.

(7.1.4)

If the above equation is assumed to hold for certain nonzero vectors u, then the only condition for a nonzero solution to Equation (7.1.4) to exist is that, for those vectors u, the determinant of the matrix A − λI is equal to zero, namely det(A − λI) = 0.

(7.1.5)

It should be pointed out that some eigenvalues λ may take the same value. The repetition number of an eigenvalue is said to be the eigenvalue multiplicity. For example, all n eigenvalues of an n × n identity matrix I are equal to 1, so its multiplicity is n. It is easily concluded that if the eigenvalue problem (7.1.4) has a nonzero solution u = 0, then the scalar λ must make the n × n matrix A − λI singular. Hence the eigenvalue problem solving consists of the following two steps: (1) find all scalars λ (eigenvalues) such that the matrix A − λI is singular; (2) given an eigenvalue λ such that the matrix A − λI is singular, find all nonzero vectors u satisfying (A − λI)u = 0; these are the eigenvector(s) corresponding to the eigenvalue λ.

7.1.2 Characteristic Polynomial As discussed above, the matrix (A − λI) is singular if and only if its determinant det(A − λI) = 0, i.e., (A − λI) singular



det(A − λI) = 0,

(7.1.6)

where the matrix A − λI is called the characteristic matrix of A. The determinant   a11 − x a12 ··· a1n    a21 a22 − x · · · a2n   p(x) = det(A − xI) =  .  .. .. ..   .. . . .     a a · · · a − x nn n1 n2 = pn xn + pn−1 xn−1 + · · · + p1 x + p0

(7.1.7)

416

Eigenanalysis

is known as the characteristic polynomial of A, and p(x) = det(A − xI) = 0

(7.1.8)

is said to be the characteristic equation of A. The roots of the characteristic equation det(A − xI) = 0 are known as the eigenvalues, characteristic values, latent values, the characteristic roots or latent roots. Obviously, computing the n eigenvalues λi of an n × n matrix A and finding the n roots of the nth-order characteristic polynomial p(x) = det(A − xI) = 0 are two equivalent problems. An n × n matrix A generates an nth-order characteristic polynomial. Likewise, each nth-order polynomial can also be written as the characteristic polynomial of an n × n matrix. THEOREM 7.1 [32]

Any polynomial p(λ) = λn + a1 λn−1 + · · · + an−1 λ + an

can be written as the characteristic polynomial of the n × n matrix ⎤ ⎡ −a1 −a2 · · · −an−1 −an ⎢ −1 0 ··· 0 0 ⎥ ⎥ ⎢ ⎢ 0 −1 · · · 0 0 ⎥ A=⎢ ⎥, ⎢ . .. .. .. ⎥ .. ⎣ .. . . . . ⎦ 0 0 ··· −1 0 namely p(λ) = det(λI − A).

7.2 Eigenvalues and Eigenvectors In this section we discuss the computation and properties of the eigenvalues and eigenvectors of an n × n matrix A.

7.2.1 Eigenvalues Even if an n × n matrix A is real, the n roots of its characteristic equation may be complex, and the root multiplicity can be arbitrary or even equal to n. These roots are collectively referred to as the eigenvalues of the matrix A. Regarding eigenvalues, it is necessary to introduce the following terminology [422, p. 15]. (1) An eigenvalue λ of a matrix A is said to have algebraic multiplicity μ, if λ occurs μ times as a root of the characteristic polynomial det(A − zI) = 0. (2) If the algebraic multiplicity of an eigenvalue λ is 1, then it is called a single eigenvalue. A nonsingle eigenvalue is said to be a multiple eigenvalue.

7.2 Eigenvalues and Eigenvectors

417

(3) An eigenvalue λ of A is said to have geometric multiplicity γ if the number of linearly independent eigenvectors corresponding to the eigenvalue is γ. In other words, the geometric multiplicity γ is the dimension of the eigenspace Null(A − λI). (4) A matrix A is known as a derogatory matrix if there is at least one of its eigenvalues with geometric multiplicity greater than 1. (5) An eigenvalue is referred to as semi-single, if its algebraic multiplicity is equal to its geometric multiplicity. A semi-single eigenvalue is also called a defective eigenvalue. It is well known that any nth-order polynomial p(x) can be written in the factorized form p(x) = a(x − x1 )(x − x2 ) · · · (x − xn ).

(7.2.1)

The n roots of the characteristic polynomial p(x), denoted x1 , x2 , . . . , xn , are not necessarily different from each other, and also are not necessarily real. In general the eigenvalues of a matrix A are different from each other, but a characteristic polynomial has multiple roots then we say that the matrix A has degenerate eigenvalues. It has already been noted that even if A is a real matrix, its eigenvalues may be complex. Taking the Givens rotation matrix   cos θ − sin θ A= sin θ cos θ as an example, its characteristic equation is   cos θ − λ − sin θ   = (cos θ − λ)2 + sin2 θ = 0. det(A − λI) =  sin θ cos θ − λ However, if θ is not an integer multiple of π then sin2 θ > 0. In this case, the characteristic equation cannot give a real value for λ, i.e., the two eigenvalues of the rotation matrix are complex, and the two corresponding eigenvectors are complex vectors. The eigenvalues of an n × n matrix (not necessarily Hermitian) A have the following properties [214]. 1. An n×n matrix A has a total of n eigenvalues, where multiple eigenvalues count according to their multiplicity. 2. If a nonsymmetric real matrix A has complex eigenvalues and/or complex eigenvectors, then they must appear in the form of a complex conjugate pair. 3. If A is a real symmetric matrix or a Hermitian matrix then its all eigenvalues are real numbers. 4. The eigenvalues of a diagonal matrix and a triangular matrix satisfy the following:

418

Eigenanalysis

• if A = Diag(a11 , . . . , ann ) then its eigenvalues are given by a11 , . . . , ann ; • if A is a triangular matrix then all its diagonal entries are eigenvalues. 5. Given an n × n matrix A: • • • •

if if if if

λ is an eigenvalue of A then λ is also an eigenvalue of AT ; λ is an eigenvalue of A then λ∗ is an eigenvalue of AH ; λ is an eigenvalue of A then λ + σ 2 is an eigenvalue of A + σ 2 I; λ is an eigenvalue of A then 1/λ is an eigenvalue of its inverse matrix A−1 .

6. All eigenvalues of an idempotent matrix A2 = A are 0 or 1. 7. If A is a real orthogonal matrix, then all its eigenvalues are located on the unit circle. 8. The relationship between eigenvalues and the matrix singularity is as follows: • if A is singular then it has at least one zero eigenvalue; • if A is nonsingular then its all eigenvalues are nonzero. 9. The relationship between eigenvalues and trace: the sum of all eigenvalues of A n is equal to its trace, namely i=1 λi = tr(A). 10. A Hermitian matrix A is positive definite (semi-definite) if and only if its all eigenvalues are positive (nonnegative). 11. The relationship between eigenvalues and determinant: the product of all eigen?n values of A is equal to its determinant, namely i=1 λi = det(A) = |A|. 12. If the eigenvalues of A are different from each other then there must be a similar matrix S (see Subsection 7.3.1) such that S−1 AS = D (diagonal matrix); the diagonal entries of D are the eigenvalues of A. 13. The relationship between eigenvalues and rank is as follows: • if an n × n matrix A has r nonzero eigenvalues then rank(A) ≥ r; • if 0 is a single eigenvalue of an n × n matrix A then rank(A) = n − 1; • if rank(A − λI) ≤ n − 1 then λ is an eigenvalue of the matrix A. 14. The geometric multiplicity of any eigenvalue λ of an n × n matrix A cannot be greater than the algebraic multiplicity of λ. 15. The Cayley–Hamilton theorem: if λ1 , λ2 , . . . , λn are the eigenvalues of an n × n matrix A then n ( (A − λi I) = 0. i=1

16. On the eigenvalues of similar matrices: • if λ is an eigenvalue of an n × n matrix A and another n × n matrix B is nonsingular then λ is also an eigenvalue of the matrix B−1 AB; • if λ is an eigenvalue of an n × n matrix A and another n × n matrix B is unitary then λ is also an eigenvalue of the matrix BH AB; • if λ is an eigenvalue of an n × n matrix A and another n × n matrix B is orthogonal then λ is also an eigenvalue of BT AB.

7.2 Eigenvalues and Eigenvectors

419

17. The eigenvalues λi of the correlation matrix R = E{x(t)xH (t)} of a random vector x(t) = [x1 (t), . . . , xn (t)]T are bounded by the maximum power Pmax = maxi E{|xi (t)|2 } and the minimum power Pmin = mini E{|xi (t)|2 } of the signal components, namely (7.2.2) Pmin ≤ λi ≤ Pmax . 18. The eigenvalue spread of the correlation matrix R of a random vector x(t) is λ (7.2.3) X (R) = max . λmin 19. The matrix products Am×n Bn×m and Bn×m Am×n have the same nonzero eigenvalues. 20. If an eigenvalue of a matrix A is λ then the corresponding eigenvalue of the matrix polynomial f (A) = An + c1 An−1 + · · · + cn−1 A + cn I is given by f (λ) = λn + c1 λn−1 + · · · + cn−1 λ + cn .

(7.2.4)

21. If λ is an eigenvalue of a matrix A then eλ is an eigenvalue of the matrix exponential function eA .

7.2.2 Eigenvectors If a matrix An×n is a complex matrix, and λ is one of its eigenvalues then the vector v satisfying (A − λI)v = 0

or Av = λv

(7.2.5)

is called the right eigenvector of the matrix A associated with the eigenvalue λ, while the vector u satisfying uH (A − λI) = 0T

or uH A = λuH

(7.2.6)

is known as the left eigenvector of A associated with the eigenvalue λ. If a matrix A is Hermitian then its all eigenvalues are real, and hence from Equation (7.2.5) it is immediately known that ((A − λI)v)T = vT (A − λI) = 0T , yielding v = u; namely, the right and left eigenvectors of any Hermitian matrix are the same. It is useful to compare the similarities and differences of the SVD and the EVD of a matrix. (1) The SVD is available for any m × n (where m ≥ n or m < n) matrix, while the EVD is available only for square matrices. (2) For an n × n non-Hermitian matrix A, its kth singular value is defined as the spectral norm of the error matrix Ek making the rank of the original matrix A decreased by 1:   : rank(A + E) ≤ k − 1 , k = 1, . . . , min{m, n},

E σk = min spec m×n E∈ C

(7.2.7)

420

Eigenanalysis

while its eigenvalues are defined as the roots of the characteristic polynomial det(A − λI) = 0. There is no inherent relationship between the singular values and the eigenvalues of the same square matrix, but each nonzero singular value of an m × n matrix A is the positive square root of some nonzero eigenvalue of the n × n Hermitian matrix AH A or the m × m Hermitian matrix AAH . (3) The left singular vector ui and the right singular vector vi of an m × n matrix A associated with the singular value σi are defined as the two vectors satisfying H H uH i Avi = σi , while the left and right eigenvectors are defined by u A = λi u and Avi = λi vi , respectively. Hence, for the same n × n non-Hermitian matrix A, there is no inherent relationship between its (left and right) singular vectors and its (left and right) eigenvectors. However, the left singular vector ui and right singular vector vi of a matrix A ∈ Cm×n are respectively the eigenvectors of the m × m Hermitian matrix AAH and of the n × n matrix AH A. From Equation (7.1.2) it is easily seen that after multiplying an eigenvector u of a matrix A by any nonzero scalar μ, then μu is still an eigenvector of A. For convenience, it is generally assumed that eigenvectors have unit norm, i.e., u 2 = 1. Using eigenvectors we can introduce the condition number of any single eigenvalue. DEFINITION 7.2 [422, p. 93] any matrix A is defined as

The condition number of a single eigenvalue λ of

cond(λ) =

1 , cos θ(u, v)

(7.2.8)

where θ(u, v) represents the acute angle between the left and right eigenvectors associated with the eigenvalue λ. DEFINITION 7.3 The set of all eigenvalues λ ∈ C of a matrix A ∈ Cn×n is called the spectrum of the matrix A, denoted λ(A). The spectral radius of a matrix A, denoted ρ(A), is a nonnegative real number and is defined as ρ(A) = max |λ| : λ ∈ λ(A).

(7.2.9)

DEFINITION 7.4 The inertia of a symmetric matrix A ∈ Rn×n , denoted In(A), is defined as the triplet In(A) = (i+ (A), i− (A), i0 (A)), where i+ (A), i− (A) and i0 (A) are respectively the numbers of the positive, negative and zero eigenvalues of A (each multiple eigenvalue is counted according to its multiplicity). Moreover, the quality i+ (A) − i− (A) is known as the signature of A. The following summarize the properties of the eigenpair (λ, u) [214]. 1. If (λ, u) is an eigenpair of a matrix A then (cλ, u) is an eigenpair of the matrix cA, where c is a nonzero constant.

7.2 Eigenvalues and Eigenvectors

421

2. If (λ, u) is an eigenpair of a matrix A then (λ, cu) is also an eigenpair of the matrix A, where c is a nonzero constant. 3. If (λi , ui ) and (λj , uj ) are two eigenpairs of a matrix A, and λi = λj , then the eigenvector ui is linearly independent of uj . 4. The eigenvectors of a Hermitian matrix associated with different eigenvalues are orthogonal to each other, namely uH i uj = 0 for λi = λj . 5. If (λ, u) is an eigenpair of a matrix A then (λk , u) is the eigenpair of the matrix Ak . 6. If (λ, u) is an eigenpair of a matrix A then (eλ , u) is an eigenpair of the matrix exponential function eA . 7. If λ(A) and λ(B) are respectively the eigenvalues of the matrices A and B, and u(A) and u(B) are respectively the eigenvectors associated with λ(A) and λ(B), then λ(A)λ(B) is the eigenpair of the matrix Kronecker product A ⊗ B, and u(A) ⊗ u(B) is the eigenvector associated with the eigenvalue λ(A)λ(B). The SVD of an m × n matrix A can be transformed to the EVD of the corresponding matrix. There are two main methods to achieve this transformation. Method 1 The nonzero singular value σi of a matrix Am×n is the positive square root of the nonzero eigenvalue λi of the m × m matrix AAT or the n × n matrix AT A, and the left singular vector ui and the right singular vector vi of A associated with σi are the eigenvectors of AAT and AT A associated with the nonzero eigenvalue λi . Method 2 The SVD of a matrix Am×n is transformed to the EVD of an (m + n) × (m + n) augmented matrix   O A . (7.2.10) AT O THEOREM 7.2 (Jordan–Wielandt theorem) [455, Theorem I.4.2] If σ1 ≥ σ2 ≥ · · · ≥ σp−1 ≥ σp are the singular values of Am×n (where p = min{m, n}) then the augmented matrix in (7.2.10) has the eigenvalues −σ1 , . . . , −σp , 0, . . . , 0, σp , . . . , σ1 , 6 78 9 |m−n|

and the eigenvectors associated with ±σj are given by   uj , j = 1, 2, . . . , p. ±vj If m = n then we have   uj , n + 1 ≤ j ≤ m, m > n 0

 or

 0 , m + 1 ≤ j ≤ n, m < n. vj

422

Eigenanalysis

On the eigenvalues of the matrix sum A + B, one has the following result. THEOREM 7.3 (Weyl theorem) [275] Let A, B ∈ Cn×n be Hermitian matrices, and let their eigenvalues be arranged in ascending order: λ1 (A) ≤ λ2 (A) ≤ · · · ≤ λn (A), λ1 (B) ≤ λ2 (B) ≤ · · · ≤ λn (B), λ1 (A + B) ≤ λ2 (A + B) ≤ · · · ≤ λn (A + B). Then

⎧ λi (A) + λ1 (B), ⎪ ⎪ ⎪ ⎨ λi−1 (A) + λ2 (B), λi (A + B) ≥ .. ⎪ ⎪ . ⎪ ⎩ λ1 (A) + λi (B), ⎧ λi (A) + λn (B), ⎪ ⎪ ⎪ ⎨ λi+1 (A) + λn−1 (B), λi (A + B) ≤ .. ⎪ ⎪ . ⎪ ⎩ λn (A) + λi (B),

(7.2.11)

(7.2.12)

where i = 1, 2, . . . , n. In particular, when A is a real symmetric matrix and B = αzzT (see below), then one has the following interlacing eigenvalue theorem [179, Theorem 8.1.8]. THEOREM 7.4 Let A ∈ Rn×n be a symmetric matrix with eigenvalues λ1 , . . . , λn satisfying λ1 ≥ λ2 ≥ · · · ≥ λn , and let z ∈ Rn be a vector with norm z = 1. Suppose that α is a real number, and the eigenvalues of the matrix A + αzzT are arranged as ξ1 ≥ ξ2 ≥ · · · ≥ ξn . Then one has ξ 1 ≥ λ1 ≥ ξ 2 ≥ λ2 ≥ · · · ≥ ξ n ≥ λn

(for α > 0)

(7.2.13)

λ 1 ≥ ξ 1 ≥ λ 2 ≥ ξ 2 ≥ · · · ≥ λn ≥ ξ n

(for α < 0),

(7.2.14)

or

and, whether α > 0 or α < 0, the following result is true: n 

(ξi − λi ) = α.

(7.2.15)

i=1

7.3 Similarity Reduction The normalized representation of a matrix is called the canonical form (or normal or standard form) of the matrix. In most fields, a canonical form specifies a unique representation while a normal form simply specifies the form without the

7.3 Similarity Reduction

423

requirement of uniqueness. The simplest canonical form of a matrix is its diagonalized representation. However, many matrices are not diagonalizable. The problem is that in many applications it is necessary to reduce a given matrix to as simple a form as possible. In these applications the similarity reduction of a matrix is a very natural choice. The standard form of the similarity reduction is the Jordan canonical form. Therefore, the core problem of matrix similarity reduction is how to obtain the Jordan canonical form. There are two different ways to find the Jordan canonical form of a matrix. (1) For a constant matrix, use a direct similarity reduction of the matrix itself. (2) First make a balanced reduction of the polynomial matrix corresponding to the original constant matrix, and then transform the Smith normal form of the balanced reduction into the Jordan canonical form of the similarity reduction of the original matrix. In this section we discuss the implementation of the first way, and in the next section we describe similarity reduction based on the second way.

7.3.1 Similarity Transformation of Matrices The theoretical basis and mathematical tools of matrix similarity reduction are provided by the similarity transformation of matrices. Let P ∈ Cm×m be a nonsingular matrix, and use it to make the linear transformation of a matrix A ∈ Cm×m : B = P−1 AP.

(7.3.1)

Suppose that an eigenvalue of the linear transformation B is λ and that a corresponding eigenvector is y, i.e., By = λy.

(7.3.2)

Substitute (7.3.1) into (7.3.2) to get P−1 APy = λy or A(Py) = λ(Py). Letting x = Py or y = P−1 x, we have Ax = λx.

(7.3.3)

Comparing (7.3.2) with (7.3.3), it can be seen that the two matrices A and B = P−1 AP have the same eigenvalues and the eigenvector y is a linear transformation of the eigenvector x of the matrix A, i.e., y = P−1 x. Because the eigenvalues of the two matrices A and B = P−1 AP are the same and their eigenvectors have a linear transformation relationship, the matrices A and B are said to be similar. DEFINITION 7.5 If there is a nonsingular matrix P ∈ Cm×m such that B = P−1 AP then the matrix B ∈ Cm×m is said to be similar to the matrix A ∈ Cm×m and P is known as the similarity transformation matrix.

424

Eigenanalysis

“B is similar to A” is denoted as B ∼ A. Similar matrices have the following basic properties. (1) Reflexivity A ∼ A, i.e., any matrix is similar to itself. (2) Symmetry If A is similar to B, then B is also similar to A. (3) Transitivity If A is similar to B, and B is similar to C, then A is similar to C, i.e., A ∼ C. In addition, similar matrices have also the following important properties. 1. Similar matrices B ∼ A have the same determinant, i.e., |B| = |A|. 2. If the matrix P−1 AP = T (an upper triangle matrix) then the diagonal entries of T give the eigenvalues λi of A. 3. Two similar matrices have exactly the same eigenvalues. 4. For the similar matrix B = P−1 AP we have B2 = P−1 APP−1 AP = P−1 A2 P, and thus Bk = P−1 Ak P. This is to say, if B ∼ A then Bk ∼ Ak . This property is called the power property of similar matrices. 5. If both B = P−1 AP and A are invertible then B−1 = P−1 A−1 P, i.e., when two matrices are similar, their inverse matrices are similar as well. If P is a unitary matrix then B = P−1 AP is called the unitary similarity transformation of A. The following example shows how to perform the diagonalization of an m × m matrix A. EXAMPLE 7.2

Find the similarity transformation of the 3 × 3 real matrix ⎡ ⎤ 1 1 1 A = ⎣ 0 3 3 ⎦. −2 1 1

Solution Direct computation gives the characteristic polynomial    λ−1 −1 −1   |λI − A| =  0 λ−3 −3  = λ(λ − 2)(λ − 3).  2 −1 λ−1  On solving the characteristic equation |λI − A| = 0, the three eigenvalues of A are given by λ = 0, 2, 3. (a) For the eigenvalue λ = 0, we have (0I − A)x = 0, i.e., x1 + x2 + x3 = 0, 3x2 + 3x3 = 0, 2x1 − x2 − x3 = 0,

7.3 Similarity Reduction

425

with solution x1 = 0 and x2 = −x3 , where x3 is arbitrary. Hence, the eigenvector associated with the eigenvalue λ = 0 is ⎡ ⎤ ⎡ ⎤ 0 0 x = ⎣ −a ⎦ = a ⎣ −1 ⎦ , a = 0. a 1 Taking a = 1 yields the eigenvector x1 = [0, −1, 1]T . (b) For λ = 2, the characteristic polynomial is given by (2I − A)x = 0, i.e., x1 − x2 − x3 = 0, x2 + 3x3 = 0, 2x1 − x2 + x3 = 0, whose solution is x1 = −2x3 , x2 = −3x3 , where x3 is arbitrary. Thus we get the eigenvector ⎡ ⎤ ⎡ ⎤ −2a −2 x = ⎣ −3a ⎦ = ⎣ −3 ⎦ , a = 1. a 1 (c) Similarly, the eigenvector associated with the eigenvalue λ = 3 is x3 = [1, 2, 0]T . The three eigenvectors constitute the similarity transformation matrix, which is given, along with its inverse, by ⎡ ⎤ ⎡ ⎤ 0 −2 1 1 −0.5 0.5 −1 P = ⎣ −1 −3 2 ⎦ , P = ⎣ −1 0.5 0.5 ⎦ . 1 1 0 −1 1.0 1.0 Therefore, the diagonalized matrix corresponding to A is given by ⎡ ⎤⎡ ⎤⎡ ⎤ 1 −0.5 0.5 1 1 1 0 −2 1 P−1 AP = ⎣ −1 0.5 0.5 ⎦ ⎣ 0 3 3 ⎦ ⎣ −1 −3 2 ⎦ −1 1.0 1.0 −2 1 1 1 1 0 ⎡ ⎤⎡ ⎤ 1 −0.5 0.5 0 −4 3 ⎣ ⎦ ⎣ = −1 0.5 0.5 0 −6 6 ⎦ −1 1.0 1.0 0 2 0 ⎡ ⎤ 0 0 0 = ⎣ 0 2 0 ⎦, 0

0

3

which is the diagonal matrix consisting of the three different eigenvalues of the matrix A. DEFINITION 7.6 An m × m real matrix A is said to be a diagonalization matrix if it is similar to a diagonal matrix.

426

Eigenanalysis

The following theorem gives the necessary and sufficient condition for the diagonalization of the matrix A ∈ Cm×m . THEOREM 7.5 An m × m matrix A is diagonalizable if and only if A has m linearly independent eigenvectors. The following theorem gives the necessary and sufficient condition for all eigenvectors to be linearly independent, and thus it is the necessary and sufficient condition for matrix diagonalization. THEOREM 7.6 (Diagonalization theorem) [433, p. 307] Given a matrix A ∈ Cm×m whose eigenvalues λk have the algebraic multiplicity dk , k = 1, . . . , p, where p k=1 dk = m. The matrix A has m linearly independent eigenvectors if and only if rank(A − λk I) = m − dk , k = 1, . . . , p. In this case, the matrix U in AU = UΣ is nonsingular, and A is diagonalized as U−1 AU = Σ.

7.3.2 Similarity Reduction of Matrices A nondiagonalizable matrix A ∈ Cm×m with multiple eigenvalues can be reduced to the Jordan canonical form via a similarity transformation. DEFINITION 7.7 Let a matrix A with r = rank(A) have d different nonzero eigenvalues, and let the nonzero eigenvalue λi have multiplicity mi , i.e., m1 + · · · + md = r. If the similarity transformation has the form P−1 AP = J = Diag(J1 , . . . , Jd , 0, . . . , 0), 6 78 9

(7.3.4)

m−r

then J is called the Jordan canonical form of the matrix A and Ji , i = 1, . . . , d, are known as Jordan blocks. A Jordan block corresponding to an eigenvalue λi with multiplicity 1 is of first order and is a 1 × 1 Jordan block matrix J1×1 = λi . A kth-order Jordan block is defined as ⎤ ⎡ λ 1 0 ⎥ ⎢ . ⎥ ⎢ λ .. ⎥ ∈ Ck×k , ⎢ Jk×k = ⎢ (7.3.5) ⎥ . . . 1⎦ ⎣ 0 λ in which the k entries on the main diagonal line are λ, and the k − 1 entries on the subdiagonal on the right of main diagonal are equal to 1, while all other entries are equal to zero. For example, second-order and third-order Jordan blocks are as

7.3 Similarity Reduction

follows:  J2×2 =

⎤ λ 1 0 = ⎣0 λ 1⎦ . 0 0 λ ⎡



λ 1 , 0 λ

427

J3×3

The Jordan canonical form J has the following properties. 1. J is an upper bidiagonal matrix. 2. J is a diagonal matrix in the special case of n Jordan blocks of size nk = 1. 3. J is unique up to permutations of the blocks (so it is called the Jordan canonical form). 4. J can have multiple blocks with the same eigenvalue. A multiple eigenvalue λ with algebraic multiplicity mi may have one or more Jordan blocks, depending on the geometric multiplicity of λ. For example, if a 3 × 3 matrix A has an eigenvalue λ0 with algebraic multiplicity 3, then the Jordan canonical form of the matrix A may have three forms: ⎤ ⎡ ⎤ ⎡ 0 0 0 J1×1 λ0 0 J1 = ⎣ 0 J1×1 0 ⎦ = ⎣ 0 λ0 0 ⎦ (geometric multiplicity 3), 0 0 J1×1 0 0 λ0 ⎤ ⎡   0 λ0 0 J1×1 0 ⎣ = 0 λ0 1 ⎦ (geometric multiplicity 2), J2 = 0 J2×2 0 0 λ0 ⎤ ⎡ 0 λ0 1 J3 = J3×3 = ⎣ 0 λ0 1 ⎦ (geometric multiplicity 1), 0 0 λ0 because the meaning of the geometric multiplicity α of the eigenvalue λ is that the number of linearly independent eigenvectors corresponding to λ is α. Jordan canonical forms with different permutation orders of the Jordan blocks are regarded as the same Jordan canonical form, as stated above. For example, the second-order Jordan canonical form J2 can also be arranged as ⎤ ⎡   0 λ0 1 J2×2 0 = ⎣ 0 λ0 0 ⎦ . J2 = 0 J1×1 0 0 λ0 In practical applications, a nondiagonalizable m × m matrix usually has multiple eigenvalues. Because the number of Jordan blocks associated with an eigenvalue λi is equal to its geometric multiplicity αi , a natural question is how to determine αi for a given eigenvalue λi ? The method for determining the geometric multiplicity (i.e. the number of Jordan blocks) of a given m × m matrix A is as follows.

428

Eigenanalysis

(1) The number of Jordan blocks with order greater than 1 (i.e., ≥ 2) is determined by α1 = rank(λi I − A) − rank(λi I − A)2 . (2) The number of Jordan blocks with order greater than 2 (i.e., ≥ 3) is given by α2 = rank(λi I − A)2 − rank(λi I − A)3 . (3) More generally, the number of Jordan blocks with the order greater than k − 1 (i.e., ≥ k) is determined by αk−1 = rank(λi I − A)k−1 − rank(λi I − A)k . (4) The sum of the orders of the Jordan blocks corresponding to an eigenvalue λi is equal to its algebraic multiplicity. EXAMPLE 7.3

Find the Jordan canonical form of a 3 × 3 matrix ⎡ ⎤ 1 0 4 A = ⎣ 2 −1 4 ⎦. −1 0 −3

Solution From the characteristic determinant   λ−1 0 −4  |λI − A| =  −2 λ+1 −4  1 0 λ+3

(7.3.6)

    = (λ + 1)3 = 0,  

we find that get the eigenvalue of the matrix A is λ = −1 with (algebraic) multiplicity 3. For λ = −1, since ⎡

−2 0 λI − A = ⎣−2 0 1 0

⎤ −4 −4⎦ , 2



⎤ 0 0 0 (λI − A)2 = ⎣0 0 0⎦ , 0 0 0

the number of Jordan blocks with order ≥ 2 is given by rank(λI − A) − rank(λI − A)2 = 1 − 0 = 1, i.e., there is one Jordan block with order ≥ 2. Moreover, since rank(λI − A)2 − rank(λI − A)3 = 0 − 0 = 0, there is no third-order Jordan block. Therefore, the eigenvalue −1 with multiplicity 3 corresponds to two Jordan blocks, of which one Jordan block has order 2, and another has order 1. In other words, the Jordan canonical form of the given matrix A is  J J = 1×1 0

0 J2×2





⎤ −1 0 0 = ⎣ 0 −1 1 ⎦ . 0 0 −1

The Jordan canonical form P−1 AP = J can be written as AP = PJ. When A has d different eigenvalues λi with multiplicity mi , i = 1, . . . , d, and m1 +· · ·+md = r with r = rank(A), the Jordan canonical form is J = Diag(J1 , . . . , Jd , 0, . . . , 0),

7.3 Similarity Reduction

429

so AP = PJ simplifies to ⎡

J1 ⎢ .. A[P1 , . . . , Pd ] = [P1 , . . . , Pd ] ⎣ . O

··· .. . ···

⎤ O .. ⎥ .⎦ Jd

= [P1 J1 , . . . , Pd Jd ], where the Jordan blocks are given by ⎡ ⎤ λi 1 · · · 0 ⎢ .⎥ ⎢ 0 . . . . . . .. ⎥ ⎥ ∈ Rmi ×mi , Ji = ⎢ ⎢. . ⎥ .. λ ⎣ .. 1⎦ i 0 ··· 0 λi

(7.3.7)

i = 1, . . . , d.

Letting Pi = [pi,1 , . . . , pi,mi ], Equation (7.3.7) gives Api,j = pi,j−1 + λi pi,j ,

i = 1, . . . , d, j = 1, . . . , mi ,

(7.3.8)

where pi,0 = 0. Equation (7.3.8) provides a method of computation for the similarity transformation matrix P. Algorithm 7.1 shows how to find the Jordan canonical form of a matrix A ∈ m×m C via similarity reduction. Algorithm 7.1

Similarity reduction of matrix A

input: A ∈ Rm×m . 1. Solve |λI − A| = 0 for λi with multiplicity mi , where i = 1, . . . , d, and m1 + · · · + md = m. 2. Solve Equation (7.3.8) for Pi = [pi,1 , . . . , pi,mi ], i = 1, . . . , d. 3. Set P = [P1 , . . . , Pd ], and find its inverse matrix P−1 .

4. Compute P−1 AP = J. output: Similarity transformation matrix P and Jordan canonical form J.

EXAMPLE 7.4 matrix

Use similarity reduction to find the Jordan canonical form of the ⎡

1 0 A = ⎣ 2 −1 −1 0

⎤ 4 4 ⎦. −3

Solution From |λI − A| = 0 we get the single eigenvalue λ = −1 with algebraic multiplicity 3.

430

Eigenanalysis

Let j = 1, and solve Equation (7.3.8): ⎤⎡ ⎡ ⎤ ⎡ ⎤ p11 1 0 4 p11 ⎣ 2 −1 4 ⎦ ⎣ p21 ⎦ = − ⎣ p21 ⎦ −1 0 −3 p31 p31 from which we have p11 = −p31 ; p21 can take an arbitrary value. Take p11 = p31 = 0 and p21 = 1. Solving Equation (7.3.8) for j = 2, ⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎤ ⎡ 1 0 4 p12 0 p12 ⎣ 2 −1 4 ⎦ ⎣ p22 ⎦ = ⎣ 1 ⎦ − ⎣ p22 ⎦ , −1 0 −3 p32 p32 0 we get p12 = −2p32 ; p22 can take an arbitrary value. Take p12 = 2, p22 = 2 and p32 = −1. Similarly, from Equation (7.3.8) with j = 3, ⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎤ ⎡ 1 0 4 p13 2 p13 ⎣ 2 −1 4 ⎦ ⎣ p23 ⎦ = ⎣ 2 ⎦ − ⎣ p23 ⎦ , −1

0 −3

p33

−1

p33

we obtain p13 +2p33 = 1; p32 can take an arbitrary value. Take p13 = 1, p23 = 0 and p33 = 0. Hence, the similarity transformation matrix and its inverse are respectively as follows: ⎡ ⎤ ⎡ ⎤ 0 2 1 0 1 2 P=⎣ 1 (7.3.9) 2 0 ⎦ , P−1 = ⎣ 0 0 −1 ⎦ . 0 −1 0 1 0 2 Therefore the similarity reduction of A is given by ⎡ ⎤ −1 0 0 J = P−1 AP = ⎣ 0 −1 1 ⎦. 0 0 −1

(7.3.10)

7.3.3 Similarity Reduction of Matrix Polynomials Consider a polynomial f (x) = an xn + an−1 xn−1 + · · · + a1 x + a0 .

(7.3.11)

When an = 0, n is called the order of the polynomial f (x). An nth-order polynomial is known as a monic polynomial if the coefficient of xn is equal to 1. Given a matrix A ∈ Cm×m and a polynomial function f (x), we say that f (A) = an An + an−1 An−1 + · · · + a1 A + a0 I is an nth-order matrix polynomial of A.

(7.3.12)

7.3 Similarity Reduction

431

Suppose that the characteristic polynomial p(A) = λI − A has d different roots which give the eigenvalues λ1 , . . . , λd of the matrix A and that the multiplicity of the eigenvalue λi is mi , i.e., m1 + · · · + md = m. If the Jordan canonical form of the matrix A is J, i.e., A = PJP−1 = PDiag(J1 (λ1 ), . . . , Jd (λd ))P−1 ,

(7.3.13)

then f (A) = an An + an−1 An−1 + · · · + a1 A + a0 I = an (PJP−1 )n + an−1 (PJP−1 )n−1 + · · · + a1 (PJP−1 ) + a0 I = P(an Jn + an−1 Jn−1 + · · · + a1 J + a0 I)P−1 = Pf (J)P−1

(7.3.14)

is known as the similarity reduction of the matrix polynomial f (A), where f (J) = an Jn + an−1 Jn−1 + · · · + a1 J + a0 I ⎡ ⎤n ⎡ O J1 J1 ⎢ ⎥ ⎢ . .. .. = an ⎣ ⎦ + · · · + a1 ⎣ . O =

Jp

Diag(an Jn1

O

+ · · · + a1 J1 +

O





⎥ ⎢ ⎦ + a0 ⎣

Jd

a0 I1 , . . . , an Jnd

O

I1 .. O

.

⎤ ⎥ ⎦

Id

+ · · · + a1 Jd + a0 Id )

= Diag(f (J1 ), . . . , f (Jd )).

(7.3.15)

Substitute Equation (7.3.15) into Equation (7.3.14) to yield f (A) = PDiag(f (J1 ), . . . , f (Jd ))P−1 ,

(7.3.16)

where f (Ji ) ∈ Cmi ×mi is the Jordan representation of the matrix function f (A) associated with the eigenvalue λi and is defined as ⎤ ⎡ 1

1 f (λi ) f (λi ) f (λ) · · · f (mi −1) (λi ) 2! (mi − 1)! ⎥ ⎢ ⎥ ⎢ 1

(mi −2) ⎢ f (λi ) f (λi ) ··· f (λi ) ⎥ ⎥ ⎢ (mi − 2)! ⎥ , (7.3.17) f (Ji ) = ⎢ .. .. .. ⎥ ⎢ . . ⎥ ⎢ . ⎥ ⎢

⎦ ⎣ f (λi ) f (λi ) 0 f (λi ) where f (k) (x) = dk f (x)/dxk is the kth-order derivative of the function f (x). Equation (7.3.16) is derived under the assumption that f (A) is a the matrix polynomial. However, using (PAP−1 )k = PJk P−1 , it is easy to see that Equation (7.3.16) applies to other common matrix functions f (A) as well. (1) Powers of a matrix AK = PJK P−1 = Pf (J)P−1 . def

432

Eigenanalysis

In this case, f (x) = xK . (2) Matrix logarithm

 ∞ ∞   (−1)n−1 (−1)n−1 n ln(I + A) = A =P An P−1 n n n=1 n=1 def

= Pf (J)P−1 . In this case, f (x) = ln(1 + x). (3) Sine and cosine functions

 ∞ ∞   (−1)n (−1)n 2n+1 2n+1 sin A = P−1 =P A J (2n + 1)! (2n + 1)! n=0 n=0 def

= Pf (J)P−1 ,

 ∞ ∞   (−1)n (−1)n 2n cos A = A =P J2n P−1 = Pf (J)P−1 , (2n)! (2n)! n=0 n=0 def

where f1 (x) = sin x and f2 (x) = cos x, respectively. (4) Matrix exponentials  ∞ ∞   1 1 n A def n e = P−1 = Pf (J)P−1 , A =P J n! n! n=0 n=0  ∞ ∞   1 1 −A def n n n n e P−1 = Pf (J)P−1 . = (−1) A = P (−1) J n! n! n=0 n=0 Here the scalar functions corresponding to the matrix exponentials f1 (A) = eA and f2 (A) = e−A are respectively f1 (x) = ex and f2 (x) = e−x . (5) Matrix exponential functions  ∞ ∞   1 1 n n At def n n e = P−1 = Pf (J)P−1 , A t =P J t n! n! n=0 n=0  ∞ ∞   1 1 −At def n n n n n n e P−1 = Pf (J)P−1 , = (−1) A t = P (−1) J t n! n! n=0 n=0 where the scalar functions corresponding to the matrix exponential functions f1 (A) = eAt and f2 (A) = e−At are respectively f1 (x) = ext and f2 (x) = e−xt . The following example shows how to compute matrix functions using (7.3.16) and (7.3.17). EXAMPLE 7.5

Given the matrix ⎡

⎤ 1 0 4 A = ⎣ 2 −1 4 ⎦, −1 0 −3

7.3 Similarity Reduction

433

compute (a) the matrix polynomial f (A) = A4 −3A3 +A−I, (b) the matrix power f (A) = A1000 , (c) the matrix exponential f (A) = eA , (d) the matrix exponential function f (A) = eAt and (e) the matrix trigonometric function sin A. Solution In Example 7.4, we found the similarity transformation matrix P, its inverse matrix P−1 and the Jordan canonical form of the above matrix, as follows: ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 0 1 2 −1 0 0 0 2 1 P = ⎣1 1 ⎦. 2 0 ⎦, P−1 = ⎣ 0 0 −1 ⎦, J = P−1 AP = ⎣ 0 −1 1 0 2 0 0 −1 0 −1 0 Thus we have the matrix polynomial f (A) = Pf (J)P−1 ⎤⎡ ⎡ ⎤⎡ 0 0 2 1 f (−1) 0 0 =⎣ 1 2 0 ⎦⎣ 0 f (−1) f (−1) ⎦ ⎣ 0 0 0 f (−1) 1 0 −1 0 ⎤ ⎡



0 4f (−1) f (−1) + 2f (−1) ⎦. =⎣ 2f (−1) f (−1) 4f (−1) −f (−1) 0 f (−1) − 2f (−1)

1 0 0

⎤ 2 −1 ⎦ 2 (7.3.18)

(a) To compute the matrix polynomial f (A) = A4 − 3A3 + A − I, consider the corresponding polynomial function f (x) = x4 −3x3 +x−1 with first-order derivative f (x) = 4x3 − 9x2 + 1. Hence, for the triple eigenvalue λ = −1, f (−1) = 5 and the first-order derivative f (−1) = −12. Substituting f (−1) = 5 and f (−1) = −12 into Equation (7.3.18), we have ⎡ ⎤ −19 0 −48 f (A) = A4 − 3A3 + A − I = ⎣ −24 5 −48 ⎦ . 12 0 29 (b) To compute the matrix power f (A) = A1000 , consider the corresponding polynomial function f (x) = x1000 , whose first-order derivative f (x) = 1000x999 . Substitute f (−1) = 1 and f (−1) = −1000 into Equation (7.3.18) to yield directly ⎡ ⎤ −1999 0 −4000 A1000 = ⎣ −2000 1 −4000 ⎦ . 1000 0 2001 (c) To compute the matrix exponential f (A) = eA , consider the polynomial functions f (x) = ex and f (x) = ex ; we have f (−1) = e−1 and f (−1) = e−1 . From these values and Equation (7.3.18), we immediately get ⎤ ⎡ −1 0 4e−1 3e eA = ⎣ 2e−1 e−1 4e−1 ⎦ . −e−1 0 −e−1

434

Eigenanalysis

(d) To compute the matrix exponential function f (A) = eAt , consider the polynomial functions f (x) = ext and f (x) = text ; it is well known that f (−1) = e−t , f (−1) = te−t . Substituting these two values into (7.3.18), we get ⎤ ⎡ −t 0 4te−t e + 2te−t eAt = ⎣ 2te−t e−t 4te−t ⎦ . −t −t −te 0 e − 2te−t (e) To compute the matrix trigonometric function sin A, consider the polynomial function f (x) = sin x, whose first-order derivative is f (x) = cos x. Substituting f (−1) = sin(−1) and f (−1) = cos(−1) into Equation (7.3.18), we immediately get ⎡ ⎤ sin(−1) + 2 cos(−1) 0 4 cos(−1) ⎦. sin A = ⎣ 2 cos(−1) sin(−1) 4 cos(−1) − cos(−1) 0 sin(−1) − 2 cos(−1) 7.4 Polynomial Matrices and Balanced Reduction The matrix polynomial and its similarity reduction presented in the above section can be used to calculate matrix powers and the matrix functions, but the determination of the Jordan canonical forms requires three key steps. (1) Solve the characteristic equation |λI − A| = 0 to get the eigenvalues λ1 , . . . , λm of A ∈ Cm×m . (2) Solve Api,j = pi,j−1 + λi pi,j , i = 1, . . . , d, j = 1, . . . , mi to determine the similarity transformation sub-matrices Pi = [pi,1 , . . . , pi,mi ], i = 1, . . . , d. (3) Find the inverse matrix P−1 = [P1 , . . . , Pd ]−1 . This section presents an alternative way to determine the Jordan canonical form of A ∈ Cm×m . The basic starting point of this idea is to transform the similarity reduction of the constant matrix A into a balanced reduction of the characteristic polynomial matrix λI − A and then to transform its Smith normal form into the Jordan canonical form of A.

7.4.1 Smith Normal Forms DEFINITION 7.8 An m × n matrix A(x) = [aij (x)], with polynomials of the argument x as entries, is called a polynomial matrix. Denote R[x]m×n as the collection of m × n real polynomial matrices and C[x]m×n as the collection of m × n complex polynomial matrices. It is easily to verify that C[x]m×n and R[x]m×n are respectively linear spaces in the complex field C and the real field R. The polynomial matrix and the matrix polynomial are two different concepts: the matrix polynomial f (A) = an An + an−1 An−1 + · · · + a1 A + a0 I is a polynomial

7.4 Polynomial Matrices and Balanced Reduction

435

with the matrix A as its argument, while the polynomial matrix A(x) is a matrix with polynomials of x as entries aij (x). DEFINITION 7.9 Let A(λ) ∈ C[λ]m×n ; then r is called the rank of the polynomial matrix A(x), denoted r = rank[A(x)], if any k ≥ (r + 1)th-order minor is equal to zero and there is at least one kth-order minor that is the nonzero polynomial in C[x]. In particular, if m = rank[A(x)], then an mth-order polynomial matrix A(x) ∈ C[x]m×m is said to be a full-rank polynomial matrix or a nonsingular polynomial matrix. DEFINITION 7.10 Let A(x) ∈ C[x]m×m . If there is another polynomial matrix B(x) ∈ C[x]m×m such that A(x)B(x) = B(x)A(x) = Im×m , then the polynomial matrix A(x) is said to be invertible, and B(x) is called the inverse of the polynomial matrix A(x), denoted B(x) = A−1 (x). For a constant matrix, its nonsingularity and invertibility are two equivalent concepts. However, the nonsingularity and invertibility of a polynomial matrix are two different concepts: nonsingularity is weaker than invertibility, i.e., a nonsingular polynomial matrix is not necessarily invertible while an invertible polynomial matrix must be nonsingular. A constant matrix A ∈ Cm×m can be reduced to the Jordan canonical form J via a similarity transformation. A natural question to ask is: what is the standard form of polynomial matrix reduction? To answer this question, we first discuss how to reduce a polynomial matrix. DEFINITION 7.11 Let A ∈ C[x]m×n , P ∈ Rm×m and d(x) ∈ R[x]. Elementary row matrices are of the following three types. The Type-I elementary row matrix Pij = [e1 , . . . , ei−1 , ej , ei+1 , . . . , ej−1 , ei , ej+1 , . . . , em ] is such that Pij A interchanges rows i and j of A. The Type-II elementary row matrix Pi (α) = [e1 , . . . , ei−1 , αei , ei+1 , . . . , em ] is such that Pi (α)A multiplies row i of A by α = 0. The Type-III elementary row matrix Pij [d(x)] = [e1 , . . . , ei , . . . , ej−1 , ej + d(x)ei , . . . , em ] is such that Pij [d(x)]A adds d(x) times row i to row j. Similarly, letting Q ∈ Rn×n , we can define three types of elementary column matrices, Qij , Qi (α) and Qij [d(x)], respectively, as follows.

436

Eigenanalysis

DEFINITION 7.12 Two m × n polynomial matrices A(x) and B(x) are said to be balanced, denoted A(x) ∼ = B(x), if there is a sequence of elementary row matrices P1 (x), . . . , Ps (x) and a sequence of elementary column matrices Q1 (x), . . . , Qt (x) such that (7.4.1) B(x) = Ps (x) · · · P1 (x)A(x)Q1 (x) · · · Qt (x) = P(x)A(x)Q(x), where P(x) is the product of the elementary row matrices and Q(x) is the product of the elementary column matrices. Balanced matrices have the following basic properties. (1) Reflexivity Any polynomial matrix A(x) is balanced with respect to itself: A(x) ∼ = A(x). (2) Symmetry B(x) ∼ = A(x) ⇔ A(x) ∼ = B(x). (3) Transitivity If C(x) ∼ = B(x) and B(x) ∼ = A(x) then C(x) ∼ = A(x). The balance of polynomial matrices is also known as their equivalence. A reduction which transforms a polynomial matrix into another, simpler, balanced polynomial matrix is known as balanced reduction. It is self-evident that, given a polynomial matrix A(x) ∈ C[x]m×n , it is desirable that it is balanced to the simplest possible form. This simplest polynomial matrix is the well-known Smith normal form. THEOREM 7.7 If the rank of a polynomial matrix A(x) ∈ C[x]m×n is equal to r, then A(x) can be balanced to the Smith normal form S(x) as follows: A(x) ∼ (7.4.2) = S(x) = Diag[σ1 (x), σ2 (x), . . . , σr (x), 0, . . . , 0], where σi (x) divides σi+1 (x) for 1 ≤ i ≤ r − 1. EXAMPLE 7.6

Given a polynomial matrix ⎡ x+1 2 A(x) = ⎣ 1 x 1 1

⎤ −6 −3 ⎦ , x−4

perform the elementary transformations of A(x) to reduce it. Solution



x+1 A(x) = ⎣ 1 1 ⎡ x−1 →⎣ 0 1 ⎡ x−1 →⎣ 0 x−1

⎤ ⎡ ⎤ ⎡ ⎤ 2 −6 x − 1 0 −2x + 2 x−1 0 0 x −3 ⎦ → ⎣ 1 x −3 ⎦ → ⎣ 1 x −1 ⎦ 1 x−4 1 1 x−4 1 1 x−2 ⎤ ⎡ ⎤ x−1 0 0 0 0 x−1 0 ⎦ x − 1 −x + 1⎦ → ⎣ 0 1 1 x−1 1 x−2 ⎤ ⎤ ⎡ x−1 0 0 0 0 ⎦→⎣ 0 ⎦ = B(x). x−1 0 x−1 0 2 2 0 0 (x − 1) x − 1 (x − 1)

7.4 Polynomial Matrices and Balanced Reduction

437

The matrix B(x) is not in Smith normal form, because |B(x)| = (x−1)4 is a fourthorder polynomial of x while |A(x)| is clearly a third-order polynomial of x. Since b11 , b22 , b33 contain the first-order factor x − 1, the first row, say, of B(x) can be divided by this factor. After this elementary row transformation, the matrix B(x) is further reduced to ⎤ ⎡ 1 0 0 ⎦. A(x) ∼ 0 = ⎣0 x − 1 2 0 0 (x − 1) This is the Smith normal form of the polynomial matrix xI − A. COROLLARY 7.1 The Smith normal form of a polynomial matrix A(x) ∈ C[x]m×n is unique. That is, every polynomial matrix A(x) can be balanced to precisely one matrix in Smith normal form. The major shortcoming of the elementary transformation method for reducing a polynomial matrix A(x) to the Smith normal form is that a sequence of elementary transformations requires a certain skill that is not easy to program, and is sometimes more trouble, than just using the matrix A(x) as it stands.

7.4.2 Invariant Factor Method To overcome the shortcoming of the elementary transformation method, it is necessary to introduce an alternative balanced reduction method. Extract any k rows and any k columns of an m × m matrix A to obtain a square matrix whose determinant is the kth minor of A. DEFINITION 7.13 The determinant rank of A is defined to be the largest integer r for which there exists a nonzero r × r minor of A. DEFINITION 7.14 Given a polynomial matrix A(x) ∈ C[x]m×n , if its rank is rank[A(x)] = r then, for a natural number k ≤ r, there is at least one nonzero kth minor of A(x). The greatest common divisor of all nonzero kth minors, denoted dk (x), is known as the kth determinant divisor of the polynomial matrix A(x). THEOREM 7.8 The kth determinant divisors dk (x) = 0 for 1 ≤ k ≤ r with r = rank(A). Also, dk−1 (x) divides dk (x), denoted dk−1 |dk , for 1 ≤ k ≤ r. Proof (from [319]) Let r = rank(A). Then there exists an r × r nonzero minor of A and hence dr (x) = 0. Then, because each r × r minor is a linear combination of (r − 1) × (r − 1) minors of A, it follows that some (r − 1) × (r − 1) minor of A is also nonzero and thus dr−1 (x) = 0. Also, dr−1 (x) divides each minor of size r − 1 and consequently divides each minor of size r, and hence dr−1 (x) divides dr (x), the greatest common divisor of all minors of size r. This argument can be repeated with r replaced by r − 1 and so on.

438

Eigenanalysis

The kth determinant divisors dk (x), k = 1, . . . , r in Theorem 7.8 are the polynomials with leading coefficient 1 (i.e., monic polynomials). DEFINITION 7.15 If rank(A(x)) = r then the invariant factors of A(x) are defined as σi (x) = di (x)/di−1 (x), 1 ≤ i ≤ r, where d0 (x) = 1. The numbers of the various minors of a given m × m polynomial matrix are as follows: 1 1 × Cm = m2 first minors; (1) Cm 2 2 (2) Cm × Cm second minors; =2 < m(m − 1) · · · (m − k + 1) k k (3) Cm × Cm = kth minors, where k = 1, . . . , m − 1; 2 × 3 × ··· × k

m = 1 mth minor. (4) Cm

For example, for the polynomial matrix ⎤ ⎡ x 0 1 A(x) = ⎣x2 + 1 x 0 ⎦, x − 1 −x x + 1 its various minors are as follows: The number of first minors is 9, in which the number of nonzero minors is 7. The number of second minors is C32 × C32 = 9 and all are nonzero minors. The number of third minors is just 1, and |A(x)| = 0. EXAMPLE 7.7

Given a polynomial matrix ⎡ 0 x(x − 1) A(x) = ⎣x 0 0 0

⎤ 0 x + 1 ⎦, −x + 1

find its Smith normal form. Solution The nonzero first minors are given by |x(x − 1)| = x2 − x,

|x| = x,

|x + 1| = x + 1,

| − x + 1| = −x + 1.

So, the first determinant divisor is d1 (x) = 1. The nonzero second minors,        0 x(x − 1)  = −x2 (x − 1), x x + 1  = −x(x − 2),    0 −x + 2 x 0   x(x − 1) 0   = x(x − 1)(x + 1),  0 x + 1 have greatest common divisor x, so the second determinant divisor is d2 (x) = x.

7.4 Polynomial Matrices and Balanced Reduction

The third determinant,   0 x(x − 1)  x 0  0 0

439

 0  x + 1  = x2 (x − 1)(x − 2), −x + 1

yields the third determinant divisor d3 (x) = x2 (x − 1)(x − 2) directly. Hence, the invariant factors are as follows: d1 (x) d2 (x) = 1, σ2 (x) = = x, d0 (x) d1 (x) d3 (x) σ3 (x) = = x(x − 1)(x − 2). d2 (x)

σ1 (x) =

Thus the Smith normal form is



1 S(x) = ⎣0 0

0 x 0

⎤ 0 ⎦. 0 x(x − 1)(x − 2)

The most common polynomial matrix is λI − A, where λ is an eigenvalue of the m × m matrix A. The polynomial matrix λI − A, in automatic control, signal processing, system engineering and so on, is often called the λ-matrix of A, and is denoted A(λ) = λI − A. EXAMPLE 7.8

Find the Smith normal form of the λ-matrix of the 3 × 3 matrix ⎡ ⎤ λ+1 2 −6 A(λ) = λI − A = ⎣ 1 λ −3 ⎦ . 1 1 λ−4

Solution Every entry consists of a first minor, and their common divisor is 1; hence the first determinant divisor d1 (λ) = 1. The second minors are respectively given by       λ + 1 2   = (λ − 1)(λ + 2), λ + 1 −6 = −3(λ − 1),   1  1 −3 λ      λ + 1 2  −6   = λ − 1, λ + 1  = (λ − 1)(λ − 2),  1  1 λ − 4 1        2 −6 2 1 λ −6       1 λ − 4 = 2(λ − 1), λ −3 = 6(λ − 1), 1 1  = −(λ − 1),     λ 1 −3  −3    1 λ − 4 = λ − 1,  1 λ − 4 = (λ − 1)(λ − 3). The greatest common divisor is λ − 1, so the second determinant divisor d2 (λ) = λ − 1.

440

Eigenanalysis

The only third minor has  λ + 1   1   1

2 λ 1

 −6  −3  = (λ − 1)3 , λ − 4

which gives the third determinant divisor d3 (λ) = (λ − 1)3 . From the above results, it is known that the invariant factors are given by σ1 (λ) =

d1 (λ) = 1, d0 (λ)

σ2 (λ) =

d2 (λ) = λ − 1, d1 (λ)

σ3 (λ) =

d3 (λ) = (λ − 1)2 , d2 (λ)

which gives the Smith normal form ⎤ ⎡ 1 0 0 ⎦. S(λ) = ⎣0 λ − 1 0 2 0 0 (λ − 1) EXAMPLE 7.9 λ-matrix

Find the invariant factors and the Smith normal form of the ⎡

λ−1 0 A(λ) = ⎣ −2 λ+1 1 0

⎤ −4 −4 ⎦ . λ+3

Solution Its determinant divisors are as follows. (1) The greatest common divisor of the nonzero first minors is 1, so the first determinant divisor d1 (λ) = 1. (2) The greatest common divisor of the nonzero second minors is λ+1, which yields directly d2 (λ) = λ + 1. (3) The determinant    λ−1 0 −4    −2 λ+1 −4  = (λ + 1)3   1 0 λ+3  gives directly the third determinant divisor d3 (λ) = (λ + 1)3 . Hence the invariant factors are σ1 (λ) =

d1 = 1, d0

σ2 (λ) =

d2 = λ + 1, d1

σ3 (λ) =

d3 = (λ + 1)2 . d2

From these invariant factors we have the Smith normal form ⎤ ⎡ 1 0 0 ⎦. S(λ) = ⎣ 0 λ + 1 0 2 0 0 (λ + 1)

7.4 Polynomial Matrices and Balanced Reduction

441

7.4.3 Conversion of Jordan Form and Smith Form We have presented the similarity reduction of matrices and the balanced reduction of polynomial matrices, respectively. An interesting issue is the relationship between the Jordan canonical form of similarity reduction and the Smith normal form of balanced reduction. Via the linear transformation ⎡ ⎤ −a12 ··· −a1m λ − a11 ⎢ −a21 λ − a22 · · · −a2m ⎥ ⎢ ⎥ (7.4.3) A(λ) = λI − A = ⎢ ⎥, .. .. .. . . ⎣ ⎦ . . . . −am2 · · · λ − amm −am1 a constant matrix A easily becomes a polynomial matrix A(λ). THEOREM 7.9 A∼B

Let A and B be two m × m (constant) matrices. Then



(xIm − A) ∼ = (xIm − B)



xIm − A and xIm − B have the same Smith normal form.

Proof [319] For the forward implication: if P−1 AP = B, where P ∈ Rm×m then P−1 (xIm − A)P = xIm − P−1 AP = xIm − B. For the backward implication: if xIm − A and xIm − B are balanced then they have the same invariant factors and so have the same nontrivial invariant factors. That is, A and B have the same invariant factors, and hence are similar. Theorem 7.9 establishes the relationship between the Jordan canonical form J of the matrix A and the Smith normal form of the λ-matrix λI − A. It may be noticed that if the polynomial matrix A(x) is not in the form of xI − A then Theorem 7.9 no longer holds: A ∼ B does not mean A(x) ∼ = B(x) if A(x) = xI − A. Let the Jordan canonical form of a matrix A be J, i.e., A ∼ J. Hence, by Theorem 7.9, we have λI − A ∼ (7.4.4) = λI − J. If the Smith normal form of the λ-matrix λI − A is S(λ), i.e., λI − A ∼ = S(λ),

(7.4.5)

then from Equations (7.4.4) and (7.4.5) and the transitivity of balanced matrices, it can immediately be seen that S(λ) ∼ = λI − J.

(7.4.6)

This is the relationship between the Jordan canonical form J of the matrix A and the Smith normal form S(λ) of the λ-matrix λI − A.

442

Eigenanalysis

Let J = Diag(J1 , . . . , Jd , 0, . . . , 0) denote the Jordan canonical form of a constant matrix A ∈ Cm×m . From Equation (7.4.6) it is known that the Smith normal form of the λ-matrix A(λ) = λI − J should also have d Smith blocks: S(λ) = Diag[S1 (λ), . . . , Sd (λ), 0 . . . , 0],

(7.4.7)

where the Smith block Si (λ) = Diag(σi,1 , . . . , σi,mi ) and m1 + · · · + md = r. Thus the Smith block Si and the Jordan blocks Ji have the same dimensions, mi × mi . Therefore, the conversion relationship between the Jordan canonical form and the Smith normal form becomes that between the Jordan block and the Smith block corresponding to the same eigenvalue λi .

7.4.4 Finding Smith Blocks from Jordan Blocks Our problem is: given the Jordan blocks Ji , i = 1, . . . , d of a constant matrix A, find the Smith blocks Si (λ), i = 1, . . . , d of the λ-matrix λI − A. According to the respective eigenvalue multiplicities, there are different corresponding relationships between Jordan blocks and Smith blocks. (1) The first-order Jordan block corresponding to the single eigenvalue λi is given by Ji = λi . From Equation (7.4.6) it is known that Ji = [λ] ⇒ Si (λ) = [λ − λi ]. (2) The second-order Jordan block Ji corresponding to the eigenvalue λi with algebraic multiplicity mi = 2 is given by     λi 0 λ − λi 0 2 2 2 ∼ Ji2 = ⇒ Si2 (λ) = λI2 − Ji2 = , 0 λi 0 λ − λi     λ λ − λi 1 −1 ⇒ S1i2 (λ) ∼ J1i2 = i = λI2 − J1i1 = 0 λi 0 λ − λi   1 0 ∼ , = 0 (λ − λi )2 because the determinant divisors of λI2 − Ji1 are d1 = 1 and d2 = (λ − λi )2 , and hence σ1 (λ) = d1 /d0 = 1 and σ2 (λ) = d2 /d1 = (λ − λi )2 . (3) The third-order Jordan block Ji is given by ⎤ ⎤ ⎡ ⎡ 0 0 λi 0 0 λ − λi J3i3 = ⎣ 0 λi 0 ⎦ ⇒ S3i3 (λ) ∼ λ − λi 0 ⎦, = λI3 − J31 ∼ =⎣ 0 0 0 λi 0 0 λ − λi ⎤ ⎤ ⎡ ⎡ λi 0 0 1 0 0 ⎦, J2i3 = ⎣ 0 λi 1 ⎦ ⇒ S2i3 (λ) ∼ 0 = λI3 − J32 ∼ = ⎣ 0 λ − λi 2 0 0 λi 0 0 (λ − λi )

7.4 Polynomial Matrices and Balanced Reduction

and



J1i3

λi =⎣ 0 0

1 λi 0

⎤ 0 1⎦ λi

443

⎤ 1 0 0 ∼ ⎦, 0 = ⎣0 1 3 0 0 (λ − λi ) ⎡

S1i3 (λ) ∼ = λI3 − J33



because the invariant factors of λI3 − J2i3 are σ1 = d1 /d0 = 1,

σ2 = d2 /d1 = λ − λi ,

σ3 = d3 /d2 = (λ − λi )2 ,

while the invariant factors of λI3 − J1i3 are σ1 = 1,

σ2 = 1,

σ3 = (λ − λi )3 .

Summarizing the above analysis, we have the corresponding relationship between the Jordan canonical form and the Smith canonical form:

J21 J22 and



J31

J32

J33

λi = ⎣0 0 ⎡ λi = ⎣0 0 ⎡ λi ⎣ = 0 0

0 λi 0 0 λi 0 1 λi 0

J1 = [λi ]   λi 0 = 0 λi   λ 1 = i 0 λi ⎤ 0 0⎦ λi ⎤ 0 1⎦ λi ⎤ 0 1⎦ λi

⇔ ⇔ ⇔

S1 (λ) = [λ − λi ];   0 λ − λi S21 (λ) = , 0 λ − λi   1 0 S22 (λ) = ; 0 (λ − λi )2

⎤ 0 0 λ − λi S31 (λ) = ⎣ 0 λ − λi 0 ⎦, 0 0 λ − λi ⎤ ⎡ 1 0 0 ⎦, S32 (λ) = ⎣0 λ − λi 0 2 0 0 (λ − λi ) ⎤ ⎡ 1 0 0 ⎦. S33 (λ) = ⎣0 1 0 3 0 0 (λ − λi )

(7.4.8) (7.4.9) (7.4.10)









(7.4.11)

(7.4.12)

(7.4.13)

7.4.5 Finding Jordan Blocks from Smith Blocks Conversely, given the Smith normal form S(λ) of a λ-matrix λI − A, how do we find the Jordan canonical form J of the matrix A? Equations (7.4.8)–(7.4.13) can also be used to find the Jordan blocks from the Smith blocks. The question is how to separate the Smith blocks from the Smith normal form. Using block-divided method of the Smith normal form, we exclude the zero-order invariant factor 1 and then separate the Smith blocks from the Smith normal form according the following two cases.

444

Eigenanalysis

Case 1

Single eigenvalue

If some ith invariant factor dependents only on one eigenvalue λk , i.e., σi (λ) = (λ − λk )ni , where ni ≥ 1, then ⎡ ⎢ ⎢ Si (λ) = ⎢ ⎣

1 ..





0

⎥ ⎥ ⎥ ⎦

. 1



(λ − λj )ni

0

⎢ ⎢ Ji = ⎢ ⎣

λj

1 ..

.

0 ..

.

λj 0



⎥ ⎥ ⎥ ∈ Cni ×ni . 1⎦

λj (7.4.14)

For example, ⎡ S(λ) = ⎣

1 (λ − λ0 ) 0



Case 2



0

⎧ ⎪ ⎨ J 1 = λ0  λ ⎪ ⎩ J2 = 0 0

(λ − λ0 )  1 , λ0

⎦ 2



⎧ ⎪ ⎨ S1 (λ) = λ − λ0 ,   ⇔ 1 0 ⎪ , ⎩ S2 (λ) = 0 (λ − λ0 )2 ⎤ ⎡ 0 λ0 0 J(λ) = ⎣ 0 λ0 1 ⎦ . 0 0 λ0

Multiple eigenvalues

If the kth invariant factor is a product of factors depending on s eigenvalues, σk (λ) = (λ − λj1 )nk,1 (λ − λj2 )nk,2 · · · (λ − λjs )nk,s ,

i = 1, . . . , s,

then the invariant factor σk (λ) can be decomposed into s different eigenvalue factors σk1 (λ) = (λ−λj1 )nk,1 , . . . , σks (λ) = (λ−λjs )nk,s . Then, using σki (λ) = (λ−λji )nk,i , the Jordan block can be obtained from the Smith block. EXAMPLE 7.10 The third invariant factor σ3 (λ) = (λ − λ1 )(λ − λ2 )2 is decomposed into two single-eigenvalue factors σ31 (λ) = λ − λ1 and σ32 (λ) = (λ − λ2 )2 ; hence ⎧ ⎤ ⎡ ⎪ 1 0 ⎨ S1 (λ) = λ − λ1 ,   ⎦ ⎣ S(λ) = ⇔ 1 1 0 ⎪ , ⎩ S2 (λ) = 0 (λ − λ1 )(λ − λ2 )2 0 (λ − λ2 )2 ⎧ ⎡ ⎤ ⎪ 0 λ1 0 ⎨ J1 = λ1 ,   ⇔ J(λ) = ⎣ 0 λ2 1 ⎦ . ⇔ λ2 1 ⎪ , = J ⎩ 2 0 0 λ2 0 λ2 EXAMPLE 7.11 The nonzero-order invariant factor of the 5 × 5 Smith normal form is (λ − λ1 )2 (λ − λ2 )3 , which is decomposed into two single-eigenvalue factors

7.4 Polynomial Matrices and Balanced Reduction

445

σ51 (λ) = (λ − λ1 )2 and σ52 (λ) = (λ − λ2 )3 . Hence we have S(λ) = Diag[1, 1, 1, 1, (λ − λ1 )2 (λ − λ2 )3 ]   ⎧ 0 ⎪ ⎪ S1 (λ) = 1 , ⎪ ⎪ 0 (λ − λ1 )2 ⎪ ⎨ ⎤ ⎡ ⇔ 1 0 0 ⎪ ⎪ ⎪ ⎦, S (λ) = ⎣0 1 0 ⎪ ⎪ ⎩ 2 0 0 (λ − λ2 )3   ⎧ ⎡ ⎪ λ1 ⎪ J 1 = λ1 1 , ⎪ ⎪ 0 λ ⎢ ⎪ 1 ⎨ ⎢0 ⎤ ⎡ ⇔ J(λ) = ⎢ ⇔ 0 λ2 1 ⎢0 ⎪ ⎪ ⎣0 ⎪ J2 = ⎣ 0 λ ⎦ , 1 ⎪ 2 ⎪ ⎩ 0 0 0 λ2

1 λ1 0 0 0

0 0 λ2 0 0

0 0 1 λ2 0

⎤ 0 0⎥ ⎥ 0⎥ ⎥. 1⎦ λ2

In addition to determinant divisors and invariant factors, there is another divisor of a polynomial matrix. DEFINITION 7.16 Decompose an invariant factor into the product of a few nonzero-order factors, i.e., write σk (λ) = (λ − λk1 )ck1 · · · (λ − λklk )cklk , where cij > 0, j = 1, . . . , lk ; then every factor (λ − λklk )cklk is called an elementary divisor of A(λ). The elementary divisors are arranged in no particular order. All elementary divisors of a polynomial matrix A(λ) constitute the elementary divisor set of A(λ). EXAMPLE 7.12 Given a polynomial matrix A(λ) ∈ C[λ]5×5 with rank 4 and Smith normal form S(λ) = Diag(1, λ, λ2 (λ − 1)(λ + 1), λ2 (λ − 1)(λ + 1)3 , 0), find the elementary divisors of A(λ). Solution From the second invariant factor σ2 (λ) = λ it is known that λ is an elementary divisor of A(λ). From the third invariant factor σ3 (λ) = λ2 (λ − 1)(λ + 1) it follows that other three elementary divisors of A(λ) are λ2 , λ − 1 and λ + 1. From the fourth invariant fact